Classic Computer Magazine Archive COMPUTE! ISSUE 167 / AUGUST 1994 / PAGE 34

Pentium speed. (microprocessor chip) (Hardware Clinic) (Column)
by Mark Minasi

The clock speeds of Pentiums and 486s seem similar. But the Pentium is faster than the 486--how does it get faster?

Pentium chips are offered in 60- and 66-MHz varieties because Intel had trouble producing Pentiums that could handle 66 MHz. Many of those failed 66-MHz chips could perform reliably at 60 MHz, so Intel offered the 60-MHz chip. Most of the less expensive Pentium machines are based on the 60-MHz chips. A Pentium running at 60 or 66 MHz doesn't sound like that much of an improvement over the 486DX2-66 chip, but it's actually much faster. One reason for the higher speed is that the 60- and 66-MHz Pentiums run both internally and externally at those speeds. The 486DX2-66, in contrast, runs at 66 MHz internally but only at 33 MHz when communicating with the rest of the PC's circuitry.

A new Pentium chip, the P54C, runs at 90 or 99 MHz, but unlike the original Pentiums, it isn't a pure 90- or 99-MHz chip. Instead, the P54C is a 60- or 66-MHz Pentium equipped with a one-and-ahalf-clock circuit. The motherboard would run at, say, 60 MHz, but internal P54C operations would run 50 percent faster.

Potentially, this P54C could be offered as an upgrade chip for existing Pentium systems, but only with a special socket. In any case, be aware that a 100-MHz Pentium system really has a jazzed-up 66-MHz Pentium at its heart.

Another way the Pentium racks up better speed is as a result of instruction pipelining. A CPU executes a program in memory by first fetching the instruction from RAM, then executing the instruction. Those two steps are unchanged regardless of the CPU. But before the instruction can be executed, the CPU must figure out what kind of instruction it is. For example, some instructions only require one byte of information, like the simplest CPU instruction, NOP. NOP means "no operation"; when the CPU encounters this instruction, it just moves along to the next instruction. Its opcode is hex 90. Here are the steps the CPU must go through in executing NOP.

1. Read the 90h opcode.

2. Recognize that there are no other bytes to read in this instruction.

3. Increment the instruction pointer (IP) by one so that the CPU knows where to get the next instruction.

4. Fetch the next program instruction.

By contrast, a command like MOV AX,[44], which tells the CPU to add the value in the AX register to whatever is in memory location 44, requires many more steps. In hex, it's three bytes: 03, 06, 44. Here's what the CPU must do to perform this operation.

1. Read the 03 opcode.

2. Recognize it as an ADD command, which requires at least one more byte.

3. Read the next byte, 06.

4. Recognize that the 03, 06 combination requires a third byte.

5. Read the next byte, 44.

6. Fetch the value at location 44 from memory.

7. Add the value in memory location 44 to the value currently in the AX register.

8. Put the result into the AX register.

9. Add three to the instruction pointer so that it can find the next instruction.

10. Fetch the next instruction from memory.

Early microprocessors would perform steps 1-10 above, and only when the tenth step was finished would they start working on the next instruction. This would be like running an automobile factory by making an entire car without starting work on the second car until the first is completely finished. That's silly, as it's obvious that one group of people can be working on an engine while another group works on the wheels while another works on the doors, and so on. Microprocessors can do the same thing, and the first Intel 80x86 processor to do that was the 80286. The 286 incorporated 6 bytes of prefetch queue or pipeline. (The 80386 has 16 bytes, the 486 has 32 bytes, and the Pentium has two 64-byte queues.)

The prefetch queue can speed up a microprocessor in two ways. The first is what I just described--the assembly-line approach to decoding instructions. Before I explain the second, I need to explain the bottleneck between memory and the CPU.

The CPU spends a lot of time retrieving data from the system's memory. Typically, that requires two cycles of the computer's clock, during which the CPU is not decoding and executing instructions. Therefore, the prefetch queue can save time in another way: It can get the instructions out of memory in parallel with the decoder unit. While the CPU's execution unit is executing instruction X, the decoder unit is decoding instruction X+1, and the prefetch unit is retrieving instruction X+2 from the RAM.

This sounds great. Unfortunately, it doesn't always work. If the CPU's execution unit is in the middle of executing an instruction that moves data to or from memory, then the pathways between the memory and the CPU are already occupied for the moment, and the prefetch unit must wait. Despite that bottleneck, the instruction prefetch considerably speeds up CPU operations.

That's not the end of the memory story, however. We discuss CPU speeds in terms of megahertz--and the more megahertz, the better. We discuss memory speeds in nanoseconds (ns) of access time--and the fewer nanoseconds, the better. As CPUs get faster, memory must get faster as well. The relationship between CPU speed and memory speed isn't straightforward, but you can get a rough equivalence using this formula: If you have a CPU of M megahertz, then it will require RAM with an access time of about 2000/M nanoseconds. For example, a 50-MHz system would require RAM with an access time of 2000/50 ns--which works out to 40 ns.

Most PCs use dynamic RAM, which is cheaper and slower than static RAM. Static RAM is about ten times more expensive than dynamic RAM, it takes up more physical space in a computer, and it generates more heat than dynamic RAM. For that reason, dynamic RAM is usually the primary RAM used in a PC. But dynamic RAM doesn't come much faster than about 65 ns, which is too slow for modern processor speeds. How can engineers design a machine with RAM that can keep up with the fastest CPUs?

The answer implemented most often is to use a small amount of the faster, more expensive static RAM and a much larger amount of the slower, cheaper dynamic RAM. The small amount of static memory is called an external cache. A chip called a cache controller looks into the future, guesses what data the CPU will soon need, and preloads that information into faster cache memory from the slower dynamic memory. Then, each time the CPU tries to read data from the memory, the cache controller checks to see if the data the CPU needs is in the cache. If it is, the cache controller zaps the data straight into the CPU, and as a result, the CPU only waits two clock cycles for the data to arrive.

If, one the other hand, the cache controller didn't guess what the CPU would need, and the requested data is not in the cache, the cache controller will tell the CPU to twiddle its thumbs for a few extra clock cycles (known as wait states) while the cache controller goes through the time-consuming process of copying data from the slower dynamic memory into the faster static cache memory.

The 486 takes the process even further by incorporating a small amount of cache memory inside the microprocessor (the internal cache). Data can be fetched from this memory in one cycle rather than two. There is only 8K of processor cache on most 486s (the 486DX4 has 16K and the Cyrix 486 replacement chip for 386 machines has only 1K). But that cache may have heavy demands placed up-on it, particularly because the prefetch unit fetches instructions at the same time the execution unit may be accessing cache memory. Suppose the CPU were executing an instruction that involved a memory operation. The prefetch unit under a 486 must idle, waiting for the execution unit on the CPU to yield access to memory so that the prefetch queue can retrieve the next instruction.

The Pentium improves up-on that with two caches: an 8K cache for instructions and an 8K cache for data. A Pentium gets things done faster than a 486 of the same clock speed partly because its prefetch queue can almost always run in parallel with its execution unit.

Using pipelining makes for a powerful solution to the memory access problem, but this solution is prone to a major hitch which occurs during program branching. The way I've described pipelining pretty much assumes that the CPU just plunks along in a linear fashion through RAM, going from the instruction at location X to the instruction at location X+1, and then X+2, and so on. But that's not always true. Programs very often will jump from one location to another, a process called branching. (If you've ever used a GOTO, an IF/THEN/ELSE, a SELECT, a WHILE/WEND, a CALL, or a GOSUB in a BASIC program, you've caused your CPU to branch.) Branching is bad news because it essentially says to the pipeline, "Well, guys, I know you've been working hard at picking apart the next 64 bytes of instructions, but dump it all--we're moving someplace else, and we don't need those next 64 bytes." A branch forces a 486 or earlier processor to flush the pipeline and start gathering instructions all over again in a new section of the program.

The Pentium's answer is branch prediction. The Pentium features a look-ahead algorithm that sees branch instructions coming up in the pipeline (such as "If AX is greater than BX, then skip ahead 100 bytes"), guesses which turn the branch will take, and starts disassembling at the place where the branch predictor guesses that the CPU will end up. This algorithm isn't always right, but it is right about 90 percent of the time.

The Pentium initially sounds like a nonstarter, with its CPU speeds that are about the same as the earlier 486 speeds. But if you take the time to look under the hood, you can see that Intel has pulled just about every possible trick to make the Pentium get your work done more quickly. You have to wonder what improvements are left for the Hexium.