Tony Duell wrote:
Hehe...
abracadrabra.
All I can say is that it's a wonderful feeling when you finally
understand how a CPU works at this level. All the magic has gone away,
everything makes sense again.
Yeah, CPU's are really very simple. Here's my half drunken rant about
them. Skip this if you know how CPU's work. :-)
Basically, CPU's fetch opcodes (instructions) and data, then decode said
instruction, execute them, and increment their Program Counter register
(or change it to a new value), and repeat. The ones with microcode (or
RISC) are easiest to understand, although you don't get to see the
microcode.
They all are pretty much similar in that they have a register called PC
(or IR) that point to the currently executing opcode.
RISC ones and microcoded ones use the bits in the opcode to turn on
circuits that do what the opcode says. With some CPU's you can almost
"see" the circuitry that does this by looking at the opcode tables.
Granted, they can be complex if you're dealing with x86, but even so, if
you were able to see what the x86 ones really do underneath the the
layer that's masked, each piece is simple. It's just that there are
many, many, many pieces, which make for a complex device.
The hard part of assembly language is that it's too simple: each opcode
does something so tiny, that it's hard to see how to program in it.
Once you figure this part out and get over it, it becomes easier.
So you can imagine a very simple machine like this:
A CPU is hooked up to a bunch of memory. The memory is addressable
linearly (on most CPU's) and some of it is ROM because RAM loses it's
data on power off.
The CPU has a register called the Program Counter (or PC.) When you
power this machine on, it sets the PC to a specific value, and starts
executing code. The PC is usually automatically incremented (that is
made to point to the next instruction each time, after it fetches an
instruction.)
CPU's have a bunch of other registers that they use to do work with. In
CISC CPU's (complex instruction set computers) you see very few
registers, which makes them inefficient. RISC CPU's (Reduced
Instruction Set Computers) typically have very few instructions (except
for PowerPC) and a lot of registers. Some do lots of fancy tricks with
their registers (SPARC for instances uses register windows.) CISC CPU's
normally don't execute the opcode they fetched directly. Instead, they
have a built in ROM containing code called microcode. You can think of
RISC chips as having the microcode as their native language (although
this isn't exactly true, it's a good analogy.)
Registers are simply set of flipflops (for example a 64 bit register has
64 flipflops - one per bit) that make up a single variable, and they're
used for calculations, indexes, status, and so forth.
To execute, it fetches a single instruction from memory, then does what
that instruction says. Some CPU's use a fixed size opcode, this means
that every instruction (opcode) is the same size. The opcode also
contains information on what is being accessed: a source register and a
destination register are typical, although some will have addressing
modes, or memory addresses inside the opcode.
The opcode and its operands (the information that says which registers
to work on), are typically stored in special registers inside the CPU
that aren't accessible. On microcoded CPU's there are also registers
that point to the memory bus that say what address should be accessed.
Typically the bits inside the instruction (microcode or RISC opcode)
light up various circuits in the CPU. For example, an opcode that does
an addition will turn on the circuits in the Arithmetic (and) Logical
Unit that enable an ADD to occur, and they'll also load the two inputs
and one output register. In a lot of CPU's the output register is one
of the inputs.
Other instructions can light up the barrel shifter, which can do bit
rotates, or shifts in either directions. Other instructions can light
up circuits that fetch or store data from memory, or copy registers from
each other and so forth, access the arithmetic & logical unit (ALU), or
floating point unit (FPU), vector unit (AltiVec for example), etc.
In more modern CPU's, you typically have multiple instructions in
flight. This is because the fetch, decode, execute cycle (which can be
split up into even finer stages) would only keep one circuit light up at
one time. So in order to be more efficient, you can have one
instruction in the fetch stage, another in the decode stage, and another
in the execute stage.
More modern CPU's have several FPU's, ALU's, vector units, and can issue
several instructions at the same time to each if the compiler did a good
job of scheduling them.
Even more fun, CPU's are far far faster than the memory they're attached
to because this memory is dynamic RAM which is cheaper, but slower. You
could use static RAM or SRAM, but this is far more expensive. 99% of
the time when the CPU reads from address X, it will read from X+1 next,
then X+2, then X+3, and so forth. Because of this you can predict the
next memory access, so you can actually use some SRAM to pre-fetch a lot
of the memory. This is called caching, and it can get very complicated
if you want it to be effective.
Modern CPU's actually consist of two CPU cores. These typically share
their cache, so that you don't have inconsistencies, but for
multithreaded code, these run your programs a lot faster. If all you're
going to run is one application at one time, you won't see a difference,
but if you run multithreaded code, or many applications/services at
once, you will notice the difference.
The above is not all very exciting, since we need to add some way to
interact with the machine, which means I/O. Some CPU's have special I/O
opcodes, others use the memory bus to map I/O devices as if they were
memory. Since computers are thousands to billions of times faster than
humans, it doesn't make sense for them to waste their power checking to
see whether the end user pressed a key. You could program them to
periodically check to see if a key is pressed - that's called polled
I/O, but it's very inefficient.
Instead, I/O devices have a way to signal the CPU. These are called
interrupts, and there's typically several of them - one per device,
although in some systems, IRQ's (Interrupt ReQuests) are shared, and
then the CPU has to poll all the devices tied to that IRQ number to
figure out which one to deal with. IRQ's are just signals. Just like
you, in real life have to stop what you're doing when the phone rings
and answer it, or the door bell, or an alarm clock, the CPU does exactly
this. It save it's state on the stack, handles whatever device needs to
be dealt with and then reloads its state from the stack and resumes the
job it was working on.
IRQ's have priorities. For example, if the phone rings at the same time
the smoke alarm goes off because you left something on the stove too
long, you need to handle the stove before you deal with the phone. So
does the CPU. Some devices are more critical than others in that
they're time sensitive. So they have a higher priority than others.
When the CPU is taking care of an important task, it can choose to
ignore interrupts. This is sort of like you shutting off your phone and
doing work. Of course, some interrupts are non-maskable interrupts
(NMI's). For example - the smoke alarm.
There are other things that happen inside a computer that can be
optimized. For instance, say a network packet comes in. In that case,
an interrupt is generated, but because the CPU can only deal with one
unit at a time - typically a byte, or word, it has to sit there and grab
1540 bytes, one at a time off the I/O bus and copy it into memory. This
is of course inefficient. So more advanced devices have something
called Direct Memory Access. They can transfer their data to memory
without bothering the CPU - until the transfer is done.
I mentioned something called the stack. This is simply a bunch of
memory that's accessed last-in-first-out or LIFO. There's a special
register called the Stack Pointer or SP. The stack grows downwards.
That is it starts at some high address, and as data is pushed onto the
stack, the SP gets decremented by 4 on a 32 bit machine, or 8 on a 64
bit machine. When data is popped (aka pulled) from the stack, it gets
incremented. When an interrupt is triggered, the current state is saved
to the stack by pushing the critical registers to the stack. These are
usually the status register, the PC, and perhaps other registers.
When the Interrupt Service Routine (ISR) - which is a bit of code that's
part of the operating system or part of that device drive is done, the
state is restored by being popped off the stack
The stack is also what allows you to have functions that call other
functions. SPARC CPU's do wonderful things with their registers for
function calls - they use a set of registers called register windows.
Remember when earlier on, I said that access to memory is a lot slower
than the CPU? SPARC's take advantage of this by rotating some of their
registers. In one function a set of registers that are labeled as "out"
are used as parameters to the next function call. When the function is
called, these "out" registers become that called function's "in"
registers. This prevents them from being written to the stack, and then
pulled off the stack, thus speeding things up a lot.
If you trace code with a debugger, you'll see that function calls are
very expensive!
When you have, say 2 CPU's in your machine, say each one running at
1GHz, you don't really get the equivalent of a 2GHz's CPU. It's more
like a 1.5-1.8GHz machine. This is because the CPU's have to fight over
the various busses (memory, I/O) to get access to resources, and some
resources aren't safely able to be accessed by two CPU's at once, so
semaphores - or locks are needed to prevent data corruption or cache
coherency issues.
Dual core CPU's sort of alleviate some, but not most of these issues,
because they're at least able to share their caches.
Since I've digressed to modern CPU's:
So things like Intel chips (and AMD, and Transmeta chips) have to waste
a lot of circuitry to deal with backwards compatibility to the 8080
CPU. Because of this, they tend to be a lot more inefficient than the
competition, even though they're a lot more popular. Even with the
introduction of 32 bit CPU's, Intel and their clones have a limited set
of registers.
Remember that access to a register is a lot faster than to memory! So
the more registers that you have and can efficiently make use of, the
faster your code runs. In order to battle this, Intel chips have to do
some insane stuff like register aliasing, which basically knows that a
register, say EAX in the pipeline will get reused elsewhere, so it can
be assigned to one internal register, and that another later on in the
pipeline where EAX is reused can be assigned to another internal
register, and all that code can be executed at the same time.
Transmeta did something more interesting. They actually had something
like a 128 bit or 64 bit RISC CPU whose microcode was programmed to
decode, and recompile intel code into an internal format that would run
much faster - well, in theory anyway. This is exactly the same was as
Java's modern JIT (Just in time compiler) JVM's work. They translate
one set of instructions into another (hence the name TransMeta?)
Because of the overhead in Intel chips for backwards compatibility, pure
RISC chips such as SPARC, ARM, HP-PA are a lot faster, and run cooler.
They don't need all the overhead of backwards compatibility. Intel
tried to get away from this mess with Itanium, but it didn't work out
too well since AMD extended the IA32 (Intel Architecture, 32 bit)
instruction set to 64 bits and won some momentum, so they had to do the
same. Sad that. I don't recall exactly, but I believe Itanium had
elements from HPPA and register windows from SPARC.
Not sure where the Very Large Instruction Word (VLIW) instruction set
comes in to which chips, but this basically allows a CPU to execute
multiple instructions at the same time by dispatching a bunch of
instructions to many parts of the same chip.
As an aside, I'd like to also mention AT&T's Hobbit CPU which was not
quite a RISC and not quite a CISC, but had some interesting features.
As far as I know the Hobbit was used in two machines: AT&T's EO which
competed with and was killed by Apple's Newton, and the prototype
BeBoxes. The Hobbit was designed to speed up C code. Instead of
internal registers, it used part of its cache to store registers on the
stack. That way your registers were offsets of the stack pointer, so
your code ran faster. It's not considered a RISC chip because it
didn't have a fixed size opcode, but it's certainly a weird and
interesting little chip.
Assembly and machine code aren't very difficult. They're actually quite
simple. Too simple. Imagine that you tell your dog, "go fetch" and
throw a bone across the back yard. From a high level language, you
might have a library call that does this for you, or maybe you can code
it yourself. From assembly, you have to instruct the dog (CPU) at each
step. Instructions like, look ahead of you, do you see the ball? Is it
infront of you, if so: Move your left front leg this way. Move your
right front leg that way, else, look a little to the left, is the ball
ahead of you, look a little more to the left, and so forth.
Some CPU's are a pleasure to write in assembly for. These include the
6502's, the Motorla 68000, and SPARC's. Some are a horrible nightmare:
the intel chip for example, and to a far far lesser degree, MIPS. It's
all a matter of "taste" and experience. But if you want to code for
Intel, I highly recommend that you don't look at 68000 code or SPARC code.
Because once you see how easy those are when compared to Intel, you'll
want to cry that you now have to code in Intel assembly. :-)