How a CPU Executes Your Code (The Short Version)
You write return a + b. Somewhere between that and a result appearing on screen, a few billion transistors do something. Most engineers never think about what, exactly - and most of the time, they don’t need to. But the mental model matters. It explains why some code is fast and some is slow, why compilers make the choices they do, and why certain bugs are genuinely mysterious until you understand what’s underneath.
This is the short version.
The gap between source code and the machine
Your source code is not what runs. There is a chain of translation between what you write and what the CPU sees.
A compiler (or interpreter, or JIT) takes your source code and produces machine code - sequences of binary instructions specific to the CPU architecture. On x86-64, adding two integers looks something like this:
mov eax, [rbp-8] ; load variable 'a' into register eax
add eax, [rbp-4] ; add variable 'b' to eax
ret ; return (result is in eax)
This assembly is still human-readable. What actually runs is the binary encoding of these instructions - a sequence of bytes like 8B 45 F8 03 45 FC C3. The CPU reads those bytes directly.
The point: by the time your code runs, the CPU has no idea what language you wrote it in. It sees instructions and data. Nothing else.
What a CPU actually is
At the level that matters for this mental model, a CPU has three parts:
Registers - tiny, extremely fast storage slots built directly into the processor. On a modern 64-bit CPU there are a handful of general-purpose registers (rax, rbx, rcx, etc.) plus special-purpose ones. There is no “variable” concept at this level. Everything that the CPU works on must be in a register first.
The ALU (Arithmetic Logic Unit) - the part that does the actual math: addition, subtraction, bitwise operations, comparisons. It takes inputs from registers and puts results back into registers.
The control unit - the part that drives the whole process: fetches the next instruction, decodes what it means, and routes data to the right place.
The fetch-decode-execute cycle
Every instruction goes through the same three steps, over and over, billions of times per second.
Fetch - The control unit reads the next instruction from memory. It knows where to look because of the program counter (also called the instruction pointer) - a register that holds the address of the current instruction. After every fetch, the program counter advances.
Decode - The CPU figures out what the instruction means. The binary bytes are parsed into an operation (add, move, compare, jump) and its operands (which registers or memory locations to use).
Execute - The ALU does the work, or data is moved between registers and memory, or the program counter is changed (for branches and function calls). The result is written to its destination.
Then it starts again.
A modern CPU does this three to five billion times per second. Per core.
Where memory fits in
Registers are fast because they’re on the processor die itself - access takes a single clock cycle. Main memory (RAM) is orders of magnitude slower. Reading from RAM can take hundreds of cycles.
The CPU bridges this with a cache hierarchy. L1 cache is small and lives closest to the execution units - access takes ~4 cycles. L2 is larger and slower (~12 cycles). L3 is larger still (~40 cycles). Main memory is the fallback (~200+ cycles).
When you access a variable, the CPU first checks L1. If it’s not there (a cache miss), it checks L2, then L3, then RAM. This is why access patterns matter. Iterating over an array sequentially is fast - the CPU prefetches ahead and everything hits cache. Jumping around memory randomly is slow - every access is likely to miss.
This is also why keeping frequently used data in hot loops compact matters more than you might expect. It’s not about the number of instructions - it’s about whether the data is in cache when the instruction needs it.
Branches and prediction
Not all code runs in sequence. if statements, loops, and function calls all change the program counter - they’re branch instructions.
A modern CPU doesn’t wait to resolve a branch before continuing. It predicts which way the branch will go and starts executing ahead. If the prediction is right, work that would have been wasted waiting is already done. If the prediction is wrong, the CPU has to throw away that speculative work and restart from the correct path - a branch misprediction penalty of 10–20 cycles.
Branch predictors are surprisingly good. A loop that runs 100 times and then exits will be predicted correctly 99 times. The cost shows up in unpredictable patterns - like iterating over data where the branch depends on values the predictor can’t anticipate.
This is the real reason code like this:
// Slower: branch outcome depends on unsorted data
for (int i = 0; i < N; i++) {
if (data[i] > 128) sum += data[i];
}
// Faster: branch is now predictable after the sort
sort(data, data + N);
for (int i = 0; i < N; i++) {
if (data[i] > 128) sum += data[i];
}
can be meaningfully faster when the data is sorted - not because sorting is free, but because when you run the loop many times, predictable branches cost almost nothing.
Function calls are not free
A function call is a branch with bookkeeping. Before jumping to the function’s code, the CPU has to:
- Save the current program counter (so it knows where to return)
- Save any registers the function might overwrite
- Push arguments onto the stack or into argument registers
- Jump to the function
On return, it does the reverse. All of this happens in the call frame on the stack - a region of memory that tracks the state of each active function.
The stack itself is usually in cache. But the overhead exists, and it’s why compilers aggressively inline short functions - replacing the call with a copy of the function body, eliminating the overhead entirely.
Where this model simplifies reality
Everything above is accurate as a mental model for reasoning about performance. It’s not a complete picture of what modern CPUs do, and knowing where it simplifies matters when you want to go deeper.
The fetch-decode-execute cycle is serial in description, not in practice. Modern CPUs are deeply pipelined - while one instruction is executing, the next is being decoded, and the one after that is being fetched. A modern out-of-order processor can have 10-20 instructions in flight simultaneously, executing them in whatever order avoids stalls, then committing results in program order. The CPU you picture as “doing one thing at a time” is actually juggling dozens.
Branch prediction is more sophisticated than a simple “guess.” Modern predictors maintain tables of branch history and can recognize complex patterns. They’re correct over 99% of the time on well-behaved code. The mental model of “predict and speculate, roll back on miss” is right, but the prediction mechanism is much more involved - and it’s also the source of vulnerabilities like Spectre, where speculative execution can be exploited to leak memory across security boundaries.
Registers are more numerous and managed differently than the mental model suggests. x86 has a small number of named registers (rax, rbx, etc.), but modern CPUs have 100-200 physical registers internally and use “register renaming” to eliminate false dependencies between instructions. What looks like a dependency on the register name may not be a real dependency at all.
Multicore and hyperthreading. The mental model describes one core. A modern CPU has 8-32 cores, each running independent instruction streams. Hyperthreading runs two logical threads on one physical core by sharing execution units. Shared caches and cache coherence protocols mean that what happens in one core’s L1 cache affects what other cores see. This is where concurrency bugs live.
The simplified model - registers, ALU, fetch-decode-execute, cache hierarchy, branch prediction - gives you the vocabulary to reason about single-threaded performance. It’s the right starting point. The complexities above are what you reach for when the simple model doesn’t explain what you’re observing.
What this changes about how you read code
None of this means you should write assembly or obsess over every instruction. Compilers are remarkably good at translating high-level code into efficient machine code, and most performance problems live much higher up in the stack.
But the mental model pays off in specific situations:
- When a tight loop is slower than expected, think: is this hitting cache? Is there an unpredictable branch?
- When a profiler points at something that looks trivial, think: is this function being called in a hot path where the call overhead adds up?
- When you read about compiler optimizations - loop unrolling, inlining, autovectorization - you now have a reason for them, not just a name.
The CPU is not magic. It’s a very fast machine that does very simple things, very many times, with a few clever tricks layered on top. Once you see it that way, a lot of things that seemed mysterious stop being mysterious.