How a CPU Executes Your Code (The Short Version)


You write return a + b. Somewhere between that and a result appearing on screen, a few billion transistors do something. Most engineers never think about what, exactly - and most of the time, they don’t need to. But the mental model matters. It explains why some code is fast and some is slow, why compilers make the choices they do, and why certain bugs are genuinely mysterious until you understand what’s underneath.

This is the short version.

The gap between source code and the machine

Your source code is not what runs. There is a chain of translation between what you write and what the CPU sees.

A compiler (or interpreter, or JIT) takes your source code and produces machine code - sequences of binary instructions specific to the CPU architecture. On x86-64, adding two integers looks something like this:

mov eax, [rbp-8]    ; load variable 'a' into register eax
add eax, [rbp-4]    ; add variable 'b' to eax
ret                 ; return (result is in eax)

This assembly is still human-readable. What actually runs is the binary encoding of these instructions - a sequence of bytes like 8B 45 F8 03 45 FC C3. The CPU reads those bytes directly.

The point: by the time your code runs, the CPU has no idea what language you wrote it in. It sees instructions and data. Nothing else.

What a CPU actually is

At the level that matters for this mental model, a CPU has three parts:

Registers - tiny, extremely fast storage slots built directly into the processor. On a modern 64-bit CPU there are a handful of general-purpose registers (rax, rbx, rcx, etc.) plus special-purpose ones. There is no “variable” concept at this level. Everything that the CPU works on must be in a register first.

The ALU (Arithmetic Logic Unit) - the part that does the actual math: addition, subtraction, bitwise operations, comparisons. It takes inputs from registers and puts results back into registers.

The control unit - the part that drives the whole process: fetches the next instruction, decodes what it means, and routes data to the right place.

The fetch-decode-execute cycle

Every instruction goes through the same three steps, over and over, billions of times per second.

Fetch - The control unit reads the next instruction from memory. It knows where to look because of the program counter (also called the instruction pointer) - a register that holds the address of the current instruction. After every fetch, the program counter advances.

Decode - The CPU figures out what the instruction means. The binary bytes are parsed into an operation (add, move, compare, jump) and its operands (which registers or memory locations to use).

Execute - The ALU does the work, or data is moved between registers and memory, or the program counter is changed (for branches and function calls). The result is written to its destination.

Then it starts again.

A modern CPU does this three to five billion times per second. Per core.

Where memory fits in

Registers are fast because they’re on the processor die itself - access takes a single clock cycle. Main memory (RAM) is orders of magnitude slower. Reading from RAM can take hundreds of cycles.

The CPU bridges this with a cache hierarchy. L1 cache is small and lives closest to the execution units - access takes ~4 cycles. L2 is larger and slower (~12 cycles). L3 is larger still (~40 cycles). Main memory is the fallback (~200+ cycles).

When you access a variable, the CPU first checks L1. If it’s not there (a cache miss), it checks L2, then L3, then RAM. This is why access patterns matter. Iterating over an array sequentially is fast - the CPU prefetches ahead and everything hits cache. Jumping around memory randomly is slow - every access is likely to miss.

This is also why keeping frequently used data in hot loops compact matters more than you might expect. It’s not about the number of instructions - it’s about whether the data is in cache when the instruction needs it.

Branches and prediction

Not all code runs in sequence. if statements, loops, and function calls all change the program counter - they’re branch instructions.

A modern CPU doesn’t wait to resolve a branch before continuing. It predicts which way the branch will go and starts executing ahead. If the prediction is right, work that would have been wasted waiting is already done. If the prediction is wrong, the CPU has to throw away that speculative work and restart from the correct path - a branch misprediction penalty of 10–20 cycles.

Branch predictors are surprisingly good. A loop that runs 100 times and then exits will be predicted correctly 99 times. The cost shows up in unpredictable patterns - like iterating over data where the branch depends on values the predictor can’t anticipate.

This is the real reason code like this:

// Slower: branch outcome depends on unsorted data
for (int i = 0; i < N; i++) {
    if (data[i] > 128) sum += data[i];
}

// Faster: branch is now predictable after the sort
sort(data, data + N);
for (int i = 0; i < N; i++) {
    if (data[i] > 128) sum += data[i];
}

can be meaningfully faster when the data is sorted - not because sorting is free, but because when you run the loop many times, predictable branches cost almost nothing.

Function calls are not free

A function call is a branch with bookkeeping. Before jumping to the function’s code, the CPU has to:

  1. Save the current program counter (so it knows where to return)
  2. Save any registers the function might overwrite
  3. Push arguments onto the stack or into argument registers
  4. Jump to the function

On return, it does the reverse. All of this happens in the call frame on the stack - a region of memory that tracks the state of each active function.

The stack itself is usually in cache. But the overhead exists, and it’s why compilers aggressively inline short functions - replacing the call with a copy of the function body, eliminating the overhead entirely.

What this changes about how you read code

None of this means you should write assembly or obsess over every instruction. Compilers are remarkably good at translating high-level code into efficient machine code, and most performance problems live much higher up in the stack.

But the mental model pays off in specific situations:

  • When a tight loop is slower than expected, think: is this hitting cache? Is there an unpredictable branch?
  • When a profiler points at something that looks trivial, think: is this function being called in a hot path where the call overhead adds up?
  • When you read about compiler optimizations - loop unrolling, inlining, autovectorization - you now have a reason for them, not just a name.

The CPU is not magic. It’s a very fast machine that does very simple things, very many times, with a few clever tricks layered on top. Once you see it that way, a lot of things that seemed mysterious stop being mysterious.