How LLMs Work: A Mental Model for Engineers


Most engineers who use LLMs in their work have a vague intuition that it’s “some kind of neural network trained on text.” That intuition is good enough for using these systems, but not good enough for building with them reliably. When an LLM gives you a wrong answer, makes up a function that doesn’t exist, or behaves differently than you expected, the model’s architecture explains why - and knowing why lets you design around the failure modes.

This article builds a mental model. Not a PhD-level explanation of transformer mathematics, but enough to reason about what these systems can and can’t do.

Text is Tokens

The first thing to understand is that LLMs don’t operate on characters or words - they operate on tokens.

A tokenizer splits text into sub-word units. Common words like “the” or “cat” are typically one token. Longer or rarer words get split: “tokenization” might become [“token”, “ization”]. Code often gets fragmented further: function_name might become [“function”, “name”] or [“function”, "", “name”] depending on the tokenizer.

# Using tiktoken (OpenAI's tokenizer)
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")

enc.encode("Hello, world!")
# [9906, 11, 1917, 0]  -- 4 tokens

enc.encode("supercalifragilistic")
# [13066, 1286, 278, 3163, 321, 4633]  -- 6 tokens

The practical implication: token count, not word count or character count, determines what fits in a model’s context window. A 128,000-token context window holds roughly 100,000 words of English text, but code or unusual formatting can be tokenized much less efficiently.

The Context Window is Everything

An LLM has no memory between conversations. When you send a message, the model receives the entire conversation history - every message from the start - and generates the next token.

The context window is the maximum number of tokens the model can process in a single call. If a conversation grows longer than the context window, you have to drop or summarize earlier content. When you do, the model has no access to what was dropped - it’s as if those messages never existed.

This explains why LLMs can seem to “forget” things mentioned early in a long conversation. In many implementations, they literally have - the early context was truncated.

It also explains why position matters. In most transformer architectures, the model pays more attention to recent tokens and to tokens at the very beginning of the context (which often contains the system prompt). Information buried in the middle of a long context is reliably processed but often underweighted.

Predicting the Next Token

The core thing a language model does is predict, for a given sequence of tokens, what token is most likely to come next. Every output token is generated by running this prediction - then appending the predicted token to the sequence and predicting the next one, and so on.

This “next token prediction” framing is how these models are trained: on massive amounts of text, learning the statistical patterns of what comes next in a sequence. The training objective is simple. The capability that emerges from it at scale is not.

What this means practically: LLMs are not lookup tables or search engines. They don’t retrieve facts from a database. They generate tokens that are statistically likely to follow the input. When a model gives you a plausible-sounding wrong answer, it’s because “plausible-sounding answer to this kind of question” is the statistical pattern it has learned.

This is why LLMs “hallucinate” - they generate text that sounds right but is factually wrong. From the model’s perspective, there is no distinction between “I know this” and “I’m generating a plausible continuation.” Both produce text. The text happens to be accurate more often than not for common topics and less often for rare ones.

Attention: How Context Gets Used

The key mechanism in transformer models is the attention mechanism. When predicting the next token, the model doesn’t just look at the token immediately before it - it can attend to any token in the context window, weighted by relevance.

The model learns which tokens to pay attention to during training. When generating code, it learns to attend heavily to variable names, function signatures, and type annotations that were defined earlier. When answering a question, it attends to the relevant parts of a document.

This is what makes transformers powerful compared to earlier sequence models: the ability to consider long-range context efficiently. But attention has quadratic complexity with sequence length - doubling the context length quadruples the computation. This is why context windows are finite and why extending them requires architectural modifications.

Temperature and Sampling

At inference time, the model produces a probability distribution over all possible next tokens. The highest-probability token is not always what gets selected - there’s a sampling process.

Temperature controls how peaked or flat the distribution is before sampling. Temperature 0 selects the highest-probability token deterministically (greedy sampling). Higher temperature flattens the distribution, making lower-probability tokens more likely to be selected. This increases variety and creativity at the cost of coherence.

# Low temperature: more deterministic, repetitive, "safe"
# High temperature: more varied, creative, potentially incoherent

Top-p sampling (nucleus sampling) considers only tokens whose cumulative probability reaches some threshold p. This avoids sampling from the long tail of low-probability tokens that would produce gibberish.

For code generation, low temperature (0 or 0.1) generally produces better results. For brainstorming or creative writing, higher temperatures (0.7-1.0) produce more varied output.

Instruction Following and RLHF

Raw language models trained only on text prediction produce completions - they continue whatever text you give them. They don’t follow instructions, they don’t have safety behaviors, and they’ll happily continue harmful text.

The models you use via APIs have been additionally trained to follow instructions. The primary technique is RLHF - Reinforcement Learning from Human Feedback. Annotators rate model outputs, and the model is trained to produce outputs that humans rate more highly.

This is why the same base model can produce outputs with very different characteristics depending on how it was fine-tuned. The instruction-following behavior and safety constraints are learned, not inherent to the architecture.

It also explains why prompt phrasing matters more than it should. The model has learned patterns like “when someone asks X in Y format, respond with Z.” A small change in phrasing can activate a different pattern and produce meaningfully different output.

What This Means When You Build With LLMs

Hallucinations are structural, not bugs: the model generates plausible tokens. For facts you care about, verify externally or give the model the facts in the context. Don’t ask an LLM to recall a specific function’s signature from memory if you can provide the documentation.

Context is scarce and precious: if you’re building a system that puts context into an LLM, be intentional about what goes in. Irrelevant content doesn’t help and may hurt by diluting the relevant signal.

Determinism requires temperature 0: if you need reproducible outputs, set temperature to 0. Even then, floating-point non-determinism can cause occasional variation.

Recency bias in context: if you need the model to use a specific piece of information reliably, put it close to the end of the context (before the user message) or in the system prompt, not buried in the middle.

Output is always probabilistic: the same prompt run twice may produce different outputs. Build systems that tolerate this: parse outputs structurally (JSON with a schema), validate results, and have fallbacks.

The mental model that serves engineers best: an LLM is a very capable text-completion engine that has been trained to complete text in ways that humans find helpful. It has broad knowledge encoded in its weights, no access to external information unless you provide it, and no persistent state between calls. Design your systems around these properties and the behavior becomes predictable.



Read more