Distributed Tracing: Following a Request Across Services
When a request in a monolithic application is slow, you have a reasonable shot at finding why: add some timing, look at the logs, run a profiler. When a request spans five microservices, the same question becomes much harder. Each service has its own logs, its own metrics, and no service has a complete picture of what happened.
Distributed tracing solves this by threading a trace context through every service a request touches, assembling a complete picture of the request’s journey.
What a Trace Is
A trace represents the end-to-end lifecycle of a single request. It’s composed of spans - individual units of work, each with a start time, duration, name, and a set of attributes.
Spans have parent-child relationships. The root span might be “HTTP GET /checkout”. That span’s children could be “validate cart”, “charge payment”, “update inventory”, “send confirmation email” - each call to a downstream service or significant operation. Each of those children can have their own children.
[HTTP GET /checkout 200ms total]
[validate cart 20ms]
[charge payment 120ms ]
[call payment-service 100ms ]
[validate card 30ms]
[process charge 60ms ]
[update inventory 15ms]
[send email 45ms ]
The waterfall view shows you exactly what ran, in what order, for how long, and which services were involved. A slow request that you thought was “the checkout service” might actually be “the payment service’s validate_card call” - something you’d never find by looking at the checkout service’s logs alone.
Context Propagation
Traces work by passing a trace context from service to service. When service A calls service B, it includes the trace ID and the current span ID in the request (typically in HTTP headers). Service B reads these headers, creates a new span as a child of the incoming span, and propagates the context to any services it calls in turn.
The standard for this is W3C Trace Context, supported by most tracing libraries:
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
^ ^trace-id ^span-id ^flags
In code, you typically don’t manage this manually. An instrumentation library handles it:
# OpenTelemetry Python - automatic HTTP instrumentation
from opentelemetry.instrumentation.requests import RequestsInstrumentor
RequestsInstrumentor().instrument()
# Now any requests.get() call automatically propagates trace context
# and creates a span
The key is that context propagation is transparent to business logic when instrumented at the framework level. HTTP clients, message consumers, and database clients can all be instrumented to carry trace context without touching application code.
The Three Components
SDK / instrumentation library: code running in your service that creates spans and records timing. OpenTelemetry is the emerging standard - vendor-neutral instrumentation that can export to any backend.
Collector / exporter: the SDK sends span data to a collector (or directly to a backend). The collector buffers, batches, and forwards to storage. This decouples your services from the specific tracing backend.
Backend / UI: stores traces and provides the interface for querying them. Jaeger and Zipkin are popular open-source options. Datadog, Honeycomb, Lightstep, and AWS X-Ray are managed options.
Adding Custom Instrumentation
Auto-instrumentation covers framework-level operations (HTTP requests, database queries, message consumption). Custom spans let you trace business-level operations:
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
def process_order(order_id: str):
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)
items = fetch_items(order_id)
span.set_attribute("order.item_count", len(items))
if apply_discount(items):
span.set_attribute("order.discount_applied", True)
return charge_and_fulfill(items)
When this trace lands in your backend, you’ll see the process_order span with the order ID, item count, and whether a discount was applied - alongside the child spans for fetch_items, apply_discount, and charge_and_fulfill.
Using Traces Effectively
The main use case is latency investigation: a request is slow, and you want to know why.
Start from the root span. Is the total duration in the trace what you expect? Find the longest child span. Is that span itself slow, or is it slow because of its children? Keep drilling until you find the leaf span where time is actually being spent.
Common findings:
- N+1 patterns: the trace shows 200 database query spans where you expected 2
- External API latency: a third-party payment provider is responding in 800ms
- Synchronous calls that could be parallelized: two operations running sequentially where they could run concurrently
- Lock contention: a span that spends most of its time waiting rather than executing
Traces also help with error investigation. An error span appears in red in most UIs, with the exception details attached. Following the parent chain from an error shows you the request context that led to it.
Sampling
A busy service might handle thousands of requests per second. Recording every span for every request would be expensive. Sampling reduces this by only recording a fraction of traces.
Head-based sampling: decide at the start of the trace (the root span) whether to record it. Simple, but you can’t preferentially keep “interesting” traces (slow ones, error ones) because you haven’t seen them yet.
Tail-based sampling: buffer spans until the trace is complete, then decide whether to keep it based on the outcome (keep all errors, keep slow requests, sample the rest). More useful but requires a stateful collector that holds spans in memory.
A common approach: head-sample at 1-10% for baseline visibility, plus tail-sample all error traces and all requests above a latency threshold. This keeps storage manageable while ensuring you always have data for the cases you care about most.
What Tracing Doesn’t Replace
Tracing tells you where time goes and how requests flow. It doesn’t replace logs (for structured event records) or metrics (for aggregate system state). The three work together: a metric alert fires, you look at the dashboard to understand scope, you find a specific trace that exemplifies the problem, you look at the logs from the relevant span to see the details.
The value of tracing increases significantly with more services. For a single service, a profiler or detailed logging might give you the same information more easily. For a system with five or more services, tracing is the only tool that gives you the complete request picture.