Jun 3, 2026

Distributed Tracing: Following a Request Across Services

When a request in a monolithic application is slow, you have a reasonable shot at finding why: add some timing, look at the logs, run a profiler. When a request spans five microservices, the same question becomes much harder. Each service has its own logs, its own metrics, and no service has a complete picture of what happened.

Distributed tracing solves this by threading a trace context through every service a request touches, assembling a complete picture of the request’s journey.

What a Trace Is

A trace represents the end-to-end lifecycle of a single request. It’s composed of spans - individual units of work, each with a start time, duration, name, and a set of attributes.

Spans have parent-child relationships. The root span might be “HTTP GET /checkout”. That span’s children could be “validate cart”, “charge payment”, “update inventory”, “send confirmation email” - each call to a downstream service or significant operation. Each of those children can have their own children.

[HTTP GET /checkout                           200ms total]
  [validate cart          20ms]
  [charge payment                       120ms        ]
    [call payment-service     100ms    ]
      [validate card  30ms]
      [process charge      60ms ]
  [update inventory   15ms]
  [send email  45ms              ]

The waterfall view shows you exactly what ran, in what order, for how long, and which services were involved. A slow request that you thought was “the checkout service” might actually be “the payment service’s validate_card call” - something you’d never find by looking at the checkout service’s logs alone.

Context Propagation

Traces work by passing a trace context from service to service. When service A calls service B, it includes the trace ID and the current span ID in the request (typically in HTTP headers). Service B reads these headers, creates a new span as a child of the incoming span, and propagates the context to any services it calls in turn.

The standard for this is W3C Trace Context, supported by most tracing libraries:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
               ^  ^trace-id                        ^span-id         ^flags

In code, you typically don’t manage this manually. An instrumentation library handles it:

# OpenTelemetry Python - automatic HTTP instrumentation
from opentelemetry.instrumentation.requests import RequestsInstrumentor
RequestsInstrumentor().instrument()

# Now any requests.get() call automatically propagates trace context
# and creates a span

The key is that context propagation is transparent to business logic when instrumented at the framework level. HTTP clients, message consumers, and database clients can all be instrumented to carry trace context without touching application code.

The Three Components

SDK / instrumentation library: code running in your service that creates spans and records timing. OpenTelemetry is the emerging standard - vendor-neutral instrumentation that can export to any backend.

Collector / exporter: the SDK sends span data to a collector (or directly to a backend). The collector buffers, batches, and forwards to storage. This decouples your services from the specific tracing backend.

Backend / UI: stores traces and provides the interface for querying them. Jaeger and Zipkin are popular open-source options. Datadog, Honeycomb, Lightstep, and AWS X-Ray are managed options.

Adding Custom Instrumentation

Auto-instrumentation covers framework-level operations (HTTP requests, database queries, message consumption). Custom spans let you trace business-level operations:

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

def process_order(order_id: str):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)

        items = fetch_items(order_id)
        span.set_attribute("order.item_count", len(items))

        if apply_discount(items):
            span.set_attribute("order.discount_applied", True)

        return charge_and_fulfill(items)

When this trace lands in your backend, you’ll see the process_order span with the order ID, item count, and whether a discount was applied - alongside the child spans for fetch_items, apply_discount, and charge_and_fulfill.

Using Traces Effectively

The main use case is latency investigation: a request is slow, and you want to know why.

Start from the root span. Is the total duration in the trace what you expect? Find the longest child span. Is that span itself slow, or is it slow because of its children? Keep drilling until you find the leaf span where time is actually being spent.

Common findings:

N+1 patterns: the trace shows 200 database query spans where you expected 2
External API latency: a third-party payment provider is responding in 800ms
Synchronous calls that could be parallelized: two operations running sequentially where they could run concurrently
Lock contention: a span that spends most of its time waiting rather than executing

Traces also help with error investigation. An error span appears in red in most UIs, with the exception details attached. Following the parent chain from an error shows you the request context that led to it.

Sampling

A busy service might handle thousands of requests per second. Recording every span for every request would be expensive. Sampling reduces this by only recording a fraction of traces.

Head-based sampling: decide at the start of the trace (the root span) whether to record it. Simple, but you can’t preferentially keep “interesting” traces (slow ones, error ones) because you haven’t seen them yet.

Tail-based sampling: buffer spans until the trace is complete, then decide whether to keep it based on the outcome (keep all errors, keep slow requests, sample the rest). More useful but requires a stateful collector that holds spans in memory.

A common approach: head-sample at 1-10% for baseline visibility, plus tail-sample all error traces and all requests above a latency threshold. This keeps storage manageable while ensuring you always have data for the cases you care about most.

What Tracing Doesn’t Replace

Tracing tells you where time goes and how requests flow. It doesn’t replace logs (for structured event records) or metrics (for aggregate system state). The three work together: a metric alert fires, you look at the dashboard to understand scope, you find a specific trace that exemplifies the problem, you look at the logs from the relevant span to see the details.

The value of tracing increases significantly with more services. For a single service, a profiler or detailed logging might give you the same information more easily. For a system with five or more services, tracing is the only tool that gives you the complete request picture.

Distributed Tracing: Following a Request Across Services

What a Trace Is

Context Propagation

The Three Components

Adding Custom Instrumentation

Using Traces Effectively

Sampling

What Tracing Doesn’t Replace

Read more

Debugging in Production: Where to Start

Metrics: What to Measure and What to Ignore

Logging That's Actually Useful: Structured Logs and Log Levels