Data Pipelines: ETL, ELT, and When Each Makes Sense
Most backend engineers encounter data pipelines eventually - either because the company grows and needs analytics, or because they’re handed a data engineering task with no context. The terminology is a maze: ETL, ELT, data lakes, data warehouses, batch processing, streaming. Underneath it all, the core problem is simple: data is in one place, and you need it somewhere else, in a different shape.
This article focuses on the ETL vs ELT distinction and what it actually means in practice.
What a Data Pipeline Does
At its most basic, a data pipeline moves data from a source to a destination, usually transforming it along the way. Sources are things like production databases, third-party APIs, event streams, or files. Destinations are typically data warehouses, data lakes, or other databases built for analytics queries.
The reason you don’t just run analytics queries directly against your production database comes down to a few problems:
Schema optimized for transactions, not analysis: relational schemas with normalized tables and foreign keys are fast for read-modify-write operations. They’re slow for aggregate queries that scan millions of rows across many joined tables. Analytical queries need differently structured data.
Load and isolation: a complex analytical query that scans your entire orders table for the past five years will compete with users trying to complete purchases. Separating analytical and transactional workloads prevents one from degrading the other.
History: production databases often retain limited history. Analytical systems are built to query data across years.
ETL: Transform Before Loading
ETL stands for Extract, Transform, Load. The traditional approach:
- Extract data from source systems (export from Postgres, read from API, consume from Kafka)
- Transform the data in a separate compute environment (clean it, join it with other sources, apply business logic, reshape it)
- Load the transformed data into the destination warehouse
Source DB --> [Extract] --> Raw Data --> [Transform] --> Clean Data --> [Load] --> Warehouse
ETL made sense when storage was expensive and compute was cheap. By transforming before loading, you stored only the data you needed in the final shape you needed it. The warehouse never held raw data.
The practical problem with ETL: the transform step is a bottleneck. You have to know the final shape of the data before you load it. If the analytics team later wants a metric you didn’t transform for, you have to go back, modify the pipeline, and reload.
ELT: Load Raw, Transform in Place
ELT flips the order: Extract, Load, then Transform.
- Extract data from source systems
- Load the raw data directly into the warehouse (or data lake)
- Transform it inside the warehouse using SQL or warehouse-native compute
Source DB --> [Extract] --> [Load] --> Raw Tables in Warehouse --> [Transform] --> Analytics Tables
This became practical as cloud data warehouses (BigQuery, Snowflake, Redshift) became extremely cheap to store data in and extremely fast at running SQL over large datasets. Storage is now so cheap that keeping raw data “forever” is a reasonable default.
The advantage: you always have the raw data. When someone asks for a metric that wasn’t in the original schema, you can transform from raw data rather than re-extracting from the source. The transformation logic lives in SQL, which most analysts can read and modify without engineering help.
The tool most associated with ELT transformation is dbt (data build tool). It lets you write SQL transformation logic as version-controlled models:
-- models/marts/orders_summary.sql
SELECT
date_trunc('day', created_at) AS order_date,
COUNT(*) AS total_orders,
SUM(amount_cents) / 100.0 AS total_revenue,
COUNT(DISTINCT customer_id) AS unique_customers
FROM {{ ref('raw_orders') }}
WHERE status = 'completed'
GROUP BY 1
dbt runs these models against your warehouse, creating views or materialized tables that analytics tools can query.
The Engineering Tradeoffs
Before choosing between ETL and ELT, it helps to name what you’re actually trading off. These are engineering decisions, not data-world conventions.
Latency: how fresh does the data need to be? Most data pipelines run on a schedule - nightly, hourly, every 15 minutes. The batch interval is the floor on data freshness. A daily ETL job means dashboards are always at least 0 to 24 hours stale. For most analytics use cases, this is fine. For fraud detection or operational monitoring, it’s not. If your use case requires freshness in seconds, you’re not building a batch pipeline at all - that’s a streaming problem. If hourly or daily freshness is acceptable, the simpler batch approach is almost always the right call.
Cost: where does the compute happen? ETL transforms data before loading - compute happens outside the warehouse, in your pipeline infrastructure. ELT transforms data inside the warehouse - compute is billed by the warehouse. Modern cloud warehouses (BigQuery, Snowflake) charge per query or per byte scanned. For large datasets with complex transformations run frequently, this can be significant. The flip side: ETL requires maintaining separate transformation infrastructure, which has its own operational cost. The honest answer is that ELT is usually cheaper for teams that don’t have dedicated data infrastructure, and the cost question is secondary to the operational simplicity question.
Governance: who can see what, and in what form? ELT loads raw data first - which means your warehouse will hold the raw copy of every source table, including PII, sensitive financial data, and anything else in the source system. This is often fine in a well-controlled warehouse with proper access controls. But it’s a governance commitment: you need column-level access controls, data masking policies, audit logging. Some regulated environments or contractual obligations prohibit storing raw PII at all - which forces ETL (mask before loading) regardless of other considerations. If your data is clean and non-sensitive, governance doesn’t drive the decision. If your data includes medical records, financial transactions, or identifiers tied to individuals, governance might be the most important factor.
Which to Use
The practical answer in 2026 is: ELT for most things, with ETL reserved for specific cases.
Use ELT when:
- You have a modern cloud data warehouse (BigQuery, Snowflake, Redshift, DuckDB)
- You want analytics teams to be able to define their own transformations
- You need flexibility - requirements will change, and you want raw data as a fallback
- Storage cost is not a significant constraint
Consider ETL when:
- You’re working with sensitive data that must not land in its raw form (PII that needs to be masked or tokenized before storage)
- You’re loading into a destination with strict schemas where raw data would be rejected
- Your transformation is computationally heavy in ways the warehouse isn’t good at (complex machine learning, geospatial computation)
- Your destination isn’t a warehouse (loading into another operational database)
The Extract Layer: Getting Data Out
Extracting from a production relational database without killing it requires care. The standard approaches:
Full extract: read the entire table every time. Simple, reliable, scales poorly as tables grow.
Incremental extract: read only rows modified since the last run. Requires a reliable updated_at timestamp column or similar. Misses hard deletes.
Change Data Capture (CDC): read the database’s binary log (Postgres replication log, MySQL binlog) to capture every insert, update, and delete in real time. Most reliable but operationally complex. Tools like Debezium handle this.
For APIs, extraction is typically scheduled API polling or webhook event consumption. Rate limits and pagination are the usual problems.
Orchestration
A pipeline is a series of steps that must run in order, on a schedule, with error handling and retries. Orchestration tools manage this:
Airflow - the industry standard. Python-defined DAGs (directed acyclic graphs), strong ecosystem, complex to operate.
Prefect / Dagster - modern alternatives to Airflow with better developer experience, managed hosting options.
dbt Cloud / dbt Core - if you’re doing ELT with dbt, it has scheduling built in for the transform layer.
For simple pipelines, a cron job or GitHub Actions scheduled workflow is perfectly adequate. Don’t reach for Airflow until the complexity justifies it.
Data Quality
The most common silent failure in data pipelines: bad data flows through without anyone noticing. A source table gains a new NULL column. An upstream API changes a field format. A timezone handling bug doubles some event counts.
The basic defensive pattern: add assertions at load time and after transformations.
-- dbt test example: orders should never have negative amounts
SELECT id FROM {{ ref('raw_orders') }} WHERE amount_cents < 0
If this query returns rows, dbt fails the run. The analyst investigating the dashboard discrepancy finds a dbt failure, not a confusing wrong number.
Data quality tooling like Great Expectations or dbt’s built-in tests can formalize this: schema checks, uniqueness constraints, value range assertions, relationship integrity. None of it is exciting, but all of it saves you from explaining to stakeholders why the revenue numbers were wrong for a week.
The Thing to Internalize
Data pipelines are plumbing. The goal is to make data flow reliably from where it is to where it’s needed, in the shape it’s needed. Modern ELT with a cloud warehouse, a CDC-based extractor, and dbt for transformations covers most use cases.
The complexity isn’t in choosing the pattern - it’s in the edge cases: schema evolution in source systems, handling late-arriving data, managing dependencies between transformations, debugging why a number changed. That’s the actual work. The ETL vs ELT choice is just the starting point.