👋 In this document, you'll learn about the two types of pipelines used in data engineering: ETL and ELT. We'll examine their differences and offer a comparison of their implementation on GCP - Google Cloud Platform.
What is an ETL, ELT?
ETL(Extract, Transform, Load ) pipelines are designed to extract data, transform it and then load it into target database storage systems.
In contrast, ELT(Extract, Load, Transform) pipelines begin by extracting and loading data, with transformation taking place later, once the data has reached its destination.
These two approaches differ fundamentally in the way they manipulate data and interact with various types of storage systems. A thorough understanding of these differences is essential for organizations seeking to make informed decisions to optimize the performance and efficiency of their data flows.
The major distinction between these pipeline systems lies in the sequence of processing and loading operations. However, they also differ in several other important aspects of their architecture and use:
What are the differences?
Aspects |
ETL |
ELT |
Order of operations (extraction, processing, loading) |
The data is extracted, transformed in a staging area and then loaded into the target system. |
Data is extracted, loaded into the target system, then transformed as required for analysis. |
Location of transformations |
The data is moved to a staging area where it undergoes transformations before being delivered to the final system. |
Data is transformed directly in the destination system, eliminating the need for a dedicated staging area. |
Technology maturity |
ETL is a proven approach, in use for more than two decades, with a rich ecosystem of supporting tools. |
ELT is a newer technology, with comparatively fewer support tools natively integrated into existing technologies. |
Data access in the system |
ETL systems only process and load data specifically designated during warehouse and pipeline design. |
ELT systems load all raw data, giving users the flexibility to choose which data to analyze at any time. |
Calculations |
Calculations performed in an ETL system replace or modify existing columns in order to integrate the results into the target table. |
Calculations are added directly to the existing data set, without altering the source data within the target system. |
Compatible storage systems |
ETL systems are typically integrated with structured, relational data warehouses. |
ELT systems are capable of ingesting unstructured data from a variety of sources, such as data lakes. |
Safety and compliance |
Sensitive information can be masked or anonymized before being loaded into the data warehouse, offering greater intrinsic protection. |
Data has to be loaded before it can be anonymized, which can expose it and make it more vulnerable for a short period. |
Data volume |
ETL is particularly well suited to the processing of smaller datasets requiring complex, predefined transformations. |
ELT excels at managing very large volumes of data, both structured and unstructured. |
Waiting times |
ETL systems have longer initial loading times (due to prior transformation), but subsequent analysis is faster because the data is already ready for use. |
Data loading is very fast in ELT systems, as ingestion takes place without waiting for transformations. However, analysis can be slower, as transformations are performed on demand (query time). |
👉 Understanding the nuances, advantages and disadvantages of each approach (ETL vs. ELT) is crucial to choosing and implementing the solution best suited to specific needs. What's more, in an ever-changing technological and business landscape, mastering these data systems paves the way for engineering innovative solutions to tomorrow's business challenges.
How do you implement ETL vs ELT on GCP?
Aspect |
ETL on GCP |
ELT on GCP |
Extraction tools |
Cloud Storage, Pub/Sub, Cloud Functions |
Cloud Storage, Pub/Sub, Cloud Logging |
Processing tools |
Dataflow, Dataproc, Cloud Functions |
Dataform |
Loading tools |
BigQuery, Cloud SQL |
BigQuery |
Orchestration |
Cloud Composer (Airflow) |
Cloud Composer (Airflow) |
Ideal use case |
Sensitive data requiring anonymization, complex transformations, moderate volumes |
Large data volumes, ad hoc analysis, SQL transformations |
Costs |
Higher for processing (Dataflow/Dataproc), lower for BigQuery storage |
Lower for processing, higher for storage and BigQuery queries |
Complexity |
More complex (separate processing infrastructure) |
Easier (everything happens in BigQuery) |
Flexibility |
Less flexible for ad hoc analysis |
More flexible to create new transformations |
Key points to remember
Currently, ETL systems (which extract, transform and then load) and ELT systems (which extract, load and then transform) represent the two dominant paradigms for building efficient data pipelines. These pipelines are essential for getting data to where it can generate value.