ELT, which stands for “Extract, Load, Transform,” is another type of data integration process, similar to its counterpart ETL, “Extract, Transform, Load”. This process moves raw data from a source system to a destination resource, such as a data warehouse. While similar to ETL, ELT is a fundamentally different approach to data pre-processing which has only more recently gained adoption with the transition to cloud environments.
ELT consists of three primary stages; Extract, Load, and Transform. Each of these stages is detailed below.
During data extraction, data is copied or exported from source locations to a staging area. The data set can consist of many data types and come from virtually any structured or unstructured source, including but not limited to:
That said, it is more typically used with unstructured data.
In this step, the transformed data is moved from the staging area into a data storage area, such as a data warehouse or data lake.
For most organizations, the data loading process is automated, well-defined, continuous and batch-driven. Typically, ELT takes place during business hours when traffic on the source systems and the data warehouse is at its peak and consumers are waiting to use the data for analysis or otherwise.
In this stage, a schema-on-write approach is employed, which applies the schema for the data using SQL, or transforms the data, prior to analysis. This stage can involve the following:
It’s possible to confuse ELT with its sister process known by a nearly identical acronym. However, there are several distinct differences between ELT and ETL, which stands for extract, transform and load. It is a data integration process that combines data from multiple data sources into a single, consistent data store that is loaded into a data warehouse or other target system. Traditional ETL tools were designed to create data warehousing in support of Business Intelligence (BI) and Artificial Intelligence (AI) applications.
The obvious difference is the ELT process performs the Load function before the Transform function – a reversal of the second and third steps of the ETL process. ELT copies or exports the data from the source locations, but instead of moving it to a staging area for transformation, it loads the raw data directly to the target data store, where it can be transformed as needed. ELT does not transform any data in transit.
However, the order of steps is not the only difference. In ELT, the target data store can be a data warehouse, but more often it is a data lake, which is a large central store designed to hold both structured and unstructured data at massive scale.
Data lakes are managed using a big data platform (such as Apache Hadoop) or a distributed NoSQL data management system. They can support business intelligence, but more often, they’re created to support artificial intelligence, machine learning, predictive analytics and applications driven by real-time data and event streams.
There are other differences between ETL and ELT, too. For example, because it transforms data before moving it to the central repository, ETL can make data privacy compliance simpler, or more systematic, than ELT (e.g., If analysts don’t transform sensitive data before they need to use it, it could sit unmasked in the data lake). However, data scientists might prefer ELT, which lets them play in a “sandbox” of raw data and do their own data transformation tailored to specific applications. But, in most cases, the choice between ETL and ELT will depend on the choice between on available business resources and needs.
ELT provides several advantages for users who integrate the process into their workflows. Let’s take a look at some of the notable benefits:
When large amounts of streaming data are generated, ELT allows that data to be loaded immediately, and transforms the data after it reaches its destination. This prevents any slowdown that can often occur if the transformation occurs before the Load function, such as in ETL. Often, decisions need to be made in relation to this data, and delays are unacceptable. An example of this is the stock market, which generates large amounts of data that is consumed in real-time. In scenarios such as this, ELT is the solution of choice because the transformation occurs after the data reaches its destination.
Because the data is transformed when it arrives at its destination, ELT allows the recipient of the data to control data manipulation. With ELT, the decoupling of the transformation and load stages ensures that a coding error or other error in the transformation stage doesn’t affect another stage.
ELT utilizes the power and size of the data warehouse to enable transformation, or scalable compute, on a large scale. The destination data warehouse can increase or decrease nodes as needed, particularly in a cloud scenario where there are multiple nodes within each cluster, and multiple clusters that can be utilized. This allows for on-demand flexibility and scalability.
ELT requires a less-powerful server for data transformation and takes advantage of resources already in the warehouse. This results in cost savings and resource efficiencies.
ELT enables the use of the destination repository of choice, for cost and resource flexibility. Data warehouses use MPP architecture (Massively Parallel Processing), including columnar memory-based storage of volumes of data. Data lake processes that apply a schema, or transformation model, as soon as the data is received (also coined “schema-on-read”) are also supported. These efficient processes provide flexibility for large amounts of data.
Continuous operation is ideal for any environment which requires fast access to data. ELT is well suited for data utilized within cloud environments that will often include applications that are accessed on-demand continuously. Likewise, cloud-native ELT transformation provides the aforementioned scalability and flexibility.
An organization may choose to transition from an ETL to an ELT architecture. The reason for the transition could be a change in the use of its product or service such that real-time response and interaction is needed, or the amount of data has grown exponentially and the transformation is delaying the Load stage due to high volume processing demands on the infrastructure. An organization may also choose to transition from ETL to ELT if it has moved to the cloud and would like to offload the processing or use data in the destination location sooner.
In a transition scenario, it is realistic to expect to encounter challenges. First and foremost, completely different logic and code are used in ELT vs. ETL. This could require a complete reconfiguration and possibly new infrastructure or a new provider with infrastructure in the cloud. In addition, with ELT, the raw data is sent to the destination warehouse. Therefore, security is a consideration and must be implemented to keep the data safe.
ELT is not new technology. Staging tables were previously used to move data into a warehouse for processing and transformation, often using SQL scripts. SQL scripts are hard-coded, and therefore subject to potential coding errors. With the use of SQL, customers had to choose between native warehouse execution using SQL scripts and declarative programming, aka declarative authoring. Declarative authoring provides the benefits of more modern, cloud-based data warehouse environments through the creation of code that describes what the program must achieve rather than how it will accomplish it. This process prevents the coding errors inherent in other processes, particularly when the transformation occurs prior to the Load function.
ELT is typically used in high-volume or real-time data use environments. Specific examples are:
IBM Cloud Pak for Data is an open, extensible data platform that provides a data fabric to make all data available for AI and analytics, on any cloud.
AI is unlocking the value of data in new ways. Organize your data to make it ready for an AI and multicloud world with DataOps solutions.
Data integration allows you to transform structured and unstructured data and deliver it to any system on a scalable big data platform.
IBM offers several data integration services and solutions designed to support a business-ready data pipeline and give your enterprise the tools it needs to scale efficiently. IBM, a leader in data integration, gives enterprises the confidence that they need when managing big data projects, applications, and machine learning technology. With industry-leading platforms like IBM Cloud Pak® for Data, organizations can modernize their DataOps processes while being able to use best-in-class virtualization tools to achieve the speed and scalability their business needs now and in the future.