This In-depth Tutorial on ETL Process Explains Process Flow & Steps Involved in the ETL (Extraction, Transformation, and Load) Process in Data Warehouse:
This tutorial in the series explains: What is ETL Process? Data Extraction, Transformation, Loading, Flat Files, What is Staging? ETL Cycle, etc.
What You Will Learn:
- ETL (Extract, Transform, Load) Process Fundamentals
- What Is ETL Process In Data Warehouse?
- Data Extraction
- Data Transformation
- Data Loading
- Flat Files
- Purpose Of Staging Area
- Designing The Staging Area
- ETL Process Flow
ETL (Extract, Transform, Load) Process Fundamentals
- Data warehouse/ETL developers and testers.
- Database professionals with basic knowledge of database concepts.
- Database administrators/big data experts who want to understand Data warehouse/ETL areas.
- College graduates/Freshers who are looking for Data warehouse jobs.
What Is ETL Process In Data Warehouse?
We all know that Data warehouse is a collection of huge volumes of data, to provide information to the business users with the help of Business Intelligence tools.
To serve this purpose DW should be loaded at regular intervals. The data into the system is gathered from one or more operational systems, flat files, etc. The process which brings the data to DW is known as ETL Process. Extraction, Transformation, and Loading are the tasks of ETL.
#1) Extraction: All the preferred data from various source systems such as databases, applications, and flat files is identified and extracted. Data extraction can be completed by running jobs during non-business hours.
#2) Transformation: Most of the extracted data can’t be directly loaded into the target system. Based on the business rules, some transformations can be done before loading the data.
For Example, a target column data may expect two source columns concatenated data as input. Likewise, there may be complex logic for data transformation that needs expertise. Some data that does not need any transformations can be directly moved to the target system.
The transformation process also corrects the data, removes any incorrect data and fixes any errors in the data before loading it.
#3) Loading: All the gathered information is loaded into the target Data Warehouse tables.
Data extraction plays a major role in designing a successful DW system. Different source systems may have different characteristics of data, and the ETL process will manage these differences effectively while extracting the data.
“Logical data map” is a base document for data extraction. This shows which source data should go to which target table, and how the source fields are mapped to the respective target table fields in the ETL process.
Below are the steps to be performed during Logical Data Map Designing:
- A Data warehouse architect designs the logical data map document.
- By referring to this document, the ETL developer will create ETL jobs and ETL testers will create test cases.
- All the specific data sources and the respective data elements that support the business decisions will be mentioned in this document. These data elements will act as inputs during the extraction process.
- Data from all the source systems are analyzed and any kind of data anomalies are documented so that this helps in designing the correct business rules to stop extracting the wrong data into DW. Such data is rejected here itself.
- Once the final source and target data model is designed by the ETL architects and the business analysts, they can conduct a walk through with the ETL developers and the testers. By this, they will get a clear understanding of how the business rules should be performed at each phase of Extraction, Transformation, and Loading.
- By going through the mapping rules from this document, the ETL architects, developers and testers should have a good understanding of how data flows from each table as dimensions, facts, and any other tables.
- Any kind of data manipulation rules or formulas is also mentioned here to avoid the extraction of wrong data. For example, extract only the last 40 days of data, etc.
- It is the responsibility of the ETL team to drill down into the data as per the business requirements, to bring out every useful source system, tables, and columns data to be loaded into DW.
Logical data map document is generally a spreadsheet which shows the following components:[table “” not found /]
Extraction Flow Diagram:
State about the time window to run the jobs to each source system in advance, so that no source data would be missed during the extraction cycle.
With the above steps, extraction achieves the goal of converting data from different formats from different sources into a single DW format, that benefits the whole ETL processes. Such logically placed data is more useful for better analysis.
Extraction Methods In Data Warehouse
Depending on the source and target data environments and the business needs, you can select the extraction method suitable for your DW.
#1) Logical Extraction Methods
Data extraction in a Data warehouse system can be a one-time full load that is done initially (or) it can be incremental loads that occur every time with constant updates.
- Full Extraction: As the name itself suggests, the source system data is completely extracted to the target table. Each time this kind of extraction loads the entire current source system data without considering the last extracted time stamps. Preferably you can use full extraction for the initial loads or tables with fewer data.
- Incremental Extraction: The data which is added/modified from a specific date will be considered for incremental extraction. This date is business-specific as last extracted date (or) last order date etc. We can refer to a timestamp column from the source table itself (or) a separate table can be created to track only the extraction date details. Referring to the timestamp is a significant method during Incremental extraction. Logics without timestamp may fail if the DW table has large data.
#2) Physical Extraction Methods
Depending on the source systems' capabilities and the limitations of data, the source systems can provide the data physically for extraction as online extraction and offline extraction. This supports any of the logical extraction types.
- Online Extraction:: We can directly connect to any source system databases with the connection strings to extract data directly from the source system tables.
- Offline Extraction:: We will not directly connect to the source system database here, instead the source system provides data explicitly in a pre-defined structure. Source systems can provide data in the form of Flat files, Dump files, Archive logs and Tablespaces.
ETL tools are best suited to perform any complex data extractions, any number of times for DW though they are expensive.
Extracting Changed Data
Once the initial load is completed, it is important to consider how to extract the data that is changed from the source system further. The ETL Process team should design a plan on how to implement extraction for the initial loads and the incremental loads, at the beginning of the project itself.
Mostly you can consider the “Audit columns” strategy for the incremental load to capture the data changes. In general, the source system tables may contain audit columns, that store the time stamp for each insertion (or) modification.
The timestamp may get populated by database triggers (or) from the application itself. You must ensure the accuracy of the audit columns' data even if they are loading by any means, to not to miss the changed data for incremental loads.
During the incremental load, you can consider the maximum date and time of when the last load has happened and extract all the data from the source system with the time stamp greater than the last load time stamp.
While Extracting the Data:
- Use queries optimally to retrieve only the data that you need.
- Do not use the Distinct clause much as it slows down the performance of the queries.
- Use SET operators such as Union, Minus, Intersect carefully as it degrades the performance.
- Use comparison key words such as like, between, etc in where clause, rather than functions such as substr(), to_char(), etc.
Transformation is the process where a set of rules is applied to the extracted data before directly loading the source system data to the target system. The extracted data is considered as raw data.
The transformation process with a set of standards brings all dissimilar data from various source systems into usable data in the DW system. Data transformation aims at the quality of the data. You can refer to the data mapping document for all the logical transformation rules.
Based on the transformation rules if any source data is not meeting the instructions, then such source data is rejected before loading into the target DW system and is placed into a reject file or reject table.
The transformation rules are not specified for the straight load columns data (does not need any change) from source to target. Hence, data transformations can be classified as simple and complex. Data transformations may involve column conversions, data structure reformatting, etc.
Given below are some of the tasks to be performed during Data Transformation:
#1) Selection: You can select either the entire table data or a specific set of columns data from the source systems. The selection of data is usually completed at the Extraction itself.
There may be cases where the source system does not allow to select a specific set of columns data during the extraction phase, then extract the whole data and do the selection in the transformation phase.
#2) Splitting/joining: You can manipulate the selected data by splitting or joining it. You will be asked to split the selected source data even more during the transformation.
For example, if the whole address is stored in a single large text field in the source system, the DW system may ask to split the address into separate fields as a city, state, zip code, etc. This is easy for indexing and analysis based on each component individually.
Whereas joining/merging two or more columns data is widely used during the transformation phase in the DW system. This does not mean merging two fields into a single field.
For Example, if information about a particular entity is coming from multiple data sources, then gathering the information as a single entity can be called as joining/merging the data.
#3) Conversion: The extracted source systems data could be in different formats for each data type, hence all the extracted data should be converted into a standardized format during the transformation phase. The same kind of format is easy to understand and easy to use for business decisions.
#4) Summarization: In some situations, DW will look for summarized data rather than low-level detailed data from the source systems. Because low-level data is not best suited for analysis and querying by the business users.
For example, sales data for every checkout may not be required by the DW system, daily sales by-product (or) daily sales by the store is useful. Hence summarization of data can be performed during the transformation phase as per the business requirements.
#5) Enrichment: When a DW column is formed by combining one or more columns from multiple records, then data enrichment will re-arrange the fields for a better view of data in the DW system.
#6) Format revisions: Format revisions happen most frequently during the transformation phase. The data type and its length are revised for each column.
For example, a column in one source system may be numeric and the same column in another source system may be a text. To standardize this, during the transformation phase the data type for this column is changed to text.
#7) Decoding of fields: When you are extracting data from multiple source systems, the data in various systems may be decoded differently.
For example, one source system may represent customer status as AC, IN, and SU. Another system may represent the same status as 1, 0 and -1.
During the data transformation phase, you need to decode such codes into proper values that are understandable by the business users. Hence, the above codes can be changed to Active, Inactive and Suspended.
#8) Calculated and derived values: By considering the source system data, DW can store additional column data for the calculations. You have to do the calculations based on the business logic before storing it into DW.
#9) Date/Time conversion: This is one of the key data types to concentrate on. The date/time format may be different in multiple source systems.
For example, one source may store the date as November 10, 1997. Another source may store the same date in 11/10/1997 format. Hence, during the data transformation, all the date/time values should be converted into a standard format.
#10) De-duplication: In case the source system has duplicate records, then ensure that only one record is loaded to the DW system.
Transformation Flow Diagram:
How To Implement Transformation?
Depending on the complexity of data transformations you can use manual methods, transformation tools (or) combination of both whichever is effective.
#1) Manual Techniques
Manual techniques are adequate for small DW systems. Data analysts and developers will create the programs and scripts to transform the data manually. This method needs detailed testing for every portion of the code.
The maintenance cost may become high due to the changes that occur in business rules (or) due to the chances of getting errors with the increase in the volumes of data. You should take care of metadata initially and also with every change that occurs in the transformation rules.
#2) Transformation Tools
If you want to automate most of the transformation process, then you can adopt the transformation tools depending on the budget and time frame available for the project. While automating you should spend good quality time to select the tools, configure, install and integrate them with the DW system.
Practically Complete transformation with the tools itself is not possible without manual intervention. But the data transformed by the tools is certainly efficient and accurate.
To achieve this, we should enter proper parameters, data definitions, and rules to the transformation tool as input. From the inputs given, the tool itself will record the metadata and this metadata gets added to the overall DW metadata.
If there are any changes in the business rules, then just enter those changes to the tool, the rest of the transformation modifications will be taken care of by the tool itself. Hence a combination of both methods is efficient to use.
Extracted and transformed data gets loaded into the target DW tables during the Load phase of the ETL process. The business decides how the loading process should happen for each table.
The loading process can happen in the below ways:
- Initial load: Loading the data to populate the respective DW tables for the first time.
- Incremental load: Once the DW tables are loaded, the rest of the ongoing changes are applied periodically.
- Full refresh: If any tables that are in use needs a refresh, then the current data from that table is completely removed and then reloaded. Reloading is similar to the initial load.
Look at the below example, for better understanding of the loading process in ETL:
|Product ID||Product Name||Sold Date|
|1||Grammar book||3rd June 2007|
|2||Marker||3rd June 2007|
|3||Back bag||4th June 2007|
|4||Cap||4th June 2007|
|5||Shoes||5th June 2007|
#1) During the initial load, the data which is sold on 3rd June 2007 gets loaded into the DW target table because it is the initial data from the above table.
#2) During the Incremental load, we need to load the data which is sold after 3rd June 2007. We should consider all the records with the sold date greater than (>) the previous date for the next day. Hence, on 4th June 2007, fetch all the records with sold date > 3rd June 2007 by using queries and load only those two records from the above table.
On 5th June 2007, fetch all the records with sold date > 4th June 2007 and load only one record from the above table.
#3) During Full refresh, all the above table data gets loaded into the DW tables at a time irrespective of the sold date.
The loaded data is stored in the respective dimension (or) fact tables. The data can be loaded, appended or merged to the DW tables as follows:
#4) Load: The data gets loaded into the target table if it is empty. If the table has some data exist, the existing data is removed and then gets loaded with the new data.
Existing Table Data
Data After Loading
#5) Append: Append is an extension of the above load as it works on already data existing tables. In the target tables, Append adds more data to the existing data. If any duplicate record is found with the input data, then it may be appended as duplicate (or) it may be rejected.
Existing Table Data
Data After Appending
#6) Destructive merge: Here the incoming data is compared with the existing target data based on the primary key. If there is a match, then the existing target record gets updated. If no match is found, then a new record gets inserted into the target table.
Existing Table Data
Data After Constructive Merge
#7) Constructive merge: Unlike destructive merge, if there is a match with the existing record, then it leaves the existing record as it is and inserts the incoming record and marks it as the latest data (timestamp) with respect to that primary key.
Existing Table Data
Data After Constructive Merge
Technically, refresh is easier than updating the data. The update needs a special strategy to extract only the specific changes and apply them to the DW system whereas Refresh just replaces the data. But refreshing the data takes longer times depending on the volumes of data.
If you have such refresh jobs to run daily, then you may need to bring down the DW system to load the data. Instead of bringing down the entire DW system to load data every time, you can divide and load data in the form of few files.
Make a note of the run time for each load while testing. If any data is not able to get loaded into the DW system due to any key mismatches etc, then give them the ways to handle such kind of data. Ensure that loaded data is tested thoroughly.
Loading Flow Diagram:
Flat files are widely used to exchange data between heterogeneous systems, from different source operating systems and from different source database systems to Data warehouse applications. Flat files are most efficient and easy to manage for homogeneous systems as well.
Flat files are primarily used for the following purposes:
#1) Delivery of source data: There may be few source systems that will not allow DW users to access their databases due to security reasons. In such cases, the data is delivered through flat files.
Similarly, the data is sourced from the external vendors or mainframes systems essentially in the form of flat files, and these will be FTP’d by the ETL users.
#2) Working/staging tables: ETL process creates staging tables for its internal purpose. The association of staging tables with the flat files is much easier than the DBMS because reads and writes to a file system are faster than inserting and querying a database.
#3) Preparation for bulk load: Once the Extraction and Transformation processes have been done, If the in-stream bulk load is not supported by the ETL tool (or) If you want to archive the data then you can create a flat-file. This flat file data is read by the processor and loads the data into the DW system.
Flat files can be created in two ways as “Fixed-length flat files” and “Delimited flat files”. Flat files can be created by the programmers who work for the source system.
Let us see how do we process these flat files:
Processing Fixed Length Flat Files
In general, flat files are of fixed length columns, hence they are also called as Positional flat files. Below is the layout of a flat-file which shows the exact fields and their positions in a file.
|First name||10||1||10||Text||First name of customer|
|Middle name||5||11||15||Text||Middle name of customer|
|Last name||10||16||25||Text||Last name of customer|
The layout contains the field name, length, starting position at which the field character begins, the end position at which the field character ends, the data type as text, numeric, etc., and comments if any.
Depending on the data positions, the ETL testing team will validate the accuracy of the data in a fixed-length flat file.
Processing Delimited Flat Files
In Delimited Flat Files, each data field is separated by delimiters. This delimiter indicates the starting and end position of each field. In general, a comma is used as a delimiter, but you can use any other symbol or a set of symbols.
Delimited files can be of .CSV extension (or).TXT extension (or) of no extension. The developers who create the ETL files will indicate the actual delimiter symbol to process that file. In the delimited file layout, the first row may represent the column names.
Same as the positional flat files, the ETL testing team will explicitly validate the accuracy of the delimited flat file data.
Purpose Of Staging Area
The main purpose of the staging area is to store data temporarily for the ETL process. The staging area is referred to as the backroom to the DW system. ETL architect decides whether to store data in the staging area or not.
Staging will help to get the data from source systems very fast. At the same time in case the DW system fails, then you need not start the process again by gathering data from the source systems if the staging data exists already.
After the data extraction process, here are the reasons to stage data in the DW system:
#1) Recoverability: The populated staging tables will be stored in the DW database itself (or) they can be moved into file systems and can be stored separately. At some point, the staging data can act as recovery data if any transformation or load step fails.
There may be chances that the source system has overwritten the data used for ETL, hence keeping the extracted data in staging helps us for any reference.
#2) Backup: It is difficult to take back up for huge volumes of DW database tables. But backups are a must for any disaster recovery. Hence if you have the staging data which is extracted data, then you can run the jobs for transformation and load, thereby the crashed data can be reloaded.
To back up the staging data, you can frequently move the staging data to file systems so that it is easy to compress and store in your network. Whenever required just uncompress files, load into staging tables and run the jobs to reload the DW tables.
#3) Auditing: Sometimes an audit can happen on the ETL system, to check the data linkage between the source system and the target system. The auditors can validate the original input data against the output data based on the transformation rules.
The staging data and it’s back up are very helpful here even if the source system has the data available or not. As audit can happen at any time and on any period of the present (or) past data. The architecture of the staging area should be well planned.
Designing The Staging Area
In the Data warehouse, the staging area data can be designed as follows:
With every new load of data into staging tables, the existing data can be deleted (or) maintained as historical data for reference. If data is deleted, then it is called a “Transient staging area”.
If data is maintained as history, then it is called a “Persistent staging area”. You can also design a staging area with a combination of the above two types which is “Hybrid”.
Here are the basic rules to be known while designing the staging area:
- Only the ETL team should have access to the data staging area. Querying the staging data is restricted to other users.
- Tables in the staging area can be added, modified or dropped by the ETL data architect without involving any other users. As the staging area is not a presentation area to generate reports, it just acts as a workbench.
- ETL architect should estimate the data storage measure of the staging area to provide the details to DBA and OS administrators. Administrators will allocate space for staging databases, file systems, directories, etc.
If the staging area and DW database are using the same server then you can easily move the data to the DW system. If the servers are different then use FTP (or) database links.
ETL Process Flow
A standard ETL cycle will go through the below process steps:
- Kick off the ETL cycle to run jobs in sequence.
- Make sure all the metadata is ready.
- ETL cycle helps to extract the data from various sources.
- Validate the extracted data.
- If staging tables are used, then the ETL cycle loads the data into staging.
- ETL performs transformations by applying business rules, by creating aggregates, etc
- If there are any failures, then the ETL cycle will bring it to notice in the form of reports.
- Then ETL cycle loads data into the target tables.
- Earlier data which needs to be stored for historical reference is archived.
- The rest of the data which need not be stored is cleaned.
ETL Process Flow Diagram:
In this tutorial, we learned about the major concepts of the ETL Process in Data Warehouse. By now, you should be able to understand what is Data Extraction, Data Transformation, Data Loading, and the ETL process flow.
Read the upcoming tutorial to know more about Data Warehouse Testing!!