This Tutorial Covers Goals & Significance of Data Warehouse Testing, ETL Testing Responsibilities, Errors in DW and ETL Deployment in detail:
In this In-Depth Data Warehouse Training Series, we had a look at the What Is ETL Process in Data Warehouse in detail in our previous tutorial.
This tutorial will give you an understanding of how Data Warehouse Testing can be accomplished in an organization. You will also get to know about the goals of DW testing, how and what kind of testing can be performed at the backend, who are all involved in this process, DW errors and ETL deployment in detail.
- Data Warehouse/ETL developers and testers.
- Database professionals with basic knowledge of database concepts.
- Database administrators/Big data experts who want to understand Data Warehouse/ETL concepts.
- College graduates/Freshers who are looking for Data Warehouse jobs.
What You Will Learn:
Data Warehouse (ETL) Testing
What is the significance of testing Data Warehouse and Business Intelligence systems?
Testing plays a critical role in the success of any of the above two systems, by ensuring the correctness of data that builds the faith of end-users.
In general, a defect found at the later stages of the software development life cycle costs more to fix that defect. This situation in the DW can be worsened because the wrong data found at the later stages might have been used in important business decisions by that time.
Thus, the fix in the DW is more expensive in terms of process, people and technology changes. You can begin the DW testing right from the requirements gathering phase.
A requirement traceability matrix is prepared & reviewed, and this mainly maps the DW features with their respective business requirements. The traceability matrix acts as an input to the DW test plan that is prepared by the testers. The test plan describes the tests to be performed to validate the DW system.
It also describes the types of tests that will be performed on the system. After the test plan is ready all the detailed test cases will be prepared for various DW scenarios. Then all the test cases will be executed and defects will be logged.
There is a standard in the operational world that maintains different environments for development, testing, and production. In the DW world, both the developers and testers will make sure that the development and test environments are available with the replica of production data before starting their work.
This is copied for a list of tables with limited or full data depending on the project needs, as the production data is really large. The developers develop their code in the developer’s environment and deliver it to the testers.
The testers will test the code delivered in the testing environments to ensure if all the systems are working. Then the code will go live in the production environments. The DW code is also maintained in different versions based on the defects fixed in each release. Maintaining multiple environments and code versions helps to build a good quality system.
Goals Of Data Warehouse (ETL) Testing
Let's take a look at the Goals Of Data Warehouse Testing.
#1) Data Completeness: Ensure that all data from various sources is loaded into a Data Warehouse. The testing team validates if all the DW records are loaded, against the source database and flat files by following the below sample strategies.
- The total number of records count uploaded from the source system should match the total number of records loaded into DW. If there is a difference then you can think about the rejected records.
- Compare the data loaded into each field of DW with the source system data fields. This will bring out the data errors if any.
#2) Data Transformation: While uploading the source data to the Data warehouse, few fields can be directly loaded with the source data but few fields will be loaded with the data that is transformed as per the business logic. This is the complex portion of testing DW (ETL).
Below are the sample strategies to test this:
- You can test by creating and comparing data in spreadsheets. Load the source transformed data and DW data into spreadsheets and do a comparison. There should not be any mismatch.
- Testers should write the queries as per the transformation logic to compare the DW data with the source data. Query execution will guarantee that the data validation for any of the fields is not missing.
#3) Data Quality: Data warehouse (ETL) system must ensure the quality of the data loaded into it by rejecting (or) correcting the data.
DW may reject a few of the source system data based on the business requirements logic. For Example, reject a record if a certain field has non-numeric data. All the rejected records are loaded into the reject table for reference.
The rejected data is reported to the clients because there is no chance of getting to know about this missed data, as it will not be loaded into the DW system. DW may correct the data by loading zero in the place of null values etc.
#4) Scalability and Performance: Data warehouse must ensure the scalability of the system with increasing loads. With this, there should not be any degradation in the performance while executing the queries, with anticipated results in specific time frames. Thus performance testing uncovers any issues and fixes it before the production.
Below are sample strategies for Performance and Scalability Testing:
- Do the performance testing by loading production volumes of data and ensure that the time frames are not missed.
- Validate the performance of each query with bulk data. Test the performance by using simple joins and multiple joins.
- Load double (or) triple to the volumes of data expected to calculate the capacity of the system approximately.
- Test by running jobs for all the listed reports at the same time.
#5) Integration Testing: Data warehouse should perform Integration Testing with other upstream and downstream applications. If possible, it is better to copy the production data into the test environment for Integration Testing.
All system teams should be involved in this phase to bridge the gaps while understanding and testing all the systems together.
#6) Unit Testing: This is performed by the individual developers on their deliverables. Developers will prepare unit test scenarios based on their understanding of the requirements, run the unit tests and document the results. This helps the developers to fix any bugs if found, before delivering the code to the testing team.
#7) Regression Testing: Validates that the DW system is not malfunctioning after fixing any defects. This is performed many times with every new code change.
#8) User Acceptance Testing: This testing is performed by business users to validate system functionality. UAT environment is different from the QA environment. The sign off from UAT implies that we are ready to move the code to production.
From the Data Warehouse and Business Intelligence system perspective, business users can validate various reports through a User Interface (UI). They can validate the report specifications against the requirements, can validate the correctness of data in the reports, can validate how quickly the system is returning the results, etc.
DW Testing Flow Diagram:
Data Warehouse Testing Responsibilities
Enlisted below are the various teams involved in delivering a successful DW system:
- Business Analysts: Gather all the business requirements for the system and document those for everyone's preference.
- Infrastructure Team: Set up various environments as required for both developers and testers.
- Developers: Develop ETL code as per the requirements and perform unit tests.
- QA (Quality Assurance)/Testers: Develop test plan, test cases, etc. Identifies defects in the system by executing the test cases. Perform various levels of testing.
- DBAs: DBAs take charge of converting logical ETL database scenarios into physical ETL database scenarios and also involve in performance testing.
- Business Users: Involve in User Acceptance Testing, run queries and reports on DW tables.
Errors In Data Warehouse
When you are Extracting, Transforming and Loading (ETL) data from multiple sources there are chances that you will get bad data that may abort the long-running jobs.
Following are the key causes of failure in the DW system:
#1) Business Rule Violations (Logical Errors): Logically wrong data violates the business rules. Such data can be handled mostly during transformation or loading phases.
#2) Data Rule Violations (Data Errors): Data errors occur inside the DW database system like data type mismatches, data constraint failures, etc.
This is the phase where all your efforts go live. All the production support documents should be prepared.
The documentation will tell others about the sequence of jobs to run, failure recovery scenarios, training materials to the DW support teams to monitor the system after deployment and to the administrative support team to execute the reports.
We learned about the Goals of Data Warehouse Testing, ETL Testing Responsibilities, Errors in DW and ETL Deployment in detail in this tutorial.
We hope you got an idea of how detailed testing can be done in a Data Warehouse (ETL) System.