This Tutorial Describes ETL & Data Migration Projects and covers Data Validation Checks or Tests for ETL/Data Migration Projects for Improved Data Quality:
This article is for software testers who are working on ETL or Data Migration projects and are interested to focus their tests on just the Data Quality aspects. These types of projects have a huge volume of data that are stored on source storage and then get operated upon by some logic present in the software and is moved to the target storage.
Data validation tests ensure that the data present in final target systems are valid, accurate, as per business requirements and good for use in the live production system.
The number of Data Quality aspects that can be tested is huge and this list below gives an introduction to this topic.
Table of Contents:
What Is Data Validation?
In simple terms, Data Validation is the act of validating the fact that the data that are moved as part of ETL or data migration jobs are consistent, accurate, and complete in the target production live systems to serve the business requirements.
Example: The address of a student in the Student table was 2000 characters in the source system. Data validation verifies if the exact same value resides in the target system. It checks if the data was truncated or if certain special characters are removed.
In this article, we will discuss many of these data validation checks. As testers for ETL or data migration projects, it adds tremendous value if we uncover data quality issues that might get propagated to the target systems and disrupt the entire business processes.
Why Validate Data For ETL Projects?
In ETL projects, data is extracted from the source, worked upon by applying some logic in the software, transformed, and then loaded into the target storage. In many cases, the transformation is done to change the source data into a more usable format for the business requirements.
Here, data validation is required to confirm that the data which is loaded into the target system is complete, accurate and there are no data loss or discrepancies.
Example: An e-commerce application has ETL jobs picking all the OrdersIds against each CustomerID from the Orders table which sums up the TotalDollarsSpend by the Customer, and loads it in a new CustomerValue table, marking each CustomerRating as High/Medium/Low-value customers based on some complex algorithm.
Simple data validation test is to see that the CustomerRating is correctly calculated.
Another test is to verify that the TotalDollarSpend is rightly calculated with no defects in rounding the values or maximum value overflows.
Why Validate Data For Data Migration Projects?
In Data Migration projects, the huge volumes of data that are stored in the Source storage are migrated to different Target storage for multiple reasons like infrastructure upgrade, obsolete technology, optimization, etc. For example, companies might migrate their huge data-warehouse from legacy systems to newer and more robust solutions on AWS or Azure.
The primary motive for such projects is to move data from the source system to a target system such that the data in the target is highly usable without any disruption or negative impact to the business.
Here again, data validation is required to confirm the data on the source is the same in target after the movement.
Example: Suppose for the e-commerce application, the Orders table which had 200 million rows was migrated to the Target system on Azure. Simple data validation test is to verify all 200 million rows of data are available in the target system.
Another test could be to confirm that the date formats match between the source and target system.
There are various aspects that testers can test in such projects like functional tests, performance tests, security tests, infra tests, E2E tests, regression tests, etc.
Recommended Reading=> Data Migration Testing, ETL Testing Data Warehouse Testing Tutorial
In this article, we will only look at the data aspect of tests for ETL & Migration projects.
Data Mapping Sheet
To begin with, create a Data Mapping sheet for your Data project. Data mapping is the process of matching entities between the source and target tables. Start with documenting all the tables and their entities in the source system in a spreadsheet. Now document the corresponding values for each of these rows that are expected to match in the target tables. Note down the transformation rules in a separate column if any.
Data mapping sheets contain a lot of information picked from data models provided by Data Architects. Initially, testers could create a simplified version and can add more information as they proceed. See the example of Data Mapping Sheet below-
Download a Template from Simplified Data Mapping Sheet
Data Validation Tests
#1) Data Uniformity
Data uniformity tests are conducted to verify that the actual value of the entity has the exact match at different places. We have two types of tests possible here:
(i) Checks within the same schema:
- The data entity might exist in two tables within the same schema (either source system or target system)
- Example: As you can see in the below image, ProductID is present in OrderDetails and Products table. Do an exact match verification for ProductId present in OrderDetails vs Products table.
(ii) Checks across schemas:
- The data entity might be migrated as-is into the Target schema i.e. it is present in the source system as well as the target system
- Example: As you can see in the above image, ProductID is present in the Products table in the source system and Products table in the target system. Do an exact match verification for ProductId in Products table in the source system to ProductId in Products table in the target system.
Note: It is best to highlight (color code) matching data entities in the Data Mapping sheet for quick reference.
#2) Entity Presence
In this type of test, we need to validate that all the entities (Tables and Fields) are matched between source and target. There are two possibilities, an entity might be present or absent as per the Data Model design.
(i) Validate that all the Tables (and columns), which have a corresponding presence in both source and target, match. We pull a list of all Tables (and columns) and do a text compare. This sanity test works only if the same entity names are used across.
Sometimes different table names are used and hence a direct comparison might not work. We might have to map this information in the Data Mapping sheet and validate it for failures.
Another possibility is the absence of data. There are cases where the data model requires that a table in the source system (or column) does not have a corresponding presence in the target system (or vice versa). Have tests to validate this.
- Example: As you can see in the below image, CustDemographic Table is present in the target system and not in the source system.
- CustomerType field in Customers table has data only in the source system and not in the target system.
#3) Data Accuracy
As the name suggests, we validate if the data is logically accurate. There are two categories for this type of test. With this, the tester can catch the data quality issues even in the source system.
[image source]
Note: Run this test in the target system and backcheck in the source system for any defects.
(i) Non-numerical type: Under this classification, we verify the accuracy of the non-numerical content. Examples are Emails, Pin codes, Phone in a valid format.
(ii) Domain analysis: In this type of test, we pick domains of data and validate for errors. There are three groupings for this:
- Based on Value: Here we create a list of values that can occur for a field (column in the table). Then validate if the column values are a subset of our list.
- Example: Verify if the Gender column contains either M or F.
- Based on Range: Here we set minimum and maximum range for valid data values for a column, based on logical or business reasoning. We then validate if the column values fall within this range.
- Example: 0 to 120 for Age.
- Reference File: Here the system uses an external validity file.
- Example: Are Country codes valid, do they pick the right value from the reference file, are Country codes the same between QA and the Production environment? If the reference file had a country code updated, is it rightly updated in DB?
#4) Metadata Validation
In Metadata validation, we validate that the Table and Column data type definitions for the target are correctly designed, and once designed they are executed as per the data model design specifications.
There are two groupings here:
(i) Metadata design: The first check is to validate that the data model is correctly designed as per the business requirements for the target tables. Data architects may migrate schema entities or can make modifications when they design the target system.
The next check should be to validate that the right scripts were created using the data models.
For each category below, we first verify if the metadata defined for the target system meets the business requirement and secondly, if the tables and field definitions were created accurately.
A few of the metadata checks are given below:
- Data Type Check: Example: Will Total Sales work correctly with Decimal (8, 16, or 20 bytes) or Double type?
- Data Length Check: Example: Will the data length for the Address field be sufficient with 500 chars? It might be a case where data migration is done as new geography is getting added to the company. The addresses of the new geography might have an exceedingly long format and sticking to the original length might error a use case.
- Index check: Example: Is there indexing done for the OrderId column in the target system? What if a merger of companies happened, requiring data migration and the Orders table grows 100 times in size in the target system?
- Metadata Check Across Environments: Under this check verify that Metadata matches between the QA test and the Production environment. Tests might pass in the QA environment but fail in other environments.
(ii) Delta change: These tests uncover defects that arise when the project is in progress and mid-way there are changes to the source system’s metadata and did not get implemented in target systems.
Example: New field CSI (Customer Satisfaction Index) was added to the Customer table in the source but failed to be made to the target system.
#5) Data Integrity
Here, we mainly validate the integrity constraints like Foreign key, Primary key reference, Unique, Default, etc.
[image source]
For foreign keys, we need to check if there are orphan records in the child table where the foreign key used is not present in the parent table.
Example: Customers table has CustomerID which is a Primary key. Orders table has CustomerID as a Foreign key. Orders table might be having a CustomerID which is not in the Customers table. We need to have tests to uncover such integrity constraint violations. The Data Mapping table will give you clarity on what tables has these constraints.
Note: Run this test in the target system and backcheck in the source system if there are defects.
#6) Data Completeness
These are sanity tests that uncover missing record or row counts between source and target table and can be run frequently once automated.
There are two types of tests:
(i) Record count: Here, we compare the total count of records for matching tables between source and target system. This is a quick sanity check to verify the post running of the ETL or Migration job. We have a defect if the counts do not match.
At times there are rejected records during the job run. Some of these may be valid. But as a tester, we make a case point for this.
(ii) Column data profiling: This type of sanity test is valuable when record counts are huge. Here, we create logical sets of data that reduce the record count and then do a comparison between source and target.
- Where feasible, filter all unique values in a column, for example, ProductID might be occurring multiple times in OrderDetails table. Pick a unique list for ProductID from both target and source tables and validate. This highly reduces the number of record counts and speeds up sanity tests.
- Like the above tests, we can also pick all the major columns and check if KPI’s (minimum, maximum, average, maximum or minimum length, etc.) match between target and source table. Example: Take average, minimum, and maximum values from the Price column in OrderDetails and compare these values between target and source tables for mismatches.
- Another check can be done for Null values. Pick important columns and filter out a list of rows where the column contains Null values. Compare these rows between the target and source systems for the mismatch.
#7) Data Transformation
These tests form the core tests of the project. Review the requirements document to understand the transformation requirements. Prepare test data in the source systems to reflect different transformation scenarios. These have a multitude of tests and should be covered in detail under ETL testing topics.
Below is a concise list of tests covered under this:
(i) Transformation:
- Example: ETL code might have logic to reject invalid data. Verify these against requirements.
- ETL code might also contain logic to auto-generate certain keys like surrogate keys. We need to have tests to verify the correctness (technical and logical) of these.
- Validate the correctness of joining or split of field values post an ETL or Migration job is done.
- Have tests to verify referential integrity checks. For example, a type of defect could be ProductId used in Orders table is not present in parent table Products. Have a test to verify how the orphan records behave during an ETL job.
- At times, missing data is inserted using the ETL code. Verify the correctness of these.
- ETL or Migration scripts sometimes have logic to correct data. Verify data correction works.
- Verify if invalid/rejected/errored out data is reported to users.
- Create a spreadsheet of scenarios of input data and expected results and validate these with the business customer.
(ii) Edge cases: Verify that Transformation logic holds good at the boundaries.
- Example: What happens when TotalSales with a value of 1 Trillion is run through an ETL job? Does the end to end cases work? Identify fields that can potentially have large values and run tests with these large values. They should include numerical and non-numerical values.
- For date fields, including the entire range of dates expected – leap years, 28/29 days for February. 30, 31 days for other months.
#8) Data Uniqueness Or Duplication
In this type of test, identify columns that should have unique values as per the data model. Also, take into consideration, business logic to weed out such data. Run tests to verify if they are unique in the system. Next run tests to identify the actual duplicates.
- Example: Filter for duplicate data and verify if it is authentic. For example, Employee dependent record contains the same sibling data twice.
- The user phone number should be unique in the system (business requirement).
- The business requirement says that a combination of ProductID and ProductName in Products table should be unique since ProductName can be duplicate.
#9) Mandatory
In this type of test, identify all fields marked as Mandatory and validate if mandatory fields have values. If there are default values associated with a field in DB, verify if it is populated correctly when data is not there.
- Example: If BillDate is not entered, then CurrentDate is the BillDate.
#10) Timeliness
Always document tests that verify that you are working with data from the agreed-upon timelines.
- Example: ProductDiscount was updated 15 days back and business domain states ProductDiscount changes every seven days. This means your tests are not being done with the right discount values.
- A predictive analytics report for the Customer satisfaction index was supposed to work with the last 1-week data, which was a sales promotion week at Walmart. But the ETL job was designed to run at a frequency of 15 days. This is a major defect that testers can uncover.
#11) Null Data
In this type of test, we focus on the validity of null data and verification that the important column cannot be null.
- Example: Filter out all null data and validate if null is allowed.
- If there are important columns for business decisions, make sure nulls are not present.
#12) Range Check
Data entity where ranges make business sense should be tested.
- Example: Order quantity per invoice cannot be more than 5K in the software category.
- Age should not be more than 120.
#13) Business Rules
Document any business requirements for fields and run tests for the same.
- Example: Resources with age less than 20 years are not eligible. Data validation checks are required if this rule is applied to the data.
- Termination date should be null if Employee Active status is True/Deceased.
- FROM data should be less than TO Date.
- Do item-level purchase amounts sum to order-level amounts
#14) Aggregate Functions
Aggregate functions are built in the functionality of the database. Document all aggregates in the source system and verify if aggregate usage gives the same values in the target system [sum, max, min, count].
Quite often the tools on the source system are different from the target system. Check if both tools execute aggregate functions in the same way.
#15) Data Truncation & Rounding
In these types of tests, we identify fields with truncation and rounding logic concerning the business. We then document and get signoff on the truncation and rounding logic with Product owners and test them with production representative data.
#16) Encoding Tests
Validate if there are encoded values in the source system and verify if the data is rightly populated post the ETL or data migration job into the target system.
- Example: Double byte chars for FirstName in Chinese was accepted in the source system which was encoded. Verify the behavior of this field when moved to the target system.
- The Password field was encoded and migrated. Ensure they work fine post-migration.
#17) Regression Tests
This is a basic testing concept where testers run all their critical test case suite generated using the above checklist post a change to source or target system.
Conclusion
So, we have seen that data validation is an interesting area to explore for data-intensive projects and forms the most important tests. The data mapping sheet is a critical artifact that testers must maintain to achieve success with these tests. They can maintain multiple versions with color highlights to form inputs for any of the tests above.
Care should be taken to maintain the delta changes across versions.
We request readers to share other areas of the test that they have come across during their work to benefit the tester community.