Test Data Management: What is Test Data and How to Design It

A Comprehensive Test Data Design and Management Guide (Part -I):

At current epic of Information and Technology revolutionary growth, the testers commonly experience extensive consumption of test data in software testing life cycle.

The testers don’t only collect/maintain data from the existing sources, but also they generate huge volumes of test data to ensure their quality booming contribution in the delivery of the product for real world use. 

Therefore, we as testers must continuously explore, learn and apply the most efficient approaches for data collection, generation, maintenance, automation and comprehensive data management for any types of functional and non-functional testing.

In this tutorial, I will provide tips on how to prepare test data so any important test case will not be missed by improper data and incomplete test environment setup.

What You Will Learn:

What is Test Data and Why It’s Important

Referring to a study conducted by IBM in 2016, searching, managing, maintaining, and generating test data encompass 30%-60% of the testers time. It is an undeniable evidence that data preparation is a time-consuming phase of the software testing.

Figure 1: Testers Average Time Spent on TDM

Nevertheless, it is a fact across many various disciplines that most data scientists spend 50%-80% of their model’s development time in organizing data. And now considering the legislations and as well as the Personally Identifiable Information (PII) makes the testers engagement overwhelmingly decent in the process of testing.

Today, the credibility and reliability of the test data are considered an uncompromised element for the business owners. The product owners see the ghost copies of the test data as the biggest challenge, which reduces the reliability of any application at this unique time of clients’ demand/requirements for the quality assurance.

Considering the significance of test data, vast majority software owners don’t accept the tested applications with fake data or less in security measures.

At this point, why don’t we recollect on what Test Data is? When we start writing our test cases to verify and validate the given features and developed scenarios of the application under the test, we need information that is used as input to perform the tests for identifying and locating the defects.

And we know that this information needs to be precise and complete for making the bugs out. It is what we call test data. To make it accurate, it can be names, countries, etc…, are not sensitive, where data concerning to Contact information, SSN, medical history, and credit card information are sensitive in nature.

The data may be in any form like:

If you are writing test case then you need input data for any kind of test. The tester may provide this input data at the time of executing the test cases or application may pick the required input data from the predefined data locations.

The data may be any kind of input to the application, any kind of file that is loaded by the application or entries read from the database tables.

Preparing proper input data is part of a test setup. Generally, testers call it a testbed preparation. In testbed, all software and hardware requirements are set using the predefined data values.

If you don’t have the systematic approach for building data while writing and executing test cases then there are chances of missing some important test cases. The testers can create their own data according to testing needs.

Don’t rely on the data created by other testers or standard production data. Always create a fresh set of data according to your requirements.

Sometimes it’s not possible to create a completely new set of data for each and every build. In such cases, you can use standard production data. But remember to add/insert your own data sets in this existing database. One best way to create data is to use the existing sample data or testbed and append your new test case data each time you get the same module for testing. This way you can build comprehensive data set over the period.

Test Data Sourcing Challenges

One of the areas in test data generation, the testers consider is data sourcing requirement for sub-set. For instance, you have over one million customers, and you need one thousand of them for testing. And this sample data should be consistent and statistically represent the appropriate distribution of the targeted group. In other words, we are supposed to find the right person to test, which is one of the most useful methods of testing the use cases.

And this sample data should be consistent and statistically represent the appropriate distribution of the targeted group. In other words, we are supposed to find the right person to test, which is one of the most useful methods of testing the use cases.

Additionally, there are some environmental constraints in the process. One of them is mapping PII policies. As privacy is a significant obstacle, the testers need to classify PII data.

The Test Data Management Tools are designed to address the mentioned issue.  These tools suggest policies based on the standards/catalog they have. Though, it is not very much safe exercise. It still offers the opportunity of auditing on what one is doing.

To keep up with addressing the current and even the future challenges, we should always ask questions like When/where should we start the conduct of TDM? What should be automated? How much investment should the companies allocate for testing in areas of human resource on-going skills development and the use of newer TDM tools? Should we start testing with functional or with non-functional testing? And much more likely questions as them.

Some of the most common challenges of Test Data Sourcing are mentioned below:

On the white box side of the data testing, the developers prepare the production data. That is where QA’s need to work touch base with the developers for furthering testing coverage of AUT. One of the biggest challenges is to incorporate all possible scenarios (100% test case) with every single possible negative case.

In this section, we talked about test data challenges. You can add more challenges as you have resolved them accordingly. Subsequently, let’s explore different approaches to handling test data design and the management.

Strategies for Test Data Preparation

We know by everyday practice that the players in the industry of testing are continuously experiencing different ways and means to enhance testing efforts and most importantly its cost efficiency. In the short course of Information and Technology evolution, we have seen when tools are incorporated into the production/testing environments the level of output substantially increased.

When we talk about the completeness and the full coverage of testing, it mainly depends on to the quality of the data. As testing is the backbone for attaining the quality of the software, test data is the core element in the process of testing.

Figure 2: Strategies for Test Data Management (TDM)

Creation of flat files based on the mapping rules. It is always practical to create the subset of the data you need from the production environment where developers designed and coded the application. Indeed, this approach reduces the testers’ efforts of data preparation, and it maximizes the use of the existing resources for avoiding further expenditures.

Typically, we need to create the data or at least identify it based on the type of the requirements each project has in the very beginning.

We can apply the following strategies handling the process of TDM:

  1. Data from the production environment
  2. Retrieving SQL queries that extract data from Client’s existing databases
  3. Automated Data Generation Tools

The testers shall back up their testing with complete data by considering the elements as shown in the figure-3 here.Testers in agile development teams generate the necessary data for executing their test cases. When we talk about test cases, we mean cases for various types of testing like the white box, black box, performance, and security.

At this point, we know that data for performance testing should be able to determine how fast system responds under a given workload to be very much close to real or live large volume of data with significant coverage.

For white box testing, the developers prepare their required data to cover as many branches as possible, all paths in the program source code, and the negative Application Program Interface (API).

Figure 3: Test Data Generation Activities

Eventually, we can say that everybody working in the software development life cycle (SDLC) like BAs, Developers and product owners should be well engaged in the process of Test Data preparation. It can be a joint effort. And now let us take you to the issue of corrupted test data.

Corrupted Test Data

Before the execution of any test cases on our existing data, we should make sure that the data is not corrupted/outdated and the application under the test can read the data source. Typically, when more than a tester working on different modules of an AUT in the testing environment at the same time, the chances of data getting corrupted is so high.

In the same environment, the testers modify the existing data as per their need/requirements of the test cases. Mostly, when the testers are done with the data, they leave the data as it is. As soon as the next tester picks up the modified data, and he/she perform another execution of the test, there is a possibility of that particular test failure which is not the code error or defect.

In most cases, this is how data becomes corrupted and/or outdated, which lead to failure. To avoid and minimize the chances of data discrepancy, we can apply the solutions as below.And of course, you can add more solutions at the end of this tutorial in the comments section.



  1. Having the backup of your data
  2. Return your modified data to its original state
  3. Data division among the testers
  4. Keep the data warehouse administrator updated for any data change/modification

How to keep your data intact in any test environment?

Most of the times, many testers are responsible for testing the same build. In this case, more than one tester will be having access to common data and they will try to manipulate the common data set according to their needs.

If you have prepared data for some specific modules then the best way to keep your data set intact is to keep backup copies of the same.

Test Data for the Performance Test Case

Performance tests require very large data set. Sometimes creating data manually will not detect some subtle bugs that may only be caught by actual data created by application under test. If you want real-time data, which is impossible to create manually, then ask your lead/manager to make it available from the live environment.

This data will be useful to ensure smooth functioning of application for all valid inputs.

What is the ideal test data?

Data can be said to be ideal if for the minimum size of data set all the application errors to get identified. Try to prepare data that will incorporate all application functionality, but not exceeding cost and time constraint for preparing data and running tests.

How to Prepare Data that will Ensure Maximum Test Coverage?

Design your data considering following categories:

1) No data: Run your test cases on blank or default data. See if proper error messages are generated.

2) Valid data set: Create it to check if the application is functioning as per requirements and valid input data is properly saved in database or files.

3) Invalid data set: Prepare invalid data set to check application behavior for negative values, alphanumeric string inputs.

4) Illegal data format: Make one data set of illegal data format. The system should not accept data in invalid or illegal format. Also, check proper error messages are generated.

5) Boundary Condition dataset: Dataset containing out of range data. Identify application boundary cases and prepare data set that will cover lower as well as upper boundary conditions.

6) The dataset for performance, load and stress testing: This data set should be large in volume.

This way creating separate datasets for each test condition will ensure complete test coverage.

Data for Black Box Testing

The Quality Assurance Testers perform integration testing, system testing and the acceptance testing, which is known as black box testing. In this method of the testing, the testers do not have any work in the internal structure, design and the code of the application under the test.

The testers’ primary purpose is to identify and locate the errors. By doing so, we apply either functional or non-functional testing with using different techniques of black box testing.

Figure 4: Black Box Data Design Methods

At this point, the testers need the test data as input for executing and implementing the techniques of the black box testing. And the testers should prepare the data that will examine all application functionality with not exceeding the given cost and the time.

We can design the data for our test cases considering data set categories like no data, valid data, Invalid data, illegal data format, boundary condition data, equivalence partition, decision data table, state transition data, and use case data.Before going into the data set categories, the testers initiate data gathering and analyzing of the existing resources of the application under tester (AUT).

According to the earlier points mentioned about keeping your data warehouse always up to date, you should document the data requirements at the test-case level and mark them useable or non-reusable when you script your test cases. It helps you the data required for testing is well-cleared and documented from the very beginning that you could reference for your further use later.

Test Data Example for Open EMR AUT

For our current tutorial, we have the Open EMR as the Application Under Test (AUT).

=> Please find the link for Open EMR application here for your reference/practice.

The table below illustrates pretty much a sample of the data requirement gathering that can be part of the test case documentation and to is updated when you write the test cases for your test scenarios.

(NOTE: Click on any image for an enlarged view)

Creation of manual data for testing Open EMR application

Let’s step forward to the creation of manual data for testing Open EMR application for the given data set categories.

1) No Data: The tester validates Open EMR application URL and the “Search or Add Patient” functions with giving no data.

2) Valid Data: The tester validates Open EMR application URL and the “Search or Add Patient” function with giving Valid data.

3) Invalid Data: The tester validates Open EMR application URL and the “Search or Add Patient” function with giving invalid data.

4) Illegal Data Format: The tester validates Open EMR application URL and the “Search or Add Patient” function with giving invalid data.

Test Data for 1-4 data set categories:

5) Boundary Condition Data Set: It is to determine input values for boundaries that are either inside or outside of the given values as data.

6) Equivalence Partition Data Set: It is the testing technique that divides your input data into the input values of valid and invalid.

Test Data for 5th and 6thdata set categories, which is for Open EMR username and password:

  

7) Decision Table Data Set: It is the technique for qualifying your data with a combination of inputs to produce various results. This method of black box testing helps you to reduce your testing efforts in verifying each and every combination of test data. Additionally, this technique can ensure you for the complete test coverage.

Please see below the decision table data set for Open EMR application’s username and the password.

The calculation of the combinations done in the table above is described for your detailed information as below. You may need it when you do more than four combinations.

8) State Transition Test Data Set: It is the testing technique that helps you to validate the state transition of the Application Under Test (AUT) by providing the system with the input conditions.

For example, we log in the Open EMR application by providing the correct username and the password at first attempt. The system gives us the access, but if we enter the incorrect login data, the system denies the access. State transition testing validates that how many logins attempts you can do before Open EMR closes.

The table below indicates how either the correct or the incorrect attempts of login respond

9) Use Case Test Date: It is the testing method that identifies our test cases capturing the end to end testing of a particular feature.

Example, Open EMR Login:

Also read => Data design techniques

Conclusion

Creating complete software test data in compliance with the industry standards, legislation and the baseline documents of the undertaken project is amongst the core responsibilities of the testers. The more we efficiently manage the test data, the more we can deploy reasonably bug-free products for the real-world users.

Test data management (TDM) is the process that is based on the analysis of challenges and introducing plus applying the best tools and methods to well address the identified issues without compromising the reliability and the full coverage of the end output (product).

We always need to come up with questions for searching innovative and more cost-effective methods for analyzing and selecting the methods of testing, including the use of tools for generating the data. It is widely proven that well-designed data allows us to identify defects of the application under the test in every phase of a multi-phase SDLC.

We need to be creative and participating with all the members within and outside our agile team.Please share your feedback, experience, questions, and comments so that we could keep up our technical discussions on-going to maximize our positive impact on AUT by managing data.

Preparing proper test data is a core part of “project test environment setup”. We can’t simply miss the test case saying that complete data was not available for testing. The tester should create his/her own test data additional to the existing standard production data. Your data set should be ideal in terms of cost and time.

Be creative, use your own skill and judgments to create different data sets instead of relying on standard production data.

Part II – The second part of this tutorial is on the “Test Data Generation with GEDIS Studio Online Tool”.

About the Authors: Haroon and Parwana wrote this hands-on tutorial as the guest authors. They both believe in continuous learning and the application of the learning in everyday work.

Over to you

Have you faced the problem of incomplete test data for testing? How you managed it? Please share your tips, experience, comments, and questions for further enriching this topic of discussion.