Comprehensive Hadoop Testing Tutorial | Big Data Testing Guide

By Vijay

By Vijay

I'm Vijay, and I've been working on this blog for the past 20+ years! I’ve been in the IT industry for more than 20 years now. I completed my graduation in B.E. Computer Science from a reputed Pune university and then started my career in…

Learn about our editorial policies.
Updated March 7, 2024

This tutorial explains the Basics, Testing Types, Plans, Required Environment, Testing Process, Validation & Verifications for Hadoop and BigData Testing:

In this tutorial, we will see the basic introduction to Hadoop and BigData Testing, like when and where the testing will come into the picture and what we need to test as a part of Hadoop Testing.

We will also discuss the following topics in detail:

  • Roles and Responsibilities of Hadoop Testing
  • Testing Approach for Hadoop/ BigData Testing

=> Check Here To See A-Z Of BigData Training Tutorials Here.

Introduction to BigData and Hadoop Testing

Storing And Processing Data In Hadoop

To perform these processes on the Hadoop system, we have the manpower that is categorized into four sections.

  • Hadoop Administrators are responsible for setting up the environment and have the Administration Rights to access the Hadoop Systems.
  • Hadoop Developers develop the programs regarding pulling, storing and processing the data from different locations to centralized locations.
  • Hadoop Testers for validating and verifying the data before pulling from different locations and after pulling at the centralized location as well as validating & verification is done while loading the data to the client environment.
  • Hadoop Analysts operate when data loading is done and when the data reaches the warehouse at the client location. They use this data for report and dashboard generation. The analysts perform the data analysis for growth and business development.

We know that Hadoop is not a single system; it contains multiple systems and machines. The data is split and stored into multiple machines and if we want to access it again we need to combine and pull the data into reports and so on.

The developer is responsible for writing programs in JAVA and Python to extract the data and store it.

The other job of a developer is to process the data. There are two layers of Hadoop, one is for storing i.e. Hadoop HDFS and another for Processing i.e. Hadoop MapReduce.

Storing means whatever data we have in the source just gets stored/inserted in the system. Processing means we need to split it into multiple machines and again combine and send it to the client.

Thus, Storing and Processing are done by programming scripts, and the developer is responsible for writing the scripts.

Apart from programming, the other method to store and process the data in Hadoop is using database applications like Hive, Impala, HBase, etc. These tools don’t need any programming knowledge.

BigData And Hadoop Testing

Once storing and processing are done by the developer the data goes for report generation. Before that, we need to verify the processed data for accuracy and check if the data is accurately loaded and processed correctly or not.

So the program and/or scripts created by a developer need to be verified by the Hadoop or BigData Tester. The tester needs to know basic programming like Mapper, Hive, Pig Scripts, etc. to verify the scripts and to execute the commands.

So, before testing, the testers need to know what all programs and scripts are working, how to write the code and then think about how to test them. Testing can be done either manually or by using automation tools.

Hadoop has various kinds of testing like Unit Testing, Regression Testing, System Testing, and Performance Testing, etc. So these are the common testing types that we use in our normal testing as well as Hadoop and BigData testing.

We have the same kind of testing terminologies like test strategy, test scenarios, and test cases, etc. in Hadoop and BigData Testing. Only the environment is different and there are different kinds of techniques that we use to test the BigData and Hadoop System because here we need to test the data and not the application.

How to test the BigData and what all things require testing in BigData?

For BigData testing, we need to have some plans and strategies.

Thus we need to consider the following points:

  • What is the strategy or plan of testing for the BigData?
  • What kind of testing approaches are applied to BigData?
  • What is the environment required?
  • How to validate and verify the BigData?
  • What are the tools used in BigData Testing?

Let’s try to get the answers to all the above questions.

What Is The Strategy Or Plan For Testing BigData?

BigData testing means Verification and Validation of data while storing and processing it into the Data Warehouse.

While testing BigData, we need to test the Volume and Variety of data extracted from different databases and loaded as well as processed on Data Warehouse or Hadoop System, this testing comes under functional testing.

We need to test the Velocity of the Data downloaded from various databases and uploaded to the Hadoop System, which is a part of Performance Testing.

So, as a plan or strategy, we need to concentrate on Functional as well as Performance Testing of BigData Testing.

In BigData Testing, the tester has to verify the processing of a huge amount of data using Commodity Hardware and relative components. Hence, the quality of data also plays an important role in the testing of BigData. It’s essential to verify and validate the quality of Data.

Testing Types For BigData Testing

In the previous section, we saw that Functional Testing and Performance Testing play a vital role in BigData Testing, apart from that as a BigData Tester, we need to do a few more types of testing like Database Testing as well as Architectural Testing.

These testing types are also as important as Functional and Performance Testing.

#1) Architectural Testing

This testing is done to ensure that the processing of data is proper and meets the requirements. Actually, the Hadoop System processes huge volumes of data and is highly resource comprehensive.

If the architecture is improper then it may degrade the performance due to which the processing of data may interrupt and loss of data may occur.

#2) Database Testing

Here, the process validation comes into the picture and we need to validate the data from various Databases i.e. we need to ensure that the data fetched from the source databases or local databases must be correct and proper.

Also, we need to check that the data available in the Source Databases are matched with the data entered in Hadoop System.

DataBase Testing

Similarly, we need to validate if the data in Hadoop System is correct and proper after processing or say after transformation and to be loaded to the client’s environment with proper validation and verification.

As a part of Database Testing, we need to go through the CRUD operations i.e. Create the data in Local Databases then Retrieve the data and we need to search it and it should be available in Database before and after loading into Data Warehouse and from Data Warehouse to the Client’s Environment.

Verification of any Updated Data on every stage of Storing or Loading and Processing the data. Deletion of any corrupted data or any duplicate and null data.

#3) Performance Testing

As a part of Performance Testing, we need to check the loading and processing speed of data i.e. like the IOPS (Input Output Per Second).

Need to check the speed of entering the data or data as an Input from various Databases to Data Warehouse or Hadoop System and from Hadoop System or Data Warehouse to Client’s Environment.

Must also check the velocity of the data coming from various Databases and from the Data Warehouse as an Output. This is what we call as Input Output Per Second or IOPS.

Apart from that, another aspect is to check the performance of the Data Absorption and Data Distribution, and how fast the data is consumed by the Data Warehouse from various Databases and by the Client’s System from the Hadoop System.

Also as a Tester, we need to check the performance of the Data Distribution like, how fast the data is distributed to various files available in the Hadoop System or in the Data Warehouse. Similarly, the same process happens while distributing data to Client Systems.

The Hadoop System or the Data Warehouse consists of multiple components, so a Tester needs to check the performance of all those components like the MapReduce Jobs, data insertion and consumption, the response time of queries and their performance as well as the performance of the search operations. All these are included in Performance Testing.

#4) Functional Testing

Functional Testing contains testing of all the sub-components, programs & scripts, tools used for performing the operations of Storing or Loading and Processing, etc.

For a Tester, these are the four important types and stages through which the data need to be filtered so the client gets the perfect and error-free data.

Tools For BigData Hadoop Testing

There are various tools that are used for testing BigData:

  • HDFS Hadoop Distribution File System for Storing the BigData.
  • HDFS Map Reduce for Processing the BigData.
  • For NoSQL or HQL Cassandra DB, ZooKeeper and HBase, etc.
  • Cloud-Based server tools like EC2.

Testing Environments And Settings

For any type of testing, the Tester needs proper settings and the environment.

Given below is the list of requirements:

  1. Type of data and application that is going to be tested.
  2. Storing and processing requires a large space for a huge amount of data.
  3. Proper distribution of Files on all the DataNodes overall the Cluster.
  4. While processing the data, hardware utilization should be minimum.
  5. Runnable Programs and Scripts as per the requirement of the Application.

Roles And Responsibilities Of Hadoop Testing

As a Hadoop Tester, we are responsible for understanding the requirements, prepare the Testing Estimations, Planning of the Testcases, Get some Test Data to test some Testcases, be involved with Test Bed creation, Executing the Test Plans, Reporting & Retesting of defects.

Also, we need to be responsible for Daily Status Reporting and Test Completion.

Roles and Responsibilities of Hadoop Testing

The first thing we are going to discuss is the Test Strategy. Once we have a proposed solution to our problem we need to go ahead and plan or strategized our testing plan, we may discuss the automation strategy that we may use in there, the plan about the testing schedule that depends upon our delivery dates, also we may discuss resource planning.

The automation strategy is something that is going to help us in reducing the manual efforts required in testing the product. The Test Schedule is important as it will ensure the timely delivery of the product.

Resource Planning will be crucial as we need to plan how much man-hours we need in our testing and how much Hadoop Resources are required to execute our Test Planning.

Once we strategize our testing, we need to go ahead and create the Test Development Plans that include Creating Test Plans, Creating Test Scripts which will help us to automate our testing and also identify some Testing Data that is going to be used in the Test Plans and helps us to execute those Test Plans.

Once we are done with the Test Development that includes Creating Test Plans, Test Scripts, and Test Data, we go ahead and start executing those Test Plans.

When we execute the Test Plans, there might be certain scenarios where the actual output is not as expected, and those things are called defects. Whenever there is a defect, we need to test those defects as well and we need to create and maintain the matrices for those.

All these things fall under the next category which is Defect Management.

What Is Defect Management?

Defect Management consists of Bug Tracking, Bugs Fixing, and Bug Verification. Whenever a Test Plan is executed against any of the products that we have and as soon as a particular bug is identified or a defect is identified then that defect needs to be reported to the developer or assigned to the developer.

So the Developer can look into it and start working on it. As a Tester, we need to track the progress of the Bug and track if the Bug has been fixed. If the Bug has been fixed as reported, then we need to go ahead and retest it & verify if it is resolved.

Once all the bugs are fixed, closed and verified, then we need to go ahead and deliver an OKAY Tested product. But before we deliver the product we must make sure that the UAT (User Acceptance Testing) is completed successfully.

We make sure that installation testing and the requirement verification are done properly i.e. the product which is delivered to the client or an end-user is as per the requirement that is mentioned in the Software Requirement Document.

The steps that we have discussed are based on the imagination, be any of the testing scenarios or any of the testing approaches that we are going to use for those steps or say those phrases to test our product and deliver the end result, which is an OKAY Tested product.

Let’s go ahead and discuss this in detail and correlate it with the Hadoop Testing.

We know that Hadoop is something that is used for Batch Processing and we also know that ETL is one of the fields where Hadoop is used a lot. ETL stands for Extraction Transformation and Loading. We will discuss these processes in detail when we discuss the Test Plan and Test Strategy as a Hadoop Testing point of view.

As per the diagram mentioned below, we just assume that we have four different Data Sources. Operational System, CRM (Customer Relationship Management) and ERP(Enterprise Resource Planning) is the RDBMS or say the Relational DataBase Management System that we have and we also have some bunch of Flat Files which maybe logs, files, records or whatever as to our Data Sources.

We might be using Sqoop or Flume or whatever particular product to get the Data, records or whatever as my Data Sources. We may use these tools to get the data from the Data Sources into my staging directory which is the first phase of our process called Extraction.

Once the Data therein Staging Directory which actually happens to be HDFS (Hadoop Distribution File System), we will particularly use the scripting Language such as PIG to Transform that Data. That Transformation will be according to the Data that we have.

Once the Data is transformed accordingly using whatever scripting technology that we have, we will be Loading that Data into the Data Warehouse. From the Data Warehouse, that data will be used for OLAP Analysis, Reporting and Data Mining or for Analytics.

Let’s go ahead and discuss which all phases we can use for Hadoop Testing.

The first phase will be the Extraction phase. Here, we are going to get the data from our Source DataBases or from Flat files, and in that case, what we can do is, we can verify that all the Data has been copied successfully and correctly from source to the Staging Directory.

It may include, verifying the number of Records, the type of the Records and the type of the Fields, etc.

Once this data is copied to the Staging Directory, we will go ahead and trigger the second phase which is Transformation. Here, we will have some business logic that will act on the copied data from the Source Systems and will actually create or transform the data into the required business logic.

Transformation may include Sorting the Data, Filtering the Data, Joining the Data from two different Data Sources and certain other operations.

Once the Data is transformed, we will go ahead and have test plans ready and we will check if we are getting the output as expected, and all the output that we are getting are meeting the expected result and the Data Types, Field Values, and the ranges, etc. are something which is falling in place.

Once it is correct, we can go ahead and load the data into Data Warehouse.

In the loading phase, we are actually checking if the number of records from the Stage and the number of records in Data Warehouse is in sync, they might not be similar, but they are supposed to be in sync. We also see if the type of Data that has been transformed is in sync.

Post that we will use this Data for OLAP Analysis, Reporting and Data Mining which is the last layer of our product and in that case, we can have subsequent or we can say that the Test Plans available for all these layers.

Whenever we get some Data from the Source into the destination, then we always need to make sure that only Authenticated Persons have Authorized access to the Data.

Authentication

Authentication

Authorization

Authorization

What do we mean by both of these terms?

To understand this, let’s get the things in perspective from the ETL Diagram.

ETL

As per the above diagram, we are getting our Data from Source RDBMS Engines and from Flat Files into HDFS, and that phase is called Extraction.

Let’s discuss authentication in a particular manner, there are certain businesses which have Data that is restricted by its nature, this type of Data is called as PII Data as per the United States standards.

PII stands for Personal Identifiable Information, any information such as the Date of Birth, SSN, Mobile Number, Email Address and Address of House, etc. all fall under PII. This is restricted and cannot be shared with everyone.

The Data should be shared only with the persons who needed it the most and those who need the Data for actual processing. Having this check and the first line of defense in place is called Authentication.

For Example, we are using a Laptop and we have Windows Installed there, we might have some user account created on our Windows Operating System and there we were applying a password.

This way only the person who has the Credentials for this particular user account can log in into the System and that is how we are going to safeguard our Data from theft or unnecessary access. The other layer is Authorization.

Example, we have two different user accounts on our Windows Operating System, One user account is ours and another one might be the guest user account. The administrator (WE) has the right to do all kinds of operations, like installation and uninstallation of the software, Creation of New file and Deletion of existing files, etc.

While on the other hand, the guest users might not have all this kind of access. The guest has authentication to log in to the system but doesn’t have the authority to delete or create the files and installation as well as uninstallation of any of the software in the system and from the system respectively.

However, the guest user account because of being authenticated has the right to read the files that are created and use the software that is already installed.

This is how the Authentication and Authorization are tested, in this case, whatever Data available in HDFS or any of the file systems that we need to check for the Authentication and Authorization of Data.

Testing Approach For Hadoop Testing / BigData Testing

Testing Approach

Testing Approach is common for all kinds of testing not only just because it is BigData or Hadoop Testing when we go to the Normal Manual Testing or Automation Testing or Security Testing, Performance Testing too, thus any kind of Testing follows the same approach.

Requirements

As a part of the Testing Approach, we need to start with the Requirements, Requirement is a basic thing, nowadays in the Agile process, we called it as Stories and Epics. Epic is nothing but a bigger requirement whereas the Stories are smaller requirements.

Requirement basically contains what are all the Data Models, Targets, Sources as well as what kind of Transformations we need to apply, what kind of tools we have to use? All these kinds of details will be available on the Requirements.

This is basically the Client Requirement or Customer Requirements. Based on this Requirement we will start our Testing Process.

Estimation

Another part of an Approach is Estimation, How much time we need to take for the entire activity to be done as a part of Testing. We do Test Planning, preparing the Test Scenarios, preparation of Test Cases and Execution of the same as well as we will find defects and report them and prepare Testing Reports as well.

All these activities will take some time, so how much time we need for completing all these activities and this is basically called an Estimation. We need to give some rough estimation to the management.

Test Planning

Test Planning is nothing but the description about processes, what to test, what not to test, what is the scope of the testing, what are the schedules, how many resources are required, Hardware and Software requirements and what are the timelines as well as testing cycles will be used, what are the testing levels we required, etc.

During the Test Planning, they will do certain Resource Allocation to the Project and what are the different models we have, how many resources are required and what kind of Skill sets are required, etc. all these things and aspects will be included in the Test Planning Phase.

Most of the time the lead level or management level people will do the Test Planning.

Test Scenarios And Test Cases

Once we are done with the Test Planning, we need to prepare Test Scenarios and Test Cases, especially for Big Data Testing, we require a few documents along with the requirement document. Along with this requirement document what all do we need?

We need the Requirement Document that contains the needs of the Client, along with this we need the Input Document i.e. Data Models. Data Model in the sense what is the DataBase Schemas, what are the tables and what are the relationships all this Data will be available in the Data Models.

Also, we have the Mapping Documents, Mapping Documents for E.g. in Relational DataBases we have some Tables and after loading the Data through ETL in Data Warehouse in HDFS, what are all mapping we need to do? i.e. Mapping Data Type.

For Example, if we have a Customer’s Table in HDFS, then in HDFS we have a CUSTOMER_TARGET Table or the same Table may be in HIVE as well.

Data Mapping

In this Customer Table, we have certain columns and in the CUSTOMER TARGET Table, we have certain columns as shown in the diagram. We dumped the Data from Customer Table to the CUSTOMER TARGET Table i.e. Source to Target.

Then we need to check the exact mapping like the Data present in the Source Table which is the Customer Table’s Column 1 and Row 1 and considers it as C1R1 and the same Data should be mapped in C1R1 of CUSTOMER TARGET Table. This is basically called as Mapping.

How will we know, what are all the Mappings that we need to Verify? So these Mappings will be present in the Mapping Document. In the Mapping Document, the Customer will give all kinds of Mappings.

Also, we have required a Design Document, Design Document required for both the Development Team as well as the QA Team, because in the Design Document the Customer will provide, what kind of Map Reduce Jobs they are going to implement and what type of MapReduce Jobs takes Inputs and what type of MapReduce Jobs gives Outputs.

Similarly, if we have HIVE or PIG, what are all UDF’s the Customer has created as well as what are all the input they will take and what kind of output they will produce, etc.

To prepare Test Scenarios and Test Cases, we need to have all these Documents by hand:

  • Requirement Document
  • Data Model
  • Mapping Document
  • Design Document

These can vary from one Organization to another Organization, and there is no mandatory rule that we must have all these documents. Sometimes we have all documents and sometimes we have only two or three documents or sometimes we need to rely on one document also, that is up to project complexity, company schedules, and everything.

Reviews On Test Scenarios And Test Cases

We need to perform a review on Test Scenarios and Test Cases because somehow or in some cases we forget or we miss some Test Cases because everybody cannot think of all the possible things that can be done with the requirements, in such conditions we need to take help from the third-party tools or from somebody else.

So, whenever we prepare some documents or perform something, then we need somebody to review the stuff from the same team, like Developers, Testers. They will give proper suggestions to include something more or also suggest updating or modifying the Test Scenarios and Test Cases.

They provide all the comments and based on this we will update our Test Scenarios and Test Cases and multiple versions of the document that we need to release across the team till the document is fully updated as per the requirement.

Test Execution

Once the document is ready, we will get the sign-off from the upper team to start the execution process that is basically called Test Case Execution.

If we want to execute our Test Cases during execution, we need to check that the Developer has to send the information, if it is normal Functional Testing or some other testing or Automation Testing we require a Build. But, here from the Hadoop or BigData Testing point of view, the Developer will provide MapReduce Jobs.

HDFS Files – whatever files which are copied in HDFS those files information is required to check the privileges, HIVE Scripts which were created by the Developers to verify the Data in HIVE Table and also we need the HIVE UDF’s which were developed by the Developers, PIG Scripts and PIG UDF’s.

These are all the things we need to get from Developers. Before going for the execution we should have all these things.

For MapReduce Jobs, they will provide some JAR Files and as a part of the HDFS they have already loaded the data in HDFS and the files should be ready and HIVE Scripts to validate the Data in HIVE Tables. Whatever the UDF’s they have implemented will be available in the HIVE UDF’s. We require the same thing for PIG Scripts and UDF’s as well.

Defect Reporting & Tracking

Once we execute our Test Cases we find some defects, some expected and some actual is not equal to the expected results, so we need to list out the same and provide them to the development team for resolution, and this is basically called Defect Reporting.

Suppose if we find some Defect in the MapReduce Job, then we will report it to the Developer and they will again recreate the MapReduce Job and they do some code level modifications and then again they will provide the latest MapReduce Job, which we need to test.

This is an ongoing process, once the job is tested and passed, we again need to retest it and report it to the Developer and then get the next one for testing. This is how the Defect Reporting and Tracking activity is accomplished.

Test Reports

Once we have done with all the Testing Process and the Defects have been closed then we need to create our Test Reports. Test Reports is whatever we have done to complete the Testing Process so far. All the planning, test cases writing & executions, what output we got, etc. everything is documented together in the form of Test Reports.

We need to send these reports on a daily basis or on a weekly basis or as per the Customer’s needs. Nowadays organizations are using the AGILE Model, so every Status Report needs to be updated during the Daily Scrums.

Conclusion

In this tutorial, we walked through:

  • The strategy or plan of testing the BigData.
  • Required environment for BigData Testing.
  • BigData Validation and Verifications.
  • Tools used in testing the BigData.

We also learned about –

  • How the Test Strategy, Test Development, Test Executions, Defect Management and Delivery work in the Roles and Responsibilities as a part of Hadoop Testing.
  • Testing Approach for Hadoop/ BigData Testing which includes Requirement Gathering, Estimation, Test Planning, Creation of Test Scenarios & Test Cases along with the reviews.
  • We also came to know about Test Execution, Defect Reporting & Tracking and Test Reporting.

We hope this BigData Testing Tutorial was helpful to you!

=> Check ALL BigData Tutorials Here.

Was this helpful?

Thanks for your feedback!

Leave a Comment