Hadoop Components – MapReduce With Hadoop YARN:
In our previous tutorial on the Hadoop component, we learned about Hadoop MapReduce and its Processing Mechanism like INPUT, SPLITTING, MAPPING, SHUFFLING, REDUCING and FINAL RESULT.
In this tutorial we will explore:
- How does Map Reduce work with YARN?
- Application Workflow of Hadoop YARN.
What You Will Learn:
Map Reduce With Hadoop YARN
Let’s understand how MapReduce is using YARN to execute the jobs over the Hadoop Cluster. But before we proceed, the first that question comes in our mind is what is the full form of YARN? Or what does YARN stands for?
YARN means Yet Another Resource Negotiator.
It is the one that allocates the resources for various jobs that need to be executed over the Hadoop Cluster. It was introduced in Hadoop 2.0.
Till Hadoop 1.0 MapReduce was the only framework or the only processing unit that can execute over the Hadoop Cluster. However, in Hadoop 2.0 YARN was introduced and using that, we are able to go beyond MapReduce as well.
As you can see in the diagram, we have HDFS in the bottom in between, we have got YARN and using YARN, a lot of frameworks are able to connect and utilize HDFS. So, even MapReduce is used to connect using YARN for requesting the Resources and only then it can execute the Job over HDFS i.e. Hadoop Cluster.
Similarly; SPARK, STORM, and other search engines can connect to HDFS. HBase which is a No SQL database can also connect it. So the applications of HDFS became huge, just because YARN was able to open the Gate for other frameworks and other Bigdata analytics tools as well.
What is the difference between MapReduce Version1 (MRv1) and MapReduce Version2 (MRv2)?
MRv1 was essentially a part of the Hadoop framework 1 and with Hadoop 2 YARN came into the picture and MapReduce was upgraded to MRv2 with several changes in classes. The classes were updated, however, the syntax of writing the MapReduce program remains the same.
In this scenario, MapReduce now connects with YARN to axis the HDFS.
Along with YARN, Resource Manager and Node Manager are the new Daemons that were introduced into the Hadoop Cluster.
Previously it was the Job Tracker and the Task Tracker. However they were removed from Hadoop 2.0, and Resource Manager & Node Manager were introduced along with YARN into the Hadoop framework.
Hadoop 2.x Daemons
Let’s have a quick look at the newly introduced Daemons in Hadoop 2.0 that run the components i.e. Storage and Processing.
In the HDFS tutorial, we understood the Daemon i.e. NameNode and DataNode in detail. In this tutorial, we will understand how Resource Manager and Node Manager Work in Hadoop 2.x Cluster to manage the processing and jobs that need to be executed in the Hadoop Cluster.
So, what is the Resource Manager? Resource Manager is the Master Daemons that runs on the Master Machine or the NameNode which is a high-end Machine. Node Manager, on the other hand, is the Daemon that runs on Slave Machines or the DataNodes or along with the DataNode Process.
Hadoop 2.x MapReduce YARN Components
Let’s explore the other components of YARN below.
- Client: It is a unit that submits the Job-like Command Line Interface (CLI), and the Client could be a JAVA application.
- Resource Manager: It is a Master Daemon to which all the Jobs are submitted from the Client, and it is the one which allocates all the Cluster level Resources for executing a particular Job. It runs on a high-end machine that has good quality hardware and good configuration as it is the Master Machine that has to manage everything over the cluster.
- Node Manager: It is a Slave Daemon that runs on the Slave Machines or the DataNode, so every Slave Machine has a Node Manager running. It monitors the resources of particular DataNode, Resource Manager manages the Cluster resources and Node Manager manages the DataNode resources.
- Job History Server: It is the unit to keep a track of all the Jobs that have been executed over the Cluster or have been submitted to the Cluster. It keeps track of the status as well and also keeps the log files of every execution happened over the Hadoop Cluster.
- Application Master: It is a component that is executed over Node Machine, Slave Machine and is created by a Resource Manager to execute and manage a Job. It is the one that negotiates the resources from the Resource Manager and finally coordinates with the Node Manager to execute the task.
- Container: It is created by the Node Manager itself that has been allocated by the Resource Manager and all the Jobs are finally executed within the Container.
YARN Work Flow
As shown in the above diagram, there is a Resource Manager to which all the Jobs are submitted and there is a Cluster in which there are Slave Machines, and on every Slave Machine, there is a Node Manager running.
Resource Manager has two components i.e. Scheduler and Application Manager.
What is the difference between Application Master and Application Manager?
Application Manager is a component of Resource Manager which ensures that every task is executed and an Application Master is created for it. Application Master, on the other hand, is somebody who executes the task and requests for all the resources that are required to be executed.
Let’s say the job is submitted to the Resource Manager, as soon as the job is submitted the Scheduler schedules the Job. Once the Scheduler schedules the Job to be executed the Application Manager will create a Container in one of the DataNodes, and within this Container, the Application Master will be started.
This Application Master will then register with the Resource Manager and request for a Container to execute the task. As soon as the Container is allocated, the Application Master will now be connected with the Node Manager and request to launch the Container.
As we can see, the Application Master got allocated to DataNodes D and E, and now this Application Master requested the Node Manager to launch the Containers of DataNode D and DataNode E.
As soon as the Containers were launched, the Application Master will execute the task within the Container and the result will be sent back to the Client.
Let’s understand this in a little sequential manner.
In the below diagram, we have four components. The first one is the Client, the second one is Resource Manager, the third one is Node Manager and the fourth line contains Application Master.
So let’s see how these steps are executed between them.
The very first step is the Client who submits the Job to the Resource Manager, in the second step the Resource Manager allocates a Container to Start the Application Master on the Slave Machines; the third step is the Application Master registers with the Resource Manager.
As soon as it registers, it requests the Container to execute the task i.e. the fourth step. In step five, the Application Master notifies the Node Manager on which the Container needs to be launched.
In step six, once the Node Manager has launched the Containers, the Application Master will execute the code within these Containers.
Finally, in the seventh step, the Client contacts the Resource Manager or the Application Master to monitor the application status.
In the end, the Application Master will unregister itself from the Resource Manager and the result is given back to the Client. So this is one simple sequential flow of how a MapReduce program is executed using the YARN framework.
So, in this tutorial, we learned the following pointers:
- YARN means Yet Another Resource Negotiator.
- YARN was introduced in Hadoop 2.0
- Resource Manager and Node Manager were introduced along with YARN into the Hadoop framework.
- YARN Components like Client, Resource Manager, Node Manager, Job History Server, Application Master, and Container.
In the upcoming tutorial, we will discuss the testing techniques of BigData and the challenges faced in BigData Testing. We will also get to know how to overcome those challenges and any bypass ways to make BigData Testing easy.