In this Tutorial, we will Understand What is MapReduce, its Advantages and how does Hadoop Map Reduce work with Examples:
In the previous tutorial, we learned about Hadoop HDFS and the Reading and Writing Mechanisms. Now let us explore another Hadoop Component i.e. MapReduce.
Let us look into the following in detail:
- What is MapReduce?
- Its benefits
- What is the exact approach of MapReduce?
What You Will Learn:
What Is MapReduce?
Let’s move on and start with Hadoop Components. As explained earlier there are two main components of Hadoop i.e. Hadoop HDFS and Hadoop Map Reduce.
The Hadoop HDFS is a File Distribution System which is used for storing a huge amount of Data in multiple racks.
Here we will discuss the processing unit of Hadoop i.e. MapReduce.
The next question that arises is “what is Map Reduce and why it is required?”
Hadoop MapReduce is the “Processing Unit” and using this component, we can process the Big Data stored on Hadoop HDFS.
But what is the exact requirement? Why do we need this component of Hadoop?
Big Data stored on Hadoop HDFS is not stored traditionally. The data gets divided into chunks of data stored in respective DataNodes. So the entire data is not stored in one centralized location.
Hence a native client application like Java or any such application cannot process the data in the current format and we need a special framework that can process the fragmented data blocks stored in respective DataNodes.
The processing is done using Hadoop MapReduce processing.
Map Reduce In A Nutshell
The above diagram gives an overview of Map Reduce, its features & uses.
Let us start with the applications of MapReduce and where is it used. For Example, it is used for Classifiers, Indexing & Searching, and Creation of Recommendation Engines on e-commerce sites (Flipkart, Amazon, etc.) It is also used as Analytics by several companies.
When we see from the features perspective, it is a Programming Model and can be used for Large Scale Distributed Model like Hadoop HDFS and has the capability of Parallel Programming that makes it very useful.
When we see the functions in Map Reduce, two functions get executed i.e. Map Function and Reduce function.
This technology has been implemented by major organizations like Google, Yahoo, Facebook and also adopted by Apache Hadoop like HDFS, PIG, HIVE and for storing data or executing & processing the Big Data using HBase which is also known as No-SQL.
Advantages Of Map-Reduce
There are two advantages to this technology.
#1) Parallel Processing
The very first advantage is parallel processing. Using Map Reduce we can always process the data in parallel.
As per the above diagram, there are five Slave Machines and some data are residing on these Machines. Here, the data gets processed parallelly using Hadoop Map Reduce and thus processing becomes fast.
Actually what happens here is that the entire chunk of data gets divided by Hadoop HDFS into HDFS Block and the Map-Reduce processes these chunks of data and thus processing becomes fast.
#2) Data Locality
This is one versatile thing that is given by Hadoop MapReduce i.e. we can process the data where it is.
What does it mean?
In the previous HDFS tutorial, we understood that the data that we moved into Hadoop Cluster gets divided into HDFS Blocks and these Blocks are saved into SlaveMachines or DataNodes. The Map-Reduce senses the processing and logic to the respective Slave Nodes or DataNodes where the data is residing as HDFS Blocks.
The processing is executed over a smaller chunk of data in multiple locations in parallel. This saves a lot of time as well as Network Bandwidth that is required to move Big Data from one location to another.
Just remember the data we are processing is Big Data broken down into chunks, and if we start moving the Big Data directly through the allotted Network channels into a centralize machine and process it, then it will give us no advantage as we are going to consume the entire Bandwidth in moving the data to a centralized server.
So using Hadoop MapReduce we are not just doing “Parallel Processing”, we are also processing the data on to the respective Slave Nodes or DataNodes where the chunks of data are present and hence we are also “Saving a lot of Network Bandwidth” which is very beneficial.
Finally, the SlaveMachines are done with the processing of data stored at SlaveMachines and they send back the results to the Master Machine as the results are not as big as the blocks that were stored on the SlaveMachines. Hence it will not be utilizing a lot of bandwidth.
The Slave Machines send the result back to the Master Machine, these results are aggregated together and the final result is sent back to the Client Machine which submitted the job.
Here one question arises- who decides which data should be processed at which DataNode?
The Client submits the job to the Resource Manager and the Resource Manager is the one who provides the direction to execute the job on the respective DataNodes where the data is residing, it decides based on the nearest DataNode that is available so that a lot of Network Bandwidth is not utilized.
Traditional Vs. MapReduce Way
To explain this we will take a real-life analogy of the Insurance Company’s Policy Holders Application Count, everyone would be aware of Insurance Company Policies, as most of the big insurance companies are having their branches in various cities.
In those branches, there are “n” numbers of people who have applied for Life Insurance Policies.
Let’s take a scenario where we have five insurance company branches where people come and apply for Life Insurance Policies. Now we also have one headquarter of that insurance company which has all the information on the branches that are available and located.
However, when people come and apply for Life Insurance Policy on the respective branches A, B, C, D, E, the policy applications are kept on the respective branches itself and that information is not shared with the Insurance Company Headquarters.
Let’s see how people apply for the policy traditionally. To solve this problem traditionally, all the applications will be moved to the Insurance Company Headquarter and then the application process will start.
In this case, we need to move all the applications to the Insurance Company Headquarter which is a costly affair, i.e. we have to gather all the applications from the Insurance Company Branches and take it to the Insurance Company Headquarter.
This is how the cost is involved along with the huge efforts in doing this activity.
Another aspect of this is the overburdened Insurance Company Headquarter, as it has to process all the applications that were applied by the people for policies in the respective branches.
As the Insurance Company is processing the applications that were applied in all the branches, it’s going to take a long time. In the end, this process doesn’t work very well.
Let’s see how Map-Reduce solves this problem.
MapReduce follows Data Locality i.e. it is not going to bring all the applications to the Insurance Company Headquarters, instead, it will do the processing of applications in the respective branches itself in parallel.
Once the applications that were applied to every branch were processed, they send back the processed details to the Insurance Company Headquarter.
Now the Insurance Company Headquarter just has to aggregate the number of processed applications that were sent from respective branches and keep the details in their respective Database or Storage Centre.
In this way, the processing will be very easy and quick and the Policy Holders get the benefits in no time.
Map Reduce In Detail
In our previous example, we had an input (applications) that were distributed among various branches and every input was processed by the respective Map Function.
We know that MapReduce has two functions i.e. Map Function and Reduce Function.
The processing part that was done on the respective branches was done by the Map Function. So each input (application) in every branch was processed using the Map Function, after that the processed details were sent to the Insurance Company Headquarter and the aggregation part is done by the Reduce Function.
The aggregated processed application details are given as the Output.
This is what happened in our previous example. The entire process was divided into the Map Task and Reduce Task.
Map Task gets an Input and the Output of the Map Task is given to the Reduce Task as an Input and this Reduce Task gives the Output finally to the Client.
To understand it in a better manner, let’s go through the anatomy of MapReduce.
A MapReduce Task works on a Key-Value pair, so when we talk about a Map, the Map takes the Input as Key-Value and gives the output as a list of Key-Value. This list of Key-Value goes through a shuffle phase and the Input of Key and a list of Values went to the Reducer.
Finally, the Reducer gives us a list of the Key-Value pairs.
MapReduce Example – Word Count Process
Let’s take another example i.e. Word Count Process the MapReduce Way. This example is the same as the introductory example of Java programming i.e. “Hello World”.
As per the diagram, we had an Input and this Input gets divided or gets split into various Inputs. So this process is called Input Splitting, and the entire Input gets divided into splits of data based on the new line character.
The very first line is the first Input i.e. Bigdata Hadoop MapReduce, the second line is the second Input i.e. MapReduce Hive Bigdata, similarly, for the third Input, it is Hive Hadoop Hive MapReduce.
Let’s move on to the next phase i.e. the Mapping phase. Now in the Mapping phase, we create a list of Key-Value pairs. So the Input is Key and Value, here Key is nothing but the offset of the line number. The line number is the Key and the entire line is the Value.
So, for line 1 the offset is the Key and the Value is Bigdata Hadoop MapReduce. In real life, the line number or the offset is a hexadecimal number, however, to make it easy, we will only consider it as number 1 or 2.
So line 1 will be the key and the entire line will be the value. When it is passing through the Mapping Function what the Mapping will do is, it will create the list of Key-Value pairs. For Example, Bigdata, so what the function will do is, it will read every word of the line and will mark one (1) after the comma.
It will mark one (1) as a Value; like Bigdata, 1 Hadoop, 1 and MapReduce, 1. Here the question is why we are putting one (1) after each word?
It is because Bigdata is one count so Bigdata, 1. Similarly, Hadoop, 1 and MapReduce, 1 are itself having one count that’s why we mark one (1) as a Value. In the same manner for second-line or say line 2 we have, MapReduce Hive Bigdata.
So in the same fashion, the Mapping Function again creates the list of Key-Value Pairs for it and thus as per the count, the Key-Value pair list will be MapReduce,1 Hive,1 and Bigdata,1.
We will get the same as a result of the Mapping Function for line 3 i.e. Hive, 2 Hadoop, 1and MapReduce,1.
Let’s move on to the Shuffling Phase, in this phase for every Key there is a list prepared. The Shuffling phase will find the appearance of Key Bigdata and it will add the Values into the list. So let’s see what is happening here.
As we can see two incoming Arrows, the first Arrow is coming from list 1 and another arrow is coming from list 2 so the result will be Bigdata, (1,1).
Similarly, when we see the word Hadoop, again for Hadoop another list will be prepared for Values. As we can see two incoming Arrows are pointing to Shuffling which means the word Hadoop will be picked up from list 2 and list 3 respectively, i.e. the final result after Shuffling will be Hadoop, (1, 1).
In the same fashion, we will get the rest of the words like Hive, (1, 1, 1) and MapReduce, (1, 1, 1) along with their list of Values or say the list of Count as per the availability of words in the respective lists.
Now come to the Reducing Phase, in this phase, we start aggregation of the Values that were present in the list against every Key. So for Bigdata, there were two values present in the list i.e. (1, 1) thus the submission of these values will be done so Bigdata, 2.
Similarly, for Hadoop the Value will be sum i.e. (1, 1) the submission will be Hadoop, 2.
In the same manner for Hive and MapReduce, the submission for Reducing Function will be Hive, 3 and MapReduce, 3 respectively.
At last, the final result will be sent back to the Client as shown in the below diagram of “The Overall MapReduce Word Count Process”
The Overall MapReduce Word Count Process
This is how the entire Word Count process works when you are using MapReduce Way.
In this tutorial, we learned the following:
- Hadoop Map Reduce is the “Processing Unit” of Hadoop.
- To process the Big Data Stored by Hadoop HDFS we use Hadoop Map Reduce.
- It is used in Searching & Indexing, Classification, Recommendation, and Analytics.
- It has features like Programming Model, Parallel Programming and Large Scale Distributed Model.
- Designing Pattern of MapReduce are: Summarization, Classification of Top Records, Sorting and Analytics like Join and Selection.
- It has only two functions i.e. Mapper Function and Reducer Function.
- Parallel Processing and Data Locality are the good advantages of Hadoop MapReduce.
- The Process of MapReduce is divided into six phases i.e. INPUT, SPLITTING, MAPPING, SHUFFLING, REDUCING and FINAL RESULT.
That’s all for this tutorial, in our upcoming tutorials we will cover:
- How does MapReduce work with YARN and its components?
- Application Workflow of YARN.
- What is Spark and what is the difference between Hadoop and Spark?