This Apache Hadoop Tutorial For Beginners Explains all about Big Data Hadoop, its Features, Framework, and Architecture in Detail:
In the previous tutorial, we discussed Big Data in detail. Now the question is how can we handle and process such a big volume of data with reliable and accurate results.
There is indeed a great solution provided by Apache and powered by Java i.e. Hadoop Framework.
What You Will Learn:
What Is Hadoop?
Apache Hadoop is an open-source framework to manage all types of data (Structured, Unstructured, and Semi-structured).
As we all know, if we want to process, store and manage our data then RDBMS is the best solution. But, data should be in a structured format to handle it with RDBMS. Also, if the size of data increases, then RDBMS is not capable of handling it and we need to perform Database clean up regularly.
This may cause historical data loss and can’t generate accurate and reliable results in some of the industries like Weather forecast, Banking, Insurance, Sales, etc. Another problem with RDBMS is that if the main server goes down then we may lose our important data and suffer a lot.
In this tutorial, we will see how can we overcome these problems with Apache Hadoop.
Hadoop is a distributed file system and can store large volumes of data (data in petabytes and terabytes). Data processing speed is also very fast and provides reliable results as it has a very high fault-tolerance system.
Hadoop is a Java-based open-source programming framework that supports the Storing and Processing of Large Data sets in a distributed computing environment.
Hadoop is based on a Cluster Concept using commodity hardware. It does not require any complex configuration and we can establish the Hadoop environment with cheaper, simple, and lightweight configuration hardware.
Cluster concept in simple words is the Data that is stored in replication format on multiple machines so that when any issue or disaster happens on one of the locations where the data is residing then there must be a duplicate copy of that data available safely on another location.
Hadoop Vs RDMBS
Enlisted below are some points that describe the advantages of Hadoop over RDBMS.
|Architecture||Hadoop is based on HDFS, MapReduce and YARN.||RDBMS is based on ACID properties.|
|Volume||Can handle large volume of data.||RDBMS can't handle large volume of data.|
|Variety/Types of Data||Can handle structured, semi structured and unstructured data like video, images, CSV files, xml etc.||Only handle structured data.|
|Speed||Fast processing of large amount data.||Very slow while processing large amount of data.|
|Throughput||High throughput.||Low throughput.|
|Fault-tolerance||Very good||Not capable to recover lost data if main server goes down.|
|Storage||Very high capacity of storage.||Can't store Bigdata.|
|Reliable||Very reliable and generate accurate historical and current reports.||Not reliable in terms of Bigdata.|
We now know the exact definition of Hadoop. Let’s move one step forward and get familiarized with the terminologies that we use in Hadoop, learn its architecture, and see how exactly it works on Bigdata.
The Hadoop framework is based on the following concepts or modules:
- Hadoop YARN
- Hadoop Common
- Hadoop HDFS (Hadoop Distributed File System)
- Hadoop MapReduce
#1) Hadoop YARN: YARN stands for “Yet Another Resource Negotiator” which is used to manage the cluster technology of the cloud. It is used for job scheduling.
#2) Hadoop Common: These are the detailed libraries or utilities used to communicate with the other features of Hadoop like YARN, MapReduce, and HDFS.
#3) Hadoop HDFS: Distributed File system is used in Hadoop to store and process a high volume of data. Also, it is used to access the data from the cluster.
#4) Hadoop MapReduce: MapReduce is the main feature of Hadoop that is responsible for the processing of data in the cluster. It is used for job scheduling and monitoring of data processing.
Here, we have just included the definition of these features, but we will see a detailed description of all these features in our upcoming tutorials.
Let’s learn the architecture of the framework and see what components are used in it. This framework follows a master-slave architecture in the cluster.
Following are the Hadoop components:
These are the three important components of Hadoop architecture. We should also understand some of the terminologies or concepts of Architecture and see how they work.
- Name Node
- Data Node
- Secondary Name Node
#1) Name Node
Name Node is the master Node in HDFS. It contains metadata of HDFS like file information, Directory structure, block information, and all the information of Data Node, etc. Name Node is only responsible for the data or file access from the client. It tracks all the transactions or changes made in files.
It mainly works on two files i.e. FsImage and EditLogs. Name Node has a JobTracker that contains all the details of Data Node like which Data Node has what task, how many blocks are there with each Data Node, the heartbeat of each Data Node, job scheduling details in the cluster, etc.
In short, we can say that a JobTracker contains the TaskTracker of each Data Node.
#2) Data Node
Data Node is the Slave Node in HDFS. Data Node is responsible for the actual storage and processing of data. Its main task is to divide the job into three blocks and store that in different Data Nodes. After that, it starts processing the data.
Also, it has TaskTracker which has full information about each block and which block is responsible for which task, which blocks completed the task, etc. and after processing the data it sends the information to Name Node. Each time the Data Node starts it sends all the information again to the Name Node.
#3) Secondary Name Node
Secondary Name Node is used in case of fault tolerance. There are two scenarios where the Name Node is down and the full Hadoop structure will fail because the Name Node is the single point of failure.
(i) If Name Node restarts due to any issue then it took to come up again as it has a huge amount of data, then to recover that takes time.
(ii) In the case of a Name Node crash, all the HDFS data will lose and cannot recover again as Name Node is the single point of failure. Thus, to overcome these issues, the Secondary Name Node is there. It also contains a Namespace image and Edits logs the same as the Name Node.
After a certain period, it will copy the Namespace image and update the Edit logs from the Name Node. Thus, in the case of a Name Node failure, the Secondary Name Node comes into the picture and behaves like the primary Name Node. This process prevents total failure.
Further Reading => List of the BEST Hadoop Consulting Companies
Blocks are the smallest unit in the HDFS. Hadoop can process a huge amount of files as it divides them into small blocks. We can say that blocks are nothing but the data of a huge file. The size of each block is 128MB. These blocks are saved in Data Nodes and process the data.
Now, let’s learn the architecture of the Hadoop to understand its working.
Hadoop distributed file system (HDFS) is the file system that is used in the Hadoop cluster. Mainly HDFS is used to store Hadoop data in the cluster. HDFS is generally working on sequential data processing. As we already know it is based on Master-Slave architecture.
All the Metadata of the cluster is saved on the Name Node in the JobTracker and the actual data is stored in the Data Node of the HDFS in the TaskTracker.
MapReduce is responsible for the processing of data. Whenever any file comes into the cluster for processing, then the first Data Node divides it into blocks each block contains 64MB of data and it could store 128MB. Then each block will replicate twice and store in different Data Nodes anywhere in the cluster.
All this information will be sent to the Name Node and the Name Node will store this information in the form of metadata. Then the actual processing of the data will start the Data Node and will send a heartbeat to the Name Node every three sec so that the Name Node has the information that this Data Node is working on.
If anyone of the Data Node fails in sending a heartbeat then the Name Node again creates the replica of that block on another Data Node and starts processing.
Al this information or snapshots will be stored in FsImage and if any transaction is done then the edit log merges the new information and always keeps a fresh copy of the logs.
The block which finishes the task first will be taken and the Data Node sends information to the Name Node and the Name Node will take the action accordingly.
In this whole process, YARN will support and provide required resources to the system, so that it won’t affect data processing and the speed. After processing the data the results will be saved in HDFS for further analysis.
In this tutorial, we learned what is Hadoop, the differences between RDBMS and Hadoop, Advantages, Components, and Architecture of Hadoop.
This framework is responsible for processing big data and analyzing it. We saw MapReduce, YARN, and HDFS working in the cluster.
Note: The following are the Configuration details of Name Node and Data Node. The secondary Name Node will have the same configuration as the Name Node.
Name Node Configuration:
Processors: 2 Quad Core CPUs running @ 2 GHz
RAM: 128 GB
Disk: 6 x 1TB SATA
Network: 10 Gigabit Ethernet
Data Node Configuration:
Processors: 2 Quad Core CPUs running @ 2 GHz
RAM: 64 GB
Disk: 12-24 x 1TB SATA
Network: 10 Gigabit Ethernet