Hadoop HDFS – Hadoop Distributed File System

By Vijay

By Vijay

I'm Vijay, and I've been working on this blog for the past 20+ years! I’ve been in the IT industry for more than 20 years now. I completed my graduation in B.E. Computer Science from a reputed Pune university and then started my career in…

Learn about our editorial policies.
Updated March 7, 2024

This Tutorial Explains Hadoop HDFS – Hadoop Distributed File System, Components and Cluster Architecture. You will also learn about Rack Awareness Algorithm:

As we learned in the previous tutorial, the biggest issue with Big Data is to store it into an existing system. And even if we somehow stored part of it in an existing system,  processing that BigData took years.

The results that you wanted in minutes took weeks or maybe in months and due to that, the value of that result was lost.

=> Watch Out The Simple BigData Training Series Here.

Components of Hadoop - HDFS

Hadoop Distributed File System

To resolve this problem or to cope up with this issue we now have HADOOP. Hadoop solved this big data problem using Hadoop HDFS.

storing Big data

Hadoop HDFS resolved the storage problem of Big Data and Hadoop Map Reduce resolved the issues related to processing part of the Big Data.

Components of Hadoop

Now, we know that Hadoop essentially has a Distributed File System …BUT WHY?

Why is Hadoop a Distributed File System?

Let’s try to understand what a Distributed File System is and understand the advantages of the Distributed File System.

Distributed File System

Let’s take an example of reading 1TB of data. We have a server which is a good high-end server that has got 4 I/O (Input Output) channels and each channel has a bandwidth of 100MB/s, using this machine, you will be able to read this 1TB data in 43 Minutes.

Now if we bring in 10 Nos. of machines exactly like this then what will happen?

Distributed File System

Time reduced to exactly 4.3 minutes. It’s because the entire effort got divided into 10 machines and that is why the time that was taken to process 1TB of data reduces to 1/10th i.e. 4.3 minutes.

Similarly, when we consider BigData, that data gets divided into multiple chunks of data and we actually process that data separately and that is why Hadoop has chosen Distributed File System over a Centralized File System.

Components Of Hadoop

Hadoop HDFS has 2 main components to solves the issues with BigData.

  • The first component is the Hadoop HDFS to store Big Data.
  • The second component is the Hadoop Map Reduce to Process Big Data.

Main Hadoop Components

Now when we see the architecture of Hadoop (image given below), it has two wings where the left-wing is “Storage” and the right-wing is “Processing”. That means the left-wing is the HDFS i.e. Hadoop Distribution File System and the right-wing is YARN and Map Reduce i.e. is the processing part.

Using HDFS, Hadoop enables us to store Big Data and using YARN & Map Reduce, Hadoop enables us to process the same Big Data that we are storing in HDFS.

Hadoop Core Components

As you can see in the image above, HDFS has two major daemons or you can call them as processes or threads which are nothing but JAVA processes i.e. running within a JVM – NameNode and DataNode.

NameNode is a master daemon that runs on Master Machine i.e. a high-end machine essentially and DataNode is a Slave Machine that runs on commodity hardware. There can be more DataNode as Slave Machines are more than a Master Machine.

So we always have one NameNode and multiple DataNode running on Slave Machines.

Similarly, we have YARN on the other side which again has two daemons, one is the Resource Manager which runs on the Master Machine and Node Manager which runs on the Slave Machine just like the DataNode. So every Slave Machine has got two daemons – one is the DataNode and the other is Node Manager.

The Master Machine has the NameNode running and Resource Manager running. NameNode is responsible for managing the data on the Hadoop Distributed File System and the Resource Manager is responsible for executing the processing tasks over this stored data.

NameNode And DataNode

We will go-deep into HDFS architecture and hence it is important to understand what is a NameNode and a DataNode as these are the two main daemons that actually run the HDFS entirely.

NameNode

  • It is a Master Daemon.
  • Managing and maintaining the DataNodes.
  • Records metadata.
  • Receives heartbeat and block reports from all the DataNodes.

DataNode

  • It is a Slave Daemon.
  • Actual data is stored here.
  • Serves read and write requests from the clients.

NameNode DataNode

Just focus on the Diagram, as you can see there is a Centralized Machine NameNode that is controlling various DataNode that are there i.e. commodity hardware. So Name Node is nothing but the Master Daemon which maintains all the DataNode.

These NameNode have all the information about the data that are stored in the DataNode. DataNode as the name itself suggests stores the data that is there in the Hadoop Cluster.

NameNode only has the information about what data stored on which DataNode. So, what we can say is NameNode stores the metadata of the data that is stored on the DataNodes.

DataNode also does another task i.e. it regularly sends the heartbeat back to the NameNode. Heartbeats actually tell the NameNode that this DataNode is still alive.

For Example, DataNodes sends a heartbeat back to the NameNode and this way NameNode has the picture that these DataNodes are alive, so NameNode can use these DataNode to store more data or read the data from these DataNodes.

Now we come on to the DataNode, DataNode is nothing but the Slave Daemons that is actually storing the data that are sent to the Hadoop Cluster. These DataNodes are the ones that actually serve the read and write request that is made by the clients.

If somebody wants to read the data from the Hadoop Cluster, then these requests are actually processed by the DataNodes where the data is residing.

Hadoop Cluster Architecture

In the previous topic related to NameNode and DataNode, we used the term “Hadoop Cluster”. Let’s take a quick look at what exactly is it?

Hadoop Cluster Architecture

The above image shows the overview of a Hadoop Cluster Architecture. Hadoop Cluster is nothing but a Master-Slave Topology, in which there is a Master Machine as you can see on the top i.e. Hadoop Cluster. In this Master Machine, there is a NameNode and the Resource Manager running i.e. the Master Daemons.

The Master Machine is connected to all the Slave Machine using the Core Switches, it is because these DataNodes are actually stored in various racks, so as you can see Computer 1, Computer 2, Computer 3 till Computer N. This is nothing but the Slave Machines or the DataNodes and they are all present in one rack.

“The rack is actually a group of machines that are present physically at one particular location and are connected to each other.”

Thus the network bandwidth between each machine is as minimum as possible. Similarly, there are more racks, however, they are not present at the same location, hence, we can have an “n” number of racks and we can also have “n” number of DataNodes or computers or Slave Machines within these racks.

This is how the Slave Machines are actually distributed over the cluster, however, at the same time they are connected to each other.

How Is Data Stored In HDFS?

Now we are slowly moving into the details of how HDFS works altogether. Here we will explore the architecture of HDFS.

When we say, storing a file in HDFS, the data gets stored as Blocks in HDFS. The entire file is not stored in HDFS, it is because as you know Hadoop is a Distributed File System.

So if you have a file size of maybe 1 PB (Peta Byte), then this kind of storage is not present in a single machine as the Hadoop cluster is made using the commodity hardware. The hardware in one single machine would be something around 1 TB or 2 TB.

Thus the entire file needs to be broken down into chunks of data which are called HDFS Blocks.

  • Each file is stored on HDFS as Blocks.
  • The default size of each block is about 128 MB in Apache Hadoop 2.x (and 64mb in the previous version i.e. Apache Hadoop 1.x).
  • There is a facility to increase or decrease the file size of the blocks using the configuration file i.e. hdfssite.xml that comes with the Hadoop package.

Let’s take an example to understand this mechanism and see how these blocks are created.

Let us consider a file of 248 MB here, now if we break this file or if we move this file into Hadoop Cluster i.e. 2.x, then this file will get broken down into one block i.e. Block A of 128 MB and another Block B of 120 MB.

As you can see the first block is of 128 MB i.e. the very first slab cuts down there and that is why the other block is of 120 MB and not 128 MB i.e. it is not going to waste any space if the remaining file size is smaller than the default block size.

Hadoop Cluster Example Diagram

Now we have another issue in front of us i.e. is it safe to have a single copy of each block?

The answer is NO because there is a chance that the system might fail and it is nothing but commodity hardware due to which we might be in big trouble. To overcome this issue, Hadoop HDFS has a good solution i.e. “The Replication of Block”.

Hadoop Architecture Block Replication

Hadoop creates the replicas of every block that gets stored into the Hadoop Distributed File System and this is how the Hadoop is a Fault-Tolerant System i.e. even though your system fails or your DataNode fails or a copy is lost, you will have multiple other copies present in the other DataNodes or in the other servers so that you can always pick those copies from there.

Block Replication

As seen in the above diagram that represents Block Replication, there are five different blocks of a file i.e. Block 1, 2,3,4,5. Let’s check with Block 1 first, and you will find copies of Block 1 in Node 1, Node 2 and Node 4.

Similarly, Block 2 has also got three copies i.e. Node 2, Node 3 and Node 4 and so the same for Block 3, 4 and 5 in the respective Nodes.

So apart from the replicas getting created, every block has been replicated thrice i.e. Hadoop follows a default replication factor of three which means any file that you copy into the Hadoop Distribution File System gets replicated thrice.

In other words,  if you copy 1 GB of a file into Hadoop Distribution File System, it actually stores 3 GB of a file in HDFS. The good part is that the default replication factor is changeable by making a change in the configuration files of Hadoop.

How does Hadoop decide where to store the replicas?

Hadoop actually follows the concept of Rack Awareness to decide where to store which replica of a Block.

Given below is the diagram depicting the Rack Awareness Algorithm.

Rack Awareness Algorithm

There are three different Racks i.e. Rack-1, Rack-2, and Rack-3.

Rack-1 has four DataNodes and so do Rack-2 & Rack-3, thus in total the entire Hadoop Cluster will consist of all the three racks and there will be 12 DataNodes.

Let’s say Block A is copied on DataNode 1 in Rack-1, as per the concept of Rack Awareness the replica of Block A cannot be created in the same rack, and it needs to be created in any other rack apart from Rack-1 as the main file already exists in Rack-1.

If we create the replicas of Block A in the same Rack-1 and in case the entire Rack-1 fails than we will lose the data for sure, thus the replicas have to be stored in any other rack but not in Rack-1.

So the replica is going to be created in DataNode 6 and 8 of Rack-2. Similarly, for Block B and Block C, the replicas are going to be created in different racks, as shown in the above diagram.

Conclusion

We learned with the following pointers from this tutorial-

  • Hadoop HDFS resolves the storage problem of BigData.
  • Hadoop Map Reduce resolves the issues related to the processing of the BigData.
  • NameNode is a Master Daemon and is used to manage and maintain the DataNodes.
  • DataNode is a Slave Daemon and the actual data is stored here. It serves to read and write requests from the clients.
  • In Hadoop Cluster, a rack is actually a group of machines that are present physically at one particular location and are connected to each other.
  • Each file is stored on HDFS as Blocks.
  • The default size of each block is about 128 MB in Apache Hadoop 2.x (64 MB in the previous version i.e. Apache Hadoop 1.x)
  • There is a facility to increase or decrease the file size of the blocks using the configuration file i.e. hdfssite.xml that comes with the Hadoop package.

In the next tutorial on HDFS, we will learn about HDFS architecture and Read & write Mechanisms.

=> Visit Here To See The BigData Training Series For All.

Was this helpful?

Thanks for your feedback!

Leave a Comment