In this Tutorial, we will Learn about the Hadoop Architecture, HDFS Read and Write Mechanisms And How to use Hadoop HDFS Commands:
In the previous tutorial, we learned how Hadoop HDFS resolves the Big Data storage problem and Hadoop Map Reduce helps with processing the BigData.
We found out how the NameNode interacts with DataNode and how the Hadoop Cluster uses the Rack Method to arrange the Data into different DataNodes.
=> Check Here For A-Z Of Big Data Training Tutorials Here.
Table of Contents:
Hadoop HDFS Architecture
So far in this series, we have understood that HDFS has two main daemons i.e. NameNode and DataNode and there is something called Blocks. We also learned what is block replication that happens on every block that is copied into the Hadoop Cluster.
Now let’s understand the complete picture of the HDFS Architecture.
As per the HDFS Architecture diagram below, we have the NameNode as we already know which is the Master Daemon in the HDFS Architecture and it stores Metadata of all the DataNode that are there in the Cluster and the information of all the Blocks that are there in each of these DataNodes.
We have racks like Rack-1 which has three DataNodes and Rack-2 has two DataNodes as per the diagram.
There is a replication factor as well that is applied to all the nodes. Now there is a Client who can read the data from the DataNode and there is another Client who can write the data to these DataNodes.
Thus, essentially two mechanisms go in parallel i.e. the read mechanism, which is a request generated to read the data by a client and the write mechanism, which is generated to write the data by the client. This is how the client moves the data into DataNode or the Hadoop Cluster.
HDFS Read And Write Mechanism
Write:
Now, here we will understand the Read and Write Mechanism behind writing a file in HDFS or reading a file from HDFS.
The above-mentioned diagram is for HDFS Write Mechanism, a client can raise a request to write a file or to read a file. Step 1 says that the writing request generated for Block A by the client to the NameNode, what the NameNode does is that it senses the list of IP addresses where the client can write the block, i.e. Block A.
Now, this client connects to the switch and then finally sends a notification to DataNode 1, DataNode 4 and DataNode 6, and why this DataNodes itself, is because these are the DataNodes that were sent by the NameNode, NameNode specified, that the data must be written in DataNode 1, 4 and 6 and that is why the client has connected to all these three DataNodes at the same time.
In the very first step, the client takes the acknowledgment from all these DataNodes whether they are ready to perform the write operation on them or not. It could be like a DataNode is executing a task and is not available as of now. So the very first step is to take their acknowledgment.
As soon as they say they are ready, the write pipeline is created. Now the client senses the write request on DataNode 1, 4 & 6, and the very first copy is created in DataNode 1 i.e. Block A created/stored or written in DataNode 1.
The next copy is then created in DataNode 4 by DataNode 1 itself, as you can see the arrow from the core switch, first the copy is created in DataNode 1 and DataNode 1 creates the second copy in DataNode 4.
Now it is not the client who is doing the job, it is the DataNode who is actually creating the copy in DataNode 4 and finally, DataNode 4 creating a copy on DataNode 6 i.e. the third copy of Block A.
Once the third copy is created on DataNode 6, it sends back an acknowledgment to DataNode 4 and DataNode 4 sends back the acknowledgment to DataNode 1 and similarly, DataNode 1 sends the acknowledgment back to the Client and finally the client sends a message of successful write back to the NameNode as an acknowledgment.
Once you get the success message by NameNode, it updates the Metadata on its end i.e. it will store the information that Block A has been stored in DataNode 1, 4 and 6. So this is the entire write mechanism when we are talking about Block A and this happens sequentially i.e. the copies of the replicas get created sequentially.
It gets a little complex when there is more than one block. The above diagram shows that when there is more than one block i.e. Block A and Block B, it is going to follow the same process like the client is going to send a request to the NameNode.
NameNode will give a list of DataNode back to the client and then the client will connect to the Core Switch. When the client connects to the Core Switch, it is going to write the first copy of both the blocks in parallel.
Just try to understand this, and you can see 1A and 1B i.e. the first step of copying Block A and Block B, now see 2A and 2B sends a request to Switch of Rack 1 and other to the Switch of Rack 5, now this switch (Rack 1) connects to DataNode 1 and creates a copy of Block A.
At the same time simultaneously the client also creates a copy of Block B in DataNode 7, so the very first copy of every block i.e. Block A and Block B and even if there are more blocks then the very first copy of the blocks are going to be created parallelly and not sequentially.
Once the first copy is created the replicas get created sequentially. First, we will see Block A, DataNode 1 is sequentially created along with another copy of Block A in DataNode 4 of Rack 4.
Once this copy is created, another copy of Block A is going to be created in DataNode 6 of the same Rack 4. Similarly, Block B is created in DataNode 9 and sequentially the next copy is going to be created in DataNode 3 of Rack 1 as per the request sent to the Switch.
If you see the first copy of every block that gets created or written into HDFS parallelly whereas the replicas of those blocks get created sequentially by their subsequent DataNodes itself.
Read:
Read Mechanism is easier to understand, as its very simple. Thus, the client is the one who makes all the requests, just like writing, it is going to request for a particular block from the NameNode. NameNode is going to send across the DataNodes where these particular blocks are stored.
As you can see in the diagram below, the client has requested Block A and B from the NameNode and the NameNode sends the addresses of DN1 i.e. DataNode 1 and DN3 i.e. DataNode 3 back to the client machine.
Now, this client machine will make a connection with the Core Switch and it is going to read Block A from DataNode 1 and Block B from DataNode 3 and this data that is set by the Core Switch is sent back to the client machine. The client can utilize this data for whatever purpose it needs.
The NameNode will assure that the client doesn’t have the work allotted to get the data or read the data.
It will ensure that the DataNodes where the actual data stored are very close enough and the client doesn’t have to consume a lot of network bandwidth to just read the data. This is a very crucial thing that is taken care of by NameNode and it helps a lot.
HDFS / Hadoop Commands
Enlisted below are the frequently used Hadoop/ HDFS commands. We need the FsShell system to run these commands. First, go to the home or working directory from the command prompt and write the commands.
#1) To see the list of available Commands in HDFS.
$hadoopfs–help
#2) To create directories in HDFS.
$hadoopfs-mkdir <path>
#3) To see the contents under a particular directory.
$hadoopfs–ls
Also, we could use -d, -h or –r with the ls command.
- -d Directories are listed as plain files.
- -h Formats the sizes of files in a human-readable fashion rather than a number of bytes.
- -R Recursively list the contents of directories
#4) To see the contents of a file.
$ hadoopfs -cat<Path of file>
#5) To copy a file from the local file system to HDFS.
$hadoopfs-put <source-path><destination-path>
Also, we could use –p or –f with the cat command.
- -p preserves access and modification times.
- -f overwrites the destination.
Or
$ Hadoop fs –copyFromLocal <source-path><destination-path>
#6) To copy a file from HDFS to local.
$hadoopfs-put <source-hadoop-path><destination-local-path>
Or
$ Hadoop fs –copyToLocal <source-hadoop-path><destination-local-path>
#7) To Removes the file or empty directory identified by path.
$ hadoopfs -rm <File Path>
#8) To copy the file or directory identified by src to dest, within HDFS.
$ hadoopfs cp <src><dest>
#9) To move the file or directory indicated by src to dest, within HDFS.
$ hadoopfs mv <src><dest>
#10) To Show disk usage, in bytes.
$ hadoopfsdu <path>
#11) Appends the contents of all the given local files to the given destination file on HDFS.
$hadoopfs-appendToFile<local files separated by space><hdfs destination file>
#12) To Show the last 1KB of the file.
$ hadoopfs -tail <filename>
#13) To create a file of zero length in HDFS.
$hadoopfs-touchz <path>
#14) To change the replication factor of a file to a specific name instead of the default of the replication factor for the remaining in HDFS.
$hadoopfs-setrep<replication factor number><file/path name>
Also, we can use -r or -w with -setrep.
- -w requests that the command waits for the replication to complete.
- -R is accepted for backward compatibility.
#15) To merge a list of files in one directory on HDFS into a single file on the local file system.
$ hadoopfs getmerge <src><localDest>
#16) To Count the number of directories, files, and bytes under the paths.
$ hadoopfs -count <path>
#17) Show the amount of space, in bytes.
$ hadoopfs -du <path>
Also, we can use -s or -h with the du command.
- -s shows the total (summary) size.
- -h formats the sizes of files in a human-readable fashion
#18) To Change permissions of a file.
$ hadoopfs chmod [-R] mode,mode,... <path>...
Here, -R modifies the files recursively.
#19) To Change the group of a file or path.
$ hadoopfs -chgrp [-R] groupname<path>
#20) To Change the owner and group of a file.
$ hadoop fs -chown [-R] [OWNER][:[GROUP]] PATH
Conclusion
In this tutorial, we learned about the Hadoop Architecture, Writing and Reading Mechanisms of HDFS and saw how the Hadoop Distribution File System works with the data.
We learned how the HDFS works on the client’s request and acknowledged the activities done on the NameNode and DataNode level. How Core Switch works as a mediator for the client and DataNodes was also explained here.
For multiple data writing, the Core Switch first sends or says Write the data on very first DataNode parallelly and then the DataNodes itself takes care of copying the data on a sequential basis.
In the next tutorial, we will discuss another component i.e. Hadoop Map Reduce, which is used to process the data as per the client’s requirement.
=> Check Out The Perfect BigData Training Guide Here.