Introduction to HDFS Architecture
HDFS Architecture is an Open source data store component of Apache Framework that is managed by the Apache Software Foundation. It is known as the Hadoop distributed file system that stores the data in distributed systems or machines using data nodes. Some of the important features of HDFS are availability, scalability, and replication. The Architecture of HDFS includes components such as name node, secondary name node, data node, checkpoint node, backup node, and blocks. HDFS is fault-tolerant and it is managed through the replication process. The Name node and data Node coordinates to store very large files in a distributed structure across the cluster systems.
Features of HDFS
The features of the HDFS which are as follows:
In HDFS data gets replicated regularly among data nodes by creating a replica of blocks on the other data node. So in case of any hardware failure or error user can get his data from another data node where the data has been replicated.
In HDFS data is stored on multiple data nodes in the form of blocks. HDFS enables users to increase the size of blocks whenever needed. There are two types of scalability mechanism used in HDFS – horizontal scalability and vertical scalability.
This is the unique features of HDFS which allow a user to get easy access to their data in case of any hardware failure.
HDFS follows master-slave architecture which has the following components:
NameNode is also known as master node because it handles all the blocks which are present on DataNodes.
NameNode performs the following tasks:
- Manage all the DataNode blocks
- Gives file access to user
- Keeps all the records of blocks present on DataNode
- NameNode records all information of files for example if a file name is rename or content has been changed or deleted NameNode immediately record that modification in EditLogs
- It takes the records of all the blocks from the data nodes to ensure that all the block are alive on DataNode.
- In case of error, if any hardware failure happens, it immediately selects another DataNode to create replication and manage the communication to all the DataNodes
Types of files in NameNode
NameNode contains two types of files FsImage and EditLogs
i. FsImage: It is also called a file image because it contains all the information on a filesystem with namespaces. It also contains all the directories and the files of the filesystem in a serialized manner.
ii. EditLogs: Current modifications done in the files of the filesystem are stored in EditLogs.
2. Secondary NameNode
Secondary NameNode is also called as a checkpoint node because it performs regular checkpoints. It acts as a helper for primary NameNode.
Secondary NameNode performs the following tasks
- Secondary NameNode combines FsImage and EditLogs from the NameNode.
- It reads all the information of filesystem from the storage memory of NameNode and writes this information on a hard disk of the filesystem.
- It downloads the FsImage and EditLogs from NameNode at regular intervals and read the modification information done EditLogs files and note down the modification to the FsImage. This process creates new FsImage which is then sent back to the NameNode. Whenever the NameNode will start it will use these FsImage.
DataNode is also known as a slave node because it handles every node that contains data on a slave machine. DataNode stores data in ext3 or ext4 file format.
Data node performs the following tasks:
- Every data is stored on DataNodes
- It performs all the operations of files as per the users’ request, for example, reading file content, writing new data in files,
- It also follows all the instructions which are given by NameNode, for example, renaming the file, deleting some blocks on DataNode, creating blocks, etc
4. Checkpoint Node
Checkpoint node is a node that created a checkpoint of files at regular intervals. Checkpoint node in HDFS, download the FsImage and EditLogs from NameNode, and merge them to create a new image and send that new image to NameNode. The latest checkpoint is stored in a directory with the same structure as the directory of the namenode. Because of this, the checkpointed image is always available if it needs.
5. Backup Node
The function of a backup node is similar to a Checkpoint node to perform a checkpointing task. In Hadoop, the Backup node stores the latest and updated copy of the file system namespace. There is no need to download FsImage and editsLogs files from the active NameNode to create a checkpoint in the Backup node because it is synchronized with the state of active NameNode. The function of the Backup node is more precise because save namespace into the local FsImage file and reset editLogs.
All the data of users are stored in files of HDFS which are then divided into small segments. These segments are stored in the DataNodes. The segments which are present on DataNodes are called as a block. The default block size of these blocks is 128 MB. The size of the block can be changed as per users requirements by configuring HDFS.
If the size of data is less than the block size then block size is equal to data size. For example, If the data is of 135 MB then it will create 2 Blocks. One will be of default size 128 MB and another will be of 7MB only, not 128 MB. Because of this, a lot of space and disk’s clock time is saved.
Replication Management in HDFS Architecture
HDFS is Fault-tolerant. Fault tolerance is a power of the system in case of failures and how it responds to the errors and difficult conditions. Fault tolerance works based on the process of replica creation. Copies of user’s data are saved on machines in the DHFS cluster. Hence, if there is any breakdown or failure in the system, a copy of that data can be accessed from the other machines of the HDFS cluster. Each block in HDFS architecture has 3 replicas that are stored in different DataNodes. NameNode maintains the copies available in DataNodes. NameNode adds or deletes copies based on the criteria of under replicated or over-replicated blocks.
To write files to HDFS, the client will communicate for metadata to the NameNode. The Nameode answers with several blocks, their location, copies, etc. The client divides files into multiple blocks based on nameode information. Then, it starts sending them to DataNode. First, the client sends block A to DataNode 1 with other information about DataNodes. When DataNode 1 receives the client’s block A, DataNode 1 copies the same block to the same rack to DataNode 2. Because of both DataNodes are in the same rack, transferring the block is done via rack switch. DataNode 2 now copies the same block to DataNode 3 Because both the DataNodes are in different racks, transferring the block is done through an out–of -rack switch. Once DataNode receives the client’s blocks, it will send the confirmation to NameMode. Each block of the file, the same process is repeated.
For Read operation, the First client communicates for metadata to the NameNode. A client exits NameNode with the file name and location. The Nameode responds with a block number, location, copies and other information. After that, the Client communicates to DataNodes. Based on the information received from the NameNode, the client starts reading data parallel from the DataNodes. When all the block of the file is received by the client or application, it combines these blocks into an original file form.
With the help of NameNode and DataNode, it reliably stores very large files across machines in a large cluster. Because of fault tolerance, it is helpful to access data while software or hardware failure. This is how HDFS architecture works.
This has been a guide to HDFS Architecture. Here we discussed the basic concepts with different types of Architecture, features, and replication management of HDFS Architecture. You can also go through our other suggested articles to learn more –