Introduction to Hadoop Architecture
Hadoop architecture is an open-source framework used to process extensive data quickly using distributed computing concepts where the data is spread across different nodes of the clusters. This architecture follows a master-slave structure where it is divided into two steps of processing and storing data. The MapReduce performs these steps and HDFS where the MapReduce does the processing while the HDFS does the storing.
- This architecture’s basic idea is that the entire storing and processing are done in two steps and two ways. The first step is processing, which reduces programming, and the second-way step is storing the data done on HDFS.
- It has a master-slave architecture for storage and data processing. The master node for data storage in Hadoop is the name node. There is also a master node that monitors and parallels data processing by using Hadoop Map Reduce.
- The slaves are other machines in the Hadoop cluster that help store data and perform complex computations. Each slave node has been assigned with a task tracker, and a data node has a job tracker, which helps run the processes and synchronize them effectively. This type of system can be set up either on the cloud or on-premise.
- The Name Node is a single point of failure when it is not running on high availability mode. The Hadoop architecture also has provisions for maintaining a stand by Name node to safeguard the system from losses. Previously there were secondary name nodes that acted as a backup when the primary name node was down.
FSimage and Edit Log
- FSimage and Edit Log ensure File System Metadata’s Persistence to keep up with all information and name node stores the metadata in two files. These files are the FSimage and the edit log. The job of FSimage is to keep a complete snapshot of the file system at a given time. The changes that are constantly being made in a system need to be kept a record of. These incremental changes, like renaming or appending details to the file, are stored in the edit log.
- The framework provides a better option than creating a new FSimage every time, a better chance to store the data while a new file for FSimage. FSimage creates a new snapshot every time changes are made. If the Name node fails, it can restore its previous state. The secondary name node can also update its copy whenever there are changes in FSimage and edit logs. Thus, it ensures that even though the name node is down, there will not be any loss of data in the presence of a secondary name node. Name node does not require that these images have to be reloaded on the secondary name node.
- HDFS is designed to process data fast and provide reliable data. It stores data across machines and in large clusters. All files are stored in a series of blocks. These blocks are replicated for fault tolerance. The block size and replication factor can be decided by the users and configured as per the user requirements. By default, the replication factor is 3. The replication factor can be specified at the time of file creation, and it can be changed later.
- The name node makes all decisions regarding these replicas. The name node keeps sending heartbeats and block report at regular intervals for all data nodes in the cluster. The receipt of the heartbeat implies that the data node is working properly. Block report specifies the list of all blocks present on the data node.
Placement of Replicas
- The placement of replicas is a critical task in Hadoop for reliability and performance. All the different data blocks are placed on other racks. The implementation of replica placement can be done as per reliability, availability and network bandwidth utilization. The cluster of computers can be spread across different frames. Not more than two nodes can be placed on the same rack. The third replica should be placed on a separate shelf to ensure more reliability of data.
- The two nodes on the rack communicate through different switches. The name node has the rack id for each data node. But placing all nodes on other shelves prevents loss of any data and allows bandwidth usage from multiple frames. It also cuts the inter-rack traffic and improves performance. Also, the chance of rack failure is significantly less as compared to that of node failure. It reduces the aggregate network bandwidth when data is being read from two unique racks rather than three.
Map Reduce is used for the processing of data which is stored on HDFS. It writes distributed data across distributed applications which ensures efficient processing of large amounts of data. They process on large clusters and require a commodity which is reliable and fault-tolerant. The core of Map-reduce can be three operations like mapping, collecting pairs, and shuffling the resulting data.
Conclusion – Hadoop Architecture
Hadoop is an open-source framework that helps in a fault-tolerant system. It can store large amounts of data and helps in storing reliable data. The two parts of storing data in HDFS and processing it through map-reduce help work correctly and efficiently. It has an architecture that helps manage all blocks of data and have the most recent copy by storing it in FSimage and edit logs. The replication factor also helps to have copies of data and get them back whenever there is a failure. HDFS also moves removed files to the trash directory for optimal usage of space.
This has been a guide to Hadoop Architecture. Here we have discussed the architecture, map-reduce, placement of replicas, data replication. You can also go through our other suggested articles to learn more –