Introduction to Hadoop Architecture
Hadoop architecture is an open-source framework that is used to process large data easily by making use of the distributed computing concepts where the data is spread across different nodes of the clusters. This architecture follows a master-slave structure where it is divided into two steps of processing and storing data. These steps are performed by the Map-reduce and HDFS where the processing is done by the MapReduce while the storing is done by the HDFS.
- The basic idea of this architecture is that the entire storing and processing are done in two steps and in two ways. The first step is processing which is done by Map reduce programming and the second-way step is of storing the data which is done on HDFS.
- It has a master-slave architecture for storage and data processing. The master node for data storage in Hadoop is the name node. There is also a master node that does the work of monitoring and parallels data processing by making use of Hadoop Map Reduce.
- The slaves are other machines in the Hadoop cluster which help in storing data and also perform complex computations. Each slave node has been assigned with a task tracker and a data node has a job tracker which helps in running the processes and synchronizing them effectively. This type of system can be set up either on the cloud or on-premise.
- The Name Node is a single point of failure when it is not running on high availability mode. The Hadoop architecture also has provisions for maintaining a stand by Name node in order to safeguard the system from failures. Previously there were secondary name nodes that acted as a backup when the primary name node was down.
FSimage and Edit Log
- FSimage and Edit Log ensure Persistence of File System Metadata to keep up with all information and name node stores the metadata in two files. These files are the FSimage and the edit log. The job of FSimage is to keep a complete snapshot of the file system at a given time. The changes that are constantly being made in a system need to be kept a record of. These incremental changes like renaming or appending details to file are stored in the edit log.
- The framework provides a better option of rather than creating a new FSimage every time, a better option being able to store the data while a new file for FSimage. FSimage creates a new snapshot every time changes are made If Name node fails it can restore its previous state. The secondary name node can also update its copy whenever there are changes in FSimage and edit logs. Thus, it ensures that even though the name node is down, in the presence of secondary name node there will not be any loss of data. Name node does not require that these images have to be reloaded on the secondary name node.
- HDFS is designed to process data fast and provide reliable data. It stores data across machines and in large clusters. All files are stored in a series of blocks. These blocks are replicated for fault tolerance. The block size and replication factor can be decided by the users and configured as per the user requirements. By default, the replication factor is 3. The replication factor can be specified at the time of file creation and it can be changed later.
- All decisions regarding these replicas are made by the name node. The name node keeps sending heartbeats and block report at regular intervals for all data nodes in the cluster. The receipt of heartbeat implies that the data node is working properly. Block report specifies the list of all blocks present on the data node.
Placement of Replicas
- The placement of replicas is a very important task in Hadoop for reliability and performance. All the different data blocks are placed on different racks. The implementation of replica placement can be done as per reliability, availability and network bandwidth utilization. The cluster of computers can be spread across different racks. Not more than two nodes can be placed on the same rack. The third replica should be placed on a different rack to ensure more reliability of data.
- The two nodes on rack communicate through different switches. The name node has the rack id for each data node. But placing all nodes on different racks prevents loss of any data and allows usage of bandwidth from multiple racks. It also cuts the inter-rack traffic and improves performance. Also, the chance of rack failure is very less as compared to that of node failure. It reduces the aggregate network bandwidth when data is being read from two unique racks rather than three.
Map Reduce is used for the processing of data which is stored on HDFS. It writes distributed data across distributed applications which ensures efficient processing of large amounts of data. They process on large clusters and require commodity which is reliable and fault-tolerant. The core of Map-reduce can be three operations like mapping, collection of pairs, and shuffling the resulting data.
Conclusion – Hadoop Architecture
Hadoop is an open-source framework that helps in a fault-tolerant system. It can store large amounts of data and helps in storing reliable data. The two parts of storing data in HDFS and processing it through map-reduce help in working properly and efficiently. It has an architecture that helps in managing all blocks of data and also having the most recent copy by storing it in FSimage and edit logs. The replication factor also helps in having copies of data and getting them back whenever there is a failure. HDFS also moves removed files to the trash directory for optimal usage of space.
This has been a guide to Hadoop Architecture. Here we have discussed the architecture, map-reduce, placement of replicas, data replication. You can also go through our other suggested articles to learn more –