Introduction to Hadoop Architecture
Hadoop Architecture is an open source framework which helps in processing large datasets easily. It helps in creating applications which process huge data with more speed. It makes use of the distributed computing concepts where data is spread across different nodes of a cluster. The applications which are built using Hadoop make use of commodity computers. These computers are available easily in the market at cheap rates. This result is accomplishing greater computational power at a low cost. All the data present in Hadoop resides on HDFS instead of a local file system. HDFS is a Hadoop Distributed File System. This model is based on Data Locality where the computational logic is sent to the nodes present in a cluster which contains the data. This logic is nothing but a logic which compiles the program.
The basic idea of this architecture is that the entire storing and processing is done in two steps and by two ways. The first step is processing which is done by Map reduce programming and the second-way step is of storing the data which is done on HDFS. It has a master-slave architecture for storage and data processing. The master node for data storage in Hadoop is the name node. There is also a master node which does the work of monitoring and parallels data processing by making use of Hadoop Map Reduce. The slaves are other machines in the Hadoop cluster which help in storing data and also perform complex computations. Each slave node has been assigned with a task tracker and a data node has a job tracker which helps in running the processes and synchronizing them effectively. This type of system can be set up either on cloud or on-premise. The Name node is a single point of failure when it is not running on high availability mode. The Hadoop architecture also has provision for maintaining a stand by Name node in order to safeguard the system from failures. Previously there were secondary name nodes which acted as a backup when the primary name node was down.
FSimage and Edit Log
FSimage and Edit Log ensure Persistence of File System Metadata to keep up with all information and name node stores the metadata in two files. These files are the FSimage and the edit log. The job of FSimage is to keep a complete snapshot of the file system at a given time. The changes that are constantly being made in a system need to be kept a record of. These incremental changes like renaming or appending details to file are stored in the edit log. The framework provides a better option of rather than creating a new FSimage every time, a better option being able to store the data while a new file for FSimage. FSimage creates a new snapshot every time changes are made If Name node fails it can restore its previous state. The secondary name node can also update its copy whenever there are changes in FSimage and edit logs. Thus, it ensures that even though the name node is down, in the presence of secondary name node there will not be any loss of data. Name node does not require that these images have to be reloaded on the secondary name node.
HDFS is designed to process data fast and provide reliable data. It stores data across machines and in large clusters. All files are stored in a series of blocks. These blocks are replicated for fault tolerance. The block size and replication factor can be decided by the users and configured as per the user requirements. By default, the replication factor is 3. The replication factor can be specified at the time of file creation and it can be changed later. All decisions regarding these replicas are made by the name node. The name node keeps sending heartbeats and block report at regular intervals for all data nodes in the cluster. The receipt of heartbeat implies that the data node is working properly. Block report specifies the list of all blocks present on the data node.
Placement of Replicas
The placement of replicas is a very important task in Hadoop for reliability and performance. All the different data blocks are placed on different racks. The implementation of replica placement can be done as per reliability, availability and network bandwidth utilization. The cluster of computers can be spread across different racks. Not more than two nodes can be placed on the same rack. The third replica should be placed on a different rack to ensure more reliability of data. The two nodes on rack communicate through different switches. The name node has the rack id for each data node. But placing all nodes on different racks prevents loss of any data and allows usage of bandwidth from multiple racks. It also cuts the inter-rack traffic and improves performance. Also, the chance of rack failure is very less as compared to that of node failure. It reduces the aggregate network bandwidth when data is being read from two unique racks rather than three.
Map Reduce is used for processing of data which is stored on HDFS. It writes distributed data across distributed applications which ensures efficient processing of large amounts of data. They process on large clusters and require commodity which is reliable and fault tolerant. The core of Map-reduce can be three operations like mapping, collection of pairs and shuffling the resulting data.
Conclusion – Hadoop Architecture
Hadoop is an open source framework which helps in a fault tolerant system. It can store large amounts of data and helps in storing reliable data. The two parts of storing data in HDFS and processing it through map-reduce help in working properly and efficiently. It has an architecture which helps in managing all blocks of data and also having the most recent copy by storing it in FSimage and edit logs. The replication factor also helps in having copies of data and getting them back whenever there is a failure. HDFS also moves removed files to the trash directory for optimal usage of space.
This has been a guide to Hadoop Architecture. Here we have discussed the Architecture, Map Reduce, Placement of Replicas, Data Replication. You can also go through our other suggested articles to learn more –