Introduction to Hadoop Ecosystem
Apache Hadoop is an open source system to reliably store and process a lot of information across many commodity computers. Hadoop has been first written in a paper and published in October 2013 as ‘Google File System’. Doug Cutting, who was working in Yahoo at that time, introduced the name as Hadoop Ecosystem based on his son’s toy elephant name. If we consider the main core of Apache Hadoop, then firstly it can consider the storage part, which known as Hadoop Distributed File System (HDFS), and secondly processing part, which known as Map Reduce Programming module. Hadoop actually splits one huge file and store them in multiple nodes across the cluster.
The concept of Hadoop Ecosystem
Apache Hadoop framework is mainly holding below modules:
- Hadoop Common: contains all the libraries and utilities needed for using Hadoop module.
- Hadoop Distributed File System (HDFS): It is one of the distributed file systems which helps to store huge data in multiple or commodity machines. Also, provide big utility in case of bandwidth, it normally provided very high bandwidth in a type of aggregate on a cluster.
- Hadoop Yarn: It introduced in 2012. It is mainly introduced to managing resources on all the system in commodity even in a cluster. Based on resources capability it distributed or scheduling user’s application as per requirement.
- Hadoop MapReduce: It mainly helps to process large-scale data through map-reduce programming methodology.
Apache Hadoop always helps on IT cost reduction in terms of processing and storing huge data smartly. As Apache Hadoop is an open source and hardware is very commonly available, it always help us in handling a proper reduction in IT cost.
Open Source Software + Commodity Hardware = IT Costs reduction
For example, if we are going to consider on daily receiving 942787 files and directories, which require 4077936 blocks, total 5020723 blocks. So if we configured at least 1.46 PB capacity, then for handling above load, the distributed file system will use 1.09 PB, that’s mean almost 74.85% of total configured capacity, whereas we considering 178 live nodes and 24 dead nodes.
Hadoop ecosystem mainly designed for storing and processing big data, which normally have some key characteristic like below:
Volume stands for the size of Data that actually stored and generated. Depends on the size of data it has been determined the data set is big data or not.
Variety stands for nature, structure, and type of data which is being used.
Velocity stands for the speed of data that has been stored and generated in a particular development process flow.
Veracity signifies the quality of data that has been captured and also helps data analysis to reach the intended target.
HDFS is mainly designed to store a very large amount of information (terabytes or petabytes) across a large number of machines in a cluster. It always maintaining some common characteristics, like data reliability, runs on commodity hardware, using blocks to store a file or part of that file, utilize ‘write once read many’ model.
HDFS following below architecture with the concept of Name Node and Data Node.
The responsibility of the Name Node (Master):
– manages the file system namespace
– maintains cluster configuration
– Responsible for replication management
The responsibility of Data Node (Slaves):
– Store data in the local file system
– Periodically report back to the name node by means of heartbeat
HDFS Write Operation:
Hadoop follows below steps for write any big file:
- Create File and update the FS image after getting one file write request from any HDFS client.
- Get block location or data node details information from the name node.
- Write the packet in an individual data nodes parallel way.
- Acknowledge completion or accepting packet writing and send back information to Hadoop client.
HDFS Block Replication Pipeline:
- The client retrieves a list of Datanodes from the Namenode that will host a replica of that block
- The client then flushes the data block to the first Datanode
- The first Datanode receives a block, writes it and transfers it to the next data node in the pipeline
- When all replicas are written, the Client moves on to the next block in the file
HDFS Fault Tolerance:
One data node has been down suddenly, in that case, HDFS has the capability to manage that scenario automatically. First, all name node is always received one heartbeat from every data node, if somehow it lost one heartbeat from one data node, considering the same data node as down, immediately take action to auto replicated all the blocks on remaining nodes immediately to satisfy replication factor.
If the name node detects one new data node available in the cluster, it immediately rebalancing all the blocks including the added data node.
Now somehow Name node loss or failed, in that case as well backup node holding one FS image of name node replay all the FS operation immediately and up the name node as per requirement. But in that case manual intervention required, and the entire Hadoop ecosystem framework will be down for a couple of times to set up a new name node again. So in this case, name node can be a single point failure, to avoid this scenario HDFS Federation introducing multiple clusters set up of name node, and ZooKeeper can manage immediate up one alternative name node as per requirement.
Examples of Hadoop Ecosystem
Full Hadoop ecosystem example can be properly explained in the below figure:
Data can come from any kind of source like Data Warehouse, Managed Document Repository, File Shares, Normal RDMS databased, or cloud or external sources. All those data came to HDFS in structure or non-structure or semi-structured way. HDFS store all those data as a distributed way, means storing in distributed commodity system very smartly.
Hadoop ecosystem mainly designed for storing and processing huge data which should have presented any of the two factors between volume, velocity, and variety. It storing data in a distributed processing system that runs on commodity hardware. Considering the full Hadoop ecosystem process, HDFS distributes the data blocks and Map Reduce provides the programming framework to read data from a file stored in HDFS.
This has been a guide to Hadoop Ecosystem. Here we have discussed the basic concept about Hadoop Ecosystem, it’s architecture, HDFS operations, examples, HDFS fault tolerance etc. You may also look at the following articles to learn more –