Introduction to Hadoop Components
Hadoop Components are mainly HDFS, Map Reduce, Yarn. Today we are living in a digital age where the data production rate is very high approximately we are producing 2.5 quintillions of data per day. Although the storing capacity of the disks are increasing but seek rate has not increased for this volume of data. To overcome this we need to read the data parallelly, to achieve this in Hadoop we have an HDFS (Hadoop Distributed File System) where the datasets are stored as blocks in HDFS (for more details refer HDFS section) to read data in parallel and achieve higher processing rate. Processing of data is done to fetch or to forecast some meaningful information or to get some trends or patterns. MapReduce process is used to effectuate to get the desired information. Map and Reduce are the two different phases of processing data.
Major Components of Hadoop
The major components of Hadoop are described below:
1. Hadoop Distributed File System (HDFS)
HDFS is the storage layer for Big Data it is a cluster of many machines, the stored data can be used for the processing using Hadoop. Once the data is pushed to HDFS we can process it anytime, till the time we process the data will be residing in HDFS till we delete the files manually. HDFS stores the data as a block, the minimum size of the block is 128MB in Hadoop 2.x and for 1.x it was 64MB. HDFS replicates the blocks for the data available if data is stored in one machine and if the machine fails data is not lost but to avoid these, data is replicated across different machines. Replication factor by default is 3 and we can change in HDFS-site.xml or using the command Hadoop fs -strep -w 3 /dir by replicating we have the blocks on different machines for high availability.
HDFS is a master-slave architecture it is NameNode as master and Data Node as a slave. NameNode is the machine where all the metadata is stored of all the blocks stored in the DataNode.
The Hadoop ecosystem is a cost-effective, scalable and flexible way of working with such large datasets. Hadoop is a framework which uses a particular programming model, called MapReduce, for breaking up computation tasks into blocks that can be distributed around a cluster of commodity machines using Hadoop Distributed Filesystem (HDFS).
MapReduce is a two different tasks Map and Reduce, Map precedes the Reducer Phase. As the name suggests Map phase maps the data into key-value pairs, as we all know Hadoop utilizes key values for processing. Reducer phase is the phase where we have the actual logic to be implemented. Apart from these two phases, it implements shuffle and sort phase as well.
Mapper is the class where the input file is converted into keys and values pair for further processing. While reading the data it is read in key values only where the key is the bit offset and the value is the entire record.
E.g. we have a file Diary.txt in that we have two lines written i.e. two records.
This is a wonderful day we should enjoy here, the offsets for ‘t’ is 0 and for ‘w’ it is 33 (white spaces are also considered as a character) so, the mapper will read the data as key-value pair, as (key, value), (0, this is a wonderful day), (33, we should enjoy)
Reducer is the class which accepts keys and values from the output of the mappers’ phase. Keys and values generated from mapper are accepted as input in reducer for further processing. Reducer accepts data from multiple mappers. Reducer aggregates those intermediate data to a reduced number of key and values which is the final output, we will see this in the example.
Apart from mapper and reducer class, we need one more class that is Driver class. This code is necessary for MapReduce as it is the bridge between the framework and logic implemented. It specifies the configuration, input data path, output storage path and most importantly which mapper and reducer classes need to be implemented also many other configurations be set in this class. e.g. in the driver class, we can specify the separator for the output file as shown in the driver class of the example below.
YARN was introduced in Hadoop 2.x, prior to that Hadoop had a JobTracker for the resource management. Job Tracker was the master and it had a Task Tracker as the slave. Job Tracker was the one which used to take care of scheduling the jobs and allocating resources. Task Tracker used to take care of the Map and Reduce tasks and the status was updated periodically to Job Tracker. With is a type of resource manager it had a scalability limit and concurrent execution of the tasks was also had a limitation. These issues were addressed in YARN and it took care of resource allocation and scheduling of jobs on a cluster. Executing a Map Reduce job needs resources in a cluster, to get the resources allocated for the job YARN helps. YARN determines which job is done and which machine it is done. It has all the information of available cores and memory in the cluster, it tracks memory consumption in the cluster. It interacts with the NameNode about the data where it resides to make the decision on the resource allocation.
Consider we have a dataset of travel agency, now we need to calculate from the data that how many people choose to travel to a particular destination. To achieve this we will need to take destination as key and for the count, we will take the value as 1. So, in the mapper phase, we will be mapping destination to value 1. Now in shuffle and sort phase after the mapper, it will map all the values to a particular key. E.g. if we have a destination as MAA we have mapped 1 also we have 2 occurrences after the shuffling and sorting we will get MAA,(1,1) where (1,1) is the value. Now in the reducer phase, we already have a logic implemented in the reducer phase to add the values to get the total count of the ticket booked for the destination. This is the flow of MapReduce.
Below is the screenshot of the implemented programme for the above example.
1. Driver Class
2. Mapper Class
3. Reducer Class
Executing the Hadoop
For Execution of Hadoop, we first need to build the jar and then we can execute using below command Hadoop jar eample.jar /input.txt /output.txt
Here we have discussed the core components of the Hadoop like HDFS, Map Reduce, and YARN. It is a distributed cluster computing framework which helps to store and process the data and do the required analysis on the captured data. Hadoop is flexible, reliable in terms of data as data is replicated and scalable i.e. we can add more machines to the cluster for storing and processing of data.
This has been a guide to Hadoop Components. Here we discussed the core components of the Hadoop with Examples. You can also go through our other suggested articles to learn more –