Introduction to Hadoop Framework
Before we deep dive into the technical framework of Hadoop, we will start with a simple example.
There is a farm that harvests tomatoes and stores them in a single storage area, now with the increasing demand of vegetables, the farm began to harvest potatoes, carrots – with increasing demand there was a shortage of farmers so they hired more farmers. After some time they realized there was a shortage in the storage area – so they distributed the vegetables in different storage areas. When it comes to retrieving the data, all of them work parallel with their own storage space.
So how this story is related to big data?
Earlier we had limited data, with the limited processor and one storage unit. But then the generation of data increased leading to high volume and different varieties – structured, semi-structured and unstructured, So the solution was to use distributed storage for each processor, this enabled easy access to store and access data.
So now we can replace the vegetables as different kinds of data and storage place as the distributed places to store the data and different workers being each processor.
So Big data is the challenge and Hadoop plays the part of the Solution.
1. Solution for BIG DATA: as it deals with complexities of high volume, velocity, and variety of data.
2. Set of the open-source project.
3. Stores a huge volume of data reliably and allows huge distributed computations.
4. Hadoop’s key attributes are redundancy and reliability (absolutely no data loss).
4.5 (1,989 ratings)
5. Primarily focuses on batch processing.
6. Runs on commodity hardware – you don’t need to buy any special expensive hardware.
Hadoop Framework :
1. Common utilities
3. Map Reduce
4. YARN Framework
1. Common Utilities:
Also called the Hadoop common. These are nothing but the JAVA libraries, files, scripts, and utilities that are actually required by the other Hadoop components to perform.
2. HDFS: Hadoop Distributed File System
Why Hadoop has chosen to incorporate a Distributed file system?
Let’s understand this with an example: We need to read 1TB of data and we have one machine with 4 I/O channels each channel having 100MB/s, it took 45 minutes to read the entire data. Now the same amount of data is read by 10 machines each with 4 I/O channels each channel having 100MB/s. Guess the amount of time it took to read the data? 4.3 minutes. HDFS solves the problem of storing big data. The two main components of HDFS are NAME NODE and DATA NODE. Name node is the master, we may have a secondary name node as well in case the primary name node stops working the secondary name node will act as a backup. The name node basically maintains and manages the data nodes by storing metadata. The data node is the slave which is basically the low-cost commodity hardware. We can have multiple data nodes. The data node stores the actual data. This data node supports the replication factor, suppose if one data node goes down then the data can be accessed by the other replicated data node, therefore, the accessibility of data is improved and loss of data is prevented.
3. Map Reduce:
It solves the problem of processing big data. Let’s understand the concept of map reduces by solving this real-world problem. ABC company wants to calculate its total sales, city wise. Now here the hash table concept won’t work because the data is in terabytes, so we will use the Map-Reduce concept.
There are two phases: a) MAP. b) REDUCE
a) Map: First, we will split the data into smaller chunks called the mappers on the basis of the key/value pair. So here the key will be the city name and the value will be total sales. Each mapper will get each month’s data which gives a city name and corresponding sales.
b) Reduce: It will get these piles of data and each reducer will be responsible for North/West/East/South cities. So the work of the reducer will be collecting these small chunks and convert into larger amounts (by adding them up) for a particular city.
4.YARN Framework: Yet another resource negotiator.
The initial version of Hadoop had just two components: Map Reduce and HDFS. Later it was realized that Map Reduce couldn’t solve a lot of big data problems. The idea was to take the resource management and job scheduling responsibilities away from the old map-reduce engine and give it to a new component. So this is how YARN came into the picture. It is the middle layer between HDFS and Map Reduce which is responsible for managing cluster resources.
It is having two key roles to perform: a) Job Scheduling. b) Resource management
a) Job scheduling: When a large amount of data is giving for processing, it needs to be distributed and broken down into different tasks/jobs. Now the JS decides which job needs to be given the top priority, the time interval between two jobs, dependency among the jobs, checks that there is no overlapping between the jobs running.
b) Resource Management: For processing the data and for storing the data we need resources right? So the resource manager provides, manages and maintains the resources to store and process the data.
So now we are clear about the concept of Hadoop and how it solves the challenges created by the BIG DATA !!!
This has been a guide to Hadoop Framework. Here we have also discuss the top 4 Hadoop frameworks. You can also go through our other suggested articles to learn more –