Introduction to Hadoop Framework
Hadoop Framework is the popular open-source big data framework used to process a large volume of unstructured, semi-structured and structured data for analytics purposes. This is licensed with Apache software. Hadoop framework mainly involves storing and data processing or computation tasks. It includes a Hadoop distributed File system known as HDFS, Map-reduce for data computing, YARN for resource management, job scheduling and other standard utilities for advanced functionalities to manage the Hadoop clusters and distributed data system. Hadoop is implemented using the Java libraries for the framework and components functionalities. Hadoop supports batch processing of data and can be implemented through commodity hardware.
1. Solution for BIG DATA: It deals with complexities of high volume, velocity, and variety of data.
2. Set up the open-source project.
3. Stores a huge volume of data reliably and allows massive distributed computations.
4. Hadoop’s key attributes are redundancy and reliability (absolutely no data loss).
5. Primarily focuses on batch processing.
6. Runs on commodity hardware – you don’t need to buy any special expensive hardware.
1. Common utilities
3. Map Reduce
4. YARN Framework
1. Common Utilities
Also called the Hadoop common. These are nothing but the JAVA libraries, files, scripts, and utilities that are actually required by the other Hadoop components to perform.
2. HDFS: Hadoop Distributed File System
Why has Hadoop chosen to incorporate a Distributed file system?
Let’s understand this with an example: We need to read 1TB of data, and we have one machine with 4 I/O channels each channel having 100MB/s, it took 45 minutes to read the entire data. The same amount of data is read by 10 engines each with 4 I/O channels each channel having 100MB/s. Guess the amount of time it took to read the data? 4.3 minutes. HDFS solves the problem of storing big data. The two main components of HDFS are NAME NODE and DATA NODE. Name node is the master; we may have a secondary name node if the primary name node stops working the secondary name node will act as a backup. The name node basically maintains and manages the data nodes by storing metadata. The data node is the slave, who is basically the low-cost commodity hardware. We can have multiple data nodes. The data node stores the actual data. This data node supports the replication factor, suppose if one data node goes down, the data can be accessed by the other replicated data node. Therefore, the accessibility of data is improved, and the loss of information is prevented.
3. Map Reduce
It solves the problem of processing big data. Let’s understand the concept of map reduces by solving this real-world problem. ABC company wants to calculate its total sales, city wise. The hash table concept won’t work because the data is in terabytes to use the Map-Reduce idea.
There are two phases:
a) Map: First, we will split the data into smaller chunks called the mappers based on the key/value pair. So here the key will be the city name, and the value will be total sales. Each mapper will get each month’s data which gives a city name and corresponding sales.
b) Reduce: It will get these piles of data, and each reducer will be responsible for North/West/East/South cities. The reducer’s work will be collecting these small chunks and converting into larger amounts (by adding them up) for a particular city.
4. YARN Framework: Yet another resource negotiator.
The initial version of Hadoop had just two components: Map Reduce and HDFS. Later it was realized that Map Reduce couldn’t solve a lot of big data problems. The idea was to take the resource management and job scheduling responsibilities away from the old map-reduce engine and give it to a new component. So this is how YARN came into the picture. It is the middle layer between HDFS and Map Reduce, which is responsible for managing cluster resources.
It is having two key roles to perform: a) Job Scheduling. b) Resource management
a) Job scheduling: When a large amount of data is giving for processing, it needs to be distributed and broken down into different tasks/jobs. Now the JS decides which position needs to be given the top priority, the time interval between two positions, dependency among the jobs, checks that there is no overlapping between the jobs running.
b) Resource Management: For processing the data and storing the data, we need resources, right? So the resource manager provides, manages and maintains the resources to store and process the data.
So now we are clear about the concept of Hadoop and how it solves the challenges created by the BIG DATA !!!
This has been a guide to Hadoop Framework. Here we have discussed basic meaning with top 4 Hadoop frameworks in detail. You can also go through our other suggested articles to learn more –