Difference Between Hadoop and MapReduce
The roots of Hadoop date way back to the year 2002 when Dough Cutting was working on an open source project named Nutch ( which was used to index the web pages and use the indexed web pages for searching, the same thing which Google does). He was facing scalability issues both in terms of storage and computing. In 2003 google published GFS (google file system) and in 2004 Nutch created NDFS (Nutch distributed file system). After Google announcing MapReduce as their computational brain behind their sorting algorithms, Dough was able to run Nutch on NDFS and used MapReduce in the year 2005 and in the year 2006 Hadoop was born.
Hadoop and MapReduce! Hadoop is an Eco-system of open source projects such as Hadoop Common, Hadoop distributed file system (HDFS), Hadoop YARN, Hadoop MapReduce. Hadoop as such is an open source framework for storing and processing huge datasets. The storing is carried by HDFS and the processing is taken care by MapReduce. MapReduce, on the other hand, is a programming model which allows you to process huge data stored in Hadoop.let us understand Hadoop and MapReduce in a detail in this post.
Head to Head Comparison between Hadoop and MapReduce (Infographics)
Below is the Top 5 Comparison Between Hadoop vs MapReduce
Key Differences between Hadoop and MapReduce
The following is the difference between Hadoop and MapReduce
- If we want to differentiate Hadoop and MapReduce in lay man’s terms we can say that, Hadoop is like the car wherein you have everything that is needed to travel distances but MapReduce is like the engine of the car, so without the car an engine can’t exist but the exterior of the car may change (other DFS (distributed file systems)).
- The basic idea behind Hadoop is that the data must be reliable and scalable, reliable as in case of a disaster or network failure the data must be available all the time and this achieved by Hadoop’s framework using Name Nodes and Data Nodes.
- Some basic idea of Data Nodes and Name Nodes
- The basic idea behind the architecture of the Data Node and Name Node is the master/slave architecture where one stores the location of the data (Name Node) and the other stores the data itself (Data Node). The data is split into chunks of 64mb and saved in the data blocks and the registry of these is maintained at the Name Node. The data is replicated thrice by default for reliability. Talking about the scalability, the hardware can be increased on the go and this helps to increase the storage and making the system scalable.
- Now coming to MapReduce there are three phases
- Map Phase
- Shuffle Phase
- Reduce Phase
Let’s take an example to understand it better. MapReduce being a programming framework also has a hello world program, but it’s known as word count program in MapReduce.
The Word Count program gives us the key-value pairs of the word and its frequency in a paragraph/article or any data source. To be able to understand it easily let’s take the below as example data.
In the dataset as we can see we have three words bus, car and train. The column named Input has the data as we have in the dataset, the column Output has the data in the intermediate stage wherein the shuffling will take place.
Here we are taking the splitter as a comma (,) to split the words. The splitter can be comma or space or a new line etc.
Input | Set of data | caR, CAR, car, BUS, TRAIN ,bus, train, bus, TRAIN,BUS, buS, Car, bus, car, train, car, bus, car |
Output | Convert into another set of data
(Key,Value) |
(Bus,1), (Car,1), (bus,1), (car,1), (train,1),
(car,1), (bus,1), (car,1), (train,1), (bus,1), (TRAIN,1),(BUS,1), (buS,1), (caR,1), (CAR,1), (car,1), (BUS,1), (TRAIN,1) |
And the output of the above intermediate stage is given to the reducer and below is the final output of the program.
Input
(output of Map function) |
Set of Tuples | (Bus,1), (Car,1), (bus,1), (car,1), (train,1),
(car,1), (bus,1), (car,1), (train,1), (bus,1), (TRAIN,1),(BUS,1), (buS,1), (caR,1), (CAR,1), (car,1), (BUS,1), (TRAIN,1) |
Output | Converts into a smaller set of tuples | (BUS,7),
(CAR,7), (TRAIN,4) |
- One of the key differences of Hadoop with other big data processing frameworks is that Hadoop sends the code (MapReduce code) to the clusters where the data is stored rather than sending the data to code, as the data sets will in TB’s or sometimes in PB’s it will be a tedious task to do.
Hadoop vs MapReduce Comparision Table
Below are the primary comparison
Basis for Comparison | Hadoop | MapReduce |
Meaning |
The name “Hadoop” was the name of the toy elephant of Doug Cutting’s son. He named this project as “Hadoop” as it was easy to pronounce it. | The “MapReduce” name came into existence as per the functionality itself of mapping and reducing in key-value pairs. |
Concept |
The Apache Hadoop is an eco-system which provides an environment which is reliable, scalable and ready for distributed computing. | MapReduce is a submodule of this project which is a programming model and is used to process huge datasets which sits on HDFS (Hadoop distributed file system). |
Pre-requisites |
Hadoop runs on implements HDFS (Hadoop Distributed File System) | MapReduce can run on HDFS/GFS/NDFS or any other distributed file system for example MapR-FS |
Language |
Hadoop is a collection of all modules and hence may include other programming/scripting languages too | MapReduce is basically written in Java programming language |
Framework |
Hadoop not only has storage framework which stores the data but creating name node’s and data node’s it also has other frameworks which include MapReduce itself. | MapReduce is a programming framework which uses a key, value mappings to sort/process the data |
The below figure will help in differentiating MapReduce from Hadoop.
MapReduce Framework
- As we can see from the above picture that MapReduce is a distributed processing framework whereas Hadoop is a collection of all the frameworks.
Conclusion
Hadoop being open source gained popularity as it was free to use and the programmers can change the code as per their needs. The Hadoop Eco-system was developed continuously over the past years to make the Eco-system as bug-free as possible.
With the ever-changing needs of the world, the technology changes rapidly and it becomes difficult to keep track of the changes. The data that is being generated in a month is getting doubled/tripled as you read this article and the need for faster processing of data sets led to many other programming frameworks such as MapReduce 2, Spark etc.
Recommended Articles
This has been a guide to Hadoop vs MapReduce, their Meaning, Head to Head Comparison, Key Differences, Comparision Table, and Conclusion. You may also look at the following articles to learn more –
- Difference Between Hadoop vs Redshift
- Find Out The 6 Best Difference Between Apache Hadoop vs Apache Storm
- Comparisons Between Hadoop Vs SQL
- Know About MapReduce vs Spark
- Hadoop vs Spark:Functions
- Laravel vs Codeigniter: Functions
20 Online Courses | 14 Hands-on Projects | 135+ Hours | Verifiable Certificate of Completion
4.5
View Course
Related Courses