Overview of Install Hadoop
The following article, Install Hadoop provides an outline of the most common Hadoop framework key modules and step-wise installation for Hadoop. The Apache Hadoop is a collection of software that enables processing of large datasets and distributed storage across a cluster of different types of the computer system. Currently, Hadoop remains the most widely used analytics platform for big data (“Sanchita Lobo, Author at Analytics Training Blog,” n.d.).
The Apache Hadoop framework consists of the following key modules.
- Apache Hadoop Common.
- Apache Hadoop Distributed File System (HDFS).
- Apache Hadoop MapReduce
- Apache Hadoop YARN (Yet Another Resource Manager).
Apache Hadoop Common
Apache Hadoop Common module consists of shared libraries that are consumed across all other modules including key management, generic I/O packages, libraries for metric collection and utilities for the registry, security, and streaming.
The HDFS is based on the Google file system and is structured to run on low-cost hardware. HDFS is tolerant of faults and is designed for applications having large datasets.
MapReduce is an inherent parallel programming model for data processing and Hadoop can run MapReduce programs written in various languages such as Java. MapReduce works by splitting the processing into the map phase and reduces the phase.
Apache Hadoop YARN
Apache Hadoop YARN is a core component and is resource management and job scheduling technology in the Hadoop distributed processing framework.
4.5 (1,962 ratings)
In this article, we will discuss the installation and configuration of Hadoop 2.7.4 on a single node cluster and test the configuration by running the MapReduce program called wordcount to count the number of words in the file. We will further look at few important Hadoop File System commands.
Steps to Install Hadoop
The following is a summary of the tasks involved in the configuration of Apache Hadoop.
Task 1: The first task in the Hadoop installation included setting up a virtual machine template that was configured with Cent OS7. Packages such as Java SDK 1.8 and Runtime Systems required to run Hadoop were downloaded and Java environment variable for Hadoop was configured by editing bash_rc.
Task 2: Hadoop Release 2.7.4 package was downloaded from the apache website and was extracted in the opt folder. Which was then renamed as Hadoop for easy access.
Task 3: Once the Hadoop packages were extracted the next step included configuring the environment variable for Hadoop user followed by configuring Hadoop node XML files. In this step, NameNode was configured within core-site.xml and DataNode was configured within hdfs-site.xml. Resource manager and node manager were configured within yarn-site.xml.
Task 4: The firewall was disabled in order to start YARN and DFS. JPS command was used to verify if relevant daemons are running in the background. The port number to access Hadoop was configured to http://localhost:50070/
Task 5: The next few steps were used to verify and test Hadoop. For this, we have created a temporary test file in the input directory for WordCount program. Map-reduce program Hadoop-MapReduce-examples2.7.4.jar was used to count the number of words in the file. Results were evaluated on the localhost and logs of the submitted application were analyzed. All MapReduce applications submitted can be viewed at the online interface, default port number being 8088.
Task 6: In the final task, we will introduce some basic Hadoop File System commands and check their usages. We will see how a directory can be created within the Hadoop file system, to list the content of a directory, its size in bytes. We will further see how to delete a specific directory and file.
Results in Hadoop Installation
The following shows the results of each of the above tasks:
Result of Task 1
A new virtual machine with a cenOS7 image has been configured to run Apache Hadoop. Figure 1 shows how CenOS 7 image was configured in the Virtual machine. Figure 1.2 shows the JAVA environment variable configuration within .bash_rc.
Figure 1: Virtual Machine configuration
Figure 1.2: Java environment variable configuration
Result of Task 2
Figure 2 shows the task carried out in order to extract the Hadoop 2.7.4 package in to opt folder.
Figure 2: Extraction of Hadoop 2.7.4 package
Result of Task 3
Figure 3 shows the configuration for the environment variable for Hadoop user, Figure 3.1 to 3.4 shows the configuration for XML files required for Hadoop configuration.
Figure 3: Configuring the environment variable for Hadoop user
Figure 3.1: Configuration of core-site.xml
Figure 3.2: Configuration of hdfs-site.xml
Figure 3.3: Configuration of mapred-site.xml file
Figure 3.4: Configuration of yarn-site.xml file
Result of Task 4
Figure 4 shows the usage of jps command to check relevant daemons are running in the background and the following figure shows Hadoop’s online user Interface.
Figure 4: jps command to verify running daemons.
Figure 4.1: Accessing Hadoop online interface at port http://hadoop1.example.com:50070/
Result of Task 5
Figure 5 shows the result for the MapReduce program called wordcount which counts the number of words in the file. The next couple of figures displays the YARN resource manager’s online user interface for the submitted task.
Figure 5: MapReduce program results
Figure 5.1: Submitted Map-reduce application.
Figure 5.2: Logs for submitted MapReduce application.
Result of Task 6
Figure 6 shows how to create a directory within the Hadoop file system and perform a listing of hdfs directory.
Figure 6: Creating a directory within the Hadoop file system
Figure 6.1 shows how to put a file onto the Hadoop distributed file system and figure 6.2 shows the created file in the dirB directory.
Figure 6.1: Creating a file in HDFS.
Figure 6.2: New file created.
The next few figures show how to list the contents of particular directories:
Figure 6.3: Content of dirA
Figure 6.4: Content of dirB
The next figure shows how file and directory size can be displayed:
Figure 6.5: Display a file and directory size.
Deleting a directory or a file can be easily accomplished by -rm command.
Figure 6.6: To delete a file.
Big Data has played a very important role in shaping today’s world market. Hadoop framework makes data analyst’s life easy while working on large datasets. The configuration of Apache Hadoop was quite simple and the online user interface provided the user with multiple options to tune and manage the application. Hadoop has been used massively in organizations for data storage, machine learning analytics and backing up data. Managing a large amount of data has been quite handy because of Hadoop distributed environment and MapReduce. Hadoop development was pretty amazing when compared to relational databases as they lack tuning and performance options. Apache Hadoop is a user-friendly and low-cost solution for managing and storing big data efficiently. HDFS also goes a long way in helping in storing data.
This is a guide to Install Hadoop. Here we discuss the introduction to Instal Hadoop, step by step installation of Hadoop along with results of Hadoop Installation. You can also go through our other suggested articles to learn more –