EDUCBA Logo

EDUCBA

MENUMENU
  • Explore
    • EDUCBA Pro
    • PRO Bundles
    • Featured Skills
    • New & Trending
    • Fresh Entries
    • Finance
    • Data Science
    • Programming and Dev
    • Excel
    • Marketing
    • HR
    • PDP
    • VFX and Design
    • Project Management
    • Exam Prep
    • All Courses
  • Blog
  • Enterprise
  • Free Courses
  • Log in
  • Sign Up
Home Data Science Data Science Tutorials Hadoop Tutorial Hadoop Streaming
 

Hadoop Streaming

Priya Pedamkar
Article byPriya Pedamkar

Updated March 18, 2023

Hadoop Streaming

 

 

What is Hadoop Streaming?

Hadoop Streaming is defined as a utility which comes Hadoop distribution that is used to execute program analysis of big data using programming languages such as Jave, Unix, Perl, Python, Scala, etc. as this gives the user the liberty to create and run MapReduce jobs with the scripts hence it’s used for real-time data ingestion which can be used in different real-time apps(like watching stock portfolio, share market analysis, narrating weather report, etc.) It is a Hadoop distribution with utility. Utility helps us to create and run specific MapReduce jobs with an executable or the script as the mapper and/or reducer.

Watch our Demo Courses and Videos

Valuation, Hadoop, Excel, Mobile Apps, Web Development & many more.

Understanding

There are java utilities provided by the Hadoop distribution, which are called Hadoop streaming. The utility is packaged in a JAR file. Using the utility, we can create and run MapReduce jobs with an executable script. Moreover, we can create executable scripts to run mapper and reducer functions. The executable scripts are passed to Hadoop streaming using a command. After the scripts are passed to Hadoop streaming, the Hadoop streaming utility creates a map and reduce jobs and submit them to the cluster. These jobs can also be monitored with this utility.

How does it Work?

The script specified for mapper and reducer works as below-

After the complete initialization of the mapper script, it will launch the instance of the script with different process ids. The mapper task, while running, takes the input lines and passes them to the standard input. At the same time, the outputs from the process’s standard output are collected by the mapper. It converts each line into a key-value pair. The set of key-value pairs is then collected as the output from the mapper. The key-value pair is selected based on the first tab character. The part of the line up to the initial tab is selected as key, while the rest of the line is selected as a valuable part. In case the tab is not present in a line, then the total line is selected as key, and there is no value part for the line. This can be adjusted according to business needs.

Purpose

It is used for real-time data ingestion, which can be used in different real-time apps. There are different real-time apps like watching stock portfolios, share market analysis, narrating weather report, traffic alerts which are done using Hadoop streaming.

Working of Hadoop Streaming

Below is a simple example of how it works:

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer /bin/wc

The input command is used to provide the input directory, while the output command is used to provide the output directory. The mapper command is used to specify the executable mapper class, while the reducer command is used to specify the executable reducer class.

Advantages

Below are the advantages explained:

advantages of haddop streaming

1. Availability

This doesn’t require any extra separate software to be installed and managed. There are other tools like a pig, hive which can be installed I need to be managed separately.

2. Learning

It doesn’t require learning new technologies. It can be leveraged with minimum Unix skills for data analysis.

3. Reduce Development Time

It requires to write mapper and reducer code while developing streaming applications in Unix, whereas doing the same work using Java MapReduce application is more complex and needs to be compiled first, then test, then package, followed by exporting JAR file, and then run.

4. Faster Conversion

It takes very little time to convert data from one format to another using Hadoop streaming. We can use it for converting data from text file to sequence file and then again from sequence file to text file and many others. This can be achieved using input format and output format options in Hadoop streaming.

5. Testing

Input and output data can be quickly tested by using it with Unix or Shell Script.

6. Requirement for Business

For simple business requirements like simple filtering operations and simple aggregation operation, we can use this with Unix.

7. Performance

Using this, we can get better performance while working with streaming data. There are also several disadvantages of Hadoop streaming which are addressed by using other tools in the Hadoop package like Kafka, Flume, spark.

Why do we need Hadoop Streaming?

It helps in real-time data analysis, which is much faster using MapReduce programming running on a multi-node cluster. There are different Technologies like spark Kafka and others which helps in real-time Hadoop streaming.

How will this technology help you in career growth?

Nowadays, all major enterprises are moving to Hadoop for their data analysis, and many of them may require analysis of real-time data. The demand for the use of real-time data and processing of the same day by day and this technology is creating a lot of scope for individual career growth.

Conclusion

It offers a huge range of advantages for different real-time data processing using streaming data.

Recommended Articles

This is a guide to Hadoop Streaming. Here we discuss the basic concept, need, purpose, working advantages, and disadvantages along with the career growth of Hadoop Streaming. You can also go through our other suggested articles to learn more –

  1. What is Hadoop Cluster?
  2. What Is Data Mining?
  3. What is Data Visualization
  4. What is Data Modeling?
  5. Complete Guide to Kafka Tools

Primary Sidebar

Footer

Follow us!
  • EDUCBA FacebookEDUCBA TwitterEDUCBA LinkedINEDUCBA Instagram
  • EDUCBA YoutubeEDUCBA CourseraEDUCBA Udemy
APPS
EDUCBA Android AppEDUCBA iOS App
Blog
  • Blog
  • Free Tutorials
  • About us
  • Contact us
  • Log in
Courses
  • Enterprise Solutions
  • Free Courses
  • Explore Programs
  • All Courses
  • All in One Bundles
  • Sign up
Email
  • [email protected]

ISO 10004:2018 & ISO 9001:2015 Certified

© 2025 - EDUCBA. ALL RIGHTS RESERVED. THE CERTIFICATION NAMES ARE THE TRADEMARKS OF THEIR RESPECTIVE OWNERS.

EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you
EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you

EDUCBA
Free Data Science Course

Hadoop, Data Science, Statistics & others

By continuing above step, you agree to our Terms of Use and Privacy Policy.
*Please provide your correct email id. Login details for this Free course will be emailed to you
EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you
EDUCBA Login

Forgot Password?

Loading . . .
Quiz
Question:

Answer:

Quiz Result
Total QuestionsCorrect AnswersWrong AnswersPercentage

Explore 1000+ varieties of Mock tests View more

🚀 Limited Time Offer! - ENROLL NOW