EDUCBA

EDUCBA

MENUMENU
  • Free Tutorials
  • Free Courses
  • Certification Courses
  • 360+ Courses All in One Bundle
  • Login
Home Data Science Data Science Tutorials Spark Tutorial Spark Streaming
Secondary Sidebar
Spark Tutorial
  • Basics
    • What is Apache Spark
    • Career in Spark
    • Spark Commands
    • How to Install Spark
    • Spark Versions
    • Apache Spark Architecture
    • Spark Tools
    • Spark Shell Commands
    • Spark Functions
    • RDD in Spark
    • Spark DataFrame
    • Spark Dataset
    • Spark Components
    • Apache Spark (Guide)
    • Spark Stages
    • Spark Streaming
    • Spark Parallelize
    • Spark Transformations
    • Spark Repartition
    • Spark Shuffle
    • Spark Parquet
    • Spark Submit
    • Spark YARN
    • SparkContext
    • Spark Cluster
    • Spark SQL Dataframe
    • Join in Spark SQL
    • What is RDD
    • Spark RDD Operations
    • Spark Broadcast
    • Spark?Executor
    • Spark flatMap
    • Spark Thrift Server
    • Spark Accumulator
    • Spark web UI
    • Spark Interview Questions
  • PySpark
    • PySpark version
    • PySpark Cheat Sheet
    • PySpark list to dataframe
    • PySpark MLlib
    • PySpark RDD
    • PySpark Write CSV
    • PySpark Orderby
    • PySpark Union DataFrame
    • PySpark apply function to column
    • PySpark Count
    • PySpark GroupBy Sum
    • PySpark AGG
    • PySpark Select Columns
    • PySpark withColumn
    • PySpark Median
    • PySpark toDF
    • PySpark partitionBy
    • PySpark join two dataframes
    • PySpark?foreach
    • PySpark when
    • PySPark Groupby
    • PySpark OrderBy Descending
    • PySpark GroupBy Count
    • PySpark Window Functions
    • PySpark Round
    • PySpark substring
    • PySpark Filter
    • PySpark Union
    • PySpark Map
    • PySpark SQL
    • PySpark Histogram
    • PySpark row
    • PySpark rename column
    • PySpark Coalesce
    • PySpark parallelize
    • PySpark read parquet
    • PySpark Join
    • PySpark Left Join
    • PySpark Alias
    • PySpark Column to List
    • PySpark structtype
    • PySpark Broadcast Join
    • PySpark Lag
    • PySpark count distinct
    • PySpark pivot
    • PySpark explode
    • PySpark Repartition
    • PySpark SQL Types
    • PySpark Logistic Regression
    • PySpark mappartitions
    • PySpark collect
    • PySpark Create DataFrame from List
    • PySpark TimeStamp
    • PySpark FlatMap
    • PySpark withColumnRenamed
    • PySpark Sort
    • PySpark to_Date
    • PySpark kmeans
    • PySpark LIKE
    • PySpark?groupby multiple columns

Related Courses

Spark Certification Course

PySpark Certification Course

Apache Storm Course

Spark Streaming

By Priya PedamkarPriya Pedamkar

Spark Streaming

Introduction to Spark Streaming

Spark Streaming is defined as the extension of the Spark API, which is used to enable the fault-tolerant, high throughput, scalable stream processing; it provides a high-level abstraction called the discretized stream, a.k.a DStream, which includes operations such as Transformation on Spark Streaming( includes a map, flat map, filter, and union) and Update states of Key operation, as internally it works by receiving the live input data stream which is divided into batches, these batches are then used to get the final stream of the result by passing it in the Spark Engine.

How Spark Streaming Works?

  • In Spark Streaming, the data streams are divided into fixed batches, also called DStreams, which is internally a fixed type sequence of the number of RDDs. Therefore, the RDDs are processed using by using Spark API, and the results returned, therefore, are in batches. The discretized stream operations, either stateful or stateless transformations, also consist of output operations, input DStream operations, and receivers. These Dstreams are the basic level of abstraction provided by Apache Spark streaming, a continuous stream of the Spark RDDs.
  • It also provides the capabilities for fault tolerance to be used for Dstreams quite similar to RDDs so long as the copy of the data is available, and therefore any state can be recomputed or brought back to the original state by making use of Spark’s lineage graph over the set of RDDs. The point to be pondered here is that the Dstreams is used to translate the basic operations on their underlying set of RDDs. These RDD based transformations are done and computed by the Spark Engine. The Dstream operations are used to provide the basic level of details and give the developer a high level of API for development purposes.

Advantages of Spark Streaming

There are various reasons why the use of Spark streaming is an added advantage.

  • Unification of stream, batch, and interactive workloads: The datasets can be easily integrated and used with any of the workloads, which were never easy to do in continuous systems. Therefore, this serves as a single-engine.
  • Advanced level of analytics and machine learning and SQL queries: When working on complex workloads, it always requires the use of continuously learning and updated data models. The best part with this component of Spark is that it gets to easily integrates with the MLib or any other dedicated machine learning library.
  • Fast failure and recovery for straggler: Failure recovery and fault tolerance is one of the basic prime features available in Spark streaming.
  • Load balancing: The bottlenecks are often caused in between systems due to uneven loads and balances that are being done. Therefore, it becomes quite necessary to balance the load evenly, which is automatically handled by this component of Spark.
  • Performance: Due to its in-memory computation technique, which makes use of the internal memory more than the external hard disk, the performance of Spark is very good and efficient when compared to other Hadoop systems.

Spark Streaming Operations

Below are the operations of Spark Streaming:

1. Transformation operations on spark streaming

The same way data is transformed from the set of RDDs here also the data is transformed from DStreams, and it offers many transformations that are available on the normal Spark RDDs. Some of them are:

Start Your Free Data Science Course

Hadoop, Data Science, Statistics & others

  • Map(): This is used to return a new form of Dstream when each element is passed through a function.
    For Example, data.map(line => (line,line.count))
  • flatMap(): This one is similar to the map, but each item is mapped to 0 or more mapped units.
    For example, data.flatMap(lines => lines.split(” “))
  • filter(): This one is used to return a new set of Dstream by returning the records which are filtered for our use.
    Example, filter(value => value==”spark”)
  • Union(): It is used to return a new set of Dstream, which consists of the data combined from the input Dstreams and other Dstreams.
    Example, Dstream1.union(Dstream2).union(Dstream3)

2. Update state by key operation

This allows you to maintain an arbitrary state even when it is continuously updating this with a new piece of information. You would be required to define the state, which can be of arbitrary type and define the state update function, which means specifying the state using the previous state and also making use of new values from an input stream. In every batch system, a spark will apply the same state update function for all the keys which are prevalent.

Example:

Code:

def update function (NV, RC):
if RC is None:
RC = 0
return sum(NV, RC) #Nv is new values and RC is running count

Conclusion

It is one of the most efficient systems to build the real streaming type pipeline and hence is used to overcome all the issues which are encountered by using traditional systems and methods. Therefore, all the developers who are learning to make their way into the spark streaming component have been stepping on the right single point of a framework that can meet all the developmental needs. Therefore, we can safely say that its use enhances productivity and performance in the projects and companies which are trying to or looking forward to making use of the big data ecosystem. I hope you liked our article. Stay tuned for more articles like these.

Recommended Articles

This is a guide to Spark Streaming. Here we discuss the introduction to spark streaming, how it works, along with advantages and operations. You can also go through our other related articles –

  1. What is Hadoop Streaming?
  2. Spark Commands
  3. Tutorials on How to Install Spark
  4. Top 6 Components of Spark
Popular Course in this category
Apache Spark Training (3 Courses)
  3 Online Courses |  13+ Hours |  Verifiable Certificate of Completion |  Lifetime Access
4.5
Price

View Course

Related Courses

PySpark Tutorials (3 Courses)4.9
Apache Storm Training (1 Courses)4.8
3 Shares
Share
Tweet
Share
Primary Sidebar
Footer
About Us
  • Blog
  • Who is EDUCBA?
  • Sign Up
  • Live Classes
  • Corporate Training
  • Certificate from Top Institutions
  • Contact Us
  • Verifiable Certificate
  • Reviews
  • Terms and Conditions
  • Privacy Policy
  •  
Apps
  • iPhone & iPad
  • Android
Resources
  • Free Courses
  • Database Management
  • Machine Learning
  • All Tutorials
Certification Courses
  • All Courses
  • Data Science Course - All in One Bundle
  • Machine Learning Course
  • Hadoop Certification Training
  • Cloud Computing Training Course
  • R Programming Course
  • AWS Training Course
  • SAS Training Course

ISO 10004:2018 & ISO 9001:2015 Certified

© 2022 - EDUCBA. ALL RIGHTS RESERVED. THE CERTIFICATION NAMES ARE THE TRADEMARKS OF THEIR RESPECTIVE OWNERS.

EDUCBA
Free Data Science Course

SPSS, Data visualization with Python, Matplotlib Library, Seaborn Package

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA Login

Forgot Password?

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA
Free Data Science Course

Hadoop, Data Science, Statistics & others

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy.

EDUCBA

*Please provide your correct email id. Login details for this Free course will be emailed to you

By signing up, you agree to our Terms of Use and Privacy Policy.

Let’s Get Started

By signing up, you agree to our Terms of Use and Privacy Policy.

This website or its third-party tools use cookies, which are necessary to its functioning and required to achieve the purposes illustrated in the cookie policy. By closing this banner, scrolling this page, clicking a link or continuing to browse otherwise, you agree to our Privacy Policy

Loading . . .
Quiz
Question:

Answer:

Quiz Result
Total QuestionsCorrect AnswersWrong AnswersPercentage

Explore 1000+ varieties of Mock tests View more