0% found this document useful (0 votes)

189 views8 pages

Create An Spark Streaming App: 1. Architecture and Abstraction

This document provides instructions for creating a Spark Streaming application. It describes the micro-batch architecture of Spark Streaming and how to create a DStream from input sources. It then provides a step-by-step example of creating a word count application that receives a stream of text from a socket and filters for lines containing "error". Finally, it provides instructions for creating a log analyzer application that reads Apache access log files from a simulated network stream and calculates statistics on the logs within sliding windows.

Uploaded by

Ngô Hoàng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

189 views8 pages

Create An Spark Streaming App: 1. Architecture and Abstraction

Uploaded by

Ngô Hoàng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

MODULE 14

CREATE AN SPARK STREAMING APP

1. Architecture and Abstraction

Spark Streaming uses a “micro-batch” architecture, where the streaming computa‐

tion is treated as a continuous series of batch computations on small batches of data.
Spark Streaming receives data from various input sources and groups it into small
batches. New batches are created at regular time intervals. At the beginning of each
time interval a new batch is created, and any data that arrives during that interval
gets
added to that batch. At the end of the time interval the batch is done growing. The
size of the time intervals is determined by a parameter called the batch interval. The
batch interval is typically between 500 milliseconds and several seconds, as config‐
ured by the application developer. Each input batch forms an RDD, and is processed
using Spark jobs to create other RDDs. The processed results can then be pushed
out
to external systems in batches

The programming abstraction in Spark Streaming is a discretized stream or a

DStream, which is a sequence of RDDs, where each RDD has one time slice of the
data in the stream
You can create DStreams either from external input sources, or by applying
transformations to other DStreams. In our simple example, we created a DStream
from data received through a socket, and then applied a filter() transformation to it

Transformations
Transformations on DStreams can be grouped into either stateless or stateful:

- In stateless transformations the processing of each batch does not depend on

the data of its previous batches. They include the common RDD
transformations
- Stateful transformations, in contrast, use data or intermediate results from
previous batches to compute the results of the current batch. They include
transformations based on sliding windows and on tracking state across time

Output Operations
Output operations specify what needs to be done with the final transformed data in a
stream (e.g., pushing it to an external database or printing it to the screen).

2. Creating a WordCount Application

We will receive a stream of newline-delimited lines of text from a server running at
port 7777, filter only the lines that contain the word error, and print them

- Move to the SPARK_HOME folder:

$ cd $SPARK_HOME

- Create a directory to save the source code

$ mkdir -p examples/socket-stream/src/main/scala
- Create a file name SocketStream.scala in
$SPARK_HOME/examples/socket-stream/src/main/scala folder:

import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.Duration
import org.apache.spark.streaming.Seconds

object SocketStream {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Socket-Stream")
// Create a StreamingContext with a 1-second batch size from
a SparkConf
val ssc = new StreamingContext(conf, Seconds(1))
// Create a DStream using data received after connecting to
port 7777 on the
// local machine
val lines = ssc.socketTextStream("localhost", 7777)
// Filter our DStream for lines with "error"
val errorLines = lines.filter(_.contains("error"))
// Print out the lines with errors
errorLines.print()
// Start our streaming context and wait for it to "finish"
ssc.start()
// Wait for the job to finish
ssc.awaitTermination()
}
}

- Create file build.sbt in $SPARK_HOME/examples/wordcount-app

name := "socket-stream"

version := "0.0.1"

scalaVersion := "2.11.12"
// additional libraries
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.4.1" % "provided",
"org.apache.spark" %% "spark-streaming" % "2.4.1"
)

- Build application :
$ sbt clean package

- Submit and run in Spark :

$ $SPARK_HOME/bin/spark-submit --class SocketStream
target/scala-2.11/socket-stream_2.11-0.0.1.jar

- Open another terminal and send text from port 7777:

nc -l localhost -p 7777

3. Create a Log Analyzer

- Move to the SPARK_HOME folder:

$ cd $SPARK_HOME

- Create a directory to save the source code

$ mkdir -p examples/logs-analyzer/src/main/scala

- Download logs sample:

https://drive.google.com/file/d/184RPO2pxbyDXUXIb3__nWxnI5iz-WNOC/view?usp=
sharing

- Create a file name ApacheAccessLog.scala in

$SPARK_HOME/examples/logs-analyzer/src/main/scala folder:

/** An entry of Apache access log. */

case class ApacheAccessLog(ipAddress: String,
clientIdentd: String,
userId: String,
dateTime: String,
method: String,
endpoint: String,
protocol: String,
responseCode: Int,
contentSize: Long) {
}

object ApacheAccessLog {
val PATTERN = """^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\]
"(\S+) (\S+) (\S+)" (\d{3}) (\d+)""".r

/**
* Parse log entry from a string.
*
* @param log A string, typically a line from a log file
* @return An entry of Apache access log
* @throws RuntimeException Unable to parse the string
*/
def parseLogLine(log: String): ApacheAccessLog = {
log match {
case PATTERN(ipAddress, clientIdentd, userId, dateTime,
method, endpoint, protocol, responseCode, contentSize)
=> ApacheAccessLog(ipAddress, clientIdentd, userId,
dateTime, method, endpoint, protocol, responseCode.toInt,
contentSize.toLong)
case _ => throw new RuntimeException(s"""Cannot parse log
line: $log""")
}
}
}

- Create a file name LogAnalyzerStreaming.scala in

$SPARK_HOME/examples/logs-analyzer/src/main/scala folder:

import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{Seconds, StreamingContext}

object LogAnalyzerStreaming {
def main(args: Array[String]) {
val WINDOW_LENGTH = Seconds(30)
val SLIDE_INTERVAL = Seconds(10)

val sparkConf = new SparkConf().setAppName("Log Analyzer

Streaming in Scala")
val streamingContext = new StreamingContext(sparkConf,
SLIDE_INTERVAL)

val logLinesDStream: DStream[String] =

streamingContext.socketTextStream("localhost", 9999)

val accessLogsDStream: DStream[ApacheAccessLog] =

logLinesDStream.map(ApacheAccessLog.parseLogLine).cache()
val windowDStream: DStream[ApacheAccessLog] =
accessLogsDStream.window(WINDOW_LENGTH, SLIDE_INTERVAL)

windowDStream.foreachRDD(accessLogs => {
if (accessLogs.count() == 0) {
println("No access logs received in this time interval")
} else {
// Calculate statistics based on the content size.
val contentSizes: RDD[Long] =
accessLogs.map(_.contentSize).cache()
println("Content Size Avg: %s, Min: %s, Max: %s".format(
contentSizes.reduce(_ + _) / contentSizes.count,
contentSizes.min,
contentSizes.max
))

// Compute Response Code to Count.

val responseCodeToCount: Array[(Int, Long)] = accessLogs
.map(_.responseCode -> 1L)
.reduceByKey(_ + _)
.take(100)
println( s"""Response code counts:
${responseCodeToCount.mkString("[", ",", "]")}""")

// Any IPAddress that has accessed the server more than 10

times.
val ipAddresses: Array[String] = accessLogs
.map(_.ipAddress -> 1L)
.reduceByKey(_ + _)
.filter(_._2 > 10)
.map(_._1)
.take(100)
println( s"""IPAddresses > 10 times:
${ipAddresses.mkString("[", ",", "]")}""")

// Top Endpoints.
val topEndpoints: Array[(String, Long)] = accessLogs
.map(_.endpoint -> 1L)
.reduceByKey(_ + _)
.top(10)(Ordering.by[(String, Long), Long](_._2))
println( s"""Top Endpoints: ${topEndpoints.mkString("[",
",", "]")}""")
}
})

// Start the streaming server.

streamingContext.start() // Start the computation
streamingContext.awaitTermination() // Wait for the computation
to terminate
}
}

- Create file build.sbt in $SPARK_HOME/examples/logs-analyzer

name := "log-analyzer"

version := "0.0.1"

scalaVersion := "2.11.12"
// additional libraries

libraryDependencies ++= Seq(

"org.apache.spark" %% "spark-core" % "2.4.1" % "provided",
"org.apache.spark" %% "spark-streaming" % "2.4.1"
)

- Create the shell script named "stream.sh" that emulates network stream by
periodically sending portions of the sample log file to a network socket:

#!/bin/sh

set -o nounset
set -o errexit

test $# -eq 1 || ( echo "Incorrect number of arguments" ; exit 1 )

file="$1"
network_port=9999
lines_in_batch=100
interval_sec=10

n_lines=$(cat $file | wc -l)

cursor=1
while test $cursor -le $n_lines
do
tail -n +$cursor $file | head -$lines_in_batch | nc -l
$network_port
cursor=$(($cursor + $lines_in_batch))
sleep $interval_sec
done

- Build application :
$ sbt clean package

- Submit and run in Spark :

$ $SPARK_HOME/bin/spark-submit --class "LogAnalyzerStreaming"
target/scala-2.11/log-analyzer_2.11-0.0.1.jar

- Open another terminal and send text from port 9999:

./stream.sh log.txt

What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Spark
No ratings yet
Spark
96 pages
Data Warehouse - What Is It
No ratings yet
Data Warehouse - What Is It
5 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
PySpark Interview Questions
No ratings yet
PySpark Interview Questions
3 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Spark QA
No ratings yet
Spark QA
34 pages
Spark Streaming Twitter Example
No ratings yet
Spark Streaming Twitter Example
4 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Apache Spark for Developers
No ratings yet
Apache Spark for Developers
3 pages
Problem Description: Sensitivity: Internal & Restricted
No ratings yet
Problem Description: Sensitivity: Internal & Restricted
2 pages
Spark in Production
No ratings yet
Spark in Production
34 pages
Creating Secrets in Databricks
No ratings yet
Creating Secrets in Databricks
13 pages
PySpark Real Time Q&A
No ratings yet
PySpark Real Time Q&A
5 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Mastercard Data Engineer Interview Questions
No ratings yet
Mastercard Data Engineer Interview Questions
16 pages
Spark Big Data Tuning Guide
100% (1)
Spark Big Data Tuning Guide
20 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
Understanding Apache Spark Architecture
No ratings yet
Understanding Apache Spark Architecture
30 pages
Shelly Bansal - SR Data Engineer
No ratings yet
Shelly Bansal - SR Data Engineer
6 pages
DVS SPARK Course Content PDF
No ratings yet
DVS SPARK Course Content PDF
2 pages
Pyspark STAR Questions
No ratings yet
Pyspark STAR Questions
21 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
Spark Vs Hadoop Features Spark
No ratings yet
Spark Vs Hadoop Features Spark
9 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
Azure DataBricks
No ratings yet
Azure DataBricks
37 pages
Apache Hive
No ratings yet
Apache Hive
3 pages
Azure Databricks Mastery
No ratings yet
Azure Databricks Mastery
95 pages
9-10 Spark Architecture
No ratings yet
9-10 Spark Architecture
25 pages
Pyspark 4
No ratings yet
Pyspark 4
5 pages
Azure Resource Group & SQL Setup Guide
No ratings yet
Azure Resource Group & SQL Setup Guide
73 pages
Spark Architecture for Developers
No ratings yet
Spark Architecture for Developers
7 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Airflow 2 X
100% (2)
Airflow 2 X
39 pages
Data Migration and CDC Tasks
No ratings yet
Data Migration and CDC Tasks
11 pages
Tuning SQL Queries - Oracle
100% (1)
Tuning SQL Queries - Oracle
27 pages
Python Data Pipeline Guide
No ratings yet
Python Data Pipeline Guide
38 pages
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
No ratings yet
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
74 pages
Narsimlu - Azure Data Engineer - Resume .Pf-1
No ratings yet
Narsimlu - Azure Data Engineer - Resume .Pf-1
4 pages
50 PySpark Interview Questions 1732556477
No ratings yet
50 PySpark Interview Questions 1732556477
7 pages
De Mod 2 Transform Data With Spark
No ratings yet
De Mod 2 Transform Data With Spark
32 pages
Building Data Pipelines - 3
No ratings yet
Building Data Pipelines - 3
29 pages
Naresh DE
No ratings yet
Naresh DE
5 pages
Day 4-01-Spark
No ratings yet
Day 4-01-Spark
43 pages
Hadoop JobTracker Explained
No ratings yet
Hadoop JobTracker Explained
8 pages
A Data Pipeline Should Address These Issues:: Topics To Study
No ratings yet
A Data Pipeline Should Address These Issues:: Topics To Study
10 pages
Spark Repartition1
No ratings yet
Spark Repartition1
7 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
No ratings yet
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
4 pages
Spark Application Deployment Guide
No ratings yet
Spark Application Deployment Guide
18 pages
Spark SQL
100% (1)
Spark SQL
25 pages
TF On Spark
No ratings yet
TF On Spark
35 pages
Data Engineering 101 - Databricks Optimization
No ratings yet
Data Engineering 101 - Databricks Optimization
16 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Python Advanced - Pipes in Python
No ratings yet
Python Advanced - Pipes in Python
7 pages
Spark Streaming
No ratings yet
Spark Streaming
14 pages
Databricks - Spark Streaming
No ratings yet
Databricks - Spark Streaming
55 pages
Data Lake 1
No ratings yet
Data Lake 1
19 pages
Pgdca 101 OS
No ratings yet
Pgdca 101 OS
63 pages
EMARKETING
No ratings yet
EMARKETING
18 pages
Course Allocation
No ratings yet
Course Allocation
35 pages
FBPM2 Chapter 11 ProcessMonitoring
No ratings yet
FBPM2 Chapter 11 ProcessMonitoring
82 pages
Kca Os Assignment July 2024
No ratings yet
Kca Os Assignment July 2024
2 pages
Microservices Raghu
0% (1)
Microservices Raghu
107 pages
Proxy Troubleshooting
No ratings yet
Proxy Troubleshooting
163 pages
Exam Outline - Practice Track Huawei ICT Competition 2022 Middle East
No ratings yet
Exam Outline - Practice Track Huawei ICT Competition 2022 Middle East
7 pages
PROFIS Anchor Channel Activation Guide
No ratings yet
PROFIS Anchor Channel Activation Guide
1 page
ABAP Dictionary: Key Concepts & Objects
No ratings yet
ABAP Dictionary: Key Concepts & Objects
17 pages
2026 - 2027 Batches - Trustique Assist Group Paid Internship + PPO Opportunity - GU - GCET
No ratings yet
2026 - 2027 Batches - Trustique Assist Group Paid Internship + PPO Opportunity - GU - GCET
2 pages
UP Aadhaar Fraud Case Report
No ratings yet
UP Aadhaar Fraud Case Report
2 pages
Test Case Failover Firewall
No ratings yet
Test Case Failover Firewall
8 pages
21 Century Literature From The Philippines and The Worl: Creative Literary Adaptations
No ratings yet
21 Century Literature From The Philippines and The Worl: Creative Literary Adaptations
8 pages
Collegedunia: Comprehensive College Search Engine
No ratings yet
Collegedunia: Comprehensive College Search Engine
6 pages
SAP Hybris 415
No ratings yet
SAP Hybris 415
26 pages
PDF - Integrations - Thought Machine
No ratings yet
PDF - Integrations - Thought Machine
3 pages
Technical Manager
No ratings yet
Technical Manager
2 pages
SAP HANA EIM Administration Guide en
No ratings yet
SAP HANA EIM Administration Guide en
338 pages
TIA Overview for Tech Leaders
No ratings yet
TIA Overview for Tech Leaders
11 pages
Course Outline Data Mining
No ratings yet
Course Outline Data Mining
4 pages
Dell PowerScale Scale-Out NAS Storage
No ratings yet
Dell PowerScale Scale-Out NAS Storage
4 pages
Midterm
No ratings yet
Midterm
7 pages
Cloud Computing Important Questions
100% (3)
Cloud Computing Important Questions
4 pages
TMA3 and Mini Project - EEX7340-EEX6340-2022-23
No ratings yet
TMA3 and Mini Project - EEX7340-EEX6340-2022-23
2 pages
How To... Configure and Use Time Dependent Hierarchy in SAP BPC 10.0 Version For NetWeaver
No ratings yet
How To... Configure and Use Time Dependent Hierarchy in SAP BPC 10.0 Version For NetWeaver
53 pages
Full Stack - QP
No ratings yet
Full Stack - QP
9 pages
PolicyCenter Data Sheet Product Content Management
No ratings yet
PolicyCenter Data Sheet Product Content Management
2 pages
Info
100% (2)
Info
2 pages
Cheat Sheet Full
100% (1)
Cheat Sheet Full
4 pages

Create An Spark Streaming App: 1. Architecture and Abstraction

Uploaded by

Create An Spark Streaming App: 1. Architecture and Abstraction

Uploaded by

MODULE 14

CREATE AN SPARK STREAMING APP

1. Architecture and Abstraction

Spark Streaming uses a “micro-batch” architecture, where the streaming computa‐

The programming abstraction in Spark Streaming is a discretized stream or a

- In stateless transformations the processing of each batch does not depend on

2. Creating a WordCount Application

- Move to the SPARK_HOME folder:

- Create a directory to save the source code

- Create file build.sbt in ​$SPARK_HOME/examples/wordcount-app

- Submit and run in Spark :

- Open another terminal and send text from port 7777:

3. Create a Log Analyzer

- Move to the SPARK_HOME folder:

- Create a directory to save the source code

- Download logs sample:

- Create a file name ApacheAccessLog.scala in

/** An entry of Apache access log. */

- Create a file name LogAnalyzerStreaming.scala in

val sparkConf = new SparkConf().setAppName(​"Log Analyzer

val logLinesDStream: DStream[String] =

val accessLogsDStream: DStream[ApacheAccessLog] =

// Compute Response Code to Count.

// Any IPAddress that has accessed the server more than 10

// Start the streaming server.

- Create file build.sbt in ​$SPARK_HOME/examples/logs-analyzer

libraryDependencies ++= Seq(

test ​$#​ -eq 1 || ( echo ​"Incorrect number of arguments"​ ; exit 1 )

n_lines=$(cat $file | wc -l)

- Submit and run in Spark :

- Open another terminal and send text from port 9999:

You might also like

- Create file build.sbt in $SPARK_HOME/examples/wordcount-app

val sparkConf = new SparkConf().setAppName("Log Analyzer

- Create file build.sbt in $SPARK_HOME/examples/logs-analyzer

test $# -eq 1 || ( echo "Incorrect number of arguments" ; exit 1 )