KEMBAR78
Streaming Concepts | PDF | Apache Spark | Scalability
0% found this document useful (0 votes)
39 views94 pages

Streaming Concepts

Uploaded by

Prajwal Khairnar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views94 pages

Streaming Concepts

Uploaded by

Prajwal Khairnar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 94

Intro to batch

processing
STREAMING CONCEPTS

Mike Metzger
Data Engineer
What is batch processing?
Processing data in groups
Runs from start of process to finish
No data added in between

Typically run as result of


an interval

starting event

Processed in a certain size (batch size)

An instance of a batch process is often referred to as a job

STREAMING CONCEPTS
Common batch processing scenarios
Reading files or parts of files (text, mp3, etc)

Sending / receiving email

Printing

STREAMING CONCEPTS
Why batch?
Simple
Generally consistent

Multiple ways to improve performance

STREAMING CONCEPTS
Let's practice!
STREAMING CONCEPTS
Scaling batch
processing
STREAMING CONCEPTS

Mike Metzger
Data Engineer
What is scaling?
Improving performance
Processing more quickly
Less time to process the same amount of data
Processing more data
More data processed in the same amount of time

STREAMING CONCEPTS
Vertical scaling
Better computing
Faster CPU

Faster IO

More memory

Typically the easiest kind of scaling


Least complexity

Rarely requires changing underlying


programs / algorithms

1 Images courtesy https://unsplash.com/@jeremy0

STREAMING CONCEPTS
Vertical scaling cons
Inherently limited

Can be expensive / low ROI

Industry improvements are not guaranteed

STREAMING CONCEPTS
Horizontal scaling
Splitting a task into multiple parts
More computers

Could also be more CPUs

Best done on tasks that are


"embarrassingly parallel"
Tasks that can be easily divided among
workers

Can be very cost effective

Can have near-linear performance


improvements for certain types of
processes

STREAMING CONCEPTS
Horizontal scaling cons
Complexity
Requires a processing framework (like Apache Spark or Dask)

Requires more extensive networking

Ongoing management

Can be expensive depending on requirements

"Non-parallel" tasks

STREAMING CONCEPTS
Let's practice!
STREAMING CONCEPTS
Batch issues
STREAMING CONCEPTS

Mike Metzger
Data Engineer
Delays
Time until data is ready to process
Is all data available?
Time until process begins
When does the next interval start?
Time to process data
How long until completion?
Time until processed data is available for use
How long until users can use the data?

STREAMING CONCEPTS
Example #1
Waiting on the source data

Machines sending log files at times of low utilization

Works ok during normal utilization


High utilization would limit ability to send logs, potentially hiding issues.

STREAMING CONCEPTS
Example #2
Waiting on the process

100GB log files per day

Currently takes 23 hrs to process


Approximately 4.4GB/hr

Grows at 5% per month

Next month would be 105GB and take ~24 hrs

Following month would be ~110GB and take ~25 hrs

Takes longer than a day to process one day's worth of data!

STREAMING CONCEPTS
Example #3
Waiting on the data to be available

How long until analytics are available?

Sales report must wait for all information to generate

Sum of delays is minimum time to generate new report


Amount of time to collect / prepare data: 1 day

Time required to process data: 7 hrs

Time to update systems: 5 hrs

Time to generate report: 2 min

Total time for each report: 1.5 days

STREAMING CONCEPTS
Let's practice!
STREAMING CONCEPTS
Intro to event-based
computing
STREAMING CONCEPTS

Mike Metzger
Data Engineer
Old days
Shared access
Batched jobs

Programs run by operators and results


returned to users

Delays, missing results, etc

STREAMING CONCEPTS
Personal computers
Often single user

But still behaved in a batch manner

Computer would basically run tasks in


order as provided

GUI gave rise to event-based interactivity

STREAMING CONCEPTS
Event-based processing
Doesn't run at a specific time
Tasks run when an event occurs
User clicks a button

A new file is uploaded to a directory

Can still start a batch process

Event-based systems wait for something to


occur

STREAMING CONCEPTS
Example event-based task
Web click-stream monitoring

User activity occurs when clicking on links


/ components of a webpage

The client application determines what


resources are needed and requests these
from a server

The server returns the appropriate info and


often logs the request

These clicks (user events) are often stored


or sent to a central location for storage and
later analysis.

STREAMING CONCEPTS
Let's practice!
STREAMING CONCEPTS
Queuing
STREAMING CONCEPTS

Mike Metzger
Data Engineer
What is queuing?
Basically, a line
Useful for processing in order

First-in, first-out (FIFO)

Sometimes referred to as a buffer

Details vary a lot by implementation

1 Photo by Joshua Tsu on Unsplash

STREAMING CONCEPTS
Why queues?
Queues allow tracking of processing order

Can be processed by a single person /


program or multiple

Can be disconnected from the remainder


of the processing pipeline

Reasonably easy to scale vertically or


horizontally
Vertical scaling by adding faster
hardware
Horizontal scaling by adding more
executors

STREAMING CONCEPTS
Queue issues
Bad data or processing errors
Customer pays with invalid credit card
Data size variances
Supermarket fast lane with 100 items
Sometimes difficult to know the length of
the queue
First preview showing of a movie
Scaling limits
Not enough space for more registers

STREAMING CONCEPTS
Let's practice!
STREAMING CONCEPTS
Single system data
streaming
STREAMING CONCEPTS

Mike Metzger
Data Engineer
Intro to streaming
What is streaming?

Data doesn't stop until processed


Once initially processed, may have other
data processing components

Is open-ended (no specific end event)

Is defined by the flow of data, not the


content

STREAMING CONCEPTS
Logs
Store event information 210507-162356 - SUCCESS: Open vvlj45.txt

Could be a simple text or binary file 210507-162254 - ERROR: Open hjry57.txt failed
210507-161523 - SUCCESS: Open kbhn78.txt
Or a system to export information to 210507-161235 - ERROR: Open ldge12.txt failed
multiple clients (ie, Apache Kafka) 210507-160127 - WARNING: keop98.txt exists
210507-155958 - SUCCESS: Open hqaz64.txt
Will store information until resources are
210507-155439 - SUCCESS: Open neuf36.txt
exhausted / pruned
210507-152335 - SUCCESS: Open mqpa91.txt
Purpose of the log depends on the 210507-144756 - ERROR: Open pqzi32.txt failed

application 210507-143541 - SUCCESS: Open urmn15.txt


210507-143152 - SUCCESS: Open fgty82.txt
210507-141732 - SUCCESS: Open mlwe96.txt

STREAMING CONCEPTS
System event log
Present on Windows, Mac, Linux Components:
Processes and stores various system event
Listener: Accepts messages
information
Parser: Understands how to read messages
Windows EventLog, Mac / Linux syslog
Logic: Decides what to do

Writer: Stores the messages for later

STREAMING CONCEPTS
Let's practice!
STREAMING CONCEPTS
Batching vs.
streaming
STREAMING CONCEPTS

Mike Metzger
Data Engineer
Quick review
Batch processes handle data in groups, or batches
The most important details about batch processing is the batch size, and the batch
frequency

Queues store / process data in order of insertion

Queues are batches, with a batch size of one!

Streams handle data without pausing along the way

Streams don't have a defined end

Streams maintain order!

STREAMING CONCEPTS
Fire!
Bucket brigade Fire hose
Batch size (how large is the bucket) Continuous amount of data

Batch frequency (how fast to pass Not sure how much water
bucket)

1Albert B. Kinne, Public domain, via Wikimedia Commons 2 Commander, U.S. Naval Forces Europe-Africa/U.S. 6th
Fleet, Public domain, via Wikimedia Commons

STREAMING CONCEPTS
How to determine the best approach?
Depends on requirements
If we can process in groups, batching often best due to simplicity

If we need order, but it's okay to pause, use a queue

If we need continuous data, or we don't know how much data, try streaming

If we can't stop until the data is processed, use streaming

STREAMING CONCEPTS
Let's practice!
STREAMING CONCEPTS
Intro to real-time
streaming
STREAMING CONCEPTS

Mike Metzger
Data Engineer
What is 'real-time'?
Definition varies depending on context
Typically defines a response timeframe

The response timeframe is defined as a sort of guarantee

Could be:
1 day

1 hour

1 minute

STREAMING CONCEPTS
Real world example
Post office

Different classes of service

Delivery timeframe varies based on service


class

Only so much capacity for faster service

Costs are proportional to service speed

Service selection is up to the sender based


on options

STREAMING CONCEPTS
Relationship to streaming?
How does real-time relate to streaming data?

Streaming processes are limited by


available resources
How quickly can data be transported?

... processed?

... delivered?

How much does it cost?

STREAMING CONCEPTS
Resources define implementation
Helps define our requirements for
streaming data processes

Speed of transport

Processing latency

Delivery

Data storage

Cost!

STREAMING CONCEPTS
Let's practice!
STREAMING CONCEPTS
Vertically scaling
streaming systems
STREAMING CONCEPTS

Mike Metzger
Data Engineer
Why scale?
Process the same data in less time
Process more data in the same time

Deliver data more quickly (reduce latency)

Meet guarantees (SLAs)

STREAMING CONCEPTS
Vertical scaling
Improve the capabilities of a single system

Faster / better components


CPU, RAM, Disk, Network

All can affect streaming performance

STREAMING CONCEPTS
Faster CPU / GPU performance
Faster execution

Better execution
New / improved instruction sets

GPU processing
Machine learning

Deep learning

Image processing

Matrix operations

STREAMING CONCEPTS
How does this affect streaming?
Streaming processes don't stop until
complete

Different items can be in different parts of


the pipeline, but total processing capacity
is limited by the system performance

Certain components have a greater effect


than others, depending on workload

Benchmark / test!

STREAMING CONCEPTS
Let's practice!
STREAMING CONCEPTS
Horizontally scaling
streaming systems
STREAMING CONCEPTS

Mike Metzger
Data Engineer
Horizontal scaling refresher
Instead of scaling "up", scale "out"
Typically means adding processing
capability by adding more, rather than
faster / better

Works best with embarrassingly parallel


situations
Tasks that can be split easily

E.g. processing a large group of non-


interdependent images

STREAMING CONCEPTS
Horizontal scaling with streaming
Streaming data processing typically has
minimal delays

Can make transfer of data between


workers tricky

Best to process a full stream within a single


pipeline

Create copies of the pipelines

STREAMING CONCEPTS
Pipeline copies
As events occur, they initially enter a
pipeline

All tasks related to that process are self-


contained within the pipeline, until
completion

Scale by adding more pipelines

Can still vertically scale within a pipeline

STREAMING CONCEPTS
Additional considerations
Other components may be required
Load balancer / director
Card dealer

Least busy node

Eventually hit bottlenecks


Disk write performance

Consider shortening streaming pipeline


Remove need to immediately process
data

STREAMING CONCEPTS
Let's practice!
STREAMING CONCEPTS
Streaming
roadblocks
STREAMING CONCEPTS

Mike Metzger
Data Engineer
Scaling review
Vertical scaling - compute resources Horizontal scaling - more nodes

CPU Add machines as nodes / workers

RAM
Disk (capacity and IO)

Network

STREAMING CONCEPTS
Initial concerns
Compute resources
Lack of adequate or slow resources

More nodes
Requires more connectivity

Some form of shared resources

Added complexity

Usually some form of cluster management

STREAMING CONCEPTS
Communication issues
Types of messaging problems:

Missing messages

Delayed messages
Out of order messages

Repeat messages

STREAMING CONCEPTS
Missing messages
Represent events that never appear
Can be difficult to detect

Sometimes handled with a sequence identifier

Requesting the missing messages can delay further responses

STREAMING CONCEPTS
Delayed messages
Similar to missing messages
May cause issues with the processing pipeline due to delays

Often related to system resource issues

STREAMING CONCEPTS
Out of order messages
Combination of missing / delayed messages
Results when an older message appears after newer ones

Requires some measure of sequence or state to detect

Handling these issues depends on the type of data process being run

STREAMING CONCEPTS
Repeat messages
Occurs when the same message is sent multiple times or resent due to systems issues
Requires sequence handling to completely avoid, but might be safe to ignore

Sometimes is not an issue (consider a temperature measurement)

STREAMING CONCEPTS
Let's practice!
STREAMING CONCEPTS
Popular streaming
systems
STREAMING CONCEPTS

Mike Metzger
Data engineer
Streaming tools
Various tools available depending on needs
Allows designers to specify the best tool for
the job

Common systems include:


Celery

Kafka

Spark Streaming

STREAMING CONCEPTS
Celery
Distributed task queue / FIFO

Used primarily as a job or task queue

Often used for asynchronous tasks


Sending password reset emails

Fulfilling digital orders

Resizing images

Allows for real-time processing of


significant quantity of messages

Provides functionality for management and


scaling (vertical & horizontal)

STREAMING CONCEPTS
Apache Kafka
Distributed event streaming system Different consumers can handle events as
Designed to send events between needed (logging, transformations, relaying,
etc)
producers and consumers
Producers create events on a topic Handles storing of events as specified

Topics are basically messages of a Extremely powerful, but can be tricky to


specified form set up

Consumers receive new events

STREAMING CONCEPTS
Kafka applications
Best used for passing data between
multiple systems
Single source of truth

Change data capture

Data backups

Data system migrations

STREAMING CONCEPTS
Spark streaming
Part of Apache Spark
Designed to process streaming data

Builds upon the capabilities of Spark to


process data in Scala, Python, SQL, and so
forth

Useful for processing large amounts of


data and in machine learning scenarios

Able to transition from batch to stream


processing fairly easily

Not designed to store or log events, but


primarily to process or modify the data

STREAMING CONCEPTS
Let's practice!
STREAMING CONCEPTS
Real-world use case:
streaming music
service
STREAMING CONCEPTS

Mike Metzger
Data Engineer
Streaming music
Consider the scenario

Not focusing on the actual music being


streamed

More interested on the user(s)


Interactions

Music preferences

Other details

STREAMING CONCEPTS
Interactions
Primary questions Like / Don't play

Next / Previous / Skip


What?
Select Channel / Playlist
When?
Where? Add / remove song from playlist

STREAMING CONCEPTS
How to store data
Data is archived as a log
Sent as interactions occur

Number of interactions will vary


considerably between users

Logged data can be analyzed later on

STREAMING CONCEPTS
Analytics
What about preferences?
Can be obtained from logged data

Favorite genres, bands, etc

Most popular times of day

Other details?
Most popular app platform / version

Location data from stream

STREAMING CONCEPTS
Let's practice!
STREAMING CONCEPTS
Real-world use case:
sensor data
STREAMING CONCEPTS

Mike Metzger
Data Engineer
What is sensor data?
Automated devices that monitor some
aspect of interest
Temperature monitors

Electricity usage monitors

Vehicle presence detection

Many others

Tend to communicate with centralized


services for management and data
reporting

Can range from a few sensors to millions or


even billions

STREAMING CONCEPTS
Connected doorbell
Monitors primarily for doorbell presses
Contains extra video / audio capabilities

Allows live, remote interaction

Can use camera / environmental sensors


for additional detection capabilities

1 C05731, CC BY-SA 4.0 [removed], via Wikimedia Commons

STREAMING CONCEPTS
What are we monitoring?
Button presses
Movement detection

Sound detection

Requires more intensive interaction than


just logging of events

STREAMING CONCEPTS
Data handling
How to store data
General event data (button press)

Sensor-based events (light sensor or


audio pickup)

Raw data for later analysis

Different services can have different SLAs,


even in same product

STREAMING CONCEPTS
Let's practice!
STREAMING CONCEPTS
Real-world use case:
vaccination clinic
STREAMING CONCEPTS

Mike Metzger
Data Engineer
Data processing review
Batch
Great for large sets of data

Potentially poor latency

Queue
Awesome to maintain order

Can be tricky to manage

Stream
Fantastic for latency / unknown data
characteristics

Scaling considerations

STREAMING CONCEPTS
Complex systems
Not all processes fit within a single
processing type

Many real-world scenarios may require


multiple components to build the best
processing model

Concepts can be applied to various


components as required

STREAMING CONCEPTS
Vaccination clinic
Multiple, simultaneous moving pieces
Vary based on locale and requirements

Consider a large self-contained clinic

Concepts apply to smaller pharmacies /


doctor's offices as well

STREAMING CONCEPTS
Vaccination clinic areas
Arrival / entrance Monitoring
Entry and temperature check, with a Patients checked for any post-
single line application reactions, many seats

Registration Departure
Check-in & validation on info, multiple Exit from clinic
registrars

Vaccine administration
Actual application of vaccine, multiple
stations

STREAMING CONCEPTS
Let's practice!
STREAMING CONCEPTS
Congratulations!
STREAMING CONCEPTS

Mike Metzger
Data Engineer
Next steps
Learn more about specific streaming platforms
Apache Kafka (Apache Kafka) or (Confluent)

Apache Spark (Apache Spark)

Apply current data implementations to stream processes

Work with data consumers to better determine best processing options for a given situation

STREAMING CONCEPTS
Thank you!
STREAMING CONCEPTS

You might also like