Intro to batch
processing
STREAMING CONCEPTS
Mike Metzger
Data Engineer
What is batch processing?
Processing data in groups
Runs from start of process to finish
No data added in between
Typically run as result of
an interval
starting event
Processed in a certain size (batch size)
An instance of a batch process is often referred to as a job
STREAMING CONCEPTS
Common batch processing scenarios
Reading files or parts of files (text, mp3, etc)
Sending / receiving email
Printing
STREAMING CONCEPTS
Why batch?
Simple
Generally consistent
Multiple ways to improve performance
STREAMING CONCEPTS
Let's practice!
STREAMING CONCEPTS
Scaling batch
processing
STREAMING CONCEPTS
Mike Metzger
Data Engineer
What is scaling?
Improving performance
Processing more quickly
Less time to process the same amount of data
Processing more data
More data processed in the same amount of time
STREAMING CONCEPTS
Vertical scaling
Better computing
Faster CPU
Faster IO
More memory
Typically the easiest kind of scaling
Least complexity
Rarely requires changing underlying
programs / algorithms
1 Images courtesy https://unsplash.com/@jeremy0
STREAMING CONCEPTS
Vertical scaling cons
Inherently limited
Can be expensive / low ROI
Industry improvements are not guaranteed
STREAMING CONCEPTS
Horizontal scaling
Splitting a task into multiple parts
More computers
Could also be more CPUs
Best done on tasks that are
"embarrassingly parallel"
Tasks that can be easily divided among
workers
Can be very cost effective
Can have near-linear performance
improvements for certain types of
processes
STREAMING CONCEPTS
Horizontal scaling cons
Complexity
Requires a processing framework (like Apache Spark or Dask)
Requires more extensive networking
Ongoing management
Can be expensive depending on requirements
"Non-parallel" tasks
STREAMING CONCEPTS
Let's practice!
STREAMING CONCEPTS
Batch issues
STREAMING CONCEPTS
Mike Metzger
Data Engineer
Delays
Time until data is ready to process
Is all data available?
Time until process begins
When does the next interval start?
Time to process data
How long until completion?
Time until processed data is available for use
How long until users can use the data?
STREAMING CONCEPTS
Example #1
Waiting on the source data
Machines sending log files at times of low utilization
Works ok during normal utilization
High utilization would limit ability to send logs, potentially hiding issues.
STREAMING CONCEPTS
Example #2
Waiting on the process
100GB log files per day
Currently takes 23 hrs to process
Approximately 4.4GB/hr
Grows at 5% per month
Next month would be 105GB and take ~24 hrs
Following month would be ~110GB and take ~25 hrs
Takes longer than a day to process one day's worth of data!
STREAMING CONCEPTS
Example #3
Waiting on the data to be available
How long until analytics are available?
Sales report must wait for all information to generate
Sum of delays is minimum time to generate new report
Amount of time to collect / prepare data: 1 day
Time required to process data: 7 hrs
Time to update systems: 5 hrs
Time to generate report: 2 min
Total time for each report: 1.5 days
STREAMING CONCEPTS
Let's practice!
STREAMING CONCEPTS
Intro to event-based
computing
STREAMING CONCEPTS
Mike Metzger
Data Engineer
Old days
Shared access
Batched jobs
Programs run by operators and results
returned to users
Delays, missing results, etc
STREAMING CONCEPTS
Personal computers
Often single user
But still behaved in a batch manner
Computer would basically run tasks in
order as provided
GUI gave rise to event-based interactivity
STREAMING CONCEPTS
Event-based processing
Doesn't run at a specific time
Tasks run when an event occurs
User clicks a button
A new file is uploaded to a directory
Can still start a batch process
Event-based systems wait for something to
occur
STREAMING CONCEPTS
Example event-based task
Web click-stream monitoring
User activity occurs when clicking on links
/ components of a webpage
The client application determines what
resources are needed and requests these
from a server
The server returns the appropriate info and
often logs the request
These clicks (user events) are often stored
or sent to a central location for storage and
later analysis.
STREAMING CONCEPTS
Let's practice!
STREAMING CONCEPTS
Queuing
STREAMING CONCEPTS
Mike Metzger
Data Engineer
What is queuing?
Basically, a line
Useful for processing in order
First-in, first-out (FIFO)
Sometimes referred to as a buffer
Details vary a lot by implementation
1 Photo by Joshua Tsu on Unsplash
STREAMING CONCEPTS
Why queues?
Queues allow tracking of processing order
Can be processed by a single person /
program or multiple
Can be disconnected from the remainder
of the processing pipeline
Reasonably easy to scale vertically or
horizontally
Vertical scaling by adding faster
hardware
Horizontal scaling by adding more
executors
STREAMING CONCEPTS
Queue issues
Bad data or processing errors
Customer pays with invalid credit card
Data size variances
Supermarket fast lane with 100 items
Sometimes difficult to know the length of
the queue
First preview showing of a movie
Scaling limits
Not enough space for more registers
STREAMING CONCEPTS
Let's practice!
STREAMING CONCEPTS
Single system data
streaming
STREAMING CONCEPTS
Mike Metzger
Data Engineer
Intro to streaming
What is streaming?
Data doesn't stop until processed
Once initially processed, may have other
data processing components
Is open-ended (no specific end event)
Is defined by the flow of data, not the
content
STREAMING CONCEPTS
Logs
Store event information 210507-162356 - SUCCESS: Open vvlj45.txt
Could be a simple text or binary file 210507-162254 - ERROR: Open hjry57.txt failed
210507-161523 - SUCCESS: Open kbhn78.txt
Or a system to export information to 210507-161235 - ERROR: Open ldge12.txt failed
multiple clients (ie, Apache Kafka) 210507-160127 - WARNING: keop98.txt exists
210507-155958 - SUCCESS: Open hqaz64.txt
Will store information until resources are
210507-155439 - SUCCESS: Open neuf36.txt
exhausted / pruned
210507-152335 - SUCCESS: Open mqpa91.txt
Purpose of the log depends on the 210507-144756 - ERROR: Open pqzi32.txt failed
application 210507-143541 - SUCCESS: Open urmn15.txt
210507-143152 - SUCCESS: Open fgty82.txt
210507-141732 - SUCCESS: Open mlwe96.txt
STREAMING CONCEPTS
System event log
Present on Windows, Mac, Linux Components:
Processes and stores various system event
Listener: Accepts messages
information
Parser: Understands how to read messages
Windows EventLog, Mac / Linux syslog
Logic: Decides what to do
Writer: Stores the messages for later
STREAMING CONCEPTS
Let's practice!
STREAMING CONCEPTS
Batching vs.
streaming
STREAMING CONCEPTS
Mike Metzger
Data Engineer
Quick review
Batch processes handle data in groups, or batches
The most important details about batch processing is the batch size, and the batch
frequency
Queues store / process data in order of insertion
Queues are batches, with a batch size of one!
Streams handle data without pausing along the way
Streams don't have a defined end
Streams maintain order!
STREAMING CONCEPTS
Fire!
Bucket brigade Fire hose
Batch size (how large is the bucket) Continuous amount of data
Batch frequency (how fast to pass Not sure how much water
bucket)
1Albert B. Kinne, Public domain, via Wikimedia Commons 2 Commander, U.S. Naval Forces Europe-Africa/U.S. 6th
Fleet, Public domain, via Wikimedia Commons
STREAMING CONCEPTS
How to determine the best approach?
Depends on requirements
If we can process in groups, batching often best due to simplicity
If we need order, but it's okay to pause, use a queue
If we need continuous data, or we don't know how much data, try streaming
If we can't stop until the data is processed, use streaming
STREAMING CONCEPTS
Let's practice!
STREAMING CONCEPTS
Intro to real-time
streaming
STREAMING CONCEPTS
Mike Metzger
Data Engineer
What is 'real-time'?
Definition varies depending on context
Typically defines a response timeframe
The response timeframe is defined as a sort of guarantee
Could be:
1 day
1 hour
1 minute
STREAMING CONCEPTS
Real world example
Post office
Different classes of service
Delivery timeframe varies based on service
class
Only so much capacity for faster service
Costs are proportional to service speed
Service selection is up to the sender based
on options
STREAMING CONCEPTS
Relationship to streaming?
How does real-time relate to streaming data?
Streaming processes are limited by
available resources
How quickly can data be transported?
... processed?
... delivered?
How much does it cost?
STREAMING CONCEPTS
Resources define implementation
Helps define our requirements for
streaming data processes
Speed of transport
Processing latency
Delivery
Data storage
Cost!
STREAMING CONCEPTS
Let's practice!
STREAMING CONCEPTS
Vertically scaling
streaming systems
STREAMING CONCEPTS
Mike Metzger
Data Engineer
Why scale?
Process the same data in less time
Process more data in the same time
Deliver data more quickly (reduce latency)
Meet guarantees (SLAs)
STREAMING CONCEPTS
Vertical scaling
Improve the capabilities of a single system
Faster / better components
CPU, RAM, Disk, Network
All can affect streaming performance
STREAMING CONCEPTS
Faster CPU / GPU performance
Faster execution
Better execution
New / improved instruction sets
GPU processing
Machine learning
Deep learning
Image processing
Matrix operations
STREAMING CONCEPTS
How does this affect streaming?
Streaming processes don't stop until
complete
Different items can be in different parts of
the pipeline, but total processing capacity
is limited by the system performance
Certain components have a greater effect
than others, depending on workload
Benchmark / test!
STREAMING CONCEPTS
Let's practice!
STREAMING CONCEPTS
Horizontally scaling
streaming systems
STREAMING CONCEPTS
Mike Metzger
Data Engineer
Horizontal scaling refresher
Instead of scaling "up", scale "out"
Typically means adding processing
capability by adding more, rather than
faster / better
Works best with embarrassingly parallel
situations
Tasks that can be split easily
E.g. processing a large group of non-
interdependent images
STREAMING CONCEPTS
Horizontal scaling with streaming
Streaming data processing typically has
minimal delays
Can make transfer of data between
workers tricky
Best to process a full stream within a single
pipeline
Create copies of the pipelines
STREAMING CONCEPTS
Pipeline copies
As events occur, they initially enter a
pipeline
All tasks related to that process are self-
contained within the pipeline, until
completion
Scale by adding more pipelines
Can still vertically scale within a pipeline
STREAMING CONCEPTS
Additional considerations
Other components may be required
Load balancer / director
Card dealer
Least busy node
Eventually hit bottlenecks
Disk write performance
Consider shortening streaming pipeline
Remove need to immediately process
data
STREAMING CONCEPTS
Let's practice!
STREAMING CONCEPTS
Streaming
roadblocks
STREAMING CONCEPTS
Mike Metzger
Data Engineer
Scaling review
Vertical scaling - compute resources Horizontal scaling - more nodes
CPU Add machines as nodes / workers
RAM
Disk (capacity and IO)
Network
STREAMING CONCEPTS
Initial concerns
Compute resources
Lack of adequate or slow resources
More nodes
Requires more connectivity
Some form of shared resources
Added complexity
Usually some form of cluster management
STREAMING CONCEPTS
Communication issues
Types of messaging problems:
Missing messages
Delayed messages
Out of order messages
Repeat messages
STREAMING CONCEPTS
Missing messages
Represent events that never appear
Can be difficult to detect
Sometimes handled with a sequence identifier
Requesting the missing messages can delay further responses
STREAMING CONCEPTS
Delayed messages
Similar to missing messages
May cause issues with the processing pipeline due to delays
Often related to system resource issues
STREAMING CONCEPTS
Out of order messages
Combination of missing / delayed messages
Results when an older message appears after newer ones
Requires some measure of sequence or state to detect
Handling these issues depends on the type of data process being run
STREAMING CONCEPTS
Repeat messages
Occurs when the same message is sent multiple times or resent due to systems issues
Requires sequence handling to completely avoid, but might be safe to ignore
Sometimes is not an issue (consider a temperature measurement)
STREAMING CONCEPTS
Let's practice!
STREAMING CONCEPTS
Popular streaming
systems
STREAMING CONCEPTS
Mike Metzger
Data engineer
Streaming tools
Various tools available depending on needs
Allows designers to specify the best tool for
the job
Common systems include:
Celery
Kafka
Spark Streaming
STREAMING CONCEPTS
Celery
Distributed task queue / FIFO
Used primarily as a job or task queue
Often used for asynchronous tasks
Sending password reset emails
Fulfilling digital orders
Resizing images
Allows for real-time processing of
significant quantity of messages
Provides functionality for management and
scaling (vertical & horizontal)
STREAMING CONCEPTS
Apache Kafka
Distributed event streaming system Different consumers can handle events as
Designed to send events between needed (logging, transformations, relaying,
etc)
producers and consumers
Producers create events on a topic Handles storing of events as specified
Topics are basically messages of a Extremely powerful, but can be tricky to
specified form set up
Consumers receive new events
STREAMING CONCEPTS
Kafka applications
Best used for passing data between
multiple systems
Single source of truth
Change data capture
Data backups
Data system migrations
STREAMING CONCEPTS
Spark streaming
Part of Apache Spark
Designed to process streaming data
Builds upon the capabilities of Spark to
process data in Scala, Python, SQL, and so
forth
Useful for processing large amounts of
data and in machine learning scenarios
Able to transition from batch to stream
processing fairly easily
Not designed to store or log events, but
primarily to process or modify the data
STREAMING CONCEPTS
Let's practice!
STREAMING CONCEPTS
Real-world use case:
streaming music
service
STREAMING CONCEPTS
Mike Metzger
Data Engineer
Streaming music
Consider the scenario
Not focusing on the actual music being
streamed
More interested on the user(s)
Interactions
Music preferences
Other details
STREAMING CONCEPTS
Interactions
Primary questions Like / Don't play
Next / Previous / Skip
What?
Select Channel / Playlist
When?
Where? Add / remove song from playlist
STREAMING CONCEPTS
How to store data
Data is archived as a log
Sent as interactions occur
Number of interactions will vary
considerably between users
Logged data can be analyzed later on
STREAMING CONCEPTS
Analytics
What about preferences?
Can be obtained from logged data
Favorite genres, bands, etc
Most popular times of day
Other details?
Most popular app platform / version
Location data from stream
STREAMING CONCEPTS
Let's practice!
STREAMING CONCEPTS
Real-world use case:
sensor data
STREAMING CONCEPTS
Mike Metzger
Data Engineer
What is sensor data?
Automated devices that monitor some
aspect of interest
Temperature monitors
Electricity usage monitors
Vehicle presence detection
Many others
Tend to communicate with centralized
services for management and data
reporting
Can range from a few sensors to millions or
even billions
STREAMING CONCEPTS
Connected doorbell
Monitors primarily for doorbell presses
Contains extra video / audio capabilities
Allows live, remote interaction
Can use camera / environmental sensors
for additional detection capabilities
1 C05731, CC BY-SA 4.0 [removed], via Wikimedia Commons
STREAMING CONCEPTS
What are we monitoring?
Button presses
Movement detection
Sound detection
Requires more intensive interaction than
just logging of events
STREAMING CONCEPTS
Data handling
How to store data
General event data (button press)
Sensor-based events (light sensor or
audio pickup)
Raw data for later analysis
Different services can have different SLAs,
even in same product
STREAMING CONCEPTS
Let's practice!
STREAMING CONCEPTS
Real-world use case:
vaccination clinic
STREAMING CONCEPTS
Mike Metzger
Data Engineer
Data processing review
Batch
Great for large sets of data
Potentially poor latency
Queue
Awesome to maintain order
Can be tricky to manage
Stream
Fantastic for latency / unknown data
characteristics
Scaling considerations
STREAMING CONCEPTS
Complex systems
Not all processes fit within a single
processing type
Many real-world scenarios may require
multiple components to build the best
processing model
Concepts can be applied to various
components as required
STREAMING CONCEPTS
Vaccination clinic
Multiple, simultaneous moving pieces
Vary based on locale and requirements
Consider a large self-contained clinic
Concepts apply to smaller pharmacies /
doctor's offices as well
STREAMING CONCEPTS
Vaccination clinic areas
Arrival / entrance Monitoring
Entry and temperature check, with a Patients checked for any post-
single line application reactions, many seats
Registration Departure
Check-in & validation on info, multiple Exit from clinic
registrars
Vaccine administration
Actual application of vaccine, multiple
stations
STREAMING CONCEPTS
Let's practice!
STREAMING CONCEPTS
Congratulations!
STREAMING CONCEPTS
Mike Metzger
Data Engineer
Next steps
Learn more about specific streaming platforms
Apache Kafka (Apache Kafka) or (Confluent)
Apache Spark (Apache Spark)
Apply current data implementations to stream processes
Work with data consumers to better determine best processing options for a given situation
STREAMING CONCEPTS
Thank you!
STREAMING CONCEPTS