MODERN DATA ARCHITECTURES
FOR BIG DATA II
INTRODUCING STREAMING DATA
Agenda
● What’s a real-time system?
○ Soft & Near real-time
● Real-time vs Streaming systems
○ Streams Data Systems
● Streams Data Systems terminology
○ Event vs Processing/Stream time
○ Windows of data
○ Message delivery semantics
● Use cases
1.
WHAT’S A
REAL-TIME
SYSTEM?
What’s a real-time system?
● Real-time systems & computing have been around
for decades.
● They have become popular lately causing
ambiguity and debate though.
● Let’s clarify it by understanding this classification:
○ Hard real-time
○ Soft real-time
○ Near real-time
What’s a real-time system?
Stream processing & real-time analytics
Soft & Near real-time
● Latency measured in ms, sec or even min.
● Tolerance for delay between low and high:
○ No system failure, no life at risk
○ What a relief!
● Differentiating soft & near real-time becomes
blurry and, often times, it’s simplified to just
real-time.
Soft & Near real-time
● A generic real-time system with consumers:
Tightly coupled;
hard to make it
scalable!
2.
REAL-TIME
VS
STREAMING
SYSTEMS
Streams Data System
● A non-hard real-time system that makes its data
available at the moment a client application
needs it.
● Clients may not be consuming the data in real time
due to network delays, application design, or
client applications
aren’t even running.
Streams Data System
● A data stream is a continuous flow of data that is
usually modeled as a sequence of elements and,
theoretically, is unbounded/infinite in size.
● Streaming data systems process huge amount of
data, and processing this data might need some
algorithms with approximations or aggregations:
○ Sampling the stream
○ Filtering the stream to keep relevant elements
○ Estimating number of different elements
○ ...
Streams Data System
● Streaming data architectural blueprint we’ll use
as a reference during this course:
Already covered in the
“Introduction to Big Data
Architecture” course
3.
STREAMS
DATA SYSTEMS
TERMINOLOGY
Event vs Processing/Stream time
● More often than not, elements of a data stream
are called events.
● In regards to timing related to events, this is it:
○ Event time.- time at which the event occurs.
○ Processing/Stream time.- time at which an event is
processed/enters the streaming system.
Event vs Processing/Stream time
● Ideally, event time == processing/stream time.
● Processing-time lag - how much delay is
observed between when the events for a
given time occurred and when they were
processed.
● Event-time skew - how far behind the ideal
(in event time) the pipeline is currently.
● They’re just two ways of looking at the same
thing; processing-time lag and event-time
skew at any given point in time are identical.
Windows of data
● Due to the infinite size and never-ending nature
of streams of data, data doesn’t keep in memory.
● Computation/processing is possible by using
windows of data - certain amount of data that can
be processed.
Windows of data
● Attributes common to windowing techniques:
○ Trigger policy - notifies that it’s time to process all
the data that is in the window.
○ Eviction policy - decides if a data element should be
evicted from the window.
● Common windowing strategies based on:
○ Number of events - tumbling windows
○ Time - fixed, sliding and sessions windows
Windows of data
● Tumbling windows:
● Time based windowing strategies:
Message delivery semantics
● Within the context of producers, brokers,
consumers and messages that will be studied later.
Message delivery semantics
● Three common semantic guarantees when looking
at message queuing:
○ At most once - A message may get lost, but there
won’t be duplicates at consumer side.
○ At least once - A message will never get lost, but
there might be duplicates at consumer side.
○ Exactly once - A message is never lost and is read by a
consumer once and only once.
4.
USE CASES
Clickstream/site Analytics
● Stream of visitors (identified with cookies)
● Applications:
○ find hot-links
○ frequent customers
○ probability of click
over new content.
High-frequency trading
● Stream of stock trades
● Applications:
○ Exploit trading opportunities
that may open up for
ms or secs.
○ Taking advantage of
small price imbalances
to generate
sizable profits.
Predictive maintenance
● Stream of measures (whatever can be measured)
● Applications:
○ Condition based maintenance.
○ Unnecessary maintenance
and use of spare parts.
○ Guaranteeing levels of
availability.
Fraud detection
● Stream of actions (whatever can be done)
● Applications:
○ Credit card fraud
○ Stock trading fraud
○ Video game cheaters
○ Cyber Security risks