Clickstream Analysis with Apache Spark

CLICKSTREAM ANALYSIS
WITH APACHE SPARK
Andreas Zitzelsberger

ONE POT TO RULE THEM ALL
Web Tracking Ad Tracking
ERP
CRM
▪
Products
▪
Inventory
▪
Margins
▪
Customer
▪
Orders
▪
Creditworthiness
▪
Ad Impressions
▪
Ad Costs
▪
Clicks &
Views
▪
Conversions

ONE POT TO RULE THEM ALL
Retention Reach
Monetarization
steer …
▪ Campaigns
▪ Offers
▪ Contents

REACT ON WEB SITE
TRAFFIC IN REAL TIME
Image: https://www.flickr.com/photos/nick-m/3663923048

SAMPLE RESULTS
Geolocated and gender-
specific conversions.
Frequency of visits
Performance of an ad campaign

THE CONCEPTS
Image: Randy Paulino

THE FIRST SKETCH
(= real-time)
SQL

CALCULATING USER JOURNEYS
C V VT VT VT C X
C V
V V V V V V V
C V V C V V V
VT VT V V V VT C
V X
Event stream: User journeys:
Web / Ad tracking
KPIs:
▪ Unique users
▪ Conversions
▪ Ad costs / conversion value
▪ …
V
X
VT
C Click
View
View Time
Conversion

„LARRY & FRIENDS“ ARCHITECTURE
Runs not well for more
than 1 TB data in terms of
ingestion speed, query time
and optimization efforts

Image: adweek.com
Nope.
Sorry, no Big Data.

„HADOOP & FRIENDS“ ARCHITECTURE
Aggregation
takes too long
Cumbersome
programming model
(can be solved with
pig, cascading et al.)
Not
interactive
enough

Κ-ARCHITECTURE
Cumbersome
programming model
Over-engineered: We only need
15min real-time ;-)
Stateful aggregations (unique x,
conversions) require a separate DB
with high throughput and fast
aggregations & lookups.

Λ-ARCHITECTURE
Cumbersome
programming model
Complex
architecture
Redundant
logic

FEELS OVER-ENGINEERED…
http://www.brainlazy.com/article/random-nonsense/over-engineered

The Final Architecture*
*) Maybe called µ-architecture one day ;-)

FUNCTIONAL ARCHITECTURE
Strange Events
IngestionRaw Event
Stream
Collection Events Processing Analytics
Warehouse
Fact
Entries
Atomic Event
Frames
Data Lake
Master Data Integration
▪ Buffers load peeks
▪ Ensures message
delivery (fire & forget
for client)
▪ Create user journeys and
unique user sets
▪ Enrich dimensions
▪ Aggregate events to KPIs
▪ Ability to replay for schema
evolution
▪ The representation of truth
▪ Multidimensional data
model
▪ Interactive queries for
actions in realtime and
data exploration
▪ Eternal memory for all
events (even strange
ones)
▪ One schema per event
type. Time partitioned.
▪ Fault tolerant message handling
▪ Event handling: Apply schema, time-partitioning, De-dup, sanity
checks, pre-aggregation, filtering, fraud detection
▪ Tolerates delayed events
▪ High throughput, moderate latency (~ 1min)

SERIAL CONNECTION OF STREAMING AND BATCHING
IngestionRaw Event
Stream
Collection Event Data Lake Processing Analytics
Warehouse
Fact
Entries
SQL Interface
Atomic Event
Frames
▪ Cool programming model
▪ Uniform dev&ops
▪ Simple solution
▪ High compression ratio due to
column-oriented storage
▪ High scan speed
▪ Cool programming model
▪ Uniform dev&ops
▪ High performance
▪ Interface to R out-of-the-box
▪ Useful libs: MLlib, GraphX, NLP, …
▪ Good connectivity (JDBC,
ODBC, …)
▪ Interactive queries
▪ Uniform ops
▪ Can easily be replaced
due to Hive Metastore
▪ Obvious choice for
cloud-scale messaging
▪ Way the best throughput
and scalability of all
evaluated alternatives

public Map<Long, UserJourney>
sessionize(JavaRDD<AtomicEvent> events) {
 
return events 
// Convert to a pair RDD with the userId as key 
.mapToPair(e -> new Tuple2<>(e.getUserId(), e)) 
// Build user journeys 
.<UserJourneyAcc>combineByKey(
UserJourneyAcc::create,
UserJourneyAcc::add,
UserJourneyAcc::combine) 
// Convert to a Java map 
.collectAsMap(); 
}

STREAM VERSUS BATCH
https://en.wikipedia.org/wiki/Tanker_(ship)#/media/File:Sirius_Star_2008b.jpghttps://blog.allstate.com/top-5-safety-tips-at-the-gas-pump/

APACHE FLINK
■ Also has a nice, Spark-like API
■ Promises similar or better
performance than spark
■ Looks like the best solution for a κ-
Architecture
■ But it’s also the newest kid on the
block

EVENT VERSUS PROCESSING TIME
■ There’s a difference between even time (te) and processing time
(tp).
■ Events arrive out-of order even during normal operation.
■ Events may arrive arbitrary late.
Apply a grace period before processing events.
Allow arbitrary update windows of metrics.

EXAMPLE
Minute
Hour
Day
Week
Month
Quarter
Year
I
U
U
U
U
U
U
I
U
U
U
U
U
U
U
Resolution  
in Time
Time
dtp
tp
tp: Processing Time
ti: Ingestion time
te: Event Time
dtp: Aggregation time
frame
dtw: Grace period
: Insert fact
: Update fact
dtw
te
ti

LESSONS LEARNED
Image: http://hochmeister-alpin.at

BEST-OF-BREED INSTEAD OF COMMODITY SOLUTIONS
ETL
Analytics
Realtime
Analytics
Slice &
Dice
Data
Exploration
Polyglot Processing
http://datadventures.ghost.io/2014/07/06/polyglot-processing

POLYGLOT ANALYTICS
Data Lake
Analytics
Warehouse
SQL  
lane
R 
lane
Timeseries 
lane
Reporting Data Exploration
Data Science

NO RETENTION PARANOIA
Data Lake
Analytics
Warehouse
▪ Eternal memory
▪ Close to raw events
▪ Allows replays and refills 
into warehouse
Aggressive forgetting with clearly defined  
retention policy per aggregation level like:
▪ 15min:30d
▪ 1h:4m
▪ …
Events
Strange Events

BEWARE OF THE HIPSTERS
Image: h&m

ENSURE YOUR SOFTWARE RUNS LOCALLY
The entire architecture must be able to run locally.
Keep the round trips low for development and
testing.
Throughput and reaction times need to be monitored
continuously. Tune your software and the underlying
frameworks as needed.

TUNE CONTINUOUSLY
IngestionRaw Event
Stream
Collection Event Data Lake Processing Analytics
Warehouse
Fact
Entries
SQL Interface
Atomic Event
Frames
Load
generator Throughput & latency probes
System, container and process monitoring

IN NUMBERS
Overall dev effort until the first release: 250 person days
Dimensions: 10 KPIs: 26
Integrated 3rd party systems: 7
Inbound data volume per day: 80GB
New data in DWH per day: 2GB
Total price of cheapest cluster which is able to handle production load:

THANK YOU
@andreasz82
andreas.zitzelsberger@qaware.de

CALCULATING UNIQUE USERS
■ We need an exact unique user
count.
■ If you can, you should use an
approximation such as
HyperLogLog.
U1
U2
U3
U1
U4
Time
Users
3 UU 2 UU
4 UU
Flajolet, P.; Fusy, E.; Gandouet, O.; Meunier, F. (2007). "HyperLogLog: the analysis of a near-optimal
cardinality estimation algorithm". AOFA ’07: Proceedings of the 2007 International Conference on the
Analysis of Algorithms.

CHARTING TECHNOLOGY
https://github.com/qaware/big-data-landscape

CHOOSING WHERE TO AGGREGATE
Ingestion Event Data Lake Processing Analytics
Warehouse
Fact
Entries
Analytics
Atomic Event
Frames
1 2
3
- Enrichment
- Preprocessing
- Validation
The hard lifting.
- Processing steps that can
be done at query time.
- Interactive queries.

Clickstream Analysis with Apache Spark

More Related Content

Viewers also liked

Similar to Clickstream Analysis with Apache Spark

More from QAware GmbH

Recently uploaded

Clickstream Analysis with Apache Spark