KEMBAR78
Clickstream Analysis with Apache Spark | PDF
CLICKSTREAM ANALYSIS
WITH APACHE SPARK
Andreas	Zitzelsberger
THE CHALLENGE
ONE POT TO RULE THEM ALL
Web Tracking Ad Tracking
ERP
CRM
▪
Products
▪
Inventory
▪
Margins
▪
Customer
▪
Orders
▪
Creditworthiness
▪
Ad Impressions
▪
Ad Costs
▪
Clicks &
Views
▪
Conversions
ONE POT TO RULE THEM ALL
Retention Reach
Monetarization
steer …
▪ Campaigns
▪ Offers
▪ Contents
REACT ON WEB SITE
TRAFFIC IN REAL TIME
Image: https://www.flickr.com/photos/nick-m/3663923048
SAMPLE RESULTS
Geolocated and gender-
specific conversions.
Frequency of visits
Performance of an ad campaign
THE CONCEPTS
Image: Randy Paulino
THE FIRST SKETCH
(= real-time)
SQL
CALCULATING USER JOURNEYS
C V VT VT VT C X
C V
V V V V V V V
C V V C V V V
VT VT V V V VT C
V X
Event stream: User journeys:
Web / Ad tracking
KPIs:
▪ Unique users
▪ Conversions
▪ Ad costs / conversion value
▪ …
V
X
VT
C Click
View
View Time
Conversion
THE ARCHITECTURE
Big Data
„LARRY & FRIENDS“ ARCHITECTURE
Runs not well for more
than 1 TB data in terms of
ingestion speed, query time
and optimization efforts
Image: adweek.com
Nope.
Sorry, no Big Data.
„HADOOP & FRIENDS“ ARCHITECTURE
Aggregation
takes too long
Cumbersome
programming model
(can be solved with
pig, cascading et al.)
Not
interactive
enough
Nope.	
Too	sluggish.
Κ-ARCHITECTURE
Cumbersome
programming model
Over-engineered: We only need
15min real-time ;-)
Stateful aggregations (unique x,
conversions) require a separate DB
with high throughput and fast
aggregations & lookups.
Λ-ARCHITECTURE
Cumbersome
programming model
Complex
architecture
Redundant
logic
FEELS OVER-ENGINEERED…
http://www.brainlazy.com/article/random-nonsense/over-engineered
The Final Architecture*
*) Maybe called µ-architecture one day ;-)
FUNCTIONAL ARCHITECTURE
Strange Events
IngestionRaw Event
Stream
Collection Events Processing Analytics
Warehouse
Fact
Entries
Atomic Event
Frames
Data Lake
Master Data Integration
▪ Buffers load peeks
▪ Ensures message
delivery (fire & forget
for client)
▪ Create user journeys and
unique user sets
▪ Enrich dimensions
▪ Aggregate events to KPIs
▪ Ability to replay for schema
evolution
▪ The representation of truth
▪ Multidimensional data
model
▪ Interactive queries for
actions in realtime and
data exploration
▪ Eternal memory for all
events (even strange
ones)
▪ One schema per event
type. Time partitioned.
▪ Fault tolerant message handling
▪ Event handling: Apply schema, time-partitioning, De-dup, sanity
checks, pre-aggregation, filtering, fraud detection
▪ Tolerates delayed events
▪ High throughput, moderate latency (~ 1min)
SERIAL CONNECTION OF STREAMING AND BATCHING
IngestionRaw Event
Stream
Collection Event Data Lake Processing Analytics
Warehouse
Fact
Entries
SQL Interface
Atomic Event
Frames
▪ Cool programming model
▪ Uniform dev&ops
▪ Simple solution
▪ High compression ratio due to
column-oriented storage
▪ High scan speed
▪ Cool programming model
▪ Uniform dev&ops
▪ High performance
▪ Interface to R out-of-the-box
▪ Useful libs: MLlib, GraphX, NLP, …
▪ Good connectivity (JDBC,
ODBC, …)
▪ Interactive queries
▪ Uniform ops
▪ Can easily be replaced
due to Hive Metastore
▪ Obvious choice for
cloud-scale messaging
▪ Way the best throughput
and scalability of all
evaluated alternatives
public Map<Long, UserJourney>
sessionize(JavaRDD<AtomicEvent> events) {


return events

// Convert to a pair RDD with the userId as key

.mapToPair(e -> new Tuple2<>(e.getUserId(), e))

// Build user journeys

.<UserJourneyAcc>combineByKey(
UserJourneyAcc::create,
UserJourneyAcc::add,
UserJourneyAcc::combine)

// Convert to a Java map

.collectAsMap();

}
STREAM VERSUS BATCH
https://en.wikipedia.org/wiki/Tanker_(ship)#/media/File:Sirius_Star_2008b.jpghttps://blog.allstate.com/top-5-safety-tips-at-the-gas-pump/
APACHE FLINK
■ Also	has	a	nice,	Spark-like	API	
■ Promises	similar	or	better	
performance	than	spark	
■ Looks	like	the	best	solution	for	a	κ-
Architecture	
■ But	it’s	also	the	newest	kid	on	the	
block
EVENT VERSUS PROCESSING TIME
■ There’s	a	difference	between	even	time	(te)	and	processing	time	
(tp).	
■ Events	arrive	out-of	order	even	during	normal	operation.	
■ Events	may	arrive	arbitrary	late.
Apply	a	grace	period	before	processing	events.
Allow	arbitrary	update	windows	of	metrics.
EXAMPLE
Minute
Hour
Day
Week
Month
Quarter
Year
I
U
U
U
U
U
U
I
U
U
U
U
U
U
U
Resolution	

in	Time
Time
dtp
tp
tp:	 Processing	Time	
ti:	 Ingestion	time	
te:	 Event	Time	
dtp:	 Aggregation	time		
	 frame	
dtw:	 Grace	period	
					:	 Insert	fact	
					:	 Update	fact
dtw
te
ti
LESSONS LEARNED
Image: http://hochmeister-alpin.at
BEST-OF-BREED INSTEAD OF COMMODITY SOLUTIONS
ETL
Analytics
Realtime
Analytics
Slice &
Dice
Data
Exploration
Polyglot Processing
http://datadventures.ghost.io/2014/07/06/polyglot-processing
POLYGLOT ANALYTICS
Data Lake
Analytics
Warehouse
SQL 

lane
R

lane
Timeseries

lane
Reporting Data Exploration
Data Science
NO RETENTION PARANOIA
Data Lake
Analytics
Warehouse
▪ Eternal memory
▪ Close to raw events
▪ Allows replays and refills

into warehouse
Aggressive forgetting with clearly defined 

retention policy per aggregation level like:
▪ 15min:30d
▪ 1h:4m
▪ …
Events
Strange Events
BEWARE OF THE HIPSTERS
Image: h&m
ENSURE YOUR SOFTWARE RUNS LOCALLY
The entire architecture must be able to run locally.
Keep the round trips low for development and
testing.
Throughput and reaction times need to be monitored
continuously. Tune your software and the underlying
frameworks as needed.
TUNE CONTINUOUSLY
IngestionRaw Event
Stream
Collection Event Data Lake Processing Analytics
Warehouse
Fact
Entries
SQL Interface
Atomic Event
Frames
Load
generator Throughput & latency probes
System, container and process monitoring
IN NUMBERS
Overall dev effort until the first release: 250 person days
Dimensions: 10 KPIs: 26
Integrated 3rd party systems: 7
Inbound data volume per day: 80GB
New data in DWH per day: 2GB
Total price of cheapest cluster which is able to handle production load:
THANK YOU
@andreasz82
andreas.zitzelsberger@qaware.de
BONUS SLIDES
CALCULATING UNIQUE USERS
■ We	need	an	exact	unique	user	
count.	
■ If	you	can,	you	should	use	an	
approximation	such	as	
HyperLogLog.
U1
U2
U3
U1
U4
Time
Users
3 UU 2 UU
4 UU
Flajolet, P.; Fusy, E.; Gandouet, O.; Meunier, F. (2007). "HyperLogLog: the analysis of a near-optimal
cardinality estimation algorithm". AOFA ’07: Proceedings of the 2007 International Conference on the
Analysis of Algorithms.
CHARTING TECHNOLOGY
https://github.com/qaware/big-data-landscape
CHOOSING WHERE TO AGGREGATE
Ingestion Event Data Lake Processing Analytics
Warehouse
Fact
Entries
Analytics
Atomic Event
Frames
1 2
3
- Enrichment
- Preprocessing
- Validation
The hard lifting.
- Processing steps that can
be done at query time.
- Interactive queries.

Clickstream Analysis with Apache Spark