Lambda architecture

Lambda Architecture
Una soluzione per i Big Data

Mario A. Santini

A solution born in Twitter

Nathan Marz
Author of Big Data:
http://www.manning.com/marz/

When big is big?
●

OpenStreetMap.org ~1,5 M users, ~2,2 nodes
(http://j.mp/OSM-stats)
http://j.mp/OSM-stats

●

Wikipedia 32 M pages, 20 M users
(http://en.wikipedia.org/wiki/Wikipedia:Statistics)
http://en.wikipedia.org/wiki/Wikipedia:Statistics

●

Facebook 1.3 G users (http://www.statisticbrain.com/facebook-statistics/)
http://www.statisticbrain.com/facebook-statistics/

●

Twitter 645 M users (http://www.statisticbrain.com/twitter-statistics/)

●

But also:
–
–

Monitoring systems
Any near real time system

Lambda

query = function(allData);

Input data

Lambda Architecture

query

Batch View

All Data

Batch Layer

Batch View

Batch View

Serving Layer

Query

Batch Layer
●

Store an immutable input data set

●

Computing continuosly the batch view

●

Simple & Distributed

Serving Layer
●

Indexing the batch views

●

Access to the batch views

●

Updated by Batch Layer

●

Trivial read only database:
–

Quick

–

Very simple

Batch Layer + Service Layer
●

Robust and fault tollerant

●

Scalable

●

General

●

Extensible

●

Allow ad hoc queries

●

Minimal maintenance

●

Debuggable

What's miss?
While Batch Layer compute the query on the full
data set a pretty big chunk of data just arrived
and be stored.
Should we wait a couple of hours to query this
data?

Speed Layer
Near real time views

New Data

Speed Layer



Query

All together now!
Serving Layer
Batch View
All Data

Batch Layer
Batch View
Query
New Data

Speed Layer


How all this mess should work?
●

●

All new data are sent to both: batch and speed
layer (data are raw and immutalble, append
only)
The batch layer precompute the query
functions continuosly to all the dataset, to
produce the batch views

●

The serving layer indexes the batch views

●

At the end the data are a couple of hours old

How all this mess should work?
●
●

●
●

The speed layer will process only the new data
It use fast read/write database and
incremental processing algorithms
Produce the near real time views
The query will merge real time and batch
views results to resolve the queries

Batch Layer - tools
●

Hadoop
–

YARN: framework to schedule jobs and cluster
management

–

Map / Reduce: a way to parallel processing of
huge amount of data, based on YARN

–

HDFS: distributed file system with an high
throughput access to application data

–

And even more...

Serving Layer – tools
●

ElephantDB
–

●

●

Readonly database, very little, very fast

Here we need anything that has the same
features
Cloudera Impala

Speed Layer - tools
●

Storm project
–

Very fast distributed computed system

●

Apache Hbase

●

MongoDB

Query - tools
●

Cloudera Impala

Lambda architecture

More Related Content

What's hot

Similar to Lambda architecture

More from Mario Alexandro Santini

Recently uploaded

Lambda architecture