KEMBAR78
Lambda architecture | PDF
Lambda Architecture
Una soluzione per i Big Data

Mario A. Santini
A solution born in Twitter

Nathan Marz
Author of Big Data:
http://www.manning.com/marz/
When big is big?
●

OpenStreetMap.org ~1,5 M users, ~2,2 nodes
(http://j.mp/OSM-stats)
http://j.mp/OSM-stats

●

Wikipedia 32 M pages, 20 M users
(http://en.wikipedia.org/wiki/Wikipedia:Statistics)
http://en.wikipedia.org/wiki/Wikipedia:Statistics

●

Facebook 1.3 G users (http://www.statisticbrain.com/facebook-statistics/)
http://www.statisticbrain.com/facebook-statistics/

●

Twitter 645 M users (http://www.statisticbrain.com/twitter-statistics/)

●

But also:
–
–

Monitoring systems
Any near real time system
Lambda

query = function(allData);
Input data

Lambda Architecture

query
Batch View

All Data

Batch Layer

Batch View

Batch View

Serving Layer

Query
Batch Layer
●

Store an immutable input data set

●

Computing continuosly the batch view

●

Simple & Distributed
Serving Layer
●

Indexing the batch views

●

Access to the batch views

●

Updated by Batch Layer

●

Trivial read only database:
–

Quick

–

Very simple
Batch Layer + Service Layer
●

Robust and fault tollerant

●

Scalable

●

General

●

Extensible

●

Allow ad hoc queries

●

Minimal maintenance

●

Debuggable
What's miss?
While Batch Layer compute the query on the full
data set a pretty big chunk of data just arrived
and be stored.
Should we wait a couple of hours to query this
data?
Speed Layer
Near real time views

New Data

Speed Layer

Near real time views

Near real time views

Query
All together now!
Serving Layer
Batch View
All Data

Batch Layer
Batch View
Query
New Data

Near real time views
Speed Layer

Near real time views
How all this mess should work?
●

●

All new data are sent to both: batch and speed
layer (data are raw and immutalble, append
only)
The batch layer precompute the query
functions continuosly to all the dataset, to
produce the batch views

●

The serving layer indexes the batch views

●

At the end the data are a couple of hours old
How all this mess should work?
●
●

●
●

The speed layer will process only the new data
It use fast read/write database and
incremental processing algorithms
Produce the near real time views
The query will merge real time and batch
views results to resolve the queries
Batch Layer - tools
●

Hadoop
–

YARN: framework to schedule jobs and cluster
management

–

Map / Reduce: a way to parallel processing of
huge amount of data, based on YARN

–

HDFS: distributed file system with an high
throughput access to application data

–

And even more...
Serving Layer – tools
●

ElephantDB
–

●

●

Readonly database, very little, very fast

Here we need anything that has the same
features
Cloudera Impala
Speed Layer - tools
●

Storm project
–

Very fast distributed computed system

●

Apache Hbase

●

MongoDB
Query - tools
●

Cloudera Impala

Lambda architecture

  • 1.
    Lambda Architecture Una soluzioneper i Big Data Mario A. Santini
  • 3.
    A solution bornin Twitter Nathan Marz Author of Big Data: http://www.manning.com/marz/
  • 4.
    When big isbig? ● OpenStreetMap.org ~1,5 M users, ~2,2 nodes (http://j.mp/OSM-stats) http://j.mp/OSM-stats ● Wikipedia 32 M pages, 20 M users (http://en.wikipedia.org/wiki/Wikipedia:Statistics) http://en.wikipedia.org/wiki/Wikipedia:Statistics ● Facebook 1.3 G users (http://www.statisticbrain.com/facebook-statistics/) http://www.statisticbrain.com/facebook-statistics/ ● Twitter 645 M users (http://www.statisticbrain.com/twitter-statistics/) ● But also: – – Monitoring systems Any near real time system
  • 5.
  • 6.
  • 7.
    Batch View All Data BatchLayer Batch View Batch View Serving Layer Query
  • 8.
    Batch Layer ● Store animmutable input data set ● Computing continuosly the batch view ● Simple & Distributed
  • 9.
    Serving Layer ● Indexing thebatch views ● Access to the batch views ● Updated by Batch Layer ● Trivial read only database: – Quick – Very simple
  • 10.
    Batch Layer +Service Layer ● Robust and fault tollerant ● Scalable ● General ● Extensible ● Allow ad hoc queries ● Minimal maintenance ● Debuggable
  • 11.
    What's miss? While BatchLayer compute the query on the full data set a pretty big chunk of data just arrived and be stored. Should we wait a couple of hours to query this data?
  • 12.
    Speed Layer Near realtime views New Data Speed Layer Near real time views Near real time views Query
  • 13.
    All together now! ServingLayer Batch View All Data Batch Layer Batch View Query New Data Near real time views Speed Layer Near real time views
  • 14.
    How all thismess should work? ● ● All new data are sent to both: batch and speed layer (data are raw and immutalble, append only) The batch layer precompute the query functions continuosly to all the dataset, to produce the batch views ● The serving layer indexes the batch views ● At the end the data are a couple of hours old
  • 15.
    How all thismess should work? ● ● ● ● The speed layer will process only the new data It use fast read/write database and incremental processing algorithms Produce the near real time views The query will merge real time and batch views results to resolve the queries
  • 16.
    Batch Layer -tools ● Hadoop – YARN: framework to schedule jobs and cluster management – Map / Reduce: a way to parallel processing of huge amount of data, based on YARN – HDFS: distributed file system with an high throughput access to application data – And even more...
  • 17.
    Serving Layer –tools ● ElephantDB – ● ● Readonly database, very little, very fast Here we need anything that has the same features Cloudera Impala
  • 18.
    Speed Layer -tools ● Storm project – Very fast distributed computed system ● Apache Hbase ● MongoDB
  • 19.