BlinkDB - Approximate Queries on Very Large Data

Approximate Queries onVery
Large Data
UC Berkeley
Sameer Agarwal
Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, AmeetTalwalkar,
Michael Jordan, Samuel Madden, Ion Stoica
M I T 1

About Me
1. Software Engineer at Databricks in San Francisco, CA
2. PhD in Databases from University of California,
Berkeley 2014.
3. Actively work as part of the open source community
and the AMPLab to create BDAS (Berkeley Data
Analytics Stack) that constitutes Apache Spark,
Tachyon, BlinkDB, Mesos etc.

Hard Disks
½ - 1 Hour 1 - 5 Minutes 1 second
?
Memory
10TB on 100 machines
Query Execution on Samples

ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
What is the average buffering ratio
in the table?
0.2325

ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
in the table?
ID City Buff Ratio Sampling Rate
2 NYC 0.13 1/4
6 Berkeley 0.25 1/4
8 NYC 0.19 1/4
Uniform
Sample
0.19
0.2325

ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
in the table?
2 NYC 0.13 1/4
6 Berkeley 0.25 1/4
8 NYC 0.19 1/4
Uniform
Sample
0.19 +/- 0.05
0.2325

ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
in the table?
2 NYC 0.13 1/2
3 Berkeley 0.25 1/2
5 NYC 0.19 1/2
6 Berkeley 0.09 1/2
8 NYC 0.18 1/2
12 Berkeley 0.49 1/2
Uniform
Sample
$0.22 +/- 0.02
0.2325
0.19 +/- 0.05

Speed/Accuracy Trade-offError
30 mins
Time to
Execute on
Entire Dataset
Interactive
Queries
2 sec
ExecutionTime (Sample Size)

Sampling Vs. No Sampling
0
200
400
600
800
1000
1 10-1 10-2 10-3 10-4 10-5
Fraction of full data
QueryResponseTime(Seconds)
103
1020
18 13 10 8
10x as response time
is dominated by I/O

Sampling Vs. No Sampling
0
200
400
600
800
1000
1 10-1 10-2 10-3 10-4 10-5
Fraction of full data
QueryResponseTime(Seconds)
103
1020
18 13 10 8
(0.02%)
(0.07%) (1.1%) (3.4%) (11%)
Error Bars

Sampling Error
Typically, error depends on sample size (n)
and not on original data size, i.e., error is
proportional to (1/sqrt(n))*

Sampling Error
Typically, error depends on sample size (n)
and not on original data size, i.e., error is
proportional to (1/sqrt(n))*
* Conditions Apply

Speed/Accuracy Trade-off
SELECT avg(sessionTime)
FROMTable
WHERE city=‘San Francisco’
WITHIN 1 SECONDS 234.23 ± 15.32

FROMTable
WITHIN 2 SECONDS 234.23 ± 15.32
239.46 ± 4.96

FROMTable
WITHIN 1 SECONDS
AVG, COUNT, SUM,
STDEV, PERCENTILE etc.

FROMTable
WITHIN 1 SECONDS
FILTERS, GROUP BY clauses

FROMTable
LEFT OUTER JOIN logs2
ON very_big_log.id = logs.id
WITHIN 1 SECONDS JOINS, Nested
Queries etc.

SELECT my_function(sessionTime)
FROMTable
LEFT OUTER JOIN logs2
ON very_big_log.id = logs.id
WITHIN 1 SECONDS
ML Primitives,
User Defined Functions

FROMTable
ERROR 0.1 CONFIDENCE 95.0%

What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples from underlying data
- returns fast, approximate answers with error bars
by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime

Uniform Samples
2
4
1
3
U
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10

Uniform Samples
2
4
1
3
U
1. FILTER rand() < 1/3
2. Adds per-row weights
3. In-memory Shuffle
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10

Uniform Samples
2
4
1
3
U
ID City Data Weight
2 NYC 0.13 1/3
8 NYC 0.25 1/3
6 Berkeley 0.09 1/3
11 NYC 0.19 1/3
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Doesn’t change
Spark RDD
Semantics

Stratified Samples
2
4
1
3
S
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10

Stratified Samples
2
4
1
3
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
S1
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
SPLIT

Stratified Samples
2
4
1
3
S1
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
S2
City Count
NYC 7
Berkeley 5
GROUP

Stratified Samples
2
4
1
3
S1
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
S2
City Count Ratio
NYC 7 2/7
Berkeley 5 2/5
GROUP

Stratified Samples
2
4
1
3
S1
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
S2
City Count Ratio
NYC 7 2/7
Berkeley 5 2/5
S2 JOIN

Stratified Samples
2
4
1
3
S1
S2
S2
U
ID City Data Weight
2 NYC 0.13 2/7
8 NYC 0.25 2/7
6 Berkeley 0.09 2/5
12 Berkeley 0.49 2/5
Doesn’t change
Shark RDD
Semantics

Error Estimation
Closed Form Aggregate Functions
- Central LimitTheorem
- Applicable to AVG, COUNT, SUM,
VARIANCE and STDEV

Error Estimation
The following results are (asymptotically in sample
rectly useful, since they depend on unknown properties
tribution. In all cases we just plug in the sample values.
of µ we use 1
n
Pn
i=1 Xi where Xi is the ith sample value.
Note that for estimators other than sum and count,
(p = 1). Filtering will increase variance a bit, or potentia
selective queries (p = 0). I can compute the ﬁltering-adju
1. Count: N(np, n(1 p)p)
2. Sum: N(npµ, np( 2
+ (1 p)µ2
))
3. Mean: N(µ, 2
/n)
4. Variance: N( 2
, (µ4
4
)/n)
5. Stddev: N( , (µ4
4
)/(4 2
n))

Error Estimation
A
1
2
Sample
AVG
SUM
COUNT
STDEV
VARIANCE
A
1
2
Sample
A
±ε
A

Error Estimation
Generalized Aggregate Functions
- Statistical Bootstrap
- Applicable to complex and nested queries, UDFs,
joins etc.

Error Estimation
Generalized Aggregate Functions
Sample
A
Sample
AA1A2A100
…
…
B
±ε

What is BlinkDB?
- creates and maintains a variety of random and
returns at runtime

ErrorVerificationError
Sample Size
More Data à Higher Accuracy
300 Data Points à 97% Accuracy
[KDD’13] [SIGMOD’14]

What is BlinkDB?
returns at runtime
[Offline Process]

What is BlinkDB?
returns at runtime
[Online Process]

TABLE
SamplingModule
Original
Data
Offline-sampling:
Creates an optimal set
of samples on native
tables and materialized
views based on query
history and workload
characteristics
BlinkDB Architecture
47

TABLE
SamplingModule
In-Memory
Samples
On-Disk
Samples
Original
Data
Sample Placement:
Samples striped over
100s or 1,000s of
machines both on
disks and in-memory.
48

SELECT
foo (*)
FROM TABLE
WITHIN 2
Query Plan
HiveQL/SQL
Query
Sample Selection
TABLE
SamplingModule
In-Memory
Samples
On-Disk
Samples
Original
Data
49

SELECT
foo (*)
FROM TABLE
WITHIN 2
Query Plan
HiveQL/SQL
Query
Sample Selection
TABLE
SamplingModule
In-Memory
Samples
On-Disk
Samples
Original
Data
Online sample
selection to pick best
sample(s) based on
query latency and
accuracy
requirements
50

TABLE
SamplingModule
In-Memory
Samples
On-Disk
Samples
Original
Data
Hive/Spark/Presto
SELECT
foo (*)
FROM TABLE
WITHIN 2
New Query Plan
HiveQL/SQL
Query
Sample Selection
Error Bars &
Confidence Intervals
Result
182.23 ± 5.56
(95% confidence)
Parallel query
execution on multiple
samples striped across
multiple machines.
51

BlinkDB is Fast!
- 5 Queries, 5 machines
- 20 GB samples (0.001%-1% of original data)
- 1-5% Error

ResponseTime(s)
Query Execution
Overall Query Execution

ResponseTime(s)
Error Estimation
Overhead

ResponseTime(s)
ErrorVerification
Overhead

Coming Soon: Native Spark Integration

BlinkDB Prototype
1. Alpha 0.2.0 released and available at http://blinkdb.org
2. Allows you to create samples on native tables and materialized
views
3. Adds approximate aggregate functions with statistical closed
forms to HiveQL
4. Compatible with Apache Hive, Spark and Facebook’s Presto
(storage, serdes, UDFs, types, metadata)

http://blinkdb.org
Native Spark Integration Coming Soon!

BlinkDB - Approximate Queries on Very Large Data

More Related Content

Similar to BlinkDB - Approximate Queries on Very Large Data

More from Knoldus Inc.

Recently uploaded

BlinkDB - Approximate Queries on Very Large Data