KEMBAR78
BlinkDB - Approximate Queries on Very Large Data | PDF
Approximate Queries onVery
Large Data
UC Berkeley
Sameer Agarwal
Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, AmeetTalwalkar,
Michael Jordan, Samuel Madden, Ion Stoica
M	I	T 1
About Me
1. Software Engineer at Databricks in San Francisco, CA
2. PhD in Databases from University of California,
Berkeley 2014.
3. Actively work as part of the open source community
and the AMPLab to create BDAS (Berkeley Data
Analytics Stack) that constitutes Apache Spark,
Tachyon, BlinkDB, Mesos etc.
Hard Disks
½ - 1 Hour 1 - 5 Minutes 1 second
?
Memory
10TB on 100 machines
Query Execution on Samples
ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Query Execution on Samples
What is the average buffering ratio
in the table?
0.2325
ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Query Execution on Samples
What is the average buffering ratio
in the table?
ID City Buff Ratio Sampling Rate
2 NYC 0.13 1/4
6 Berkeley 0.25 1/4
8 NYC 0.19 1/4
Uniform
Sample
0.19
0.2325
ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Query Execution on Samples
What is the average buffering ratio
in the table?
ID City Buff Ratio Sampling Rate
2 NYC 0.13 1/4
6 Berkeley 0.25 1/4
8 NYC 0.19 1/4
Uniform
Sample
0.19 +/- 0.05
0.2325
ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Query Execution on Samples
What is the average buffering ratio
in the table?
ID City Buff Ratio Sampling Rate
2 NYC 0.13 1/2
3 Berkeley 0.25 1/2
5 NYC 0.19 1/2
6 Berkeley 0.09 1/2
8 NYC 0.18 1/2
12 Berkeley 0.49 1/2
Uniform
Sample
$0.22 +/- 0.02
0.2325
0.19 +/- 0.05
Speed/Accuracy Trade-offError
30 mins
Time to
Execute on
Entire Dataset
Interactive
Queries
2 sec
ExecutionTime (Sample Size)
Sampling Vs. No Sampling
0
200
400
600
800
1000
1 10-1 10-2 10-3 10-4 10-5
Fraction of full data
QueryResponseTime(Seconds)
103
1020
18 13 10 8
10x as response time
is dominated by I/O
Sampling Vs. No Sampling
0
200
400
600
800
1000
1 10-1 10-2 10-3 10-4 10-5
Fraction of full data
QueryResponseTime(Seconds)
103
1020
18 13 10 8
(0.02%)
(0.07%) (1.1%) (3.4%) (11%)
Error Bars
Sampling Error
Typically, error depends on sample size (n)
and not on original data size, i.e., error is
proportional to (1/sqrt(n))*
Sampling Error
Typically, error depends on sample size (n)
and not on original data size, i.e., error is
proportional to (1/sqrt(n))*
* Conditions Apply
Sampling Error
Typically, error depends on sample size (n)
and not on original data size, i.e., error is
proportional to (1/sqrt(n))*
* Conditions Apply
Speed/Accuracy Trade-off
SELECT avg(sessionTime)
FROMTable
WHERE city=‘San Francisco’
WITHIN 1 SECONDS 234.23 ± 15.32
Speed/Accuracy Trade-off
SELECT avg(sessionTime)
FROMTable
WHERE city=‘San Francisco’
WITHIN 2 SECONDS 234.23 ± 15.32
239.46 ± 4.96
Speed/Accuracy Trade-off
SELECT avg(sessionTime)
FROMTable
WHERE city=‘San Francisco’
WITHIN 1 SECONDS
AVG, COUNT, SUM,
STDEV, PERCENTILE etc.
Speed/Accuracy Trade-off
SELECT avg(sessionTime)
FROMTable
WHERE city=‘San Francisco’
WITHIN 1 SECONDS
FILTERS, GROUP BY clauses
Speed/Accuracy Trade-off
SELECT avg(sessionTime)
FROMTable
WHERE city=‘San Francisco’
LEFT OUTER JOIN logs2
ON very_big_log.id = logs.id
WITHIN 1 SECONDS JOINS, Nested
Queries etc.
Speed/Accuracy Trade-off
SELECT my_function(sessionTime)
FROMTable
WHERE city=‘San Francisco’
LEFT OUTER JOIN logs2
ON very_big_log.id = logs.id
WITHIN 1 SECONDS
ML Primitives,
User Defined Functions
Speed/Accuracy Trade-off
SELECT avg(sessionTime)
FROMTable
WHERE city=‘San Francisco’
ERROR 0.1 CONFIDENCE 95.0%
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples from underlying data
- returns fast, approximate answers with error bars
by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples from underlying data
- returns fast, approximate answers with error bars
by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
Uniform Samples
2
4
1
3
Uniform Samples
2
4
1
3
U
Uniform Samples
2
4
1
3
U
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Uniform Samples
2
4
1
3
U
1. FILTER rand() < 1/3
2. Adds	per-row	weights
3. In-memory	Shuffle
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Uniform Samples
2
4
1
3
U
ID City Data Weight
2 NYC 0.13 1/3
8 NYC 0.25 1/3
6 Berkeley 0.09 1/3
11 NYC 0.19 1/3
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Doesn’t change
Spark RDD
Semantics
Stratified Samples
2
4
1
3
Stratified Samples
2
4
1
3
S
Stratified Samples
2
4
1
3
S
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Stratified Samples
2
4
1
3
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
S1
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
SPLIT
Stratified Samples
2
4
1
3
S1
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
S2
City Count
NYC 7
Berkeley 5
GROUP
Stratified Samples
2
4
1
3
S1
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
S2
City Count Ratio
NYC 7 2/7
Berkeley 5 2/5
GROUP
Stratified Samples
2
4
1
3
S1
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
S2
City Count Ratio
NYC 7 2/7
Berkeley 5 2/5
S2 JOIN
Stratified Samples
2
4
1
3
S1
S2
S2
U
ID City Data Weight
2 NYC 0.13 2/7
8 NYC 0.25 2/7
6 Berkeley 0.09 2/5
12 Berkeley 0.49 2/5
Doesn’t change
Shark RDD
Semantics
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples from underlying data
- returns fast, approximate answers with error bars
by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
Error Estimation
Closed Form Aggregate Functions
- Central LimitTheorem
- Applicable to AVG, COUNT, SUM,
VARIANCE and STDEV
Error Estimation
Closed Form Aggregate Functions
The following results are (asymptotically in sample
rectly useful, since they depend on unknown properties
tribution. In all cases we just plug in the sample values.
of µ we use 1
n
Pn
i=1 Xi where Xi is the ith sample value.
Note that for estimators other than sum and count,
(p = 1). Filtering will increase variance a bit, or potentia
selective queries (p = 0). I can compute the filtering-adju
1. Count: N(np, n(1 p)p)
2. Sum: N(npµ, np( 2
+ (1 p)µ2
))
3. Mean: N(µ, 2
/n)
4. Variance: N( 2
, (µ4
4
)/n)
5. Stddev: N( , (µ4
4
)/(4 2
n))
Error Estimation
Closed Form Aggregate Functions
A
1
2
Sample
AVG
SUM
COUNT
STDEV
VARIANCE
A
1
2
Sample
A
±ε
A
Error Estimation
Generalized Aggregate Functions
- Statistical Bootstrap
- Applicable to complex and nested queries, UDFs,
joins etc.
Error Estimation
Generalized Aggregate Functions
Sample
A
Sample
AA1A2A100
…
…
B
±ε
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of random and
stratified samples from underlying data
- returns fast, approximate answers with error bars
by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
ErrorVerificationError
Sample Size
More Data à Higher Accuracy
300 Data Points à 97% Accuracy
[KDD’13] [SIGMOD’14]
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of random and
stratified samples from underlying data
- returns fast, approximate answers with error bars
by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples from underlying data
- returns fast, approximate answers with error bars
by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
[Offline Process]
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples from underlying data
- returns fast, approximate answers with error bars
by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
[Online Process]
TABLE
SamplingModule
Original
Data
Offline-sampling:
Creates an optimal set
of samples on native
tables and materialized
views based on query
history and workload
characteristics
BlinkDB Architecture
47
TABLE
SamplingModule
In-Memory
Samples
On-Disk
Samples
Original
Data
Sample Placement:
Samples striped over
100s or 1,000s of
machines both on
disks and in-memory.
BlinkDB Architecture
48
SELECT
foo (*)
FROM TABLE
WITHIN 2
Query Plan
HiveQL/SQL
Query
Sample Selection
TABLE
SamplingModule
In-Memory
Samples
On-Disk
Samples
Original
Data
BlinkDB Architecture
49
SELECT
foo (*)
FROM TABLE
WITHIN 2
Query Plan
HiveQL/SQL
Query
Sample Selection
TABLE
SamplingModule
In-Memory
Samples
On-Disk
Samples
Original
Data
Online sample
selection to pick best
sample(s) based on
query latency and
accuracy
requirements
BlinkDB Architecture
50
TABLE
SamplingModule
In-Memory
Samples
On-Disk
Samples
Original
Data
Hive/Spark/Presto
SELECT
foo (*)
FROM TABLE
WITHIN 2
New Query Plan
HiveQL/SQL
Query
Sample Selection
Error Bars &
Confidence Intervals
Result
182.23 ± 5.56
(95% confidence)
Parallel query
execution on multiple
samples striped across
multiple machines.
BlinkDB Architecture
51
BlinkDB is Fast!
- 5 Queries, 5 machines
- 20 GB samples (0.001%-1% of original data)
- 1-5% Error
ResponseTime(s)
Query Execution
Overall Query Execution
Overall Query Execution
ResponseTime(s)
Error Estimation
Overhead
Overall Query Execution
ResponseTime(s)
ErrorVerification
Overhead
Coming Soon: Native Spark Integration
BlinkDB Prototype
1. Alpha 0.2.0 released and available at http://blinkdb.org
2. Allows you to create samples on native tables and materialized
views
3. Adds approximate aggregate functions with statistical closed
forms to HiveQL
4. Compatible with Apache Hive, Spark and Facebook’s Presto
(storage, serdes, UDFs, types, metadata)
http://blinkdb.org
Native Spark Integration Coming Soon!

BlinkDB - Approximate Queries on Very Large Data

  • 1.
    Approximate Queries onVery LargeData UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, AmeetTalwalkar, Michael Jordan, Samuel Madden, Ion Stoica M I T 1
  • 2.
    About Me 1. SoftwareEngineer at Databricks in San Francisco, CA 2. PhD in Databases from University of California, Berkeley 2014. 3. Actively work as part of the open source community and the AMPLab to create BDAS (Berkeley Data Analytics Stack) that constitutes Apache Spark, Tachyon, BlinkDB, Mesos etc.
  • 3.
    Hard Disks ½ -1 Hour 1 - 5 Minutes 1 second ? Memory 10TB on 100 machines Query Execution on Samples
  • 4.
    ID City BuffRatio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Query Execution on Samples What is the average buffering ratio in the table? 0.2325
  • 5.
    ID City BuffRatio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Query Execution on Samples What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/4 6 Berkeley 0.25 1/4 8 NYC 0.19 1/4 Uniform Sample 0.19 0.2325
  • 6.
    ID City BuffRatio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Query Execution on Samples What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/4 6 Berkeley 0.25 1/4 8 NYC 0.19 1/4 Uniform Sample 0.19 +/- 0.05 0.2325
  • 7.
    ID City BuffRatio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Query Execution on Samples What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/2 3 Berkeley 0.25 1/2 5 NYC 0.19 1/2 6 Berkeley 0.09 1/2 8 NYC 0.18 1/2 12 Berkeley 0.49 1/2 Uniform Sample $0.22 +/- 0.02 0.2325 0.19 +/- 0.05
  • 8.
    Speed/Accuracy Trade-offError 30 mins Timeto Execute on Entire Dataset Interactive Queries 2 sec ExecutionTime (Sample Size)
  • 9.
    Sampling Vs. NoSampling 0 200 400 600 800 1000 1 10-1 10-2 10-3 10-4 10-5 Fraction of full data QueryResponseTime(Seconds) 103 1020 18 13 10 8 10x as response time is dominated by I/O
  • 10.
    Sampling Vs. NoSampling 0 200 400 600 800 1000 1 10-1 10-2 10-3 10-4 10-5 Fraction of full data QueryResponseTime(Seconds) 103 1020 18 13 10 8 (0.02%) (0.07%) (1.1%) (3.4%) (11%) Error Bars
  • 11.
    Sampling Error Typically, errordepends on sample size (n) and not on original data size, i.e., error is proportional to (1/sqrt(n))*
  • 12.
    Sampling Error Typically, errordepends on sample size (n) and not on original data size, i.e., error is proportional to (1/sqrt(n))* * Conditions Apply
  • 13.
    Sampling Error Typically, errordepends on sample size (n) and not on original data size, i.e., error is proportional to (1/sqrt(n))* * Conditions Apply
  • 14.
    Speed/Accuracy Trade-off SELECT avg(sessionTime) FROMTable WHEREcity=‘San Francisco’ WITHIN 1 SECONDS 234.23 ± 15.32
  • 15.
    Speed/Accuracy Trade-off SELECT avg(sessionTime) FROMTable WHEREcity=‘San Francisco’ WITHIN 2 SECONDS 234.23 ± 15.32 239.46 ± 4.96
  • 16.
    Speed/Accuracy Trade-off SELECT avg(sessionTime) FROMTable WHEREcity=‘San Francisco’ WITHIN 1 SECONDS AVG, COUNT, SUM, STDEV, PERCENTILE etc.
  • 17.
    Speed/Accuracy Trade-off SELECT avg(sessionTime) FROMTable WHEREcity=‘San Francisco’ WITHIN 1 SECONDS FILTERS, GROUP BY clauses
  • 18.
    Speed/Accuracy Trade-off SELECT avg(sessionTime) FROMTable WHEREcity=‘San Francisco’ LEFT OUTER JOIN logs2 ON very_big_log.id = logs.id WITHIN 1 SECONDS JOINS, Nested Queries etc.
  • 19.
    Speed/Accuracy Trade-off SELECT my_function(sessionTime) FROMTable WHEREcity=‘San Francisco’ LEFT OUTER JOIN logs2 ON very_big_log.id = logs.id WITHIN 1 SECONDS ML Primitives, User Defined Functions
  • 20.
    Speed/Accuracy Trade-off SELECT avg(sessionTime) FROMTable WHEREcity=‘San Francisco’ ERROR 0.1 CONFIDENCE 95.0%
  • 21.
    What is BlinkDB? Aframework built on Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime
  • 22.
    What is BlinkDB? Aframework built on Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime
  • 23.
  • 24.
  • 25.
    Uniform Samples 2 4 1 3 U ID CityData 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10
  • 26.
    Uniform Samples 2 4 1 3 U 1. FILTERrand() < 1/3 2. Adds per-row weights 3. In-memory Shuffle ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10
  • 27.
    Uniform Samples 2 4 1 3 U ID CityData Weight 2 NYC 0.13 1/3 8 NYC 0.25 1/3 6 Berkeley 0.09 1/3 11 NYC 0.19 1/3 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Doesn’t change Spark RDD Semantics
  • 28.
  • 29.
  • 30.
    Stratified Samples 2 4 1 3 S ID CityData 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10
  • 31.
    Stratified Samples 2 4 1 3 ID CityData 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 S1 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 SPLIT
  • 32.
    Stratified Samples 2 4 1 3 S1 ID CityData 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 S2 City Count NYC 7 Berkeley 5 GROUP
  • 33.
    Stratified Samples 2 4 1 3 S1 ID CityData 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 S2 City Count Ratio NYC 7 2/7 Berkeley 5 2/5 GROUP
  • 34.
    Stratified Samples 2 4 1 3 S1 ID CityData 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 S2 City Count Ratio NYC 7 2/7 Berkeley 5 2/5 S2 JOIN
  • 35.
    Stratified Samples 2 4 1 3 S1 S2 S2 U ID CityData Weight 2 NYC 0.13 2/7 8 NYC 0.25 2/7 6 Berkeley 0.09 2/5 12 Berkeley 0.49 2/5 Doesn’t change Shark RDD Semantics
  • 36.
    What is BlinkDB? Aframework built on Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime
  • 37.
    Error Estimation Closed FormAggregate Functions - Central LimitTheorem - Applicable to AVG, COUNT, SUM, VARIANCE and STDEV
  • 38.
    Error Estimation Closed FormAggregate Functions The following results are (asymptotically in sample rectly useful, since they depend on unknown properties tribution. In all cases we just plug in the sample values. of µ we use 1 n Pn i=1 Xi where Xi is the ith sample value. Note that for estimators other than sum and count, (p = 1). Filtering will increase variance a bit, or potentia selective queries (p = 0). I can compute the filtering-adju 1. Count: N(np, n(1 p)p) 2. Sum: N(npµ, np( 2 + (1 p)µ2 )) 3. Mean: N(µ, 2 /n) 4. Variance: N( 2 , (µ4 4 )/n) 5. Stddev: N( , (µ4 4 )/(4 2 n))
  • 39.
    Error Estimation Closed FormAggregate Functions A 1 2 Sample AVG SUM COUNT STDEV VARIANCE A 1 2 Sample A ±ε A
  • 40.
    Error Estimation Generalized AggregateFunctions - Statistical Bootstrap - Applicable to complex and nested queries, UDFs, joins etc.
  • 41.
    Error Estimation Generalized AggregateFunctions Sample A Sample AA1A2A100 … … B ±ε
  • 42.
    What is BlinkDB? Aframework built on Spark that … - creates and maintains a variety of random and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime
  • 43.
    ErrorVerificationError Sample Size More Dataà Higher Accuracy 300 Data Points à 97% Accuracy [KDD’13] [SIGMOD’14]
  • 44.
    What is BlinkDB? Aframework built on Spark that … - creates and maintains a variety of random and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime
  • 45.
    What is BlinkDB? Aframework built on Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime [Offline Process]
  • 46.
    What is BlinkDB? Aframework built on Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime [Online Process]
  • 47.
    TABLE SamplingModule Original Data Offline-sampling: Creates an optimalset of samples on native tables and materialized views based on query history and workload characteristics BlinkDB Architecture 47
  • 48.
    TABLE SamplingModule In-Memory Samples On-Disk Samples Original Data Sample Placement: Samples stripedover 100s or 1,000s of machines both on disks and in-memory. BlinkDB Architecture 48
  • 49.
    SELECT foo (*) FROM TABLE WITHIN2 Query Plan HiveQL/SQL Query Sample Selection TABLE SamplingModule In-Memory Samples On-Disk Samples Original Data BlinkDB Architecture 49
  • 50.
    SELECT foo (*) FROM TABLE WITHIN2 Query Plan HiveQL/SQL Query Sample Selection TABLE SamplingModule In-Memory Samples On-Disk Samples Original Data Online sample selection to pick best sample(s) based on query latency and accuracy requirements BlinkDB Architecture 50
  • 51.
    TABLE SamplingModule In-Memory Samples On-Disk Samples Original Data Hive/Spark/Presto SELECT foo (*) FROM TABLE WITHIN2 New Query Plan HiveQL/SQL Query Sample Selection Error Bars & Confidence Intervals Result 182.23 ± 5.56 (95% confidence) Parallel query execution on multiple samples striped across multiple machines. BlinkDB Architecture 51
  • 52.
    BlinkDB is Fast! -5 Queries, 5 machines - 20 GB samples (0.001%-1% of original data) - 1-5% Error
  • 53.
  • 54.
  • 55.
  • 56.
    Coming Soon: NativeSpark Integration
  • 57.
    BlinkDB Prototype 1. Alpha0.2.0 released and available at http://blinkdb.org 2. Allows you to create samples on native tables and materialized views 3. Adds approximate aggregate functions with statistical closed forms to HiveQL 4. Compatible with Apache Hive, Spark and Facebook’s Presto (storage, serdes, UDFs, types, metadata)
  • 58.