KEMBAR78
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap) | PPTX
Myria: Scalable
Analytics as a Service
Bill Howe, PhD
University of Washington
with Dan Suciu, Magda Balazinska, Dan Halperin, and many students
MMDS 2014, Berkeley CA
Today
• Three observations about Big Data
• Myria: Scalable Analytics as a Service
• Parallel Flow-based Graph Clustering
(if time, but there won’t be)
7/10/2014 Bill Howe, UW 2/57
7/10/2014 Bill Howe, UW 3
How can we deliver 1000 little SDSSs
to anyone who wants one?
How much time do you spend “handling
data” as opposed to “doing science”?
Mode answer: “90%”
7/10/2014 Bill Howe, UW 4
0
30
60
90
120
Benchmark 1 Benchmark 2
Old system Your system Our system
A typical Computer Science paper….
slide src: Dan Halperin
0
2500
5000
7500
10000
12500
Benchmark 1 Benchmark 2
Old system Your system
Our system What people use
The reality of the situation….
slide src: Dan Halperin
“[This was hard] due to the large amount of data (e.g. data indexes for data retrieval,
dissection into data blocks and processing steps, order in which steps are performed
to match memory/time requirements, file formats required by software used).
In addition we actually spend quite some time in iterations fixing problems with
certain features (e.g. capping ENCODE data), testing features and feature products
to include, identifying useful test data sets, adjusting the training data (e.g. 1000G vs
human-derived variants)
So roughly 50% of the project was testing and improving the model, 30% figuring out
how to do things (engineering) and 20% getting files and getting them into the right
format.
I guess in total [I spent] 6 months [on this project].”
At least 3 months on issues of
scale, file handling, and feature
engineering.
Martin Kircher,
Genome SciencesWhy?
3k NSF postdocs in 2010
$50k / postdoc
at least 50% overhead
maybe $75M annually
at NSF alone?
Data Science Workflow:
7/10/2014 Bill Howe, UW 8
1) Preparing to run a model
2) Running the model
3) Interpreting the results
Gathering, cleaning, integrating, restructuring,
transforming, loading, filtering, deleting, combining,
merging, verifying, extracting, shaping, massaging
“80% of the work”
-- Aaron Kimball
“The other 80% of the work”
7/10/2014 Bill Howe, UW 9
Your cool algorithmic problem is not the bottleneck
Observation 1
7/10/2014 Bill Howe, UW 10
Symbolic Reasoning and Algebraic Optimization
N = ((z*2)+((z*3)+0))/1
Algebraic Laws:
1. (+) identity: x+0 = x
2. (/) identity: x/1 = x
3. (*) distributes: (n*x+n*y) = n*(x+y)
4. (*) commutes: x*y = y*x
Apply rules 1, 3, 4, 2:
N = (2+3)*z
two operations instead of five, no division operator
Every database does this kind of optimization
every time you issue a query
SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp
, x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp
, w.category as nc_category
, CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
THEN x.end_bp - x.start_bp + 1
WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
THEN x.end_bp - w.start_bp + 1
WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)
THEN w.end_bp - x.start_bp + 1
END AS len_overlap
FROM [koesterj@washington.edu].[hotspots_deserts.tab] x
INNER JOIN [koesterj@washington.edu].[table_noncoding_positions.tab] w
ON x.chr = w.chr
WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)
ORDER BY x.strain, x.chr ASC, x.start_bp ASC
Non-programmers can write very complex queries
(rather than relying on staff programmers)
Example: Computing the overlaps of two sets of blast results
We see thousands of
queries written by
non-programmers
Howe, et al., CISE 2013
Steven
Roberts
SQL as a lab notebook:
http://bit.ly/16Xj2JP
Calculate #
methylated CGs
Calculate #
all CGs
Calculate
methylation ratio
Link methylation
with gene description
GFF of
methylated
CG locations
GFF of all
genes
GFF of all
CG locations
Gene
descriptions
Join
Reorder
columns
Count Count
JoinJoin
Reorder
columns
Reorder
columns
Compute
Trim
Excel
Join Join
misstep: join
w/ wrong fill
Calculate #
methylated
CGs
Calculate #
all CGs
GFF of
methylated
CG locations
GFF of all
genes
GFF of all
CG locations
Gene
descriptions
Calculate
methylation ratio
and link with gene
description
Popular service for
Bioinformatics Workflows
14
A = LOAD('points.txt', id:int, x:float, y:float)
E = LIMIT(A, 4);
F = SEQUENCE();
Centroids = [FROM E EMIT (id=F.next, x=E.x, y=E.y)];
Kmeans = [FROM A EMIT (id=id, x=x, y=y, cluster_id=0)]
DO
I = CROSS(Kmeans, Centroids);
J = [FROM I EMIT (Kmeans.id, Kmeans.x, Kmeans.y, Centroids.cluster_id,
$distance(Kmeans.x, Kmeans.y, Centroids.x, Centroids.y))];
K = [FROM J EMIT id, distance=$min(distance)];
L = JOIN(J, id, K, id)
M = [FROM L WHERE J.distance <= K.distance EMIT
(id=J.id, x=J.x, y=J.y, cluster_id=J.cluster_id)];
Kmeans' = [FROM M EMIT (id, x, y, $min(cluster_id))];
Delta = DIFF(Kmeans', Kmeans)
Kmeans = Kmeans'
Centroids = [FROM Kmeans' EMIT (cluster_id, x=avg(x), y=avg(y))];
WHILE DELTA != {}
K-Means in relational algebra
7/10/2014 Bill Howe, UW 15
“SQL” vs. “ML” is a false dichotomy
Observation 2
(Relational Algebra) (Linear Algebra)
• SIGMOD 2009: Vertica 100x < Hadoop (Grep, Aggregation, Join)
• VLDB 2010: HaLoop ~100x < Hadoop (PageRank, ShortestPath)
• SIGMOD 2010: Pregel (no comparisons)
• HotCloud 2010: Spark ~100x < Hadoop (logistic regression, ALS)
• ICDE 2011: SystemML ~100x < Hadoop
• ICDE 2011: Asterix ~100x < Hadoop (K-Means)
• VLDB 2012: Graphlab ~100x < Hadoop, GraphLab 5x > Pregel, Graphlab ~
MPI (Recommendation/ALS, CoSeq/GMM, NER)
• NSDI 2012: Spark 20x < Hadoop (logistic regression, PageRank)
• VLDB 2012: Asterix (no comparisons)
• SIGMOD 2013: Cumulon 5x < SystemML
• VLDB 2013: Giraph ~ GraphLab (Connected Components)
• SIGMOD 2014: SimSQL vs. Spark vs. GraphLab vs. Giraph (GMM,
bayesian regression, HMM, LDA, imputation)
• VLDB 2014: epiC ~ Impala, epiC ~ Shark, epiC 2x < Hadoop (Grep, Sort,
TPC-H Q3, PageRank)
A quick meta-analysis of some Big Data systems literature
Pregel
(Malewicz)
Hadoop 2008
2009
2010
2011
2012
2013
2014
HaLoop
(Bu)
Spark
(Zakaria)
Vertica
(Pavlo)
~100x faster
SystemML
(Ghoting)
Hyracks
(Borkar)
GraphLab
(Low)
faster
Cumulon
(Huang)
comparable or
inconclusive
Giraph
(Tian)
Dremel
(Melnik)
SimSQL
(Cai)
epiC
(Jiang)
Impala
(Cloudera)
Shark
(Xin)
HIVE
(Thusoo)
“The good old days”
“The age of uncertainty”
7/10/2014 Bill Howe, UW 18
Anything based on Hadoop is 100x
slower than the state-of-the-art
Observation 3
…but the rest of the story is not clear
7/10/2014 Bill Howe, UW 19
1) BD experiments are ridiculously labor-intensive
– N systems x M real-world applications
– Big clusters and big datasets
2) No “one size fits all solution”
– Realistic environments will use more than one system
3) A return to distributed, federated databases
– Erase the distinction between ETL and Analytics
We need big data middleware
7/10/2014 Bill Howe, UW 20
Relational Analytics-as-a-Service
Version 2
http://myria.cs.washington.edu
Magda Balazinska, Bill Howe, and Dan Suciu
Dan Halperin (technical lead)
Victor Almeida
Andrew Whitaker
PhD Students
Shumo Chu
Eric Gribkoff
Jeremy Hyrkas
Paris Koutris
Ryan Maas
Dominik Moritz
Laurel Orr
Jennifer Ortiz
Emad Soroush
Jingjing Wang
ShengLiang Xu
Undergraduate Students
Lee Lee Choo
Vaspol Ruamviboonsuk
Myria Team
Myria is…
• MyriaQ: An optimizing compiler and
middleware for multiple iterative source
languages and multiple target big data
systems
• MyriaX: A parallel, shared-nothing,
iterative execution engine
• MyriaWeb: An IDE and RESTful service
for algorithm development
22
Myria is …
Myria Architecture
Coordinator
Language Parser
Myria
Compiler
Logical Optimizer for RA+While
REST Server
Worker Catalog
Catalog
…
json query plan
netty
protocols
RDBMS
jdbc
Worker Catalog
RDBMS
jdbc
Worker Catalog
RDBMS
jdbc
MyriaX (Java)
C Compiler Grappa
Web UI
MyriaQ (Python)
HDFS HDFS HDFS
Datalog SQL MyriaL
REST
SciDB Hadoop
SciDBSerial C++GrappaMyriaX SQL
SQLDatalogMyriaL ??
Relational Algebra + Iteration
Compiler Compiler Compiler Compiler Compiler
MyriaQ
Oceanography, Astronomy, Biology, Medical Informatics
Laser
Microscope Objective
Pine Hole Lens
Nozzle d1
d2
FSC
(Forward scatter)
Orange fluo
Red fluo
EX: SeaFlow
Francois
Ribalet
Jarred
Swalwell
Ginger
Armbrust
Ex: SeaFlow
10
0
10
1
10
2
10
3
10
4
100
10
1
10
2
10
3
10
4
ps3.fcs…Focus
D1/FSC
D2/FSC
d1/FSC
d2 / FSC
10
0
10
1
10
2
10
3
10
4
100
101
10
2
10
3
10
4
ps3.fcs…subset
FSC
692-40REDfluorescence
FSC
Picoplankton
Nanoplankton
100
101
102
103
104
10
0
10
1
10
2
103
104
P35-surf
FSC Small Stuff
580-30
IS
Ultraplankton
Prochlorococcus
 Continuous observations of various phytoplankton groups from 1-20
mm in size
 Based on RED fluo: Prochlorococcus, Pico-, Ultra- and Nanoplankton
 Based on ORANGE fluo: Synechococcus, Cryptophytes
 Based on FSC: Coccolithophores
Francois
Ribalet
Jarred
Swalwell
Ginger
Armbrust
Ex: SeaFlow Francois
Ribalet
Jarred
Swalwell
Ginger
Armbrust
SeaFlow in Myria
• “That 5-line MyriaL program was 100x faster than my R cluster,
and much simpler”
Dan Halperin Sophie Clayton
Lowering barrier to entry
Algorithmic insight
Shumo Chu Dominik Moritz
Performance analysis
Sourcenode
Destination node
Shumo Chu Dominik Moritz
32
A = LOAD('points.txt', id:int, x:float, y:float)
E = LIMIT(A, 4);
F = SEQUENCE();
Centroids = [FROM E EMIT (id=F.next, x=E.x, y=E.y)];
Kmeans = [FROM A EMIT (id=id, x=x, y=y, cluster_id=0)]
DO
I = CROSS(Kmeans, Centroids);
J = [FROM I EMIT (Kmeans.id, Kmeans.x, Kmeans.y, Centroids.cluster_id,
$distance(Kmeans.x, Kmeans.y, Centroids.x, Centroids.y))];
K = [FROM J EMIT id, distance=$min(distance)];
L = JOIN(J, id, K, id)
M = [FROM L WHERE J.distance <= K.distance EMIT
(id=J.id, x=J.x, y=J.y, cluster_id=J.cluster_id)];
Kmeans' = [FROM M EMIT (id, x, y, $min(cluster_id))];
Delta = DIFF(Kmeans', Kmeans)
Kmeans = Kmeans'
Centroids = [FROM Kmeans' EMIT (cluster_id, x=avg(x), y=avg(y))];
WHILE DELTA != {}
K-Means in the language MyriaL
33
CurGood = SCAN(public:adhoc:sc_points);
DO
mean = [FROM CurGood EMIT val=AVG(v)];
std = [FROM CurGood EMIT val=STDEV(v)];
NewBad = [FROM Good WHERE ABS(Good.v - mean) > 2 * std EMIT *];
CurGood = CurGood - NewBad;
continue = [FROM NewBad EMIT COUNT(NewBad.v) > 0];
WHILE continue;
DUMP(CurGood);
Sigma-clipping, V0
Sigma-Clipping (v1)
34
CurGood = P
sum = [FROM CurGood EMIT SUM(val)];
sumsq = [FROM CurGood EMIT SUM(val*val)]
cnt = [FROM CurGood EMIT CNT(*)];
NewBad = []
DO
sum = sum – [FROM NewBad EMIT SUM(val)];
sumsq = sum – [FROM NewBad EMIT SUM(val*val)];
cnt = sum - [FROM NewBad EMIT CNT(*)];
mean = sum / cnt
std = sqrt(1/(cnt*(cnt-1)) * (cnt * sumsq - sum*sum))
NewBad = FILTER([ABS(val-mean)>std], CurGood)
CurGood = CurGood - NewBad
WHILE NewBad != {}
Sigma-clipping, V1: Incremental
Sigma-Clipping (v2)
35
Points = SCAN(public:adhoc:sc_points);
aggs = [FROM Points EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)];
newBad = []
bounds = [FROM Points EMIT lower=MIN(v), upper=MAX(v)];
DO
new_aggs = [FROM newBad EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)];
aggs = [FROM aggs, new_aggs EMIT _sum=aggs._sum - new_aggs._sum,
sumsq=aggs.sumsq - new_aggs.sumsq, cnt=aggs.cnt - new_aggs.cnt];
stats = [FROM aggs EMIT mean=_sum/cnt,
std=SQRT(1.0/(cnt*(cnt-1)) * (cnt * sumsq - _sum * _sum))];
newBounds = [FROM stats EMIT lower=mean - 2 * std, upper=mean + 2 * std];
tooLow = [FROM Points, bounds, newBounds WHERE newBounds.lower > v
AND v >= bounds.lower EMIT v=Points.v];
tooHigh = [FROM Points, bounds, newBounds WHERE newBounds.upper < v
AND v <= bounds.upper EMIT v=Points.v];
newBad = UNIONALL(tooLow, tooHigh);
bounds = newBounds;
continue = [FROM newBad EMIT COUNT(v) > 0];
WHILE continue;
output = [FROM Points, bounds WHERE Points.v > bounds.lower AND
Points.v < bounds.upper EMIT v=Points.v];
DUMP(output);
Sigma-clipping, V2
Sigma-Clipping (v3)
Takeaways
• Myria: Analytics-as-a-Service
– Lower barrier to entry, iterative processing, state-of-the-art internals
• Blur the distinction between “Query” and “Algorithm”
• Relational Algebra is at least as important as Linear Algebra
http://escience.washington.edu
@billghowe
billhowe@cs.washington.edu
http://myria.cs.washington.edu
https://demo.myria.cs.washington.edu/
(Relational Algebra) (Linear Algebra)
37
Huffman coding refresher
symbol : frequency (descending)
Rosvall and Bergstrom 2007,
2010
A Random Walk….
http://www.mapequation.org/apps/MapDemo.html
…generates a sequence
of symbols with
frequencies
so we can generate a
Huffman code for that
sequence….
A two-level coding
a global index codebook indicates which module you
are in
a local module codebook indicates which vertex is
visited
Rosvall and Bergstrom 2007,
2010
codebook derived from the relative rates at
which a random walker enters each module
codebook derived from the relative rates at which
a random walker visits each node OR exits the
module
MapEquation intuition
With a bad two-level encoding, you might be
frequently jumping between modules
With a bad two-level encoding, your modules might
have too many vertices and require long codebooks
A good, short encoding means a walker spends a lot
of time within modules rather than moving between
them, while keeping module size to a minimum
Rosvall and Bergstrom 2007,
2010
A good graph clustering
MapEquation
Rosvall and Bergstrom 2007,
2010
Describes movements
between modules
Describes movements
within module i
Third-party
benchmarks
Lancichinetti, Fortunato,
“Community detection algorithms: a
comparative analysis”, Phys
Review 2009
“We conclude that the Infomap method by Rosvall and Bergstrom
is the best performing on the set of benchmarks we have
examined here.”
Serial Algorithm (simplified)
Compute visit probability of each vertex (PageRank)
While the code length L has not converged:
Put the vertices in random order
For each vertex v:
greedily move v to best neighboring module
do global bookkeeping
….plus several optimizations
How do we parallelize it?
Naïve 1st Attempt: Drop all locks
Compute visit probability of each vertex (PageRank)
While the code length L has not converged:
Put the vertices in random order
For each vertex v:
greedily move v to best neighboring module
do global bookkeeping
parallel
serial, but fast
serial
Naïve lock-free scheme
converges prematurely
Seung-Hee
Bae
ICDM 2013
How do we parallelize it?
2nd Attempt: RelaxMap
Compute visit probability of each vertex (PageRank)
While the code length L has not converged:
Put the vertices in random order
For each vertex v:
greedily move v to best neighboring module
grab a lock
do global bookkeeping
parallel
serial, but fast
serial
ICDM 2013
Seung-Hee
Bae
ICDM 2013
Converges faster
Same quality
Parallel efficiency is … ok
Seung-Hee
Bae
ICDM 2013
Side excursion: Prioritization
Observation: Certain vertices contribute to improving
the objective function more than others.
Which ones?
c1
n1
c2
c3
m1 v
m2
mn1
mn2
mn3
n2
n3
Mc
Ma
Mb
Seung-Hee
Bae
DMKD 2014
(submitted)
vertex neighbors are
red
module siblings are
blue
vertices in neighboring
modules are green
Seung-Hee
Bae
DMKD 2014
(submitted)
Seung-Hee
Bae
DMKD 2014
(submitted)
How do we parallelize it?
3rd Attempt:
Approximate the objective function by just
moving every vertex along its heaviest
edge
Ignore the terms that require
synchronization
How do we parallelize it?
4th Attempt: Fully Asynchronous +
Gossiping
Each vertex
1) tells its neighbors when it
moves
2) propagate other messages
To decide whether to move, just
use the information you have
Seung-Hee
Bae
A closer look at an example
ROI(id, start, stop) is a set of “regions of interest”
Read(id, start, stop) is a set of “reads” from sequencer
Task: For each region of interest, count the number
of reads it contains
start stop
stopstart
SELECT roi.id, count(rd.id)
FROM regions_of_interest roi, reads rd
WHERE roi.start <= rd.start AND rd.[end] <= roi.[end]
GROUP BY roi.id​
As a query
“region of interest”
sequence “read”
SELECT roi.id, count(rd.start)
FROM regions_of_interest roi, reads rd
WHERE roi.start <= rd.start AND rd.[end] <= roi.[end]
GROUP BY roi.id​
Why databases get
a bad reputation
many minutes
SELECT roi.id, count(rd.start) as cnt
FROM regions_of_interest roi, indexed_reads rd
WHERE roi.start <= rd.start AND rd.start <= roi.[end]
AND roi.start <= rd.[end] AND rd.[end] >= roi.[end]
GROUP BY roi.id
3 seconds!
roi
read
two-sided index scan
one-sided index scan,
plus filter
The broken promise of declarative query…
60
Maslow’s Needs Hierarchy
“As each need is satisfied, the
next higher level in the hierarchy
dominates conscious functioning.”
-- Maslow 43
A “Needs Hierarchy” of Science Data Management
storage
sharing
61
query
integration
analytics
“As each need is satisfied, the
next higher level in the hierarchy
dominates conscious functioning.”
-- Maslow 43
A “Needs Hierarchy” of Science Data Management
storage
sharing
62
integration
query
analytics
“As each need is satisfied, the
next higher level in the hierarchy
dominates conscious functioning.”
-- Maslow 43

MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)

  • 1.
    Myria: Scalable Analytics asa Service Bill Howe, PhD University of Washington with Dan Suciu, Magda Balazinska, Dan Halperin, and many students MMDS 2014, Berkeley CA
  • 2.
    Today • Three observationsabout Big Data • Myria: Scalable Analytics as a Service • Parallel Flow-based Graph Clustering (if time, but there won’t be) 7/10/2014 Bill Howe, UW 2/57
  • 3.
    7/10/2014 Bill Howe,UW 3 How can we deliver 1000 little SDSSs to anyone who wants one?
  • 4.
    How much timedo you spend “handling data” as opposed to “doing science”? Mode answer: “90%” 7/10/2014 Bill Howe, UW 4
  • 5.
    0 30 60 90 120 Benchmark 1 Benchmark2 Old system Your system Our system A typical Computer Science paper…. slide src: Dan Halperin
  • 6.
    0 2500 5000 7500 10000 12500 Benchmark 1 Benchmark2 Old system Your system Our system What people use The reality of the situation…. slide src: Dan Halperin
  • 7.
    “[This was hard]due to the large amount of data (e.g. data indexes for data retrieval, dissection into data blocks and processing steps, order in which steps are performed to match memory/time requirements, file formats required by software used). In addition we actually spend quite some time in iterations fixing problems with certain features (e.g. capping ENCODE data), testing features and feature products to include, identifying useful test data sets, adjusting the training data (e.g. 1000G vs human-derived variants) So roughly 50% of the project was testing and improving the model, 30% figuring out how to do things (engineering) and 20% getting files and getting them into the right format. I guess in total [I spent] 6 months [on this project].” At least 3 months on issues of scale, file handling, and feature engineering. Martin Kircher, Genome SciencesWhy? 3k NSF postdocs in 2010 $50k / postdoc at least 50% overhead maybe $75M annually at NSF alone?
  • 8.
    Data Science Workflow: 7/10/2014Bill Howe, UW 8 1) Preparing to run a model 2) Running the model 3) Interpreting the results Gathering, cleaning, integrating, restructuring, transforming, loading, filtering, deleting, combining, merging, verifying, extracting, shaping, massaging “80% of the work” -- Aaron Kimball “The other 80% of the work”
  • 9.
    7/10/2014 Bill Howe,UW 9 Your cool algorithmic problem is not the bottleneck Observation 1
  • 10.
    7/10/2014 Bill Howe,UW 10 Symbolic Reasoning and Algebraic Optimization N = ((z*2)+((z*3)+0))/1 Algebraic Laws: 1. (+) identity: x+0 = x 2. (/) identity: x/1 = x 3. (*) distributes: (n*x+n*y) = n*(x+y) 4. (*) commutes: x*y = y*x Apply rules 1, 3, 4, 2: N = (2+3)*z two operations instead of five, no division operator Every database does this kind of optimization every time you issue a query
  • 11.
    SELECT x.strain, x.chr,x.region as snp_region, x.start_bp as snp_start_bp , x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp , w.category as nc_category , CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) THEN x.end_bp - x.start_bp + 1 WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) THEN x.end_bp - w.start_bp + 1 WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) THEN w.end_bp - x.start_bp + 1 END AS len_overlap FROM [koesterj@washington.edu].[hotspots_deserts.tab] x INNER JOIN [koesterj@washington.edu].[table_noncoding_positions.tab] w ON x.chr = w.chr WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) ORDER BY x.strain, x.chr ASC, x.start_bp ASC Non-programmers can write very complex queries (rather than relying on staff programmers) Example: Computing the overlaps of two sets of blast results We see thousands of queries written by non-programmers
  • 12.
    Howe, et al.,CISE 2013
  • 13.
    Steven Roberts SQL as alab notebook: http://bit.ly/16Xj2JP Calculate # methylated CGs Calculate # all CGs Calculate methylation ratio Link methylation with gene description GFF of methylated CG locations GFF of all genes GFF of all CG locations Gene descriptions Join Reorder columns Count Count JoinJoin Reorder columns Reorder columns Compute Trim Excel Join Join misstep: join w/ wrong fill Calculate # methylated CGs Calculate # all CGs GFF of methylated CG locations GFF of all genes GFF of all CG locations Gene descriptions Calculate methylation ratio and link with gene description Popular service for Bioinformatics Workflows
  • 14.
    14 A = LOAD('points.txt',id:int, x:float, y:float) E = LIMIT(A, 4); F = SEQUENCE(); Centroids = [FROM E EMIT (id=F.next, x=E.x, y=E.y)]; Kmeans = [FROM A EMIT (id=id, x=x, y=y, cluster_id=0)] DO I = CROSS(Kmeans, Centroids); J = [FROM I EMIT (Kmeans.id, Kmeans.x, Kmeans.y, Centroids.cluster_id, $distance(Kmeans.x, Kmeans.y, Centroids.x, Centroids.y))]; K = [FROM J EMIT id, distance=$min(distance)]; L = JOIN(J, id, K, id) M = [FROM L WHERE J.distance <= K.distance EMIT (id=J.id, x=J.x, y=J.y, cluster_id=J.cluster_id)]; Kmeans' = [FROM M EMIT (id, x, y, $min(cluster_id))]; Delta = DIFF(Kmeans', Kmeans) Kmeans = Kmeans' Centroids = [FROM Kmeans' EMIT (cluster_id, x=avg(x), y=avg(y))]; WHILE DELTA != {} K-Means in relational algebra
  • 15.
    7/10/2014 Bill Howe,UW 15 “SQL” vs. “ML” is a false dichotomy Observation 2 (Relational Algebra) (Linear Algebra)
  • 16.
    • SIGMOD 2009:Vertica 100x < Hadoop (Grep, Aggregation, Join) • VLDB 2010: HaLoop ~100x < Hadoop (PageRank, ShortestPath) • SIGMOD 2010: Pregel (no comparisons) • HotCloud 2010: Spark ~100x < Hadoop (logistic regression, ALS) • ICDE 2011: SystemML ~100x < Hadoop • ICDE 2011: Asterix ~100x < Hadoop (K-Means) • VLDB 2012: Graphlab ~100x < Hadoop, GraphLab 5x > Pregel, Graphlab ~ MPI (Recommendation/ALS, CoSeq/GMM, NER) • NSDI 2012: Spark 20x < Hadoop (logistic regression, PageRank) • VLDB 2012: Asterix (no comparisons) • SIGMOD 2013: Cumulon 5x < SystemML • VLDB 2013: Giraph ~ GraphLab (Connected Components) • SIGMOD 2014: SimSQL vs. Spark vs. GraphLab vs. Giraph (GMM, bayesian regression, HMM, LDA, imputation) • VLDB 2014: epiC ~ Impala, epiC ~ Shark, epiC 2x < Hadoop (Grep, Sort, TPC-H Q3, PageRank) A quick meta-analysis of some Big Data systems literature
  • 17.
    Pregel (Malewicz) Hadoop 2008 2009 2010 2011 2012 2013 2014 HaLoop (Bu) Spark (Zakaria) Vertica (Pavlo) ~100x faster SystemML (Ghoting) Hyracks (Borkar) GraphLab (Low) faster Cumulon (Huang) comparableor inconclusive Giraph (Tian) Dremel (Melnik) SimSQL (Cai) epiC (Jiang) Impala (Cloudera) Shark (Xin) HIVE (Thusoo) “The good old days” “The age of uncertainty”
  • 18.
    7/10/2014 Bill Howe,UW 18 Anything based on Hadoop is 100x slower than the state-of-the-art Observation 3 …but the rest of the story is not clear
  • 19.
    7/10/2014 Bill Howe,UW 19 1) BD experiments are ridiculously labor-intensive – N systems x M real-world applications – Big clusters and big datasets 2) No “one size fits all solution” – Realistic environments will use more than one system 3) A return to distributed, federated databases – Erase the distinction between ETL and Analytics We need big data middleware
  • 20.
    7/10/2014 Bill Howe,UW 20 Relational Analytics-as-a-Service Version 2 http://myria.cs.washington.edu
  • 21.
    Magda Balazinska, BillHowe, and Dan Suciu Dan Halperin (technical lead) Victor Almeida Andrew Whitaker PhD Students Shumo Chu Eric Gribkoff Jeremy Hyrkas Paris Koutris Ryan Maas Dominik Moritz Laurel Orr Jennifer Ortiz Emad Soroush Jingjing Wang ShengLiang Xu Undergraduate Students Lee Lee Choo Vaspol Ruamviboonsuk Myria Team
  • 22.
    Myria is… • MyriaQ:An optimizing compiler and middleware for multiple iterative source languages and multiple target big data systems • MyriaX: A parallel, shared-nothing, iterative execution engine • MyriaWeb: An IDE and RESTful service for algorithm development 22 Myria is …
  • 23.
    Myria Architecture Coordinator Language Parser Myria Compiler LogicalOptimizer for RA+While REST Server Worker Catalog Catalog … json query plan netty protocols RDBMS jdbc Worker Catalog RDBMS jdbc Worker Catalog RDBMS jdbc MyriaX (Java) C Compiler Grappa Web UI MyriaQ (Python) HDFS HDFS HDFS Datalog SQL MyriaL REST SciDB Hadoop
  • 24.
    SciDBSerial C++GrappaMyriaX SQL SQLDatalogMyriaL?? Relational Algebra + Iteration Compiler Compiler Compiler Compiler Compiler MyriaQ Oceanography, Astronomy, Biology, Medical Informatics
  • 25.
    Laser Microscope Objective Pine HoleLens Nozzle d1 d2 FSC (Forward scatter) Orange fluo Red fluo EX: SeaFlow Francois Ribalet Jarred Swalwell Ginger Armbrust
  • 26.
    Ex: SeaFlow 10 0 10 1 10 2 10 3 10 4 100 10 1 10 2 10 3 10 4 ps3.fcs…Focus D1/FSC D2/FSC d1/FSC d2 /FSC 10 0 10 1 10 2 10 3 10 4 100 101 10 2 10 3 10 4 ps3.fcs…subset FSC 692-40REDfluorescence FSC Picoplankton Nanoplankton 100 101 102 103 104 10 0 10 1 10 2 103 104 P35-surf FSC Small Stuff 580-30 IS Ultraplankton Prochlorococcus  Continuous observations of various phytoplankton groups from 1-20 mm in size  Based on RED fluo: Prochlorococcus, Pico-, Ultra- and Nanoplankton  Based on ORANGE fluo: Synechococcus, Cryptophytes  Based on FSC: Coccolithophores Francois Ribalet Jarred Swalwell Ginger Armbrust
  • 27.
  • 28.
    SeaFlow in Myria •“That 5-line MyriaL program was 100x faster than my R cluster, and much simpler” Dan Halperin Sophie Clayton
  • 29.
  • 30.
  • 31.
  • 32.
    32 A = LOAD('points.txt',id:int, x:float, y:float) E = LIMIT(A, 4); F = SEQUENCE(); Centroids = [FROM E EMIT (id=F.next, x=E.x, y=E.y)]; Kmeans = [FROM A EMIT (id=id, x=x, y=y, cluster_id=0)] DO I = CROSS(Kmeans, Centroids); J = [FROM I EMIT (Kmeans.id, Kmeans.x, Kmeans.y, Centroids.cluster_id, $distance(Kmeans.x, Kmeans.y, Centroids.x, Centroids.y))]; K = [FROM J EMIT id, distance=$min(distance)]; L = JOIN(J, id, K, id) M = [FROM L WHERE J.distance <= K.distance EMIT (id=J.id, x=J.x, y=J.y, cluster_id=J.cluster_id)]; Kmeans' = [FROM M EMIT (id, x, y, $min(cluster_id))]; Delta = DIFF(Kmeans', Kmeans) Kmeans = Kmeans' Centroids = [FROM Kmeans' EMIT (cluster_id, x=avg(x), y=avg(y))]; WHILE DELTA != {} K-Means in the language MyriaL
  • 33.
    33 CurGood = SCAN(public:adhoc:sc_points); DO mean= [FROM CurGood EMIT val=AVG(v)]; std = [FROM CurGood EMIT val=STDEV(v)]; NewBad = [FROM Good WHERE ABS(Good.v - mean) > 2 * std EMIT *]; CurGood = CurGood - NewBad; continue = [FROM NewBad EMIT COUNT(NewBad.v) > 0]; WHILE continue; DUMP(CurGood); Sigma-clipping, V0 Sigma-Clipping (v1)
  • 34.
    34 CurGood = P sum= [FROM CurGood EMIT SUM(val)]; sumsq = [FROM CurGood EMIT SUM(val*val)] cnt = [FROM CurGood EMIT CNT(*)]; NewBad = [] DO sum = sum – [FROM NewBad EMIT SUM(val)]; sumsq = sum – [FROM NewBad EMIT SUM(val*val)]; cnt = sum - [FROM NewBad EMIT CNT(*)]; mean = sum / cnt std = sqrt(1/(cnt*(cnt-1)) * (cnt * sumsq - sum*sum)) NewBad = FILTER([ABS(val-mean)>std], CurGood) CurGood = CurGood - NewBad WHILE NewBad != {} Sigma-clipping, V1: Incremental Sigma-Clipping (v2)
  • 35.
    35 Points = SCAN(public:adhoc:sc_points); aggs= [FROM Points EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)]; newBad = [] bounds = [FROM Points EMIT lower=MIN(v), upper=MAX(v)]; DO new_aggs = [FROM newBad EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)]; aggs = [FROM aggs, new_aggs EMIT _sum=aggs._sum - new_aggs._sum, sumsq=aggs.sumsq - new_aggs.sumsq, cnt=aggs.cnt - new_aggs.cnt]; stats = [FROM aggs EMIT mean=_sum/cnt, std=SQRT(1.0/(cnt*(cnt-1)) * (cnt * sumsq - _sum * _sum))]; newBounds = [FROM stats EMIT lower=mean - 2 * std, upper=mean + 2 * std]; tooLow = [FROM Points, bounds, newBounds WHERE newBounds.lower > v AND v >= bounds.lower EMIT v=Points.v]; tooHigh = [FROM Points, bounds, newBounds WHERE newBounds.upper < v AND v <= bounds.upper EMIT v=Points.v]; newBad = UNIONALL(tooLow, tooHigh); bounds = newBounds; continue = [FROM newBad EMIT COUNT(v) > 0]; WHILE continue; output = [FROM Points, bounds WHERE Points.v > bounds.lower AND Points.v < bounds.upper EMIT v=Points.v]; DUMP(output); Sigma-clipping, V2 Sigma-Clipping (v3)
  • 36.
    Takeaways • Myria: Analytics-as-a-Service –Lower barrier to entry, iterative processing, state-of-the-art internals • Blur the distinction between “Query” and “Algorithm” • Relational Algebra is at least as important as Linear Algebra http://escience.washington.edu @billghowe billhowe@cs.washington.edu http://myria.cs.washington.edu https://demo.myria.cs.washington.edu/ (Relational Algebra) (Linear Algebra)
  • 37.
  • 38.
    Huffman coding refresher symbol: frequency (descending)
  • 39.
    Rosvall and Bergstrom2007, 2010 A Random Walk…. http://www.mapequation.org/apps/MapDemo.html …generates a sequence of symbols with frequencies so we can generate a Huffman code for that sequence….
  • 40.
    A two-level coding aglobal index codebook indicates which module you are in a local module codebook indicates which vertex is visited Rosvall and Bergstrom 2007, 2010 codebook derived from the relative rates at which a random walker enters each module codebook derived from the relative rates at which a random walker visits each node OR exits the module
  • 41.
    MapEquation intuition With abad two-level encoding, you might be frequently jumping between modules With a bad two-level encoding, your modules might have too many vertices and require long codebooks A good, short encoding means a walker spends a lot of time within modules rather than moving between them, while keeping module size to a minimum Rosvall and Bergstrom 2007, 2010 A good graph clustering
  • 42.
    MapEquation Rosvall and Bergstrom2007, 2010 Describes movements between modules Describes movements within module i
  • 43.
    Third-party benchmarks Lancichinetti, Fortunato, “Community detectionalgorithms: a comparative analysis”, Phys Review 2009 “We conclude that the Infomap method by Rosvall and Bergstrom is the best performing on the set of benchmarks we have examined here.”
  • 44.
    Serial Algorithm (simplified) Computevisit probability of each vertex (PageRank) While the code length L has not converged: Put the vertices in random order For each vertex v: greedily move v to best neighboring module do global bookkeeping ….plus several optimizations
  • 45.
    How do weparallelize it? Naïve 1st Attempt: Drop all locks Compute visit probability of each vertex (PageRank) While the code length L has not converged: Put the vertices in random order For each vertex v: greedily move v to best neighboring module do global bookkeeping parallel serial, but fast serial
  • 46.
    Naïve lock-free scheme convergesprematurely Seung-Hee Bae ICDM 2013
  • 47.
    How do weparallelize it? 2nd Attempt: RelaxMap Compute visit probability of each vertex (PageRank) While the code length L has not converged: Put the vertices in random order For each vertex v: greedily move v to best neighboring module grab a lock do global bookkeeping parallel serial, but fast serial ICDM 2013
  • 48.
  • 49.
    Parallel efficiency is… ok Seung-Hee Bae ICDM 2013
  • 50.
    Side excursion: Prioritization Observation:Certain vertices contribute to improving the objective function more than others. Which ones? c1 n1 c2 c3 m1 v m2 mn1 mn2 mn3 n2 n3 Mc Ma Mb
  • 51.
    Seung-Hee Bae DMKD 2014 (submitted) vertex neighborsare red module siblings are blue vertices in neighboring modules are green
  • 52.
  • 53.
  • 54.
    How do weparallelize it? 3rd Attempt: Approximate the objective function by just moving every vertex along its heaviest edge Ignore the terms that require synchronization
  • 55.
    How do weparallelize it? 4th Attempt: Fully Asynchronous + Gossiping Each vertex 1) tells its neighbors when it moves 2) propagate other messages To decide whether to move, just use the information you have
  • 56.
  • 57.
    A closer lookat an example ROI(id, start, stop) is a set of “regions of interest” Read(id, start, stop) is a set of “reads” from sequencer Task: For each region of interest, count the number of reads it contains start stop stopstart
  • 58.
    SELECT roi.id, count(rd.id) FROMregions_of_interest roi, reads rd WHERE roi.start <= rd.start AND rd.[end] <= roi.[end] GROUP BY roi.id​ As a query “region of interest” sequence “read”
  • 59.
    SELECT roi.id, count(rd.start) FROMregions_of_interest roi, reads rd WHERE roi.start <= rd.start AND rd.[end] <= roi.[end] GROUP BY roi.id​ Why databases get a bad reputation many minutes SELECT roi.id, count(rd.start) as cnt FROM regions_of_interest roi, indexed_reads rd WHERE roi.start <= rd.start AND rd.start <= roi.[end] AND roi.start <= rd.[end] AND rd.[end] >= roi.[end] GROUP BY roi.id 3 seconds! roi read two-sided index scan one-sided index scan, plus filter The broken promise of declarative query…
  • 60.
    60 Maslow’s Needs Hierarchy “Aseach need is satisfied, the next higher level in the hierarchy dominates conscious functioning.” -- Maslow 43
  • 61.
    A “Needs Hierarchy”of Science Data Management storage sharing 61 query integration analytics “As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.” -- Maslow 43
  • 62.
    A “Needs Hierarchy”of Science Data Management storage sharing 62 integration query analytics “As each need is satisfied, the next higher level in the hierarchy dominates conscious functioning.” -- Maslow 43

Editor's Notes

  • #26 Advantage/ inconvenient sheath fluid alignment particles/laser. Sheath fluid replacement. Loading samples to the instrument. Advantage/ inconvenient sheathless
  • #55 Like HogWild
  • #56 Like HogWild