KEMBAR78
The Other HPC: High Productivity Computing | PPTX
The Other HPC: High
Productivity Computing in
Polystore Environments
Bill Howe, Ph.D.
Associate Director, eScience Institute
Senior Data Science Fellow, eScience Institute
Affiliate Associate Professor, Computer Science & Engineering
11/23/2015 Bill Howe, UW 1
Time
Amountofdataintheworld
Time
Processingpower
What is the rate-limiting step in data understanding?
Processing power:
Moore’s Law
Amount of data in
the world
Processingpower
Time
What is the rate-limiting step in data understanding?
Processing power:
Moore’s Law
Human cognitive capacity
Idea adapted from “Less is More” by Bill Buxton (2001)
Amount of data in
the world
slide src: Cecilia Aragon, UW HCDE
How much time do you spend “handling
data” as opposed to “doing science”?
Mode answer: “90%”
11/23/2015 Bill Howe, UW 4
“[This was hard] due to the large amount of data (e.g. data indexes for data retrieval,
dissection into data blocks and processing steps, order in which steps are performed
to match memory/time requirements, file formats required by software used).
In addition we actually spend quite some time in iterations fixing problems with
certain features (e.g. capping ENCODE data), testing features and feature products
to include, identifying useful test data sets, adjusting the training data (e.g. 1000G vs
human-derived variants)
So roughly 50% of the project was testing and improving the model, 30% figuring out
how to do things (engineering) and 20% getting files and getting them into the right
format.
I guess in total [I spent] 6 months [on this project].”
At least 3 months on issues of
scale, file handling, and feature
extraction.
Martin Kircher,
Genome SciencesWhy?
3k NSF postdocs in 2010
$50k / postdoc
at least 50% overhead
maybe $75M annually
at NSF alone?
Where does the time go? (2)
Productivity
How long I have to wait for results
monthsweeksdayshoursminutessecondsmilliseconds
HPC
Systems
Databases
feasibility
threshold
interactivity
threshold
These two performance
thresholds are really important;
other requirements are
situation-specific
11/23/2015 Bill Howe, UW 7
Table
Graph
Array
Matrix
Key-
Value
Data-
frame
MATLAB
GEMS
GraphX Neo4J
Dato
RDBMS
HIVE
Spark
R
Pandas
Ibis
Accumulo
Spark
SciDB HDF5
Myria
Polystore
Algebra
Desiderata for a Polystore Algebra
• Captures user intent
• Affords reasoning and optimization
• Accommodates best-known algorithms
11/23/2015 Bill Howe, UW 8
11/23/2015 Bill Howe, eScience Institute 13
Why do we care? Algebraic Optimization
N = ((z*2)+((z*3)+0))/1
Algebraic Laws:
1. (+) identity: x+0 = x
2. (/) identity: x/1 = x
3. (*) distributes: (n*x+n*y) = n*(x+y)
4. (*) commutes: x*y = y*x
Apply rules 1, 3, 4, 2:
N = (2+3)*z
two operations instead of five, no division operator
Same idea works with the Relational Algebra
The Myria Algebra is…
Relational Algebra
+ While / Sequence
+ Flatmap
+ Window Ops
+ Sample
(+ Dimension Bounds)
https://github.com/uwescience/raco/
MyriaX Radish SciDB GEMS
Parallel
Algebra
Polystore
Algebra
Middleware
SciDB
API
MyriaX
API
Radish
API
Graph
API
rewrite
rulesArray
Algebra
MyriaL
Services: visualization, logging, discovery, history, browsing
Orchestration
How does this actually work?
(1) Client submits a program
in one of several Big Data
languages….
(2) Program is parsed as
an expression tree….
(or programs directly against the API…)
(3) Expression tree is optimized
into a parallel, federated
execution plan involving one
or more Big Data platforms.
(4) Depending on the back end,
parallel plan may be directly
compiled into executable
code
How does this actually work?
(5) Orchestrates the parallel,
federated plan execution
across the platforms
Clien
t
MyriaQ Sys1 Sys2
How does this actually work?
(6) Exposes query execution
logs and results through
a REST API and a visual
web-based interface
How does this actually work?
What can you do with a Polystore Algebra?
1) Facilitate Experiments
– Provide reference implementations
– Apply shared optimizations for apples-to-apples
comparisons
– K-means, Markov chain, Naïve Bayes, TPC-H,
Betweenness Centrality, Sigma-clipping, Linear
Algebra
– LANL using this idea to express algorithms to solve
governing equations for heat transfer models!
11/23/2015 Bill Howe, UW 20
What can you do with a Polystore Algebra?
2) Rapidly develop new applications
– Microbial Oceanography
– Neuroanatomy
– Music Analytics
– Video Analytics
– Clinical Analytics
– Astronomical Image de-noising
11/23/2015 Bill Howe, UW 21
Laser
Microscope Objective
Pine Hole Lens
Nozzle d1
d2
FSC
(Forward scatter)
Orange fluo
Red fluo
EX: SeaFlow
Francois
Ribalet
Jarred
Swalwell
Ginger
Armbrust
Ex: SeaFlow
10
0
10
1
10
2
10
3
10
4
100
101
10
2
10
3
10
4
ps3.fcs…subset
FSC
692-40REDfluorescence FSC
Picoplankton
Nanoplankton
100
101
102
103
104
10
0
10
1
10
2
103
104
P35-surf
FSC Small Stuff
580-30
IS
Ultraplankton
Prochlorococcus
 Continuous observations of various phytoplankton groups from
1-20 mm in size
 Based on RED fluo: Prochlorococcus, Pico-, Ultra- and Nanoplankton
 Based on ORANGE fluo: Synechococcus, Cryptophytes
 Based on FSC: Coccolithophores
Francois
Ribalet
Jarred
Swalwell
Ginger
Armbrust
SeaFlow in Myria
• “That 5-line MyriaL program was 100x faster than my R cluster,
and much simpler”
Dan Halperin Sophie Clayton
11/23/2015 Bill Howe, UW 25
select a.annotation
, var_samp(d.density) as var
from density d join annotation a
on d.x = a.x
and d.y = a.y
and d.z = a.z
group by a.annotation
order by var desc
limit 10
Sample variance by annotation
across all experiments
11/23/2015 Bill Howe, UW 29
Are two regions connected?
adjacent(r1, r2) :-
annotation(experiment, x1, y1, z1, r1),
annotation(experiment, x2, y2, z2, r2),
x2 = x1+1 or y2 = y1+1 or z2 = z1 + 1
connected(r1, r2) :- adjacent(r1,r2)
connected(r1, r3) :- connected(r1, r2), adjacent(r2, r3)
Music Analytics
segments = scan(Jeremy:MSD:SegmentsTable);
songs = scan(Jeremy:MSD:SongsTable);
seg_count = select song_id, count(segment_number) as c from
segments;
density = select songs.song_id,
(seg_count.c / songs.duration) as density
from songs, seg_count
where songs.song_id = seg_count.song_id;
store(density, public:adhoc:song_density);
Computing song density
Million-Song Dataset
http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/
Blog post on how to run it in 20 minutes on Hadoop…
11/23/2015 Bill Howe, UW 31
-- calculate probability of outcomes
Poe = select input_sp.id as inputId,
sum(CondP.lp) as lprob,
CondP.outcome as outcome
from CondP, input_sp
where CondP.index=input_sp.index
and CondP.value=input_sp.value;
-- select the max probability outcome
classes = select inputId,
ArgMax(outcome, lprob)
from Poe;
Naïve Bayes Classification:
Million Song Dataset
Predict song year in a 515,345-song
dataset using eight timbre features,
discretized into intervals of size 10
bad data?
lower heart
rate variance
averagerelativeheartrate
variance
time (hours)
averageheartrate
beats/minute
MIMIC Information Flow
Client MyriaMiddleware MyriaX SciDB
Waveform
data
Structured
data
headless
Octave + Web
interface
REST
interface,
optimization,
orchestration
serverclient
https://metanautix.com/tr/01_big_data_techniques_for_media_graphics.pdf
https://metanautix.com/tr/01_big_data_techniques_for_media_graphics.pdf
11/23/2015 Bill Howe, UW 36
Ollie Lo, Los Alamos National Lab
What can you do with a Polystore Algebra?
3) Reason about algorithms
• Apply application-specific optimizations (in addition to
automatic optimizations)
11/23/2015 Bill Howe, UW 37
38
CurGood = SCAN(public:adhoc:sc_points);
DO
mean = [FROM CurGood EMIT val=AVG(v)];
std = [FROM CurGood EMIT val=STDEV(v)];
NewBad = [FROM Good WHERE ABS(Good.v - mean) > 2 * std EMIT *];
CurGood = CurGood - NewBad;
continue = [FROM NewBad EMIT COUNT(NewBad.v) > 0];
WHILE continue;
DUMP(CurGood);
Sigma-clipping, V0
39
CurGood = P
sum = [FROM CurGood EMIT SUM(val)];
sumsq = [FROM CurGood EMIT SUM(val*val)]
cnt = [FROM CurGood EMIT CNT(*)];
NewBad = []
DO
sum = sum – [FROM NewBad EMIT SUM(val)];
sumsq = sum – [FROM NewBad EMIT SUM(val*val)];
cnt = sum - [FROM NewBad EMIT CNT(*)];
mean = sum / cnt
std = sqrt(1/(cnt*(cnt-1)) * (cnt * sumsq - sum*sum))
NewBad = FILTER([ABS(val-mean)>std], CurGood)
CurGood = CurGood - NewBad
WHILE NewBad != {}
Sigma-clipping, V1: Incremental
40
Points = SCAN(public:adhoc:sc_points);
aggs = [FROM Points EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)];
newBad = []
bounds = [FROM Points EMIT lower=MIN(v), upper=MAX(v)];
DO
new_aggs = [FROM newBad EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)];
aggs = [FROM aggs, new_aggs EMIT _sum=aggs._sum - new_aggs._sum,
sumsq=aggs.sumsq - new_aggs.sumsq, cnt=aggs.cnt - new_aggs.cnt];
stats = [FROM aggs EMIT mean=_sum/cnt,
std=SQRT(1.0/(cnt*(cnt-1)) * (cnt * sumsq - _sum * _sum))];
newBounds = [FROM stats EMIT lower=mean - 2 * std, upper=mean + 2 * std];
tooLow = [FROM Points, bounds, newBounds WHERE newBounds.lower > v
AND v >= bounds.lower EMIT v=Points.v];
tooHigh = [FROM Points, bounds, newBounds WHERE newBounds.upper < v
AND v <= bounds.upper EMIT v=Points.v];
newBad = UNIONALL(tooLow, tooHigh);
bounds = newBounds;
continue = [FROM newBad EMIT COUNT(v) > 0];
WHILE continue;
output = [FROM Points, bounds WHERE Points.v > bounds.lower AND
Points.v < bounds.upper EMIT v=Points.v];
DUMP(output);
Sigma-clipping, V2
What can you do with a Polystore Algebra?
3) Orchestrate Federated Workflows
11/23/2015 Bill Howe, UW 41
Client MyriaX SciDB
More Orchestrating Federated Workflows
Spar
k
HadoopRDBMSMyriaQ
What can you do with a Polystore Algebra?
4) Study the price of abstraction
11/23/2015 Bill Howe, UW 43
Compiling the Myria algebra to bare metal PGAS programs
RADISH
ICDE 15
Brandon
Myers
RADISH
ICDE 15
Brandon
Myers
Query compilation for distributed processing
pipeline
as
parallel
code
parallel compiler
machine
code
[Myers ’14]
pipeline
fragment
code
pipeline
fragment
code
sequential
compiler
machine
code
[Crotty ’14, Li ’14, Seo ’14, Murray ‘11]
sequential
compiler
11/23/2015 Bill Howe, UW 47/57
1% selection microbenchmark, 20GB
Avoid long code paths
11/23/2015 Bill Howe, UW 48/57
Q2 SP2Bench, 100M triples, multiple self-joins
Communication optimization
Graph Patterns
49
• SP2Bench, 100 million triples
• Queries compiled to a PGAS C++ language layer, then
compiled again by a low-level PGAS compiler
• One of Myria’s supported back ends
• Comparison with Shark/Spark, which itself has been shown to
be 100X faster than Hadoop-based systems
• …plus PageRank, Naïve Bayes, and more
RADISH
ICDE 15
11/23/2015 Bill Howe, UW 50
select A.i, B.k, sum(A.val*B.val)
from A, B
where A.j = B.j
group by A.i, B.k
Matrix multiply in RA
Matrix multiply
sparsity exponent (r s.t. m=nr)
Complexity
exponent
n2.38
mn
m0.7n1.2+n2
slide adapted from ZwickR. Yuster and U. Zwick, Fast Sparse Matrix
n = number of rows
m = number of non-zerosComplexity of matrix
multiply
naïve sparse
algorithm
best known
sparse
algorithm
best known
dense
algorithm
lots of room
here
BLAS vs. SpBLAS vs. SQL (10k)
off the shelf
database
15X
Relative Speedup of SpBLAS vs. HyperDB
- speedup = T_HyperDB / T_SpBLAS
- benchmark datasets with r is 1.2 and
the real data cases (the three largest
datasets: 1.17 < r < 1.20)
- on star (nTh = 12), on dragon (nTh =
60)
- As n increases, the relative speedup
of SpBLAS over HyperDB is
reduced.
- soc-Pokec: the speedup is only
around 5 times.
on star, hyperDB stuck on thrashing
with soc-Pokec data.
11/23/2015 Bill Howe, UW 55
20k X 20k matrix multiply by sparsity
CombBLAS, MyriaX, Radish
11/23/2015 Bill Howe, UW 56
50k X 50k matrix multiply by sparsity
CombBLAS, MyriaX, Radish
Filter to upper left corner of result matrix
What can you do with a Polystore Algebra?
5) Provide new services over a Polystore
Ecosystem
11/23/2015 Bill Howe, UW 57
Lowering barrier to entry
Exposing Performance Issues
Dominik Moritz
EuroSys 15
Exposing Performance Issues
Dominik Moritz
EuroSys 15
Sourceworker
Destination worker
Kanit "Ham"
Wongsuphasawat
Voyager: Visualization
Recommendation
InfoVis 15
Seung-Hee
BaeScalable Graph Clustering
Version 1
Parallelize Best-known
Serial Algorithm
ICDM 2013
Version 2
Free 30% improvement
for any algorithm
TKDD 2014 SC 2015
Version 3
Distributed approx.
algorithm, 1.5B edges
Viziometrics: Analysis of Visualization
in the Scientific Literature
Proportion of
non-quantitative
figures in paper
Paper impact, grouped into 5% percentiles
Poshen Lee
http://escience.washington.edu
http://myria.cs.washington.edu
http://uwescience.github.io/sqlshare/

The Other HPC: High Productivity Computing

  • 1.
    The Other HPC:High Productivity Computing in Polystore Environments Bill Howe, Ph.D. Associate Director, eScience Institute Senior Data Science Fellow, eScience Institute Affiliate Associate Professor, Computer Science & Engineering 11/23/2015 Bill Howe, UW 1
  • 2.
    Time Amountofdataintheworld Time Processingpower What is therate-limiting step in data understanding? Processing power: Moore’s Law Amount of data in the world
  • 3.
    Processingpower Time What is therate-limiting step in data understanding? Processing power: Moore’s Law Human cognitive capacity Idea adapted from “Less is More” by Bill Buxton (2001) Amount of data in the world slide src: Cecilia Aragon, UW HCDE
  • 4.
    How much timedo you spend “handling data” as opposed to “doing science”? Mode answer: “90%” 11/23/2015 Bill Howe, UW 4
  • 5.
    “[This was hard]due to the large amount of data (e.g. data indexes for data retrieval, dissection into data blocks and processing steps, order in which steps are performed to match memory/time requirements, file formats required by software used). In addition we actually spend quite some time in iterations fixing problems with certain features (e.g. capping ENCODE data), testing features and feature products to include, identifying useful test data sets, adjusting the training data (e.g. 1000G vs human-derived variants) So roughly 50% of the project was testing and improving the model, 30% figuring out how to do things (engineering) and 20% getting files and getting them into the right format. I guess in total [I spent] 6 months [on this project].” At least 3 months on issues of scale, file handling, and feature extraction. Martin Kircher, Genome SciencesWhy? 3k NSF postdocs in 2010 $50k / postdoc at least 50% overhead maybe $75M annually at NSF alone? Where does the time go? (2)
  • 6.
    Productivity How long Ihave to wait for results monthsweeksdayshoursminutessecondsmilliseconds HPC Systems Databases feasibility threshold interactivity threshold These two performance thresholds are really important; other requirements are situation-specific
  • 7.
    11/23/2015 Bill Howe,UW 7 Table Graph Array Matrix Key- Value Data- frame MATLAB GEMS GraphX Neo4J Dato RDBMS HIVE Spark R Pandas Ibis Accumulo Spark SciDB HDF5 Myria Polystore Algebra
  • 8.
    Desiderata for aPolystore Algebra • Captures user intent • Affords reasoning and optimization • Accommodates best-known algorithms 11/23/2015 Bill Howe, UW 8
  • 10.
    11/23/2015 Bill Howe,eScience Institute 13 Why do we care? Algebraic Optimization N = ((z*2)+((z*3)+0))/1 Algebraic Laws: 1. (+) identity: x+0 = x 2. (/) identity: x/1 = x 3. (*) distributes: (n*x+n*y) = n*(x+y) 4. (*) commutes: x*y = y*x Apply rules 1, 3, 4, 2: N = (2+3)*z two operations instead of five, no division operator Same idea works with the Relational Algebra
  • 11.
    The Myria Algebrais… Relational Algebra + While / Sequence + Flatmap + Window Ops + Sample (+ Dimension Bounds) https://github.com/uwescience/raco/
  • 12.
    MyriaX Radish SciDBGEMS Parallel Algebra Polystore Algebra Middleware SciDB API MyriaX API Radish API Graph API rewrite rulesArray Algebra MyriaL Services: visualization, logging, discovery, history, browsing Orchestration
  • 13.
    How does thisactually work? (1) Client submits a program in one of several Big Data languages…. (2) Program is parsed as an expression tree…. (or programs directly against the API…)
  • 14.
    (3) Expression treeis optimized into a parallel, federated execution plan involving one or more Big Data platforms. (4) Depending on the back end, parallel plan may be directly compiled into executable code How does this actually work?
  • 15.
    (5) Orchestrates theparallel, federated plan execution across the platforms Clien t MyriaQ Sys1 Sys2 How does this actually work?
  • 16.
    (6) Exposes queryexecution logs and results through a REST API and a visual web-based interface How does this actually work?
  • 17.
    What can youdo with a Polystore Algebra? 1) Facilitate Experiments – Provide reference implementations – Apply shared optimizations for apples-to-apples comparisons – K-means, Markov chain, Naïve Bayes, TPC-H, Betweenness Centrality, Sigma-clipping, Linear Algebra – LANL using this idea to express algorithms to solve governing equations for heat transfer models! 11/23/2015 Bill Howe, UW 20
  • 18.
    What can youdo with a Polystore Algebra? 2) Rapidly develop new applications – Microbial Oceanography – Neuroanatomy – Music Analytics – Video Analytics – Clinical Analytics – Astronomical Image de-noising 11/23/2015 Bill Howe, UW 21
  • 19.
    Laser Microscope Objective Pine HoleLens Nozzle d1 d2 FSC (Forward scatter) Orange fluo Red fluo EX: SeaFlow Francois Ribalet Jarred Swalwell Ginger Armbrust
  • 20.
    Ex: SeaFlow 10 0 10 1 10 2 10 3 10 4 100 101 10 2 10 3 10 4 ps3.fcs…subset FSC 692-40REDfluorescence FSC Picoplankton Nanoplankton 100 101 102 103 104 10 0 10 1 10 2 103 104 P35-surf FSCSmall Stuff 580-30 IS Ultraplankton Prochlorococcus  Continuous observations of various phytoplankton groups from 1-20 mm in size  Based on RED fluo: Prochlorococcus, Pico-, Ultra- and Nanoplankton  Based on ORANGE fluo: Synechococcus, Cryptophytes  Based on FSC: Coccolithophores Francois Ribalet Jarred Swalwell Ginger Armbrust
  • 21.
    SeaFlow in Myria •“That 5-line MyriaL program was 100x faster than my R cluster, and much simpler” Dan Halperin Sophie Clayton
  • 22.
  • 25.
    select a.annotation , var_samp(d.density)as var from density d join annotation a on d.x = a.x and d.y = a.y and d.z = a.z group by a.annotation order by var desc limit 10 Sample variance by annotation across all experiments
  • 26.
    11/23/2015 Bill Howe,UW 29 Are two regions connected? adjacent(r1, r2) :- annotation(experiment, x1, y1, z1, r1), annotation(experiment, x2, y2, z2, r2), x2 = x1+1 or y2 = y1+1 or z2 = z1 + 1 connected(r1, r2) :- adjacent(r1,r2) connected(r1, r3) :- connected(r1, r2), adjacent(r2, r3)
  • 27.
    Music Analytics segments =scan(Jeremy:MSD:SegmentsTable); songs = scan(Jeremy:MSD:SongsTable); seg_count = select song_id, count(segment_number) as c from segments; density = select songs.song_id, (seg_count.c / songs.duration) as density from songs, seg_count where songs.song_id = seg_count.song_id; store(density, public:adhoc:song_density); Computing song density Million-Song Dataset http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/ Blog post on how to run it in 20 minutes on Hadoop…
  • 28.
    11/23/2015 Bill Howe,UW 31 -- calculate probability of outcomes Poe = select input_sp.id as inputId, sum(CondP.lp) as lprob, CondP.outcome as outcome from CondP, input_sp where CondP.index=input_sp.index and CondP.value=input_sp.value; -- select the max probability outcome classes = select inputId, ArgMax(outcome, lprob) from Poe; Naïve Bayes Classification: Million Song Dataset Predict song year in a 515,345-song dataset using eight timbre features, discretized into intervals of size 10
  • 29.
    bad data? lower heart ratevariance averagerelativeheartrate variance time (hours) averageheartrate beats/minute
  • 30.
    MIMIC Information Flow ClientMyriaMiddleware MyriaX SciDB Waveform data Structured data headless Octave + Web interface REST interface, optimization, orchestration serverclient
  • 31.
  • 32.
  • 33.
    11/23/2015 Bill Howe,UW 36 Ollie Lo, Los Alamos National Lab
  • 34.
    What can youdo with a Polystore Algebra? 3) Reason about algorithms • Apply application-specific optimizations (in addition to automatic optimizations) 11/23/2015 Bill Howe, UW 37
  • 35.
    38 CurGood = SCAN(public:adhoc:sc_points); DO mean= [FROM CurGood EMIT val=AVG(v)]; std = [FROM CurGood EMIT val=STDEV(v)]; NewBad = [FROM Good WHERE ABS(Good.v - mean) > 2 * std EMIT *]; CurGood = CurGood - NewBad; continue = [FROM NewBad EMIT COUNT(NewBad.v) > 0]; WHILE continue; DUMP(CurGood); Sigma-clipping, V0
  • 36.
    39 CurGood = P sum= [FROM CurGood EMIT SUM(val)]; sumsq = [FROM CurGood EMIT SUM(val*val)] cnt = [FROM CurGood EMIT CNT(*)]; NewBad = [] DO sum = sum – [FROM NewBad EMIT SUM(val)]; sumsq = sum – [FROM NewBad EMIT SUM(val*val)]; cnt = sum - [FROM NewBad EMIT CNT(*)]; mean = sum / cnt std = sqrt(1/(cnt*(cnt-1)) * (cnt * sumsq - sum*sum)) NewBad = FILTER([ABS(val-mean)>std], CurGood) CurGood = CurGood - NewBad WHILE NewBad != {} Sigma-clipping, V1: Incremental
  • 37.
    40 Points = SCAN(public:adhoc:sc_points); aggs= [FROM Points EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)]; newBad = [] bounds = [FROM Points EMIT lower=MIN(v), upper=MAX(v)]; DO new_aggs = [FROM newBad EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)]; aggs = [FROM aggs, new_aggs EMIT _sum=aggs._sum - new_aggs._sum, sumsq=aggs.sumsq - new_aggs.sumsq, cnt=aggs.cnt - new_aggs.cnt]; stats = [FROM aggs EMIT mean=_sum/cnt, std=SQRT(1.0/(cnt*(cnt-1)) * (cnt * sumsq - _sum * _sum))]; newBounds = [FROM stats EMIT lower=mean - 2 * std, upper=mean + 2 * std]; tooLow = [FROM Points, bounds, newBounds WHERE newBounds.lower > v AND v >= bounds.lower EMIT v=Points.v]; tooHigh = [FROM Points, bounds, newBounds WHERE newBounds.upper < v AND v <= bounds.upper EMIT v=Points.v]; newBad = UNIONALL(tooLow, tooHigh); bounds = newBounds; continue = [FROM newBad EMIT COUNT(v) > 0]; WHILE continue; output = [FROM Points, bounds WHERE Points.v > bounds.lower AND Points.v < bounds.upper EMIT v=Points.v]; DUMP(output); Sigma-clipping, V2
  • 38.
    What can youdo with a Polystore Algebra? 3) Orchestrate Federated Workflows 11/23/2015 Bill Howe, UW 41
  • 39.
    Client MyriaX SciDB MoreOrchestrating Federated Workflows Spar k HadoopRDBMSMyriaQ
  • 40.
    What can youdo with a Polystore Algebra? 4) Study the price of abstraction 11/23/2015 Bill Howe, UW 43
  • 41.
    Compiling the Myriaalgebra to bare metal PGAS programs RADISH ICDE 15 Brandon Myers
  • 42.
  • 43.
    Query compilation fordistributed processing pipeline as parallel code parallel compiler machine code [Myers ’14] pipeline fragment code pipeline fragment code sequential compiler machine code [Crotty ’14, Li ’14, Seo ’14, Murray ‘11] sequential compiler
  • 44.
    11/23/2015 Bill Howe,UW 47/57 1% selection microbenchmark, 20GB Avoid long code paths
  • 45.
    11/23/2015 Bill Howe,UW 48/57 Q2 SP2Bench, 100M triples, multiple self-joins Communication optimization
  • 46.
    Graph Patterns 49 • SP2Bench,100 million triples • Queries compiled to a PGAS C++ language layer, then compiled again by a low-level PGAS compiler • One of Myria’s supported back ends • Comparison with Shark/Spark, which itself has been shown to be 100X faster than Hadoop-based systems • …plus PageRank, Naïve Bayes, and more RADISH ICDE 15
  • 47.
  • 48.
    select A.i, B.k,sum(A.val*B.val) from A, B where A.j = B.j group by A.i, B.k Matrix multiply in RA Matrix multiply
  • 49.
    sparsity exponent (rs.t. m=nr) Complexity exponent n2.38 mn m0.7n1.2+n2 slide adapted from ZwickR. Yuster and U. Zwick, Fast Sparse Matrix n = number of rows m = number of non-zerosComplexity of matrix multiply naïve sparse algorithm best known sparse algorithm best known dense algorithm lots of room here
  • 50.
    BLAS vs. SpBLASvs. SQL (10k) off the shelf database 15X
  • 51.
    Relative Speedup ofSpBLAS vs. HyperDB - speedup = T_HyperDB / T_SpBLAS - benchmark datasets with r is 1.2 and the real data cases (the three largest datasets: 1.17 < r < 1.20) - on star (nTh = 12), on dragon (nTh = 60) - As n increases, the relative speedup of SpBLAS over HyperDB is reduced. - soc-Pokec: the speedup is only around 5 times. on star, hyperDB stuck on thrashing with soc-Pokec data.
  • 52.
    11/23/2015 Bill Howe,UW 55 20k X 20k matrix multiply by sparsity CombBLAS, MyriaX, Radish
  • 53.
    11/23/2015 Bill Howe,UW 56 50k X 50k matrix multiply by sparsity CombBLAS, MyriaX, Radish Filter to upper left corner of result matrix
  • 54.
    What can youdo with a Polystore Algebra? 5) Provide new services over a Polystore Ecosystem 11/23/2015 Bill Howe, UW 57
  • 55.
  • 56.
  • 57.
    Exposing Performance Issues DominikMoritz EuroSys 15 Sourceworker Destination worker
  • 58.
  • 59.
    Seung-Hee BaeScalable Graph Clustering Version1 Parallelize Best-known Serial Algorithm ICDM 2013 Version 2 Free 30% improvement for any algorithm TKDD 2014 SC 2015 Version 3 Distributed approx. algorithm, 1.5B edges
  • 60.
    Viziometrics: Analysis ofVisualization in the Scientific Literature Proportion of non-quantitative figures in paper Paper impact, grouped into 5% percentiles Poshen Lee
  • 61.

Editor's Notes

  • #3 And processing power, either as raw processor speed or via novel multi-core and many-core architectures, is also continuing to increase exponentially…
  • #4 … but human cognitive capacity is remaining constant. How can computing technologies help scientists make sense out of these vast and complex data sets?
  • #5  We want to give a little background of our project before we launch into it, so we will discuss the problem we are trying to solve. Essentially, we want to remove the speed-bump of data handling from the scientists.
  • #11  Express these plans Optimize these plans Compile these plans Execute these plans
  • #12  So our approach is to model this overlap in capabilities as its own language. We start
  • #13 matrices and linear algebra is a terrible programming model, but there’s just so god damn much math that has been developed around them, that it’s here to stay. the functional programming crowd has been poised to solve all the world’s ills for 60 years, but they tend to have trouble pulling their heads out of their own navels long enough to solve someone’s actual problem in practice objects and methods are great for building software systems, but get in the way for data analysis files and scripts aren’t really data analysis – they are low-level operating system concepts data frames are just relations key-value pairs -- I’ll talk more about this in a bit Scale “While the community was skeptical that this new method could possibly outperform hand-coding, it reduced the number of programming statements necessary to operate a machine by a factor of 20, and quickly gained acceptance. “ “Relational model was buggy and slow, but you only had to write 5% of the code you used to have to write”
  • #16 We hoist
  • #17 MyriaL is an imperative language we like; I’ll show you some examples of that. The whole program is chained together as one big expression, perhaps with loops
  • #18 The logical plan is translated into a possibly federated, typically parallel back-end specific physical plan. Optimization rules are applied as appropriate. We’ve gotten more mileage than we expected out of just a simple rule-based optimizer, for two reasons: we have tried to make it very easy to add new rules on the fly, and we have made some algorithmic developments. For example, there’s been a lot of recent work on worst-case optimal join algorithms that scale with the size of the output rather than (only) the size of the input. One of our students has developed a variant of these worst-case optimal, multi-way join algorithm that looks like it could subsume the need for a lot of fretting about join order, skew handling, broadcasting, merge vs. hash, etc.
  • #19 Single interface to multiple big data systems * No one size fits all – there WILL be multiple systems and multiple tasks in play in realistic scenarios * Developer attention span is the bottleneck: Your data scientists can’t/won’t do the plumbing to make these systems talk to each other * Every system either a) claims to do everything or b) claims nobody else can do “their” thing. We need to stop the madness and do some good science. We need a middleware
  • #23 Advantage/ inconvenient sheath fluid alignment particles/laser. Sheath fluid replacement. Loading samples to the instrument. Advantage/ inconvenient sheathless
  • #45 And that’s just usinga parallel database. If we instead generate parallel programs and compile them the way the HPC folks do, we can beat up on Spark/Shark basically due to aggregating messages and removing serialization overhead.
  • #47  NOTES: Optimizations enable? with better semantics on a hash table join with UDFs, can do redundant computation elimination, code motion from UDF
  • #52 Can you just run this in a database and expect good performance. Of course not. But is this a fundamentally bad idea to run it this way? Maybe not.
  • #53  This is the complexity of three matrix multiply algorithms plotted against the sparsi – a naïve sparse