KEMBAR78
MathWorks Interview Lecture | PPTX
Selling an executable
data flow graph based IR
John Yates
Order of presentation
• Who am I and why am I here?
• 2010: Netezza needs a new architecture
• A family of statically typed acyclic DFG IRs
• (Time permitting: Some engineering details)
• Q&A
“Who am I and why am I here?”
(with apologies to Adm. Stockdale)
1970: Maybe I’ll be a programmer
• NYC hippie, ponytail, curled handlebar mustache
• Liberal arts high school, lousy student
• Wanted to build things, real things
• Computers seemed interesting and intuitive
• Luckily in 1970 programmers were scarce
40 years…
– 1970: learning the craft, various jobs (all in assembler)
– 1978: Digital Equipment Corp
• Pascal frontend, dynamic programming code selector
– 1983: Apollo Computer
• Designed RISC ISP w/ explicit parallel dispatch (pre-VLIW)
• Lead architect for RISC backend optimizer; built team
• 1st commercial: SSA IR, SW pipeliner, lattice const prop
– 1992: Binary translation: DEC (sw), Chromatic (hw-support)
• More SSA IR, lowering; built teams; lot of patents (many hw)
– 1999: Everfile - NFS-like Win32 internet file system
– 2002: Netezza, badge #26
• Storage: compression, indices, access methods, txns, CBTs
20+years
2010: Netezza needs
a new architecture
Data parallel analytics engine
• Data partitioned across a cluster of nodes
– Multiple “slices” per node to exploit multi-core
• Execution model:
– Leader accepts query, produces an execution plan
– Leader broadcasts plan’s parallel components
– Cluster performs data parallel work
– Leader performs work requiring a single locus
• Competition: Teradata, Green Plum, DB2, …
Netezza’s architecture
PG
Plan
Split1
Split2
Gen
FPGA
Gen
C++
Gen
C++
CompileCompile
Load
DLL
Bcast
Load
DLL
Load
FPGA
ExecuteExecute
N workers
Latency
Netezza’s problems
PG
Plan
Split1
Split2
Gen
FPGA
Gen
C++
Gen
C++
CompileCompile
Load
DLL
Bcast
Load
DLL
Load
FPGA
ExecuteExecute
Very simplistic code generator:
-Lowering across an enormous
semantic gulf
- No intermediate representation
- Very complex, very fragile
- Difficult to implement much more
than general case code patterns
Hardware
development
time scales
N workers
Garth’s incomplete Marlin vision
• What is the real input to the interpreter?
• How do we get from query plan to that form?
PG
Plan
Split
Bcast
Interpret
(faster?)
Interpret
(faster?)
N workers
Unspecified
miracle
Multi-
core?
A family of statically typed
acyclic data flow graph IRs
Working backwards
• Graph
• Dataflow
• Acyclic
• Statically typed
• A family of … IRs
Graph
• Operators
– Label names a function
– Edge connections in and out
• Edges
– Directed (“dataflow”)
Dataflow
• Dataflow machines
– Apply history, wisdom, insights to the interpreter
• Value semantics
– All edges carry data
– No other kinds of edges (i.e. no anti-dependence)
– No updatable shared state (i.e. no store)
• Expose all opportunities for concurrency
Acyclic
• No backedges ≡ no cycles J
• Can exploit topological ordering
– Fact propagation: rDFS (forward) or DFS (reverse)
– No iteration, guaranteed termination
– Linear algorithms, O(graph)
Statically typed
• Edges initially have unknown type
• A well-formed graph can be statically typed
– Linear pass over topologically ordered Operators
– Assign edge types per Operator descriptors
– Inconsistencies can be diagnosed and reported
• Well-nested subsets of edge type vocabularies
• Constraining edge types constrains operators
A family of … IRs
PG
Plan
Split
Bcast
Interpret
N workers
Lower
andOpt
Lower
andOpt
Lower
andOpt
Interpret
Tree
patterns
Graph1
patterns
Graph2
patterns
High level tree - tuples
High level graph - tuples
Mid level graph - nullable values
Low level graph - values
Common
pattern
notation
Topo
expand,
insert
CLONEs
Topo
expand,
insert
CLONEs
≈
Nothing convinces like working code
• First delivery
– Table drive operator semantics
– Utilities: build, edit & expand
– Topologically sort
– Type check & report errors
Split
Bcast
Interpret
N workers
Interpret
Topo
expand,
insert
CLONEs
Topo
expand,
insert
CLONEs
Graph
assembler
Graph
assembly
program
Sold!
• Working code rendered my
successive lowerings idea credible
• Overall Marlin added ~10 engineers; I got 3
• My team got itsfirst end-to-end test case working
PG
Plan
Split
Bcast
Interpret
N workers
Lower
andOpt
Lower
andOpt
Lower
andOpt
Interpret
Tree
patterns
Graph1
patterns
Graph2
patterns
Topo
expand,
insert
CLONEs
Topo
expand,
insert
CLONEs
IBM killed the Marlin program…
• Marlin was a clean up project promising…
– Performance and shorter development cycles
– But no new features nor functionality
• It is always hard to fund significant clean up
– Especially if not legitimately tied to a coveted feature
• Harder if your company is under duress
• Harder still if DB2 is gunning for your headcount
Question?
Some engineering details
Why clone?
• After expansion all edges are point-to-point
– No output is multiply-consumed
• Chunk handoff along an edge becomes trivial
– Think C++11’s new move semantics
• So only clones implement reference counting
Broadcast
• Serialize / deserialize
• On network size matters
• Graph object
– Small number of scalar members
– Handful of C++ vector (some ephemeral)
– Position independent (no pointers in vectors)
No pointers
• Pointers index the linear address space
– Implicit context (there is only one address space)
• Unsigned as vector index
– User must provide explicit context (vector base)
– 32 bit indices are ½ the size of 64 bit pointers
– Position independence simplifies serialization
The graph object
• Exposed read-only data
– Vector of Operator objects
– Vector of EdgeIn objects
– Vector of EdgeOut objects
– Literal table and pool
• Private data (may be missing or elided)
– Vector of EdgeIn next links
– Vector of Operator BreadCrumbs
Discardable elements
• vecBc: BreadCrumbs vector
• vecNxt: EdgeIn sibling links
• LiteralPool hash table array
Graph vector details
Vector Index Type Element Type Element Size
g.vecOp OperatorIndex Operator 16 bytes
g.vecOut EdgeOutIndex EdgeOut 8 bytes
g.vecIn EdgeInIndex EdgeIn 8 bytes
g.lit LiteralKey Literal multiple of 8 bytes
g.vecNxt EdgeInIndex EdgeInIndex 4 bytes
g.vecBc OperatorIndex BreadCrumb 4 bytes
Connectivity: Operator objects
• Operator private members
– Operator’s edges are sub-vectors of g.vecIn, g.vecOut
– Start of EdgeIn objects: EdgeInIndex baseIn_;
– Start of EdgeOut objects: EdgeOutIndex baseOut_;
• Number of connections
– Inputs: vecOp[x+1].baseIn_ - vecOp[x].baseIn_
– Outputs: vecOp[x+1].baseOut_ - vecOp[x].baseOut_
Connectivity: EdgeIn objects
• EdgeIn private members
– Sink Operator: OperatorIndex dstOp_;
– Source EdgeOut: EdgeOutIndex src_;
• EdgeIn connection position
– Use pointer arithmetic:
this - (vecIn + vecOp[dstOp_].baseIn_);
Connectivity: EdgeOut objects
• EdgeOut private members
– Source Operator: OperatorIndex srcOp_;
– Sink EdgeIn: EdgeInXIndex dst_;
• EdgeOut connection position
– Use pointer arithmetic
this - (vecOut + vecOp[srcOp_].baseOut_);
Working with XG
Thin graph construction
Method Effect
graph.add(BreadCrumb, Op, Locus, Expansion,
unsigned nVarIn =0, unsigned nVarOut =0);
Add an Operator and its
Edge resources
graph.connect(OperatorIndex srcOp, unsigned srcPos,
OperatorIndex dstOp, unsigned dstPos);
Guarantee a srcOp[srcPos] to
dstOp[dstPos] edge exists
Whole graph operations
Operation Effect
Graph(); Construct an empty Graph
void done(); Topo sort and type check
Graph(Graph const thinGraph&, bool forSpu); Partitioning constructor
BinStream& operator << (BinStream&, Graph const&); Put to a BinStream (cheap)
BinStream& operator >> (BinStream&, Graph&); Get from a BinStream (cheap)
void expand(bool forSpu, Environment const& env); Expand, insert clones, etc.
Graph states and conversions
• Start with a “thin” graph
• Leader plus one representative node and dataslice
• Operators tagged with a locus and expansion rule
• Outputs can have multiple consumers
• Partition into leader-side & node-side subsets
• Expand based on loci and system topology
• Duplicate operators, adjust in and out arities, add sites
• Expand edges: fan-in, fan-out, parallel
• Introduce clones as needed
Graph overlay
• Template object publically derived from Graph
• Macro hides lots of template boilerplate
• User supplied types for parallel vectors
– MyOperator ovOp[OperatorIndex]
– MyEdgeIn ovIn[EdgeInIndex]
– MyEdgeOut ovOut[EdgeOutIndex]
• Constructor shares vectors and LiteralTable
1973: Began 2-axis controller
I wrote every line of code (in assembler)
1975: First installation
0.5 MegaWatt torch cutting up to ¾”
steel plate at Marion Power Shovel
1975: Torch on… I was hooked!

MathWorks Interview Lecture

  • 1.
    Selling an executable dataflow graph based IR John Yates
  • 2.
    Order of presentation •Who am I and why am I here? • 2010: Netezza needs a new architecture • A family of statically typed acyclic DFG IRs • (Time permitting: Some engineering details) • Q&A
  • 3.
    “Who am Iand why am I here?” (with apologies to Adm. Stockdale)
  • 4.
    1970: Maybe I’llbe a programmer • NYC hippie, ponytail, curled handlebar mustache • Liberal arts high school, lousy student • Wanted to build things, real things • Computers seemed interesting and intuitive • Luckily in 1970 programmers were scarce
  • 5.
    40 years… – 1970:learning the craft, various jobs (all in assembler) – 1978: Digital Equipment Corp • Pascal frontend, dynamic programming code selector – 1983: Apollo Computer • Designed RISC ISP w/ explicit parallel dispatch (pre-VLIW) • Lead architect for RISC backend optimizer; built team • 1st commercial: SSA IR, SW pipeliner, lattice const prop – 1992: Binary translation: DEC (sw), Chromatic (hw-support) • More SSA IR, lowering; built teams; lot of patents (many hw) – 1999: Everfile - NFS-like Win32 internet file system – 2002: Netezza, badge #26 • Storage: compression, indices, access methods, txns, CBTs 20+years
  • 6.
    2010: Netezza needs anew architecture
  • 7.
    Data parallel analyticsengine • Data partitioned across a cluster of nodes – Multiple “slices” per node to exploit multi-core • Execution model: – Leader accepts query, produces an execution plan – Leader broadcasts plan’s parallel components – Cluster performs data parallel work – Leader performs work requiring a single locus • Competition: Teradata, Green Plum, DB2, …
  • 8.
  • 9.
    Latency Netezza’s problems PG Plan Split1 Split2 Gen FPGA Gen C++ Gen C++ CompileCompile Load DLL Bcast Load DLL Load FPGA ExecuteExecute Very simplisticcode generator: -Lowering across an enormous semantic gulf - No intermediate representation - Very complex, very fragile - Difficult to implement much more than general case code patterns Hardware development time scales N workers
  • 10.
    Garth’s incomplete Marlinvision • What is the real input to the interpreter? • How do we get from query plan to that form? PG Plan Split Bcast Interpret (faster?) Interpret (faster?) N workers Unspecified miracle Multi- core?
  • 11.
    A family ofstatically typed acyclic data flow graph IRs
  • 12.
    Working backwards • Graph •Dataflow • Acyclic • Statically typed • A family of … IRs
  • 13.
    Graph • Operators – Labelnames a function – Edge connections in and out • Edges – Directed (“dataflow”)
  • 14.
    Dataflow • Dataflow machines –Apply history, wisdom, insights to the interpreter • Value semantics – All edges carry data – No other kinds of edges (i.e. no anti-dependence) – No updatable shared state (i.e. no store) • Expose all opportunities for concurrency
  • 15.
    Acyclic • No backedges≡ no cycles J • Can exploit topological ordering – Fact propagation: rDFS (forward) or DFS (reverse) – No iteration, guaranteed termination – Linear algorithms, O(graph)
  • 16.
    Statically typed • Edgesinitially have unknown type • A well-formed graph can be statically typed – Linear pass over topologically ordered Operators – Assign edge types per Operator descriptors – Inconsistencies can be diagnosed and reported
  • 17.
    • Well-nested subsetsof edge type vocabularies • Constraining edge types constrains operators A family of … IRs PG Plan Split Bcast Interpret N workers Lower andOpt Lower andOpt Lower andOpt Interpret Tree patterns Graph1 patterns Graph2 patterns High level tree - tuples High level graph - tuples Mid level graph - nullable values Low level graph - values Common pattern notation Topo expand, insert CLONEs Topo expand, insert CLONEs ≈
  • 18.
    Nothing convinces likeworking code • First delivery – Table drive operator semantics – Utilities: build, edit & expand – Topologically sort – Type check & report errors Split Bcast Interpret N workers Interpret Topo expand, insert CLONEs Topo expand, insert CLONEs Graph assembler Graph assembly program
  • 19.
    Sold! • Working coderendered my successive lowerings idea credible • Overall Marlin added ~10 engineers; I got 3 • My team got itsfirst end-to-end test case working PG Plan Split Bcast Interpret N workers Lower andOpt Lower andOpt Lower andOpt Interpret Tree patterns Graph1 patterns Graph2 patterns Topo expand, insert CLONEs Topo expand, insert CLONEs
  • 20.
    IBM killed theMarlin program… • Marlin was a clean up project promising… – Performance and shorter development cycles – But no new features nor functionality • It is always hard to fund significant clean up – Especially if not legitimately tied to a coveted feature • Harder if your company is under duress • Harder still if DB2 is gunning for your headcount
  • 21.
  • 22.
  • 23.
    Why clone? • Afterexpansion all edges are point-to-point – No output is multiply-consumed • Chunk handoff along an edge becomes trivial – Think C++11’s new move semantics • So only clones implement reference counting
  • 24.
    Broadcast • Serialize /deserialize • On network size matters • Graph object – Small number of scalar members – Handful of C++ vector (some ephemeral) – Position independent (no pointers in vectors)
  • 25.
    No pointers • Pointersindex the linear address space – Implicit context (there is only one address space) • Unsigned as vector index – User must provide explicit context (vector base) – 32 bit indices are ½ the size of 64 bit pointers – Position independence simplifies serialization
  • 26.
    The graph object •Exposed read-only data – Vector of Operator objects – Vector of EdgeIn objects – Vector of EdgeOut objects – Literal table and pool • Private data (may be missing or elided) – Vector of EdgeIn next links – Vector of Operator BreadCrumbs
  • 27.
    Discardable elements • vecBc:BreadCrumbs vector • vecNxt: EdgeIn sibling links • LiteralPool hash table array
  • 28.
    Graph vector details VectorIndex Type Element Type Element Size g.vecOp OperatorIndex Operator 16 bytes g.vecOut EdgeOutIndex EdgeOut 8 bytes g.vecIn EdgeInIndex EdgeIn 8 bytes g.lit LiteralKey Literal multiple of 8 bytes g.vecNxt EdgeInIndex EdgeInIndex 4 bytes g.vecBc OperatorIndex BreadCrumb 4 bytes
  • 29.
    Connectivity: Operator objects •Operator private members – Operator’s edges are sub-vectors of g.vecIn, g.vecOut – Start of EdgeIn objects: EdgeInIndex baseIn_; – Start of EdgeOut objects: EdgeOutIndex baseOut_; • Number of connections – Inputs: vecOp[x+1].baseIn_ - vecOp[x].baseIn_ – Outputs: vecOp[x+1].baseOut_ - vecOp[x].baseOut_
  • 30.
    Connectivity: EdgeIn objects •EdgeIn private members – Sink Operator: OperatorIndex dstOp_; – Source EdgeOut: EdgeOutIndex src_; • EdgeIn connection position – Use pointer arithmetic: this - (vecIn + vecOp[dstOp_].baseIn_);
  • 31.
    Connectivity: EdgeOut objects •EdgeOut private members – Source Operator: OperatorIndex srcOp_; – Sink EdgeIn: EdgeInXIndex dst_; • EdgeOut connection position – Use pointer arithmetic this - (vecOut + vecOp[srcOp_].baseOut_);
  • 32.
  • 33.
    Thin graph construction MethodEffect graph.add(BreadCrumb, Op, Locus, Expansion, unsigned nVarIn =0, unsigned nVarOut =0); Add an Operator and its Edge resources graph.connect(OperatorIndex srcOp, unsigned srcPos, OperatorIndex dstOp, unsigned dstPos); Guarantee a srcOp[srcPos] to dstOp[dstPos] edge exists
  • 34.
    Whole graph operations OperationEffect Graph(); Construct an empty Graph void done(); Topo sort and type check Graph(Graph const thinGraph&, bool forSpu); Partitioning constructor BinStream& operator << (BinStream&, Graph const&); Put to a BinStream (cheap) BinStream& operator >> (BinStream&, Graph&); Get from a BinStream (cheap) void expand(bool forSpu, Environment const& env); Expand, insert clones, etc.
  • 35.
    Graph states andconversions • Start with a “thin” graph • Leader plus one representative node and dataslice • Operators tagged with a locus and expansion rule • Outputs can have multiple consumers • Partition into leader-side & node-side subsets • Expand based on loci and system topology • Duplicate operators, adjust in and out arities, add sites • Expand edges: fan-in, fan-out, parallel • Introduce clones as needed
  • 36.
    Graph overlay • Templateobject publically derived from Graph • Macro hides lots of template boilerplate • User supplied types for parallel vectors – MyOperator ovOp[OperatorIndex] – MyEdgeIn ovIn[EdgeInIndex] – MyEdgeOut ovOut[EdgeOutIndex] • Constructor shares vectors and LiteralTable
  • 37.
    1973: Began 2-axiscontroller I wrote every line of code (in assembler)
  • 38.
    1975: First installation 0.5MegaWatt torch cutting up to ¾” steel plate at Marion Power Shovel
  • 39.
    1975: Torch on…I was hooked!