KEMBAR78
Custom Pregel Algorithms in ArangoDB | PDF
Feature Preview: Custom Pregel
Complex Graph Algorithms made Easy
@arangodb @joerg_schad @hkernbach
2
tl;dr
● “Many practical computing problems concern large
graphs.”
● ArangoDB is a “Beyond Graph Database”
supporting multiple data models around a scalable
graph foundation
● Pregel is a framework for distributed graph
processing
○ ArangoDB supports predefined Prgel algorithms, e.g.
PageRank, Single-Source Shortest Path and Connected
components.
● Programmable Pregel Algorithms (PPA) allows
adding/modifying algorithms on the flight
Disclaimer
This is an experimental
feature and especially the
language specification
(front-end) is still under
development!
Jörg Schad, PhD
Head of Engineering and ML
@ArangoDB
● Suki.ai
● Mesosphere
● Architect @SAP Hana
● PhD Distributed DB
Systems
● Twitter: @joerg_schad
4
Heiko Kernbach
Core Engineer (Graphs Team)
@
● Graph
● Custom Pregel
● Geo / UI
● Twitter: @hkernbach
● Slack:
hkernbach.ArangoDB
5
● Open Source
● Beyond Graph Database
○ Stores, K/V, Documents connected by
scalable Graph Processing
● Scalable
○ Distributed Graphs
● AQL - SQL-like multi-model query language
● ACID Transactions including Multi Collection
Transactions
https://blog.acolyer.org/2015/05/26/pregel-a-system-for-large-scale-graph-processing/
https://blog.acolyer.org/2015/05/26/pregel-a-system-for-large-scale-graph-processing/
Pregel Max Value
While not converged:
Communicate: send own value to neighbours
Compute: Own value = Max Value from all messages (+ own value) Superstep
ArangoDB and Pregel: Status Quo
● https://www.arangodb.com/docs/stable/graphs-pregel.html
● https://www.arangodb.com/pregel-community-detection/
Available Algorithms
● Page Rank
● Seeded PageRank
● Single-Source Shortest Path
● Connected Components
○ Component
○ WeaklyConnected
○ StronglyConnected
● Hyperlink-Induced Topic Search
(HITS)Permalink
● Vertex Centrality
● Effective Closeness
● LineRank
● Label Propagation
● Speaker-Listener Label Propagation 8
var pregel = require("@arangodb/pregel");
pregel.start("pagerank", "graphname", {maxGSS: 100,
threshold: 0.00000001, resultField: "rank"})
● Pregel support since 2014
● Predefined algorithms
○ Could be extended via C++
● Same platform used for PPA
Challenges
Add and modify Algorithms
Programmable Pregel Algorithms (PPA)
const pregel = require("@arangodb/pregel");
let pregelID = pregel.start("air", graphName, "<custom-algorithm>");
var status = pregel.status(pregelID);
● Add/Modify algorithms on-the-fly
○ Without C++ code
○ Without restarting the Database
● Efficiency (as Pregel) depends on Sharding
○ Smart Graphs
○ Required: Collocation of vertices and edges
9
Custom Algorithm
10
{
"resultField": "<string>",
"maxGSS": "<number>",
"dataAccess": {
"writeVertex": "<program>",
"readVertex": "<array>",
"readEdge": "<array>"
},
"vertexAccumulators": "<object>",
"globalAccumulators": "<object>",
"customAccumulators": "<object>",
"phases": "<array>"
}
Accumulators
Accumulators are used to consume and process messages which are being
sent to them during the computational phase (initProgram, updateProgram,
onPreStep, onPostStep) of a superstep. After a superstep is done, all messages
will be processed.
● max: stores the maximum of all messages received.
● min: stores the minimum of all messages received.
● sum: sums up all messages received.
● and: computes and on all messages received.
● or: computes or and all messages received.
● store: holds the last received value (non-deterministic).
● list: stores all received values in list (order is non-deterministic).
● custom
Custom Algorithm
11
{
"resultField": "<string>",
"maxGSS": "<number>",
"dataAccess": {
"writeVertex": "<program>",
"readVertex": "<array>",
"readEdge": "<array>"
},
"vertexAccumulators": "<object>",
"globalAccumulators": "<object>",
"customAccumulators": "<object>",
"phases": "<array>"
}
● resultField (string, optional): Name of the document attribute to store the result in. The
vertex computation results will be in all vertices pointing to the given attribute.
● maxGSS (number, required): The max amount of global supersteps After the amount of max
defined supersteps is reached, the Pregel execution will stop.
● dataAccess (object, optional): Allows to define writeVertex, readVertex and readEdge.
○ writeVertex: A program that is used to write the results into vertices. If writeVertex is
used, the resultField will be ignored.
○ readVertex: An array that consists of strings and/or additional arrays (that represents
a path).
■ string: Represents a single attribute at the top level.
■ array of strings: Represents a nested path
○ readEdge: An array that consists of strings and/or additional arrays (that represents
a path).
■ string: Represents a single path at the top level which is not nested.
■ array of strings: Represents a nested path
● vertexAccumulators (object, optional): Definition of all used vertex accumulators.
● globalAccumulators (object, optional): Definition all used global accumulators. Global
Accumulators are able to access variables at shared global level.
● customAccumulators (object, optional): Definition of all used custom accumulators.
● phases (array): Array of a single or multiple phase definitions.
● debug (optional): See Debugging.
Phases - Execution order
12
Step 1: Initialization
1. onPreStep (Conductor, executed on Coordinator
instances)
2. initProgram (Worker, executed on DB-Server instances)
3. onPostStep (Conductor)
Step {2, ...n} Computation
1. onPreStep (Conductor)
2. updateProgram (Worker)
3. onPostStep (Conductor)
Program - Arango Intermediate Representation (AIR)
13
Program - Arango Intermediate Representation (AIR)
Lisp-like intermediate representation, represented in
JSON and supports its data types
14
Specification
● Language Primitives
○ Basic Algebraic Operators
○ Logical operators
○ Comparison operators
○ Lists
○ Sort
○ Dicts
○ Lambdas
○ Reduce
○ Utilities
○ Functional
○ Variables
○ Debug operators
● Math Library
● Special Form
○ let statement
○ seq statement
○ if statement
○ match statement
○ for-each statement
○ quote and quote-splice
statements
○ quasi-quote, unquote and
unquote-splice statements
○ cons statement
○ and and or statements
Program - Arango Intermediate Representation (AIR)
Lisp-like intermediate representation,
represented in JSON and supports its data types
15
Specification
● Language Primitives
○ Basic Algebraic Operators
○ Logical operators
○ Comparison operators
○ Lists
○ Sort
○ Dicts
○ Lambdas
○ Reduce
○ Utilities
○ Functional
○ Variables
○ Debug operators
● Math Library
● Special Form
○ let statement
○ seq statement
○ if statement
○ match statement
○ for-each statement
○ quote and quote-splice
statements
○ quasi-quote, unquote and
unquote-splice statements
○ cons statement
○ and and or statements
Pregelator
Simple Foxx service based IDE
16https://github.com/arangodb-foxx/pregelator
PPA: What is next?
- Gather Feedback
- In particular use-cases
- Missing functions & functionality
- User-friendly Front-End language
- Improve Scale/Performance of underlying
Pregel platform
- Algorithm library
- Blog Post (including Jupyter example)
18
ArangoDB 3.8 (end of year)
- Experimental Feature
- Initial Library
ArangoDB 3.9 (Q1 21)
- Draft for Front-End
- Extended Library
- Platform Improvements
ArangoDB 4.0 (Mid 21)
- GA
Pregel vs AQL
When to (not) use Pregel…
- Can the algorithm be efficiently be
expressed in Pregel?
- Counter example: Topological Sort
- Is the graph size worth the loading?
19
AQL Pregel
All Models (Graph, Document, Key-Value, Search, …) Iterative Graph Processing
Online Queries Large Graphs, multiple iterations
How can I start?
● Docker Image: arangodb/enterprise-preview:3.8.0-milestone.3
● Check existing algorithms
● Preview documentation
● Give Feedback
○ https://slack.arangodb.com/ -> custom-pregel
20
Thanks for listening!
21
Reach out with Feedback/Questions!
• @arangodb
• https://www.arangodb.com/
• docker pull arangodb
Test-drive Oasis
14-days for free

Custom Pregel Algorithms in ArangoDB

  • 1.
    Feature Preview: CustomPregel Complex Graph Algorithms made Easy @arangodb @joerg_schad @hkernbach
  • 2.
    2 tl;dr ● “Many practicalcomputing problems concern large graphs.” ● ArangoDB is a “Beyond Graph Database” supporting multiple data models around a scalable graph foundation ● Pregel is a framework for distributed graph processing ○ ArangoDB supports predefined Prgel algorithms, e.g. PageRank, Single-Source Shortest Path and Connected components. ● Programmable Pregel Algorithms (PPA) allows adding/modifying algorithms on the flight Disclaimer This is an experimental feature and especially the language specification (front-end) is still under development!
  • 3.
    Jörg Schad, PhD Headof Engineering and ML @ArangoDB ● Suki.ai ● Mesosphere ● Architect @SAP Hana ● PhD Distributed DB Systems ● Twitter: @joerg_schad
  • 4.
    4 Heiko Kernbach Core Engineer(Graphs Team) @ ● Graph ● Custom Pregel ● Geo / UI ● Twitter: @hkernbach ● Slack: hkernbach.ArangoDB
  • 5.
    5 ● Open Source ●Beyond Graph Database ○ Stores, K/V, Documents connected by scalable Graph Processing ● Scalable ○ Distributed Graphs ● AQL - SQL-like multi-model query language ● ACID Transactions including Multi Collection Transactions
  • 6.
  • 7.
    https://blog.acolyer.org/2015/05/26/pregel-a-system-for-large-scale-graph-processing/ Pregel Max Value Whilenot converged: Communicate: send own value to neighbours Compute: Own value = Max Value from all messages (+ own value) Superstep
  • 8.
    ArangoDB and Pregel:Status Quo ● https://www.arangodb.com/docs/stable/graphs-pregel.html ● https://www.arangodb.com/pregel-community-detection/ Available Algorithms ● Page Rank ● Seeded PageRank ● Single-Source Shortest Path ● Connected Components ○ Component ○ WeaklyConnected ○ StronglyConnected ● Hyperlink-Induced Topic Search (HITS)Permalink ● Vertex Centrality ● Effective Closeness ● LineRank ● Label Propagation ● Speaker-Listener Label Propagation 8 var pregel = require("@arangodb/pregel"); pregel.start("pagerank", "graphname", {maxGSS: 100, threshold: 0.00000001, resultField: "rank"}) ● Pregel support since 2014 ● Predefined algorithms ○ Could be extended via C++ ● Same platform used for PPA Challenges Add and modify Algorithms
  • 9.
    Programmable Pregel Algorithms(PPA) const pregel = require("@arangodb/pregel"); let pregelID = pregel.start("air", graphName, "<custom-algorithm>"); var status = pregel.status(pregelID); ● Add/Modify algorithms on-the-fly ○ Without C++ code ○ Without restarting the Database ● Efficiency (as Pregel) depends on Sharding ○ Smart Graphs ○ Required: Collocation of vertices and edges 9
  • 10.
    Custom Algorithm 10 { "resultField": "<string>", "maxGSS":"<number>", "dataAccess": { "writeVertex": "<program>", "readVertex": "<array>", "readEdge": "<array>" }, "vertexAccumulators": "<object>", "globalAccumulators": "<object>", "customAccumulators": "<object>", "phases": "<array>" } Accumulators Accumulators are used to consume and process messages which are being sent to them during the computational phase (initProgram, updateProgram, onPreStep, onPostStep) of a superstep. After a superstep is done, all messages will be processed. ● max: stores the maximum of all messages received. ● min: stores the minimum of all messages received. ● sum: sums up all messages received. ● and: computes and on all messages received. ● or: computes or and all messages received. ● store: holds the last received value (non-deterministic). ● list: stores all received values in list (order is non-deterministic). ● custom
  • 11.
    Custom Algorithm 11 { "resultField": "<string>", "maxGSS":"<number>", "dataAccess": { "writeVertex": "<program>", "readVertex": "<array>", "readEdge": "<array>" }, "vertexAccumulators": "<object>", "globalAccumulators": "<object>", "customAccumulators": "<object>", "phases": "<array>" } ● resultField (string, optional): Name of the document attribute to store the result in. The vertex computation results will be in all vertices pointing to the given attribute. ● maxGSS (number, required): The max amount of global supersteps After the amount of max defined supersteps is reached, the Pregel execution will stop. ● dataAccess (object, optional): Allows to define writeVertex, readVertex and readEdge. ○ writeVertex: A program that is used to write the results into vertices. If writeVertex is used, the resultField will be ignored. ○ readVertex: An array that consists of strings and/or additional arrays (that represents a path). ■ string: Represents a single attribute at the top level. ■ array of strings: Represents a nested path ○ readEdge: An array that consists of strings and/or additional arrays (that represents a path). ■ string: Represents a single path at the top level which is not nested. ■ array of strings: Represents a nested path ● vertexAccumulators (object, optional): Definition of all used vertex accumulators. ● globalAccumulators (object, optional): Definition all used global accumulators. Global Accumulators are able to access variables at shared global level. ● customAccumulators (object, optional): Definition of all used custom accumulators. ● phases (array): Array of a single or multiple phase definitions. ● debug (optional): See Debugging.
  • 12.
    Phases - Executionorder 12 Step 1: Initialization 1. onPreStep (Conductor, executed on Coordinator instances) 2. initProgram (Worker, executed on DB-Server instances) 3. onPostStep (Conductor) Step {2, ...n} Computation 1. onPreStep (Conductor) 2. updateProgram (Worker) 3. onPostStep (Conductor)
  • 13.
    Program - ArangoIntermediate Representation (AIR) 13
  • 14.
    Program - ArangoIntermediate Representation (AIR) Lisp-like intermediate representation, represented in JSON and supports its data types 14 Specification ● Language Primitives ○ Basic Algebraic Operators ○ Logical operators ○ Comparison operators ○ Lists ○ Sort ○ Dicts ○ Lambdas ○ Reduce ○ Utilities ○ Functional ○ Variables ○ Debug operators ● Math Library ● Special Form ○ let statement ○ seq statement ○ if statement ○ match statement ○ for-each statement ○ quote and quote-splice statements ○ quasi-quote, unquote and unquote-splice statements ○ cons statement ○ and and or statements
  • 15.
    Program - ArangoIntermediate Representation (AIR) Lisp-like intermediate representation, represented in JSON and supports its data types 15 Specification ● Language Primitives ○ Basic Algebraic Operators ○ Logical operators ○ Comparison operators ○ Lists ○ Sort ○ Dicts ○ Lambdas ○ Reduce ○ Utilities ○ Functional ○ Variables ○ Debug operators ● Math Library ● Special Form ○ let statement ○ seq statement ○ if statement ○ match statement ○ for-each statement ○ quote and quote-splice statements ○ quasi-quote, unquote and unquote-splice statements ○ cons statement ○ and and or statements
  • 16.
    Pregelator Simple Foxx servicebased IDE 16https://github.com/arangodb-foxx/pregelator
  • 18.
    PPA: What isnext? - Gather Feedback - In particular use-cases - Missing functions & functionality - User-friendly Front-End language - Improve Scale/Performance of underlying Pregel platform - Algorithm library - Blog Post (including Jupyter example) 18 ArangoDB 3.8 (end of year) - Experimental Feature - Initial Library ArangoDB 3.9 (Q1 21) - Draft for Front-End - Extended Library - Platform Improvements ArangoDB 4.0 (Mid 21) - GA
  • 19.
    Pregel vs AQL Whento (not) use Pregel… - Can the algorithm be efficiently be expressed in Pregel? - Counter example: Topological Sort - Is the graph size worth the loading? 19 AQL Pregel All Models (Graph, Document, Key-Value, Search, …) Iterative Graph Processing Online Queries Large Graphs, multiple iterations
  • 20.
    How can Istart? ● Docker Image: arangodb/enterprise-preview:3.8.0-milestone.3 ● Check existing algorithms ● Preview documentation ● Give Feedback ○ https://slack.arangodb.com/ -> custom-pregel 20
  • 21.
    Thanks for listening! 21 Reachout with Feedback/Questions! • @arangodb • https://www.arangodb.com/ • docker pull arangodb Test-drive Oasis 14-days for free