KEMBAR78
Distributed processing of large graphs in python | PDF
Jose Quesada
Director, Data Science Retreat
jose@datascienceretreat.com
@quesada
• Mentors are world-class. CTOs, library authors, inventors,
founders of fast-growing companies, etc
• DSR accepts fewer than 5% of the applications
• Strong focus on commercial awareness
• 5 years of working experience on average
• 30+ partner companies in Europe
DSR participants do a portfolio
project
Why is DSR talking about Scala/Spark?
They are b
IBM is behind this
They hired
What is a good question?
What is a good question?
• Business case
• Data available
• Technology to answer the question is available
• We know when the solution worked
Does he look like a bitch?
What is a good question?
• Business case
• Data available
• Technology to answer the question is available
• We know when the solution worked
The question: When should I tweet
to influence the right account?

Or ‘beat Buffer at their own game’
What is a good question?
• Business case
DJ J & MAX RECORDS
DJ J & MAX RECORDS
DJ J & MAX RECORDS
DJ J & MAX RECORDS
DJ J & MAX RECORDS
DJ J & MAX RECORDS
Overlap Tweet hours
Tweet frequency per UTC hour
What is a good question?
• Business case
• Data available
24GB
What is a good question?
• Business case
• Data available
• Technology to answer the question is available
What is a good question?
• Business case
• Data available
• Technology to answer the question is available
• We know when the solution worked
Graph theory parts we can
use to solve this problem
Graph theory primer
• Random walk
• Shortest path
• Sampling
Sampling in networks
Sampling in Networks
Note that sampling in Networks is fraught with difficulties. One cannot simply
sample the edges and nodes and expect that the sample be representative of the
original network. In the graph below, a sample that missed node 1 or 2 would
disconnect the two clusters, and would not have the same properties as the
original
Node 11
Node 2
Random surfer
Random surfer
A
B
C
D
Random surfer
A
B
C
D
Random surfer
A
B
C
D
E
Visited more often:
• Nodes with many links
• Coming from frequently visited nodes
Computing Pagerank
 
 
A
B
C
D
E
 
Computing Pagerank
 
 
A
B
C
D
E
 
 
Computing Pagerank
 
 
A
B
C
D
E
 
 
Computing Pagerank
 
 
A
B
C
D
E
 
 
Computing Pagerank
 
 
A
B
C
D
E
 
 
Computing Pagerank
 
 
A
B
C
D
E
 
 
Teleport
A
B
C
D
E
Teleport
A
B
C
D
E
Teleport
A
B
C
D
E
Teleport
A
B
C
D
E
   
  
 
 
 
 
Teleport
A
B
C
D
E
At regular node: invoke
teleport operation with
probability α and standard
random walk with
probability (1-α)
 
 
 
 
 
 
(1-α)
α
Personalized pagerank
A
B
C
D
E
At regular node: invoke
teleport operation with
probability α and standard
random walk with
probability (1-α). When
teleporting, go to target
node
 
 
 
 
 
(1-α)
Personalized pagerank
A
B
C
D
E
At regular node: invoke
teleport operation with
probability α and standard
random walk with
probability (1-α). When
teleporting, go to target
node
(1-α)
α
Personalized pagerank
• Special case of Pagerank with priors (distribution of weights
over the nodes)
Implementation
A partitioned, distributed graph processing engine
is significantly more complex and difficult to build
GraphX and graphframes (new in spark
2.0)
• GraphX is to RDD as graphframe is to dataframe
• GraphX is lower level, and the API is scala-only. Graphframe is
very new:
• It’s not designed to be a graph database, as neo4J. Nodes and
edges can contain metadata, but the query engine is not as
complete as cypher
Advantages of graphframes
• Graphframes have a python API
• Graphframes give you simple querying for free.  GraphFrame
vertices and edges are stored as DataFrames, many queries are
just DataFrame (or SQL) queries
• They contain most of the algorithms in graphX, but the API is
less well-tested
• Pyspark shell instead of spark-shell
Distributed PageRank
• Problem: Computing PageRank on graph too large for one
machine
• Algorithm:
– Shard edges randomly,
– compute on each machine
– average results
• Basic idea: Duplicate edges from low-degree nodes. Gives an
unbiased estimator
• Nodes: 41.652.230
• Edges:
1.468.365.182
Summary of implementation, benefits
• Graph theory is a really flexible way to represent a problem
• Data structures to represent graphs are mature
• You can do now out-of-core, distributed graph analysis for
cheap
• Implementations are there for even state-of-the-art methods
Summary, finding a problem
• We live in an age of abundance (methods, data, hardware, ideas)
• Finding the question is more than half of the battle
• I had about a week to prepare this talk, but I managed to put
together something that showcases what you can do with large
graphs today, and it could be effective as a startup idea
• My question is not great because you cannot demonstrate that it
works till you use it (common problem for unsupervised methods)
The question: When should I tweet
to influence the right account?

Or ‘beat Buffer at their own game’
References: Drawing graphs
• Graphs in this slide set have been drawn with Gephi
• If you use Zeppelin notebook, you can draw graphs with:
drawGraph(org.apache.spark.graphx.util.
GraphGenerators.rmatGraph(sc,32,60))


25 videos explaining ML on spark, 50 more
to come. A bunch on graphX
• For people who already know ML
• http://datascienceretreat.com/videos/data-science-with-
scala-and-spark
About learning new tech over seven
weekends…
About learning new tech over seven
weekends
• You have time and enjoy using it to learn alone: learn it ‘the
hard way’
• You are extremely motivated and talented, have money: Apply
for DSR
• You want your weekends for yourself. You are already very
good but want to switch jobs. Apply for codekitt
Thanks!
Jose Quesada
Director, Data Science Retreat
jose@datascienceretreat.com
@quesada
http://datascienceretreat.com/
codekitt.com

Distributed processing of large graphs in python