KEMBAR78
Graph Based Machine Learning on Relational Data | PPTX
Graph Based Machine
Learning on Relational Data
Problems and Methods
Machine Learning using Graphs
Machine Learning using Graphs
- Machine Learning is iterative but iteration
can also be seen as traversal.
Machine Learning using Graphs
- Machine Learning is iterative but iteration
can also be seen as traversal.
- Many domains have structures already
modeled as graphs (health records, finance)
Machine Learning using Graphs
- Machine Learning is iterative but iteration
can also be seen as traversal.
- Many domains have structures already
modeled as graphs (health records, finance)
- Important analyses are graph algorithms:
clusters, influence propagation, centrality.
Machine Learning using Graphs
- Machine Learning is iterative but iteration
can also be seen as traversal.
- Many domains have structures already
modeled as graphs (health records, finance)
- Important analyses are graph algorithms:
clusters, influence propagation, centrality.
- Performance benefits on sparse data
Machine Learning using Graphs
- Machine Learning is iterative but iteration
can also be seen as traversal.
- Many domains have structures already
modeled as graphs (health records, finance)
- Important analyses are graph algorithms:
clusters, influence propagation, centrality.
- Performance benefits on sparse data
- More understandable implementation
Iterative PageRank in Python
def pageRank(G, s = .85, maxerr = .001):
n = G.shape[0]
# transform G into markov matrix M
M = csc_matrix(G,dtype=np.float)
rsums = np.array(M.sum(1))[:,0]
ri, ci = M.nonzero()
M.data /= rsums[ri]
sink = rsums==0 # bool array of sink states
# Compute pagerank r until we converge
ro, r = np.zeros(n), np.ones(n)
while np.sum(np.abs(r-ro)) > maxerr:
ro = r.copy()
for i in xrange(0,n):
Ii = np.array(M[:,i].todense())[:,0] # inlinks of state i
Si = sink / float(n) # account for sink states
Ti = np.ones(n) / float(n) # account for teleportation
r[i] = ro.dot( Ii*s + Si*s + Ti*(1-s) )
return r/sum(r) # return normalized pagerank
Graph-Based PageRank in Gremlin
pagerank = [:].withDefault{0}
size = uris.size();
uris.each{
count = it.outE.count();
if(count == 0 || rand.nextDouble() > 0.85) {
rank = pagerank[it]
uris.each {
pagerank[it] = pagerank[it] / uris.size()
}
}
rank = pagerank[it] / it.outE.count();
it.out.each{
pagerank[it] = pagerank[it] + rank;
}
}
Learning by Example
- Machine Learning requires many instances with
which to fit a model to make predictions.
Learning by Example
- Machine Learning requires many instances with
which to fit a model to make predictions.
- Current large scale analytical methods (Pregel,
Giraph, GraphLab) are in-memory without data
storage components.
Learning by Example
- Machine Learning requires many instances with
which to fit a model to make predictions.
- Current large scale analytical methods (Pregel,
Giraph, GraphLab) are in-memory with data
storage components
- And while Neo4j, OrientDB, and Titan are ok...
Learning by Example
- Machine Learning requires many instances with
which to fit a model to make predictions.
- Current large scale analytical methods (Pregel,
Giraph, GraphLab) are in-memory with data
storage components
- And while Neo4j, OrientDB, and Titan are ok...
- Most (active) data sits in relational databases
where users interact with it in real time via
transactions in web applications.
Is it because relational data is a legacy system we must support?
Is it purely because of inertia?
NO! It’s because Relational Data is awesome!
Awesome sauce relational data of the future.
- Ability to express queries/algorithms using a
declarative, graph-domain specific language
like SQL, or at the very least via UDFs.
Requirements
Requirements
- Ability to express queries/algorithms using a
declarative, graph-domain specific language
like SQL, or at the very least via UDFs.
- Ability to explore and identify hidden or
implicit graphs in the database.
Requirements
- Ability to express queries/algorithms using a
declarative, graph-domain specific language
like SQL, or at the very least via UDFs.
- Ability to explore and identify hidden or
implicit graphs in the database.
- Combine in-memory analytics with some
disk storage facility that is transactional.
Approach 1: ETL Methods
t = 0 t > 0
extract
transform
load
synchronize
analyze
Approach 1: ETL Methods
The Good
- Processing is not physical layer dependent
- Relational data storage with real time interaction
- Analytics can scale in size to Hadoop or in speed to in-
memory computation frameworks.
The Bad
- Must know structure of graph in relational database
ahead of time, no exploration.
- Synchronization can cause inconsistency.
- OLAP processes incur resource penalty (I/O or CPU
depending on location).
Approach 1: ETL Methods
The Good
- Processing is not physical layer dependent
- Relational data storage with real time interaction
- Analytics can scale in size to Hadoop or in speed to in-
memory computation frameworks.
The Bad
- Must know structure of graph in relational database
ahead of time, no exploration.
- Synchronization can cause inconsistency.
- OLAP processes incur resource penalty (I/O or CPU
depending on location).
Approach 2: Store Graph in RDBMS
Approach 2: Store Graph in RDBMS
The Good
- Can utilize relational devices like indices and parallel
joins for graph-specific queries on existing data.
- Simply use SQL for the data access mechanism.
- Transactional storage of the data.
The Bad
- Constrained to graph-specific schema.
- Many joins required for traversal.
- Depending on storage mechanisms there may be too
few or too many tables in the database for applications.
- Must convert existing database to this structure.
Approach 2: Store Graph in RDBMS
The Good
- Can utilize relational devices like indices and parallel
joins for graph-specific queries on existing data.
- Simply use SQL for the data access mechanism.
- Transactional storage of the data.
The Bad
- Constrained to graph-specific schema.
- Many joins required for traversal.
- Depending on storage mechanisms there may be too
few or too many tables in the database for applications.
- Must convert existing database to this structure.
Approach 3: Use Graph Query Language
API
Optimizer
Query Result
Query Translator
SQL Queries
Final SQL
Queries
Graph DSL Query
Approach 3: Use Graph Query Language
The Good
- DSL in the graph domain that easily expresses graph
analytics but also relational semantics.
- Can use existing relational schemas; allows for
exploration and identification of graphs.
- Computation is offloaded into in-memory processing
The Bad
- Many graphs or big graphs can cause too many joins
without optimal query translation.
- User is required to facilitate definition of relational
structure into a graph representation.
- May not leverage relational resources.
Approach 3: Use Graph Query Language
The Good
- DSL in the graph domain that easily expresses graph
analytics but also relational semantics.
- Can use existing relational schemas; allows for
exploration and identification of graphs.
- Computation is offloaded into in-memory processing
The Bad
- Many graphs or big graphs can cause too many joins
without optimal query translation.
- User is required to facilitate definition of relational
structure into a graph representation.
- May not leverage relational resources.
Any Questions?
Thank you!
Presented By:
Konstantinos Xirogiannopoulos <kostasx@cs.umd.edu>
Benjamin Bengfort <bengfort@cs.umd.edu>
May 7, 2015

Graph Based Machine Learning on Relational Data

  • 1.
    Graph Based Machine Learningon Relational Data Problems and Methods
  • 2.
  • 3.
    Machine Learning usingGraphs - Machine Learning is iterative but iteration can also be seen as traversal.
  • 4.
    Machine Learning usingGraphs - Machine Learning is iterative but iteration can also be seen as traversal. - Many domains have structures already modeled as graphs (health records, finance)
  • 5.
    Machine Learning usingGraphs - Machine Learning is iterative but iteration can also be seen as traversal. - Many domains have structures already modeled as graphs (health records, finance) - Important analyses are graph algorithms: clusters, influence propagation, centrality.
  • 6.
    Machine Learning usingGraphs - Machine Learning is iterative but iteration can also be seen as traversal. - Many domains have structures already modeled as graphs (health records, finance) - Important analyses are graph algorithms: clusters, influence propagation, centrality. - Performance benefits on sparse data
  • 7.
    Machine Learning usingGraphs - Machine Learning is iterative but iteration can also be seen as traversal. - Many domains have structures already modeled as graphs (health records, finance) - Important analyses are graph algorithms: clusters, influence propagation, centrality. - Performance benefits on sparse data - More understandable implementation
  • 8.
    Iterative PageRank inPython def pageRank(G, s = .85, maxerr = .001): n = G.shape[0] # transform G into markov matrix M M = csc_matrix(G,dtype=np.float) rsums = np.array(M.sum(1))[:,0] ri, ci = M.nonzero() M.data /= rsums[ri] sink = rsums==0 # bool array of sink states # Compute pagerank r until we converge ro, r = np.zeros(n), np.ones(n) while np.sum(np.abs(r-ro)) > maxerr: ro = r.copy() for i in xrange(0,n): Ii = np.array(M[:,i].todense())[:,0] # inlinks of state i Si = sink / float(n) # account for sink states Ti = np.ones(n) / float(n) # account for teleportation r[i] = ro.dot( Ii*s + Si*s + Ti*(1-s) ) return r/sum(r) # return normalized pagerank
  • 9.
    Graph-Based PageRank inGremlin pagerank = [:].withDefault{0} size = uris.size(); uris.each{ count = it.outE.count(); if(count == 0 || rand.nextDouble() > 0.85) { rank = pagerank[it] uris.each { pagerank[it] = pagerank[it] / uris.size() } } rank = pagerank[it] / it.outE.count(); it.out.each{ pagerank[it] = pagerank[it] + rank; } }
  • 10.
    Learning by Example -Machine Learning requires many instances with which to fit a model to make predictions.
  • 11.
    Learning by Example -Machine Learning requires many instances with which to fit a model to make predictions. - Current large scale analytical methods (Pregel, Giraph, GraphLab) are in-memory without data storage components.
  • 12.
    Learning by Example -Machine Learning requires many instances with which to fit a model to make predictions. - Current large scale analytical methods (Pregel, Giraph, GraphLab) are in-memory with data storage components - And while Neo4j, OrientDB, and Titan are ok...
  • 13.
    Learning by Example -Machine Learning requires many instances with which to fit a model to make predictions. - Current large scale analytical methods (Pregel, Giraph, GraphLab) are in-memory with data storage components - And while Neo4j, OrientDB, and Titan are ok... - Most (active) data sits in relational databases where users interact with it in real time via transactions in web applications.
  • 14.
    Is it becauserelational data is a legacy system we must support? Is it purely because of inertia?
  • 15.
    NO! It’s becauseRelational Data is awesome! Awesome sauce relational data of the future.
  • 16.
    - Ability toexpress queries/algorithms using a declarative, graph-domain specific language like SQL, or at the very least via UDFs. Requirements
  • 17.
    Requirements - Ability toexpress queries/algorithms using a declarative, graph-domain specific language like SQL, or at the very least via UDFs. - Ability to explore and identify hidden or implicit graphs in the database.
  • 18.
    Requirements - Ability toexpress queries/algorithms using a declarative, graph-domain specific language like SQL, or at the very least via UDFs. - Ability to explore and identify hidden or implicit graphs in the database. - Combine in-memory analytics with some disk storage facility that is transactional.
  • 19.
    Approach 1: ETLMethods t = 0 t > 0 extract transform load synchronize analyze
  • 20.
    Approach 1: ETLMethods The Good - Processing is not physical layer dependent - Relational data storage with real time interaction - Analytics can scale in size to Hadoop or in speed to in- memory computation frameworks. The Bad - Must know structure of graph in relational database ahead of time, no exploration. - Synchronization can cause inconsistency. - OLAP processes incur resource penalty (I/O or CPU depending on location).
  • 21.
    Approach 1: ETLMethods The Good - Processing is not physical layer dependent - Relational data storage with real time interaction - Analytics can scale in size to Hadoop or in speed to in- memory computation frameworks. The Bad - Must know structure of graph in relational database ahead of time, no exploration. - Synchronization can cause inconsistency. - OLAP processes incur resource penalty (I/O or CPU depending on location).
  • 22.
    Approach 2: StoreGraph in RDBMS
  • 23.
    Approach 2: StoreGraph in RDBMS The Good - Can utilize relational devices like indices and parallel joins for graph-specific queries on existing data. - Simply use SQL for the data access mechanism. - Transactional storage of the data. The Bad - Constrained to graph-specific schema. - Many joins required for traversal. - Depending on storage mechanisms there may be too few or too many tables in the database for applications. - Must convert existing database to this structure.
  • 24.
    Approach 2: StoreGraph in RDBMS The Good - Can utilize relational devices like indices and parallel joins for graph-specific queries on existing data. - Simply use SQL for the data access mechanism. - Transactional storage of the data. The Bad - Constrained to graph-specific schema. - Many joins required for traversal. - Depending on storage mechanisms there may be too few or too many tables in the database for applications. - Must convert existing database to this structure.
  • 25.
    Approach 3: UseGraph Query Language API Optimizer Query Result Query Translator SQL Queries Final SQL Queries Graph DSL Query
  • 26.
    Approach 3: UseGraph Query Language The Good - DSL in the graph domain that easily expresses graph analytics but also relational semantics. - Can use existing relational schemas; allows for exploration and identification of graphs. - Computation is offloaded into in-memory processing The Bad - Many graphs or big graphs can cause too many joins without optimal query translation. - User is required to facilitate definition of relational structure into a graph representation. - May not leverage relational resources.
  • 27.
    Approach 3: UseGraph Query Language The Good - DSL in the graph domain that easily expresses graph analytics but also relational semantics. - Can use existing relational schemas; allows for exploration and identification of graphs. - Computation is offloaded into in-memory processing The Bad - Many graphs or big graphs can cause too many joins without optimal query translation. - User is required to facilitate definition of relational structure into a graph representation. - May not leverage relational resources.
  • 28.
  • 29.
    Thank you! Presented By: KonstantinosXirogiannopoulos <kostasx@cs.umd.edu> Benjamin Bengfort <bengfort@cs.umd.edu> May 7, 2015

Editor's Notes

  • #2 Hi, my name is Kostas and this is Ben. Today we’re going to present the research challenges and existing methods of using graph analyses on relational data stores. [SLIDE CHANGE] Today I’d like to talk a little bit about conducting Graph Based Machine Learning on Relational Data
  • #3 As we’ve read and discussed in class - graphs are a valuable data structure, well suited for a range of non-trivial analyses and machine learning tasks. [SLIDE CHANGE] So as we’ve recently read about and talked about in class, graphs are an interesting data structure that is actually well suited not only for trivial graph analyses as well as complex machine learning tasks.
  • #4 To motivate the usage of graphs and graph oriented frameworks even further
  • #6 Analyses like finding clusters, influence propagation by means of Pagerank for example, or centrality, are in essence graph algorithms.
  • #8 But more importantly, graph, and graph specific languages and frameworks provide a substantially more comprehensible way of implementing algorithms
  • #9 https://gist.github.com/diogojc/1338222 G = np.array([[0,0,1,0,0,0,0], [0,1,1,0,0,0,0], [1,0,1,1,0,0,0], [0,0,0,1,1,0,0], [0,0,0,0,0,0,1], [0,0,0,0,0,1,1], [0,0,0,1,1,0,1]])
  • #10 https://groups.google.com/forum/#!topic/gremlin-users/LAm4mzzg8NY Expressing graph algorithms through this framework is a lot more intuitive!
  • #11 So hopefully I’ve convinced you about why we’d want to use Graphs for these types of analyses. Now we also know that Machine learning...
  • #14 Most actuve data actually sits inside relational databases , where users interact with it in real time via transactions in the web applications that we use every day
  • #15 Now why is it that we don’t move on? A rusting old jalopy in Hackberry, a small Arizona town just outside the middle of nowhere. https://flic.kr/p/dqG9Ad
  • #16 1996 McLaren F1 GTR https://flic.kr/p/oc8gUh Awesome because : Strong semantics (durability, fault tolerance, integrity constraints) They support truly ACID Transactions They provide assurance because mature