Web of Data Usage Mining

Web of Data Usage Mining
Markus Luczak-Roesch
@mluczak | http://markus-luczak.de

What you should learn:
•  describe the architectural differences between content
negotiation and Linked Data queries;
•  develop applications that use different strategies to
consume Linked Data;
•  develop usage mining methods that exploit the atomic parts
of the SPARQL query language.

Linked Data principles
1.  Use URIs as names for
“Things” (resources).
2.  Use HTTP URIs to allow the
access to resources on the
Web.
3.  On resource access, deliver
meaningful information
conforming to Web standards
(RDF, SPARQL).
4.  Set RDF links to resources
published by other parties to
allow the discovery of more
resources.
http://dbpedia.org/resource/Berlin

http://dbpedia.org/page/Berlin
http://dbpedia.org/data/Berlin

yago-res:Berlin

S

owl:sameAs

P

dbpedia:Berlin O
h"p://www.w3.org/DesignIssues/LinkedData.html
Content Negotiation

Web.
(RDF, SPARQL).
resources.


yago-res:Berlin

S

owl:sameAs

P

dbpedia:Berlin O
Content Negotiation

Web.
3.  On resource access,
deliver meaningful
information conforming to
Web standards (RDF,
SPARQL).
resources.


yago-res:Berlin

S

owl:sameAs

P

dbpedia:Berlin O
Content Negotiation

Web.
(RDF, SPARQL).
published by other parties
to allow the discovery of
more resources.


yago-res:Berlin

S

owl:sameAs

P

dbpedia:Berlin O
Content Negotiation

Linked Data exploits RDF
h"p://markus-luczak.de#me
“Markus Luczak-Roesch“
foaf:name
h"p://hannes.muehleisen.org#me
foaf:knows

Linked Data vocabularies
•  Vocabulary reuse:
–  Geo
–  FOAF
–  GoodRelations
–  SIOC
–  DOAP
–  …
•  Vocabulary development:
–  Thing
•  Person
–  OfﬁceHolder
–  …
•  …
http://dbpedia.org/ontology/Person
http://dbpedia.org/ontology/OfﬁceHolder
http://xmlns.com/foaf/0.1/knows

Linked Data vocabularies
•  Mixing:
– Geo
– FOAF
– Dublin Core
– DBpedia Ontology
–  ...
http://xmlns.com/foaf/0.1/Person
http://www.w3.org/2003/01/geo/wgs84_pos#lat
http://dbpedia.org/ontology/leader
http://dbpedia.org/ontology/City

Linked Data is self-descriptive
Instance level Schema level
int:resA
ont:ClassA
owl:sameAs
„ABC“
foaf:name
ext:resA
int:resB
rdf:type
owl:equivalentClass
rdf:type
foaf:name
rdf:type
rdf:type
rdf:type
rdfs:subClassOf
foaf:Agent
rdf:type
foaf:Person
rdfs:subClassOf
owl:sameAs
owl:equivalentClass

rdf:type
u_id ﬁrstname surname
45 Markus Luczak-Roesch
… … …
foaf:name
foaf:Person

“3.375.222“
dbpedia:Berlin
c_id city country inhabitants
67 Berlin Germany 3.375.222
… … …
dbp:populaVon

rdf:type
foaf:name
dbp:birthPlace
foaf:Person
“3.375.222“
dbp:populaVon
dbpedia:Berlin

foaf:basedNear
dbp:birthPlace
h"p://markus-luczak.de/res/Soton
dbpedia:CiVes_in_Europe
skos:subject
dbpedia:Berlin
skos:subject
dbpedia:Southampton

foaf:basedNear
dbp:birthPlace
dbpedia:CiVes_in_Europe
skos:subject
dbpedia:Berlin
skos:subject
dbpedia:Southampton
rdfs:seeAlso

foaf:basedNear
rdfs:seeAlso
rdf:type
foaf:Person
owl:equivalentClass
dbp:Person
rdf:type
dbpedia:Southampton
dbp:birthPlace
dbpedia:Benny_Hill

Linked Data Infrastructure
Image source: Tom Heath and ChrisVan Bizer (2011) Linked Data: Evolving the Web into a
Global Data Space (1st ediVon). Synthesis Lectures on the SemanVc Web: Theory and
Technology, 1:1, 1-136. Morgan & Claypool.

Consuming Linked Data
•  stateless
•  request-response
t
Client Server
request
response
TCP life cycle
derived from R. Tolksdorf
Open connection
Close connection

GET / HTTP/1.1
User-Agent: Mozilla/5.0 … Firefox/10.0.3
Host: markus-luczak.de:80
Accept: */*
HTTP/1.1 200 OK
Server: Apache/2.0.49
Content-Language: en
Content-Type: text/html
Content-length: 2990
<!DOCTYPE html>
<html xml:lang="en"
…
Client
Server
derived from R. Tolksdorf

Server
Representation 1
index.html
Representation 2
index.rdf
Information
Resource
http://example.com/content/index
Client
HTTP GET

•  Discover URIs
– Lookup services
•  http://rkbexplorer.com
– Web of Data search engines
•  http://sindice.com
•  http://ws.nju.edu.cn/falcons/objectsearch/index.jsp

•  Discover additional data for the resource at hand
•  follow links („follow your nose“)
–  rdfs:seeAlso
–  owl:sameAs
•  Co-Reference services
–  http://sameas.org
•  Web of Data search engines

Linked Data
Source: http://wifo5-03.informatik.uni-mannheim.de/bizer/pub/LinkedDataTutorial/
The server can trace this usage.

Linked Data is queryable
?s
foaf:name
?o
foaf:knows

SPARQL-recap
•  Basic principle: pattern matching
– describe pattern
– query RDF triple set („RDF graph“)
– matching subset comes into results
?s

SPARQL-recap
?s
dbp:Klaus_Wowereit

dbp:Reinhard_Mey

dbp:Klaus_Wowereit
dbp:Berlin
dbp:birthPlace
dbp:Reinhard_Mey
Berlino
dbp:Axel_Springer

SPARQL queries on the Web
•  RESTful service endpoint
GET /sparql?query=PREFIX+rdf… HTTP/1.1
Host: dbpedia.org
h"p://www.w3.org/TR/rdf-sparql-XMLres/ h"p://www.w3.org/TR/rdf-sparql-json-res/

Querying Linked Data
dbp:Klaus_Wowereit
dbp:Berlin
dbp:birthPlace
dbp:Reinhard_Mey
http://www.markus-luczak.de/me
dbp:birthPlace

Querying Linked Data
•  distribution of data creates challenges for querying them
•  Query approaches
–  follow-up queries ß application-dependent, proprietary
–  query a central data repository (e.g. LOD cache) ß trivial
–  federated queries ß more interesting
•  idea: query a mediator that distributes the sub-queries and returns
aggregated result (as of SPARQL 1.1)
–  link traversal ß very interesting
•  idea: follow links in the results retrieved from a source to expand the data
dynamically

Dataset
User Client/ApplicaVon
Query Pa"ern Access

Resource Centered
Access
HTTP
Query Processing

Graph CreaVon and
Content NegoVaVon
GET /resource/resA
GET /sparql?query=SELECT…
applicaVon/rdf+xml, …
Evaluate and
perform query,
create result set
Process and
select result
text/xml, …
Data Publisher Data Consumer
Data Publisher Data Consumer

h"p://www.ﬂickr.com/photos/therichbrooks/4040197666/, CC-BY 2.0, h"ps://creaVvecommons.or
A game of pairs with SPARQL

SPARQL queries are self-descriptive data
themselves
{
?s1 foaf:name “Markus Luczak-Roesch”.
?s1 rdf:type dbp:Person
}
TP
TP BGP

themselves
{
}
rdf:type
foaf:name
foaf:Person
✔
✗
✗

themselves
{
dbpedia:Benny_Hill dbp:birthPlace ?o1 .
?s dbp:basedNear ?o1 .
?s foaf:name ?o2
}
✔
✗
✗
✗

themselves
{
dbpedia:Benny_Hill dbp:birthPlace ?o1
}
✔

themselves
{
?s dbp:basedNear ?o1
}
✔

themselves
{
?s foaf:name ?o2
}
✔

all TP
all TP in successful BGP
all TP in successful queries
all TP in failing queries
all TP in failing BGP

Statistical analysis
missing facts
inconsistent
data
•  ns:Band ns:knownFor ?x
•  ns:Band ns:naVonality ?y
•  ns:Band ns:instrument ?x
•  ns:Band ns:genre ?y
•  ns:Band ns:associatedBand ?z

Statistical analysis
(a) SWC (b) DBpedia (c) LGD
Abbildung 20: Nutzung der Konzepte der Multi-Ontologien (Kanten sind ausgeblendet)
Quelle: eigene Darstellung
dieser Datensets besitzt noch ein großes Verbesserungspotential. Beispielsweise sind di
Möglichkeiten gegeben, eine höhere Anzahl an speziellen Konzepten zu nutzen. Eben
so können theoretisch mehr Konzepte aus anderen Bereichen als Personen, Orte unSource: Masterthesis of Markus Bischoff

Estimating the effects of change
o be added to the DBpedia 3.4 data set conforming to our approach16
.
able 7.14: Recommended predicates to be added to the data set and the estimate
↵ects of change.
Primitive to add E↵ects of change Exists in data set
dbp:manufacturer 0.004505372 x
dbp:ﬁrstFlight 0.004505372 x
dbp:introduced 0.004505372 x
dbp:nationalOrigin 0.004505372
dbo:thumbnail 0.021986718 x
dbo:director 0.025047524
dbp:director 0.02503915 x
dbp:abstract 0.025797024 x
dbo:starring 0.034066643
dbp:starring 0.034066643 x
dbp:stars 0.034066643 x
skos:Concept 0.040946128 x
skos:broader 0.04116386 x
dbp:redirect 0.066441677 x

Log ﬁles
Selected log ﬁles
Preprocessed
queries
Decomposed
queries
and
transac<on
tables
Pa=erns
Change
recommenda<ons
[0,1]

What’s in your SPARQL shopping bag?
{
}
{
?s foaf:name ?o2
}
{
}
{
?s foaf:name ?o2
}
{
}
{
?s foaf:name ?o2
}
T1
T2
T1
…
… 30 mins., same IP, same user agent
…
…
…

SPARQL
7. Evaluation
The visualization shows how primitives on the left hand side (LHS) of a rule imply
particular ones on the right hand side (RHS) and which likelihood such an associa-
tion has. In our specific case this allows us to analyze which primitives are queried
together frequently in failing queries. We spot two characteristic usage patterns: (1)
the properties and classes queried in the context of http://dbpedia.org/ontology/
Aircraft; (2) the properties and classes queried in the context of an object variable.
These can be further analyzed by exporting the association rules to GraphML and vi-
sualizing the network by use of a network visualization and analysis tool like Gephi15
for example. Figure 7.13 depicts one filtered network representation for our example
case. Nodes with a degree lower than 5 are filtered out (k-core network with k = 5)
to derive a well-arranged visualization of the most important primitives in failing
queries. Nodes represent LHS and RHS of the computed rules. Edges point from the
LHS to the RHS of the particular rules.
Figure 7.13: Filtered visualization of the association rule network (k-core 5 filter
applied to reduce nodes with degree lower than 5).
Table 7.14 lists the an exemplary set of primitives which would be recommended
15http://gephi.org/
177
{


}
rdf:type
foaf:name
foaf:Person
✔
✗
✗
query applied to dataset
The server can trace detailed usage.

Linked Data Fragments
Querying Datasets on the Web with High Availability 5
generic requests
high client effort
high server availability
specific requests
high server effort
low server availability
data
dump
Linked Data
document
sparql
result
triple pattern
fragments
various types of
Linked Data Fragments
Fig. 1: All http triple interfaces offer Linked Data Fragments of a dataset. They differ
in the specificity of the data they contain, and thus the effort needed to create them.
3.2 Formal definitions
As a basis for our formalization, we use the following concepts of the rdf data
model [16] and the sparql query language [12]. We write U, B, L, and V to
denote the sets of all uris, blank nodes, literals, and variables, respectively.
Then, T = (U [ B) ⇥ U ⇥ (U [ B [ L) is the (infinite) set of all rdf triples. Any
tuple tp 2 (U [ V) ⇥ (U [ V) ⇥ (U [ L [ V) is a triple pattern. Any finite set of
such triple patterns is a basic graph pattern (bgp). Any more complex sparql
graph pattern, typically denoted by P, combines triple patterns (or bgps) using
specific operators [12,20]. The standard (set-based) query semantics for sparql
defines the query result of such a graph pattern P over a set of rdf triples
G ✓ T as a set that we denote by [[P]]G and that consists of partial mappings
µ : V ! (U [ B [ L), which are called solution mappings. An rdf triple t is
a matching triple for a triple pattern tp if there exists a solution mapping µ
such that t = µ[tp], where µ[tp] denotes the triple (pattern) that we obtain by
replacing the variables in tp according to µ.
For the sake of a more straightforward formalization, in this paper, we as-
sume without loss of generality that every dataset G published via some kind of
fragments on the Web is a finite set of blank-node-free rdf triples; i.e., G ✓ T ⇤
where T ⇤
= U ⇥ U ⇥ (U [ L). Each fragment of such a dataset contains triples
that somehow belong together; they have been selected based on some condition,
which we abstract through the notion of a selector:
T
xxx.xxx.xxx.xxx - - [17/Oct/2014:07:43:02 +0000]  
"GET /2014/en?subject=&predicate=&object=dbpedia%3AAustin HTTP/1.1" 200
1309 "http://fragments.dbpedia.org/2014/en" …
fetches the first page of the corresponding ldf. This page contains the cnt meta-
data, which tells us how many matches the dataset has for each triple pattern.
The pattern is then decomposed by evaluating it using a) a triple pattern iter-
ator for the triple pattern with the smallest number of matches, and b) a new
bgp iterator for the remainder of the pattern. This results in a dynamic pipeline
for each of the mappings of its predecessor, as visualized in Fig. 2. Each pipeline
is optimized locally for a specific mapping, reducing the number of requests.
To evaluate a sparql query over a triple pattern fragment collection, we pro-
ceed as follows. For each bgp of the query, a bgp iterator is created. Dedicated
iterators are necessary for other sparql constructs such as UNION and OPTIONAL,
but their implementation need not be ldf-specific; they can reuse the triple
pattern fragment bgp iterators. The predecessor of the first iterator is a start
iterator. We continuously pull solution mappings from the last iterator in the
pipeline and output them as solutions of the query, until the last iterator re-
sponds with nil. This pull-based process is able to deliver results incrementally.
...
B00
= { Drago_Ibler a Architect. }
Alen_Peternac
Drago_Ibler
Juraj_Neidhardt
...
?person birthPlace Zagreb.
B0
= { ?person a Architect. ?person birthPlace Zagreb. }
Zagreb
Budapest
Rome
...
?city subject
Capitals_in_Europe.
B = { ?person a Architect. ?person birthPlace ?city. ?city subject Capitals_in_Europe. }
Fig. 2: A bgp iterator decomposes a bgp B = {tp1, . . . , tpn} into a triple pattern
iterator for an optimal tpi and, for each resulting solution mapping µ of tpi, creates
a bgp iterator for the remaining pattern B0
= {tp | tp = µ[tpj] ^ tpj 2 B} {µ[tpi]}.
Pre-print of a paper accepted to the International Semantic Web Conference 2014 (ISWC 2014).
The final publication is available at link.springer.com.
Querying Datasets on the Web with High Av
4.2 Dynamic iterator pipelines
A common approach to implement query execution in database sy
iterators that are typically arranged in a tree or a pipeline, based
results are computed recursively [10]. Such a pipelined approac
studied for Linked Data query processing [13,15]. In order to en
results and allow the straightforward addition of sparql oper
ment a triple pattern fragments client using iterators.
The previous algorithm, however, cannot be implemented by
pipeline. For instance, consider a query for architects born in Eu
SELECT ?person ?city WHERE {
?person a dbpedia-owl:Architect. # tp1
?person dbpprop:birthPlace ?city. # tp2
?city dc:subject dbpedia:Capitals_in_Europe. # tp3
} LIMIT 100
Suppose the pipeline begins by finding ?city mappings for tp
to choose whether it will next consider tp1 or tp2. The optimal
differs depending on the value of ?city:
– For dbpedia:Paris, there are ±1,900 matches for tp2, and
for tp1, so there will be less http requests if we continue w
– For dbpedia:Vilnius, there are 164 matches for tp2, and ±1
tp1, so there will be less http requests if we continue with
With a static pipeline, we would have to choose the pipeline stru
and subsequently reuse it.
In order to generate an optimized pipeline for each (sub-)qu
a divide-and-conquer strategy in which a query is decomposed d

Wikidata
•  API access to
•  items
•  edit history
•  items’ discussions
•  items’ access statistics
•  and more
•  Linked Data interface
•  MediaWiki API
•  Wikidata Query
•  SPARQL
•  Linked Data Fragments
Access to more than
“just” usage.

Thank you very much!
@mluczak | http://markus-luczak.de
h"p://www.ﬂickr.com/photos/therichbrooks/4040197666/, CC-BY 2.0, h"ps://creaVvecommons.or

References
•  Luczak-Rösch, M., & Bischoff, M. (2011). Statistical analysis of web of data usage. In Joint Workshop on Knowledge Evolution and
Ontology Dynamics (EvoDyn2011), CEUR WS.
•  Luczak-Rösch, M. (2014). Usage-dependent maintenance of structured Web data sets (Doctoral dissertation, Freie Universität Berlin,
Germany), http://edocs.fu-berlin.de/diss/receive/FUDISS_thesis_000000096138.
•  Elbedweihy, K., Mazumdar, S., Cano, A. E., Wrigley, S. N., & Ciravegna, F. (2011). Identifying Information Needs by Modelling Collective
Query Patterns. COLD, 782.
•  Elbedweihy, K., Wrigley, S. N., & Ciravegna, F. (2012). Improving Semantic Search Using Query Log Analysis. Interacting with Linked Data
(ILD 2012), 61.
•  Raghuveer, A. (2012). Characterizing machine agent behavior through SPARQL query mining. In Proceedings of the International
Workshop on Usage Analysis and the Web of Data, Lyon, France.
•  Arias, M., Fernández, J. D., Martínez-Prieto, M. A., & de la Fuente, P. (2011). An empirical study of real-world SPARQL queries. arXiv
preprint arXiv:1103.5043.
•  Hartig, O., Bizer, C., & Freytag, J. C. (2009). Executing SPARQL queries over the web of linked data (pp. 293-309). Springer Berlin
Heidelberg.
•  Verborgh, R., Hartig, O., De Meester, B., Haesendonck, G., De Vocht, L., Vander Sande, M., ... & Van de Walle, R. (2014). Querying
datasets on the web with high availability. In The Semantic Web–ISWC 2014 (pp. 180-196). Springer International Publishing.
•  Verborgh, R., Vander Sande, M., Colpaert, P., Coppens, S., Mannens, E., & Van de Walle, R. (2014, April). Web-Scale Querying through
Linked Data Fragments. In LDOW.

Web of Data Usage Mining

More Related Content

What's hot

Viewers also liked

Similar to Web of Data Usage Mining

More from Markus Luczak-Rösch

Recently uploaded

Web of Data Usage Mining