KEMBAR78
Web of Data Usage Mining | PDF
Web of Data Usage Mining
Markus Luczak-Roesch
@mluczak | http://markus-luczak.de
What you should learn:
•  describe the architectural differences between content
negotiation and Linked Data queries;
•  develop applications that use different strategies to
consume Linked Data;
•  develop usage mining methods that exploit the atomic parts
of the SPARQL query language.
Linked Data principles
1.  Use URIs as names for
“Things” (resources).
2.  Use HTTP URIs to allow the
access to resources on the
Web.
3.  On resource access, deliver
meaningful information
conforming to Web standards
(RDF, SPARQL).
4.  Set RDF links to resources
published by other parties to
allow the discovery of more
resources.
http://dbpedia.org/resource/Berlin



http://dbpedia.org/page/Berlin
http://dbpedia.org/data/Berlin

yago-res:Berlin
 
S

owl:sameAs
 
P


dbpedia:Berlin O
h"p://www.w3.org/DesignIssues/LinkedData.html	
Content Negotiation
Linked Data principles
1.  Use URIs as names for
“Things” (resources).
2.  Use HTTP URIs to allow the
access to resources on the
Web.
3.  On resource access, deliver
meaningful information
conforming to Web standards
(RDF, SPARQL).
4.  Set RDF links to resources
published by other parties to
allow the discovery of more
resources.
h"p://www.w3.org/DesignIssues/LinkedData.html	
http://dbpedia.org/resource/Berlin



http://dbpedia.org/page/Berlin
http://dbpedia.org/data/Berlin

yago-res:Berlin
 
S

owl:sameAs
 
P


dbpedia:Berlin O
Content Negotiation
Linked Data principles
1.  Use URIs as names for
“Things” (resources).
2.  Use HTTP URIs to allow the
access to resources on the
Web.
3.  On resource access, deliver
meaningful information
conforming to Web standards
(RDF, SPARQL).
4.  Set RDF links to resources
published by other parties to
allow the discovery of more
resources.
h"p://www.w3.org/DesignIssues/LinkedData.html	
http://dbpedia.org/resource/Berlin



http://dbpedia.org/page/Berlin
http://dbpedia.org/data/Berlin

yago-res:Berlin
 
S

owl:sameAs
 
P


dbpedia:Berlin O
Content Negotiation
Linked Data principles
1.  Use URIs as names for
“Things” (resources).
2.  Use HTTP URIs to allow the
access to resources on the
Web.
3.  On resource access,
deliver meaningful
information conforming to
Web standards (RDF,
SPARQL).
4.  Set RDF links to resources
published by other parties to
allow the discovery of more
resources.
h"p://www.w3.org/DesignIssues/LinkedData.html	
http://dbpedia.org/resource/Berlin



http://dbpedia.org/page/Berlin
http://dbpedia.org/data/Berlin

yago-res:Berlin
 
S

owl:sameAs
 
P


dbpedia:Berlin O
Content Negotiation
Linked Data principles
1.  Use URIs as names for
“Things” (resources).
2.  Use HTTP URIs to allow the
access to resources on the
Web.
3.  On resource access, deliver
meaningful information
conforming to Web standards
(RDF, SPARQL).
4.  Set RDF links to resources
published by other parties
to allow the discovery of
more resources.
h"p://www.w3.org/DesignIssues/LinkedData.html	
http://dbpedia.org/resource/Berlin



http://dbpedia.org/page/Berlin
http://dbpedia.org/data/Berlin

yago-res:Berlin
 
S

owl:sameAs
 
P


dbpedia:Berlin O
Content Negotiation
Linked Data exploits RDF
h"p://markus-luczak.de#me	
“Markus	Luczak-Roesch“	
foaf:name	
h"p://markus-luczak.de#me	
h"p://hannes.muehleisen.org#me	
foaf:knows
Linked Data vocabularies
•  Vocabulary reuse:
–  Geo
–  FOAF
–  GoodRelations
–  SIOC
–  DOAP
–  …
•  Vocabulary development:
–  Thing
•  Person
–  OfficeHolder
–  …
•  …
http://dbpedia.org/ontology/Person
http://dbpedia.org/ontology/OfficeHolder
http://xmlns.com/foaf/0.1/knows
Linked Data vocabularies
•  Mixing:
– Geo
– FOAF
– Dublin Core
– DBpedia Ontology
–  ...
http://xmlns.com/foaf/0.1/Person
http://www.w3.org/2003/01/geo/wgs84_pos#lat
http://dbpedia.org/ontology/leader
http://dbpedia.org/ontology/City
Linked Data is self-descriptive
Instance	level	 Schema	level	
int:resA	
ont:ClassA	
owl:sameAs	
„ABC“	
foaf:name	
ext:resA	
int:resB	
rdf:type	
owl:equivalentClass	
rdf:type	
foaf:name	
rdf:type	
rdf:type	
rdf:type	
rdfs:subClassOf	
foaf:Agent	
rdf:type	
foaf:Person	
rdfs:subClassOf	
owl:sameAs	
owl:equivalentClass
h"p://markus-luczak.de#me	
“Markus	Luczak-Roesch“	
rdf:type	
u_id	 firstname	 surname	
45	 Markus	 Luczak-Roesch	
…	 …	 …	
foaf:name	
foaf:Person
“3.375.222“	
dbpedia:Berlin	
c_id	 city	 country	 inhabitants	
67	 Berlin	 Germany	 3.375.222	
…	 …	 …	
dbp:populaVon
h"p://markus-luczak.de#me	
“Markus	Luczak-Roesch“	
rdf:type	
foaf:name	
dbp:birthPlace	
foaf:Person	
“3.375.222“	
dbp:populaVon	
dbpedia:Berlin
h"p://markus-luczak.de#me	
foaf:basedNear	
dbp:birthPlace	
h"p://markus-luczak.de/res/Soton	
dbpedia:CiVes_in_Europe	
skos:subject	
dbpedia:Berlin	
skos:subject	
dbpedia:Southampton
h"p://markus-luczak.de#me	
foaf:basedNear	
dbp:birthPlace	
h"p://markus-luczak.de/res/Soton	
dbpedia:CiVes_in_Europe	
skos:subject	
dbpedia:Berlin	
skos:subject	
dbpedia:Southampton	
rdfs:seeAlso
h"p://markus-luczak.de#me	
foaf:basedNear	
h"p://markus-luczak.de/res/Soton	
rdfs:seeAlso	
rdf:type	
foaf:Person	
owl:equivalentClass	
dbp:Person	
rdf:type	
dbpedia:Southampton	
dbp:birthPlace	
dbpedia:Benny_Hill
Linked Data Infrastructure
Image	source:	Tom	Heath	and	ChrisVan	Bizer	(2011)	Linked	Data:	Evolving	the	Web	into	a	
Global	Data	Space	(1st	ediVon).	Synthesis	Lectures	on	the	SemanVc	Web:	Theory	and	
Technology,	1:1,	1-136.	Morgan	&	Claypool.
Consuming Linked Data
•  stateless
•  request-response
t
Client Server
request
response
TCP	life	cycle	
derived	from	R.	Tolksdorf	
Open connection
Close connection
Consuming Linked Data
GET / HTTP/1.1
User-Agent: Mozilla/5.0 … Firefox/10.0.3
Host: markus-luczak.de:80
Accept: */*
HTTP/1.1 200 OK
Server: Apache/2.0.49
Content-Language: en
Content-Type: text/html
Content-length: 2990
<!DOCTYPE html>
<html xml:lang="en"
…
Client
Server
derived	from	R.	Tolksdorf
Server
Consuming Linked Data
Representation 1
index.html
Representation 2
index.rdf
Information
Resource
http://example.com/content/index	
Client
HTTP GET
Consuming Linked Data
•  Discover URIs
– Lookup services
•  http://rkbexplorer.com
– Web of Data search engines
•  http://sindice.com
•  http://ws.nju.edu.cn/falcons/objectsearch/index.jsp
Consuming Linked Data
•  Discover additional data for the resource at hand
•  follow links („follow your nose“)
–  rdfs:seeAlso
–  owl:sameAs
•  Co-Reference services
–  http://sameas.org
•  Web of Data search engines
Linked Data
Source: http://wifo5-03.informatik.uni-mannheim.de/bizer/pub/LinkedDataTutorial/
The server can trace this usage.
Linked Data is queryable
?s	
“Markus	Luczak-Roesch“	
foaf:name	
h"p://markus-luczak.de#me	
?o	
foaf:knows
SPARQL-recap
•  Basic principle: pattern matching
– describe pattern
– query RDF triple set („RDF graph“)
– matching subset comes into results
?s
http://dbpedia.org/resource/Berlin
SPARQL-recap
?s	
dbp:Klaus_Wowereit
	
dbp:Reinhard_Mey
	
dbp:Klaus_Wowereit
dbp:Berlin
dbp:birthPlace
dbp:Reinhard_Mey
Berlino
dbp:Axel_Springer
SPARQL queries on the Web
•  RESTful service endpoint
GET /sparql?query=PREFIX+rdf… HTTP/1.1
Host: dbpedia.org
h"p://www.w3.org/TR/rdf-sparql-XMLres/	 h"p://www.w3.org/TR/rdf-sparql-json-res/
Querying Linked Data
dbp:Klaus_Wowereit
dbp:Berlin
dbp:birthPlace
dbp:Reinhard_Mey
http://www.markus-luczak.de/me
dbp:birthPlace
Querying Linked Data
•  distribution of data creates challenges for querying them
•  Query approaches
–  follow-up queries ß application-dependent, proprietary
–  query a central data repository (e.g. LOD cache) ß trivial
–  federated queries ß more interesting
•  idea: query a mediator that distributes the sub-queries and returns
aggregated result (as of SPARQL 1.1)
–  link traversal ß very interesting
•  idea: follow links in the results retrieved from a source to expand the data
dynamically
Dataset	
User	Client/ApplicaVon	
Query	Pa"ern	Access	
	
	
	
	
	
	
	
Resource	Centered	
Access	
HTTP	
Query	Processing	
	
	
	
	
	
	
	
Graph	CreaVon	and	
Content	NegoVaVon	
GET		/resource/resA	
GET		/sparql?query=SELECT…	
applicaVon/rdf+xml,	…	
Evaluate		and	
perform	query,	
create	result	set	
Process	and	
select	result	
text/xml,	…	
Data	Publisher	Data	Consumer	
Data	Publisher	Data	Consumer
h"p://www.flickr.com/photos/therichbrooks/4040197666/,	CC-BY	2.0,	h"ps://creaVvecommons.or
A game of pairs with SPARQL
SPARQL queries are self-descriptive data
themselves
{	
	?s1	foaf:name		“Markus	Luczak-Roesch”.	
	?s1	rdf:type	dbp:Person	
}	
TP	
TP	 BGP
SPARQL queries are self-descriptive data
themselves
{	
	?s1	foaf:name		“Markus	Luczak-Roesch”.	
	?s1	rdf:type	dbp:Person	
}	
h"p://markus-luczak.de#me	
“Markus	Luczak-Roesch“	
rdf:type	
foaf:name	
foaf:Person	
✔	
✗	
✗
SPARQL queries are self-descriptive data
themselves
{	
	dbpedia:Benny_Hill	dbp:birthPlace	?o1	.	
	?s	dbp:basedNear	?o1	.	
	?s	foaf:name	?o2	
}	
✔	
✗	
✗	
✗
SPARQL queries are self-descriptive data
themselves
{	
	dbpedia:Benny_Hill	dbp:birthPlace	?o1	
}	
✔
SPARQL queries are self-descriptive data
themselves
{	
	?s	dbp:basedNear	?o1		
}	
✔
SPARQL queries are self-descriptive data
themselves
{	
	?s	foaf:name	?o2	
}	
✔
all	TP	
all	TP	in	successful	BGP	
all	TP	in	successful	queries	
all	TP	in	failing	queries	
all	TP	in	failing	BGP
Statistical analysis
missing	facts	
inconsistent	
data	
•  ns:Band	ns:knownFor	?x	
•  ns:Band	ns:naVonality	?y	
•  ns:Band	ns:instrument	?x	
•  ns:Band	ns:genre	?y	
•  ns:Band	ns:associatedBand	?z
Statistical analysis
(a) SWC (b) DBpedia (c) LGD
Abbildung 20: Nutzung der Konzepte der Multi-Ontologien (Kanten sind ausgeblendet)
Quelle: eigene Darstellung
dieser Datensets besitzt noch ein großes Verbesserungspotential. Beispielsweise sind di
M¨oglichkeiten gegeben, eine h¨ohere Anzahl an speziellen Konzepten zu nutzen. Eben
so k¨onnen theoretisch mehr Konzepte aus anderen Bereichen als Personen, Orte unSource:	Masterthesis	of	Markus	Bischoff
Estimating the effects of change
o be added to the DBpedia 3.4 data set conforming to our approach16
.
able 7.14: Recommended predicates to be added to the data set and the estimate
↵ects of change.
Primitive to add E↵ects of change Exists in data set
dbp:manufacturer 0.004505372 x
dbp:firstFlight 0.004505372 x
dbp:introduced 0.004505372 x
dbp:nationalOrigin 0.004505372
dbo:thumbnail 0.021986718 x
dbo:director 0.025047524
dbp:director 0.02503915 x
dbp:abstract 0.025797024 x
dbo:starring 0.034066643
dbp:starring 0.034066643 x
dbp:stars 0.034066643 x
skos:Concept 0.040946128 x
skos:broader 0.04116386 x
dbp:redirect 0.066441677 x
Log	files	
Selected	log	files	
Preprocessed	
queries	
Decomposed	
queries	
and	
transac<on	
tables	
Pa=erns	
Change	
recommenda<ons	
[0,1]
What’s in your SPARQL shopping bag?
{	
	?s1	foaf:name		“Markus	Luczak-Roesch”.	
	?s1	rdf:type	dbp:Person	
}	
{	
	dbpedia:Benny_Hill	dbp:birthPlace	?o1	.	
	?s	dbp:basedNear	?o1	.	
	?s	foaf:name	?o2	
}	
{	
	?s1	foaf:name		“Markus	Luczak-Roesch”.	
	?s1	rdf:type	dbp:Person	
}	
{	
	dbpedia:Benny_Hill	dbp:birthPlace	?o1	.	
	?s	dbp:basedNear	?o1	.	
	?s	foaf:name	?o2	
}	
{	
	?s1	foaf:name		“Markus	Luczak-Roesch”.	
	?s1	rdf:type	dbp:Person	
}	
{	
	dbpedia:Benny_Hill	dbp:birthPlace	?o1	.	
	?s	dbp:basedNear	?o1	.	
	?s	foaf:name	?o2	
}	
T1	
T2	
T1	
…	
…	30	mins.,	same	IP,	same	user	agent	
…	
…	
…
LGD
Linked Data
Source: http://wifo5-03.informatik.uni-mannheim.de/bizer/pub/LinkedDataTutorial/
The server can trace this usage.
SPARQL
7. Evaluation
The visualization shows how primitives on the left hand side (LHS) of a rule imply
particular ones on the right hand side (RHS) and which likelihood such an associa-
tion has. In our specific case this allows us to analyze which primitives are queried
together frequently in failing queries. We spot two characteristic usage patterns: (1)
the properties and classes queried in the context of http://dbpedia.org/ontology/
Aircraft; (2) the properties and classes queried in the context of an object variable.
These can be further analyzed by exporting the association rules to GraphML and vi-
sualizing the network by use of a network visualization and analysis tool like Gephi15
for example. Figure 7.13 depicts one filtered network representation for our example
case. Nodes with a degree lower than 5 are filtered out (k-core network with k = 5)
to derive a well-arranged visualization of the most important primitives in failing
queries. Nodes represent LHS and RHS of the computed rules. Edges point from the
LHS to the RHS of the particular rules.
Figure 7.13: Filtered visualization of the association rule network (k-core 5 filter
applied to reduce nodes with degree lower than 5).
Table 7.14 lists the an exemplary set of primitives which would be recommended
15http://gephi.org/
177
{

?s1 foaf:name “Markus Luczak-Roesch”.

?s1 rdf:type dbp:Person
}
h"p://markus-luczak.de#me	
“Markus	Luczak-Roesch“	
rdf:type	
foaf:name	
foaf:Person	
✔
✗
✗
 query applied to dataset
The server can trace detailed usage.
Linked Data Fragments
Querying Datasets on the Web with High Availability 5
generic requests
high client effort
high server availability
specific requests
high server effort
low server availability
data
dump
Linked Data
document
sparql
result
triple pattern
fragments
various types of
Linked Data Fragments
Fig. 1: All http triple interfaces offer Linked Data Fragments of a dataset. They differ
in the specificity of the data they contain, and thus the effort needed to create them.
3.2 Formal definitions
As a basis for our formalization, we use the following concepts of the rdf data
model [16] and the sparql query language [12]. We write U, B, L, and V to
denote the sets of all uris, blank nodes, literals, and variables, respectively.
Then, T = (U [ B) ⇥ U ⇥ (U [ B [ L) is the (infinite) set of all rdf triples. Any
tuple tp 2 (U [ V) ⇥ (U [ V) ⇥ (U [ L [ V) is a triple pattern. Any finite set of
such triple patterns is a basic graph pattern (bgp). Any more complex sparql
graph pattern, typically denoted by P, combines triple patterns (or bgps) using
specific operators [12,20]. The standard (set-based) query semantics for sparql
defines the query result of such a graph pattern P over a set of rdf triples
G ✓ T as a set that we denote by [[P]]G and that consists of partial mappings
µ : V ! (U [ B [ L), which are called solution mappings. An rdf triple t is
a matching triple for a triple pattern tp if there exists a solution mapping µ
such that t = µ[tp], where µ[tp] denotes the triple (pattern) that we obtain by
replacing the variables in tp according to µ.
For the sake of a more straightforward formalization, in this paper, we as-
sume without loss of generality that every dataset G published via some kind of
fragments on the Web is a finite set of blank-node-free rdf triples; i.e., G ✓ T ⇤
where T ⇤
= U ⇥ U ⇥ (U [ L). Each fragment of such a dataset contains triples
that somehow belong together; they have been selected based on some condition,
which we abstract through the notion of a selector:
T
xxx.xxx.xxx.xxx - - [17/Oct/2014:07:43:02 +0000] 

"GET /2014/en?subject=&predicate=&object=dbpedia%3AAustin HTTP/1.1" 200
1309 "http://fragments.dbpedia.org/2014/en" …
fetches the first page of the corresponding ldf. This page contains the cnt meta-
data, which tells us how many matches the dataset has for each triple pattern.
The pattern is then decomposed by evaluating it using a) a triple pattern iter-
ator for the triple pattern with the smallest number of matches, and b) a new
bgp iterator for the remainder of the pattern. This results in a dynamic pipeline
for each of the mappings of its predecessor, as visualized in Fig. 2. Each pipeline
is optimized locally for a specific mapping, reducing the number of requests.
To evaluate a sparql query over a triple pattern fragment collection, we pro-
ceed as follows. For each bgp of the query, a bgp iterator is created. Dedicated
iterators are necessary for other sparql constructs such as UNION and OPTIONAL,
but their implementation need not be ldf-specific; they can reuse the triple
pattern fragment bgp iterators. The predecessor of the first iterator is a start
iterator. We continuously pull solution mappings from the last iterator in the
pipeline and output them as solutions of the query, until the last iterator re-
sponds with nil. This pull-based process is able to deliver results incrementally.
...
B00
= { Drago_Ibler a Architect. }
Alen_Peternac
Drago_Ibler
Juraj_Neidhardt
...
?person birthPlace Zagreb.
B0
= { ?person a Architect. ?person birthPlace Zagreb. }
Zagreb
Budapest
Rome
...
?city subject
Capitals_in_Europe.
B = { ?person a Architect. ?person birthPlace ?city. ?city subject Capitals_in_Europe. }
Fig. 2: A bgp iterator decomposes a bgp B = {tp1, . . . , tpn} into a triple pattern
iterator for an optimal tpi and, for each resulting solution mapping µ of tpi, creates
a bgp iterator for the remaining pattern B0
= {tp | tp = µ[tpj] ^ tpj 2 B}  {µ[tpi]}.
Pre-print of a paper accepted to the International Semantic Web Conference 2014 (ISWC 2014).
The final publication is available at link.springer.com.
Querying Datasets on the Web with High Av
4.2 Dynamic iterator pipelines
A common approach to implement query execution in database sy
iterators that are typically arranged in a tree or a pipeline, based
results are computed recursively [10]. Such a pipelined approac
studied for Linked Data query processing [13,15]. In order to en
results and allow the straightforward addition of sparql oper
ment a triple pattern fragments client using iterators.
The previous algorithm, however, cannot be implemented by
pipeline. For instance, consider a query for architects born in Eu
SELECT ?person ?city WHERE {
?person a dbpedia-owl:Architect. # tp1
?person dbpprop:birthPlace ?city. # tp2
?city dc:subject dbpedia:Capitals_in_Europe. # tp3
} LIMIT 100
Suppose the pipeline begins by finding ?city mappings for tp
to choose whether it will next consider tp1 or tp2. The optimal
differs depending on the value of ?city:
– For dbpedia:Paris, there are ±1,900 matches for tp2, and
for tp1, so there will be less http requests if we continue w
– For dbpedia:Vilnius, there are 164 matches for tp2, and ±1
tp1, so there will be less http requests if we continue with
With a static pipeline, we would have to choose the pipeline stru
and subsequently reuse it.
In order to generate an optimized pipeline for each (sub-)qu
a divide-and-conquer strategy in which a query is decomposed d
Wikidata
•  API access to
•  items
•  edit history
•  items’ discussions
•  items’ access statistics
•  and more
•  Linked Data interface
•  MediaWiki API
•  Wikidata Query
•  SPARQL
•  Linked Data Fragments
Access to more than
“just” usage.
Thank you very much!
@mluczak | http://markus-luczak.de
h"p://www.flickr.com/photos/therichbrooks/4040197666/,	CC-BY	2.0,	h"ps://creaVvecommons.or
References
•  Luczak-Rösch, M., & Bischoff, M. (2011). Statistical analysis of web of data usage. In Joint Workshop on Knowledge Evolution and
Ontology Dynamics (EvoDyn2011), CEUR WS.
•  Luczak-Rösch, M. (2014). Usage-dependent maintenance of structured Web data sets (Doctoral dissertation, Freie Universität Berlin,
Germany), http://edocs.fu-berlin.de/diss/receive/FUDISS_thesis_000000096138.
•  Elbedweihy, K., Mazumdar, S., Cano, A. E., Wrigley, S. N., & Ciravegna, F. (2011). Identifying Information Needs by Modelling Collective
Query Patterns. COLD, 782.
•  Elbedweihy, K., Wrigley, S. N., & Ciravegna, F. (2012). Improving Semantic Search Using Query Log Analysis. Interacting with Linked Data
(ILD 2012), 61.
•  Raghuveer, A. (2012). Characterizing machine agent behavior through SPARQL query mining. In Proceedings of the International
Workshop on Usage Analysis and the Web of Data, Lyon, France.
•  Arias, M., Fernández, J. D., Martínez-Prieto, M. A., & de la Fuente, P. (2011). An empirical study of real-world SPARQL queries. arXiv
preprint arXiv:1103.5043.
•  Hartig, O., Bizer, C., & Freytag, J. C. (2009). Executing SPARQL queries over the web of linked data (pp. 293-309). Springer Berlin
Heidelberg.
•  Verborgh, R., Hartig, O., De Meester, B., Haesendonck, G., De Vocht, L., Vander Sande, M., ... & Van de Walle, R. (2014). Querying
datasets on the web with high availability. In The Semantic Web–ISWC 2014 (pp. 180-196). Springer International Publishing.
•  Verborgh, R., Vander Sande, M., Colpaert, P., Coppens, S., Mannens, E., & Van de Walle, R. (2014, April). Web-Scale Querying through
Linked Data Fragments. In LDOW.

Web of Data Usage Mining

  • 1.
    Web of DataUsage Mining Markus Luczak-Roesch @mluczak | http://markus-luczak.de
  • 2.
    What you shouldlearn: •  describe the architectural differences between content negotiation and Linked Data queries; •  develop applications that use different strategies to consume Linked Data; •  develop usage mining methods that exploit the atomic parts of the SPARQL query language.
  • 3.
    Linked Data principles 1. Use URIs as names for “Things” (resources). 2.  Use HTTP URIs to allow the access to resources on the Web. 3.  On resource access, deliver meaningful information conforming to Web standards (RDF, SPARQL). 4.  Set RDF links to resources published by other parties to allow the discovery of more resources. http://dbpedia.org/resource/Berlin http://dbpedia.org/page/Berlin http://dbpedia.org/data/Berlin yago-res:Berlin S owl:sameAs P dbpedia:Berlin O h"p://www.w3.org/DesignIssues/LinkedData.html Content Negotiation
  • 4.
    Linked Data principles 1. Use URIs as names for “Things” (resources). 2.  Use HTTP URIs to allow the access to resources on the Web. 3.  On resource access, deliver meaningful information conforming to Web standards (RDF, SPARQL). 4.  Set RDF links to resources published by other parties to allow the discovery of more resources. h"p://www.w3.org/DesignIssues/LinkedData.html http://dbpedia.org/resource/Berlin http://dbpedia.org/page/Berlin http://dbpedia.org/data/Berlin yago-res:Berlin S owl:sameAs P dbpedia:Berlin O Content Negotiation
  • 5.
    Linked Data principles 1. Use URIs as names for “Things” (resources). 2.  Use HTTP URIs to allow the access to resources on the Web. 3.  On resource access, deliver meaningful information conforming to Web standards (RDF, SPARQL). 4.  Set RDF links to resources published by other parties to allow the discovery of more resources. h"p://www.w3.org/DesignIssues/LinkedData.html http://dbpedia.org/resource/Berlin http://dbpedia.org/page/Berlin http://dbpedia.org/data/Berlin yago-res:Berlin S owl:sameAs P dbpedia:Berlin O Content Negotiation
  • 6.
    Linked Data principles 1. Use URIs as names for “Things” (resources). 2.  Use HTTP URIs to allow the access to resources on the Web. 3.  On resource access, deliver meaningful information conforming to Web standards (RDF, SPARQL). 4.  Set RDF links to resources published by other parties to allow the discovery of more resources. h"p://www.w3.org/DesignIssues/LinkedData.html http://dbpedia.org/resource/Berlin http://dbpedia.org/page/Berlin http://dbpedia.org/data/Berlin yago-res:Berlin S owl:sameAs P dbpedia:Berlin O Content Negotiation
  • 7.
    Linked Data principles 1. Use URIs as names for “Things” (resources). 2.  Use HTTP URIs to allow the access to resources on the Web. 3.  On resource access, deliver meaningful information conforming to Web standards (RDF, SPARQL). 4.  Set RDF links to resources published by other parties to allow the discovery of more resources. h"p://www.w3.org/DesignIssues/LinkedData.html http://dbpedia.org/resource/Berlin http://dbpedia.org/page/Berlin http://dbpedia.org/data/Berlin yago-res:Berlin S owl:sameAs P dbpedia:Berlin O Content Negotiation
  • 8.
    Linked Data exploitsRDF h"p://markus-luczak.de#me “Markus Luczak-Roesch“ foaf:name h"p://markus-luczak.de#me h"p://hannes.muehleisen.org#me foaf:knows
  • 9.
    Linked Data vocabularies • Vocabulary reuse: –  Geo –  FOAF –  GoodRelations –  SIOC –  DOAP –  … •  Vocabulary development: –  Thing •  Person –  OfficeHolder –  … •  … http://dbpedia.org/ontology/Person http://dbpedia.org/ontology/OfficeHolder http://xmlns.com/foaf/0.1/knows
  • 10.
    Linked Data vocabularies • Mixing: – Geo – FOAF – Dublin Core – DBpedia Ontology –  ... http://xmlns.com/foaf/0.1/Person http://www.w3.org/2003/01/geo/wgs84_pos#lat http://dbpedia.org/ontology/leader http://dbpedia.org/ontology/City
  • 11.
    Linked Data isself-descriptive Instance level Schema level int:resA ont:ClassA owl:sameAs „ABC“ foaf:name ext:resA int:resB rdf:type owl:equivalentClass rdf:type foaf:name rdf:type rdf:type rdf:type rdfs:subClassOf foaf:Agent rdf:type foaf:Person rdfs:subClassOf owl:sameAs owl:equivalentClass
  • 12.
  • 13.
    “3.375.222“ dbpedia:Berlin c_id city country inhabitants 67 Berlin Germany 3.375.222 … … … dbp:populaVon
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
    Consuming Linked Data • stateless •  request-response t Client Server request response TCP life cycle derived from R. Tolksdorf Open connection Close connection
  • 20.
    Consuming Linked Data GET/ HTTP/1.1 User-Agent: Mozilla/5.0 … Firefox/10.0.3 Host: markus-luczak.de:80 Accept: */* HTTP/1.1 200 OK Server: Apache/2.0.49 Content-Language: en Content-Type: text/html Content-length: 2990 <!DOCTYPE html> <html xml:lang="en" … Client Server derived from R. Tolksdorf
  • 21.
    Server Consuming Linked Data Representation1 index.html Representation 2 index.rdf Information Resource http://example.com/content/index Client HTTP GET
  • 22.
    Consuming Linked Data • Discover URIs – Lookup services •  http://rkbexplorer.com – Web of Data search engines •  http://sindice.com •  http://ws.nju.edu.cn/falcons/objectsearch/index.jsp
  • 23.
    Consuming Linked Data • Discover additional data for the resource at hand •  follow links („follow your nose“) –  rdfs:seeAlso –  owl:sameAs •  Co-Reference services –  http://sameas.org •  Web of Data search engines
  • 24.
  • 25.
    Linked Data isqueryable ?s “Markus Luczak-Roesch“ foaf:name h"p://markus-luczak.de#me ?o foaf:knows
  • 26.
    SPARQL-recap •  Basic principle:pattern matching – describe pattern – query RDF triple set („RDF graph“) – matching subset comes into results ?s http://dbpedia.org/resource/Berlin
  • 27.
  • 28.
    SPARQL queries onthe Web •  RESTful service endpoint GET /sparql?query=PREFIX+rdf… HTTP/1.1 Host: dbpedia.org h"p://www.w3.org/TR/rdf-sparql-XMLres/ h"p://www.w3.org/TR/rdf-sparql-json-res/
  • 29.
  • 30.
    Querying Linked Data • distribution of data creates challenges for querying them •  Query approaches –  follow-up queries ß application-dependent, proprietary –  query a central data repository (e.g. LOD cache) ß trivial –  federated queries ß more interesting •  idea: query a mediator that distributes the sub-queries and returns aggregated result (as of SPARQL 1.1) –  link traversal ß very interesting •  idea: follow links in the results retrieved from a source to expand the data dynamically
  • 31.
  • 32.
  • 33.
    SPARQL queries areself-descriptive data themselves { ?s1 foaf:name “Markus Luczak-Roesch”. ?s1 rdf:type dbp:Person } TP TP BGP
  • 34.
    SPARQL queries areself-descriptive data themselves { ?s1 foaf:name “Markus Luczak-Roesch”. ?s1 rdf:type dbp:Person } h"p://markus-luczak.de#me “Markus Luczak-Roesch“ rdf:type foaf:name foaf:Person ✔ ✗ ✗
  • 35.
    SPARQL queries areself-descriptive data themselves { dbpedia:Benny_Hill dbp:birthPlace ?o1 . ?s dbp:basedNear ?o1 . ?s foaf:name ?o2 } ✔ ✗ ✗ ✗
  • 36.
    SPARQL queries areself-descriptive data themselves { dbpedia:Benny_Hill dbp:birthPlace ?o1 } ✔
  • 37.
    SPARQL queries areself-descriptive data themselves { ?s dbp:basedNear ?o1 } ✔
  • 38.
    SPARQL queries areself-descriptive data themselves { ?s foaf:name ?o2 } ✔
  • 39.
  • 40.
    Statistical analysis missing facts inconsistent data •  ns:Band ns:knownFor ?x • ns:Band ns:naVonality ?y •  ns:Band ns:instrument ?x •  ns:Band ns:genre ?y •  ns:Band ns:associatedBand ?z
  • 41.
    Statistical analysis (a) SWC(b) DBpedia (c) LGD Abbildung 20: Nutzung der Konzepte der Multi-Ontologien (Kanten sind ausgeblendet) Quelle: eigene Darstellung dieser Datensets besitzt noch ein großes Verbesserungspotential. Beispielsweise sind di M¨oglichkeiten gegeben, eine h¨ohere Anzahl an speziellen Konzepten zu nutzen. Eben so k¨onnen theoretisch mehr Konzepte aus anderen Bereichen als Personen, Orte unSource: Masterthesis of Markus Bischoff
  • 42.
    Estimating the effectsof change o be added to the DBpedia 3.4 data set conforming to our approach16 . able 7.14: Recommended predicates to be added to the data set and the estimate ↵ects of change. Primitive to add E↵ects of change Exists in data set dbp:manufacturer 0.004505372 x dbp:firstFlight 0.004505372 x dbp:introduced 0.004505372 x dbp:nationalOrigin 0.004505372 dbo:thumbnail 0.021986718 x dbo:director 0.025047524 dbp:director 0.02503915 x dbp:abstract 0.025797024 x dbo:starring 0.034066643 dbp:starring 0.034066643 x dbp:stars 0.034066643 x skos:Concept 0.040946128 x skos:broader 0.04116386 x dbp:redirect 0.066441677 x
  • 43.
  • 44.
    What’s in yourSPARQL shopping bag? { ?s1 foaf:name “Markus Luczak-Roesch”. ?s1 rdf:type dbp:Person } { dbpedia:Benny_Hill dbp:birthPlace ?o1 . ?s dbp:basedNear ?o1 . ?s foaf:name ?o2 } { ?s1 foaf:name “Markus Luczak-Roesch”. ?s1 rdf:type dbp:Person } { dbpedia:Benny_Hill dbp:birthPlace ?o1 . ?s dbp:basedNear ?o1 . ?s foaf:name ?o2 } { ?s1 foaf:name “Markus Luczak-Roesch”. ?s1 rdf:type dbp:Person } { dbpedia:Benny_Hill dbp:birthPlace ?o1 . ?s dbp:basedNear ?o1 . ?s foaf:name ?o2 } T1 T2 T1 … … 30 mins., same IP, same user agent … … …
  • 45.
  • 46.
  • 47.
    SPARQL 7. Evaluation The visualizationshows how primitives on the left hand side (LHS) of a rule imply particular ones on the right hand side (RHS) and which likelihood such an associa- tion has. In our specific case this allows us to analyze which primitives are queried together frequently in failing queries. We spot two characteristic usage patterns: (1) the properties and classes queried in the context of http://dbpedia.org/ontology/ Aircraft; (2) the properties and classes queried in the context of an object variable. These can be further analyzed by exporting the association rules to GraphML and vi- sualizing the network by use of a network visualization and analysis tool like Gephi15 for example. Figure 7.13 depicts one filtered network representation for our example case. Nodes with a degree lower than 5 are filtered out (k-core network with k = 5) to derive a well-arranged visualization of the most important primitives in failing queries. Nodes represent LHS and RHS of the computed rules. Edges point from the LHS to the RHS of the particular rules. Figure 7.13: Filtered visualization of the association rule network (k-core 5 filter applied to reduce nodes with degree lower than 5). Table 7.14 lists the an exemplary set of primitives which would be recommended 15http://gephi.org/ 177 { ?s1 foaf:name “Markus Luczak-Roesch”. ?s1 rdf:type dbp:Person } h"p://markus-luczak.de#me “Markus Luczak-Roesch“ rdf:type foaf:name foaf:Person ✔ ✗ ✗ query applied to dataset The server can trace detailed usage.
  • 48.
    Linked Data Fragments QueryingDatasets on the Web with High Availability 5 generic requests high client effort high server availability specific requests high server effort low server availability data dump Linked Data document sparql result triple pattern fragments various types of Linked Data Fragments Fig. 1: All http triple interfaces offer Linked Data Fragments of a dataset. They differ in the specificity of the data they contain, and thus the effort needed to create them. 3.2 Formal definitions As a basis for our formalization, we use the following concepts of the rdf data model [16] and the sparql query language [12]. We write U, B, L, and V to denote the sets of all uris, blank nodes, literals, and variables, respectively. Then, T = (U [ B) ⇥ U ⇥ (U [ B [ L) is the (infinite) set of all rdf triples. Any tuple tp 2 (U [ V) ⇥ (U [ V) ⇥ (U [ L [ V) is a triple pattern. Any finite set of such triple patterns is a basic graph pattern (bgp). Any more complex sparql graph pattern, typically denoted by P, combines triple patterns (or bgps) using specific operators [12,20]. The standard (set-based) query semantics for sparql defines the query result of such a graph pattern P over a set of rdf triples G ✓ T as a set that we denote by [[P]]G and that consists of partial mappings µ : V ! (U [ B [ L), which are called solution mappings. An rdf triple t is a matching triple for a triple pattern tp if there exists a solution mapping µ such that t = µ[tp], where µ[tp] denotes the triple (pattern) that we obtain by replacing the variables in tp according to µ. For the sake of a more straightforward formalization, in this paper, we as- sume without loss of generality that every dataset G published via some kind of fragments on the Web is a finite set of blank-node-free rdf triples; i.e., G ✓ T ⇤ where T ⇤ = U ⇥ U ⇥ (U [ L). Each fragment of such a dataset contains triples that somehow belong together; they have been selected based on some condition, which we abstract through the notion of a selector: T xxx.xxx.xxx.xxx - - [17/Oct/2014:07:43:02 +0000] 
 "GET /2014/en?subject=&predicate=&object=dbpedia%3AAustin HTTP/1.1" 200 1309 "http://fragments.dbpedia.org/2014/en" … fetches the first page of the corresponding ldf. This page contains the cnt meta- data, which tells us how many matches the dataset has for each triple pattern. The pattern is then decomposed by evaluating it using a) a triple pattern iter- ator for the triple pattern with the smallest number of matches, and b) a new bgp iterator for the remainder of the pattern. This results in a dynamic pipeline for each of the mappings of its predecessor, as visualized in Fig. 2. Each pipeline is optimized locally for a specific mapping, reducing the number of requests. To evaluate a sparql query over a triple pattern fragment collection, we pro- ceed as follows. For each bgp of the query, a bgp iterator is created. Dedicated iterators are necessary for other sparql constructs such as UNION and OPTIONAL, but their implementation need not be ldf-specific; they can reuse the triple pattern fragment bgp iterators. The predecessor of the first iterator is a start iterator. We continuously pull solution mappings from the last iterator in the pipeline and output them as solutions of the query, until the last iterator re- sponds with nil. This pull-based process is able to deliver results incrementally. ... B00 = { Drago_Ibler a Architect. } Alen_Peternac Drago_Ibler Juraj_Neidhardt ... ?person birthPlace Zagreb. B0 = { ?person a Architect. ?person birthPlace Zagreb. } Zagreb Budapest Rome ... ?city subject Capitals_in_Europe. B = { ?person a Architect. ?person birthPlace ?city. ?city subject Capitals_in_Europe. } Fig. 2: A bgp iterator decomposes a bgp B = {tp1, . . . , tpn} into a triple pattern iterator for an optimal tpi and, for each resulting solution mapping µ of tpi, creates a bgp iterator for the remaining pattern B0 = {tp | tp = µ[tpj] ^ tpj 2 B} {µ[tpi]}. Pre-print of a paper accepted to the International Semantic Web Conference 2014 (ISWC 2014). The final publication is available at link.springer.com. Querying Datasets on the Web with High Av 4.2 Dynamic iterator pipelines A common approach to implement query execution in database sy iterators that are typically arranged in a tree or a pipeline, based results are computed recursively [10]. Such a pipelined approac studied for Linked Data query processing [13,15]. In order to en results and allow the straightforward addition of sparql oper ment a triple pattern fragments client using iterators. The previous algorithm, however, cannot be implemented by pipeline. For instance, consider a query for architects born in Eu SELECT ?person ?city WHERE { ?person a dbpedia-owl:Architect. # tp1 ?person dbpprop:birthPlace ?city. # tp2 ?city dc:subject dbpedia:Capitals_in_Europe. # tp3 } LIMIT 100 Suppose the pipeline begins by finding ?city mappings for tp to choose whether it will next consider tp1 or tp2. The optimal differs depending on the value of ?city: – For dbpedia:Paris, there are ±1,900 matches for tp2, and for tp1, so there will be less http requests if we continue w – For dbpedia:Vilnius, there are 164 matches for tp2, and ±1 tp1, so there will be less http requests if we continue with With a static pipeline, we would have to choose the pipeline stru and subsequently reuse it. In order to generate an optimized pipeline for each (sub-)qu a divide-and-conquer strategy in which a query is decomposed d
  • 49.
    Wikidata •  API accessto •  items •  edit history •  items’ discussions •  items’ access statistics •  and more •  Linked Data interface •  MediaWiki API •  Wikidata Query •  SPARQL •  Linked Data Fragments Access to more than “just” usage.
  • 50.
    Thank you verymuch! @mluczak | http://markus-luczak.de h"p://www.flickr.com/photos/therichbrooks/4040197666/, CC-BY 2.0, h"ps://creaVvecommons.or
  • 51.
    References •  Luczak-Rösch, M.,& Bischoff, M. (2011). Statistical analysis of web of data usage. In Joint Workshop on Knowledge Evolution and Ontology Dynamics (EvoDyn2011), CEUR WS. •  Luczak-Rösch, M. (2014). Usage-dependent maintenance of structured Web data sets (Doctoral dissertation, Freie Universität Berlin, Germany), http://edocs.fu-berlin.de/diss/receive/FUDISS_thesis_000000096138. •  Elbedweihy, K., Mazumdar, S., Cano, A. E., Wrigley, S. N., & Ciravegna, F. (2011). Identifying Information Needs by Modelling Collective Query Patterns. COLD, 782. •  Elbedweihy, K., Wrigley, S. N., & Ciravegna, F. (2012). Improving Semantic Search Using Query Log Analysis. Interacting with Linked Data (ILD 2012), 61. •  Raghuveer, A. (2012). Characterizing machine agent behavior through SPARQL query mining. In Proceedings of the International Workshop on Usage Analysis and the Web of Data, Lyon, France. •  Arias, M., Fernández, J. D., Martínez-Prieto, M. A., & de la Fuente, P. (2011). An empirical study of real-world SPARQL queries. arXiv preprint arXiv:1103.5043. •  Hartig, O., Bizer, C., & Freytag, J. C. (2009). Executing SPARQL queries over the web of linked data (pp. 293-309). Springer Berlin Heidelberg. •  Verborgh, R., Hartig, O., De Meester, B., Haesendonck, G., De Vocht, L., Vander Sande, M., ... & Van de Walle, R. (2014). Querying datasets on the web with high availability. In The Semantic Web–ISWC 2014 (pp. 180-196). Springer International Publishing. •  Verborgh, R., Vander Sande, M., Colpaert, P., Coppens, S., Mannens, E., & Van de Walle, R. (2014, April). Web-Scale Querying through Linked Data Fragments. In LDOW.