KEMBAR78
Integrating NLP using Linked Data | ODP
Creating Knowledge out of Interlinked Data
http://lod2.eu

ISWC – 2013/10/23 – Page 1

Integrating NLP using Linked Data
Sebastian Hellmann, Jens Lehmann, Sören Auer and Martin Brümmer

http://slideshare.net/kurzum
http://nlp2rdf.org
http://lod2.eu

LOD2 Presentation . 02.09.2010 . Page

AKSW, Universität Leipzig

http://lod2.eu
ISWC – 2013/10/23 – Page 2

Introduction

http://lod2.eu
ISWC – 2013/10/23 – Page 3

Introduction

Core problems in integrating NLP:
1. Too much heterogeneity
2. Almost no open standards available
3. Lack of open collaboration
4. Difficult and large domain

http://lod2.eu
ISWC – 2013/10/23 – Page 4

Problem analysis
Hardly any reusability in NLP
• Free software (as in free beer), but no open licenses
• Few standards and few mappings
• Integration is hard-wired (you have to write software)
– for each tool, for each framework
Main benefits of using RDF, OWL and Linked Data are:
• lower entry barrier (as a client / user)
• easy data integration (linking, mapping)
• reusability of tools and conceptualisations (ontologies)
• off-the-shelf solutions for common tasks

http://lod2.eu
ISWC – 2013/10/23 – Page 5

The Semantic Gap

http://lod2.eu
ISWC – 2013/10/23 – Page 6

http://lod2.eu
ISWC – 2013/10/23 – Page 7

NLP2RDF project
NLP2RDF (http://nlp2rdf.org)
- community project bootstrapped by LOD2
- develops NLP Interchange Format (NIF)
- umbrella project to combine (and consolidate) existing work

http://lod2.eu
ISWC – 2013/10/23 – Page 8

NIF Overview
The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to
achieve interoperability between Natural Language Processing (NLP) tools,
language resources and annotations.
→ to create an eco-system of interopable web services

http://lod2.eu
ISWC – 2013/10/23 – Page 9

http://lod2.eu

NIF Overview
The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to
achieve interoperability between Natural Language Processing (NLP) tools,
language resources and annotations.

•

Reuse of existing standards such as RDF, OWL2, the PROV Ontology, LAF (ISO
24612), Unicode and RFC 5147

•

Standardize access parameters, annotations (e.g. tokenization), validation
and log messages

•

Reuse of existing ontologies:
ISWC – 2013/10/23 – Page 10

http://lod2.eu

Example NIF Workflow

NIF workflow, however, can obviously not provide any better performance (Fmeasure, speed) than a properly configured UIMA or GATE pipeline with the same
components.
ISWC – 2013/10/23 – Page 11

Use Cases
•
•
•

Internationalization TagSet 2.0
Part of Speech Tagging
Wikifier API access via RDFaCE (Entity Linking)

http://lod2.eu
ISWC – 2013/10/23 – Page 12

http://lod2.eu

UC1 - Internationalisation Tagset 2.0

•

NIF will be the recommended RDF conversion of the Internationalisation
Tagset 2.0 of W3C (ITS 2.0) - http://www.w3.org/TR/its20/

•

NIF turns out to have a unique selling proposition regarding NLP and RDF

•

There were no suitable alternative RDF vocabulary for this conversion
available.
ISWC – 2013/10/23 – Page 13

Source: http://www.w3.org/TR/its20/#EX-HTML-whitespace-normalization

http://lod2.eu

ITS 2.0

RDFa parsers loose all provenance information:
<http://examples.com/books/wikinomics> dc:title ''Wikinomics'' .

Source: https://en.wikipedia.org/wiki/RDFa
ISWC – 2013/10/23 – Page 14

UC1 - Internationalisation Tagset 2.0

http://lod2.eu
ISWC – 2013/10/23 – Page 15

UC1 - Internationalisation Tagset 2.0

String offset based on:
- Unicode NFC, code points
- ISO 24612
- RFC 5147

http://lod2.eu
http://lod2.eu

ISWC – 2013/10/23 – Page 16

UC2 – Part of Speech Tagging

Please see the paper:

http://purl.org/olia
ISWC – 2013/10/23 – Page 17

UC3 – Wikifier API access via RDFaCE

https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki

http://lod2.eu
ISWC – 2013/10/23 – Page 18

UC3 - Wikifier API access via RDFaCE
http://rdface.aksw.org/

http://lod2.eu
ISWC – 2013/10/23 – Page 19

UC3 - Wikifier API access via RDFaCE
http://rdface.aksw.org/

http://lod2.eu
ISWC – 2013/10/23 – Page 20

Evaluation
Please see the paper!
1) Quantitative Analysis with Google Wikilinks Corpus as NIF RDF
• Crawl of 3 million web sites, 40 million Wikipedia links
• ~ 477 million triples in NIF
2) Questionnaire and Developers Study for NIF 1.0
• NIF 1.0 was released in September 2009
• Over 30 known implementations (22 not from authors)
• 14 developers participated in the study
• Minimal NIF implementation requires less than 500 LoC
3) Qualitative Comparison with other Frameworks and Formats

http://lod2.eu
ISWC – 2013/10/23 – Page 21

State of NIF 2.0
Corpora as Linked Data
• Wikilinks corpus - http://wiki-link.nlp2rdf.org
• KORE 50 - http://www.yovisto.com/labs/ner-benchmarks/
• DBpedia Spotlight dataset
Tools
• entityclassifier.eu – http://entityclassifier.eu
• Spotlight - http://spotlight.dbpedia.org
• Open NLP
• Stanford CoreNLP - https://github.com/NLP2RDF/software
• Validator - https://github.com/NLP2RDF/software

http://lod2.eu
ISWC – 2013/10/23 – Page 22

State of NIF 2.0
•
•
•

Rollout is in progress
Distributed implementation at different speed and quality
Software lifecycle:
• Implementation
• Testing/Validation
• Integration in the main software
• Deployment as a web service

•

Hosted web services often not up to date while code base is

http://lod2.eu
ISWC – 2013/10/23 – Page 23

How to join - http://nlp2rdf.org

http://lod2.eu
ISWC – 2013/10/23 – Page 24

For ontology creators
NLP2RDF provides infrastructure for your NLP ontologies

•
•
•
•
•
•

Redundant, persistent hosting
Maven packages
Code and documentation generation
Continuous Integration (planned)
Indexing
Validation of instance data

Please write to me or the mailing list
nlp2rdf@lists.informatik.uni-leipzig.de

http://lod2.eu
http://lod2.eu

ISWC – 2013/10/23 – Page 25

Take home message
•

Early industrial uptake
• OpenLink, Vistatech.ie, Zemanta, Tenforce, Unister
• ITS 2.0 W3C standard was driven by localization industry

•
•

NIF is open and free (CC0 planned)
NIF is designed to be a cost-saver

Not primarily aimed at
increasing features or
performance (F-Measure)
ISWC – 2013/10/23 – Page 26

Thanks for your attention
Open Community – All feedback is welcome!
http://slideshare.net/kurzum
Websites:
http://nlp2rdf.org
http://lod2.eu

http://lod2.eu
ISWC – 2013/10/23 – Page 27

Annotations

http://lod2.eu
ISWC – 2013/10/23 – Page 28

NIF

http://lod2.eu
ISWC – 2013/10/23 – Page 29

Scalability - Salzburg Research KMT

https://bitbucket.org/srfgkmt/stanbol-nlp

http://lod2.eu
ISWC – 2013/10/23 – Page 30

Unicode Normal Form C

•
•

Recommendation for RDF Literals
http://unicode.org/reports/tr15/#Norm_Forms

http://lod2.eu
ISWC – 2013/10/23 – Page 31

Tokenization

Christian Chiarcos, Julia Ritz, Manfred Stede: By all these lovely tokens... Merging conflicting tokenizations.
Language Resources and Evaluation 46(1): 53-74 (2012)

http://lod2.eu
http://lod2.eu

ISWC – 2013/10/23 – Page 32

Validation over specification

•
•
•
•
•
•

SPARQL queries produce (find) errors

http://persistence.uni-leipzig.org/nlp2rdf/ontologies/testcase/lib/nif-2.0-suite.t
RLOG – An RDF Logging Ontology
./validate.jar -i nif-erroneous-model.ttl -t file
Demo → character count
Demo → all errors

ALL DEMOS ARE AVAILABLE AT:
http://nlp2rdf.org/leipzig-24-9-2013
ISWC – 2013/10/23 – Page 33

NIF

Demo:
http://nlp2rdf.lod2.eu/demo.php

http://lod2.eu
ISWC – 2013/10/23 – Page 34

OLiA

http://purl.org/olia

http://lod2.eu
ISWC – 2013/10/23 – Page 35

NIF

http://lod2.eu
ISWC – 2013/10/23 – Page 36

NIF

http://lod2.eu

Integrating NLP using Linked Data

  • 1.
    Creating Knowledge outof Interlinked Data http://lod2.eu ISWC – 2013/10/23 – Page 1 Integrating NLP using Linked Data Sebastian Hellmann, Jens Lehmann, Sören Auer and Martin Brümmer http://slideshare.net/kurzum http://nlp2rdf.org http://lod2.eu LOD2 Presentation . 02.09.2010 . Page AKSW, Universität Leipzig http://lod2.eu
  • 2.
    ISWC – 2013/10/23– Page 2 Introduction http://lod2.eu
  • 3.
    ISWC – 2013/10/23– Page 3 Introduction Core problems in integrating NLP: 1. Too much heterogeneity 2. Almost no open standards available 3. Lack of open collaboration 4. Difficult and large domain http://lod2.eu
  • 4.
    ISWC – 2013/10/23– Page 4 Problem analysis Hardly any reusability in NLP • Free software (as in free beer), but no open licenses • Few standards and few mappings • Integration is hard-wired (you have to write software) – for each tool, for each framework Main benefits of using RDF, OWL and Linked Data are: • lower entry barrier (as a client / user) • easy data integration (linking, mapping) • reusability of tools and conceptualisations (ontologies) • off-the-shelf solutions for common tasks http://lod2.eu
  • 5.
    ISWC – 2013/10/23– Page 5 The Semantic Gap http://lod2.eu
  • 6.
    ISWC – 2013/10/23– Page 6 http://lod2.eu
  • 7.
    ISWC – 2013/10/23– Page 7 NLP2RDF project NLP2RDF (http://nlp2rdf.org) - community project bootstrapped by LOD2 - develops NLP Interchange Format (NIF) - umbrella project to combine (and consolidate) existing work http://lod2.eu
  • 8.
    ISWC – 2013/10/23– Page 8 NIF Overview The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations. → to create an eco-system of interopable web services http://lod2.eu
  • 9.
    ISWC – 2013/10/23– Page 9 http://lod2.eu NIF Overview The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations. • Reuse of existing standards such as RDF, OWL2, the PROV Ontology, LAF (ISO 24612), Unicode and RFC 5147 • Standardize access parameters, annotations (e.g. tokenization), validation and log messages • Reuse of existing ontologies:
  • 10.
    ISWC – 2013/10/23– Page 10 http://lod2.eu Example NIF Workflow NIF workflow, however, can obviously not provide any better performance (Fmeasure, speed) than a properly configured UIMA or GATE pipeline with the same components.
  • 11.
    ISWC – 2013/10/23– Page 11 Use Cases • • • Internationalization TagSet 2.0 Part of Speech Tagging Wikifier API access via RDFaCE (Entity Linking) http://lod2.eu
  • 12.
    ISWC – 2013/10/23– Page 12 http://lod2.eu UC1 - Internationalisation Tagset 2.0 • NIF will be the recommended RDF conversion of the Internationalisation Tagset 2.0 of W3C (ITS 2.0) - http://www.w3.org/TR/its20/ • NIF turns out to have a unique selling proposition regarding NLP and RDF • There were no suitable alternative RDF vocabulary for this conversion available.
  • 13.
    ISWC – 2013/10/23– Page 13 Source: http://www.w3.org/TR/its20/#EX-HTML-whitespace-normalization http://lod2.eu ITS 2.0 RDFa parsers loose all provenance information: <http://examples.com/books/wikinomics> dc:title ''Wikinomics'' . Source: https://en.wikipedia.org/wiki/RDFa
  • 14.
    ISWC – 2013/10/23– Page 14 UC1 - Internationalisation Tagset 2.0 http://lod2.eu
  • 15.
    ISWC – 2013/10/23– Page 15 UC1 - Internationalisation Tagset 2.0 String offset based on: - Unicode NFC, code points - ISO 24612 - RFC 5147 http://lod2.eu
  • 16.
    http://lod2.eu ISWC – 2013/10/23– Page 16 UC2 – Part of Speech Tagging Please see the paper: http://purl.org/olia
  • 17.
    ISWC – 2013/10/23– Page 17 UC3 – Wikifier API access via RDFaCE https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki http://lod2.eu
  • 18.
    ISWC – 2013/10/23– Page 18 UC3 - Wikifier API access via RDFaCE http://rdface.aksw.org/ http://lod2.eu
  • 19.
    ISWC – 2013/10/23– Page 19 UC3 - Wikifier API access via RDFaCE http://rdface.aksw.org/ http://lod2.eu
  • 20.
    ISWC – 2013/10/23– Page 20 Evaluation Please see the paper! 1) Quantitative Analysis with Google Wikilinks Corpus as NIF RDF • Crawl of 3 million web sites, 40 million Wikipedia links • ~ 477 million triples in NIF 2) Questionnaire and Developers Study for NIF 1.0 • NIF 1.0 was released in September 2009 • Over 30 known implementations (22 not from authors) • 14 developers participated in the study • Minimal NIF implementation requires less than 500 LoC 3) Qualitative Comparison with other Frameworks and Formats http://lod2.eu
  • 21.
    ISWC – 2013/10/23– Page 21 State of NIF 2.0 Corpora as Linked Data • Wikilinks corpus - http://wiki-link.nlp2rdf.org • KORE 50 - http://www.yovisto.com/labs/ner-benchmarks/ • DBpedia Spotlight dataset Tools • entityclassifier.eu – http://entityclassifier.eu • Spotlight - http://spotlight.dbpedia.org • Open NLP • Stanford CoreNLP - https://github.com/NLP2RDF/software • Validator - https://github.com/NLP2RDF/software http://lod2.eu
  • 22.
    ISWC – 2013/10/23– Page 22 State of NIF 2.0 • • • Rollout is in progress Distributed implementation at different speed and quality Software lifecycle: • Implementation • Testing/Validation • Integration in the main software • Deployment as a web service • Hosted web services often not up to date while code base is http://lod2.eu
  • 23.
    ISWC – 2013/10/23– Page 23 How to join - http://nlp2rdf.org http://lod2.eu
  • 24.
    ISWC – 2013/10/23– Page 24 For ontology creators NLP2RDF provides infrastructure for your NLP ontologies • • • • • • Redundant, persistent hosting Maven packages Code and documentation generation Continuous Integration (planned) Indexing Validation of instance data Please write to me or the mailing list nlp2rdf@lists.informatik.uni-leipzig.de http://lod2.eu
  • 25.
    http://lod2.eu ISWC – 2013/10/23– Page 25 Take home message • Early industrial uptake • OpenLink, Vistatech.ie, Zemanta, Tenforce, Unister • ITS 2.0 W3C standard was driven by localization industry • • NIF is open and free (CC0 planned) NIF is designed to be a cost-saver Not primarily aimed at increasing features or performance (F-Measure)
  • 26.
    ISWC – 2013/10/23– Page 26 Thanks for your attention Open Community – All feedback is welcome! http://slideshare.net/kurzum Websites: http://nlp2rdf.org http://lod2.eu http://lod2.eu
  • 27.
    ISWC – 2013/10/23– Page 27 Annotations http://lod2.eu
  • 28.
    ISWC – 2013/10/23– Page 28 NIF http://lod2.eu
  • 29.
    ISWC – 2013/10/23– Page 29 Scalability - Salzburg Research KMT https://bitbucket.org/srfgkmt/stanbol-nlp http://lod2.eu
  • 30.
    ISWC – 2013/10/23– Page 30 Unicode Normal Form C • • Recommendation for RDF Literals http://unicode.org/reports/tr15/#Norm_Forms http://lod2.eu
  • 31.
    ISWC – 2013/10/23– Page 31 Tokenization Christian Chiarcos, Julia Ritz, Manfred Stede: By all these lovely tokens... Merging conflicting tokenizations. Language Resources and Evaluation 46(1): 53-74 (2012) http://lod2.eu
  • 32.
    http://lod2.eu ISWC – 2013/10/23– Page 32 Validation over specification • • • • • • SPARQL queries produce (find) errors http://persistence.uni-leipzig.org/nlp2rdf/ontologies/testcase/lib/nif-2.0-suite.t RLOG – An RDF Logging Ontology ./validate.jar -i nif-erroneous-model.ttl -t file Demo → character count Demo → all errors ALL DEMOS ARE AVAILABLE AT: http://nlp2rdf.org/leipzig-24-9-2013
  • 33.
    ISWC – 2013/10/23– Page 33 NIF Demo: http://nlp2rdf.lod2.eu/demo.php http://lod2.eu
  • 34.
    ISWC – 2013/10/23– Page 34 OLiA http://purl.org/olia http://lod2.eu
  • 35.
    ISWC – 2013/10/23– Page 35 NIF http://lod2.eu
  • 36.
    ISWC – 2013/10/23– Page 36 NIF http://lod2.eu