KEMBAR78
Applied semantic technology and linked data | PPTX
APPLIED LINKED DATA
AND SEMANTIC
TECHNOLOGY
Expanding a Neurobiology Dataset
Today we are discussing…
• What is the use case and who requested it?
• How do you import and normalize thousands of RDF
•
•
•
•

triples worth of gene data?
How do we enrich the normalized gene data with parallel
research data sets?
Creating instance pages without knowing exactly what will
be displayed on them.
Demonstration of the initial use cases
Question and answer session
Why?
• Prototype: How do we assemble the data mine and

refine the authoring tools?

How do we expand this to the research
community?
• How do we expand ownership of the data to research

professionals?
• How do we build systems in a way that research
professionals can author and link the data?
• How do we publish these new relationships to the wider
research community?
What is the Allen Institute for Brain
Science?
• Launched in 2003 with seed funding from founder and

philanthropist Paul G. Allen.
• Serving the scientific community is at the center of our mission
to accelerate progress toward understanding the brain and
neurological systems.
• The Allen Institute's multidisciplinary staff includes
neuroscientists, molecular biologists, informaticists, and
engineers.

“The Allen Institute for Brain Science is an
independent 501(c)(3) nonprofit medical
research organization dedicated to accelerating
the understanding of how the human brain
works.”
Human Brain Map
• Open, public online access
• A detailed, interactive three-

•
•

•

•

dimensional anatomic atlas of the
"normal" human brain
Data from multiple human brains
Genomic analysis of every brain
structure, providing a quantitative
inventory of which genes are
turned on where
High-resolution atlases of key brain
structures, pinpointing where
selected genes are expressed
down to the cellular level
Navigation and analysis tools for
accessing and mining the data
Biological Linked Data Map
• Open, public online access
• Data from multiple RDF data
•
•

•

•

stores
Complete import pipeline using
LDIF framework
Outlines of each imported
instance embedding inline wiki
properties and providing views of
imported properties from original
RDF datasets
Charting tools that „pivot‟ SPARQL
queries providing several views of
each query
Navigation and composition tools
for accessing and mining the data
Where did we get the data?
• KEGG : Kyoto Encyclopedia of Genes and Genomes
• “KEGG GENES is a collection of gene catalogs for all complete genomes

generated from publicly available resources, mostly NCBI RefSeq.”

• Diseasome
• “The Diseasome website is a disease/disorder relationships explorer and

a sample of an innovative map-oriented scientific work. Built by a team of
researchers and engineers, it uses the Human Disease Network data set.”

• DrugBank
• “The DrugBank database is a unique bioinformatics and cheminformatics

resource that combines detailed drug data with comprehensive drug target
information.”

• SIDER
• “SIDER contains information on marketed medicines and their recorded

adverse drug reactions. The information is extracted from public
documents and package inserts.”
New ontology map for import
•

Genes
•
•
•

•

Diseases
•
•

•

DrugBank : 4,772
KEGG : 2,482
SIDER : 924

Effects
•

•

Diseasome : 4,213
KEGG : 459

Drugs
•
•
•

•

DrugBank : 4,553
Diseasome : 3,919
KEGG : 9,841

SIDER : 1,737

Pathways
•

KEGG : 28,442

We chose to intentionally simplify the
ontology due to disagreements between
researchers about entity relationships and
subclasses.
Importing and mapping the Linked Data
•

R2R
•

•
•

•

32,900 instances were converted to the
wiki ontology.

Networked
Storage

Local
Storage

Download

583,746 properties mapped
Pathways were ignored for wiki
ontology import, but are available within
the triple store KEGG Pathway graph.

SIEVE
•

20,849 instances available in wiki
ontology after SILK normalization

•

Instance merging effected drugs,
genes, and diseases across datasets.

• Triple Store SPARQL Update

R2R
Mapping
Engine

Maps Entities to
New Ontology

Import to
Wiki

Sieve
Mapping
Engine

Normalizes Entities
across data sources

Normalize
Entities

Triple
Store

Available with
SPARQL Queries
Importing and mapping the Linked Data
•

R2R
•

•
•

•

32,900 instances were converted to the
wiki ontology.

Networked
Storage

Local
Storage

Download

583,746 properties mapped
Pathways were ignored for wiki
ontology import, but are available within
the triple store KEGG Pathway graph.

SIEVE
•

20,849 instances available in wiki
ontology after SILK normalization

•

Instance merging effected drugs,
genes, and diseases across datasets.

• Triple Store SPARQL Update

R2R
Mapping
Engine

Maps Entities to
New Ontology

Import to
Wiki

Sieve
Mapping
Engine

Normalizes Entities
across data sources

Normalize
Entities

Triple
Store

Available with
SPARQL Queries
LDIF: LINKED DATA
INTEGRATION
FRAMEWORK
Expanding a Neurobiology Dataset
Linked Data challenges
• Data sources that overlap in content may:
• Use a wide range of different RDF vocabularies
• Use different identifiers for the same real-world entity
• Provide conflicting values for the same properties
• Implications
• Queries become hand crafted for a specific RDF data set – no
different than using a proprietary API.
• Individual, improvised and manual merging techniques for data
sets.
• Integrating public datasets with internal databases poses

the same problems
Linked Data Integration Framework
• LDIF normalizes the Linked Data from multiple sources

into a clean, local target representation while keeping
track of data provenance.
1

Collect data: Managed download and update

2

Translate data into a single, target vocabulary

3

Resolve identifier aliases into local target URIs

4

Cleanse data and resolve conflicting values

5

Output to local file system or triple store
LDIF Pipeline
1

Collect data

2

Translate data

3

Supported Data Formats

Resolve
identities

4

Cleanse data

5

Output data

•
•
•

RDF Files (Multiple Formats
SPARQL Endpoints
Crawling Linked Data

Component Stack
LDIF Pipeline
1

Collect data

2

Translate data

Sources use a wide range of different
RDF vocabularies
dbpedia-owl:City

schema:Place

R2R

location:City

fb:location.citytown

3

Resolve
identities

4

Cleanse data

5

Output data

Component Stack
LDIF Pipeline
1

Collect data

2

Sources use different identifiers for the
same entity

Translate data

London, England
London, MA, USA
London, TN, USA
London, TX, USA

SILK

London

3

London =
London, England

Resolve
identities

4

Cleanse data

5

Output data

Component Stack
LDIF Pipeline
1

Collect data

2

Translate data

3

Sources provide different values for the
same property
London, England
has a population
of 8.174M people

London, England
has a population
of 9.2M people

SILK

rdfs:population:
8.174M

Resolve
identities

4

Cleanse data

5

Output data

Component Stack
LDIF Pipeline
1

Collect data

2

Translate data

3

Supported Output Formats
•
•
•

N-Quads
N-Triples
SPARQL Update Stream

Resolve
identities

4

Cleanse data

5

Output data

Provenance tracking using Named Graphs

Component Stack
LDIF Architecture
Normalized Linked Data is not always
pretty.
Normalized Linked Data is not always
pretty.
SELECT DISTINCT ?group1 ?item1 ?group2 ?item2 {
GRAPH ?G {
?target drugbank:geneName "{{{1}}}" ;
drugbank:geneName ?geneName ;
.
?drug drugbank:target ?target ;
drugbank:genericName ?item2 ;
drugbank:affectedOrganism ?group2 ;
.
}
GRAPH ?G1 {
?siderDrug sider:drugName ?item2 ;
rdfs:label ?group1 ;
sider:sideEffect ?effect;
.
?effect rdfs:label ?item1 .
}
}
Semantic MediaWiki
Semantic MediaWiki is a full-fledged framework, in
conjunction with many spinoff extensions, that can turn a
wiki into a powerful and flexible knowledge management
system. All data created within SMW can easily be
published via the Semantic Web, allowing other systems to
use this data seamlessly.
Four initial templates for each instance by
category
1. Custom infobox within outline

template
•

Visible inline properties

2. Outline template providing instance

information
3. Widget template displaying dynamic

charts or third party services
•

Donut charts and AIBS gene feed

4. Broad table SPARQL queries

showing instance relationships
5. Hidden inline properties for other

extensions
Creating instance wiki pages
• The Triple Store now contained tens of

thousands of recognized category
instances. Creating the pages require a
bot.

Create List of Page
Names

1.0

RDF Data

Download

1. Fetch the RDF dumps from an active

D2R server
2. Use regex to fetch the rdf:label property

that was mapped by R2R as an instance
name
3. Open category specific text file of wiki

markup (page of template includes)
4. Contact Neurowiki and request a new

page from the list of names with the
category content

Sanitize
Script

2.0

Create CSV

Category
Page Names

Text of Wiki
markup for page
instance

Read Open

3.0
Create MediaWiki Page
MediaWiki
Gateway rb
Framework

REST
interface

4.0

Neurowiki
Instance
Page
Final application stack
JavaScript View Layer (High Charts / Sproutcore / JQuery)

Semantic MediaWiki
Triple Store
(Virtuoso)

Relational Database
(MySQL)

LDIF

AIBS REST API
(Gene Heat Map Data)

AIBS

Diseasome

DrugBank

SIDER

KEGG
NEUROWIKI
Expanding a Neurobiology Dataset
How are base entities like Calcium
represented?
1. The wiki page and

corresponding template
components are rendered.

Drug Search

1.0
Wiki Page

Aggregate
Page of
Components

2. Relations are pulled from the

normalized data store of linked
data.

2.0
Calcium
Relations

Neurobase
Data Stores

3. The JavaScript components are

3.0
Selected
Widget for
Display

populated via a data feed
How are base entities like Calcium
represented?
• Because so many

organisms contain
calcium the
mappings to
affected species
were never created
to conserve space
in the data store.

Drug and Disease Class Ratios of Calcium
Inner Circle: Drugs by Affected Species, Outer Circle: Disease Ratios by Class
What are the dangers of Propofol?
1. Propofol DrugBank relations are
Drug Search

Neurobase
Data Stores

rendered in corresponding
JavaScript components.

1.0
Propofol
Relations

2.0

Aggregate
Components

2. The Diseasome disease

relations show classes of illness
Propofol affects.

Propofol
Disease
Relations

3. An aggregate of SIDER side
3.0
Propofol Side
Effects

effects are rendered in relation
to Propofol and disease classes.
What are the dangers of Propofol?
What are the dangers of Propofol?
What are the dangers of Propofol?
Which drugs are used in Chemotherapy?
1.

2.

Disease
Search

DrugBank and AIBS relations to
genes affected by both the disease
and drug.

3.

SIDER side effects related to the
gene, disease, and drug.

4.

DrugBank drug glossary definition
specifying various forms of Cancer
treatment.

Neurobase
Data Stores

1.0
Disease
Relations

Diseasome disease relations
normalized by LDIF.

Aggregate
Components

2.0
Gene Drug
Relations

3.0
Drug Side
Effects

4.0
Drug Info Box
Which Drugs are used in Chemotherapy?
Which drugs are used in Chemotherapy?
Drug and Disease Class Ratios of AR
Inner Circle: Drugs by Affected Species, Outer Circle: Disease Ratios by Class
Which drugs are used in Chemotherapy?
Drug and Side Effect Ratios of AR
Inner Circle: Drugs by Affected Species, Outer Circle: Side Effect Ratios of Drugs
Which drugs are used in Chemotherapy?
Which drugs are used in Chemotherapy?
Drug and Disease Class Ratios of Nilutamide
Inner Circle: Drugs by Affected Species, Outer Circle: Disease Ratios by Class
Which drugs are used in Chemotherapy?
Drug and Disease Class Ratios of Bicalutamide
Inner Circle: Drugs by Affected Species, Outer Circle: Disease Ratios by Class
Which drugs are used in Chemotherapy?
Expanding the Prototype
• Semantic MediaWiki query construction
• Could this be done in SPARQL?
• Authoring SILK / R2R mappings for the LDIF Pipeline
• Extremely difficult and the editors are not intuitive
• How do you get data owners to fuse the sets and create

the data store themselves?
• Tested with Aura Wiki prototype

• Expand authoring provenance
• How do we ensure new data / links comes from an authoritative
source?
Today we discussed…
• The Allen Institute for Brain Science (AIBS)
• Four similar research data sets to interlink with the AIBS
•
•
•
•

•

data set
An import pipeline named Link Data Integration
Framework (LDIF)
The interlinking process for 5 concurrent research data
sets (AIBS, DrugBank, Diseasome, KEGG, SIDER)
A prototype neurobiology authoring platform.
Creating instance pages to display the new connections.
Demonstration of the initial use cases.
QUESTIONS?
COMMENTS?
Expanding a Neurobiology Dataset
THANK YOU.
Expanding a Neurobiology Dataset

Applied semantic technology and linked data

  • 1.
    APPLIED LINKED DATA ANDSEMANTIC TECHNOLOGY Expanding a Neurobiology Dataset
  • 2.
    Today we arediscussing… • What is the use case and who requested it? • How do you import and normalize thousands of RDF • • • • triples worth of gene data? How do we enrich the normalized gene data with parallel research data sets? Creating instance pages without knowing exactly what will be displayed on them. Demonstration of the initial use cases Question and answer session
  • 3.
    Why? • Prototype: Howdo we assemble the data mine and refine the authoring tools? How do we expand this to the research community? • How do we expand ownership of the data to research professionals? • How do we build systems in a way that research professionals can author and link the data? • How do we publish these new relationships to the wider research community?
  • 4.
    What is theAllen Institute for Brain Science? • Launched in 2003 with seed funding from founder and philanthropist Paul G. Allen. • Serving the scientific community is at the center of our mission to accelerate progress toward understanding the brain and neurological systems. • The Allen Institute's multidisciplinary staff includes neuroscientists, molecular biologists, informaticists, and engineers. “The Allen Institute for Brain Science is an independent 501(c)(3) nonprofit medical research organization dedicated to accelerating the understanding of how the human brain works.”
  • 5.
    Human Brain Map •Open, public online access • A detailed, interactive three- • • • • dimensional anatomic atlas of the "normal" human brain Data from multiple human brains Genomic analysis of every brain structure, providing a quantitative inventory of which genes are turned on where High-resolution atlases of key brain structures, pinpointing where selected genes are expressed down to the cellular level Navigation and analysis tools for accessing and mining the data
  • 6.
    Biological Linked DataMap • Open, public online access • Data from multiple RDF data • • • • stores Complete import pipeline using LDIF framework Outlines of each imported instance embedding inline wiki properties and providing views of imported properties from original RDF datasets Charting tools that „pivot‟ SPARQL queries providing several views of each query Navigation and composition tools for accessing and mining the data
  • 7.
    Where did weget the data? • KEGG : Kyoto Encyclopedia of Genes and Genomes • “KEGG GENES is a collection of gene catalogs for all complete genomes generated from publicly available resources, mostly NCBI RefSeq.” • Diseasome • “The Diseasome website is a disease/disorder relationships explorer and a sample of an innovative map-oriented scientific work. Built by a team of researchers and engineers, it uses the Human Disease Network data set.” • DrugBank • “The DrugBank database is a unique bioinformatics and cheminformatics resource that combines detailed drug data with comprehensive drug target information.” • SIDER • “SIDER contains information on marketed medicines and their recorded adverse drug reactions. The information is extracted from public documents and package inserts.”
  • 8.
    New ontology mapfor import • Genes • • • • Diseases • • • DrugBank : 4,772 KEGG : 2,482 SIDER : 924 Effects • • Diseasome : 4,213 KEGG : 459 Drugs • • • • DrugBank : 4,553 Diseasome : 3,919 KEGG : 9,841 SIDER : 1,737 Pathways • KEGG : 28,442 We chose to intentionally simplify the ontology due to disagreements between researchers about entity relationships and subclasses.
  • 9.
    Importing and mappingthe Linked Data • R2R • • • • 32,900 instances were converted to the wiki ontology. Networked Storage Local Storage Download 583,746 properties mapped Pathways were ignored for wiki ontology import, but are available within the triple store KEGG Pathway graph. SIEVE • 20,849 instances available in wiki ontology after SILK normalization • Instance merging effected drugs, genes, and diseases across datasets. • Triple Store SPARQL Update R2R Mapping Engine Maps Entities to New Ontology Import to Wiki Sieve Mapping Engine Normalizes Entities across data sources Normalize Entities Triple Store Available with SPARQL Queries
  • 10.
    Importing and mappingthe Linked Data • R2R • • • • 32,900 instances were converted to the wiki ontology. Networked Storage Local Storage Download 583,746 properties mapped Pathways were ignored for wiki ontology import, but are available within the triple store KEGG Pathway graph. SIEVE • 20,849 instances available in wiki ontology after SILK normalization • Instance merging effected drugs, genes, and diseases across datasets. • Triple Store SPARQL Update R2R Mapping Engine Maps Entities to New Ontology Import to Wiki Sieve Mapping Engine Normalizes Entities across data sources Normalize Entities Triple Store Available with SPARQL Queries
  • 11.
  • 12.
    Linked Data challenges •Data sources that overlap in content may: • Use a wide range of different RDF vocabularies • Use different identifiers for the same real-world entity • Provide conflicting values for the same properties • Implications • Queries become hand crafted for a specific RDF data set – no different than using a proprietary API. • Individual, improvised and manual merging techniques for data sets. • Integrating public datasets with internal databases poses the same problems
  • 13.
    Linked Data IntegrationFramework • LDIF normalizes the Linked Data from multiple sources into a clean, local target representation while keeping track of data provenance. 1 Collect data: Managed download and update 2 Translate data into a single, target vocabulary 3 Resolve identifier aliases into local target URIs 4 Cleanse data and resolve conflicting values 5 Output to local file system or triple store
  • 14.
    LDIF Pipeline 1 Collect data 2 Translatedata 3 Supported Data Formats Resolve identities 4 Cleanse data 5 Output data • • • RDF Files (Multiple Formats SPARQL Endpoints Crawling Linked Data Component Stack
  • 15.
    LDIF Pipeline 1 Collect data 2 Translatedata Sources use a wide range of different RDF vocabularies dbpedia-owl:City schema:Place R2R location:City fb:location.citytown 3 Resolve identities 4 Cleanse data 5 Output data Component Stack
  • 16.
    LDIF Pipeline 1 Collect data 2 Sourcesuse different identifiers for the same entity Translate data London, England London, MA, USA London, TN, USA London, TX, USA SILK London 3 London = London, England Resolve identities 4 Cleanse data 5 Output data Component Stack
  • 17.
    LDIF Pipeline 1 Collect data 2 Translatedata 3 Sources provide different values for the same property London, England has a population of 8.174M people London, England has a population of 9.2M people SILK rdfs:population: 8.174M Resolve identities 4 Cleanse data 5 Output data Component Stack
  • 18.
    LDIF Pipeline 1 Collect data 2 Translatedata 3 Supported Output Formats • • • N-Quads N-Triples SPARQL Update Stream Resolve identities 4 Cleanse data 5 Output data Provenance tracking using Named Graphs Component Stack
  • 19.
  • 20.
    Normalized Linked Datais not always pretty.
  • 21.
    Normalized Linked Datais not always pretty.
  • 22.
    SELECT DISTINCT ?group1?item1 ?group2 ?item2 { GRAPH ?G { ?target drugbank:geneName "{{{1}}}" ; drugbank:geneName ?geneName ; . ?drug drugbank:target ?target ; drugbank:genericName ?item2 ; drugbank:affectedOrganism ?group2 ; . } GRAPH ?G1 { ?siderDrug sider:drugName ?item2 ; rdfs:label ?group1 ; sider:sideEffect ?effect; . ?effect rdfs:label ?item1 . } }
  • 23.
    Semantic MediaWiki Semantic MediaWikiis a full-fledged framework, in conjunction with many spinoff extensions, that can turn a wiki into a powerful and flexible knowledge management system. All data created within SMW can easily be published via the Semantic Web, allowing other systems to use this data seamlessly.
  • 24.
    Four initial templatesfor each instance by category 1. Custom infobox within outline template • Visible inline properties 2. Outline template providing instance information 3. Widget template displaying dynamic charts or third party services • Donut charts and AIBS gene feed 4. Broad table SPARQL queries showing instance relationships 5. Hidden inline properties for other extensions
  • 25.
    Creating instance wikipages • The Triple Store now contained tens of thousands of recognized category instances. Creating the pages require a bot. Create List of Page Names 1.0 RDF Data Download 1. Fetch the RDF dumps from an active D2R server 2. Use regex to fetch the rdf:label property that was mapped by R2R as an instance name 3. Open category specific text file of wiki markup (page of template includes) 4. Contact Neurowiki and request a new page from the list of names with the category content Sanitize Script 2.0 Create CSV Category Page Names Text of Wiki markup for page instance Read Open 3.0 Create MediaWiki Page MediaWiki Gateway rb Framework REST interface 4.0 Neurowiki Instance Page
  • 26.
    Final application stack JavaScriptView Layer (High Charts / Sproutcore / JQuery) Semantic MediaWiki Triple Store (Virtuoso) Relational Database (MySQL) LDIF AIBS REST API (Gene Heat Map Data) AIBS Diseasome DrugBank SIDER KEGG
  • 27.
  • 28.
    How are baseentities like Calcium represented? 1. The wiki page and corresponding template components are rendered. Drug Search 1.0 Wiki Page Aggregate Page of Components 2. Relations are pulled from the normalized data store of linked data. 2.0 Calcium Relations Neurobase Data Stores 3. The JavaScript components are 3.0 Selected Widget for Display populated via a data feed
  • 29.
    How are baseentities like Calcium represented? • Because so many organisms contain calcium the mappings to affected species were never created to conserve space in the data store. Drug and Disease Class Ratios of Calcium Inner Circle: Drugs by Affected Species, Outer Circle: Disease Ratios by Class
  • 30.
    What are thedangers of Propofol? 1. Propofol DrugBank relations are Drug Search Neurobase Data Stores rendered in corresponding JavaScript components. 1.0 Propofol Relations 2.0 Aggregate Components 2. The Diseasome disease relations show classes of illness Propofol affects. Propofol Disease Relations 3. An aggregate of SIDER side 3.0 Propofol Side Effects effects are rendered in relation to Propofol and disease classes.
  • 31.
    What are thedangers of Propofol?
  • 32.
    What are thedangers of Propofol?
  • 33.
    What are thedangers of Propofol?
  • 34.
    Which drugs areused in Chemotherapy? 1. 2. Disease Search DrugBank and AIBS relations to genes affected by both the disease and drug. 3. SIDER side effects related to the gene, disease, and drug. 4. DrugBank drug glossary definition specifying various forms of Cancer treatment. Neurobase Data Stores 1.0 Disease Relations Diseasome disease relations normalized by LDIF. Aggregate Components 2.0 Gene Drug Relations 3.0 Drug Side Effects 4.0 Drug Info Box
  • 35.
    Which Drugs areused in Chemotherapy?
  • 36.
    Which drugs areused in Chemotherapy? Drug and Disease Class Ratios of AR Inner Circle: Drugs by Affected Species, Outer Circle: Disease Ratios by Class
  • 37.
    Which drugs areused in Chemotherapy? Drug and Side Effect Ratios of AR Inner Circle: Drugs by Affected Species, Outer Circle: Side Effect Ratios of Drugs
  • 38.
    Which drugs areused in Chemotherapy?
  • 39.
    Which drugs areused in Chemotherapy? Drug and Disease Class Ratios of Nilutamide Inner Circle: Drugs by Affected Species, Outer Circle: Disease Ratios by Class
  • 40.
    Which drugs areused in Chemotherapy? Drug and Disease Class Ratios of Bicalutamide Inner Circle: Drugs by Affected Species, Outer Circle: Disease Ratios by Class
  • 41.
    Which drugs areused in Chemotherapy?
  • 42.
    Expanding the Prototype •Semantic MediaWiki query construction • Could this be done in SPARQL? • Authoring SILK / R2R mappings for the LDIF Pipeline • Extremely difficult and the editors are not intuitive • How do you get data owners to fuse the sets and create the data store themselves? • Tested with Aura Wiki prototype • Expand authoring provenance • How do we ensure new data / links comes from an authoritative source?
  • 43.
    Today we discussed… •The Allen Institute for Brain Science (AIBS) • Four similar research data sets to interlink with the AIBS • • • • • data set An import pipeline named Link Data Integration Framework (LDIF) The interlinking process for 5 concurrent research data sets (AIBS, DrugBank, Diseasome, KEGG, SIDER) A prototype neurobiology authoring platform. Creating instance pages to display the new connections. Demonstration of the initial use cases.
  • 44.
  • 45.
    THANK YOU. Expanding aNeurobiology Dataset

Editor's Notes

  • #2 Hello, My name is william smith and today we will be talking about a project near and dear to my heart.I served as project manager for a prototype application, worked closely with 2 German teams, and we were the first customer for several of the tools used to assemble this application. I was also the chief integration point into Vulcan so am well aware of the technologies, code bases, and data sets that went into assembling this project…
  • #3 So what are we discussing today?First and foremost this was a project for an internal organization at Vulcan involved in mapping the human brain. This, of course, generates petabytes of data and millions of triples worth of gene mappings – but we took a smaller slice of a couple hundred thousand genes for the initial prototype. There were also several parallel research programs generating data in a format we could use, and a conference was held of industry professionals to find the interlinking pieces of these datasets. Finally, I’m going to walk through the data pipeline, the application itself, and a set of our original use cases.
  • #4 Why?Well a core problem that has been in neurobiology, and most sciences for that matter, is the inability to share and author sets of data across projects by industry professionals. This leaves an odd gap where people with computer science degrees are linking data they don’t fully understand, and the people that understand the data don’t have the ability to add the interlinks for greater vision into the data.With this problem known our original prototype soon expanded into how do we get these tools into the hands of the research community, and that in itself created 3 core questions. Ownership, Authorship, and publishing provenance of the newly linked data.
  • #5 The organization that chartered this project, and provided the original data sets is the Allen Institute for Brain Science – or AIBS. When you hear me say AIBS on accident I’m referring to this organization. It was launched in 2003 by Paul G. Allen and has the explicit focus of mapping the human brain to accelerate our understand of the brain and neurological systems. Furthermore, the institute is a 501c(3) nonprofit medical research organization employing hundreds of neuroscientists, molecular biologists, informaticists, and engineers within the seattle area.----- Meeting Notes (1/28/14 12:15) -----So who requested this?accelerate our understanding
  • #6 And this is the Institutes core product… or several screen shots of the core product. Here we have gene heat maps… some location data… where it all is location wise in the human brain. As odd as those screen caps are they are accessed by thousands of researchers daily and this is considered a major success.It’s open, the public right now can go to this site and browse the catalog. There are currently 3 human brains fully mapped with a 4th in progress. Each of these donors have generated genomic analysis of brain structure and have created a thorough catalog of genes with respect to location. While the captions are small they are part of a much larger suite of atlas navigation tools with several components – ie. Heat map – pinpointing genes expressed down to the cellular level.And most importantly, for our purposes, they generate terabytes of data with industry wide IDS we can link to other sources!
  • #7 And here’s our prototype in screenshots. No page is hand type, no graph is hand entered, 4 static templates pulling data from our normalized mine creating all these pretty pictures and full pages of text. There are over 30 thousand of these pages.We will be discussing the first two points in depth – RDF and the LDIF pipeline. Charting tools use SPARQL which we will not be discussing in depth – however I have a hidden slide of the details should somebody be really malicious and want to ask about SPARQL queries. Finally, our navigation closely resembles the common MediaWiki installation which everybody who has been on the internet in the last 10 years is familiar with… editing on the other hand is very different and currently only bots create and maintain the pages.
  • #8 Which brings us to these parallel tracks of research data I keep mentioning.To choose these sets we had a conference of industry researchers and data professionals go through the hundreds of biology mines looking for useful projects that closely relate to genes found in the human brain. The 4 prototype sets chosen were <read slides>
  • #9 Our original cross section of data found these connections. Not the full dump, but with roughly 15 thousand gene connections plenty of pages produced relevant connections and filled pages with interesting data points. <read numbers>And to the right we have our simplified ontology. Looks incredible right… hey they can’t all be winners and don’t blame me – blame protégé.This was generated with basic 1-1 relations, domain-range logic, where applicable. <joke about line colors>The simplification was created in part because nobody that does anything in neuroscience agrees with another person that does the same thing. We could get them agree that in some gray area way these things are related on the domain-range level … so that generates that and it looks way worst if I try to spread the boxes out in any other way.
  • #10 Which brings us to the pretty graph I hate… because it makes unifying things into that ugly protégé graph look easy.It’s not, but it does give a good overall view of what we able to convert directly to the wiki be 32 thousand 900 instances turned directly into pages with over 500 thousand properties across the set. Even more important after “same as” connections were made we had 20 thousand fully populates pages – and these are the pages with connections across the datasets. That brings up an important point, if I imported all of the gene data I would end up with a huge wiki by page count, but the better part of these pages would be nothing more than a page title and empty templates. Hence the importance of finding these connections and only tracking the useful data points – like pages with more than a title.On the right we have the simplified process which I will be going into more detail very soon <read right graph>----- Meeting Notes (1/28/14 12:15) -----But it does give a good overall view of what we are able to convert directly into the wiki.
  • #11 And those parts that just turned red - <read red parts> - is the process we will be discussing for a section I like to call: Linked Data Integration Framework
  • #12 \----- Meeting Notes (1/28/14 13:28) -----Created over the last 4 yearsCreated by Free University of BerlinSame team that helped build the prototypeFirst customerStill active, last update late 20132 main components, R2R and SILK
  • #13 And this is why I don’t like the oversimplification of that process chart. Plenty of difficult computer science problems and none of them cut and dry to solve…Assuming we can find overlapping data sources you then have to unify vocabularies – the predicate of the triple. Once this is done and you can agree on what the name of the entity is, then you will have data sets with the same entity going by a range of names and ids. Finally, once you’ve located the same entities there’s no guarantee the normalized vocabularies will be referencing the same value.Without the normalization pipeline – LDIF - this creates queries that are silo’d to a specific data set basically creating an API… and that’s good for companies like facebook and Google but terrible for independent research. The last point is less of a problem for us because we decided long ago this was a philanthropic prototype with 501-c-3 data – but it is something to be considered when working with say – national security data.
  • #14 Lucky for us, as customer 1 of the LDIF framework, we get to test all of the steps in normalization and hope for the best or fix it ourselves!If this works right we will…<read steps>
  • #20 And here’s the LDIF architecture.All this stuff on the bottom are the 5 data sets, the arrows don’t really apply because they didn’t link up that well before LDIF, and then to the pipeline.After processing and RE-releasing the arrows apply, and then we shove that all in our own public triple store for use in the application.
  • #21 And here’s your application.
  • #22 ----- Meeting Notes (1/28/14 13:28) -----Pubby created 5 years agoUsed in dbpediaFree univeristy of berlinNo search, have to follow linksNot very modern viewing experienceNo expression of data via links
  • #23 Less than helpful – FINE.
  • #24 Well I am in this business to please the consumer, and my consumer understands common web architectures – even if they don’t know they do - so let’s try an installation of Semantic MediaWikiInvented roughly 5 years ago it’s a series of plugins, that run on mediawiki, which was created by the good folks that invented wikipedia! Millions of people see it everyday while researching homework they don’t feel like doing, when sloppily referencing college term papers, or in my opinion creating one of the most accurate and comprehensive encyclopedias humanity has to date. Even better we can display the semantic properties of our normalized data inline! <show arrows><can you expand> of course I can.
  • #25 I’m going to build you 4 base templates by category – Gene, Drug, Disease, and Side Effect.These templates will have the base information displaying our semantic properties - <run through wireframe>
  • #26 ----- Meeting Notes (1/28/14 12:15) -----This created a problem - namely how do I create 30,000 pages and not get fired for entering data over the course of 2 years. So, a lot of what you see on wikipedia isn' t actually input or maintained by humans. The gene pages all have very complex info boxes tracking ids, regions, and a variety of known properties mined from other sources. The pieces of code that do this mining and page creation are called wiki-bots.We wrote a wiki-bot to create our 30,000 pages, one for each page type, and this is the creation pipeline these bots utilized.
  • #30 ----- Meeting Notes (1/28/14 12:15) -----I'll be running through 3 core use cases we used to test the project and explaining how the pages and graphs were generated. All of the graphs related to the genes, diseases, drugs, and side effects within the next few slides are generated from the wiki.However, it's far easier to view the wiki when you have access behind the vulcan firewall... so I had to run on screen shots for this portion.
  • #31 ----- Meeting Notes (1/28/14 12:15) -----Calcium - difficult use case- within all creatures- has lots of connections to other entities- but we don't want to create all the pages
  • #33 ----- Meeting Notes (1/28/14 13:28) ------ 15 minutes of fame 5 years ago- Powerful seditive used in anesthesiology-- You should not use it as a sleep aid- Listed as cause of death for popular musician
  • #40 Fix this
  • #44 ----- Meeting Notes (1/28/14 12:15) -----Finally, we head over to drug bank and search for an obscure drug page... Bicalutamide...It's an oral steroid used in the treatment of cancer that effects the androgen receptor. Thus validating our links across the data. An example of how a not-so-simple correlation of data can give researchers deeper vision by merging sets and presenting the interlinks.
  • #46 ----- Meeting Notes (1/28/14 12:15) -----Aura wiki - it was used to test crowd sourcing of data authoring for a proto-AI.