KEMBAR78
Text Mining: the next data frontier. Beyond Open Access | PDF
Presentation’s Subtitle
#openminted_eu
beyond Open Access
Text Mining: the next data frontier
Natalia Manola
Athena Research & Innovation Centre
OpenCon Satellite Berlin, 25 Nov 2016
A few sobering facts on content production
OpenCon SatelliteBerlin, 25Nov 2016
● 1,8 billion websites & 3,46 billion internet users, on 25 September 2016.
● 24 million wireless sensors and actuators worldwide (553% up, between 2011
and 2016)
● 16 zettabytes of useful data (16 Trillion GB) by 2020
● YouTube claims to upload 24 hours of video every minute, making the site a
hugely significant data aggregator.
● Every second, on average, around 6,000 tweets are tweeted on Twitter, which
corresponds to over 350,000 tweets sent per minute, >500 million tweets per
day and around 200 billion tweets per year.
● 74,200,000 pages existed on Facebook, with 7 million apps and websites
integrated with Facebook on 30/5/2016
2
… And some facts on scientific literature
OpenCon SatelliteBerlin, 25Nov 2016
The global research community generates ~2.5 million new scholarly
articles per year (English only)
The STM report (2015)
… some 90% of papers … are never cited (82% in the humanities)
… of those articles that are cited, only 20 percent have actually been read
… 50% of papers are never read by anyone other than their authors,
referees and journal editors
Lokman I. Meho, The rise and rise of citation analysis, 2007
… one paper published every 12seconds
… 70,000 papers published on a single protein, the tumor suppressor p53
Spangler et al, Automated Hypothesis Generation based on Mining Scientific
Literature, 2014
3
How can we make sense
of this data?
OpenCon SatelliteBerlin, 25Nov 2016
4
Emerging solutions
Machine reading
process textual sources, organise and classify in various dimensions, extract
main (indexical) information items,
… and “understanding”
identify and extract entities and relations between entities, facilitate the
transformation of unstructured textual sources into structured data
… and predicting
enable the multidimensional analysis of structured data to extract meaningful
insights and improve the ability to predict
OpenCon SatelliteBerlin, 25Nov 2016
5
However, …
Multitude of solutions catering for different
Text Types
Newswire
Scientific Literature
Tweets/blogs
Patents
Clinical/medical records
Textbooks, monographs
Online forums
….
Languages
English
French
German
Spanish
Portuguese
Italian
Polish
….
Tasks
Translation
Information Extraction
Semantic Search
Question Answering
Sentiment Analysis
Summarization
Knowledge Discovery
….
Domains
Finance/Business
Health
Biology
Social Sciences
Humanities
….
Creating a fragmented landscape
OpenCon SatelliteBerlin, 25Nov 2016
6
A glimpse on the TDM landscape
OpenCon SatelliteBerlin, 25Nov 2016
7
Resource: FutureTDM project (www.fututetdm.eu)
What can we do?
8
1. Share content
• Document literature content
• Share in a meaningful way: what does Open Access really mean?
IPR and licensing
• Study IPR restrictions for reuse of sources as well as possible exceptions
• Promote clarity and standardisation of legal rights and obligations
Challenges
• Rights statement vs. Open licenses (for repositories)
• No access to full text. We live in a metadata world
• No standard protocols, formats and APIs for access and retrieval
• No capacity to handle extra traffic
OpenCon SatelliteBerlin, 25Nov 2016
9
Proposed solution : Make TDM enabled hubs
OpenCon SatelliteBerlin, 25Nov 2016
10
Literature
Repositories
OA Journals
Data
Repositories
Aggregators
Archives
Metadata
Full text
Data
OpenAIRE
CORE
PMC
Europe
…
Guidelines APIs
TDM
Research
networks
WIkiPedia/Med
ia/Research
…
2. Share TDM Services
• Document language processing/text mining services and workflows in a
meaningful way for domain discipline researchers
• Document language/knowledge resources, data categories taxonomies,
provenance information
Interoperable services
• Common way of presenting annotated results
• Combine services into workflows
• Combine content and language resources with services and workflows
• Combine automatic and manual/crowdsourcing annotation services
IPR and licensing
• Translate the legal & policy aspects into specifications for lawful user-to-
service and service-to-service interactions
Challenges
• Bring text miners close to the researcher problems and needs
• Semantic interoperability (not just technical)
OpenCon SatelliteBerlin, 25Nov 2016
11
OpenMinted
Establish an open and sustainable Text and Data
Mining (TDM) platform and infrastructure where
researchers can discover, collaboratively create, share
and re-use knowledge from a wide range of text based
scientific and scholarly related sources.
OpenCon SatelliteBerlin, 25Nov 2016
12
A step from Open Access to Open Science
HIGH LEVEL ARCHITECTURE
OpenCon SatelliteBerlin, 25Nov 2016
13
Policies &
guidelines
Register and Discover TDM Services and tools
Link to Content hubs
Run a TDM job and share results
Get people’s knowledge - Crowdsourced Annotation
Our Services
14
OpenCon SatelliteBerlin, 25Nov 2016
Build your own service – Combine components into
a Workflow and SHARE
Our Users
End users
• Researchers, data base curators, Research Infrastructure
operators
• Novice: use services to advance their science
• Advanced: use TDM components into complex workflows
OpenCon SatelliteBerlin, 25Nov 2016
15
Content and service providers
- Publishers, libraries, scientific data base centres, …
- TDM researchers
- SMEs
OpenCon SatelliteBerlin, 25Nov 2016
Scholarly
Comm.
Feature extraction
Data citation
Research analytics
Life Sciences
Curation of
databases and lexica
in Chembolomics &
neuroinformatics
Agriculture
Extracting
information from
tables for food safety
alerts
Social Sciences
Data citation
Community Driven
16
From the very beginning…
Requirements, content, barriers, expected outcomes.
… to the very end
Create applications, validate and evaluate the results.
Examples of OpenAIRE TDM services
we want to share
17
@openaire_eu
18
Discover research in context
OpenCon SatelliteBerlin, 25Nov 2016
19
Research Trends and correlations
Text and data mining with
domain specific knowledge
Interactive visualization for
drill-down information
…
Trends in science
Correlations of funding programmes
Within a funder, or
across countries
OpenCon SatelliteBerlin, 25Nov 2016
What will it look like?
20
the openminted registry
OpenCon SatelliteBerlin, 25Nov 2016
21
Browse tdm resources & tools/services
OpenCon SatelliteBerlin, 25Nov 2016
22
Register, document, share tools
OpenCon SatelliteBerlin, 25Nov 2016
23
Create your corpus, annotate, share
OpenCon SatelliteBerlin, 25Nov 2016
24
How does this all bind together?
OpenCon SatelliteBerlin, 25Nov 2016
25
OpenAIRE
CORE
CrossRef
…
OpenMinted REGISTRY
CLARIN
META-SHARE
OpenMinted WORKFLOWS
TDM TOOLS
Repositories
(OA) Journals
Other textual resources
e.g. medical records, PSI
How DOES open Science help?
Language
resources
…
What’s next
Participate with your ideas
• Give us your feedback on our pending guidelines and APIs
• Provide us with your TDM requirements – we have the
experts to consult you
• Register your TDM services
• Test out the system when it comes live (spring)
Watch out for
• OpenAIRE’s datathons, tenders and challenges (60K in total)
• OpenMinTeD’s tenders and challenges (240K in total)
OpenCon SatelliteBerlin, 25Nov 2016
26
twitter.com/openminted_eu
facebook.com/openminted
bit.do/openmintedlinkedin
vimeo.com/openminted
bit.do/openmintedplus
THANK YOU!
Natalia Manola
natalia@di.uoa.gr
twitter.com/openminted_eu
facebook.com/openminted
bit.do/openmintedlinkedin
vimeo.com/openminted
bit.do/openmintedplus27

Text Mining: the next data frontier. Beyond Open Access

  • 1.
    Presentation’s Subtitle #openminted_eu beyond OpenAccess Text Mining: the next data frontier Natalia Manola Athena Research & Innovation Centre OpenCon Satellite Berlin, 25 Nov 2016
  • 2.
    A few soberingfacts on content production OpenCon SatelliteBerlin, 25Nov 2016 ● 1,8 billion websites & 3,46 billion internet users, on 25 September 2016. ● 24 million wireless sensors and actuators worldwide (553% up, between 2011 and 2016) ● 16 zettabytes of useful data (16 Trillion GB) by 2020 ● YouTube claims to upload 24 hours of video every minute, making the site a hugely significant data aggregator. ● Every second, on average, around 6,000 tweets are tweeted on Twitter, which corresponds to over 350,000 tweets sent per minute, >500 million tweets per day and around 200 billion tweets per year. ● 74,200,000 pages existed on Facebook, with 7 million apps and websites integrated with Facebook on 30/5/2016 2
  • 3.
    … And somefacts on scientific literature OpenCon SatelliteBerlin, 25Nov 2016 The global research community generates ~2.5 million new scholarly articles per year (English only) The STM report (2015) … some 90% of papers … are never cited (82% in the humanities) … of those articles that are cited, only 20 percent have actually been read … 50% of papers are never read by anyone other than their authors, referees and journal editors Lokman I. Meho, The rise and rise of citation analysis, 2007 … one paper published every 12seconds … 70,000 papers published on a single protein, the tumor suppressor p53 Spangler et al, Automated Hypothesis Generation based on Mining Scientific Literature, 2014 3
  • 4.
    How can wemake sense of this data? OpenCon SatelliteBerlin, 25Nov 2016 4
  • 5.
    Emerging solutions Machine reading processtextual sources, organise and classify in various dimensions, extract main (indexical) information items, … and “understanding” identify and extract entities and relations between entities, facilitate the transformation of unstructured textual sources into structured data … and predicting enable the multidimensional analysis of structured data to extract meaningful insights and improve the ability to predict OpenCon SatelliteBerlin, 25Nov 2016 5
  • 6.
    However, … Multitude ofsolutions catering for different Text Types Newswire Scientific Literature Tweets/blogs Patents Clinical/medical records Textbooks, monographs Online forums …. Languages English French German Spanish Portuguese Italian Polish …. Tasks Translation Information Extraction Semantic Search Question Answering Sentiment Analysis Summarization Knowledge Discovery …. Domains Finance/Business Health Biology Social Sciences Humanities …. Creating a fragmented landscape OpenCon SatelliteBerlin, 25Nov 2016 6
  • 7.
    A glimpse onthe TDM landscape OpenCon SatelliteBerlin, 25Nov 2016 7 Resource: FutureTDM project (www.fututetdm.eu)
  • 8.
  • 9.
    1. Share content •Document literature content • Share in a meaningful way: what does Open Access really mean? IPR and licensing • Study IPR restrictions for reuse of sources as well as possible exceptions • Promote clarity and standardisation of legal rights and obligations Challenges • Rights statement vs. Open licenses (for repositories) • No access to full text. We live in a metadata world • No standard protocols, formats and APIs for access and retrieval • No capacity to handle extra traffic OpenCon SatelliteBerlin, 25Nov 2016 9
  • 10.
    Proposed solution :Make TDM enabled hubs OpenCon SatelliteBerlin, 25Nov 2016 10 Literature Repositories OA Journals Data Repositories Aggregators Archives Metadata Full text Data OpenAIRE CORE PMC Europe … Guidelines APIs TDM Research networks WIkiPedia/Med ia/Research …
  • 11.
    2. Share TDMServices • Document language processing/text mining services and workflows in a meaningful way for domain discipline researchers • Document language/knowledge resources, data categories taxonomies, provenance information Interoperable services • Common way of presenting annotated results • Combine services into workflows • Combine content and language resources with services and workflows • Combine automatic and manual/crowdsourcing annotation services IPR and licensing • Translate the legal & policy aspects into specifications for lawful user-to- service and service-to-service interactions Challenges • Bring text miners close to the researcher problems and needs • Semantic interoperability (not just technical) OpenCon SatelliteBerlin, 25Nov 2016 11
  • 12.
    OpenMinted Establish an openand sustainable Text and Data Mining (TDM) platform and infrastructure where researchers can discover, collaboratively create, share and re-use knowledge from a wide range of text based scientific and scholarly related sources. OpenCon SatelliteBerlin, 25Nov 2016 12 A step from Open Access to Open Science
  • 13.
    HIGH LEVEL ARCHITECTURE OpenConSatelliteBerlin, 25Nov 2016 13 Policies & guidelines
  • 14.
    Register and DiscoverTDM Services and tools Link to Content hubs Run a TDM job and share results Get people’s knowledge - Crowdsourced Annotation Our Services 14 OpenCon SatelliteBerlin, 25Nov 2016 Build your own service – Combine components into a Workflow and SHARE
  • 15.
    Our Users End users •Researchers, data base curators, Research Infrastructure operators • Novice: use services to advance their science • Advanced: use TDM components into complex workflows OpenCon SatelliteBerlin, 25Nov 2016 15 Content and service providers - Publishers, libraries, scientific data base centres, … - TDM researchers - SMEs
  • 16.
    OpenCon SatelliteBerlin, 25Nov2016 Scholarly Comm. Feature extraction Data citation Research analytics Life Sciences Curation of databases and lexica in Chembolomics & neuroinformatics Agriculture Extracting information from tables for food safety alerts Social Sciences Data citation Community Driven 16 From the very beginning… Requirements, content, barriers, expected outcomes. … to the very end Create applications, validate and evaluate the results.
  • 17.
    Examples of OpenAIRETDM services we want to share 17 @openaire_eu
  • 18.
    18 Discover research incontext OpenCon SatelliteBerlin, 25Nov 2016
  • 19.
    19 Research Trends andcorrelations Text and data mining with domain specific knowledge Interactive visualization for drill-down information … Trends in science Correlations of funding programmes Within a funder, or across countries OpenCon SatelliteBerlin, 25Nov 2016
  • 20.
    What will itlook like? 20
  • 21.
    the openminted registry OpenConSatelliteBerlin, 25Nov 2016 21
  • 22.
    Browse tdm resources& tools/services OpenCon SatelliteBerlin, 25Nov 2016 22
  • 23.
    Register, document, sharetools OpenCon SatelliteBerlin, 25Nov 2016 23
  • 24.
    Create your corpus,annotate, share OpenCon SatelliteBerlin, 25Nov 2016 24
  • 25.
    How does thisall bind together? OpenCon SatelliteBerlin, 25Nov 2016 25 OpenAIRE CORE CrossRef … OpenMinted REGISTRY CLARIN META-SHARE OpenMinted WORKFLOWS TDM TOOLS Repositories (OA) Journals Other textual resources e.g. medical records, PSI How DOES open Science help? Language resources …
  • 26.
    What’s next Participate withyour ideas • Give us your feedback on our pending guidelines and APIs • Provide us with your TDM requirements – we have the experts to consult you • Register your TDM services • Test out the system when it comes live (spring) Watch out for • OpenAIRE’s datathons, tenders and challenges (60K in total) • OpenMinTeD’s tenders and challenges (240K in total) OpenCon SatelliteBerlin, 25Nov 2016 26
  • 27.