KEMBAR78
Extracting science from the archive | PPTX
Royal Society 2018,
London, UK, 2018-06-12
Extracting Data from Early Scientific Diagrams
Peter Murray-Rust1,2
[1]University of Cambridge
[2]TheContentMine
ContentMine extracts data from modern diagrams on a high-throughput scale.
Do the same tools work for C19th scientific diagrams?
Text, data on maps, chemical formulae, plots, phylogenetics,
Images from ContentMine CC BY and Wikimedia CC BY-SA , ProcRoySoc PD
pm286@cam.ac.uk
peter@contentmine.org
Modern Diagram Mining
4500 separate images
Phylogenetic tree
supertree
A machine-compiled microbial
supertree from figure-mining
thousands of papers,
Ross Mounce, Peter Murray-
Rust, Matthew A Wills, 2017
https://riojournal.com/article/
13589/
Original scanned bitmap
Extracted by Tesseract
Errors all due to untrained punctuation
Text Mining bitmap
Original: V.D. vs Temperature Automatic Extraction: Grid, curve and points
Plot Mining
About 50% accurateProbably hand-drawn plot
binarization segmentation
OCR
Extraction of chemistry: lines correct, some atom corruption
3
4
15
3
4
5
21
1
1
John’s Snow’s map of
deaths in Broad Street 1854
Onion-ring pixel analysis
Segmentation and area/object count
https://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak
https://en.wikipedia.org/wiki/Tree_of_life_(biology)
Darwin’s Phylogenetic Tree
Binarized notebook
segmentation
Topological
tree
https://commons.wikimedia.org/wiki/File:See_No_Evil,_Hear_No_Evil,_Speak_No_Evil.jpg Parodied by User:petermr CC-BY SA
3 11 13
European Copyright Made Simple
Mine no
Content
Link no
Content
Upload no
Content
Julia Reda’s explanation https://juliareda.eu/eu-copyright-reform/ and write to your MEP
Peter Murray-Rust
All ContentMine Software is Free/Open
• Contentmine.org
• http://github.com/contentmine
• http://github.com/petermr
• http://discuss.contentmine.org/t/d3-text-processing-
infrastructure/486/14
• Main site: Contentmine.org
• Software: http://github.com/contentmine (production) and
http://github.com/peterm (development forks)
• Discussion and open notebooks: http://discuss.contentmine.org
– http://discuss.contentmine.org/t/extracting-science-from-early-scientific-
documents/613/
– http://discuss.contentmine.org/t/extracting-data-from-early-scientific-
maps/614

Extracting science from the archive

  • 1.
    Royal Society 2018, London,UK, 2018-06-12 Extracting Data from Early Scientific Diagrams Peter Murray-Rust1,2 [1]University of Cambridge [2]TheContentMine ContentMine extracts data from modern diagrams on a high-throughput scale. Do the same tools work for C19th scientific diagrams? Text, data on maps, chemical formulae, plots, phylogenetics, Images from ContentMine CC BY and Wikimedia CC BY-SA , ProcRoySoc PD pm286@cam.ac.uk peter@contentmine.org
  • 2.
    Modern Diagram Mining 4500separate images Phylogenetic tree supertree A machine-compiled microbial supertree from figure-mining thousands of papers, Ross Mounce, Peter Murray- Rust, Matthew A Wills, 2017 https://riojournal.com/article/ 13589/
  • 3.
    Original scanned bitmap Extractedby Tesseract Errors all due to untrained punctuation Text Mining bitmap
  • 4.
    Original: V.D. vsTemperature Automatic Extraction: Grid, curve and points Plot Mining About 50% accurateProbably hand-drawn plot
  • 5.
    binarization segmentation OCR Extraction ofchemistry: lines correct, some atom corruption
  • 6.
    3 4 15 3 4 5 21 1 1 John’s Snow’s mapof deaths in Broad Street 1854 Onion-ring pixel analysis Segmentation and area/object count https://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak
  • 7.
  • 8.
    https://commons.wikimedia.org/wiki/File:See_No_Evil,_Hear_No_Evil,_Speak_No_Evil.jpg Parodied byUser:petermr CC-BY SA 3 11 13 European Copyright Made Simple Mine no Content Link no Content Upload no Content Julia Reda’s explanation https://juliareda.eu/eu-copyright-reform/ and write to your MEP Peter Murray-Rust
  • 9.
    All ContentMine Softwareis Free/Open • Contentmine.org • http://github.com/contentmine • http://github.com/petermr • http://discuss.contentmine.org/t/d3-text-processing- infrastructure/486/14 • Main site: Contentmine.org • Software: http://github.com/contentmine (production) and http://github.com/peterm (development forks) • Discussion and open notebooks: http://discuss.contentmine.org – http://discuss.contentmine.org/t/extracting-science-from-early-scientific- documents/613/ – http://discuss.contentmine.org/t/extracting-data-from-early-scientific- maps/614