Extracting science from the archive

Royal Society 2018,
London, UK, 2018-06-12
Extracting Data from Early Scientific Diagrams
Peter Murray-Rust1,2
[1]University of Cambridge
[2]TheContentMine
ContentMine extracts data from modern diagrams on a high-throughput scale.
Do the same tools work for C19th scientific diagrams?
Text, data on maps, chemical formulae, plots, phylogenetics,
Images from ContentMine CC BY and Wikimedia CC BY-SA , ProcRoySoc PD
pm286@cam.ac.uk
peter@contentmine.org

Modern Diagram Mining
4500 separate images
Phylogenetic tree
supertree
A machine-compiled microbial
supertree from figure-mining
thousands of papers,
Ross Mounce, Peter Murray-
Rust, Matthew A Wills, 2017
https://riojournal.com/article/
13589/

Original scanned bitmap
Extracted by Tesseract
Errors all due to untrained punctuation
Text Mining bitmap

Original: V.D. vs Temperature Automatic Extraction: Grid, curve and points
Plot Mining
About 50% accurateProbably hand-drawn plot

binarization segmentation
OCR
Extraction of chemistry: lines correct, some atom corruption

3
4
15
3
4
5
21
1
1
John’s Snow’s map of
deaths in Broad Street 1854
Onion-ring pixel analysis
Segmentation and area/object count
https://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak

https://en.wikipedia.org/wiki/Tree_of_life_(biology)
Darwin’s Phylogenetic Tree
Binarized notebook
segmentation
Topological
tree

https://commons.wikimedia.org/wiki/File:See_No_Evil,_Hear_No_Evil,_Speak_No_Evil.jpg Parodied by User:petermr CC-BY SA
3 11 13
European Copyright Made Simple
Mine no
Content
Link no
Content
Upload no
Content
Julia Reda’s explanation https://juliareda.eu/eu-copyright-reform/ and write to your MEP
Peter Murray-Rust

All ContentMine Software is Free/Open
• Contentmine.org
• http://github.com/contentmine
• http://github.com/petermr
• http://discuss.contentmine.org/t/d3-text-processing-
infrastructure/486/14
• Main site: Contentmine.org
• Software: http://github.com/contentmine (production) and
http://github.com/peterm (development forks)
• Discussion and open notebooks: http://discuss.contentmine.org
– http://discuss.contentmine.org/t/extracting-science-from-early-scientific-
documents/613/
– http://discuss.contentmine.org/t/extracting-data-from-early-scientific-
maps/614

Extracting science from the archive

More Related Content

Similar to Extracting science from the archive

More from petermurrayrust

Recently uploaded

Extracting science from the archive