Data curation
Data curation is the organization and integration of data collected from various sources. It involves
annotation, publication and presentation of the data such that the value of the data is maintained over time,
and the data remains available for reuse and preservation. Data curation includes "all the processes needed
for principled and controlled data creation, maintenance, and management, together with the capacity to
add value to data".[1] In science, data curation may indicate the process of extraction of important
information from scientific texts, such as research articles by experts, to be converted into an electronic
format, such as an entry of a biological database.[2]
In the modern era of big data, the curation of data has become more prominent, particularly for software
processing high volume and complex data systems.[3] The term is also used in historical occasions and the
humanities,[4] where increasing cultural and scholarly data from digital humanities projects requires the
expertise and analytical practices of data curation.[5] In broad terms, curation means a range of activities
and processes done to create, manage, maintain, and validate a component.[6] Specifically, data curation is
the attempt to determine what information is worth saving and for how long.[7]
History and practice
The user, rather than the database itself, typically initiates data curation and maintains metadata.[8]
According to the University of Illinois' Graduate School of Library and Information Science, "Data
curation is the active and on-going management of data through its lifecycle of interest and usefulness to
scholarship, science, and education; curation activities enable data discovery and retrieval, maintain quality,
add value, and provide for re-use over time."[9] The data curation workflow is distinct from data quality
management, data protection, lifecycle management, and data movement.[8]
Census data has been available in tabulated punch card form since the early 20th century and has been
electronic since the 1960s.[10] The Inter-university Consortium for Political and Social Research (ICPSR)
website marks 1962 as the date of their first Survey Data Archive.[11]
Deep background on data libraries appeared in a 1982 issue of the Illinois journal, Library Trends.[12] For
historical background on the data archive movement, see "Social Scientific Information Needs for Numeric
Data: The Evolution of the International Data Archive Infrastructure."[13] The exact curation process
undertaken within any organisation depends on the volume of data, how much noise the data contains, and
what the expected future use of the data means to its dissemination.[3]
The crises in space data led to the 1999 creation of the Open Archival Information System (OAIS)
model,[14] stewarded by the Consultative Committee for Space Data Systems (CCSDS), which was formed
in 1982.[15]
The term data curation is sometimes used in the context of biological databases, where specific biological
information is firstly obtained from a range of research articles and then stored within a specific category of
database. For instance, information about anti-depressant drugs can be obtained from various sources and,
after checking whether they are available as a database or not, they are saved under a drug's database's anti-
depressive category. Enterprises are also utilizing data curation within their operational and strategic
processes to ensure data quality and accuracy.[16][17]
Projects and studies
The Dissemination Information Packages (DIPS) for Information Reuse (DIPIR) project is studying
research data produced and used by quantitative social scientists, archaeologists, and zoologists. The
intended audience is researchers who use secondary data and the digital curators, digital repository
managers, data center staff, and others who collect, manage, and store digital information.[18]
The Protein Data Bank was established in 1971 at Brookhaven National Laboratory, and has grown into a
global project.[19] A database for three-dimensional structural data of proteins and other large biological
molecules, the PDB contains over 120,000 structures, all standardized, validated against experimental data,
and annotated.
FlyBase, the primary repository of genetic and molecular data for the insect family Drosophilidae, dates
back to 1992. FlyBase annotates the entire Drosophila melanogaster genome.[20]
The Linguistic Data Consortium is a data repository for linguistic data, dating back to 1992.[21]
The Sloan Digital Sky Survey began surveying the night sky in 2000.[22] Computer scientist Jim Gray,
while working on the data architecture of the SDSS, championed the idea of data curation in the
sciences.[23]
DataNet was a research program of the U.S. National Science Foundation Office of Cyberinfrastructure,
funding data management projects in the sciences.[24] DataONE (Data Observation Network for Earth) is
one of the projects funded through DataNet, helping the environmental science community preserve and
share data.[25]
See also
       Literature portal
    Biocurator
    Data archaeology
    Data degradation
    Data format management
    Data preservation
    Data stewardship
    Data wrangling
    Digital curation – the curation of published documents, rather than raw data[7]
    Digital preservation
    Informationist – an individual with extensive expertise in data curation
References
 1. Renée J. Miller, “Big Data Curation” (http://comad.in/comad2014/Proceedings/Keynote2.pdf)
    in 20th International Conference on Management of Data (COMAD) 2014, Hyderabad, India,
    December 17–19, 2014
 2. Bio creative Glossary (https://biocreative.sourceforge.net/biocreative_glossary.html).
    Retrieved on 3 October 2016.
 3. Furht, Borko; Armando Escalante (2011). Handbook of Data Intensive Computing (https://pla
    y.google.com/store/books/details?id=gsk6XpZgGYwC). Springer Science & Business
    Media. p. 32. ISBN 9781461414155. Retrieved 2 October 2016.
 4. Sabharwal, Arjun (2015). Digital Curation in the Digital Humanities: Preserving and
    Promoting Archival and Special Collections (https://play.google.com/store/books/details?id=
    GpiKBAAAQBAJ). Chandos Publishing. p. 60. ISBN 9780081001783. Retrieved 2 October
    2016.
 5. "An Introduction to Humanities Data Curation" by Julia Flanders and Trevor Muñoz
    http://guide.dhcuration.org/intro/. Not available any more: archive.org (https://web.archive.or
    g/web/20170925151437/http://guide.dhcuration.org/contents/intro/)
 6. Pilin Glossary (http://www.pilin.net.au/Project_Documents/Glossary.htm). Not available any
    more: archive.org (https://web.archive.org/web/20120330202110/http://www.pilin.net.au/Proj
    ect_Documents/Glossary.htm)
 7. Borgman, C (2015). Big data, little data, no data: Scholarship in the networked world (https://
    archive.org/details/bigdatalittledat0000borg/page/13). Cambridge, Massachusetts: MIT
    Press. pp. 13 (https://archive.org/details/bigdatalittledat0000borg/page/13). ISBN 978-0-262-
    02856-1.
 8. Chessell, Mandy; Nigel L Jones; Jay Limburn; David Radley; Kevin Shank (2015).
    Designing and Operating a Data Reservoir (https://play.google.com/store/books/details?id=-
    BWrCQAAQBAJ). IBM Redbooks. pp. 111–113. ISBN 9780837440668. Retrieved 2 October
    2016.
 9. Cragin, Melissa; Heidorn, P. Bryan; Palmer, Carole L.; Smith, Linda C. (2007). "An
    Educational Program on Data Curation" (https://www.ideals.illinois.edu/handle/2142/3493).
    ALA Science & Technology Section Conference. Retrieved 7 October 2013.
10. "Preserving Digital Information (PDI) report" (https://www.clir.org/wp-content/uploads/sites/6/
    2016/09/pub63watersgarrett.pdf) (PDF). 1996. Retrieved 2018-03-13.
11. "ICPSR: History" (https://www.icpsr.umich.edu/icpsrweb/content/about/history/).
    www.icpsr.umich.edu. Retrieved 2018-03-15.
12. Heim, Kathleen M. (November 29, 1982). "Library Trends 30 (3) Winter 1982: Data Libraries
    for the Social Sciences" (https://www.ideals.illinois.edu/handle/2142/7218). Library Trends –
    via www.ideals.illinois.edu.
13. Kathleen M. Heim, "Social Scientific Information Needs for Numeric Data: The Evolution of
    the International Data Archive Infrastructure." in Collection Management 9 (Spring 1987): 1-
    53.
14. "The OAIS reference model" (https://www.oclc.org/research/publications/library/2000/lavoie-
    oais.html). 2015-12-09. Retrieved 2018-03-15.
15. "CCSDS.org - The Consultative Committee for Space Data Systems (CCSDS)" (https://publi
    c.ccsds.org/default.aspx). public.ccsds.org. Retrieved 2018-03-14.
16. E. Curry, A. Freitas, and S. O’Riáin, “The Role of Community-Driven Data Curation for
    Enterprises,” (http://3roundstones.com/led_book/led-curry-et-al.html) Archived (https://web.ar
    chive.org/web/20120123161104/http://3roundstones.com/led_book/led-curry-et-al.html)
    2012-01-23 at the Wayback Machine in Linking Enterprise Data, D. Wood, Ed. Boston, MA:
    Springer US, 2010, pp. 25-47. ISBN 978-1-4419-7664-2
17. A. Freitas, E. Curry, “Big Data Curation,” (https://www.insight-centre.org/sites/default/files/pu
    blications/newhorizons_online.pdf) Archived (https://web.archive.org/web/20160913163450/
    https://www.insight-centre.org/sites/default/files/publications/newhorizons_online.pdf) 2016-
    09-13 at the Wayback Machine in New Horizons for a Data-Driven Economy, Springer
    (Open Access), 2015.
18. Dissemination Information Packages for Information Reuse (DIPIR) project
    http://www.oclc.org/research/themes/user-studies/dipir.html
19. "RCSB PDB: About the PDB Archive and the RCSB PDB" (https://www.rcsb.org/pages/abou
    tus). About the PDB Archive and the RCSB PDB. Retrieved 15 March 2018.
20. Gramates, LS; Marygold, SJ; dos Santos, G; Urbano, J-M; Antonazzo, G; Matthews, BB; Rey,
    AJ; Tabone, CJ; Crosby, MA; Emmert, DB; Falls, K; Goodman, JL; Hu, Y; Ponting, L;
    Schroeder, AJ; Strelets, VB; Thurmond, J; Zhou, P; FlyBase Consortium (2017). "lyBase at
    25: looking to the future" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5210523). Nucleic
    Acids Res. 45 (D1): D663–D671. doi:10.1093/nar/gkw1016 (https://doi.org/10.1093%2Fna
    r%2Fgkw1016). PMC 5210523 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5210523).
    PMID 27799470 (https://pubmed.ncbi.nlm.nih.gov/27799470).
21. "About LDC" (https://www.ldc.upenn.edu/about). Linguistic Data Consortium. Retrieved
    15 March 2018.
22. "Sloan Digital Sky Survey" (http://www.sdss.org/). SDSS. Retrieved 15 March 2018.
23. Palmer, Carole L.; Weber, Nicholas M.; Muñoz, Trevor; Renear, Allen H. (June 2013).
    "Foundations of Data Curation: The Pedagogy and Practice of "Purposeful Work" with
    Research Data". Archive Journal. 3. hdl:2142/78099 (https://hdl.handle.net/2142%2F78099).
24. "Sustainable Digital Data Preservation and Access Network Partners (DataNet) Program
    Summary" (https://www.nsf.gov/funding/pgm_summ.jsp?pims_id=503141). National Science
    Foundation. September 28, 2007. Retrieved March 15, 2018.
25. "What is DataONE?" (https://www.dataone.org/what-dataone). What is DataONE?.
    Retrieved 15 March 2018.
External links
    Curation of ecological and environmental data: DataONE (http://www.dataone.org/)
    Data management tools and services spanning multiple scientific disciplines:
    DataConservancy (http://www.dataconservancy.org/)
Retrieved from "https://en.wikipedia.org/w/index.php?title=Data_curation&oldid=1129921671"