While computing technologies are young by comparison with other efforts of
human ingenuity, their origins go back to many early chapters in the human
quest for the understanding and realization of mechanical aids to knowledge.
Giuseppi Primiero (2020, 7)
Computing is symbol processing. Any automaton capable of processing symbol
structures is a computer . . . We may choose to call such symbol structures
information, data, or knowledge, depending in the particular ‘culture’ within
computer science to which we belong.
Subrata Dasgupta (2016, 121)
Data science is the process by which the power of data is realised, it is how we
find actionable insights from among the swathes of data that are available.
David Stuart (2020, xvi)
Introduction
Technology, from the Greek techné, meaning art, skill or craft, is usually
taken to mean the understanding of how to use tools, in the broadest sense
of the word. The term information technology was first used in the 1950s to
describe the application of mechanised documentation and then-new digital
computers, and became widely used in the 1980s to describe the widerspread use of digital
technology (Zorkoczy, 1982).
Information technology is usually associated with computers and
networks. But, in a wider sense stemming from the original meaning of the
word, the technologies of information include all the tools and machines
which have been used to assist the creation and dissemination of information
throughout history, as discussed in Chapter 2; from ink and paper, through
printing to microforms, document reproduction and photocopying, and
mechanised documentation technologies such as card indexes, punched cards,
edge-notched cards, and optical coincidence cards. Krajewski (2011) examines
the idea of card index files as a ‘universal paper machine’, in a sense the
forerunner of the computer, between the 16th and 20th centuries.
Our focus here is on digital technologies, echoing the view expressed by
Gilster (1997) that all information today is digital, has been digital or may be
digital. This chapter covers the basics of digital technologies and the
handling of data; the following chapter deals with information systems.
Digital technologies
We will describe these aspects only in outline; see Ince (2011) and Dasgupta
(2016) for more detailed but accessible accounts, and Primiero (2020) for a
more advanced summary; for accounts of the historical development of the
computer, see Ceruzzi (2012) and Haigh and Ceruzzi (2021).
Any digital device represents data in the form of binary digits or bits.
Patterns of bits, bit strings with bits conventionally represented by 0 and 1,
may represent data or instructions. A collection of eight bits is known as a
byte. Quantities of data are represented as multiples of bytes, for example: a
kilobyte, defined as 210 bytes, i.e. 1024 bytes.
Any character or symbol may be represented by a bit string, but this
requires an agreed coding. The most widely used code since the beginning
of the computer age was the American National Standards Institute’s ASCII
(American Standard Code for Information Interchange) 7-bit code, but it is
limited to the Latin alphabet, Arabic numerals and a few other symbols. It
has been supplanted by Unicode, which can handle a wider variety of
symbols and scripts. It can do this by using a 32-bit code; the more bits in the
code, the more different symbols can be coded. Codes provide arbitrary
representations of characters; for example, the ASCII code for the letter L is
the bit string 1001100.
The idea that a general-purpose computing machine might be possible
stems from the theoretical studies of the British mathematician Alan Turing
(1912–54), whose concept of the Turing machine, developed in the 1930s,
showed that a single machine limited to a few simple operations (read,
write, delete, compare) could undertake any task if appropriately
programmed.
The basic architecture for a practical digital computer was set out by John
von Neumann (1903–57), the Hungarian-American mathematician, in the 1940s (Figure 9.1). His
design, which gave the first formal description of a
single-memory stored-program computer, is shown in Figure 9.2.
DIGITAL TECHNOLOGIES AND DATA SYSTEMS 169
Figure 9.1 John von Neumann
(Alan Richards photographer. From the Shelby White and
Leon Levy Archives Center, Institute for Advanced Study, Princeton, NJ, USA)
Figure 9.2 Von Neumann computer architecture (Wikimedia Commons, CC BY-SA)
This architecture is general purpose, in the sense that it can run a variety
of programs. This distinguishes it from special-purpose digital computers
which carry out only one task, as in digital cameras, kitchen appliances, cars,
etc. A von Neumann machine loads and runs programs as necessary, to
accomplish very different tasks.
The heart of the computer, the processor, often referred to as the central
processing unit (CPU), carries out a set of very basic arithmetical and logical
operations with instructions and data pulled in from the memory, also
referred to as main or working memory. This sequence is referred to as the
fetch–execute cycle. Two components of the processor are sometimes
distinguished: an arithmetic and logic unit, which carries out the operations,
and a control unit, which governs the operations of the cycle.
While programs and data are being used they are kept in the memory.
While not being used they are stored long-term in file storage. Items in
memory are accessible much more rapidly than items in file storage, but
memory is more expensive, so computers have much more file storage than
memory.
Data comes into the computer through its input devices and is sent into the
outside world through the output devices. The components are linked
together through circuits usually denoted as a bus, sometimes referred to
more specifically as a data bus or address bus.
All of these components have undergone considerable change since the
first computers were designed. Processor design has gone through three
main technological stages. The first generation of computers used valves as
processor components, and the second generation used transistors.
Computers of the third generation, including all present-day computers, use
circuits on ‘silicon chips’, by very-large-scale integration (VLSI), which
allows millions of components to be placed on a single computer chip, thus
increasing processing speed.
The storage elements of the computer’s memory comprise regular arrays
of silicon-based units, each holding a bit of data, and created using the same
VLSI methods as processors. Earlier generations of computers used core
storage, with data held on tiny, magnetised elements arranged in a threedimensional lattice.
File storage, holding data and programs that are not needed immediately,
has always used magnetic media, which can hold large volumes of data
cheaply, provided that quick access is not required. Earlier generations of
computers used a variety of tapes, drums and disks; current computers use
hard disks, rapidly spinning magnetisable disks with moveable arms with
read/write heads, to read data from, or write data to, any area of the disk. Regardless of the
technology, each area holds one bit of information,
according to its magnetic state.
Input devices fall into three categories: those which take input
interactively from the user; those which accept data from other digital
sources; and those which convert paper data into digital form. The first
category comprises the venerable ‘QWERTY’ keyboard, originally
developed for typewriters in the 1870s, together with more recently
developed devices: the mouse and other pointing devices and the touchscreen. The second category
comprises the silicon-memory data stick,
replacing the various forms of portable magnetic floppy disks used
previously, and the ports and circuits by which the computer communicates
with networked resources. The third category is that of the scanner,
digitising print and images from paper sources.
Output devices are similarly categorised; display screens, which allow
user interaction through visual and sound output; devices which output
digital data, such as data sticks, network circuits and ports; and devices
which print paper output, typically laser or inkjet printers.
The most recent development is the trend towards mixed reality and
immersive environments, which may require specific input/output devices
(Greengard, 2019; Pangilinan, Lukos and Mohan, 2019). Virtual reality (VR)
provides immersive experiences in a computer-generated world, often using
head-mounted display devices, such as Oculus Rift. Augmented reality (AR)
overlays computer-generated information, often visual but sometimes
multisensory, on the real world. Mixed reality is the combination of real and
virtual environments, a spectrum with AR at one end and VR at the other.
These are of increasing relevance to the provision of information services
(Robinson, 2015; Varnum, 2019; Dahya et al., 2021; Kroski, 2021).
Networks
Virtually all computers are connected via some form of network to others, to
enable communication and information access and sharing. Since the 1990s
the internet and the World Wide Web have become ubiquitous, such that it
has become difficult to think about digital technologies without these at
centre stage.
The growth of networked computing has been driven by three factors:
communications technology, software and standards. In terms of
technology, older forms of wired network, originating in the twisted copperpair cables used for
telegraph systems from the mid-19th century, have been
succeeded by fibre-optic cables and various forms of wireless transmission. These allow much faster
transmission speeds and greater informationcarrying capacity; such systems are described loosely as
broadband. Software
systems have improved the efficiency and reliability of transmission greatly:
an important example is packet switching, by which messages may be split up
and their constituents sent by the fastest available route, being recombined
before delivery.
Standards are fundamentally important, so that different forms of
network, and different kinds of computers connected to them, can
communicate effectively. The internet, originating in 1960s networks built
for defence research in the USA, is a worldwide network of networks,
integrated by common standards: the internet control protocols, commonly
known as TCP/IP (Transmission Control Protocol/Internet Protocol).
The World Wide Web, often spoken of as if it were synonymous with the
internet, is in fact a major internet application. A system for allowing access to
interlinked hypertext documents stored on networked computers, it
originated in the work of Sir Tim Berners-Lee at CERN, the European nuclear
research establishment. Berners-Lee introduced the idea in 1989 in an internal
memorandum with the modest title ‘Information management: a proposal’.
The web is based on client–server architecture: a web browser, the client, on
the user’s computer accesses the website on the remote server computer,
through the internet, and downloads the required web page. This relies on a
number of standards. For example, web pages must be created using a markup language, typically
HTML, sometimes referred to as the lingua franca of
the internet, and be identified by a Uniform Resource Identifier, commonly
referred to as a Uniform Resource Locator or URL; a Hypertext Transfer
Protocol (HTTP) enables communication between client and server.
The Internet of Things (Greengard, 2021) is a generic term for the
aggregation of network-enabled ‘smart’ devices of all kinds, able to collect
and share data.
The internet and web, despite their great reach and influence, are by no
means the only significant computer networks. Many private, local and
regional networks exist, generally using the standards of the internet and
web for compatibility.
Software
Software, in its various forms, provides the instructions for digital
computers to carry out their tasks and can be categorised into systems
software (firmware and operating systems) and application software; for short
overviews, see Ince (2011), White and Downs (2015) and Dasgupta (2016). Firmware, such as BIOS
(Basic Input Output System), EFI (Extensible
Firmware Interface) and UEFI (Unified Extensible Firmware Interface),
comes pre-installed on all computers and is activated when the computer is
powered up, to initialise the hardware and start the operating system.
The operating system, specific to a particular type of computer, controls the
hardware, allocates system resources, such as memory, and runs
applications. Users interact with the operating system to manage files, use
input and output devices, and run applications. Examples of operating
systems are Windows, Unix and several variants of Unix, including Linux
and Mac OS.
Users interact with the operating system via a program termed a shell
because it forms an outer layer around an operating system; examples of
shells are Bash, zsh and Powershell. Most users will rely on a graphical user
interface to the shell for convenience. An alternative form of interface uses a
command line, with the user typing instructions (such interfaces are
commonly referred to as shells, although strictly the shell is the program to
which they are the interface). This type of interface offers a more direct
control over computer operations than with a graphical interface, but
requires the user to understand the set of commands and the syntax, and to
have some understanding of the workings of the operating system.
Applications software, discrete programs for carrying out specific tasks, for
a specific purpose, may either be installed on individual computers, as, for
example, word processors, spreadsheets, and web browsers typically are, or
may be accessed on the web. The latter is usual for search engines, library
management systems, and social media platforms. From the users’
perspectives, application programs cause the computer to carry out tasks
defined at a fairly high level: search the bibliographic database for authors
with this name; calculate the mean of this column of figures; insert this
image into this blog post; and so on.
All software must be written in a programming language, though this will
be invisible to the user, who will have no reason to know what language any
particular application is written in. At the risk of over-simplification, we can
say there are four kinds of software language: low-level and high-level
programming languages, scripting languages and mark-up languages.
Low-level programming languages, also referred to as assembly languages
or machine code, encode instructions at the very detailed level of processor
operations; these languages are therefore necessarily specific to a particular
type of computer. This kind of programming is complex and difficult, but
results in very efficient operation; it is reserved for situations where reliably
fast processing is essential . High-level programming languages express instructions in terms closer
to
user intentions, and are converted by other software systems, compilers and
interpreters into processor instructions in machine code for a particular
hardware. Programs are therefore much easier to write, understand and
modify than those in machine code. Such languages can be used on different
types of computer. There have been many such languages, some general
purpose and some aimed at a particular kind of application. The first
examples were developed in the 1950s, and two of the earliest are still in use:
FORTRAN, still a language of choice for scientific and engineering
applications, and COBOL, still used in many legacy systems for business
applications. Currently popular languages are Java, C++, R and Python.
Compilers convert a whole program into machine code, whereas
interpreters convert one line at time. Programming languages are described
as either compiled, for example C++, or interpreted, for example Python.
Scripting languages are a particularly significant form of high-level
language, designed to support interaction, particularly with web resources;
for example, to update a web page in response to a user’s input rather than
reloading the page each time, or to create a mashup, an integration of data
from several web sources. Examples are JavaScript, PHP and Perl.
Mark-up languages are somewhat different in nature, as they are designed
to annotate text so as to denote either structural elements (‘this is an author
Name’), presentation (‘print this in italic’) or both. The first such languages
were designed for formatting documents for printing; an example is LaTeX.
More recent examples control the format and structure of web resources;
examples are HTML and XML.
Regardless of the language used, software is created in essentially the
same way. The abstract process to be followed, the algorithm, is determined
and its steps are converted into statements in an appropriate language; this
is the process of coding or programming. For small-scale software
development, perhaps for a piece of code to be used temporarily or just by
one person, writing the code and testing that it works correctly is sufficient.
For a major piece of software, other stages will include design, testing and
maintenance; these stages, together with the coding stages, make up system
development or software engineering. For accessible introductions to these
topics, see Louridas (2020) and Montfort (2021).
One influence on the way software is created is the availability of libraries
of subroutines, packages of code in a particular language to undertake
specific tasks which can be included in programs by a single call. If a
subroutine library (also called a code library) is available for a particular topic,
it means that a coder can create a program very quickly and economically by a series of calls to
subroutines, without having to write the code in detail; see
Wintjen (2020) for detailed examples. There are an increasing number of
subroutine libraries available for topics of relevance to the information
disciplines. For example, the Python language has libraries of subroutines
for text processing, for extracting data from URLs, for extracting data from
mark-up languages such as HTML and XML and for processing data from
Wikipedia.
An innovation in the way software is provided has come from the open
source movement. Commercial software is usually provided as a ‘black box’;
the user has no access to the program code and therefore cannot modify the
system at all, nor even know exactly how it works. With open source software,
the user is given the full source code – the original programs – which they are
free to modify. This makes it easy to customise software to meet local needs
and preferences. It also allows users to collaborate in extending and
improving the systems, correcting errors, etc. Most such software is free,
leading it to be known as FOSS (Free and Open Source Software). Well-known
examples of full FOSS systems relevant to library and information
management are the Koha and FOLIO-ERM library management systems, the
EPrints repository system, the Drupal content management system and the
Open Journal Systems (OJS) e-journal publishing system; see, for example,
Breeding (2017) and Choo and Pruett (2019). Many smaller-scale FOSS
software products are available; the GitHub repository is the most commonly
used means for making these available (Beer, 2017). Github is commonly
integrated with the Jupyter notebook environment, which combines simple
word processing with the ability to create and run programs; this integration
allows for testing, sharing and distribution of open source software and is
widely used for data-wrangling functionality, including library/information
applications (Wintjen, 2020).
There are a number of reasons why information professionals may wish to
do their own coding, particularly now that relevant code libraries make it so
much quicker and easier to do so. The main reasons would be to customise
a procedure to a specific situation, to merge data from different sources, to
create tidy, consistent data or to link systems and data together in a
particular context.
Artificial intelligence
Artificial intelligence (AI) encompasses a complex array of concepts and
processes, directed to making it possible for digital computers to do the kind
of things that minds can do; see Boden (2016), Cantwell Smith (2019) and Mitchell (2019) for
accessible overviews, Floridi (2019) for ideas of its future
and Mitchell (2021) for caveats about how much we can expect from AI,
especially with regard to endowing machines with common knowledge and
common sense. It is ‘a growing resource of interactive, autonomous, selflearning agency, which
enables computational artefacts to perform tasks
that otherwise would require human intelligence to be carried out
Successfully’ (Taddeo and Floridi, 2018, 751).
Although AI as such has been possible only since the middle of the 20th
century, it has been preceded by several centuries of interest in mechanical
automata, game-playing machines, calculating machines, automated
reasoning and the like; see Pickover (2019) for an informal history. Although
Ada Lovelace mused on a ‘calculus of the nervous system’, and the
possibility of mechanical devices composing music, AI became a prospect
only with the development of the digital computer.
Alan Turing’s 1950 article ‘Computing Machinery and Intelligence’ was
the first serious philosophical investigation of the question ‘can machines
think?’ In order to avoid the problem that a machine might be doing what
might be regarded as thinking but in a very different way from a human,
Turing proposed that the question be replaced by the question as to whether
machines could pass what is now known as the Turing Test. Turing proposed
an ‘Imitation Game’, in which a judge would interact remotely with a human
and a computer by typed questions and answers, and would decide which
was the computer; if the judge could not reliably decide which was the
computer, then the machine passed the test, as Turing believed that digital
computers would be able to do. Turing’s article also proposed for the first
time how a machine might learn, and noted a series of objections to the idea
of AI which are still raised today.
The term ‘artificial intelligence’ is attributed to John McCarthy (1927–
2011) and appears in a proposal for a workshop at Dartmouth College, in
which ‘the artificial intelligence problem is taken to be that of making a
machine behave in ways that would be called intelligent if a human were so
Behaving’ (McCarthy et al., 1955). The workshop, held in 1956, established
the topic as a field for realistic academic study (Nilsson, 2012). A
fundamental insight was provided by another founder of AI, Marvin
Minsky (1927–2016), who argued that intelligence was not the result of a
single sophisticated agent or principle but, rather, the result of the
interaction of many simple processes (Minsky, 1987). This suggested that AI
might be achieved by the combination of relatively simple programs.
The first approaches to practical AI were centered on attempting to capture
in a logic-based program the knowledge and reasoning processes of a human
expert: a doctor diagnosing disease, or a chemist interpreting analytical data
or planning the synthesis of new compounds. This symbolic approach to AI,
now sometimes referred to as Good Old Fashioned AI (GOFAI), had some
successes, but proved to be limited by the difficulty of entering the large
amounts of knowledge, particularly general ‘common sense’ knowledge,
required for even seemingly simple tasks.
A newer generation, or ‘second wave’, of AI systems relies on programs
which enable machines to learn for themselves, given very large datasets
and massive computing power. (Big Data is not necessarily a sine qua non for
AI; Floridi (2019) notes that in some cases small datasets of high quality may
be more valuable.) For example, in 2011, IBM’s Watson AI program beat a
human champion at the game Jeopardy, which requires general knowledge
questions, including puns, jokes and cultural references, to be solved
quickly, without access to any external resources. Watson’s success was due
to massive computing power using probabilistic reasoning, plus extensive
resources, such as the whole of Wikipedia and many other reference works,
held in computer memory. In 2017 the program AlphaZero was given only
the rules of chess as a starting point, and trained its neural networks by
playing against itself many millions of times. Within hours it was at a level
to compete with the best chess programs. This ‘brute force’ approach has led
to greater success, but at the cost that it may be impossible to understand
how and why the system is working; as Turing foresaw, machines may be
‘thinking’ but they do it differently. In particular, we should not attribute
any element of ‘understanding’ to an AI because it can equal, or better,
human performance at tasks which would require a human to understand a
situation.
At the core of most modern AI systems is machine learning, a way of
programming computers to learn from experience, using large sets of data
for training (Alpaydin, 2016). This comes in three forms: supervised learning,
in which the algorithms are given large sets of labelled examples;
unsupervised learning, in which the algorithms discover patterns de novo in
sets of unlabelled data; and reinforcement learning, in which the system
receives feedback as to whether the change it made was good or bad. This
approach encompasses the neural net data structure, a complex linkage of
many simple processing units and evolutionary or genetic algorithms, which
allows programs to develop themselves to improve their performance. Socalled deep learning uses a
neural net with many layers by which machines
train themselves to perform tasks (Kelleher, 2019). This can lead to very high
performance, at the cost of creating models of such complexity that it may
be difficult to understand why the artificial agent is making its judgements.
All machine-learning systems face such problems of explanation, and also
problems of algorithmic bias if the original data is incorrect or biased in any
way.
Although somewhat displaced by machine learning and statistical
algorithms, research continues on knowledge representation, using data
structures such as frames or semantic nets, so as to allow artificial agents to
simulate human reasoning, applying logical deduction, application of rules,
or probabilistic reasoning. This is incorporated in expert systems, dealing
with one specific task, typically of diagnosis, classification, recommendation
or prediction, in scientific, medical, legal and financial domains. The two
approaches, symbolic and statistical, are beginning to be used together in
systems combining the power of both (Boden, 2016).
AI has raised ethical concerns, initially because of fears about ‘killer
Robots’ and ‘superhuman intelligences’, more recently and realistically
because of dangers of algorithmic bias (Coeckelbergh, 2020); a full
discussion is given in Chapter 15.
Robotics, the design of machines to undertake tasks without direct human
control, is linked with AI, since the machines must be programmed to
undertake activities which would otherwise require the direction of an
intelligent person. Indeed, a robot may be understood as an AI given
autonomous agency, albeit usually of very limited scope, in the physical
world. Agents on the internet which emulate a human user are termed bots,
by analogy with robots in the physical world; those which are able to
simulate human conversation may be termed chatbots. The term ‘robot’ was
first used in English in Karel Capek’s play R. U. R. (Rossum’s Universal
Robots) of 1921, from Czech ‘robata’, meaning ‘forced labour’. The use of
robots is well established in many areas, including libraries and archives,
typically undertaking repetitive tasks; the number of robots in use depends
on the definition used, but is certainly in the millions.
Application of AI in libraries, archives and similar institutions is, for the
most part, at an initial exploratory stage, but the possibilities are becoming
evident; see Griffey (2019) and Rolan et al. (2019). They include metadata
creation, information extraction and text and data mining.
Interfaces, interaction and information architecture
The ways in which people interact with computers, generally described
under the headings of human–computer interaction (HCI), human–information
interaction, interaction design or usability, is an area of evident importance to
the information sciences. It is particularly concerned with the design and
evaluation of interfaces to information systems of all kinds (Sharp, Preece
and Rogers, 2019).
Information architecture is the organising and structuring of complex digital
information spaces, typically on websites. It draws from principles of
information design on the printed page, of HCI and of categorisation and
taxonomy. The term was coined in the 1970s by Saul Wurman, originally a
‘real’ architect, who came to believe that the principles behind the design of
buildings and physical places could be applicable to information spaces.
Information architecture came into its own from the late 1990s, with the
expansion of the web and the need to bring order and structure to webbased information. For
overviews, see Rosenfeld, Morville and Arango
(2015) and Arango (2018).
Data systems
The growth of data science, and the increased emphasis on managing and
curating data, as distinct from information, has been a major influence on
information science since 2010. For an accessible general introduction, see
Kelleher and Tierney (2018), for an overview aimed at library/information
applications, see Stuart (2020), and for a more detailed and practical
treatment, see Shah (2020) and Wintjen (2020). For an analysis of the idea of
data itself, see Hjørland (2020), and for examples of data practices in science,
including the issues discussed later in this chapter, see Leonelli and Tempini
(2020). We discuss this under two main headings: data wrangling, the
processes of acquiring, cleaning and storing data; and techniques for finding
meaning in data.
Data wrangling
The term data wrangling is sometimes used to refer to all activities in the
handling of data. We will consider them, and the digital tools which support
them, under four headings: collecting; storing; cleaning; and combining and
reusing. A useful overview from an information science perspective is given
in Stuart (2020, chapter 4).
Collecting
Collecting data is most simply achieved by typing it in from scratch. If it
already exists in printed form, with a recognisable structure, then optical
character recognition (OCR) may be used, though this is rarely error-free,
and entry must be followed by cleaning. If the data exists in some digital
form, then it may be selected and copied; a tool such as OpenRefine may be
useful in collecting and parsing data, and in converting between formats.
If the required data is to be found on one or more web pages, then a
process of web scraping can be used to obtain it. This is the automatic
extraction of information from websites, taking elements of unstructured
website data and creating structured data from it. Web scraping can be done
with code written for the purpose, or by using a browser extension such as
Crawly or Parsehub. This is easier if an Applications Programming Interface
(API) can be accessed. Described as ‘an interface for software rather than
People’, an API in general is a part of a software system designed to be
accessed and manipulated by other programs rather than by human users.
Web APIs allow programs to extract structured data from a web page using
a formal syntax for queries and a standard structure for output; see Lynch,
Gibson and Han (2020) for a metadata aggregation example.
Storing
Data may be stored in a wide variety of formats and systems. These include
database systems such as MySQL, Oracle and Microsoft Access, and dataanalysis packages such as
SAS or SPSS. Most common is the spreadsheet,
such as Microsoft Excel and Google Sheets, widely used for data entry and
storage, with some facilities for analysis and visualisation of data. All kinds
of structured data and text are stored in a two-dimensional row–column
arrangement of cells, each cell containing one data element. Consistency is
of great importance, in the codes and names used for variables, the format
for dates, etc., and can be supported by a data dictionary, a separate file
stating the name and type of each data element (Broman and Woo, 2018).
Exchange formats are relatively simple, used to move data between more
complex formats or systems. One widely used example is the CSV (Comma
Separated Values) format. It holds data in plain text form as a series of values
separated by commas in a series of lines, the values and lines being
equivalent to the cells and rows of a spreadsheet. JSON (Javascript Object
Notation) is a more complex XML-like format used for the same purposes; it
can represent diverse data structures and relationships, rather than CSV’s
two-dimensional flat files.
Data collections are typically bedevilled by factors such as missing data,
duplicate data records, multiple values in a cell, meaningless values and
inconsistent data presentation. Cleaning
Data-cleaning processes are applied to create tidy data as opposed to messy
data. For a data sheet, tidy data requires that: each item is in a row; each
variable is in a column; each value has its own cell; there are no empty cells;
and the values of all variables are consistent.
To achieve this, data may be scanned by eye, and errors corrected
manually. An automated process is more efficient and less error prone.
Examples are the use of Python scripts for identification of place names in
collections of historical documents (Won, Murrieta-Flores and Martins,
2018) and for the normalisation of Latin place names in a catalogue of rare
books (Davis, 2020); see Walsh (2021) for a variety of examples. Datahandling packages, such SAS
and SPSS, have their own data-cleaning
functions.
One specific data-cleaning tool has been much used by library/infor -
mation practitioners, as well as by scholars in the digital humanities.
OpenRefine, a tool designed for improving messy data, provides a
consistent approach to automatically identifying and correcting errors,
grouping similar items so inconsistencies can be readily spotted by eye and
identifying outliers, and hence possible errors, in numeric values. Examples
of its use include the cleaning of metadata for museum collections (van
Hooland et al., 2013) and extraction in a consistent form of proper names
from descriptive metadata (van Hooland et al., 2015).
The most precise means of algorithmic data cleaning uses Regular
Expressions (RegEx), a subset of a formal language for describing patterns in
text, which ensures consistency. They were devised originally within
theoretical computer science by the American mathematician Stephen Kleen
in the 1950s, using the mathematics of set theory as a way of formally
describing all the operations that can be performed by computers in
handling text.
In practice, regular expressions are search strings with special characters
to denote exactly what is wanted. Examples of these special characters are:
. matches any character (‘wildcard’)
[] matches any of the characters within the brackets
^ matches beginning of a line of text
These expressions are used primarily for searching, parsing, cleaning and
processing text, and also for the formal modelling of documents and
databases; see applications to inconsistencies in bibliographic records
(Monaco, 2020), errors in DOIs (Xu et al., 2019), metadata aggregation and identification of
locations in the text of
journal articles (Karl, 2019).
Combining and reusing
Data in digital form offers the possibilities of combining and integrating
different formats and media. The results are sometimes described as
mashups, usually meaning data from more than one source displayed in a
single view. Data may also be reused for different purposes. The term remix
is used for the amending or repurposing of data to create something novel;
for examples, see Engard (2014). These processes may offer a reimagining of
a dataset or collection of datasets so as to show different perspectives, and
have the potential of creating new knowledge.
Finding meaning in data
One important strand of data handling in the information sciences is to
analyse datasets so as to find meaning in them. We will outline the methods
used in four broad and over-lapping themes: data mining and statistical
analysis; data visualisation; text visualisation; and text mining. For more
detailed treatment of this area, see Shah (2020) and Wintjen (2020).
Data mining and statistical analysis
Data mining and statistical analysis identifies nuggets of value in large
volumes of unhelpful data (hence the mining analogy) and tests the
significance of what is found by statistical methods. This is most commonly
carried out using specialist subroutine libraries with the Python and R
programming languages. A variety of techniques may be used, including:
association, correlation and regression; classification and clustering; analysis
of variance and model building; and time series. When dealing with large
sets of data, and typically investigating many possible correlations and
interactions, it is essential to have a thorough understanding of statistical
significance, ways to guard against misleading and chance findings.
Data visualisation
Data visualisation is now easier than ever before, but the results are not
always pleasing or informative. There are many ways of achieving this,
including routines in coding libraries, functions in spreadsheet and database software and specialist
software packages. The ease of creating a
visualisation means that the emphasis is now on representing the data in an
understandable way, free from bias and distortion. See Tufte’s classic 1990
book, Healy (2019), Kirk (2019) and Dick (2020).
One useful form of visualisation of particular value for information
science is the network diagram, displaying nodes, with their size, and links,
with their strength. There are numerous software packages available for this
purpose, such as the NodeXL extension to Microsoft Excel (Ahmed and
Lugovic, 2019), and packages such as VOSviewer and CitNetExplorer,
designed for analysis and display of citation networks in published
literature (Eck and Waltman, 2017; Williams, 2020).
Text visualisation
There are many ways of visualising textual data; a survey by Linnaeus
University identified 440 distinct techniques (Kucher and Kerren, 2015).
Few, however, have gained wide use.
The most common way of visualising text documents or collections of
documents is the word cloud, a relatively simple tool for producing images
comprising the words of the text, with greater prominence given to the
words that appear most frequently. They are popular for giving quickly
assimilated summaries and comparisons of the contents of lengthy
documents, but have been criticised for giving overly simplistic views of
complex data. The same is true of other simple text-visualisation methods,
such as the termsberry and the streamgraph. Some suggestions for
improvement by manipulating the layout of word clouds to make them
more meaningful are given by Hearst et al. (2020).
Text mining
Text mining involves analysing and displaying the structure of texts. There
are a variety of means to achieve this. Coding with specialised libraries, such
as the Python language’s TextBlob, gives greatest flexibility. Generalpurpose qualitative analysis
packages, such as nVivo and MAXQDA, can be
used for text mining, but are complex solutions to what may be only a
simple need. Web-based environments for analysis, visualisation and distant
reading of texts, such as Voyant Tools, provide a variety of analyses,
including word clouds, word-frequency lists, common words shown in
context, and occurrence of terms in different sections of a web document.
Network analysis software such as VOSviewer can be used with bodies of text, producing visual
displays of the frequency and co-occurrence of words
and phrases. For library/information applications of this type of software,
see Tokarz (2017), Miller (2018) and Hendrigan (2019). A more specialised
tool, the n-gram viewer, allows the comparison of frequency of occurrence of
words or phrases between datasets, or over time in a single dataset. The bestknown example is
Google Books Ngram Viewer, which provides a picture of
how word, phrase and name occurrences have changed over time in
Google’s corpus of digitised books.
Summary
Digital technologies and data systems form the bedrock of the infosphere.
An understanding of their nature and significance, and an ability to make
practical use of relevant aspects, is essential for all information professionals. .