Human Language Understanding and Reasoning
Human Language Understanding and Reasoning
The last decade has yielded dramatic and quite surprising breakthroughs in natural
W
hen scientists consider artificial intelligence, they mostly think of
modeling or recreating the capabilities of an individual human brain.
But modern human intelligence is much more than the intelligence of
an individual brain. Human language is powerful and has been transformative to
our species because it gives groups of people a way to network human brains to-
gether. An individual human may not be much more intelligent than our close rel-
atives of chimpanzees or bonobos. These apes have been shown to possess many
of the hallmark skills of human intelligence, such as using tools and planning;
moreover, they have better short-term memory than we do.1 When humans in-
vented language is still, and perhaps will forever be, quite uncertain, but within
the long evolutionary history of life on Earth, human beings developed language
incredibly recently. The common ancestor of prosimians, monkeys, and apes
dates to perhaps sixty-five million years ago; humans separated from chimps per-
haps six million years ago, while human language is generally assumed to be only
a few hundred thousand years old.2 Once humans developed language, the pow-
er of communication quickly led to the ascendancy of Homo sapiens over other
creatures, even though we are not as strong as an elephant nor as fast as a cheetah.
It was much more recently that humans developed writing (only a bit more than
five thousand years ago), allowing knowledge to be communicated across distanc-
es of time and space. In just a few thousand years, this information-sharing mech-
anism took us from the bronze age to the smartphones of today. A high-fidelity
code allowing both rational discussion among humans and the distribution of in-
formation has allowed the cultural evolution of complex societies and the knowl-
edge underlying modern technologies. The power of language is fundamental to
human societal intelligence, and language will retain an important role in a future
world in which human abilities are augmented by artificial intelligence tools.
For these reasons, the field of natural language processing (NLP) emerged in
tandem with the earliest developments in artificial intelligence. Indeed, initial
work on the NLP problem of machine translation, including the famous George-
T
he history of natural language processing until now can be roughly divid-
ed into four eras. The first era runs from 1950 to 1969. NLP research began
as research in machine translation. It was imagined that translation could
quickly build on the great successes of computers in code breaking during World
War II. On both sides of the Cold War, researchers sought to develop systems ca-
pable of translating the scientific output of other nations. Yet, at the beginning
of this era, almost nothing was known about the structure of human language,
artificial intelligence, or machine learning. The amount of computation and data
available was, in retrospect, comically small. Although initial systems were pro-
moted with great fanfare, the systems provided little more than word-level trans-
lation lookups and some simple, not very principled rule-based mechanisms to
deal with the inflectional forms of words (morphology) and word order.
The second era, from 1970 to 1992, saw the development of a whole series of
NLP demonstration systems that showed sophistication and depth in handling
phenomena like syntax and reference in human languages. These systems includ-
ed SHRDLU by Terry Winograd, LUNAR by Bill Woods, Roger Schank’s systems
such as SAM, Gary Hendrix’s LIFER, and GUS by Danny Bobrow.4 These were all
hand-built, rule-based systems, but they started to model and use some of the
complexity of human language understanding. Some systems were even deployed
operationally for tasks like database querying.5 Linguistics and knowledge-based
artificial intelligence were rapidly developing, and in the second decade of this
era, a new generation of hand-built systems emerged, which had a clear separa-
tion between declarative linguistic knowledge and its procedural processing,
and which benefited from the development of a range of more modern linguistic
theories.
128 Dædalus, the Journal of the American Academy of Arts & Sciences
Christopher D. Manning
However, the direction of work changed markedly in the third era, from rough-
ly 1993 to 2012. In this period, digital text became abundantly available, and the
compelling direction was to develop algorithms that could achieve some level of
language understanding over large amounts of natural text and that used the ex-
istence of this text to help provide this ability. This led to a fundamental reorien-
tation of the field around empirical machine learning models of NLP, an orienta-
tion that still dominates the field today. At the beginning of this period, the dom-
inant modus operandi was to get hold of a reasonable quantity of online text–in
accumulated, and this knowledge can then be deployed for tasks of interest, such
as question answering or text classification.
In hindsight, the development of large-scale self-supervised learning ap-
proaches may well be viewed as the fundamental change, and the third era might
be extended until 2017. The impact of pretrained self-supervised approaches has
been revolutionary: it is now possible to train models on huge amounts of unla-
beled human language material in such a way as to produce one large pretrained
model that can be very easily adapted, via fine-tuning or prompting, to give strong
I
cannot give here a full description of the now-dominant neural network mod-
els of human language, but I can offer an inkling. These models represent ev-
erything via vectors of real numbers and are able to learn good representa-
tions after exposure to many pieces of data by back-propagation of errors (which
comes down to doing differential calculus) from some prediction task back to the
representations of the words in a text. Since 2018, the dominant neural network
model for NLP applications has been the transformer neural network.7 With sev-
eral ideas and parts, a transformer is a much more complex model than the simple
neural networks for sequences of words that were explored in earlier decades. The
dominant idea is one of attention, by which a representation at a position is com-
puted as a weighted combination of representations from other positions. A com-
mon self-supervision objective in a transformer model is to mask out occasional
words in a text. The model works out what word used to be there. It does this by
calculating from each word position (including mask positions) vectors that rep-
resent a query, key, and value at that position. The query at a position is compared
with the value at every position to calculate how much attention to pay to each po-
sition; based on this, a weighted average of the values at all positions is calculated.
This operation is repeated many times at each level of the transformer neural net,
and the resulting value is further manipulated through a fully connected neural
net layer and through use of normalization layers and residual connections to pro-
duce a new vector for each word. This whole process is repeated many times, giv-
ing extra layers of depth to the transformer neural net. At the end, the representa-
tion above a mask position should capture the word that was there in the original
text: for instance, committee as illustrated in Figure 1.
It is not at all obvious what can be achieved or learned by the many simple cal-
culations of a transformer neural net. At first, this may sound like some kind of
complex statistical association learner. However, given a very powerful, flexible,
and high-parameter model like a transformer neural net and an enormous amount
130 Dædalus, the Journal of the American Academy of Arts & Sciences
Christopher D. Manning
Figure 1
Details of the Attention Calculations in One Part of a
Transformer Neural Net Model
of data to practice predictions on, these models discover and represent much of
the structure of human languages. Indeed, work has shown that these models
learn and represent the syntactic structure of a sentence and will learn to memo-
rize many facts of the world, since each of these things helps the model to predict
masked words successfully.8 Moreover, while predicting a masked word initial-
ly seems a rather simple and low-level task–a kind of humorless Mad Libs–and
not something sophisticated, like diagramming a sentence to show its grammati-
cal structure, this task turns out to be very powerful because it is universal: every
form of linguistic and world knowledge, from sentence structure, word connota-
tions, and facts about the world, help one to do this task better. As a result, these
models assemble a broad general knowledge of the language and world to which
they are exposed. A single such large pretrained language model (LPLM) can be
deployed for many particular NLP tasks with only a small amount of further in-
struction. The standard way of doing this from 2018 to 2020 was fine-tuning the
model via a small amount of additional supervised learning, training it on the ex-
act task of interest. But very recently, researchers have surprisingly found that the
largest of these models, such as GPT-3 (Generative Pre-trained Transformer-3),
can perform novel tasks very well with just a prompt. Give them a human language
description or several examples of what one wants them to do, and they can per-
form many tasks for which they were never otherwise trained.9
T
raditional natural language processing models were elaborately composed
from several usually independently developed components, frequently
built into a pipeline, which first tried to capture the sentence structure
and low-level entities of a text and then something of the higher-level meaning,
132 Dædalus, the Journal of the American Academy of Arts & Sciences
Christopher D. Manning
turning pages that are suggested to hold relevant information, as in the early gen-
erations of Web search). Question answering has many straightforward commer-
cial applications, including both presale and postsale customer support. Modern
neural network question-answering systems have high accuracy in extracting an
answer present in a text and are even fairly good at working out that no answer is
present. For example, from this passage:
Samsung saved its best features for the Galaxy Note 20 Ultra, including a more refined
design than the Galaxy S20 Ultra–a phone I don’t recommend. You’ll find an excep-
One can get answers to questions like the following (using the UnifiedQA model):14
For common traditional NLP tasks like marking person or organization names
in a piece of text or classifying the sentiment of a text about a product (as posi-
tive or negative), the best current systems are again based on LPLMs, usually fine-
tuned by providing a set of examples labeled in the desired way. While these tasks
could be done quite well even before recent large language models, the greater
breadth of knowledge of language and the world in these models has further im-
proved performance on these tasks.
Finally, LPLMs have led to a revolution in the ability to generate fluent and
connected text. In addition to many creative uses, such systems have prosaic uses
ranging from writing formulaic news articles like earnings or sports reports and
automating summarization. For example, such a system can help a radiologist by
suggesting the impression (or summary) based on the radiologist’s findings. For
the findings below, we can see that the system-generated impression is quite sim-
ilar to a radiologist-generated impression:15
Findings: lines/tubes: right ij sheath with central venous catheter tip overlying the
svc. on initial radiograph, endotracheal tube between the clavicular heads, and enteric
tube with side port at the ge junction and tip below the diaphragm off the field-of-
view; these are removed on subsequent film. mediastinal drains and left thoracosto-
my tube are unchanged. lungs: low lung volumes. retrocardiac airspace disease, slight-
ly increased on most recent film. pleura: small left pleural effusion. no pneumothorax.
heart and mediastinum: postsurgical widening of the cardiomediastinal silhouette.
These recent NLP systems perform very well on many tasks. Indeed, given a
fixed task, they can often be trained to perform it as well as human beings, on av-
erage. Nevertheless, there are still reasons to be skeptical as to whether these sys-
tems really understand what they are doing, or whether they are just very elabo-
rate rewriting systems, bereft of meaning.
T
he dominant approach to describing meaning, in not only linguistics and
philosophy of language but also for programming languages, is a denota-
tional semantics approach or a theory of reference: the meaning of a word,
phrase, or sentence is the set of objects or situations in the world that it describes
(or a mathematical abstraction thereof ). This contrasts with the simple distribu-
tional semantics (or use theory of meaning) of modern empirical work in NLP, where-
by the meaning of a word is simply a description of the contexts in which it ap-
pears.16 Some have suggested that the latter is not a theory of semantics at all but
just a regurgitation of distributional or syntactic facts.17 I would disagree. Mean-
ing is not all or nothing; in many circumstances, we partially appreciate the mean-
ing of a linguistic form. I suggest that meaning arises from understanding the net-
work of connections between a linguistic form and other things, whether they be
objects in the world or other linguistic forms. If we possess a dense network of
connections, then we have a good sense of the meaning of the linguistic form. For
example, if I have held an Indian shehnai, then I have a reasonable idea of the mean-
ing of the word, but I would have a richer meaning if I had also heard one being
played. Going in the other direction, if I have never seen, felt, or heard a shehnai,
but someone tells me that it’s like a traditional Indian oboe, then the word has some
meaning for me: it has connections to India, to wind instruments that use reeds,
and to playing music. If someone added that it has holes sort of like a recorder, but it
has multiple reeds and a flared end more like an oboe, then I have more network con-
134 Dædalus, the Journal of the American Academy of Arts & Sciences
Christopher D. Manning
nections to objects and attributes. Conversely, I might not have that information
but just a couple of contexts in which the word has been used, such as: From a week
before, shehnai players sat in bamboo machans at the entrance to the house, playing their
pipes. Bikash Babu disliked the shehnai’s wail, but was determined to fulfil every convention-
al expectation the groom’s family might have.18 Then, in some ways, I understand the
meaning of the word shehnai rather less, but I still know that it is a pipe-like musi-
cal instrument, and my meaning is not a subset of the meaning of the person who
has simply held a shehnai, for I know some additional cultural connections of the
indeed, one might be able to do it simply with natural language instructions. This
resulting convergence on a small number of models carries several risks: the
groups capable of building these models may have excessive power and influence,
many end users might suffer from any biases present in these models, and it will
be difficult to tell if models are safe to use in particular contexts because the mod-
els and their training data are so large. Nevertheless, the ability of these models to
deploy knowledge gained from a huge amount of training data to many different
runtime tasks will make these models powerful, and they will for the first time
endnotes
1 Frans de Waal, Are We Smart Enough to Know How Smart Animals Are? (New York: W. W.
Norton, 2017).
2 Mark Pagel, “Q&A: What Is Human Language, When Did It Evolve and Why Should We
Care?” BMC Biology 15 (1) (2017): 64.
3 W. John Hutchins, “The Georgetown-IBM Experiment Demonstrated in January 1954,”
in Machine Translation: From Real Users to Research, ed. Robert E. Frederking and Kathryn
B. Taylor (New York: Springer, 2004), 102–114.
4 A survey of these systems and references to individual systems appears in Avron Barr,
“Natural Language Understanding,” AI Magazine, Fall 1980.
136 Dædalus, the Journal of the American Academy of Arts & Sciences
Christopher D. Manning
5 Larry R. Harris, “Experience with Robot in 12 Commercial, Natural Language Data Base
Query Applications” in Proceedings of the 6th International Joint Conference on Artificial Intelli-
gence, IJCAI-79 (Santa Clara, Calif.: International Joint Conferences on Artificial Intelli-
gence Organization, 1979), 365–371.
6 Glenn Carroll and Eugene Charniak, “Two Experiments on Learning Probabilistic De-
pendency Grammars from Corpora,” in Working Notes of the Workshop Statistically-Based
NLP Techniques, ed. Carl Weir, Stephen Abney, Ralph Grishman, and Ralph Weischedel
(Menlo Park, Calif.: AAAI Press, 1992).
7 Ashish Vaswani, Noam Shazeer, Niki Parmar, et al., “Attention Is All You Need,” Advances
20 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “BERT: Pre-train-
ing of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of
the 2019 Conference of the North American Chapter of the Association for Computational Linguistics
(Stroudsburg, Pa.: Association for Computational Linguistics, 2019), 4171–4186.
21 Robert Logan, Nelson F. Liu, Matthew E. Peters, et al., “Barack’s Wife Hillary: Using
Knowledge Graphs for Fact-Aware Language Modeling,” in Proceedings of the 57th Annu-
al Meeting of the Association for Computational Linguistics (Stroudsburg, Pa.: Association for
Computational Linguistics, 2019), 5962–5971; and Kelvin Guu, Kenton Lee, Zora Tung,
et al., “REALM: Retrieval-Augmented Language Model Pre-Training,” Proceedings of Ma-
138 Dædalus, the Journal of the American Academy of Arts & Sciences