KEMBAR78
El leon no es como lo pintan | PDF
PNAS PLUS
Phoneme and word recognition in the auditory
ventral stream
Iain DeWitt1 and Josef P. Rauschecker1
Laboratory of Integrative Neuroscience and Cognition, Department of Neuroscience, Georgetown University Medical Center, Washington, DC 20007

Edited by Mortimer Mishkin, National Institute for Mental Health, Bethesda, MD, and approved December 19, 2011 (received for review August 17, 2011)

Spoken word recognition requires complex, invariant representa-             gate–like operation, conjoining structurally simple representa-
tions. Using a meta-analytic approach incorporating more than 100           tions in lower-order units into the increasingly complex repre-
functional imaging experiments, we show that preference for                 sentations (i.e., multiple excitatory and inhibitory zones) of
complex sounds emerges in the human auditory ventral stream in              higher-order units. In the case of speech sounds, these neurons
a hierarchical fashion, consistent with nonhuman primate electro-           conjoin representations for adjacent speech formants or, at
physiology. Examining speech sounds, we show that activation                higher levels, adjacent phonemes. Although the mechanism by
associated with the processing of short-timescale patterns (i.e.,           which combination sensitivity (CS) is directionally selective in
phonemes) is consistently localized to left mid-superior temporal           the temporal domain is not fully understood, some propositions
gyrus (STG), whereas activation associated with the integration of          exist (22–26). As an empirical matter, direction selectivity is
phonemes into temporally complex patterns (i.e., words) is con-             clearly present early in auditory cortex (19, 27). It is also ob-
sistently localized to left anterior STG. Further, we show left mid-        served to operate at time scales (50–250 ms) sufficient for pho-
to anterior STG is reliably implicated in the invariant representation      neme concatenation, as long as 250 ms in the zebra finch (15)
of phonetic forms and that this area also responds preferentially to        and 100 to 150 ms in macaque lateral belt (18). Logical-




                                                                                                                                                                         NEUROSCIENCE
phonetic sounds, above artificial control sounds or environmental            OR gate–like computation, technically proposed to be a soft
sounds. Together, this shows increasing encoding specificity and             maximum operation (28–30), is posited to be performed by
invariance along the auditory ventral stream for temporally                 spectrotemporal-pooling units. These units respond to supra-
complex speech sounds.                                                      threshold stimulation from any member of their connected
                                                                            lower-order pool, thus creating a superposition of the connected
functional MRI   | meta-analysis | auditory cortex | object recognition |   lower-order representations and abstracting them. With respect
language                                                                    to speech, this might involve the pooling of numerous, rigidly
                                                                            tuned representations of different exemplars of a given phoneme

S   poken word recognition presents several challenges to the
    brain. Two key challenges are the assembly of complex au-
ditory representations and the variability of natural speech (SI
                                                                            into an abstracted representation of the entire pool. Spatial
                                                                            pooling is well documented in visual cortex (7, 31, 32) and there
                                                                            is some evidence for its analog, spectrotemporal pooling, in
Appendix, Fig. S1) (1). Representation at the level of primary              auditory cortex (33–35), including the observation of complex
auditory cortex is precise: fine-grained in scale and local in               cells when A1 is developmentally reprogrammed as a surrogate
spectrotemporal space (2, 3). The recognition of complex spec-              V1 (36). However, a formal equivalence is yet to be demon-
trotemporal forms, like words, in higher areas of auditory cortex           strated (37, 38).
requires the transformation of this granular representation                    Auditory cortex’s predominant processing pathways, ventral
into Gestalt-like, object-centered representations. In brief, local         and dorsal (39, 40), appear to be optimized for pattern recog-
features must be bound together to form representations of                  nition and action planning, respectively (17, 18, 40–44). Speech-
complex spectrotemporal contours, which are themselves the                  specific models generally concur (45–48), creating a wide con-
constituents of auditory “objects” or complex sound patterns (4,            sensus that word recognition is performed in the auditory ventral
5). Next, representations must be generalized and abstracted.               stream (refs. 42, 45, 47–50, but see refs. 51–53). The hierarchical
Coding in primary auditory cortex is sensitive even to minor                model predicts an increase in neural receptive field size and
physical transformations. Object-centered coding in higher areas,           complexity along the ventral stream. With respect to speech,
however, must be invariant (i.e., tolerant of natural stimulus              there is a discontinuity in the processing demands associated
variation) (6). For example, whereas the phonemic structure of a            with the recognition of elemental phonetic units (i.e., phonemes
word is fixed, there is considerable variation in physical, spec-            or something phone-like) and concatenated units (i.e., multi-
trotemporal form—attributable to accent, pronunciation, body                segmental forms, both sublexical forms and word forms). Pho-
size, and the like—among utterances of a given word. It has been            neme recognition requires sensitivity to the arrangement of
proposed for visual cortical processing that a feed-forward, hi-            constellations of spectrotemporal features (i.e., the presence and
erarchical architecture (7) may be capable of simultaneously                absence of energy at particular center frequencies and with
solving the problems of complexity and variability (8–12). Here,            particular temporal offsets). Word-form recognition requires
we examine these ideas in the context of auditory cortex.                   sensitivity to the temporal arrangement of phonemes. Thus,
   In a hierarchical pattern-recognition scheme (8), coding in the          phoneme recognition requires spectrotemporal CS and operates
earliest cortical field would reflect the tuning and organization of
primary auditory cortex (or core) (2, 3, 13). That is, single-neu-
ron receptive fields (more precisely, frequency-response areas)
would be tuned to particular center frequencies and would have              Author contributions: I.D. designed research; I.D. performed research; I.D. analyzed data;
minimal spectrotemporal complexity (i.e., a single excitatory               and I.D. and J.P.R. wrote the paper.
zone and one-to-two inhibitory side bands). Units in higher fields           The authors declare no conflict of interest.
would be increasingly pattern selective and invariant to natural            This article is a PNAS Direct Submission.
variation. Pattern selectivity and invariance respectively arise            1
                                                                                To whom correspondence may be addressed. E-mail: id32@georgetown.edu or
from neural computations similar in effect to “logical-AND” and                 rauschej@georgetown.edu.
“logical-OR” gates. In the auditory system, neurons whose tun-              This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.
ing is combination sensitive (14–21) perform the logical-AND                1073/pnas.1113427109/-/DCSupplemental.



www.pnas.org/cgi/doi/10.1073/pnas.1113427109                                                                                        PNAS Early Edition | 1 of 10
on low-level acoustic features (SI Appendix, Fig. S1B, second           vestigations of stimulus complexity, comparing activation to word-
layer), whereas word-form recognition requires only temporal            form and pure-tone stimuli, report similar localization (47, 73, 74).
CS (i.e., concatenation of phonemes) and operates on higher-            Invariant tuning for word forms, as inferred from fMRI-adapta-
order features that may also be perceptual objects in their own         tion studies, also localizes to anterior STG/STS (75–77). Studies
right (SI Appendix, Fig. S1B, top layer). If word-form recognition      investigating cross-modal repetition effects for auditory and visual
is implemented hierarchically, we might expect this discontinuity       stimuli confirm anterior STG/STS localization and, further, show
in processing to be mirrored in cortical organization, with con-        it to be part of unimodal auditory cortex (78, 79). Finally, appli-
catenative phonetic recognition occurring distal to elemental           cation of electrical cortical interference to anterior STG disrupts
phonetic recognition.                                                   auditory comprehension, producing patient reports of speech as
   Primate electrophysiology identifies CS as occurring as early as      being like “a series of meaningless utterances” (80).
core’s supragranular layers and in lateral belt (16, 17, 19, 37). In       Here, we use a coordinate-based meta-analytic approach [ac-
the macaque, selectivity for communication calls—similar in             tivation likelihood estimation (ALE)] (81) to make an unbiased
spectrotemporal structure to phonemes or consonant-vowel                assessment of the robustness of functional-imaging evidence for
(CV) syllables—is observed in belt area AL (54) and, to an even         the aforementioned speech-recognition model. In short, the
greater degree, in a more anterior field, RTp (55). Further, for         method assesses the stereotaxic concordance of reported effects.
macaques trained to discriminate human phonemes, categorical            First, we investigate the strength of evidence for the predicted
coding is present in the single-unit activity of AL neurons as well     anatomical dissociation between elemental phonetic recognition
as in the population activity of area AL (1, 56). Human homologs        (mid-STG) and concatenative phonetic recognition (anterior
to these sites putatively lie on or about the anterior-lateral aspect   STG). To assess this, two functional imaging paradigms are
of Heschl’s gyrus and in the area immediately posterior to it (13,      meta-analyzed: speech vs. acoustic-control sounds (a proxy for
57–59). Macaque PET imaging suggests there is also an evolu-            CS, as detailed later) and repetition suppression (RS). For each
tionary predisposition to left-hemisphere processing for con-           paradigm, separate analyses are performed for studies of ele-
specific communication calls (60). Consistent with macaque               mental phonetic processing (i.e., phoneme- and CV-length
electrophysiology, human electrocorticography recordings from           stimuli) and for studies involving concatenative phonetic pro-
superior temporal gyrus (STG), in the region immediately pos-           cessing (i.e., word-length stimuli). Although the aforementioned
terior to the anterior-lateral aspect of Heschl’s gyrus (i.e., mid-     model is principally concerned with word-from recognition, for
STG), show the site to code for phoneme identity at the pop-            comparative purposes, we meta-analyze studies of phrase-length
ulation level (61). Mid-STG is also the site of peak high-gamma         stimuli as well. Second, we investigate the strength of evidence
activity in response to CV sounds (62–64). Similarly, human             for the predicted ventral-stream colocalization of CS and IR
functional imaging studies suggest left mid-STG is involved in          phenomena. To assess this, the same paradigms are reanalyzed
processing elemental speech sounds. For instance, in subtractive        with two modifications: (i) For IR, a subset of RS studies
functional MRI (fMRI) comparisons, after partialing out vari-           meeting heightened criteria for fMRI-adaptation designs is in-
ance attributable to acoustic factors, Leaver and Rauschecker           cluded (Methods); (ii) to attain sufficient sample size, analyses
(2010) showed selectivity in left mid-STG for CV speech sounds          are collapsed across stimulus lengths.
as opposed to other natural sounds (5). This implies the presence          We also investigate the strength of evidence for AS, which has
of a local density of neurons with receptive-field tuning opti-          been suggested as an organizing principle in higher-order areas
mized for the recognition of elemental phonetic sounds [i.e.,           of the auditory ventral stream (5, 82–85) and is a well established
areal specialization (AS)]. Furthermore, the region exhibits            organizing principle in the visual system’s analogous pattern
fMRI-adaptation phenomena consistent with invariant repre-              recognition pathway (86–89). In the interest of comparing the
sentation (IR) (65, 66). That is, response diminishes when the          organizational properties of the auditory ventral stream with
same phonetic content is repeatedly presented even though a             those of the visual ventral stream, we assess the colocalization of
physical attribute of the stimulus, one unrelated to phonetic           AS phenomena with CS and IR phenomena. CS and IR are
content, is changed; here, the speaker’s voice (5). Similarly, using    examined as described earlier. AS is examined by meta-analysis
speech sound stimuli on the /ga/ — /da/ continuum and com-              of speech vs. nonspeech natural-sound paradigms.
paring response to exemplar pairs that varied only in acoustics or         At a deep level, both our AS and CS analyses putatively examine
which varied both in acoustics and in phonetic content, Joanisse        CS-dependent tuning for complex patterns of spectrotemporal
and colleagues (2007) found adaptation specific to phonetic              energy. Acoustic-control sounds lack the spectrotemporal fea-
content in left mid-STG, again implying IR (67).                        ture combinations requisite for driving combination-sensitive
   The site downstream of mid-STG, performing phonetic con-             neurons tuned to speech sounds. For nonspeech natural sounds,
catenation, should possess neurons that respond to late com-            the same is true, but there should also exist combination-sensi-
ponents of multisegmental sounds (i.e., latencies >60 ms). These        tive neurons tuned to these stimuli, as they have been repeatedly
units should also be selective for specific phoneme orderings.           encountered over development. For an effect to be observed in
Nonhuman primate data for regions rostral to A1 confirm that             the AS analyses, not only must there be a population of com-
latencies increase rostrally along the ventral stream (34, 55, 68,      bination-sensitive speech-tuned neurons, but these neurons must
69), with the median latency to peak response approaching               also cluster together such that a differential response is observ-
100 ms in area RT (34), consistent with the latencies required for      able at the macroscopic scale of fMRI and PET.
phonetic concatenation. In a rare human electrophysiology study,
Creutzfeldt and colleagues (1989) report vigorous single-unit           Results
responses to words and sentences in mid- to anterior STG (70).          Phonetic-length-based analyses of CS studies (i.e., speech sounds
This included both feature-tuned units and late-component-              vs. acoustic control sounds) were performed twice. In the first
tuned units. Although the relative location of feature and late-        analyses, tonal control stimuli were excluded on grounds that
component units is not reported, and the late component units           they do not sufficiently match the spectrotemporal energy dis-
do not clearly evince temporal CS, the mixture of response types        tribution of speech. That is, for a strict test of CS, we required
supports the supposition of temporal combination-sensitive units        acoustic control stimuli to model low-level properties of speech
in human STG. Imaging studies localize processing of multi-             (i.e., contain spectrotemporal features coarsely similar to
segmental forms to anterior STG/superior temporal sulcus (STS).         speech), not merely to drive primary and secondary auditory
This can be seen in peak activation to word-forms in electro-           cortex. Under this preparation, spatial concordance was greatest
corticography (71) and magnetoencephalography (72). FMRI in-            in STG/STS across each phonetic length-based analysis (Table 1).

2 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1113427109                                                                  DeWitt and Rauschecker
PNAS PLUS
Table 1. Results for phonetic length-based analyses
                                                                                            Center of mass                 Peak coordinates
                                                                              3
Analysis/anatomy          BA        Cluster Concordance         Volume, mm             x          y           z        x          y           z       Peak ALE

CS
  Phoneme length
    Left STG             42/22               0.93                    3,624            −57        −25          1       −58        −20              2    0.028
    Right STG/RT         42/22               0.21                      512             56        −11         −2        54         −2              2    0.015
  Word length
    Left STG             42/22               0.56                    2,728            −57        −17         −1       −56        −16          −2       0.021
    Right STG             22                 0.13                      192             55        −17          0        56        −16           0       0.014
  Phrase length
    Left STS              21                 0.58                    2,992            −56         −8          −8      −56         −8       −8          0.038
    Left STS              21                 0.42                    1,456            −52          7         −16      −52          8      −16          0.035
    Right STS             21                 0.32                    2,264             54         −3          −9       56         −6       −6          0.032
    Left STS              22                 0.32                      840            −54        −35           1      −54        −34        0          0.028
    Left PreCG             6                 0.32                      664            −47         −7          47      −48         −8       48          0.025
    Left IFG              47                 0.21                      456            −42         25         −12      −42         24      −12          0.021
    Left IFG              44                 0.16                      200            −48         11          20      −48         10       20          0.020
RS
  Phoneme length
    Left STG             42/22               0.33                      640            −58        −21              4   −58        −20              4    0.018
  Word length




                                                                                                                                                                  NEUROSCIENCE
    Left STG             42/22               0.50                     1408            −56         −9         −3       −56        −10          −4       0.027
    Left STG             42/22               0.19                      288            −58        −28          2       −58        −28           2       0.017

  BA, Brodmann area; IFG, inferior frontal gyrus; PreCG, precentral gyrus; RT, rostrotemporal area.



Within STG/STS, results were left-biased across peak ALE-sta-                     length effects with left anterior STG (Fig. 1 and SI Appendix, Fig.
tistic value, cluster volume, and the percentage of studies                       S2). Phrase-length studies showed a similar leftward processing
reporting foci within a given cluster, hereafter “cluster concor-                 bias. Further, peak processing for phrase-length stimuli localized
dance.” The predicted differential localization for phoneme- and                  to a site anterior and subjacent to that of word-length stimuli,
word-length processing was confirmed, with phoneme-length                          suggesting a processing gradient for phonetic stimuli that pro-
effects most strongly associated with left mid-STG and word-                      gresses from mid-STG to anterior STG and then into STS.




Fig. 1. Foci meeting inclusion criteria for length-based CS analyses (A–C) and ALE-statistic maps for regions of significant concordance (D–F) (p < 10−3, k >
150 cm3). Analyses show leftward bias and an anterior progression in peak effects with phoneme-length studies showing greatest concordance in left mid-STG
(A and D; n = 14), word-length studies showing greatest concordance in left anterior STG (B and E; n = 16), and phrase-length analyses showing greatest
concordance in left anterior STS (C and F; n = 19). Sample size is given with respect to the number of contrasts from independent experiments contributing to
an analysis.


DeWitt and Rauschecker                                                                                                             PNAS Early Edition | 3 of 10
Although individual studies report foci for left frontal cortex                analysis was also generally coextensive with the CS analysis. In
in each of the length-based cohorts, only in the phrase-length                 left ventral prefrontal cortex, the AS and CS results were not
analysis do focus densities reach statistical significance.                     coextensive but were nonetheless similarly localized. Fig. 5 shows
   Second, to increase sample size and enable lexical status-based             exact regions of overlap across length-based and pooled analyses.
subanalyses, we included studies that used tonal control stimuli.
Under this preparation the same overall pattern of results was                 Discussion
observed with one exception: the addition of a pair of clusters in             Meta-analysis of speech processing shows a left-hemisphere op-
left ventral prefrontal cortex for the word-length analysis (SI                timization for speech and an anterior-directed processing gra-
Appendix, Fig. S3 and Table S1). Next, we further subdivided                   dient. Two unique findings are presented. First, dissociation is
word-length studies according to lexical status: real word or                  observed for the processing of phonemes, words, and phrases:
pseudoword. A divergent pattern of concordance was observed                    elemental phonetic processing is most strongly associated with
in left STG (Fig. 2 and SI Appendix, Fig. S4 and Table S1). Peak               mid-STG; auditory word-form processing is most strongly asso-
processing for real-word stimuli robustly localized to anterior                ciated with anterior STG, and phrasal processing is most strongly
STG. For pseudoword stimuli, a bimodal distribution was ob-                    associated with anterior STS. Second, evidence for CS, IR, and
served, peaking both in mid- and anterior STG and coextensive                  AS colocalize in mid- to anterior STG. Each finding supports the
with the real-word cluster.                                                    presence of an anterior-directed ventral-stream pattern-recog-
   Third, to assess the robustness of the predicted STG stimulus-              nition pathway. This is in agreement with Leaver and Rau-
length processing gradient, length-based analyses were per-                    schecker (2010), who tested colocalization of AS and IR in
formed on foci from RS studies. For both phoneme- and word-                    a single sample using phoneme-length stimuli (5). Recent meta-
length stimuli, concordant foci were observed to be strictly left-             analyses that considered related themes affirm aspects of the
lateralized and exclusively within STG (Table 1). The predicted                present work. In a study that collapsed across phoneme and
processing gradient was also observed. Peak concordance for                    pseudoword processing, Turkeltaub and Coslett (2010) localized
phoneme-length stimuli was seen in mid-STG, whereas peak                       sublexical processing to mid-STG (91). This is consistent with
concordance for word-length stimuli was seen in anterior STG                   our more specific localization of elemental phonetic processing.
(Fig. 3 and SI Appendix, Fig. S5). For the word-length analysis,               Samson and colleagues (2011), examining preferential tuning for
a secondary cluster was observed in mid-STG. This may reflect                   speech over music, report peak concordance in left anterior
repetition effects concurrently observed for phoneme-level rep-                STG/STS (92), consistent with our more general areal-speciali-
resentation or, as the site is somewhat inferior to that of pho-               zation analysis. Finally, our results support Binder and col-
neme-length effects, it may be tentative evidence of a secondary               leagues’ (2000) anterior-directed, hierarchical account of word
processing pathway within the ventral stream (63, 90).                         recognition (47) and Cohen and colleagues’ (2004) hypothesis of
   Fourth, to assess colocalization of CS, IR, and AS, we per-                 an auditory word-form area in left anterior STG (78).
formed length-pooled analyses (Fig. 4, Table 2, and SI Appendix,                  Classically, auditory word-form recognition was thought to
Fig. S6). Robust CS effects were observed in STG/STS. Again,                   localize to posterior STG/STS (93). This perspective may have
they were left-biased across peak ALE-statistic value, cluster                 been biased by the spatial distribution of middle cerebral artery
volume, and cluster concordance. Significant concordance was                    accidents. The artery’s diameter decreases along the Sylvian
also found in left frontal cortex. A single result was observed in             fissure, possibly increasing the prevalence of posterior infarcts.
the IR analysis, localizing to left mid- to anterior STG. This                 Current methods in aphasia research are better controlled and
cluster was entirely coextensive with the primary left-STG CS                  more precise. They implicate mid- and anterior temporal regions
cluster. Finally, analysis of AS foci found concordance in STG/                in speech comprehension, including anterior STG (94, 95). Al-
STS. It was also left-biased in peak ALE-statistic value, cluster              though evidence for an anterior STG/STS localization of audi-
volume, and cluster concordance. Further, a left-lateralized                   tory word-form processing has been present in the functional
ventral prefrontal result was observed. The principal left STG/                imaging literature since inception (96–99), perspectives advanc-
STS cluster was coextensive with the region of overlap between                 ing this view have been controversial and the localization is still
the CS and IR analyses. Within superior temporal cortex, the AS                not uniformly accepted. We find strong agreement among word-




Fig. 2. Foci meeting liberal inclusion criteria for lexically based word-length CS analyses (A and B) and ALE-statistic maps for regions of significant con-
cordance (C and D) (p < 10−3, k > 150 cm3). Similar to the CS analyses in Fig. 1, a leftward bias and an anterior progression in peak effects are shown.
Pseudoword studies show greatest concordance in left mid- to anterior STG (A and C; n = 13). Notably, the distribution of concordance effects is bimodal,
peaking both in mid- (−60, −26, 6) and anterior (−56, −10, 2) STG. Real-word studies show greatest concordance in left anterior STG (B and D; n = 22).


4 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1113427109                                                                              DeWitt and Rauschecker
PNAS PLUS
Fig. 3. Foci meeting inclusion criteria for length-based RS analyses (A and B) and ALE-statistic maps for regions of significant concordance (C and D) (p < 10−3,
k > 150 cm3). Analyses show left lateralization and an anterior progression in peak effects with phoneme-length studies showing greatest concordance in left
mid-STG (A and C; n = 12) and word-length studies showing greatest concordance in left anterior STG (B and D; n = 16). Too few studies exist for phrase-length
analyses (n = 4).



processing experiments, both within and across paradigms, each                       the ventral stream. As human core auditory fields lie along or




                                                                                                                                                                         NEUROSCIENCE
supporting relocation of auditory word-form recognition to an-                       about Heschl’s gyrus (13, 57–59, 100), the ventral streams’ course
terior STG. Through consideration of phoneme- and phrasal-                           can be inferred to traverse portions of planum temporale. Spe-
processing experiments, we show the identified anterior-STG                           cifically, the ventral stream is associated with macaque areas
word form-recognition site to be situated between sites robustly                     RTp and AL (54–56), which lie anterior to and lateral of A1
associated with phoneme and phrase processing. This comports                         (13). As human A1 lies on or about the medial aspect of Heschl’s
with hierarchical processing and thereby further supports ante-                      gyrus, with core running along its extent (57, 100), a processing
rior-STG localization for auditory word-form recognition.                            cascade emanating from core areas, progressing both laterally,
   It is important to note that some authors define “posterior”                       away from core itself, and anteriorly, away from A1, will neces-
STG to be posterior of the anterior-lateral aspect of Heschl’s                       sarily traverse the anterior-lateral portion of planum temporale.
gyrus or of the central sulcus. These definitions include the re-                     Further, this implies mid-STG is the initial STG waypoint of the
gion we discuss as “mid-STG,” the area lateral of Heschl’s gyrus.                    ventral stream.
We differentiate mid- from posterior STG on the basis of                                Nominal issues aside, support for a posterior localization
proximity to primary auditory cortex and the putative course of                      could be attributed to a constellation of effects pertaining to




Fig. 4. Foci meeting inclusion criteria for length-pooled analyses (A–C) and ALE-statistic maps for regions of significant concordance (D–F) (p < 10−3, k > 150
cm3). Analyses show leftward bias in the CS (A and D; n = 49) and AS (C and F; n = 15) analyses and left lateralization in the IR (B and E; n = 11) analysis. Foci are
color coded by stimulus length: phoneme length, red; word length, green; and phrase length, blue.


DeWitt and Rauschecker                                                                                                                  PNAS Early Edition | 5 of 10
Table 2. Results for aggregate analyses
                                                                                              Center of Mass                 Peak Coordinates
                                                                                 3
Analysis/anatomy          BA         Cluster Concordance          Volume, mm              x         y          z         x          y           z       Peak ALE

CS
 Left STG                42/22                0.82                    11,944             −57       −19        −1        −58        −18         0         0.056
 Right STG               42/22                0.47                     6,624              55       −10        −3         56         −6        −6         0.045
 Left STS                 21                  0.18                     1,608             −51         8       −14        −50          8       −14         0.039
 Left PreCG                6                  0.12                       736             −47        −7        48        −48         −8        48         0.031
 Left IFG                 44                  0.10                       744             −45        12        21        −46         12        20         0.025
 Left IFG                 47                  0.08                       240             −42        25       −12        −42         24       −12         0.022
 Left IFG                 45                  0.04                       200             −50        21        12        −50         22        12         0.020
IR*
 Left STG                22/21                0.45                      1,200            −58       −16         −1       −56        −14        −2         0.020
AS
 Left STG                42/22                0.87                      3,976            −58       −22             2    −58        −24              2    0.031
 Right STG               42/22                0.53                      2,032             51       −23             2     54        −16              0    0.026
 Left IFG                47/45                0.13                        368            −45        17             3    −44         18              2    0.018

*Broader inclusion criteria for the IR analysis (SI Appendix, Table S3) yield equivalent results with the following qualifications: cluster volume 1,008 mm3 and
cluster concordance 0.33.



aspects of speech or phonology that localize to posterior STG/                       and phonology, they do so in terms of multisensory processing
STS (69), for instance: speech production (101–108), phono-                          and sensorimotor integration and are not the key paradigms
logical/articulatory working memory (109, 110), reading (111–                        indicated by computational theory for demonstrating the pres-
113) [putatively attributable to orthography-to-phonology trans-                     ence of pattern recognition networks (8–12, 123). Those para-
lation (114–116)], and aspects of audiovisual language processing                    digms (CS and adaptation), systematically meta-analyzed here,
(117–122). Although these findings relate to aspects of speech                        find anterior localization.




Fig. 5. Flat-map presentation of ALE cluster overlap for (A) the CS analyses shown in Fig. 1, (B) the word-length lexical status analyses shown in Fig. 2, (C) the
RS analyses shown in Fig. 3, and (D) the length-pooled analyses shown in Fig. 4. For orientation, prominent landmarks are shown on the left hemisphere of A,
including the circular sulcus (CirS), central sulcus (CS), STG, and STS.


6 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1113427109                                                                                    DeWitt and Rauschecker
PNAS PLUS
   The segregation of phoneme and word-form processing along          memory task, they demonstrated the time course of left anterior
STG implies a growing encoding specificity for complex phonetic        STG/STS activation to be consistent with strictly auditory
forms by higher-order ventral-stream areas. More specifically, it      encoding: activation was locked to auditory stimulation and it
suggests the presence of a hierarchical network performing pho-       was not sustained throughout the late phase of item rehearsal. In
netic concatenation at a site anatomically distinct from and          contrast, they observed the activation time course in the dorsal
downstream of the site performing elemental phonetic recogni-         stream to be modality independent and to coincide with late-
tion. Alternatively, the phonetic-length effect could be attributed   phase rehearsal (i.e., it was associated with verbal rehearsal in-
to semantic confound: semantic content increases from phonemes        dependent of input modality, auditory or visual). Importantly,
to word forms. In an elegant experiment, Thierry and colleagues       late-phase rehearsal can be demonstrated behaviorally, by ar-
(2003) report evidence against this (82). After controlling for       ticulatory suppression, to be mediated by subvocalization (i.e.,
acoustics, they show that left anterior STG responds more to          articulatory rehearsal in the phonological loop) (140).
speech than to semantically matched environmental sounds.                There are some notable differences between auditory and vi-
Similarly, Belin and colleagues (2000, 2002), after controlling for   sual word recognition. Spoken language was intensely selected
acoustics, show that left anterior STG is not merely responding to    for during evolution (141), whereas reading is a recent cultural
the vocal quality of phonetic sounds; rather, it responds prefer-     innovation (111). The age of acquisition of phoneme represen-
entially to the phonetic quality of vocal sounds (83, 84).            tation is in the first year of life (124), whereas it is typically in the
   Additional comment on the localization and laterality of au-       third year for letters. A similar developmental lag is present with
ditory word and pseudoword processing, as well as on processing       respect to acquisition of the visual lexicon. Differences aside,
gradients in STG/STS, is provided in SI Appendix, Discussion.         word recognition in each modality requires similar processing,
   The auditory ventral stream is proposed to use CS to conjoin       including the concatenation of elemental forms, phonemes or
lower-order representations and thereby to synthesize complex         letters, into sublexical forms and word forms. If the analogy
representations. As the tuning of higher-order combination-           between auditory and visual ventral streams is correct, our
sensitive units is contingent upon sensory experience (124, 125),     results predict a similar anatomical dissociation for elemental
phrases and sentences would not generally be processed as Ge-         and concatenative representation in the visual ventral stream.




                                                                                                                                                        NEUROSCIENCE
stalt-like objects. Although we have analyzed studies involving       This prediction is also made by models of text processing (10).
phrase- and sentence-level processing, their inclusion is for         Although we are aware of no study that has investigated letter
context and because word-form recognition is a constituent part       and word recognition in a single sample, support for the disso-
of sentence processing. In some instances, however, phrases are       ciation is present in the literature. The visual word-form area,
processed as objects (126). This status is occasionally recognized    the putative site of visual word-form recognition (142), is located
in orthography (e.g., “nonetheless”). Such phrases ought to be        in the left fusiform gyrus of inferior temporal cortex (IT) (143).
recognized by the ventral-stream network. This, however, would        Consistent with expectation, the average site of peak activation
be the exception, not the rule. Hypothetically, the opposite may      to single letters in IT (144–150) is more proximal to V1, by ap-
also occur: a word form’s length might exceed the network’s           proximately 13 mm. A similar anatomical dissociation can be
integrative capacity (e.g., “antidisestablishmentarianism”). We       seen in paradigms probing IR. Ordinarily, nonhuman primate IT
speculate the network is capable of concatenating sequences of        neurons exhibit a degree of mirror-symmetric invariant tuning
at least five to eight phonemes: five to six phonemes is the modal      (151). Letter recognition, however, requires nonmirror IR (e.g.,
length of English word forms and seven- to eight-phoneme-long         to distinguish “b” from “d”). When assessing identity-specific RS
word forms comprise nearly one fourth of English words (SI            (i.e., repetition effects specific to non–mirror-inverted repeti-
Appendix, Fig. S7 and Discussion). This estimate is also consis-      tions), letter and word effects differentially localize: effects for
tent with the time constant of echoic memory (∼2 s). (Notably,        word stimuli localize to the visual word-form area (152), whereas
there is a similar issue concerning the processing of text in the     effects for single-letter stimuli localize to the lateral occipital
visual system’s ventral stream, where, for longer words, fovea-       complex (153), a site closer to V1. Thus, the anatomical disso-
width representations must be “temporally” conjoined across           ciation observed in auditory cortex for phonemes and words
microsaccades.) Although some phrases may be recognized in            appears to reflect a general hierarchical processing architecture
the word-form recognition network, the majority of STS activa-        also present in other sensory cortices.
tion associated with phrase-length stimuli (Fig. 1F) is likely re-       In conclusion, our analyses show the human functional imag-
lated to aspects of syntax and semantics. This observation            ing literature to support a hierarchical model of object recog-
enables us to subdivide the intelligibility network, broadly de-      nition in auditory cortex, consistent with nonhuman primate
fined by Scott and colleagues (2000) (127). The first two stages        electrophysiology. Specifically, our results support a left-biased,
involve elemental and concatenative phonetic recognition, ex-         two-stage model of auditory word-form recognition with analysis
tending from mid-STG to anterior STG and, possibly, into sub-         of phonemes occurring in mid-STG and word recognition oc-
jacent STS. Higher-order syntactic and semantic processing is         curring in anterior STG. A third stage extends the model to
conducted throughout STS and continues into prefrontal cortex         phrase-level processing in STS. Mechanistically, left mid- to
(128–133).                                                            anterior STG exhibits core qualities of a pattern recognition
   A qualification to the propositions advanced here for word-         network, including CS, IR, and AS.
form recognition is that this account pertains to perceptually
fluent speech recognition (e.g., native language conversational        Methods
discourse). Both left ventral and dorsal networks likely mediate      To identify prospective studies for inclusion, a systematic search of the
nonfluent speech recognition (e.g., when processing neologisms         PubMed database was performed for variations of the query, “(phonetics OR
or recently acquired words in a second language). Whereas             ‘speech sounds’ OR phoneme OR ‘auditory word’) AND (MRI OR fMRI OR
ventral networks are implicated in pattern recognition, dorsal        PET).” This yielded more than 550 records (as of February 2011). These
networks are implicated in forward- and inverse-model compu-          studies were screened for compliance with formal inclusion criteria: (i) the
                                                                      publication of stereotaxic coordinates for group-wise fMRI or PET results in
tation (42, 44), including sensorimotor integration (42, 45, 48,
                                                                      a peer-reviewed journal and (ii) report of a contrast of interest (as detailed
134). This supports a role for left dorsal networks in mapping        later). Exclusion criteria were the use of pediatric or clinical samples. In-
auditory representations onto the somatomotor frame of refer-         clusion/exclusion criteria admitted 115 studies. For studies reporting multiple
ence (135–139), yielding articulator-encoded speech. This ven-        suitable contrasts per sample, to avoid sampling bias, a single contrast was
tral–dorsal dissociation is illustrated in an experiment by           selected. For CS analyses, contrasts of interest compared activation to speech
Buchsbaum and colleagues (2005) (110). Using a verbal working         stimuli (i.e., phonemes/syllables, words/pseudowords, and phrases/sentences/


DeWitt and Rauschecker                                                                                                 PNAS Early Edition | 7 of 10
pseudoword sentences) with activation to matched, nonnaturalistic acoustic                 (154). Foci concordance was assessed by the method of ALE (81) in a random-
control stimuli (i.e., various tonal, noise, and complex artificial nonspeech               effects implementation (155) that controls for within-experiment effects
stimuli). A total of 84 eligible contrasts were identified, representing 1,211              (156). Under ALE, foci are treated as Gaussian probability distributions,
subjects and 541 foci. For RS analyses, contrasts compared activation to re-               which reflect localization uncertainty. Pooled Gaussian focus maps were
peated and nonrepeated speech stimuli. A total of 31 eligible contrasts were               tested against a null distribution reflecting a random spatial association
identified, representing 471 subjects and 145 foci. For IR analyses, a subset of            between different experiments. Correction for multiple comparisons was
the RS cohort was selected that used designs in which “repeated” stimuli                   obtained through estimation of false discovery rate (157). Two significance
also varied acoustically but not phonetically (e.g., two different utterances              criteria were used: minimum p value was set at 10−3 and minimum cluster
of the same word). The RS cohort was used for phonetic length-based                        extent was set at 150 mm3. Analyses were conducted in GINGERALE (Re-
analyses as the more restrictive criteria for IR yielded insufficient sample                search Imaging Institute), AFNI (National Institute of Mental Health), and
sizes (as detailed later). For AS analyses, contrasts compared activation to               MATLAB (Mathworks). For visualization, CARET (Washington University in
speech stimuli and to other naturalistic stimuli (e.g., animal calls, music, tool          St. Louis) was used to project foci and ALE clusters from volumetric space
sounds). A total of 17 eligible contrasts were identified, representing 239                 onto the cortical surface of the Population-Average, Landmark- and Surface-
subjects and 100 foci. All retained contrasts were binned for phonetic                     based atlas (158). Readers should note that this procedure can introduce
length-based analyses according to the estimated mean number of pho-                       slight localization artifacts (e.g., projection may distribute one volumetric
nemes in their stimuli: (i) “phoneme length,” one or two phonemes, (ii)                    cluster discontinuously over two adjacent gyri).
“word length,” three to 10 phonemes, and (iii) “phrase length,” more than
10 phonemes. SI Appendix, Tables S2–S4, identify the contrasts included in
                                                                                           ACKNOWLEDGMENTS. We thank Max Riesenhuber, Marc Ettlinger, and two
each analysis.                                                                             anonymous reviewers for comments helpful to the development of this
   The minimum sample size for meta-analyses was 10 independent contrasts.                 manuscript. This work was supported by National Science Foundation Grants
Foci reported in Montreal Neurological Institute coordinates were trans-                   BCS-0519127 and OISE-0730255 (to J.P.R.) and National Institute on Deafness
formed into Talairach coordinates according to the ICBM2TAL transformation                 and Other Communication Disorders Grant 1RC1DC010720 (to J.P.R.).


 1. Steinschneider M (2011) Unlocking the role of the superior temporal gyrus for          26. Carr CE, Konishi M (1988) Axonal delay lines for time measurement in the owl’s
    speech sound categorization. J Neurophysiol 105:2631–2633.                                 brainstem. Proc Natl Acad Sci USA 85:8311–8315.
 2. Brugge JF, Merzenich MM (1973) Responses of neurons in auditory cortex of the          27. Tian B, Rauschecker JP (2004) Processing of frequency-modulated sounds in the
    macaque monkey to monaural and binaural stimulation. J Neurophysiol 36:                    lateral auditory belt cortex of the rhesus monkey. J Neurophysiol 92:2993–3013.
    1138–1158.                                                                             28. Fukushima K (1980) Neocognitron: A self organizing neural network model for
 3. Bitterman Y, Mukamel R, Malach R, Fried I, Nelken I (2008) Ultra-fine frequency             a mechanism of pattern recognition unaffected by shift in position. Biol Cybern 36:
    tuning revealed in single neurons of human auditory cortex. Nature 451:197–201.            193–202.
 4. Griffiths TD, Warren JD (2004) What is an auditory object? Nat Rev Neurosci 5:          29. Riesenhuber M, Poggio TA (1999) Hierarchical models of object recognition in
    887–892.                                                                                   cortex. Nat Neurosci 2:1019–1025.
 5. Leaver AM, Rauschecker JP (2010) Cortical representation of natural complex sounds:    30. Kouh M, Poggio TA (2008) A canonical neural circuit for cortical nonlinear
    effects of acoustic features and auditory object category. J Neurosci 30:7604–7612.        operations. Neural Comput 20:1427–1451.
 6. Luce P, McLennan C (2005) Spoken word recognition: The challenge of variation.         31. Lampl I, Ferster D, Poggio T, Riesenhuber M (2004) Intracellular measurements of
    Handbook of Speech Perception, eds Pisoni D, Remez R (Blackwell, Malden, MA), pp           spatial integration and the MAX operation in complex cells of the cat primary visual
    591–609.                                                                                   cortex. J Neurophysiol 92:2704–2713.
 7. Hubel DH, Wiesel TN (1962) Receptive fields, binocular interaction and functional       32. Finn IM, Ferster D (2007) Computational diversity in complex cells of cat primary
    architecture in the cat’s visual cortex. J Physiol 160:106–154.                            visual cortex. J Neurosci 27:9638–9648.
 8. Riesenhuber M, Poggio TA (2002) Neural mechanisms of object recognition. Curr          33. Bendor D, Wang X (2007) Differential neural coding of acoustic flutter within
    Opin Neurobiol 12:162–168.                                                                 primate auditory cortex. Nat Neurosci 10:763–771.
 9. Husain FT, Tagamets M-A, Fromm SJ, Braun AR, Horwitz B (2004) Relating neuronal        34. Bendor D, Wang X (2008) Neural response properties of primary, rostral, and
    dynamics for auditory object processing to neuroimaging activity: A computational          rostrotemporal core fields in the auditory cortex of marmoset monkeys.
    modeling and an fMRI study. Neuroimage 21:1701–1720.                                       J Neurophysiol 100:888–906.
10. Dehaene S, Cohen L, Sigman M, Vinckier F (2005) The neural code for written words:     35. Atencio CA, Sharpee TO, Schreiner CE (2008) Cooperative nonlinearities in auditory
                                                                                               cortical neurons. Neuron 58:956–966.
    a proposal. Trends Cogn Sci 9:335–341.
                                                                                           36. Roe AW, Pallas SL, Kwon YH, Sur M (1992) Visual projections routed to the auditory
11. Hoffman KL, Logothetis NK (2009) Cortical mechanisms of sensory learning and
                                                                                               pathway in ferrets: receptive fields of visual neurons in primary auditory cortex.
    object recognition. Philos Trans R Soc Lond B Biol Sci 364:321–329.
                                                                                               J Neurosci 12:3651–3664.
12. Larson E, Billimoria CP, Sen K (2009) A biologically plausible computational model
                                                                                           37. Atencio CA, Sharpee TO, Schreiner CE (2009) Hierarchical computation in the
    for auditory object recognition. J Neurophysiol 101:323–331.
                                                                                               canonical auditory cortical circuit. Proc Natl Acad Sci USA 106:21894–21899.
13. Hackett TA (2011) Information flow in the auditory cortical network. Hear Res 271:
                                                                                           38. Ahmed B, Garcia-Lazaro JA, Schnupp JWH (2006) Response linearity in primary
    133–146.
                                                                                               auditory cortex of the ferret. J Physiol 572:763–773.
14. Suga N, O’Neill WE, Manabe T (1978) Cortical neurons sensitive to combinations of
                                                                                           39. Rauschecker JP, Tian B (2000) Mechanisms and streams for processing of “what” and
    information-bearing elements of biosonar signals in the mustache bat. Science 200:
                                                                                               “where” in auditory cortex. Proc Natl Acad Sci USA 97:11800–11806.
    778–781.
                                                                                           40. Romanski LM, et al. (1999) Dual streams of auditory afferents target multiple
15. Margoliash D, Fortune ES (1992) Temporal and harmonic combination-sensitive
                                                                                               domains in the primate prefrontal cortex. Nat Neurosci 2:1131–1136.
    neurons in the zebra finch’s HVc. J Neurosci 12:4309–4326.
                                                                                           41. Kaas JH, Hackett TA (1999) ‘What’ and ‘where’ processing in auditory cortex. Nat
16. Rauschecker JP, Tian B, Hauser M (1995) Processing of complex sounds in the
                                                                                               Neurosci 2:1045–1047.
    macaque nonprimary auditory cortex. Science 268:111–114.
                                                                                           42. Rauschecker JP, Scott SK (2009) Maps and streams in the auditory cortex: nonhuman
17. Rauschecker JP (1997) Processing of complex sounds in the auditory cortex of cat,
                                                                                               primates illuminate human speech processing. Nat Neurosci 12:718–724.
    monkey, and man. Acta Otolaryngol Suppl 532:34–38.                                     43. Romanski LM, Averbeck BB (2009) The primate cortical auditory system and neural
18. Rauschecker JP (1998) Parallel processing in the auditory cortex of primates. Audiol       representation of conspecific vocalizations. Annu Rev Neurosci 32:315–346.
    Neurootol 3:86–103.                                                                    44. Rauschecker JP (2011) An expanded role for the dorsal auditory pathway in
19. Sadagopan S, Wang X (2009) Nonlinear spectrotemporal interactions underlying               sensorimotor control and integration. Hear Res 271:16–25.
    selectivity for complex sounds in auditory cortex. J Neurosci 29:11192–11202.          45. Hickok G, Poeppel D (2007) The cortical organization of speech processing. Nat Rev
20. Medvedev AV, Chiao F, Kanwal JS (2002) Modeling complex tone perception:                   Neurosci 8:393–402.
    grouping harmonics with combination-sensitive neurons. Biol Cybern 86:497–505.         46. Scott SK, Wise RJS (2004) The functional neuroanatomy of prelexical processing in
21. Willmore BDB, King AJ (2009) Auditory cortex: representation through                       speech perception. Cognition 92:13–45.
    sparsification? Curr Biol 19:1123–1125.                                                 47. Binder JR, et al. (2000) Human temporal lobe activation by speech and nonspeech
22. Voytenko SV, Galazyuk AV (2007) Intracellular recording reveals temporal                   sounds. Cereb Cortex 10:512–528.
    integration in inferior colliculus neurons of awake bats. J Neurophysiol 97:           48. Wise RJ, et al. (2001) Separate neural subsystems within ‘Wernicke’s area’. Brain 124:
    1368–1378.                                                                                 83–95.
23. Peterson DC, Voytenko S, Gans D, Galazyuk A, Wenstrup J (2008) Intracellular           49. Patterson RD, Johnsrude IS (2008) Functional imaging of the auditory processing
    recordings from combination-sensitive neurons in the inferior colliculus.                  applied to speech sounds. Philos Trans R Soc Lond B Biol Sci 363:1023–1035.
    J Neurophysiol 100:629–645.                                                            50. Weiller C, Bormann T, Saur D, Musso M, Rijntjes M (2011) How the ventral pathway
24. Ye CQ, Poo MM, Dan Y, Zhang XH (2010) Synaptic mechanisms of direction                     got lost: and what its recovery might mean. Brain Lang 118:29–39.
    selectivity in primary auditory cortex. J Neurosci 30:1861–1868.                       51. Whalen DH, et al. (2006) Differentiation of speech and nonspeech processing within
25. Rao RP, Sejnowski TJ (2000) Predictive sequence learning in recurrent neocortical          primary auditory cortex. J Acoust Soc Am 119:575–581.
    circuits. Advances in Neural Information Processing Systems,, eds Solla SA, Leen TK,   52. Nelken I (2008) Processing of complex sounds in the auditory system. Curr Opin
    Muller KR (MIT Press, Cambridge), Vol 12.                                                  Neurobiol 18:413–417.



8 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1113427109                                                                                                   DeWitt and Rauschecker
PNAS PLUS
53. Recanzone GH, Cohen YE (2010) Serial and parallel processing in the primate                92. Samson F, Zeffiro TA, Toussaint A, Belin P (2011) Stimulus complexity and categorical
    auditory cortex revisited. Behav Brain Res 206:1–7.                                            effects in human auditory cortex: An activation likelihood estimation meta-analysis.
54. Tian B, Reser D, Durham A, Kustov A, Rauschecker JP (2001) Functional specialization           Front Psychol 1:241.
    in rhesus monkey auditory cortex. Science 292:290–293.                                     93. Geschwind N (1970) The organization of language and the brain. Science 170:
55. Kikuchi Y, Horwitz B, Mishkin M (2010) Hierarchical auditory processing directed               940–944.
    rostrally along the monkey’s supratemporal plane. J Neurosci 30:13021–13030.               94. Bates E, et al. (2003) Voxel-based lesion-symptom mapping. Nat Neurosci 6:448–450.
56. Tsunada J, Lee JH, Cohen YE (2011) Representation of speech categories in the              95. Dronkers NF, Wilkins DP, Van Valin RD, Jr., Redfern BB, Jaeger JJ (2004) Lesion
    primate auditory cortex. J Neurophysiol 105:2634–2646.                                         analysis of the brain areas involved in language comprehension. Cognition 92:
57. Galaburda AM, Sanides F (1980) Cytoarchitectonic organization of the human                     145–177.
    auditory cortex. J Comp Neurol 190:597–610.                                                96. Mazziotta JC, Phelps ME, Carson RE, Kuhl DE (1982) Tomographic mapping of
58. Chevillet M, Riesenhuber M, Rauschecker JP (2011) Functional correlates of the                 human cerebral metabolism: Auditory stimulation. Neurology 32:921–937.
    anterolateral processing hierarchy in human auditory cortex. J Neurosci 31:                97. Petersen SE, Fox PT, Posner MI, Mintun M, Raichle ME (1988) Positron emission
    9345–9352.                                                                                     tomographic studies of the cortical anatomy of single-word processing. Nature 331:
59. Glasser MF, Van Essen DC (2011) Mapping human cortical areas in vivo based on                  585–589.
    myelin content as revealed by T1- and T2-weighted MRI. J Neurosci 31:11597–11616.          98. Wise RJS, et al. (1991) Distribution of cortical neural networks involved in word
60. Poremba A, et al. (2004) Species-specific calls evoke asymmetric activity in the                comprehension and word retrieval. Brain 114:1803–1817.
    monkey’s temporal poles. Nature 427:448–451.                                               99. Démonet JF, et al. (1992) The anatomy of phonological and semantic processing in
61. Chang EF, et al. (2010) Categorical speech representation in human superior                    normal subjects. Brain 115:1753–1768.
    temporal gyrus. Nat Neurosci 13:1428–1432.                                                100. Rademacher J, et al. (2001) Probabilistic mapping and volume measurement of
62. Chang EF, et al. (2011) Cortical spatio-temporal dynamics underlying phonological              human primary auditory cortex. Neuroimage 13:669–683.
    target detection in humans. J Cogn Neurosci 23:1437–1446.                                 101. Hamberger MJ, Seidel WT, Goodman RR, Perrine K, McKhann GM (2003) Temporal
63. Steinschneider M, et al. (2011) Intracranial study of speech-elicited activity on the          lobe stimulation reveals anatomic distinction between auditory naming processes.
    human posterolateral superior temporal gyrus. Cereb Cortex 21:2332–2347.                       Neurology 60:1478–1483.
64. Edwards E, et al. (2009) Comparison of time-frequency responses and the event-            102. Hashimoto Y, Sakai KL (2003) Brain activations during conscious self-monitoring of
    related potential to auditory speech stimuli in human cortex. J Neurophysiol 102:              speech production with delayed auditory feedback: An fMRI study. Hum Brain Mapp
    377–386.                                                                                       20:22–28.
65. Miller EK, Li L, Desimone R (1991) A neural mechanism for working and recognition         103. Warren JE, Wise RJS, Warren JD (2005) Sounds do-able: Auditory-motor trans-
    memory in inferior temporal cortex. Science 254:1377–1379.                                     formations and the posterior temporal plane. Trends Neurosci 28:636–643.
66. Grill-Spector K, Malach R (2001) fMR-adaptation: A tool for studying the functional       104. Guenther FH (2006) Cortical interactions underlying the production of speech




                                                                                                                                                                                             NEUROSCIENCE
                                                                                                   sounds. J Commun Disord 39:350–365.
    properties of human cortical neurons. Acta Psychol (Amst) 107:293–321.
                                                                                              105. Tourville JA, Reilly KJ, Guenther FH (2008) Neural mechanisms underlying auditory
67. Joanisse MF, Zevin JD, McCandliss BD (2007) Brain mechanisms implicated in the
                                                                                                   feedback control of speech. Neuroimage 39:1429–1443.
    preattentive categorization of speech sounds revealed using FMRI and a short-
                                                                                              106. Towle VL, et al. (2008) ECoG gamma activity during a language task: Differentiating
    interval habituation trial paradigm. Cereb Cortex 17:2084–2093.
                                                                                                   expressive and receptive speech areas. Brain 131:2013–2027.
68. Scott BH, Malone BJ, Semple MN (2011) Transformation of temporal processing
                                                                                              107. Takaso H, Eisner F, Wise RJS, Scott SK (2010) The effect of delayed auditory feedback
    across auditory cortex of awake macaques. J Neurophysiol 105:712–730.
                                                                                                   on activity in the temporal lobe while speaking: A positron emission tomography
69. Kusmierek P, Rauschecker JP (2009) Functional specialization of medial auditory belt
                                                                                                   study. J Speech Lang Hear Res 53:226–236.
    cortex in the alert rhesus monkey. J Neurophysiol 102:1606–1622.
                                                                                              108. Zheng ZZ, Munhall KG, Johnsrude IS (2010) Functional overlap between regions
70. Creutzfeldt O, Ojemann G, Lettich E (1989) Neuronal activity in the human lateral
                                                                                                   involved in speech perception and in monitoring one’s own voice during speech
    temporal lobe. I. Responses to speech. Exp Brain Res 77:451–475.
                                                                                                   production. J Cogn Neurosci 22:1770–1781.
71. Pei X, et al. (2011) Spatiotemporal dynamics of electrocorticographic high gamma
                                                                                              109. Buchsbaum BR, Padmanabhan A, Berman KF (2011) The neural substrates of
    activity during overt and covert word repetition. Neuroimage 54:2960–2972.
                                                                                                   recognition memory for verbal information: Spanning the divide between short-
72. Marinkovic K, et al. (2003) Spatiotemporal dynamics of modality-specific and
                                                                                                   and long-term memory. J Cogn Neurosci 23:978–991.
    supramodal word processing. Neuron 38:487–497.
                                                                                              110. Buchsbaum BR, Olsen RK, Koch P, Berman KF (2005) Human dorsal and ventral
73. Binder JR, Frost JA, Hammeke TA, Rao SM, Cox RW (1996) Function of the left
                                                                                                   auditory streams subserve rehearsal-based and echoic processes during verbal
    planum temporale in auditory and linguistic processing. Brain 119:1239–1247.
                                                                                                   working memory. Neuron 48:687–697.
74. Binder JR, et al. (1997) Human brain language areas identified by functional
                                                                                              111. Vinckier F, et al. (2007) Hierarchical coding of letter strings in the ventral stream:
    magnetic resonance imaging. J Neurosci 17:353–362.
                                                                                                   dissecting the inner organization of the visual word-form system. Neuron 55:
75. Dehaene-Lambertz G, et al. (2006) Functional segregation of cortical language areas
                                                                                                   143–156.
    by sentence repetition. Hum Brain Mapp 27:360–371.
                                                                                              112. Dehaene S, et al. (2010) How learning to read changes the cortical networks for
76. Sammler D, et al. (2010) The relationship of lyrics and tunes in the processing of
                                                                                                   vision and language. Science 330:1359–1364.
    unfamiliar songs: A functional magnetic resonance adaptation study. J Neurosci 30:
                                                                                              113. Pallier C, Devauchelle A-D, Dehaene S (2011) Cortical representation of the
    3572–3578.                                                                                     constituent structure of sentences. Proc Natl Acad Sci USA 108:2522–2527.
77. Hara NF, Nakamura K, Kuroki C, Takayama Y, Ogawa S (2007) Functional                      114. Graves WW, Desai R, Humphries C, Seidenberg MS, Binder JR (2010) Neural systems
    neuroanatomy of speech processing within the temporal cortex. Neuroreport 18:                  for reading aloud: A multiparametric approach. Cereb Cortex 20:1799–1815.
    1603–1607.                                                                                115. Jobard G, Crivello F, Tzourio-Mazoyer N (2003) Evaluation of the dual route theory
78. Cohen L, Jobert A, Le Bihan D, Dehaene S (2004) Distinct unimodal and multimodal               of reading: A metanalysis of 35 neuroimaging studies. Neuroimage 20:693–712.
    regions for word processing in the left temporal cortex. Neuroimage 23:1256–1270.         116. Turkeltaub PE, Gareau L, Flowers DL, Zeffiro TA, Eden GF (2003) Development of
79. Buchsbaum BR, D’Esposito M (2009) Repetition suppression and reactivation in                   neural mechanisms for reading. Nat Neurosci 6:767–773.
    auditory-verbal short-term recognition memory. Cereb Cortex 19:1474–1485.                 117. Hamberger MJ, Goodman RR, Perrine K, Tamny T (2001) Anatomic dissociation of
80. Matsumoto R, et al. (2011) Left anterior temporal cortex actively engages in speech            auditory and visual naming in the lateral temporal cortex. Neurology 56:56–61.
    perception: A direct cortical stimulation study. Neuropsychologia 49:1350–1354.           118. Hamberger MJ, McClelland S, III, McKhann GM, II, Williams AC, Goodman RR (2007)
81. Turkeltaub PE, Eden GF, Jones KM, Zeffiro TA (2002) Meta-analysis of the functional             Distribution of auditory and visual naming sites in nonlesional temporal lobe
    neuroanatomy of single-word reading: method and validation. Neuroimage 16:                     epilepsy patients and patients with space-occupying temporal lobe lesions. Epilepsia
    765–780.                                                                                       48:531–538.
82. Thierry G, Giraud AL, Price CJ (2003) Hemispheric dissociation in access to the human     119. Blau V, van Atteveldt N, Formisano E, Goebel R, Blomert L (2008) Task-irrelevant
    semantic system. Neuron 38:499–506.                                                            visual letters interact with the processing of speech sounds in heteromodal and
83. Belin P, Zatorre RJ, Lafaille P, Ahad P, Pike B (2000) Voice-selective areas in human          unimodal cortex. Eur J Neurosci 28:500–509.
    auditory cortex. Nature 403:309–312.                                                      120. van Atteveldt NM, Blau VC, Blomert L, Goebel R (2010) fMR-adaptation indicates
84. Belin P, Zatorre RJ, Ahad P (2002) Human temporal-lobe response to vocal sounds.               selectivity to audiovisual content congruency in distributed clusters in human
    Brain Res Cogn Brain Res 13:17–26.                                                             superior temporal cortex. BMC Neurosci 11:11.
85. Petkov CI, et al. (2008) A voice region in the monkey brain. Nat Neurosci 11:367–374.     121. Beauchamp MS, Nath AR, Pasalar S (2010) fMRI-Guided transcranial magnetic
86. Desimone R, Albright TD, Gross CG, Bruce C (1984) Stimulus-selective properties of             stimulation reveals that the superior temporal sulcus is a cortical locus of the McGurk
    inferior temporal neurons in the macaque. J Neurosci 4:2051–2062.                              effect. J Neurosci 30:2414–2417.
87. Gaillard R, et al. (2006) Direct intracranial, FMRI, and lesion evidence for the causal   122. Nath AR, Beauchamp MS (2011) Dynamic changes in superior temporal sulcus
    role of left inferotemporal cortex in reading. Neuron 50:191–204.                              connectivity during perception of noisy audiovisual speech. J Neurosci 31:1704–1714.
88. Tsao DY, Freiwald WA, Tootell RBH, Livingstone MS (2006) A cortical region                123. Ison MJ, Quiroga RQ (2008) Selectivity and invariance for visual object perception.
    consisting entirely of face-selective cells. Science 311:670–674.                              Front Biosci 13:4889–4903.
89. Kanwisher N, Yovel G (2006) The fusiform face area: A cortical region specialized for     124. Kuhl PK (2004) Early language acquisition: Cracking the speech code. Nat Rev
    the perception of faces. Philos Trans R Soc Lond B Biol Sci 361:2109–2128.                     Neurosci 5:831–843.
90. Edwards E, et al. (2010) Spatiotemporal imaging of cortical activation during verb        125. Glezer LS, Jiang X, Riesenhuber M (2009) Evidence for highly selective neuronal
    generation and picture naming. Neuroimage 50:291–301.                                          tuning to whole words in the “visual word form area”. Neuron 62:199–204.
91. Turkeltaub PE, Coslett HB (2010) Localization of sublexical speech perception             126. Cappelle B, Shtyrov Y, Pulvermüller F (2010) Heating up or cooling up the brain?
    components. Brain Lang 114:1–15.                                                               MEG evidence that phrasal verbs are lexical units. Brain Lang 115:189–201.



DeWitt and Rauschecker                                                                                                                                 PNAS Early Edition | 9 of 10
127. Scott SK, Blank CC, Rosen S, Wise RJS (2000) Identification of a pathway for            143. Wandell BA, Rauschecker AM, Yeatman JD (2012) Learning to see words. Ann Rev
     intelligible speech in the left temporal lobe. Brain 123:2400–2406.                         Psychol 63:31–53.
128. Binder JR, Desai RH, Graves WW, Conant LL (2009) Where is the semantic system? A       144. Turkeltaub PE, Flowers DL, Lyon LG, Eden GF (2008) Development of ventral stream
     critical review and meta-analysis of 120 functional neuroimaging studies. Cereb             representations for single letters. Ann N Y Acad Sci 1145:13–29.
     Cortex 19:2767–2796.                                                                   145. Joseph JE, Cerullo MA, Farley AB, Steinmetz NA, Mier CR (2006) fMRI correlates of
129. Rogalsky C, Hickok G (2011) The role of Broca’s area in sentence comprehension.             cortical specialization and generalization for letter processing. Neuroimage 32:
     J Cogn Neurosci 23:1664–1680.                                                               806–820.
130. Obleser J, Meyer L, Friederici AD (2011) Dynamic assignment of neural resources in     146. Pernet C, Celsis P, Démonet J-F (2005) Selective response to letter categorization
     auditory comprehension of complex sentences. Neuroimage 56:2310–2320.                       within the left fusiform gyrus. Neuroimage 28:738–744.
131. Humphries C, Binder JR, Medler DA, Liebenthal E (2006) Syntactic and semantic          147. Callan AM, Callan DE, Masaki S (2005) When meaningless symbols become letters:
     modulation of neural activity during auditory sentence comprehension. J Cogn                Neural activity change in learning new phonograms. Neuroimage 28:553–562.
     Neurosci 18:665–679.                                                                   148. Longcamp M, Anton J-L, Roth M, Velay J-L (2005) Premotor activations in response to
132. Tyler LK, Marslen-Wilson W (2008) Fronto-temporal brain systems supporting                  visually presented single letters depend on the hand used to write: A study on left-
                                                                                                 handers. Neuropsychologia 43:1801–1809.
     spoken language comprehension. Philos Trans R Soc Lond B Biol Sci 363:1037–1054.
                                                                                            149. Flowers DL, et al. (2004) Attention to single letters activates left extrastriate cortex.
133. Friederici AD, Kotz SA, Scott SK, Obleser J (2010) Disentangling syntax and
                                                                                                 Neuroimage 21:829–839.
     intelligibility in auditory language comprehension. Hum Brain Mapp 31:448–457.
                                                                                            150. Longcamp M, Anton J-L, Roth M, Velay J-L (2003) Visual presentation of single letters
134. Guenther FH (1994) A neural network model of speech acquisition and motor
                                                                                                 activates a premotor area involved in writing. Neuroimage 19:1492–1500.
     equivalent speech production. Biol Cybern 72:43–53.
                                                                                            151. Logothetis NK, Pauls J (1995) Psychophysical and physiological evidence for viewer-
135. Cohen YE, Andersen RA (2002) A common reference frame for movement plans in
                                                                                                 centered object representations in the primate. Cereb Cortex 5:270–288.
     the posterior parietal cortex. Nat Rev Neurosci 3:553–562.
                                                                                            152. Dehaene S, et al. (2010) Why do children make mirror errors in reading? Neural
136. Hackett TA, et al. (2007) Sources of somatosensory input to the caudal belt areas of
                                                                                                 correlates of mirror invariance in the visual word form area. Neuroimage 49:
     auditory cortex. Perception 36:1419–1430.
                                                                                                 1837–1848.
137. Smiley JF, et al. (2007) Multisensory convergence in auditory cortex, I. Cortical
                                                                                            153. Pegado F, Nakamura K, Cohen L, Dehaene S (2011) Breaking the symmetry: Mirror
     connections of the caudal superior temporal plane in macaque monkeys. J Comp                discrimination for single letters but not for pictures in the visual word form area.
     Neurol 502:894–923.                                                                         Neuroimage 55:742–749.
138. Hackett TA, et al. (2007) Multisensory convergence in auditory cortex, II.             154. Lancaster JL, et al. (2007) Bias between MNI and Talairach coordinates analyzed
     Thalamocortical connections of the caudal superior temporal plane. J Comp Neurol            using the ICBM-152 brain template. Hum Brain Mapp 28:1194–1205.
     502:924–952.                                                                           155. Eickhoff SB, et al. (2009) Coordinate-based activation likelihood estimation meta-
139. Dhanjal NS, Handunnetthi L, Patel MC, Wise RJS (2008) Perceptual systems                    analysis of neuroimaging data: a random-effects approach based on empirical
     controlling speech production. J Neurosci 28:9969–9975.                                     estimates of spatial uncertainty. Hum Brain Mapp 30:2907–2926.
140. Baddeley A (2003) Working memory: Looking back and looking forward. Nat Rev            156. Turkeltaub PE, et al. (2012) Minimizing within-experiment and within-group effects
     Neurosci 4:829–839.                                                                         in activation likelihood estimation meta-analyses. Hum Brain Mapp 33:1–13.
141. Fitch WT (2000) The evolution of speech: A comparative review. Trends Cogn Sci 4:      157. Genovese CR, Lazar NA, Nichols T (2002) Thresholding of statistical maps in
     258–267.                                                                                    functional neuroimaging using the false discovery rate. Neuroimage 15:870–878.
142. McCandliss BD, Cohen L, Dehaene S (2003) The visual word form area: Expertise for      158. Van Essen DC (2005) A Population-Average, Landmark- and Surface-based (PALS)
     reading in the fusiform gyrus. Trends Cogn Sci 7:293–299.                                   atlas of human cerebral cortex. Neuroimage 28:635–662.




10 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1113427109                                                                                                      DeWitt and Rauschecker

El leon no es como lo pintan

  • 1.
    PNAS PLUS Phoneme andword recognition in the auditory ventral stream Iain DeWitt1 and Josef P. Rauschecker1 Laboratory of Integrative Neuroscience and Cognition, Department of Neuroscience, Georgetown University Medical Center, Washington, DC 20007 Edited by Mortimer Mishkin, National Institute for Mental Health, Bethesda, MD, and approved December 19, 2011 (received for review August 17, 2011) Spoken word recognition requires complex, invariant representa- gate–like operation, conjoining structurally simple representa- tions. Using a meta-analytic approach incorporating more than 100 tions in lower-order units into the increasingly complex repre- functional imaging experiments, we show that preference for sentations (i.e., multiple excitatory and inhibitory zones) of complex sounds emerges in the human auditory ventral stream in higher-order units. In the case of speech sounds, these neurons a hierarchical fashion, consistent with nonhuman primate electro- conjoin representations for adjacent speech formants or, at physiology. Examining speech sounds, we show that activation higher levels, adjacent phonemes. Although the mechanism by associated with the processing of short-timescale patterns (i.e., which combination sensitivity (CS) is directionally selective in phonemes) is consistently localized to left mid-superior temporal the temporal domain is not fully understood, some propositions gyrus (STG), whereas activation associated with the integration of exist (22–26). As an empirical matter, direction selectivity is phonemes into temporally complex patterns (i.e., words) is con- clearly present early in auditory cortex (19, 27). It is also ob- sistently localized to left anterior STG. Further, we show left mid- served to operate at time scales (50–250 ms) sufficient for pho- to anterior STG is reliably implicated in the invariant representation neme concatenation, as long as 250 ms in the zebra finch (15) of phonetic forms and that this area also responds preferentially to and 100 to 150 ms in macaque lateral belt (18). Logical- NEUROSCIENCE phonetic sounds, above artificial control sounds or environmental OR gate–like computation, technically proposed to be a soft sounds. Together, this shows increasing encoding specificity and maximum operation (28–30), is posited to be performed by invariance along the auditory ventral stream for temporally spectrotemporal-pooling units. These units respond to supra- complex speech sounds. threshold stimulation from any member of their connected lower-order pool, thus creating a superposition of the connected functional MRI | meta-analysis | auditory cortex | object recognition | lower-order representations and abstracting them. With respect language to speech, this might involve the pooling of numerous, rigidly tuned representations of different exemplars of a given phoneme S poken word recognition presents several challenges to the brain. Two key challenges are the assembly of complex au- ditory representations and the variability of natural speech (SI into an abstracted representation of the entire pool. Spatial pooling is well documented in visual cortex (7, 31, 32) and there is some evidence for its analog, spectrotemporal pooling, in Appendix, Fig. S1) (1). Representation at the level of primary auditory cortex (33–35), including the observation of complex auditory cortex is precise: fine-grained in scale and local in cells when A1 is developmentally reprogrammed as a surrogate spectrotemporal space (2, 3). The recognition of complex spec- V1 (36). However, a formal equivalence is yet to be demon- trotemporal forms, like words, in higher areas of auditory cortex strated (37, 38). requires the transformation of this granular representation Auditory cortex’s predominant processing pathways, ventral into Gestalt-like, object-centered representations. In brief, local and dorsal (39, 40), appear to be optimized for pattern recog- features must be bound together to form representations of nition and action planning, respectively (17, 18, 40–44). Speech- complex spectrotemporal contours, which are themselves the specific models generally concur (45–48), creating a wide con- constituents of auditory “objects” or complex sound patterns (4, sensus that word recognition is performed in the auditory ventral 5). Next, representations must be generalized and abstracted. stream (refs. 42, 45, 47–50, but see refs. 51–53). The hierarchical Coding in primary auditory cortex is sensitive even to minor model predicts an increase in neural receptive field size and physical transformations. Object-centered coding in higher areas, complexity along the ventral stream. With respect to speech, however, must be invariant (i.e., tolerant of natural stimulus there is a discontinuity in the processing demands associated variation) (6). For example, whereas the phonemic structure of a with the recognition of elemental phonetic units (i.e., phonemes word is fixed, there is considerable variation in physical, spec- or something phone-like) and concatenated units (i.e., multi- trotemporal form—attributable to accent, pronunciation, body segmental forms, both sublexical forms and word forms). Pho- size, and the like—among utterances of a given word. It has been neme recognition requires sensitivity to the arrangement of proposed for visual cortical processing that a feed-forward, hi- constellations of spectrotemporal features (i.e., the presence and erarchical architecture (7) may be capable of simultaneously absence of energy at particular center frequencies and with solving the problems of complexity and variability (8–12). Here, particular temporal offsets). Word-form recognition requires we examine these ideas in the context of auditory cortex. sensitivity to the temporal arrangement of phonemes. Thus, In a hierarchical pattern-recognition scheme (8), coding in the phoneme recognition requires spectrotemporal CS and operates earliest cortical field would reflect the tuning and organization of primary auditory cortex (or core) (2, 3, 13). That is, single-neu- ron receptive fields (more precisely, frequency-response areas) would be tuned to particular center frequencies and would have Author contributions: I.D. designed research; I.D. performed research; I.D. analyzed data; minimal spectrotemporal complexity (i.e., a single excitatory and I.D. and J.P.R. wrote the paper. zone and one-to-two inhibitory side bands). Units in higher fields The authors declare no conflict of interest. would be increasingly pattern selective and invariant to natural This article is a PNAS Direct Submission. variation. Pattern selectivity and invariance respectively arise 1 To whom correspondence may be addressed. E-mail: id32@georgetown.edu or from neural computations similar in effect to “logical-AND” and rauschej@georgetown.edu. “logical-OR” gates. In the auditory system, neurons whose tun- This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. ing is combination sensitive (14–21) perform the logical-AND 1073/pnas.1113427109/-/DCSupplemental. www.pnas.org/cgi/doi/10.1073/pnas.1113427109 PNAS Early Edition | 1 of 10
  • 2.
    on low-level acousticfeatures (SI Appendix, Fig. S1B, second vestigations of stimulus complexity, comparing activation to word- layer), whereas word-form recognition requires only temporal form and pure-tone stimuli, report similar localization (47, 73, 74). CS (i.e., concatenation of phonemes) and operates on higher- Invariant tuning for word forms, as inferred from fMRI-adapta- order features that may also be perceptual objects in their own tion studies, also localizes to anterior STG/STS (75–77). Studies right (SI Appendix, Fig. S1B, top layer). If word-form recognition investigating cross-modal repetition effects for auditory and visual is implemented hierarchically, we might expect this discontinuity stimuli confirm anterior STG/STS localization and, further, show in processing to be mirrored in cortical organization, with con- it to be part of unimodal auditory cortex (78, 79). Finally, appli- catenative phonetic recognition occurring distal to elemental cation of electrical cortical interference to anterior STG disrupts phonetic recognition. auditory comprehension, producing patient reports of speech as Primate electrophysiology identifies CS as occurring as early as being like “a series of meaningless utterances” (80). core’s supragranular layers and in lateral belt (16, 17, 19, 37). In Here, we use a coordinate-based meta-analytic approach [ac- the macaque, selectivity for communication calls—similar in tivation likelihood estimation (ALE)] (81) to make an unbiased spectrotemporal structure to phonemes or consonant-vowel assessment of the robustness of functional-imaging evidence for (CV) syllables—is observed in belt area AL (54) and, to an even the aforementioned speech-recognition model. In short, the greater degree, in a more anterior field, RTp (55). Further, for method assesses the stereotaxic concordance of reported effects. macaques trained to discriminate human phonemes, categorical First, we investigate the strength of evidence for the predicted coding is present in the single-unit activity of AL neurons as well anatomical dissociation between elemental phonetic recognition as in the population activity of area AL (1, 56). Human homologs (mid-STG) and concatenative phonetic recognition (anterior to these sites putatively lie on or about the anterior-lateral aspect STG). To assess this, two functional imaging paradigms are of Heschl’s gyrus and in the area immediately posterior to it (13, meta-analyzed: speech vs. acoustic-control sounds (a proxy for 57–59). Macaque PET imaging suggests there is also an evolu- CS, as detailed later) and repetition suppression (RS). For each tionary predisposition to left-hemisphere processing for con- paradigm, separate analyses are performed for studies of ele- specific communication calls (60). Consistent with macaque mental phonetic processing (i.e., phoneme- and CV-length electrophysiology, human electrocorticography recordings from stimuli) and for studies involving concatenative phonetic pro- superior temporal gyrus (STG), in the region immediately pos- cessing (i.e., word-length stimuli). Although the aforementioned terior to the anterior-lateral aspect of Heschl’s gyrus (i.e., mid- model is principally concerned with word-from recognition, for STG), show the site to code for phoneme identity at the pop- comparative purposes, we meta-analyze studies of phrase-length ulation level (61). Mid-STG is also the site of peak high-gamma stimuli as well. Second, we investigate the strength of evidence activity in response to CV sounds (62–64). Similarly, human for the predicted ventral-stream colocalization of CS and IR functional imaging studies suggest left mid-STG is involved in phenomena. To assess this, the same paradigms are reanalyzed processing elemental speech sounds. For instance, in subtractive with two modifications: (i) For IR, a subset of RS studies functional MRI (fMRI) comparisons, after partialing out vari- meeting heightened criteria for fMRI-adaptation designs is in- ance attributable to acoustic factors, Leaver and Rauschecker cluded (Methods); (ii) to attain sufficient sample size, analyses (2010) showed selectivity in left mid-STG for CV speech sounds are collapsed across stimulus lengths. as opposed to other natural sounds (5). This implies the presence We also investigate the strength of evidence for AS, which has of a local density of neurons with receptive-field tuning opti- been suggested as an organizing principle in higher-order areas mized for the recognition of elemental phonetic sounds [i.e., of the auditory ventral stream (5, 82–85) and is a well established areal specialization (AS)]. Furthermore, the region exhibits organizing principle in the visual system’s analogous pattern fMRI-adaptation phenomena consistent with invariant repre- recognition pathway (86–89). In the interest of comparing the sentation (IR) (65, 66). That is, response diminishes when the organizational properties of the auditory ventral stream with same phonetic content is repeatedly presented even though a those of the visual ventral stream, we assess the colocalization of physical attribute of the stimulus, one unrelated to phonetic AS phenomena with CS and IR phenomena. CS and IR are content, is changed; here, the speaker’s voice (5). Similarly, using examined as described earlier. AS is examined by meta-analysis speech sound stimuli on the /ga/ — /da/ continuum and com- of speech vs. nonspeech natural-sound paradigms. paring response to exemplar pairs that varied only in acoustics or At a deep level, both our AS and CS analyses putatively examine which varied both in acoustics and in phonetic content, Joanisse CS-dependent tuning for complex patterns of spectrotemporal and colleagues (2007) found adaptation specific to phonetic energy. Acoustic-control sounds lack the spectrotemporal fea- content in left mid-STG, again implying IR (67). ture combinations requisite for driving combination-sensitive The site downstream of mid-STG, performing phonetic con- neurons tuned to speech sounds. For nonspeech natural sounds, catenation, should possess neurons that respond to late com- the same is true, but there should also exist combination-sensi- ponents of multisegmental sounds (i.e., latencies >60 ms). These tive neurons tuned to these stimuli, as they have been repeatedly units should also be selective for specific phoneme orderings. encountered over development. For an effect to be observed in Nonhuman primate data for regions rostral to A1 confirm that the AS analyses, not only must there be a population of com- latencies increase rostrally along the ventral stream (34, 55, 68, bination-sensitive speech-tuned neurons, but these neurons must 69), with the median latency to peak response approaching also cluster together such that a differential response is observ- 100 ms in area RT (34), consistent with the latencies required for able at the macroscopic scale of fMRI and PET. phonetic concatenation. In a rare human electrophysiology study, Creutzfeldt and colleagues (1989) report vigorous single-unit Results responses to words and sentences in mid- to anterior STG (70). Phonetic-length-based analyses of CS studies (i.e., speech sounds This included both feature-tuned units and late-component- vs. acoustic control sounds) were performed twice. In the first tuned units. Although the relative location of feature and late- analyses, tonal control stimuli were excluded on grounds that component units is not reported, and the late component units they do not sufficiently match the spectrotemporal energy dis- do not clearly evince temporal CS, the mixture of response types tribution of speech. That is, for a strict test of CS, we required supports the supposition of temporal combination-sensitive units acoustic control stimuli to model low-level properties of speech in human STG. Imaging studies localize processing of multi- (i.e., contain spectrotemporal features coarsely similar to segmental forms to anterior STG/superior temporal sulcus (STS). speech), not merely to drive primary and secondary auditory This can be seen in peak activation to word-forms in electro- cortex. Under this preparation, spatial concordance was greatest corticography (71) and magnetoencephalography (72). FMRI in- in STG/STS across each phonetic length-based analysis (Table 1). 2 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1113427109 DeWitt and Rauschecker
  • 3.
    PNAS PLUS Table 1.Results for phonetic length-based analyses Center of mass Peak coordinates 3 Analysis/anatomy BA Cluster Concordance Volume, mm x y z x y z Peak ALE CS Phoneme length Left STG 42/22 0.93 3,624 −57 −25 1 −58 −20 2 0.028 Right STG/RT 42/22 0.21 512 56 −11 −2 54 −2 2 0.015 Word length Left STG 42/22 0.56 2,728 −57 −17 −1 −56 −16 −2 0.021 Right STG 22 0.13 192 55 −17 0 56 −16 0 0.014 Phrase length Left STS 21 0.58 2,992 −56 −8 −8 −56 −8 −8 0.038 Left STS 21 0.42 1,456 −52 7 −16 −52 8 −16 0.035 Right STS 21 0.32 2,264 54 −3 −9 56 −6 −6 0.032 Left STS 22 0.32 840 −54 −35 1 −54 −34 0 0.028 Left PreCG 6 0.32 664 −47 −7 47 −48 −8 48 0.025 Left IFG 47 0.21 456 −42 25 −12 −42 24 −12 0.021 Left IFG 44 0.16 200 −48 11 20 −48 10 20 0.020 RS Phoneme length Left STG 42/22 0.33 640 −58 −21 4 −58 −20 4 0.018 Word length NEUROSCIENCE Left STG 42/22 0.50 1408 −56 −9 −3 −56 −10 −4 0.027 Left STG 42/22 0.19 288 −58 −28 2 −58 −28 2 0.017 BA, Brodmann area; IFG, inferior frontal gyrus; PreCG, precentral gyrus; RT, rostrotemporal area. Within STG/STS, results were left-biased across peak ALE-sta- length effects with left anterior STG (Fig. 1 and SI Appendix, Fig. tistic value, cluster volume, and the percentage of studies S2). Phrase-length studies showed a similar leftward processing reporting foci within a given cluster, hereafter “cluster concor- bias. Further, peak processing for phrase-length stimuli localized dance.” The predicted differential localization for phoneme- and to a site anterior and subjacent to that of word-length stimuli, word-length processing was confirmed, with phoneme-length suggesting a processing gradient for phonetic stimuli that pro- effects most strongly associated with left mid-STG and word- gresses from mid-STG to anterior STG and then into STS. Fig. 1. Foci meeting inclusion criteria for length-based CS analyses (A–C) and ALE-statistic maps for regions of significant concordance (D–F) (p < 10−3, k > 150 cm3). Analyses show leftward bias and an anterior progression in peak effects with phoneme-length studies showing greatest concordance in left mid-STG (A and D; n = 14), word-length studies showing greatest concordance in left anterior STG (B and E; n = 16), and phrase-length analyses showing greatest concordance in left anterior STS (C and F; n = 19). Sample size is given with respect to the number of contrasts from independent experiments contributing to an analysis. DeWitt and Rauschecker PNAS Early Edition | 3 of 10
  • 4.
    Although individual studiesreport foci for left frontal cortex analysis was also generally coextensive with the CS analysis. In in each of the length-based cohorts, only in the phrase-length left ventral prefrontal cortex, the AS and CS results were not analysis do focus densities reach statistical significance. coextensive but were nonetheless similarly localized. Fig. 5 shows Second, to increase sample size and enable lexical status-based exact regions of overlap across length-based and pooled analyses. subanalyses, we included studies that used tonal control stimuli. Under this preparation the same overall pattern of results was Discussion observed with one exception: the addition of a pair of clusters in Meta-analysis of speech processing shows a left-hemisphere op- left ventral prefrontal cortex for the word-length analysis (SI timization for speech and an anterior-directed processing gra- Appendix, Fig. S3 and Table S1). Next, we further subdivided dient. Two unique findings are presented. First, dissociation is word-length studies according to lexical status: real word or observed for the processing of phonemes, words, and phrases: pseudoword. A divergent pattern of concordance was observed elemental phonetic processing is most strongly associated with in left STG (Fig. 2 and SI Appendix, Fig. S4 and Table S1). Peak mid-STG; auditory word-form processing is most strongly asso- processing for real-word stimuli robustly localized to anterior ciated with anterior STG, and phrasal processing is most strongly STG. For pseudoword stimuli, a bimodal distribution was ob- associated with anterior STS. Second, evidence for CS, IR, and served, peaking both in mid- and anterior STG and coextensive AS colocalize in mid- to anterior STG. Each finding supports the with the real-word cluster. presence of an anterior-directed ventral-stream pattern-recog- Third, to assess the robustness of the predicted STG stimulus- nition pathway. This is in agreement with Leaver and Rau- length processing gradient, length-based analyses were per- schecker (2010), who tested colocalization of AS and IR in formed on foci from RS studies. For both phoneme- and word- a single sample using phoneme-length stimuli (5). Recent meta- length stimuli, concordant foci were observed to be strictly left- analyses that considered related themes affirm aspects of the lateralized and exclusively within STG (Table 1). The predicted present work. In a study that collapsed across phoneme and processing gradient was also observed. Peak concordance for pseudoword processing, Turkeltaub and Coslett (2010) localized phoneme-length stimuli was seen in mid-STG, whereas peak sublexical processing to mid-STG (91). This is consistent with concordance for word-length stimuli was seen in anterior STG our more specific localization of elemental phonetic processing. (Fig. 3 and SI Appendix, Fig. S5). For the word-length analysis, Samson and colleagues (2011), examining preferential tuning for a secondary cluster was observed in mid-STG. This may reflect speech over music, report peak concordance in left anterior repetition effects concurrently observed for phoneme-level rep- STG/STS (92), consistent with our more general areal-speciali- resentation or, as the site is somewhat inferior to that of pho- zation analysis. Finally, our results support Binder and col- neme-length effects, it may be tentative evidence of a secondary leagues’ (2000) anterior-directed, hierarchical account of word processing pathway within the ventral stream (63, 90). recognition (47) and Cohen and colleagues’ (2004) hypothesis of Fourth, to assess colocalization of CS, IR, and AS, we per- an auditory word-form area in left anterior STG (78). formed length-pooled analyses (Fig. 4, Table 2, and SI Appendix, Classically, auditory word-form recognition was thought to Fig. S6). Robust CS effects were observed in STG/STS. Again, localize to posterior STG/STS (93). This perspective may have they were left-biased across peak ALE-statistic value, cluster been biased by the spatial distribution of middle cerebral artery volume, and cluster concordance. Significant concordance was accidents. The artery’s diameter decreases along the Sylvian also found in left frontal cortex. A single result was observed in fissure, possibly increasing the prevalence of posterior infarcts. the IR analysis, localizing to left mid- to anterior STG. This Current methods in aphasia research are better controlled and cluster was entirely coextensive with the primary left-STG CS more precise. They implicate mid- and anterior temporal regions cluster. Finally, analysis of AS foci found concordance in STG/ in speech comprehension, including anterior STG (94, 95). Al- STS. It was also left-biased in peak ALE-statistic value, cluster though evidence for an anterior STG/STS localization of audi- volume, and cluster concordance. Further, a left-lateralized tory word-form processing has been present in the functional ventral prefrontal result was observed. The principal left STG/ imaging literature since inception (96–99), perspectives advanc- STS cluster was coextensive with the region of overlap between ing this view have been controversial and the localization is still the CS and IR analyses. Within superior temporal cortex, the AS not uniformly accepted. We find strong agreement among word- Fig. 2. Foci meeting liberal inclusion criteria for lexically based word-length CS analyses (A and B) and ALE-statistic maps for regions of significant con- cordance (C and D) (p < 10−3, k > 150 cm3). Similar to the CS analyses in Fig. 1, a leftward bias and an anterior progression in peak effects are shown. Pseudoword studies show greatest concordance in left mid- to anterior STG (A and C; n = 13). Notably, the distribution of concordance effects is bimodal, peaking both in mid- (−60, −26, 6) and anterior (−56, −10, 2) STG. Real-word studies show greatest concordance in left anterior STG (B and D; n = 22). 4 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1113427109 DeWitt and Rauschecker
  • 5.
    PNAS PLUS Fig. 3.Foci meeting inclusion criteria for length-based RS analyses (A and B) and ALE-statistic maps for regions of significant concordance (C and D) (p < 10−3, k > 150 cm3). Analyses show left lateralization and an anterior progression in peak effects with phoneme-length studies showing greatest concordance in left mid-STG (A and C; n = 12) and word-length studies showing greatest concordance in left anterior STG (B and D; n = 16). Too few studies exist for phrase-length analyses (n = 4). processing experiments, both within and across paradigms, each the ventral stream. As human core auditory fields lie along or NEUROSCIENCE supporting relocation of auditory word-form recognition to an- about Heschl’s gyrus (13, 57–59, 100), the ventral streams’ course terior STG. Through consideration of phoneme- and phrasal- can be inferred to traverse portions of planum temporale. Spe- processing experiments, we show the identified anterior-STG cifically, the ventral stream is associated with macaque areas word form-recognition site to be situated between sites robustly RTp and AL (54–56), which lie anterior to and lateral of A1 associated with phoneme and phrase processing. This comports (13). As human A1 lies on or about the medial aspect of Heschl’s with hierarchical processing and thereby further supports ante- gyrus, with core running along its extent (57, 100), a processing rior-STG localization for auditory word-form recognition. cascade emanating from core areas, progressing both laterally, It is important to note that some authors define “posterior” away from core itself, and anteriorly, away from A1, will neces- STG to be posterior of the anterior-lateral aspect of Heschl’s sarily traverse the anterior-lateral portion of planum temporale. gyrus or of the central sulcus. These definitions include the re- Further, this implies mid-STG is the initial STG waypoint of the gion we discuss as “mid-STG,” the area lateral of Heschl’s gyrus. ventral stream. We differentiate mid- from posterior STG on the basis of Nominal issues aside, support for a posterior localization proximity to primary auditory cortex and the putative course of could be attributed to a constellation of effects pertaining to Fig. 4. Foci meeting inclusion criteria for length-pooled analyses (A–C) and ALE-statistic maps for regions of significant concordance (D–F) (p < 10−3, k > 150 cm3). Analyses show leftward bias in the CS (A and D; n = 49) and AS (C and F; n = 15) analyses and left lateralization in the IR (B and E; n = 11) analysis. Foci are color coded by stimulus length: phoneme length, red; word length, green; and phrase length, blue. DeWitt and Rauschecker PNAS Early Edition | 5 of 10
  • 6.
    Table 2. Resultsfor aggregate analyses Center of Mass Peak Coordinates 3 Analysis/anatomy BA Cluster Concordance Volume, mm x y z x y z Peak ALE CS Left STG 42/22 0.82 11,944 −57 −19 −1 −58 −18 0 0.056 Right STG 42/22 0.47 6,624 55 −10 −3 56 −6 −6 0.045 Left STS 21 0.18 1,608 −51 8 −14 −50 8 −14 0.039 Left PreCG 6 0.12 736 −47 −7 48 −48 −8 48 0.031 Left IFG 44 0.10 744 −45 12 21 −46 12 20 0.025 Left IFG 47 0.08 240 −42 25 −12 −42 24 −12 0.022 Left IFG 45 0.04 200 −50 21 12 −50 22 12 0.020 IR* Left STG 22/21 0.45 1,200 −58 −16 −1 −56 −14 −2 0.020 AS Left STG 42/22 0.87 3,976 −58 −22 2 −58 −24 2 0.031 Right STG 42/22 0.53 2,032 51 −23 2 54 −16 0 0.026 Left IFG 47/45 0.13 368 −45 17 3 −44 18 2 0.018 *Broader inclusion criteria for the IR analysis (SI Appendix, Table S3) yield equivalent results with the following qualifications: cluster volume 1,008 mm3 and cluster concordance 0.33. aspects of speech or phonology that localize to posterior STG/ and phonology, they do so in terms of multisensory processing STS (69), for instance: speech production (101–108), phono- and sensorimotor integration and are not the key paradigms logical/articulatory working memory (109, 110), reading (111– indicated by computational theory for demonstrating the pres- 113) [putatively attributable to orthography-to-phonology trans- ence of pattern recognition networks (8–12, 123). Those para- lation (114–116)], and aspects of audiovisual language processing digms (CS and adaptation), systematically meta-analyzed here, (117–122). Although these findings relate to aspects of speech find anterior localization. Fig. 5. Flat-map presentation of ALE cluster overlap for (A) the CS analyses shown in Fig. 1, (B) the word-length lexical status analyses shown in Fig. 2, (C) the RS analyses shown in Fig. 3, and (D) the length-pooled analyses shown in Fig. 4. For orientation, prominent landmarks are shown on the left hemisphere of A, including the circular sulcus (CirS), central sulcus (CS), STG, and STS. 6 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1113427109 DeWitt and Rauschecker
  • 7.
    PNAS PLUS The segregation of phoneme and word-form processing along memory task, they demonstrated the time course of left anterior STG implies a growing encoding specificity for complex phonetic STG/STS activation to be consistent with strictly auditory forms by higher-order ventral-stream areas. More specifically, it encoding: activation was locked to auditory stimulation and it suggests the presence of a hierarchical network performing pho- was not sustained throughout the late phase of item rehearsal. In netic concatenation at a site anatomically distinct from and contrast, they observed the activation time course in the dorsal downstream of the site performing elemental phonetic recogni- stream to be modality independent and to coincide with late- tion. Alternatively, the phonetic-length effect could be attributed phase rehearsal (i.e., it was associated with verbal rehearsal in- to semantic confound: semantic content increases from phonemes dependent of input modality, auditory or visual). Importantly, to word forms. In an elegant experiment, Thierry and colleagues late-phase rehearsal can be demonstrated behaviorally, by ar- (2003) report evidence against this (82). After controlling for ticulatory suppression, to be mediated by subvocalization (i.e., acoustics, they show that left anterior STG responds more to articulatory rehearsal in the phonological loop) (140). speech than to semantically matched environmental sounds. There are some notable differences between auditory and vi- Similarly, Belin and colleagues (2000, 2002), after controlling for sual word recognition. Spoken language was intensely selected acoustics, show that left anterior STG is not merely responding to for during evolution (141), whereas reading is a recent cultural the vocal quality of phonetic sounds; rather, it responds prefer- innovation (111). The age of acquisition of phoneme represen- entially to the phonetic quality of vocal sounds (83, 84). tation is in the first year of life (124), whereas it is typically in the Additional comment on the localization and laterality of au- third year for letters. A similar developmental lag is present with ditory word and pseudoword processing, as well as on processing respect to acquisition of the visual lexicon. Differences aside, gradients in STG/STS, is provided in SI Appendix, Discussion. word recognition in each modality requires similar processing, The auditory ventral stream is proposed to use CS to conjoin including the concatenation of elemental forms, phonemes or lower-order representations and thereby to synthesize complex letters, into sublexical forms and word forms. If the analogy representations. As the tuning of higher-order combination- between auditory and visual ventral streams is correct, our sensitive units is contingent upon sensory experience (124, 125), results predict a similar anatomical dissociation for elemental phrases and sentences would not generally be processed as Ge- and concatenative representation in the visual ventral stream. NEUROSCIENCE stalt-like objects. Although we have analyzed studies involving This prediction is also made by models of text processing (10). phrase- and sentence-level processing, their inclusion is for Although we are aware of no study that has investigated letter context and because word-form recognition is a constituent part and word recognition in a single sample, support for the disso- of sentence processing. In some instances, however, phrases are ciation is present in the literature. The visual word-form area, processed as objects (126). This status is occasionally recognized the putative site of visual word-form recognition (142), is located in orthography (e.g., “nonetheless”). Such phrases ought to be in the left fusiform gyrus of inferior temporal cortex (IT) (143). recognized by the ventral-stream network. This, however, would Consistent with expectation, the average site of peak activation be the exception, not the rule. Hypothetically, the opposite may to single letters in IT (144–150) is more proximal to V1, by ap- also occur: a word form’s length might exceed the network’s proximately 13 mm. A similar anatomical dissociation can be integrative capacity (e.g., “antidisestablishmentarianism”). We seen in paradigms probing IR. Ordinarily, nonhuman primate IT speculate the network is capable of concatenating sequences of neurons exhibit a degree of mirror-symmetric invariant tuning at least five to eight phonemes: five to six phonemes is the modal (151). Letter recognition, however, requires nonmirror IR (e.g., length of English word forms and seven- to eight-phoneme-long to distinguish “b” from “d”). When assessing identity-specific RS word forms comprise nearly one fourth of English words (SI (i.e., repetition effects specific to non–mirror-inverted repeti- Appendix, Fig. S7 and Discussion). This estimate is also consis- tions), letter and word effects differentially localize: effects for tent with the time constant of echoic memory (∼2 s). (Notably, word stimuli localize to the visual word-form area (152), whereas there is a similar issue concerning the processing of text in the effects for single-letter stimuli localize to the lateral occipital visual system’s ventral stream, where, for longer words, fovea- complex (153), a site closer to V1. Thus, the anatomical disso- width representations must be “temporally” conjoined across ciation observed in auditory cortex for phonemes and words microsaccades.) Although some phrases may be recognized in appears to reflect a general hierarchical processing architecture the word-form recognition network, the majority of STS activa- also present in other sensory cortices. tion associated with phrase-length stimuli (Fig. 1F) is likely re- In conclusion, our analyses show the human functional imag- lated to aspects of syntax and semantics. This observation ing literature to support a hierarchical model of object recog- enables us to subdivide the intelligibility network, broadly de- nition in auditory cortex, consistent with nonhuman primate fined by Scott and colleagues (2000) (127). The first two stages electrophysiology. Specifically, our results support a left-biased, involve elemental and concatenative phonetic recognition, ex- two-stage model of auditory word-form recognition with analysis tending from mid-STG to anterior STG and, possibly, into sub- of phonemes occurring in mid-STG and word recognition oc- jacent STS. Higher-order syntactic and semantic processing is curring in anterior STG. A third stage extends the model to conducted throughout STS and continues into prefrontal cortex phrase-level processing in STS. Mechanistically, left mid- to (128–133). anterior STG exhibits core qualities of a pattern recognition A qualification to the propositions advanced here for word- network, including CS, IR, and AS. form recognition is that this account pertains to perceptually fluent speech recognition (e.g., native language conversational Methods discourse). Both left ventral and dorsal networks likely mediate To identify prospective studies for inclusion, a systematic search of the nonfluent speech recognition (e.g., when processing neologisms PubMed database was performed for variations of the query, “(phonetics OR or recently acquired words in a second language). Whereas ‘speech sounds’ OR phoneme OR ‘auditory word’) AND (MRI OR fMRI OR ventral networks are implicated in pattern recognition, dorsal PET).” This yielded more than 550 records (as of February 2011). These networks are implicated in forward- and inverse-model compu- studies were screened for compliance with formal inclusion criteria: (i) the publication of stereotaxic coordinates for group-wise fMRI or PET results in tation (42, 44), including sensorimotor integration (42, 45, 48, a peer-reviewed journal and (ii) report of a contrast of interest (as detailed 134). This supports a role for left dorsal networks in mapping later). Exclusion criteria were the use of pediatric or clinical samples. In- auditory representations onto the somatomotor frame of refer- clusion/exclusion criteria admitted 115 studies. For studies reporting multiple ence (135–139), yielding articulator-encoded speech. This ven- suitable contrasts per sample, to avoid sampling bias, a single contrast was tral–dorsal dissociation is illustrated in an experiment by selected. For CS analyses, contrasts of interest compared activation to speech Buchsbaum and colleagues (2005) (110). Using a verbal working stimuli (i.e., phonemes/syllables, words/pseudowords, and phrases/sentences/ DeWitt and Rauschecker PNAS Early Edition | 7 of 10
  • 8.
    pseudoword sentences) withactivation to matched, nonnaturalistic acoustic (154). Foci concordance was assessed by the method of ALE (81) in a random- control stimuli (i.e., various tonal, noise, and complex artificial nonspeech effects implementation (155) that controls for within-experiment effects stimuli). A total of 84 eligible contrasts were identified, representing 1,211 (156). Under ALE, foci are treated as Gaussian probability distributions, subjects and 541 foci. For RS analyses, contrasts compared activation to re- which reflect localization uncertainty. Pooled Gaussian focus maps were peated and nonrepeated speech stimuli. A total of 31 eligible contrasts were tested against a null distribution reflecting a random spatial association identified, representing 471 subjects and 145 foci. For IR analyses, a subset of between different experiments. Correction for multiple comparisons was the RS cohort was selected that used designs in which “repeated” stimuli obtained through estimation of false discovery rate (157). Two significance also varied acoustically but not phonetically (e.g., two different utterances criteria were used: minimum p value was set at 10−3 and minimum cluster of the same word). The RS cohort was used for phonetic length-based extent was set at 150 mm3. Analyses were conducted in GINGERALE (Re- analyses as the more restrictive criteria for IR yielded insufficient sample search Imaging Institute), AFNI (National Institute of Mental Health), and sizes (as detailed later). For AS analyses, contrasts compared activation to MATLAB (Mathworks). For visualization, CARET (Washington University in speech stimuli and to other naturalistic stimuli (e.g., animal calls, music, tool St. Louis) was used to project foci and ALE clusters from volumetric space sounds). A total of 17 eligible contrasts were identified, representing 239 onto the cortical surface of the Population-Average, Landmark- and Surface- subjects and 100 foci. All retained contrasts were binned for phonetic based atlas (158). Readers should note that this procedure can introduce length-based analyses according to the estimated mean number of pho- slight localization artifacts (e.g., projection may distribute one volumetric nemes in their stimuli: (i) “phoneme length,” one or two phonemes, (ii) cluster discontinuously over two adjacent gyri). “word length,” three to 10 phonemes, and (iii) “phrase length,” more than 10 phonemes. SI Appendix, Tables S2–S4, identify the contrasts included in ACKNOWLEDGMENTS. We thank Max Riesenhuber, Marc Ettlinger, and two each analysis. anonymous reviewers for comments helpful to the development of this The minimum sample size for meta-analyses was 10 independent contrasts. manuscript. This work was supported by National Science Foundation Grants Foci reported in Montreal Neurological Institute coordinates were trans- BCS-0519127 and OISE-0730255 (to J.P.R.) and National Institute on Deafness formed into Talairach coordinates according to the ICBM2TAL transformation and Other Communication Disorders Grant 1RC1DC010720 (to J.P.R.). 1. Steinschneider M (2011) Unlocking the role of the superior temporal gyrus for 26. Carr CE, Konishi M (1988) Axonal delay lines for time measurement in the owl’s speech sound categorization. J Neurophysiol 105:2631–2633. brainstem. Proc Natl Acad Sci USA 85:8311–8315. 2. Brugge JF, Merzenich MM (1973) Responses of neurons in auditory cortex of the 27. Tian B, Rauschecker JP (2004) Processing of frequency-modulated sounds in the macaque monkey to monaural and binaural stimulation. J Neurophysiol 36: lateral auditory belt cortex of the rhesus monkey. J Neurophysiol 92:2993–3013. 1138–1158. 28. Fukushima K (1980) Neocognitron: A self organizing neural network model for 3. Bitterman Y, Mukamel R, Malach R, Fried I, Nelken I (2008) Ultra-fine frequency a mechanism of pattern recognition unaffected by shift in position. Biol Cybern 36: tuning revealed in single neurons of human auditory cortex. Nature 451:197–201. 193–202. 4. Griffiths TD, Warren JD (2004) What is an auditory object? Nat Rev Neurosci 5: 29. Riesenhuber M, Poggio TA (1999) Hierarchical models of object recognition in 887–892. cortex. Nat Neurosci 2:1019–1025. 5. Leaver AM, Rauschecker JP (2010) Cortical representation of natural complex sounds: 30. Kouh M, Poggio TA (2008) A canonical neural circuit for cortical nonlinear effects of acoustic features and auditory object category. J Neurosci 30:7604–7612. operations. Neural Comput 20:1427–1451. 6. Luce P, McLennan C (2005) Spoken word recognition: The challenge of variation. 31. Lampl I, Ferster D, Poggio T, Riesenhuber M (2004) Intracellular measurements of Handbook of Speech Perception, eds Pisoni D, Remez R (Blackwell, Malden, MA), pp spatial integration and the MAX operation in complex cells of the cat primary visual 591–609. cortex. J Neurophysiol 92:2704–2713. 7. Hubel DH, Wiesel TN (1962) Receptive fields, binocular interaction and functional 32. Finn IM, Ferster D (2007) Computational diversity in complex cells of cat primary architecture in the cat’s visual cortex. J Physiol 160:106–154. visual cortex. J Neurosci 27:9638–9648. 8. Riesenhuber M, Poggio TA (2002) Neural mechanisms of object recognition. Curr 33. Bendor D, Wang X (2007) Differential neural coding of acoustic flutter within Opin Neurobiol 12:162–168. primate auditory cortex. Nat Neurosci 10:763–771. 9. Husain FT, Tagamets M-A, Fromm SJ, Braun AR, Horwitz B (2004) Relating neuronal 34. Bendor D, Wang X (2008) Neural response properties of primary, rostral, and dynamics for auditory object processing to neuroimaging activity: A computational rostrotemporal core fields in the auditory cortex of marmoset monkeys. modeling and an fMRI study. Neuroimage 21:1701–1720. J Neurophysiol 100:888–906. 10. Dehaene S, Cohen L, Sigman M, Vinckier F (2005) The neural code for written words: 35. Atencio CA, Sharpee TO, Schreiner CE (2008) Cooperative nonlinearities in auditory cortical neurons. Neuron 58:956–966. a proposal. Trends Cogn Sci 9:335–341. 36. Roe AW, Pallas SL, Kwon YH, Sur M (1992) Visual projections routed to the auditory 11. Hoffman KL, Logothetis NK (2009) Cortical mechanisms of sensory learning and pathway in ferrets: receptive fields of visual neurons in primary auditory cortex. object recognition. Philos Trans R Soc Lond B Biol Sci 364:321–329. J Neurosci 12:3651–3664. 12. Larson E, Billimoria CP, Sen K (2009) A biologically plausible computational model 37. Atencio CA, Sharpee TO, Schreiner CE (2009) Hierarchical computation in the for auditory object recognition. J Neurophysiol 101:323–331. canonical auditory cortical circuit. Proc Natl Acad Sci USA 106:21894–21899. 13. Hackett TA (2011) Information flow in the auditory cortical network. Hear Res 271: 38. Ahmed B, Garcia-Lazaro JA, Schnupp JWH (2006) Response linearity in primary 133–146. auditory cortex of the ferret. J Physiol 572:763–773. 14. Suga N, O’Neill WE, Manabe T (1978) Cortical neurons sensitive to combinations of 39. Rauschecker JP, Tian B (2000) Mechanisms and streams for processing of “what” and information-bearing elements of biosonar signals in the mustache bat. Science 200: “where” in auditory cortex. Proc Natl Acad Sci USA 97:11800–11806. 778–781. 40. Romanski LM, et al. (1999) Dual streams of auditory afferents target multiple 15. Margoliash D, Fortune ES (1992) Temporal and harmonic combination-sensitive domains in the primate prefrontal cortex. Nat Neurosci 2:1131–1136. neurons in the zebra finch’s HVc. J Neurosci 12:4309–4326. 41. Kaas JH, Hackett TA (1999) ‘What’ and ‘where’ processing in auditory cortex. Nat 16. Rauschecker JP, Tian B, Hauser M (1995) Processing of complex sounds in the Neurosci 2:1045–1047. macaque nonprimary auditory cortex. Science 268:111–114. 42. Rauschecker JP, Scott SK (2009) Maps and streams in the auditory cortex: nonhuman 17. Rauschecker JP (1997) Processing of complex sounds in the auditory cortex of cat, primates illuminate human speech processing. Nat Neurosci 12:718–724. monkey, and man. Acta Otolaryngol Suppl 532:34–38. 43. Romanski LM, Averbeck BB (2009) The primate cortical auditory system and neural 18. Rauschecker JP (1998) Parallel processing in the auditory cortex of primates. Audiol representation of conspecific vocalizations. Annu Rev Neurosci 32:315–346. Neurootol 3:86–103. 44. Rauschecker JP (2011) An expanded role for the dorsal auditory pathway in 19. Sadagopan S, Wang X (2009) Nonlinear spectrotemporal interactions underlying sensorimotor control and integration. Hear Res 271:16–25. selectivity for complex sounds in auditory cortex. J Neurosci 29:11192–11202. 45. Hickok G, Poeppel D (2007) The cortical organization of speech processing. Nat Rev 20. Medvedev AV, Chiao F, Kanwal JS (2002) Modeling complex tone perception: Neurosci 8:393–402. grouping harmonics with combination-sensitive neurons. Biol Cybern 86:497–505. 46. Scott SK, Wise RJS (2004) The functional neuroanatomy of prelexical processing in 21. Willmore BDB, King AJ (2009) Auditory cortex: representation through speech perception. Cognition 92:13–45. sparsification? Curr Biol 19:1123–1125. 47. Binder JR, et al. (2000) Human temporal lobe activation by speech and nonspeech 22. Voytenko SV, Galazyuk AV (2007) Intracellular recording reveals temporal sounds. Cereb Cortex 10:512–528. integration in inferior colliculus neurons of awake bats. J Neurophysiol 97: 48. Wise RJ, et al. (2001) Separate neural subsystems within ‘Wernicke’s area’. Brain 124: 1368–1378. 83–95. 23. Peterson DC, Voytenko S, Gans D, Galazyuk A, Wenstrup J (2008) Intracellular 49. Patterson RD, Johnsrude IS (2008) Functional imaging of the auditory processing recordings from combination-sensitive neurons in the inferior colliculus. applied to speech sounds. Philos Trans R Soc Lond B Biol Sci 363:1023–1035. J Neurophysiol 100:629–645. 50. Weiller C, Bormann T, Saur D, Musso M, Rijntjes M (2011) How the ventral pathway 24. Ye CQ, Poo MM, Dan Y, Zhang XH (2010) Synaptic mechanisms of direction got lost: and what its recovery might mean. Brain Lang 118:29–39. selectivity in primary auditory cortex. J Neurosci 30:1861–1868. 51. Whalen DH, et al. (2006) Differentiation of speech and nonspeech processing within 25. Rao RP, Sejnowski TJ (2000) Predictive sequence learning in recurrent neocortical primary auditory cortex. J Acoust Soc Am 119:575–581. circuits. Advances in Neural Information Processing Systems,, eds Solla SA, Leen TK, 52. Nelken I (2008) Processing of complex sounds in the auditory system. Curr Opin Muller KR (MIT Press, Cambridge), Vol 12. Neurobiol 18:413–417. 8 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1113427109 DeWitt and Rauschecker
  • 9.
    PNAS PLUS 53. RecanzoneGH, Cohen YE (2010) Serial and parallel processing in the primate 92. Samson F, Zeffiro TA, Toussaint A, Belin P (2011) Stimulus complexity and categorical auditory cortex revisited. Behav Brain Res 206:1–7. effects in human auditory cortex: An activation likelihood estimation meta-analysis. 54. Tian B, Reser D, Durham A, Kustov A, Rauschecker JP (2001) Functional specialization Front Psychol 1:241. in rhesus monkey auditory cortex. Science 292:290–293. 93. Geschwind N (1970) The organization of language and the brain. Science 170: 55. Kikuchi Y, Horwitz B, Mishkin M (2010) Hierarchical auditory processing directed 940–944. rostrally along the monkey’s supratemporal plane. J Neurosci 30:13021–13030. 94. Bates E, et al. (2003) Voxel-based lesion-symptom mapping. Nat Neurosci 6:448–450. 56. Tsunada J, Lee JH, Cohen YE (2011) Representation of speech categories in the 95. Dronkers NF, Wilkins DP, Van Valin RD, Jr., Redfern BB, Jaeger JJ (2004) Lesion primate auditory cortex. J Neurophysiol 105:2634–2646. analysis of the brain areas involved in language comprehension. Cognition 92: 57. Galaburda AM, Sanides F (1980) Cytoarchitectonic organization of the human 145–177. auditory cortex. J Comp Neurol 190:597–610. 96. Mazziotta JC, Phelps ME, Carson RE, Kuhl DE (1982) Tomographic mapping of 58. Chevillet M, Riesenhuber M, Rauschecker JP (2011) Functional correlates of the human cerebral metabolism: Auditory stimulation. Neurology 32:921–937. anterolateral processing hierarchy in human auditory cortex. J Neurosci 31: 97. Petersen SE, Fox PT, Posner MI, Mintun M, Raichle ME (1988) Positron emission 9345–9352. tomographic studies of the cortical anatomy of single-word processing. Nature 331: 59. Glasser MF, Van Essen DC (2011) Mapping human cortical areas in vivo based on 585–589. myelin content as revealed by T1- and T2-weighted MRI. J Neurosci 31:11597–11616. 98. Wise RJS, et al. (1991) Distribution of cortical neural networks involved in word 60. Poremba A, et al. (2004) Species-specific calls evoke asymmetric activity in the comprehension and word retrieval. Brain 114:1803–1817. monkey’s temporal poles. Nature 427:448–451. 99. Démonet JF, et al. (1992) The anatomy of phonological and semantic processing in 61. Chang EF, et al. (2010) Categorical speech representation in human superior normal subjects. Brain 115:1753–1768. temporal gyrus. Nat Neurosci 13:1428–1432. 100. Rademacher J, et al. (2001) Probabilistic mapping and volume measurement of 62. Chang EF, et al. (2011) Cortical spatio-temporal dynamics underlying phonological human primary auditory cortex. Neuroimage 13:669–683. target detection in humans. J Cogn Neurosci 23:1437–1446. 101. Hamberger MJ, Seidel WT, Goodman RR, Perrine K, McKhann GM (2003) Temporal 63. Steinschneider M, et al. (2011) Intracranial study of speech-elicited activity on the lobe stimulation reveals anatomic distinction between auditory naming processes. human posterolateral superior temporal gyrus. Cereb Cortex 21:2332–2347. Neurology 60:1478–1483. 64. Edwards E, et al. (2009) Comparison of time-frequency responses and the event- 102. Hashimoto Y, Sakai KL (2003) Brain activations during conscious self-monitoring of related potential to auditory speech stimuli in human cortex. J Neurophysiol 102: speech production with delayed auditory feedback: An fMRI study. Hum Brain Mapp 377–386. 20:22–28. 65. Miller EK, Li L, Desimone R (1991) A neural mechanism for working and recognition 103. Warren JE, Wise RJS, Warren JD (2005) Sounds do-able: Auditory-motor trans- memory in inferior temporal cortex. Science 254:1377–1379. formations and the posterior temporal plane. Trends Neurosci 28:636–643. 66. Grill-Spector K, Malach R (2001) fMR-adaptation: A tool for studying the functional 104. Guenther FH (2006) Cortical interactions underlying the production of speech NEUROSCIENCE sounds. J Commun Disord 39:350–365. properties of human cortical neurons. Acta Psychol (Amst) 107:293–321. 105. Tourville JA, Reilly KJ, Guenther FH (2008) Neural mechanisms underlying auditory 67. Joanisse MF, Zevin JD, McCandliss BD (2007) Brain mechanisms implicated in the feedback control of speech. Neuroimage 39:1429–1443. preattentive categorization of speech sounds revealed using FMRI and a short- 106. Towle VL, et al. (2008) ECoG gamma activity during a language task: Differentiating interval habituation trial paradigm. Cereb Cortex 17:2084–2093. expressive and receptive speech areas. Brain 131:2013–2027. 68. Scott BH, Malone BJ, Semple MN (2011) Transformation of temporal processing 107. Takaso H, Eisner F, Wise RJS, Scott SK (2010) The effect of delayed auditory feedback across auditory cortex of awake macaques. J Neurophysiol 105:712–730. on activity in the temporal lobe while speaking: A positron emission tomography 69. Kusmierek P, Rauschecker JP (2009) Functional specialization of medial auditory belt study. J Speech Lang Hear Res 53:226–236. cortex in the alert rhesus monkey. J Neurophysiol 102:1606–1622. 108. Zheng ZZ, Munhall KG, Johnsrude IS (2010) Functional overlap between regions 70. Creutzfeldt O, Ojemann G, Lettich E (1989) Neuronal activity in the human lateral involved in speech perception and in monitoring one’s own voice during speech temporal lobe. I. Responses to speech. Exp Brain Res 77:451–475. production. J Cogn Neurosci 22:1770–1781. 71. Pei X, et al. (2011) Spatiotemporal dynamics of electrocorticographic high gamma 109. Buchsbaum BR, Padmanabhan A, Berman KF (2011) The neural substrates of activity during overt and covert word repetition. Neuroimage 54:2960–2972. recognition memory for verbal information: Spanning the divide between short- 72. Marinkovic K, et al. (2003) Spatiotemporal dynamics of modality-specific and and long-term memory. J Cogn Neurosci 23:978–991. supramodal word processing. Neuron 38:487–497. 110. Buchsbaum BR, Olsen RK, Koch P, Berman KF (2005) Human dorsal and ventral 73. Binder JR, Frost JA, Hammeke TA, Rao SM, Cox RW (1996) Function of the left auditory streams subserve rehearsal-based and echoic processes during verbal planum temporale in auditory and linguistic processing. Brain 119:1239–1247. working memory. Neuron 48:687–697. 74. Binder JR, et al. (1997) Human brain language areas identified by functional 111. Vinckier F, et al. (2007) Hierarchical coding of letter strings in the ventral stream: magnetic resonance imaging. J Neurosci 17:353–362. dissecting the inner organization of the visual word-form system. Neuron 55: 75. Dehaene-Lambertz G, et al. (2006) Functional segregation of cortical language areas 143–156. by sentence repetition. Hum Brain Mapp 27:360–371. 112. Dehaene S, et al. (2010) How learning to read changes the cortical networks for 76. Sammler D, et al. (2010) The relationship of lyrics and tunes in the processing of vision and language. Science 330:1359–1364. unfamiliar songs: A functional magnetic resonance adaptation study. J Neurosci 30: 113. Pallier C, Devauchelle A-D, Dehaene S (2011) Cortical representation of the 3572–3578. constituent structure of sentences. Proc Natl Acad Sci USA 108:2522–2527. 77. Hara NF, Nakamura K, Kuroki C, Takayama Y, Ogawa S (2007) Functional 114. Graves WW, Desai R, Humphries C, Seidenberg MS, Binder JR (2010) Neural systems neuroanatomy of speech processing within the temporal cortex. Neuroreport 18: for reading aloud: A multiparametric approach. Cereb Cortex 20:1799–1815. 1603–1607. 115. Jobard G, Crivello F, Tzourio-Mazoyer N (2003) Evaluation of the dual route theory 78. Cohen L, Jobert A, Le Bihan D, Dehaene S (2004) Distinct unimodal and multimodal of reading: A metanalysis of 35 neuroimaging studies. Neuroimage 20:693–712. regions for word processing in the left temporal cortex. Neuroimage 23:1256–1270. 116. Turkeltaub PE, Gareau L, Flowers DL, Zeffiro TA, Eden GF (2003) Development of 79. Buchsbaum BR, D’Esposito M (2009) Repetition suppression and reactivation in neural mechanisms for reading. Nat Neurosci 6:767–773. auditory-verbal short-term recognition memory. Cereb Cortex 19:1474–1485. 117. Hamberger MJ, Goodman RR, Perrine K, Tamny T (2001) Anatomic dissociation of 80. Matsumoto R, et al. (2011) Left anterior temporal cortex actively engages in speech auditory and visual naming in the lateral temporal cortex. Neurology 56:56–61. perception: A direct cortical stimulation study. Neuropsychologia 49:1350–1354. 118. Hamberger MJ, McClelland S, III, McKhann GM, II, Williams AC, Goodman RR (2007) 81. Turkeltaub PE, Eden GF, Jones KM, Zeffiro TA (2002) Meta-analysis of the functional Distribution of auditory and visual naming sites in nonlesional temporal lobe neuroanatomy of single-word reading: method and validation. Neuroimage 16: epilepsy patients and patients with space-occupying temporal lobe lesions. Epilepsia 765–780. 48:531–538. 82. Thierry G, Giraud AL, Price CJ (2003) Hemispheric dissociation in access to the human 119. Blau V, van Atteveldt N, Formisano E, Goebel R, Blomert L (2008) Task-irrelevant semantic system. Neuron 38:499–506. visual letters interact with the processing of speech sounds in heteromodal and 83. Belin P, Zatorre RJ, Lafaille P, Ahad P, Pike B (2000) Voice-selective areas in human unimodal cortex. Eur J Neurosci 28:500–509. auditory cortex. Nature 403:309–312. 120. van Atteveldt NM, Blau VC, Blomert L, Goebel R (2010) fMR-adaptation indicates 84. Belin P, Zatorre RJ, Ahad P (2002) Human temporal-lobe response to vocal sounds. selectivity to audiovisual content congruency in distributed clusters in human Brain Res Cogn Brain Res 13:17–26. superior temporal cortex. BMC Neurosci 11:11. 85. Petkov CI, et al. (2008) A voice region in the monkey brain. Nat Neurosci 11:367–374. 121. Beauchamp MS, Nath AR, Pasalar S (2010) fMRI-Guided transcranial magnetic 86. Desimone R, Albright TD, Gross CG, Bruce C (1984) Stimulus-selective properties of stimulation reveals that the superior temporal sulcus is a cortical locus of the McGurk inferior temporal neurons in the macaque. J Neurosci 4:2051–2062. effect. J Neurosci 30:2414–2417. 87. Gaillard R, et al. (2006) Direct intracranial, FMRI, and lesion evidence for the causal 122. Nath AR, Beauchamp MS (2011) Dynamic changes in superior temporal sulcus role of left inferotemporal cortex in reading. Neuron 50:191–204. connectivity during perception of noisy audiovisual speech. J Neurosci 31:1704–1714. 88. Tsao DY, Freiwald WA, Tootell RBH, Livingstone MS (2006) A cortical region 123. Ison MJ, Quiroga RQ (2008) Selectivity and invariance for visual object perception. consisting entirely of face-selective cells. Science 311:670–674. Front Biosci 13:4889–4903. 89. Kanwisher N, Yovel G (2006) The fusiform face area: A cortical region specialized for 124. Kuhl PK (2004) Early language acquisition: Cracking the speech code. Nat Rev the perception of faces. Philos Trans R Soc Lond B Biol Sci 361:2109–2128. Neurosci 5:831–843. 90. Edwards E, et al. (2010) Spatiotemporal imaging of cortical activation during verb 125. Glezer LS, Jiang X, Riesenhuber M (2009) Evidence for highly selective neuronal generation and picture naming. Neuroimage 50:291–301. tuning to whole words in the “visual word form area”. Neuron 62:199–204. 91. Turkeltaub PE, Coslett HB (2010) Localization of sublexical speech perception 126. Cappelle B, Shtyrov Y, Pulvermüller F (2010) Heating up or cooling up the brain? components. Brain Lang 114:1–15. MEG evidence that phrasal verbs are lexical units. Brain Lang 115:189–201. DeWitt and Rauschecker PNAS Early Edition | 9 of 10
  • 10.
    127. Scott SK,Blank CC, Rosen S, Wise RJS (2000) Identification of a pathway for 143. Wandell BA, Rauschecker AM, Yeatman JD (2012) Learning to see words. Ann Rev intelligible speech in the left temporal lobe. Brain 123:2400–2406. Psychol 63:31–53. 128. Binder JR, Desai RH, Graves WW, Conant LL (2009) Where is the semantic system? A 144. Turkeltaub PE, Flowers DL, Lyon LG, Eden GF (2008) Development of ventral stream critical review and meta-analysis of 120 functional neuroimaging studies. Cereb representations for single letters. Ann N Y Acad Sci 1145:13–29. Cortex 19:2767–2796. 145. Joseph JE, Cerullo MA, Farley AB, Steinmetz NA, Mier CR (2006) fMRI correlates of 129. Rogalsky C, Hickok G (2011) The role of Broca’s area in sentence comprehension. cortical specialization and generalization for letter processing. Neuroimage 32: J Cogn Neurosci 23:1664–1680. 806–820. 130. Obleser J, Meyer L, Friederici AD (2011) Dynamic assignment of neural resources in 146. Pernet C, Celsis P, Démonet J-F (2005) Selective response to letter categorization auditory comprehension of complex sentences. Neuroimage 56:2310–2320. within the left fusiform gyrus. Neuroimage 28:738–744. 131. Humphries C, Binder JR, Medler DA, Liebenthal E (2006) Syntactic and semantic 147. Callan AM, Callan DE, Masaki S (2005) When meaningless symbols become letters: modulation of neural activity during auditory sentence comprehension. J Cogn Neural activity change in learning new phonograms. Neuroimage 28:553–562. Neurosci 18:665–679. 148. Longcamp M, Anton J-L, Roth M, Velay J-L (2005) Premotor activations in response to 132. Tyler LK, Marslen-Wilson W (2008) Fronto-temporal brain systems supporting visually presented single letters depend on the hand used to write: A study on left- handers. Neuropsychologia 43:1801–1809. spoken language comprehension. Philos Trans R Soc Lond B Biol Sci 363:1037–1054. 149. Flowers DL, et al. (2004) Attention to single letters activates left extrastriate cortex. 133. Friederici AD, Kotz SA, Scott SK, Obleser J (2010) Disentangling syntax and Neuroimage 21:829–839. intelligibility in auditory language comprehension. Hum Brain Mapp 31:448–457. 150. Longcamp M, Anton J-L, Roth M, Velay J-L (2003) Visual presentation of single letters 134. Guenther FH (1994) A neural network model of speech acquisition and motor activates a premotor area involved in writing. Neuroimage 19:1492–1500. equivalent speech production. Biol Cybern 72:43–53. 151. Logothetis NK, Pauls J (1995) Psychophysical and physiological evidence for viewer- 135. Cohen YE, Andersen RA (2002) A common reference frame for movement plans in centered object representations in the primate. Cereb Cortex 5:270–288. the posterior parietal cortex. Nat Rev Neurosci 3:553–562. 152. Dehaene S, et al. (2010) Why do children make mirror errors in reading? Neural 136. Hackett TA, et al. (2007) Sources of somatosensory input to the caudal belt areas of correlates of mirror invariance in the visual word form area. Neuroimage 49: auditory cortex. Perception 36:1419–1430. 1837–1848. 137. Smiley JF, et al. (2007) Multisensory convergence in auditory cortex, I. Cortical 153. Pegado F, Nakamura K, Cohen L, Dehaene S (2011) Breaking the symmetry: Mirror connections of the caudal superior temporal plane in macaque monkeys. J Comp discrimination for single letters but not for pictures in the visual word form area. Neurol 502:894–923. Neuroimage 55:742–749. 138. Hackett TA, et al. (2007) Multisensory convergence in auditory cortex, II. 154. Lancaster JL, et al. (2007) Bias between MNI and Talairach coordinates analyzed Thalamocortical connections of the caudal superior temporal plane. J Comp Neurol using the ICBM-152 brain template. Hum Brain Mapp 28:1194–1205. 502:924–952. 155. Eickhoff SB, et al. (2009) Coordinate-based activation likelihood estimation meta- 139. Dhanjal NS, Handunnetthi L, Patel MC, Wise RJS (2008) Perceptual systems analysis of neuroimaging data: a random-effects approach based on empirical controlling speech production. J Neurosci 28:9969–9975. estimates of spatial uncertainty. Hum Brain Mapp 30:2907–2926. 140. Baddeley A (2003) Working memory: Looking back and looking forward. Nat Rev 156. Turkeltaub PE, et al. (2012) Minimizing within-experiment and within-group effects Neurosci 4:829–839. in activation likelihood estimation meta-analyses. Hum Brain Mapp 33:1–13. 141. Fitch WT (2000) The evolution of speech: A comparative review. Trends Cogn Sci 4: 157. Genovese CR, Lazar NA, Nichols T (2002) Thresholding of statistical maps in 258–267. functional neuroimaging using the false discovery rate. Neuroimage 15:870–878. 142. McCandliss BD, Cohen L, Dehaene S (2003) The visual word form area: Expertise for 158. Van Essen DC (2005) A Population-Average, Landmark- and Surface-based (PALS) reading in the fusiform gyrus. Trends Cogn Sci 7:293–299. atlas of human cerebral cortex. Neuroimage 28:635–662. 10 of 10 | www.pnas.org/cgi/doi/10.1073/pnas.1113427109 DeWitt and Rauschecker