KEMBAR78
Deep Learning Methods and Application | PDF | Deep Learning | Artificial Neural Network
0% found this document useful (0 votes)
74 views100 pages

Deep Learning Methods and Application

Uploaded by

dzungko
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views100 pages

Deep Learning Methods and Application

Uploaded by

dzungko
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

FnT SIG 7:3-4

Foundations and Trends® in


Signal Processing
7:3-4
Deep Learning; Methods and Applications

Deep Learning
Methods and Applications
Li Deng and Dong Yu
Li Deng and Dong Yu

now
now

the essence of k nowledge


FnT SIG 7:3-4
THƯ NGỎ
Deep Learning

Deep Learning; Methods and Applications


Methods and Applications
Li Deng and Dong Yu
Trang điện tử Hướng nghiệp 4.0 (huongnghiep40.vn) ra đời với mục đích góp phần
Deep Learning: Methods and Applications provides an overview of general deep learning vào công cuộc định hướng nghề nghiệp cho các bạn học sinh THPT và sinh viên
methodology and its applications to a variety of signal and information processing tasks. The Việt Nam, trong bối cảnh cuộc Cách mạng công nghiệp 4.0 đã và đang bùng nổ
application areas are chosen with the following three criteria in mind: (1) expertise or knowledge mạnh mẽ hơn bao giờ hết. Bằng việc cung cấp những thông tin đa chiều, thiết thực
of the authors; (2) the application areas that have already been transformed by the successful
và bổ ích về các ngành nghề có sức nóng và tiềm năng phát triển bền vững trong
use of deep learning technology, such as speech recognition and computer vision; and (3) the
application areas that have the potential to be impacted significantly by deep learning and that tương lai dài hạn thông qua các tin tức tổng hợp cùng những góc nhìn sâu rộng của
have been benefitting from recent research efforts, including natural language and text các chuyên gia uy tín ở nhiều lĩnh vực như hướng nghiệp, khởi nghiệp, giáo dục,
processing, information retrieval, and multimodal information processing empowered by multi- công nghệ thông tin, kinh tế, xã hội, tài chính ngân hàng…,
task deep learning.

Li Deng and Dong Yu


trang điện tử huongnghiep40.vn được kỳ vọng sẽ mang đến những kiến thức nền
tảng hữu ích về các ngành nghề trong xã hội cũng như thị trường nhân lực
Deep Learning: Methods and Applications is a timely and important book for researchers and
students with an interest in deep learning methodology and its applications in signal and Việt Nam và thế giới.
information processing.
Trang điện tử huongnghiep40.vn cam kết được xây dựng và phát triển với mục đích
“This book provides an overview of a sweeping range of up-to-date deep learning hoàn toàn phi lợi nhuận. Tất cả các bài viết và ebook được tổng hợp, đăng tải và
methodologies and their application to a variety of signal and information processing tasks,
chia sẻ tại đây đều có thể xem và tải về miễn phí, với mục đích góp thêm những cơ
including not only automatic speech recognition (ASR), but also computer vision, language
modeling, text processing, multimodal learning, and information retrieval. This is the first and hội làm giàu kiến thức cho tất cả mọi người.
the most valuable book for “deep and wide learning” of deep learning, not to be missed by
anyone who wants to know the breathtaking impact of deep learning on many facets of Chúc bạn đọc có được những thông tin bổ ích và định hướng nghề nghiệp đúng đắn
information processing, especially ASR, all of vital importance to our modern technological cho tương lai.
society.” — Sadaoki Furui, President of Toyota Technological Institute at Chicago, and
Professor at the Tokyo Institute of Technology
Trân trọng,

Ban biên tập website huongnghiep40.vn

This book is originally published as


Foundations and Trends ® in Signal Processing
Volume 7 Issues 3-4, ISSN: 1932-8346.
huongnghiep40.vn
now
Foundations and Trends R
in Signal Processing
Vol. 7, Nos. 3–4 (2013) 197–387 Contents
c 2014 L. Deng and D. Yu
DOI: 10.1561/2000000039

Deep Learning: Methods and Applications

1 Introduction 198
Li Deng Dong Yu
Microsoft Research Microsoft Research 1.1 Definitions and background . . . . . . . . . . . . . . . . . 198
One Microsoft Way One Microsoft Way 1.2 Organization of this monograph . . . . . . . . . . . . . . 202
Redmond, WA 98052; USA Redmond, WA 98052; USA
deng@microsoft.com Dong.Yu@microsoft.com 2 Some Historical Context of Deep Learning 205

3 Three Classes of Deep Learning Networks 214


3.1 A three-way categorization . . . . . . . . . . . . . . . . . 214
3.2 Deep networks for unsupervised or generative learning . . . 216
3.3 Deep networks for supervised learning . . . . . . . . . . . 223
3.4 Hybrid deep networks . . . . . . . . . . . . . . . . . . . . 226

4 Deep Autoencoders — Unsupervised Learning 230


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 230
4.2 Use of deep autoencoders to extract speech features . . . 231
4.3 Stacked denoising autoencoders . . . . . . . . . . . . . . . 235
4.4 Transforming autoencoders . . . . . . . . . . . . . . . . . 239

5 Pre-Trained Deep Neural Networks — A Hybrid 241


5.1 Restricted Boltzmann machines . . . . . . . . . . . . . . . 241
5.2 Unsupervised layer-wise pre-training . . . . . . . . . . . . 245
5.3 Interfacing DNNs with HMMs . . . . . . . . . . . . . . . 248

ii
iii iv

6 Deep Stacking Networks and Variants — 12 Conclusion 343


Supervised Learning 250
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 250 References 349
6.2 A basic architecture of the deep stacking network . . . . . 252
6.3 A method for learning the DSN weights . . . . . . . . . . 254
6.4 The tensor deep stacking network . . . . . . . . . . . . . . 255
6.5 The Kernelized deep stacking network . . . . . . . . . . . 257

7 Selected Applications in Speech and Audio Processing 262


7.1 Acoustic modeling for speech recognition . . . . . . . . . . 262
7.2 Speech synthesis . . . . . . . . . . . . . . . . . . . . . . . 286
7.3 Audio and music processing . . . . . . . . . . . . . . . . . 288

8 Selected Applications in Language


Modeling and Natural Language Processing 292
8.1 Language modeling . . . . . . . . . . . . . . . . . . . . . 293
8.2 Natural language processing . . . . . . . . . . . . . . . . . 299

9 Selected Applications in Information Retrieval 308


9.1 A brief introduction to information retrieval . . . . . . . . 308
9.2 SHDA for document indexing and retrieval . . . . . . . . . 310
9.3 DSSM for document retrieval . . . . . . . . . . . . . . . . 311
9.4 Use of deep stacking networks for information retrieval . . 317

10 Selected Applications in Object Recognition


and Computer Vision 320
10.1 Unsupervised or generative feature learning . . . . . . . . 321
10.2 Supervised feature learning and classification . . . . . . . . 324

11 Selected Applications in Multimodal


and Multi-task Learning 331
11.1 Multi-modalities: Text and image . . . . . . . . . . . . . . 332
11.2 Multi-modalities: Speech and image . . . . . . . . . . . . 336
11.3 Multi-task learning within the speech, NLP or image . . . 339
Abstract

This monograph provides an overview of general deep learning method- 1


ology and its applications to a variety of signal and information pro-
cessing tasks. The application areas are chosen with the following three Introduction
criteria in mind: (1) expertise or knowledge of the authors; (2) the
application areas that have already been transformed by the successful
use of deep learning technology, such as speech recognition and com-
puter vision; and (3) the application areas that have the potential to be
impacted significantly by deep learning and that have been experienc-
ing research growth, including natural language and text processing,
information retrieval, and multimodal information processing empow-
ered by multi-task deep learning.

1.1 Definitions and background

Since 2006, deep structured learning, or more commonly called deep


learning or hierarchical learning, has emerged as a new area of machine
learning research [20, 163]. During the past several years, the techniques
developed from deep learning research have already been impacting
a wide range of signal and information processing work within the
traditional and the new, widened scopes including key aspects of
machine learning and artificial intelligence; see overview articles in
[7, 20, 24, 77, 94, 161, 412], and also the media coverage of this progress
in [6, 237]. A series of workshops, tutorials, and special issues or con-
ference special sessions in recent years have been devoted exclusively
to deep learning and its applications to various signal and information
processing areas. These include:

• 2008 NIPS Deep Learning Workshop;


• 2009 NIPS Workshop on Deep Learning for Speech Recognition
L. Deng and D. Yu. Deep Learning: Methods and Applications. Foundations and and Related Applications;
Trends
R
in Signal Processing, vol. 7, nos. 3–4, pp. 197–387, 2013.
DOI: 10.1561/2000000039. • 2009 ICML Workshop on Learning Feature Hierarchies;

198
1.1. Definitions and background 199 200 Introduction

• 2011 ICML Workshop on Learning Architectures, Representa- supervised or unsupervised feature extraction and transforma-
tions, and Optimization for Speech and Visual Information Pro- tion, and for pattern analysis and classification.
cessing; • Definition 2 : “A sub-field within machine learning that is based
• 2012 ICASSP Tutorial on Deep Learning for Signal and Informa- on algorithms for learning multiple levels of representation in
tion Processing; order to model complex relationships among data. Higher-level
• 2012 ICML Workshop on Representation Learning; features and concepts are thus defined in terms of lower-level
ones, and such a hierarchy of features is called a deep architec-
• 2012 Special Section on Deep Learning for Speech and Language
ture. Most of these models are based on unsupervised learning of
Processing in IEEE Transactions on Audio, Speech, and Lan-
representations.” (Wikipedia on “Deep Learning” around March
guage Processing (T-ASLP, January);
2012.)
• 2010, 2011, and 2012 NIPS Workshops on Deep Learning and
• Definition 3 : “A sub-field of machine learning that is based
Unsupervised Feature Learning;
on learning several levels of representations, corresponding to a
• 2013 NIPS Workshops on Deep Learning and on Output Repre- hierarchy of features or factors or concepts, where higher-level
sentation Learning; concepts are defined from lower-level ones, and the same lower-
• 2013 Special Issue on Learning Deep Architectures in IEEE level concepts can help to define many higher-level concepts. Deep
Transactions on Pattern Analysis and Machine Intelligence learning is part of a broader family of machine learning methods
(T-PAMI, September). based on learning representations. An observation (e.g., an image)
• 2013 International Conference on Learning Representations; can be represented in many ways (e.g., a vector of pixels), but
some representations make it easier to learn tasks of interest (e.g.,
• 2013 ICML Workshop on Representation Learning Challenges;
is this the image of a human face?) from examples, and research
• 2013 ICML Workshop on Deep Learning for Audio, Speech, and in this area attempts to define what makes better representations
Language Processing; and how to learn them.” (Wikipedia on “Deep Learning” around
• 2013 ICASSP Special Session on New Types of Deep Neural Net- February 2013.)
work Learning for Speech Recognition and Related Applications. • Definition 4 : “Deep learning is a set of algorithms in machine
learning that attempt to learn in multiple levels, correspond-
The authors have been actively involved in deep learning research and
ing to different levels of abstraction. It typically uses artificial
in organizing or providing several of the above events, tutorials, and
neural networks. The levels in these learned statistical models
editorials. In particular, they gave tutorials and invited lectures on
correspond to distinct levels of concepts, where higher-level con-
this topic at various places. Part of this monograph is based on their
cepts are defined from lower-level ones, and the same lower-
tutorials and lecture material.
level concepts can help to define many higher-level concepts.”
Before embarking on describing details of deep learning, let’s pro-
See Wikipedia http://en.wikipedia.org/wiki/Deep_learning on
vide necessary definitions. Deep learning has various closely related
“Deep Learning” as of this most recent update in October 2013.
definitions or high-level descriptions:
• Definition 5 : “Deep Learning is a new area of Machine Learning
• Definition 1 : A class of machine learning techniques that research, which has been introduced with the objective of moving
exploit many layers of non-linear information processing for Machine Learning closer to one of its original goals: Artificial
1.1. Definitions and background 201 202 Introduction

Intelligence. Deep Learning is about learning multiple levels of Technology, University of Washington, and numerous other places; see
representation and abstraction that help to make sense of data http://deeplearning.net/deep-learning-research-groups-and-labs/ for
such as images, sound, and text.” See https://github.com/lisa- a more detailed list. These researchers have demonstrated empirical
lab/DeepLearningTutorials successes of deep learning in diverse applications of computer vision,
phonetic recognition, voice search, conversational speech recognition,
Note that the deep learning that we discuss in this monograph is speech and image feature coding, semantic utterance classifica-
about learning with deep architectures for signal and information pro- tion, natural language understanding, hand-writing recognition, audio
cessing. It is not about deep understanding of the signal or infor- processing, information retrieval, robotics, and even in the analysis of
mation, although in many cases they may be related. It should also molecules that may lead to discovery of new drugs as reported recently
be distinguished from the overloaded term in educational psychology: by [237].
“Deep learning describes an approach to learning that is character- In addition to the reference list provided at the end of this mono-
ized by active engagement, intrinsic motivation, and a personal search graph, which may be outdated not long after the publication of this
for meaning.” http://www.blackwellreference.com/public/tocnode?id= monograph, there are a number of excellent and frequently updated
g9781405161251_chunk_g97814051612516_ss1-1 reading lists, tutorials, software, and video lectures online at:
Common among the various high-level descriptions of deep learning
above are two key aspects: (1) models consisting of multiple layers • http://deeplearning.net/reading-list/
or stages of nonlinear information processing; and (2) methods for • http://ufldl.stanford.edu/wiki/index.php/
supervised or unsupervised learning of feature representation at UFLDL_Recommended_Readings
successively higher, more abstract layers. Deep learning is in the
• http://www.cs.toronto.edu/∼hinton/
intersections among the research areas of neural networks, artificial
intelligence, graphical modeling, optimization, pattern recognition, • http://deeplearning.net/tutorial/
and signal processing. Three important reasons for the popularity • http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial
of deep learning today are the drastically increased chip processing
abilities (e.g., general-purpose graphical processing units or GPGPUs), 1.2 Organization of this monograph
the significantly increased size of data used for training, and the recent
advances in machine learning and signal/information processing The rest of the monograph is organized as follows:
research. These advances have enabled the deep learning methods In Section 2, we provide a brief historical account of deep learning,
to effectively exploit complex, compositional nonlinear functions, to mainly from the perspective of how speech recognition technology has
learn distributed and hierarchical feature representations, and to make been hugely impacted by deep learning, and how the revolution got
effective use of both labeled and unlabeled data. started and has gained and sustained immense momentum.
Active researchers in this area include those at University of In Section 3, a three-way categorization scheme for a majority of
Toronto, New York University, University of Montreal, Stanford the work in deep learning is developed. They include unsupervised,
University, Microsoft Research (since 2009), Google (since about supervised, and hybrid deep learning networks, where in the latter cat-
2011), IBM Research (since about 2011), Baidu (since 2012), Facebook egory unsupervised learning (or pre-training) is exploited to assist the
(since 2013), UC-Berkeley, UC-Irvine, IDIAP, IDSIA, University subsequent stage of supervised learning when the final tasks pertain to
College London, University of Michigan, Massachusetts Institute of classification. The supervised and hybrid deep networks often have the
1.2. Organization of this monograph 203 204 Introduction

same type of architectures or the structures in the deep networks, but In Sections 7–11, we select a set of typical and successful applica-
the unsupervised deep networks tend to have different architectures tions of deep learning in diverse areas of signal and information process-
from the others. ing. In Section 7, we review the applications of deep learning to speech
Sections 4–6 are devoted, respectively, to three popular types of recognition, speech synthesis, and audio processing. Subsections sur-
deep architectures, one from each of the classes in the three-way cat- rounding the main subject of speech recognition are created based on
egorization scheme reviewed in Section 3. In Section 4, we discuss several prominent themes on the topic in the literature.
in detail deep autoencoders as a prominent example of the unsuper- In Section 8, we present recent results of applying deep learning to
vised deep learning networks. No class labels are used in the learning, language modeling and natural language processing, where we highlight
although supervised learning methods such as back-propagation are the key recent development in embedding symbolic entities such as
cleverly exploited when the input signal itself, instead of any label words into low-dimensional, continuous-valued vectors.
information of interest to possible classification tasks, is treated as the Section 9 is devoted to selected applications of deep learning to
“supervision” signal. information retrieval including web search.
In Section 5, as a major example in the hybrid deep network cate- In Section 10, we cover selected applications of deep learning to
gory, we present in detail the deep neural networks with unsupervised image object recognition in computer vision. The section is divided to
and largely generative pre-training to boost the effectiveness of super- two main classes of deep learning approaches: (1) unsupervised feature
vised training. This benefit is found critical when the training data learning, and (2) supervised learning for end-to-end and joint feature
are limited and no other appropriate regularization approaches (i.e., learning and classification.
dropout) are exploited. The particular pre-training method based on Selected applications to multi-modal processing and multi-task
restricted Boltzmann machines and the related deep belief networks learning are reviewed in Section 11, divided into three categories
described in this section has been historically significant as it ignited according to the nature of the multi-modal data as inputs to the deep
the intense interest in the early applications of deep learning to speech learning systems. For single-modality data of speech, text, or image,
recognition and other information processing tasks. In addition to this a number of recent multi-task learning studies based on deep learning
retrospective review, subsequent development and different paths from methods are reviewed in the literature.
the more recent perspective are discussed. Finally, conclusions are given in Section 12 to summarize the mono-
In Section 6, the basic deep stacking networks and their several graph and to discuss future challenges and directions.
extensions are discussed in detail, which exemplify the discrimina- This short monograph contains the material expanded from two
tive, supervised deep learning networks in the three-way classification tutorials that the authors gave, one at APSIPA in October 2011 and
scheme. This group of deep networks operate in many ways that are the other at ICASSP in March 2012. Substantial updates have been
distinct from the deep neural networks. Most notably, they use target made based on the literature up to January 2014 (including the mate-
labels in constructing each of many layers or modules in the overall rials presented at NIPS-2013 and at IEEE-ASRU-2013 both held in
deep networks. Assumptions made about part of the networks, such as December of 2013), focusing on practical aspects in the fast develop-
linear output units in each of the modules, simplify the learning algo- ment of deep learning research and technology during the interim years.
rithms and enable a much wider variety of network architectures to
be constructed and learned than the networks discussed in Sections 4
and 5.
206 Some Historical Context of Deep Learning
2
Human information processing mechanisms (e.g., vision and audi-
tion), however, suggest the need of deep architectures for extracting
Some Historical Context of Deep Learning complex structure and building internal representation from rich sen-
sory inputs. For example, human speech production and perception
systems are both equipped with clearly layered hierarchical structures
in transforming the information from the waveform level to the linguis-
tic level [11, 12, 74, 75]. In a similar vein, the human visual system is
also hierarchical in nature, mostly in the perception side but interest-
ingly also in the “generation” side [43, 126, 287]). It is natural to believe
that the state-of-the-art can be advanced in processing these types of
natural signals if efficient and effective deep learning algorithms can be
developed.
Historically, the concept of deep learning originated from artifi-
cial neural network research. (Hence, one may occasionally hear the
Until recently, most machine learning and signal processing techniques
discussion of “new-generation neural networks.”) Feed-forward neural
had exploited shallow-structured architectures. These architectures
networks or MLPs with many hidden layers, which are often referred
typically contain at most one or two layers of nonlinear feature transfor-
to as deep neural networks (DNNs), are good examples of the models
mations. Examples of the shallow architectures are Gaussian mixture
with a deep architecture. Back-propagation (BP), popularized in 1980s,
models (GMMs), linear or nonlinear dynamical systems, conditional
has been a well-known algorithm for learning the parameters of these
random fields (CRFs), maximum entropy (MaxEnt) models, support
networks. Unfortunately BP alone did not work well in practice then
vector machines (SVMs), logistic regression, kernel regression, multi-
for learning networks with more than a small number of hidden layers
layer perceptrons (MLPs) with a single hidden layer including extreme
(see a review and analysis in [20, 129]. The pervasive presence of local
learning machines (ELMs). For instance, SVMs use a shallow linear
optima and other optimization challenges in the non-convex objective
pattern separation model with one or zero feature transformation layer
function of the deep networks are the main source of difficulties in the
when the kernel trick is used or otherwise. (Notable exceptions are the
learning. BP is based on local gradient information, and starts usu-
recent kernel methods that have been inspired by and integrated with
ally at some random initial points. It often gets trapped in poor local
deep learning; e.g. [9, 53, 102, 377]). Shallow architectures have been
optima when the batch-mode or even stochastic gradient descent BP
shown effective in solving many simple or well-constrained problems,
algorithm is used. The severity increases significantly as the depth of
but their limited modeling and representational power can cause dif-
the networks increases. This difficulty is partially responsible for steer-
ficulties when dealing with more complicated real-world applications
ing away most of the machine learning and signal processing research
involving natural signals such as human speech, natural sound and
from neural networks to shallow models that have convex loss func-
language, and natural image and visual scenes.
tions (e.g., SVMs, CRFs, and MaxEnt models), for which the global
optimum can be efficiently obtained at the cost of reduced modeling
power, although there had been continuing work on neural networks
205 with limited scale and impact (e.g., [42, 45, 87, 168, 212, 263, 304].
207 208 Some Historical Context of Deep Learning

The optimization difficulty associated with the deep models was optimal configurations. Even if parameter learning is trapped into a
empirically alleviated when a reasonably efficient, unsupervised learn- local optimum, the resulting DNN can still perform quite well since
ing algorithm was introduced in the two seminar papers [163, 164]. the chance of having a poor local optimum is lower than when a small
In these papers, a class of deep generative models, called deep belief number of neurons are used in the network. Using deep and wide neu-
network (DBN), was introduced. A DBN is composed of a stack of ral networks, however, would cast great demand to the computational
restricted Boltzmann machines (RBMs). A core component of the power during the training process and this is one of the reasons why it
DBN is a greedy, layer-by-layer learning algorithm which optimizes is not until recent years that researchers have started exploring both
DBN weights at time complexity linear to the size and depth of the deep and wide neural networks in a serious manner.
networks. Separately and with some surprise, initializing the weights Better learning algorithms and different nonlinearities also con-
of an MLP with a correspondingly configured DBN often produces tributed to the success of DNNs. Stochastic gradient descend (SGD)
much better results than that with the random weights. As such, algorithms are the most efficient algorithm when the training set is large
MLPs with many hidden layers, or deep neural networks (DNN), and redundant as is the case for most applications [39]. Recently, SGD is
which are learned with unsupervised DBN pre-training followed by shown to be effective for parallelizing over many machines with an asyn-
back-propagation fine-tuning is sometimes also called DBNs in the chronous mode [69] or over multiple GPUs through pipelined BP [49].
literature [67, 260, 258]. More recently, researchers have been more Further, SGD can often allow the training to jump out of local optima
careful in distinguishing DNNs from DBNs [68, 161], and when DBN due to the noisy gradients estimated from a single or a small batch of
is used to initialize the training of a DNN, the resulting network is samples. Other learning algorithms such as Hessian free [195, 238] or
sometimes called the DBN–DNN [161]. Krylov subspace methods [378] have shown a similar ability.
Independently of the RBM development, in 2006 two alternative, For the highly non-convex optimization problem of DNN learn-
non-probabilistic, non-generative, unsupervised deep models were pub- ing, it is obvious that better parameter initialization techniques will
lished. One is an autoencoder variant with greedy layer-wise training lead to better models since optimization starts from these initial mod-
much like the DBN training [28]. Another is an energy-based model els. What was not obvious, however, is how to efficiently and effec-
with unsupervised learning of sparse over-complete representations tively initialize DNN parameters and how the use of large amounts of
[297]. They both can be effectively used to pre-train a deep neural training data can alleviate the learning problem until more recently
network, much like the DBN. [28, 20, 100, 64, 68, 163, 164, 161, 323, 376, 414]. The DNN parameter
In addition to the supply of good initialization points, the DBN initialization technique that attracted the most attention is the unsu-
comes with other attractive properties. First, the learning algorithm pervised pretraining technique proposed in [163, 164] discussed earlier.
makes effective use of unlabeled data. Second, it can be interpreted The DBN pretraining procedure is not the only one that allows
as a probabilistic generative model. Third, the over-fitting problem, effective initialization of DNNs. An alternative unsupervised approach
which is often observed in the models with millions of parameters such that performs equally well is to pretrain DNNs layer by layer by con-
as DBNs, and the under-fitting problem, which occurs often in deep sidering each pair of layers as a de-noising autoencoder regularized by
networks, can be effectively alleviated by the generative pre-training setting a random subset of the input nodes to zero [20, 376]. Another
step. An insightful analysis on what kinds of speech information DBNs alternative is to use contractive autoencoders for the same purpose by
can capture is provided in [259]. favoring representations that are more robust to the input variations,
Using hidden layers with many neurons in a DNN significantly i.e., penalizing the gradient of the activities of the hidden units with
improves the modeling power of the DNN and creates many closely respect to the inputs [303]. Further, Ranzato et al. [294] developed the
209 210 Some Historical Context of Deep Learning

sparse encoding symmetric machine (SESM), which has a very similar


architecture to RBMs as building blocks of a DBN. The SESM may also
be used to effectively initialize the DNN training. In addition to unsu-
pervised pretraining using greedy layer-wise procedures [28, 164, 295],
the supervised pretraining, or sometimes called discriminative pretrain-
ing, has also been shown to be effective [28, 161, 324, 432] and in cases
where labeled training data are abundant performs better than the
unsupervised pretraining techniques. The idea of the discriminative
pretraining is to start from a one-hidden-layer MLP trained with the
BP algorithm. Every time when we want to add a new hidden layer we
replace the output layer with a randomly initialized new hidden and
output layer and train the whole new MLP (or DNN) using the BP
algorithm. Different from the unsupervised pretraining techniques, the
discriminative pretraining technique requires labels.
Researchers who apply deep learning to speech and vision analyzed
what DNNs capture in speech and images. For example, [259] applied
a dimensionality reduction method to visualize the relationship among Figure 2.1: Gartner hyper cycle graph representing five phases of a technology
the feature vectors learned by the DNN. They found that the DNN’s (http://en.wikipedia.org/wiki/Hype_cycle).
hidden activity vectors preserve the similarity structure of the feature
vectors at multiple scales, and that this is especially true for the fil-
terbank features. A more elaborated visualization method, based on Applying the Gartner hyper cycle to the artificial neural network
a top-down generative process in the reverse direction of the classi- development, we created Figure 2.2 to align different generations of
fication network, was recently developed by Zeiler and Fergus [436] the neural network with the various phases designated in the hype
for examining what features the deep convolutional networks capture cycle. The peak activities (“expectations” or “media hype” on the ver-
from the image data. The power of the deep networks is shown to tical axis) occurred in late 1980s and early 1990s, corresponding to the
be their ability to extract appropriate features and do discrimination height of what is often referred to as the “second generation” of neu-
jointly [210]. ral networks. The deep belief network (DBN) and a fast algorithm for
As another way to concisely introduce the DNN, we can review the training it were invented in 2006 [163, 164]. When the DBN was used
history of artificial neural networks using a “hype cycle,” which is a to initialize the DNN, the learning became highly effective and this has
graphic representation of the maturity, adoption and social applica- inspired the subsequent fast growing research (“enlightenment” phase
tion of specific technologies. The 2012 version of the hype cycles graph shown in Figure 2.2). Applications of the DBN and DNN to industry-
compiled by Gartner is shown in Figure 2.1. It intends to show how scale speech feature extraction and speech recognition started in 2009
a technology or application will evolve over time (according to five when leading academic and industrial researchers with both deep learn-
phases: technology trigger, peak of inflated expectations, trough of dis- ing and speech expertise collaborated; see reviews in [89, 161]. This
illusionment, slope of enlightenment, and plateau of production), and collaboration fast expanded the work of speech recognition using deep
to provide a source of insight to manage its deployment. learning methods to increasingly larger successes [94, 161, 323, 414],
211 212 Some Historical Context of Deep Learning

Figure 2.3: The famous NIST plot showing the historical speech recognition error
rates achieved by the GMM-HMM approach for a number of increasingly difficult
speech recognition tasks. Data source: http://itl.nist.gov/iad/mig/publications/
Figure 2.2: Applying Gartner hyper cycle graph to analyzing the history of artificial ASRhistory/index.html
neural network technology (We thank our colleague John Platt during 2012 for
bringing this type of “Hyper Cycle” graph to our attention for concisely analyzing
the neural network history).

many of which will be covered in the remainder of this monograph.


The height of the “plateau of productivity” phase, not yet reached in
our opinion, is expected to be higher than that in the stereotypical
curve (circled with a question mark in Figure 2.2), and is marked by
the dashed line that moves straight up.
We show in Figure 2.3 the history of speech recognition, which
has been compiled by NIST, organized by plotting the word error rate
(WER) as a function of time for a number of increasingly difficult
speech recognition tasks. Note all WER results were obtained using the
GMM–HMM technology. When one particularly difficult task (Switch-
board) is extracted from Figure 2.3, we see a flat curve over many
years using the GMM–HMM technology but after the DNN technology
is used the WER drops sharply (marked by the red star in Figure 2.4).
Figure 2.4: Extracting WERs of one task from Figure 2.3 and adding the signifi-
cantly lower WER (marked by the star) achieved by the DNN technology.
213
3
In the next section, an overview is provided on the various architec-
tures of deep learning, followed by more detailed expositions of a few
widely studied architectures and methods and by selected applications
Three Classes of Deep Learning Networks
in signal and information processing including speech and audio, natu-
ral language, information retrieval, vision, and multi-modal processing.

3.1 A three-way categorization

As described earlier, deep learning refers to a rather wide class of


machine learning techniques and architectures, with the hallmark
of using many layers of non-linear information processing that are
hierarchical in nature. Depending on how the architectures and tech-
niques are intended for use, e.g., synthesis/generation or recognition/
classification, one can broadly categorize most of the work in this area
into three major classes:

1. Deep networks for unsupervised or generative learn-


ing, which are intended to capture high-order correlation of the
observed or visible data for pattern analysis or synthesis purposes
when no information about target class labels is available. Unsu-
pervised feature or representation learning in the literature refers
to this category of the deep networks. When used in the genera-
tive mode, may also be intended to characterize joint statistical
distributions of the visible data and their associated classes when
available and being treated as part of the visible data. In the

214
3.1. A three-way categorization 215 216 Three Classes of Deep Learning Networks

latter case, the use of Bayes rule can turn this type of generative studies have generalized the traditional denoising autoencoders so that
networks into a discriminative one for learning. they can be efficiently sampled from and thus have become genera-
2. Deep networks for supervised learning, which are intended tive models [5, 24, 30]. Nevertheless, the traditional two-way classifi-
to directly provide discriminative power for pattern classifica- cation indeed points to several key differences between deep networks
tion purposes, often by characterizing the posterior distributions for unsupervised and supervised learning. Compared between the two,
of classes conditioned on the visible data. Target label data are deep supervised-learning models such as DNNs are usually more effi-
always available in direct or indirect forms for such supervised cient to train and test, more flexible to construct, and more suitable for
learning. They are also called discriminative deep networks. end-to-end learning of complex systems (e.g., no approximate inference
3. Hybrid deep networks, where the goal is discrimination which and learning such as loopy belief propagation). On the other hand, the
is assisted, often in a significant way, with the outcomes of genera- deep unsupervised-learning models, especially the probabilistic gener-
tive or unsupervised deep networks. This can be accomplished by ative ones, are easier to interpret, easier to embed domain knowledge,
better optimization or/and regularization of the deep networks easier to compose, and easier to handle uncertainty, but they are typi-
in category (2). The goal can also be accomplished when discrim- cally intractable in inference and learning for complex systems. These
inative criteria for supervised learning are used to estimate the distinctions are retained also in the proposed three-way classification
parameters in any of the deep generative or unsupervised deep which is hence adopted throughout this monograph.
networks in category (1) above. Below we review representative work in each of the above three
categories, where several basic definitions are summarized in Table 3.1.
Note the use of “hybrid” in (3) above is different from that used Applications of these deep architectures, with varied ways of learn-
sometimes in the literature, which refers to the hybrid systems for ing including supervised, unsupervised, or hybrid, are deferred to Sec-
speech recognition feeding the output probabilities of a neural network tions 7–11.
into an HMM [17, 25, 42, 261].
By the commonly adopted machine learning tradition (e.g.,
Chapter 28 in [264], and Reference [95], it may be natural to just clas- 3.2 Deep networks for unsupervised or generative learning
sify deep learning techniques into deep discriminative models (e.g., deep
neural networks or DNNs, recurrent neural networks or RNNs, convo- Unsupervised learning refers to no use of task specific supervision infor-
lutional neural networks or CNNs, etc.) and generative/unsupervised mation (e.g., target class labels) in the learning process. Many deep net-
models (e.g., restricted Boltzmann machine or RBMs, deep belief works in this category can be used to meaningfully generate samples by
networks or DBNs, deep Boltzmann machines (DBMs), regularized sampling from the networks, with examples of RBMs, DBNs, DBMs,
autoencoders, etc.). This two-way classification scheme, however, and generalized denoising autoencoders [23], and are thus generative
misses a key insight gained in deep learning research about how gener- models. Some networks in this category, however, cannot be easily sam-
ative or unsupervised-learning models can greatly improve the training pled, with examples of sparse coding networks and the original forms
of DNNs and other deep discriminative or supervised-learning mod- of deep autoencoders, and are thus not generative in nature.
els via better regularization or optimization. Also, deep networks for Among the various subclasses of generative or unsupervised deep
unsupervised learning may not necessarily need to be probabilistic or be networks, the energy-based deep models are the most common [28, 20,
able to meaningfully sample from the model (e.g., traditional autoen- 213, 268]. The original form of the deep autoencoder [28, 100, 164],
coders, sparse coding networks, etc.). We note here that more recent which we will give more detail about in Section 4, is a typical example
3.2. Deep networks for unsupervised or generative learning 217 218 Three Classes of Deep Learning Networks

Table 3.1: Basic deep learning terminologies. Table 3.1: (Continued)

Deep Learning: a class of machine learning techniques, where many Distributed representation: an internal representation of the
layers of information processing stages in hierarchical supervised observed data in such a way that they are modeled as being explained
architectures are exploited for unsupervised feature learning and for by the interactions of many hidden factors. A particular factor
pattern analysis/classification. The essence of deep learning is to learned from configurations of other factors can often generalize well
compute hierarchical features or representations of the observational to new configurations. Distributed representations naturally occur in
data, where the higher-level features or factors are defined from a “connectionist” neural network, where a concept is represented by a
lower-level ones. The family of deep learning methods have been pattern of activity across a number of units and where at the same
growing increasingly richer, encompassing those of neural networks, time a unit typically contributes to many concepts. One key
hierarchical probabilistic models, and a variety of unsupervised and advantage of such many-to-many correspondence is that they provide
supervised feature learning algorithms. robustness in representing the internal structure of the data in terms
Deep belief network (DBN): probabilistic generative models of graceful degradation and damage resistance. Another key
composed of multiple layers of stochastic, hidden variables. The top advantage is that they facilitate generalizations of concepts and
two layers have undirected, symmetric connections between them. relations, thus enabling reasoning abilities.
The lower layers receive top-down, directed connections from the
layer above.
Boltzmann machine (BM): a network of symmetrically connected, of this unsupervised model category. Most other forms of deep autoen-
neuron-like units that make stochastic decisions about whether to be coders are also unsupervised in nature, but with quite different prop-
on or off. erties and implementations. Examples are transforming autoencoders
Restricted Boltzmann machine (RBM): a special type of BM [160], predictive sparse coders and their stacked version, and de-noising
consisting of a layer of visible units and a layer of hidden units with autoencoders and their stacked versions [376].
no visible-visible or hidden-hidden connections. Specifically, in de-noising autoencoders, the input vectors are first
corrupted by, for example, randomly selecting a percentage of the
Deep neural network (DNN): a multilayer perceptron with many
inputs and setting them to zeros or adding Gaussian noise to them.
hidden layers, whose weights are fully connected and are often
Then the parameters are adjusted for the hidden encoding nodes to
(although not always) initialized using either an unsupervised or a
reconstruct the original, uncorrupted input data using criteria such as
supervised pretraining technique. (In the literature prior to 2012, a
mean square reconstruction error and KL divergence between the orig-
DBN was often used incorrectly to mean a DNN.)
inal inputs and the reconstructed inputs. The encoded representations
Deep autoencoder: a “discriminative” DNN whose output targets transformed from the uncorrupted data are used as the inputs to the
are the data input itself rather than class labels; hence an next level of the stacked de-noising autoencoder.
unsupervised learning model. When trained with a denoising Another prominent type of deep unsupervised models with genera-
criterion, a deep autoencoder is also a generative model and can be tive capability is the deep Boltzmann machine or DBM [131, 315, 316,
sampled from. 348]. A DBM contains many layers of hidden variables, and has no con-
(Continued) nections between the variables within the same layer. This is a special
case of the general Boltzmann machine (BM), which is a network of
3.2. Deep networks for unsupervised or generative learning 219 220 Three Classes of Deep Learning Networks

symmetrically connected units that are on or off based on a stochastic models, and the “product” nodes build up the feature hierarchy. Prop-
mechanism. While having a simple learning algorithm, the general BMs erties of “completeness” and “consistency” constrain the SPN in a desir-
are very complex to study and very slow to train. In a DBM, each layer able way. The learning of SPNs is carried out using the EM algorithm
captures complicated, higher-order correlations between the activities together with back-propagation. The learning procedure starts with a
of hidden features in the layer below. DBMs have the potential of learn- dense SPN. It then finds an SPN structure by learning its weights,
ing internal representations that become increasingly complex, highly where zero weights indicate removed connections. The main difficulty
desirable for solving object and speech recognition problems. Further, in learning SPNs is that the learning signal (i.e., the gradient) quickly
the high-level representations can be built from a large supply of unla- dilutes when it propagates to deep layers. Empirical solutions have been
beled sensory inputs and very limited labeled data can then be used to found to mitigate this difficulty as reported in [289]. It was pointed
only slightly fine-tune the model for a specific task at hand. out in that early paper that despite the many desirable generative
When the number of hidden layers of DBM is reduced to one, we properties in the SPN, it is difficult to fine tune the parameters using
have restricted Boltzmann machine (RBM). Like DBM, there are no the discriminative information, limiting its effectiveness in classifica-
hidden-to-hidden and no visible-to-visible connections in the RBM. The tion tasks. However, this difficulty has been overcome in the subse-
main virtue of RBM is that via composing many RBMs, many hidden quent work reported in [125], where an efficient BP-style discriminative
layers can be learned efficiently using the feature activations of one training algorithm for SPN was presented. Importantly, the standard
RBM as the training data for the next. Such composition leads to deep gradient descent, based on the derivative of the conditional likelihood,
belief network (DBN), which we will describe in more detail, together suffers from the same gradient diffusion problem well known in the
with RBMs, in Section 5. regular DNNs. The trick to alleviate this problem in learning SPNs
The standard DBN has been extended to the factored higher-order is to replace the marginal inference with the most probable state of
Boltzmann machine in its bottom layer, with strong results obtained the hidden variables and to propagate gradients through this “hard”
for phone recognition [64] and for computer vision [296]. This model, alignment only. Excellent results on small-scale image recognition tasks
called the mean-covariance RBM or mcRBM, recognizes the limitation were reported by Gens and Domingo [125].
of the standard RBM in its ability to represent the covariance structure Recurrent neural networks (RNNs) can be considered as another
of the data. However, it is difficult to train mcRBMs and to use them class of deep networks for unsupervised (as well as supervised) learning,
at the higher levels of the deep architecture. Further, the strong results where the depth can be as large as the length of the input data sequence.
published are not easy to reproduce. In the architecture described by In the unsupervised learning mode, the RNN is used to predict the data
Dahl et al. [64], the mcRBM parameters in the full DBN are not fine- sequence in the future using the previous data samples, and no addi-
tuned using the discriminative information, which is used for fine tuning tional class information is used for learning. The RNN is very powerful
the higher layers of RBMs, due to the high computational cost. Subse- for modeling sequence data (e.g., speech or text), but until recently
quent work showed that when speaker adapted features are used, which they had not been widely used partly because they are difficult to train
remove more variability in the features, mcRBM was not helpful [259]. to capture long-term dependencies, giving rise to gradient vanishing or
Another representative deep generative network that can be used gradient explosion problems which were known in early 1990s [29, 167].
for unsupervised (as well as supervised) learning is the sum–product These problems can now be dealt with more easily [24, 48, 85, 280].
network or SPN [125, 289]. An SPN is a directed acyclic graph with Recent advances in Hessian-free optimization [238] have also partially
the observed variables as leaves, and with sum and product operations overcome this difficulty using approximated second-order information
as internal nodes in the deep network. The “sum” nodes give mixture or stochastic curvature estimates. In the more recent work [239], RNNs
3.2. Deep networks for unsupervised or generative learning 221 222 Three Classes of Deep Learning Networks

that are trained with Hessian-free optimization are used as a genera- modeling tool, the deep architecture of speech has more recently been
tive deep network in the character-level language modeling tasks, where successfully applied to solve the very difficult problem of single-channel,
gated connections are introduced to allow the current input characters multi-talker speech recognition, where the mixed speech is the visible
to predict the transition from one latent state vector to the next. Such variable while the un-mixed speech becomes represented in a new hid-
generative RNN models are demonstrated to be well capable of gener- den layer in the deep generative architecture [301, 391]. Deep generative
ating sequential text characters. More recently, Bengio et al. [22] and graphical models are indeed a powerful tool in many applications due
Sutskever [356] have explored variations of stochastic gradient descent to their capability of embedding domain knowledge. However, they are
optimization algorithms in training generative RNNs and shown that often used with inappropriate approximations in inference, learning,
these algorithms can outperform Hessian-free optimization methods. prediction, and topology design, all arising from inherent intractability
Mikolov et al. [248] have reported excellent results on using RNNs for in these tasks for most real-world applications. This problem has been
language modeling. Most recently, Mesnil et al. [242] and Yao et al. addressed in the recent work of Stoyanov et al. [352], which provides
[403] reported the success of RNNs in spoken language understanding. an interesting direction for making deep generative graphical models
We will review this set of work in Section 8. potentially more useful in practice in the future. An even more drastic
There has been a long history in speech recognition research way to deal with this intractability was proposed recently by Bengio
where human speech production mechanisms are exploited to con- et al. [30], where the need to marginalize latent variables is avoided
struct dynamic and deep structure in probabilistic generative models; altogether.
for a comprehensive review, see the monograph by Deng [76]. Specif- The standard statistical methods used for large-scale speech recog-
ically, the early work described in [71, 72, 83, 84, 99, 274] generalized nition and understanding combine (shallow) hidden Markov models
and extended the conventional shallow and conditionally independent for speech acoustics with higher layers of structure representing dif-
HMM structure by imposing dynamic constraints, in the form of poly- ferent levels of natural language hierarchy. This combined hierarchical
nomial trajectory, on the HMM parameters. A variant of this approach model can be suitably regarded as a deep generative architecture, whose
has been more recently developed using different learning techniques motivation and some technical detail may be found in Section 7 of the
for time-varying HMM parameters and with the applications extended recent monograph [200] on “Hierarchical HMM” or HHMM. Related
to speech recognition robustness [431, 416]. Similar trajectory HMMs models with greater technical depth and mathematical treatment can
also form the basis for parametric speech synthesis [228, 326, 439, 438]. be found in [116] for HHMM and [271] for Layered HMM. These early
Subsequent work added a new hidden layer into the dynamic model to deep models were formulated as directed graphical models, missing the
explicitly account for the target-directed, articulatory-like properties in key aspect of “distributed representation” embodied in the more recent
human speech generation [45, 73, 74, 83, 96, 75, 90, 231, 232, 233, 251, deep generative networks of the DBN and DBM discussed earlier in this
282]. More efficient implementation of this deep architecture with hid- chapter. Filling in this missing aspect would help improve these gener-
den dynamics is achieved with non-recursive or finite impulse response ative models.
(FIR) filters in more recent studies [76, 107, 105]. The above deep- Finally, dynamic or temporally recursive generative models based
structured generative models of speech can be shown as special cases on neural network architectures can be found in [361] for human motion
of the more general dynamic network model and even more general modeling, and in [344, 339] for natural language and natural scene pars-
dynamic graphical models [35, 34]. The graphical models can comprise ing. The latter model is particularly interesting because the learning
many hidden layers to characterize the complex relationship between algorithms are capable of automatically determining the optimal model
the variables in speech generation. Armed with powerful graphical structure. This contrasts with other deep architectures such as DBN
3.3. Deep networks for supervised learning 223 224 Three Classes of Deep Learning Networks

where only the parameters are learned while the architectures need to of the observation variable in HMMs. For some representative recent
be pre-defined. Specifically, as reported in [344], the recursive struc- work in this area, see [193, 283].
ture commonly found in natural scene images and in natural language In more recent work of [106, 110, 218, 366, 377], a new deep learning
sentences can be discovered using a max-margin structure prediction architecture, sometimes called deep stacking network (DSN), together
architecture. It is shown that the units contained in the images or sen- with its tensor variant [180, 181] and its kernel version [102], are
tences are identified, and the way in which these units interact with developed that all focus on discrimination with scalable, parallelizable,
each other to form the whole is also identified. block-wise learning relying on little or no generative component. We
will describe this type of discriminative deep architecture in detail in
Section 6.
3.3 Deep networks for supervised learning As discussed in the preceding section, recurrent neural networks
(RNNs) have been used as a generative model; see also the neural pre-
Many of the discriminative techniques for supervised learning in signal dictive model [87] with a similar “generative” mechanism. RNNs can
and information processing are shallow architectures such as HMMs also be used as a discriminative model where the output is a label
[52, 127, 147, 186, 188, 290, 394, 418] and conditional random fields sequence associated with the input data sequence. Note that such dis-
(CRFs) [151, 155, 281, 400, 429, 446]. A CRF is intrinsically a shal- criminative RNNs or sequence models were applied to speech a long
low discriminative architecture, characterized by the linear relationship time ago with limited success. In [17], an HMM was trained jointly with
between the input features and the transition features. The shallow the neural networks, with a discriminative probabilistic training crite-
nature of the CRF is made most clear by the equivalence established rion. In [304], a separate HMM was used to segment the sequence during
between the CRF and the discriminatively trained Gaussian models training, and the HMM was also used to transform the RNN classifi-
and HMMs [148]. More recently, deep-structured CRFs have been devel- cation results into label sequences. However, the use of the HMM for
oped by stacking the output in each lower layer of the CRF, together these purposes does not take advantage of the full potential of RNNs.
with the original input data, onto its higher layer [428]. Various ver- A set of new models and methods were proposed more recently
sions of deep-structured CRFs are successfully applied to phone recog- in [133, 134, 135, 136] that enable the RNNs themselves to perform
nition [410], spoken language identification [428], and natural language sequence classification while embedding the long-short-term memory
processing [428]. However, at least for the phone recognition task, the into the model, removing the need for pre-segmenting the training data
performance of deep-structured CRFs, which are purely discrimina- and for post-processing the outputs. Underlying this method is the idea
tive (non-generative), has not been able to match that of the hybrid of interpreting RNN outputs as the conditional distributions over all
approach involving DBN, which we will take on shortly. possible label sequences given the input sequences. Then, a differen-
Morgan [261] gives an excellent review on other major existing tiable objective function can be derived to optimize these conditional
discriminative models in speech recognition based mainly on the tra- distributions over the correct label sequences, where the segmentation
ditional neural network or MLP architecture using back-propagation of the data is performed automatically by the algorithm. The effective-
learning with random initialization. It argues for the importance of ness of this method has been demonstrated in handwriting recognition
both the increased width of each layer of the neural networks and the tasks and in a small speech task [135, 136] to be discussed in more
increased depth. In particular, a class of deep neural network models detail in Section 7 of this monograph.
forms the basis of the popular “tandem” approach [262], where the out- Another type of discriminative deep architecture is the convo-
put of the discriminatively learned neural network is treated as part lutional neural network (CNN), in which each module consists of
3.3. Deep networks for supervised learning 225 226 Three Classes of Deep Learning Networks

a convolutional layer and a pooling layer. These modules are often Finally, the learning architecture developed for bottom-up,
stacked up with one on top of another, or with a DNN on top of it, to detection-based speech recognition proposed in [214] and developed
form a deep model [212]. The convolutional layer shares many weights, further since 2004, notably in [330, 332, 427] using the DBN–DNN
and the pooling layer subsamples the output of the convolutional layer technique, can also be categorized in the discriminative or supervised-
and reduces the data rate from the layer below. The weight sharing learning deep architecture category. There is no intent and mecha-
in the convolutional layer, together with appropriately chosen pool- nism in this architecture to characterize the joint probability of data
ing schemes, endows the CNN with some “invariance” properties (e.g., and recognition targets of speech attributes and of the higher-level
translation invariance). It has been argued that such limited “invari- phone and words. The most current implementation of this approach
ance” or equi-variance is not adequate for complex pattern recognition is based on the DNN, or neural networks with many layers using back-
tasks and more principled ways of handling a wider range of invariance propagation learning. One intermediate neural network layer in the
may be needed [160]. Nevertheless, CNNs have been found highly effec- implementation of this detection-based framework explicitly represents
tive and been commonly used in computer vision and image recognition the speech attributes, which are simplified entities from the “atomic”
[54, 55, 56, 57, 69, 198, 209, 212, 434]. More recently, with appropri- units of speech developed in the early work of [101, 355]. The simpli-
ate changes from the CNN designed for image analysis to that taking fication lies in the removal of the temporally overlapping properties
into account speech-specific properties, the CNN is also found effec- of the speech attributes or articulatory-like features. Embedding such
tive for speech recognition [1, 2, 3, 81, 94, 312]. We will discuss such more realistic properties in the future work is expected to improve the
applications in more detail in Section 7 of this monograph. accuracy of speech recognition further.
It is useful to point out that the time-delay neural network (TDNN)
[202, 382] developed for early speech recognition is a special case and 3.4 Hybrid deep networks
predecessor of the CNN when weight sharing is limited to one of the
two dimensions, i.e., time dimension, and there is no pooling layer. It The term “hybrid” for this third category refers to the deep architecture
was not until recently that researchers have discovered that the time- that either comprises or makes use of both generative and discrimina-
dimension invariance is less important than the frequency-dimension tive model components. In the existing hybrid architectures published
invariance for speech recognition [1, 3, 81]. A careful analysis on the in the literature, the generative component is mostly exploited to help
underlying reasons is described in [81], together with a new strategy for with discrimination, which is the final goal of the hybrid architecture.
designing the CNN’s pooling layer demonstrated to be more effective How and why generative modeling can help with discrimination can be
than all previous CNNs in phone recognition. examined from two viewpoints [114]:
It is also useful to point out that the model of hierarchical tempo-
• The optimization viewpoint where generative models trained in
ral memory (HTM) [126, 143, 142] is another variant and extension of
an unsupervised fashion can provide excellent initialization points
the CNN. The extension includes the following aspects: (1) Time or
in highly nonlinear parameter estimation problems (The com-
temporal dimension is introduced to serve as the “supervision” infor-
monly used term of “pre-training” in deep learning has been intro-
mation for discrimination (even for static images); (2) Both bottom-up
duced for this reason); and/or
and top-down information flows are used, instead of just bottom-up in
the CNN; and (3) A Bayesian probabilistic formalism is used for fusing • The regularization perspective where the unsupervised-learning
information and for decision making. models can effectively provide a prior on the set of functions
representable by the model.
3.4. Hybrid deep networks 227 228 Three Classes of Deep Learning Networks

The study reported in [114] provided an insightful analysis and exper- maximum likelihood). This type of methods, which uses maximum-
imental evidence supporting both of the viewpoints above. likelihood trained parameters to assist in the discriminative HMM
The DBN, a generative, deep network for unsupervised learning dis- training can be viewed as a “hybrid” approach to train the shallow
cussed in Section 3.2, can be converted to and used as the initial model HMM model.
of a DNN for supervised learning with the same network structure, Along the line of using discriminative criteria to train parameters in
which is further discriminatively trained or fine-tuned using the target generative models as in the above HMM training example, we here dis-
labels provided. When the DBN is used in this way we consider this cuss the same method applied to learning other hybrid deep networks.
DBN–DNN model as a hybrid deep model, where the model trained In [203], the generative model of RBM is learned using the discrimina-
using unsupervised data helps to make the discriminative model effec- tive criterion of posterior class-label probabilities. Here the label vector
tive for supervised learning. We will review details of the discriminative is concatenated with the input data vector to form the combined vis-
DNN for supervised learning in the context of RBM/DBN generative, ible layer in the RBM. In this way, RBM can serve as a stand-alone
unsupervised pre-training in Section 5. solution to classification problems and the authors derived a discrim-
Another example of the hybrid deep network is developed in [260], inative learning algorithm for RBM as a shallow generative model. In
where the DNN weights are also initialized from a generative DBN the more recent work by Ranzato et al. [298], the deep generative model
but are further fine-tuned with a sequence-level discriminative crite- of DBN with gated Markov random field (MRF) at the lowest level is
rion, which is the conditional probability of the label sequence given learned for feature extraction and then for recognition of difficult image
the input feature sequence, instead of the frame-level criterion of cross- classes including occlusions. The generative ability of the DBN facil-
entropy commonly used. This can be viewed as a combination of the itates the discovery of what information is captured and what is lost
static DNN with the shallow discriminative architecture of CRF. It can at each level of representation in the deep model, as demonstrated in
be shown that such a DNN–CRF is equivalent to a hybrid deep architec- [298]. A related study on using the discriminative criterion of empirical
ture of DNN and HMM whose parameters are learned jointly using the risk to train deep graphical models can be found in [352].
full-sequence maximum mutual information (MMI) criterion between A further example of hybrid deep networks is the use of generative
the entire label sequence and the input feature sequence. A closely models of DBNs to pre-train deep convolutional neural networks (deep
related full-sequence training method designed and implemented for CNNs) [215, 216, 217]. Like the fully connected DNN discussed ear-
much larger tasks is carried out more recently with success for a shallow lier, pre-training also helps to improve the performance of deep CNNs
neural network [194] and for a deep one [195, 353, 374]. We note that over random initialization. Pre-training DNNs or CNNs using a set of
the origin of the idea for joint training of the sequence model (e.g., the regularized deep autoencoders [24], including denoising autoencoders,
HMM) and of the neural network came from the early work of [17, 25], contractive autoencoders, and sparse autoencoders, is also a similar
where shallow neural networks were trained with small amounts of example of the category of hybrid deep networks.
training data and with no generative pre-training. The final example given here for hybrid deep networks is based
Here, it is useful to point out a connection between the above on the idea and work of [144, 267], where one task of discrimination
pretraining/fine-tuning strategy associated with hybrid deep networks (e.g., speech recognition) produces the output (text) that serves
and the highly popular minimum phone error (MPE) training technique as the input to the second task of discrimination (e.g., machine
for the HMM (see [147, 290] for an overview). To make MPE training translation). The overall system, giving the functionality of speech
effective, the parameters need to be initialized using an algorithm (e.g., translation — translating speech in one language into text in another
Baum-Welch algorithm) that optimizes a generative criterion (e.g., language — is a two-stage deep architecture consisting of both
3.4. Hybrid deep networks 229
4
generative and discriminative elements. Both models of speech
recognition (e.g., HMM) and of machine translation (e.g., phrasal
mapping and non-monotonic alignment) are generative in nature, but
Deep Autoencoders — Unsupervised Learning
their parameters are all learned for discrimination of the ultimate
translated text given the speech data. The framework described in
[144] enables end-to-end performance optimization in the overall deep
architecture using the unified learning framework initially published
in [147]. This hybrid deep learning approach can be applied to not
only speech translation but also all speech-centric and possibly other
information processing tasks such as speech information retrieval,
speech understanding, cross-lingual speech/text understanding and
retrieval, etc. (e.g., [88, 94, 145, 146, 366, 398]).
In the next three chapters, we will elaborate on three prominent
types of models for deep learning, one from each of the three classes
This section and the next two will each select one prominent example
reviewed in this chapter. These are chosen to serve the tutorial purpose,
deep network for each of the three categories outlined in Section 3.
given their simplicity of the architectural and mathematical descrip-
Here we begin with the category of the deep models designed mainly
tions. The three architectures described in the following three chapters
for unsupervised learning.
may not be interpreted as the most representative and influential work
in each of the three classes.
4.1 Introduction

The deep autoencoder is a special type of the DNN (with no class


labels), whose output vectors have the same dimensionality as the input
vectors. It is often used for learning a representation or effective encod-
ing of the original data, in the form of input vectors, at hidden layers.
Note that the autoencoder is a nonlinear feature extraction method
without using class labels. As such, the features extracted aim at con-
serving and better representing information instead of performing clas-
sification tasks, although sometimes these two goals are correlated.
An autoencoder typically has an input layer which represents the
original data or input feature vectors (e.g., pixels in image or spec-
tra in speech), one or more hidden layers that represent the trans-
formed feature, and an output layer which matches the input layer for

230
4.2. Use of deep autoencoders to extract speech features 231 232 Deep Autoencoders — Unsupervised Learning

reconstruction. When the number of hidden layers is greater than one,


the autoencoder is considered to be deep. The dimension of the hidden
layers can be either smaller (when the goal is feature compression) or
larger (when the goal is mapping the feature to a higher-dimensional
space) than the input dimension.
An autoencoder is often trained using one of the many back-
propagation variants, typically the stochastic gradient descent method.
Though often reasonably effective, there are fundamental problems
when using back-propagation to train networks with many hidden
layers. Once the errors get back-propagated to the first few layers,
they become minuscule, and training becomes quite ineffective. Though
more advanced back-propagation methods help with this problem to
some degree, it still results in slow learning and poor solutions, espe-
cially with limited amounts of training data. As mentioned in the pre-
vious chapters, the problem can be alleviated by pre-training each layer
as a simple autoencoder [28, 163]. This strategy has been applied to
construct a deep autoencoder to map images to short binary code for
fast, content-based image retrieval, to encode documents (called seman-
tic hashing), and to encode spectrogram-like speech features which we Figure 4.1: The architecture of the deep autoencoder used in [100] for extracting
review below. binary speech codes from high-resolution spectrograms. [after [100], @Elsevier].

4.2 Use of deep autoencoders to extract speech features probabilities of its hidden units are treated as the data for training
another Bernoulli-Bernoulli RBM. These two RBM’s can then be com-
Here we review a set of work, some of which was published in [100], posed to form a deep belief net (DBN) in which it is easy to infer the
in developing an autoencoder for extracting binary speech codes from states of the second layer of binary hidden units from the input in a
the raw speech spectrogram data in an unsupervised manner (i.e., no single forward pass. The DBN used in this work is illustrated on the left
speech class labels). The discrete representations in terms of a binary side of Figure 4.1, where the two RBMs are shown in separate boxes.
code extracted by this model can be used in speech information retrieval (See more detailed discussions on the RBM and DBN in Section 5).
or as bottleneck features for speech recognition. The deep autoencoder with three hidden layers is formed by
A deep generative model of patches of spectrograms that con- “unrolling” the DBN using its weight matrices. The lower layers of
tain 256 frequency bins and 1, 3, 9, or 13 frames is illustrated in this deep autoencoder use the matrices to encode the input and the
Figure 4.1. An undirected graphical model called a Gaussian-Bernoulli upper layers use the matrices in reverse order to decode the input.
RBM is built that has one visible layer of linear variables with This deep autoencoder is then fine-tuned using error back-propagation
Gaussian noise and one hidden layer of 500 to 3000 binary latent to minimize the reconstruction error, as shown on the right side of Fig-
variables. After learning the Gaussian-Bernoulli RBM, the activation ure 4.1. After learning is complete, any variable-length spectrogram
4.2. Use of deep autoencoders to extract speech features 233 234 Deep Autoencoders — Unsupervised Learning

can be encoded and reconstructed as follows. First, N consecutive At the top of Figure 4.2 is the original, un-coded speech, followed
overlapping frames of 256-point log power spectra are each normalized by the speech utterances reconstructed from the binary codes (zero
to zero-mean and unit-variance across samples per feature to provide or one) at the 312 unit bottleneck code layer with encoding window
the input to the deep autoencoder. The first hidden layer then uses the lengths of N = 1, 3, 9, and 13, respectively. The lower reconstruction
logistic function to compute real-valued activations. These real values errors for N = 9 and N = 13 are clearly seen.
are fed to the next, coding layer to compute “codes.” The real-valued Encoding error of the deep autoencoder is qualitatively examined
activations of hidden units in the coding layer are quantized to be in comparison with the more traditional codes via vector quantization
either zero or one with 0.5 as the threshold. These binary codes are (VQ). Figure 4.3 shows various aspects of the encoding errors. At the
then used to reconstruct the original spectrogram, where individual top is the original speech utterance’s spectrogram. The next two spec-
fixed-frame patches are reconstructed first using the two upper layers trograms are the blurry reconstruction from the 312-bit VQ and the
of network weights. Finally, the standard overlap-and-add technique in much more faithful reconstruction from the 312-bit deep autoencoder.
signal processing is used to reconstruct the full-length speech spectro- Coding errors from both coders, plotted as a function of time, are
gram from the outputs produced by applying the deep autoencoder to
every possible window of N consecutive frames. We show some illus-
trative encoding and reconstruction examples below.

Figure 4.3: Top to bottom: The original spectrogram from the test set; reconstruc-
tion from the 312-bit VQ coder; reconstruction from the 312-bit autoencoder; coding
Figure 4.2: Top to Bottom: The ordinal spectrogram; reconstructions using input errors as a function of time for the VQ coder (blue) and autoencoder (red); spec-
window sized of N = 1, 3, 9, and 13 while forcing the coding units to take values of trogram of the VQ coder residual; spectrogram of the deep autoencoder’s residual.
zero one (i.e., a binary code) . [after [100], @Elsevier]. [after [100], @ Elsevier].
4.3. Stacked denoising autoencoders 235 236 Deep Autoencoders — Unsupervised Learning

Figure 4.4: The original speech spectrogram and the reconstructed counterpart. Figure 4.5: Same as Figure 4.4 but with a different TIMIT speech utterance.
A total of 312 binary codes are with one for each single frame.

shown below the spectrograms, demonstrating that the autoencoder


(red curve) is producing lower errors than the VQ coder (blue curve)
throughout the entire span of the utterance. The final two spectrograms
show detailed coding error distributions over both time and frequency
bins.
Figures 4.4 to 4.10 show additional examples (unpublished) for the
original un-coded speech spectrograms and their reconstructions using
the deep autoencoder. They give a diverse number of binary codes for
either a single or three consecutive frames in the spectrogram samples.

4.3 Stacked denoising autoencoders


Figure 4.6: The original speech spectrogram and the reconstructed counterpart.
In early years of autoencoder research, the encoding layer had smaller A total of 936 binary codes are used for three adjacent frames.
dimensions than the input layer. However, in some applications, it is
desirable that the encoding layer is wider than the input layer, in which
case techniques are needed to prevent the neural network from learning
the trivial identity mapping function. One of the reasons for using a
4.3. Stacked denoising autoencoders 237 238 Deep Autoencoders — Unsupervised Learning

Figure 4.7: Same as Figure 4.6 but with a different TIMIT speech utterance. Figure 4.9: The original speech spectrogram and the reconstructed counterpart.
A total of 2000 binary codes with one for each single frame.

Figure 4.8: Same as Figure 4.6 but with yet another TIMIT speech utterance.
Figure 4.10: Same as Figure 4.9 but with a different TIMIT speech utterance.

higher dimension in the hidden or encoding layers than the input layer example, in the stacked denoising autoencoder detailed in [376], random
is that it allows the autoencoder to capture a rich input distribution. noises are added to the input data. This serves several purposes. First,
The trivial mapping problem discussed above can be prevented by by forcing the output to match the original undistorted input data the
methods such as using sparseness constraints, or using the “dropout” model can avoid learning the trivial identity solution. Second, since
trick by randomly forcing certain values to be zero and thus introducing the noises are added randomly, the model learned would be robust to
distortions at the input data [376, 375] or at the hidden layers [166]. For the same kind of distortions in the test data. Third, since each distorted
4.4. Transforming autoencoders 239 240 Deep Autoencoders — Unsupervised Learning

input sample is different, it greatly increases the training set size and The building block of the transforming autoencoder is a “capsule,”
thus can alleviate the overfitting problem. which is an independent sub-network that extracts a single parameter-
It is interesting to note that when the encoding and decoding ized feature representing a single entity, be it visual or audio. A trans-
weights are forced to be the transpose of each other, such denoising forming autoencoder receives both an input vector and a target output
autoencoder with a single sigmoidal hidden layer is strictly equiva- vector, which is transformed from the input vector through a simple
lent to a particular Gaussian RBM, but instead of training it by the global transformation mechanism; e.g., translation of an image and
technique of contrastive divergence (CD) or persistent CD, it is trained frequency shift of speech (the latter due to the vocal tract length
by a score matching principle, where the score is defined as the deriva- difference). An explicit representation of the global transformation is
tive of the log-density with respect to the input [375]. Furthermore, assumed known. The coding layer of the transforming autoencoder con-
Alain and Bengio [5] generalized this result to any parameterization sists of the outputs of several capsules.
of the encoder and decoder with squared reconstruction error and During the training phase, the different capsules learn to extract
Gaussian corruption noise. They show that as the amount of noise different entities in order to minimize the error between the final output
approaches zero, such models estimate the true score of the underly- and the target.
ing data generating distribution. Finally, Bengio et al. [30] show that In addition to the deep autoencoder architectures described here,
any denoising autoencoder is a consistent estimator of the underly- there are many other types of generative architectures in the literature,
ing data generating distribution within some family of distributions. all characterized by the use of data alone (i.e., free of classification
This is true for any parameterization of the autoencoder, for any labels) to automatically derive higher-level features.
type of information-destroying corruption process with no constraint
on the noise level except being positive, and for any reconstruction
loss expressed as a conditional log-likelihood. The consistency of the
estimator is achieved by associating the denoising autoencoder with
a Markov chain whose stationary distribution is the distribution esti-
mated by the model, and this Markov chain can be used to sample
from the denoising autoencoder.

4.4 Transforming autoencoders

The deep autoencoder described above can extract faithful codes for
feature vectors due to many layers of nonlinear processing. However, the
code extracted in this way is transformation-variant. In other words,
the extracted code would change in ways chosen by the learner when the
input feature vector is transformed. Sometimes, it is desirable to have
the code change predictably to reflect the underlying transformation-
invariant property of the perceived content. This is the goal of the
transforming autoencoder proposed in [162] for image recognition.
242 Pre-Trained Deep Neural Networks — A Hybrid
5
be represented as bipartite graphs, where all visible units are connected
to all hidden units, and there are no visible–visible or hidden–hidden
Pre-Trained Deep Neural Networks — A Hybrid connections.
In an RBM, the joint distribution p(v, h; θ) over the visible units v
and hidden units h, given the model parameters θ, is defined in terms
of an energy function E(v, h; θ) of
exp(−E(v, h; θ))
p(v, h; θ) = ,
Z
� �
where Z = v h exp(−E(v, h; θ)) is a normalization factor or parti-
tion function, and the marginal probability that the model assigns to
a visible vector v is

h exp(−E(v, h; θ))
p(v; θ) =
Z
In this section, we present the most widely used hybrid deep archi-
tecture — the pre-trained deep neural network (DNN), and discuss For a Bernoulli (visible)-Bernoulli (hidden) RBM, the energy function
the related techniques and building blocks including the RBM and is defined as
DBN. We discuss the DNN example here in the category of hybrid � J
I � I
� J

deep networks before the examples in the category of deep networks for E(v, h; θ) = − wij vi hj − bi vi − aj hj .
supervised learning (Section 6). This is partly due to the natural flow i=1 j=1 i=1 j=1

from the unsupervised learning models to the DNN as a hybrid model. where wij represents the symmetric interaction term between visible
The discriminative nature of artificial neural networks for supervised unit vi and hidden unit hj , bi and aj the bias terms, and I and J are
learning has been widely known, and thus would not be required for the numbers of visible and hidden units. The conditional probabilities
understanding the hybrid nature of the DNN that uses unsupervised can be efficiently calculated as
pre-training to facilitate the subsequent discriminative fine tuning. � I �

Part of the review in this chapter is based on recent publications in p(hj = 1|v; θ) = σ wij vi + aj ,
[68, 161, 412]. i=1
 
J

p(vi = 1|h; θ) = σ  wij hj +bi  ,
5.1 Restricted Boltzmann machines j=1

An RBM is a special type of Markov random field that has one layer of where σ(x) = 1/(1 + exp(−x)).
(typically Bernoulli) stochastic hidden units and one layer of (typically Similarly, for a Gaussian (visible)-Bernoulli (hidden) RBM, the
Bernoulli or Gaussian) stochastic visible or observable units. RBMs can energy is
� J
I �
1� I �J
241 E(v, h; θ) = − wij vi hj − (vi − bi )2 − aj hj ,
i=1 j=1
2 i=1 j=1
5.1. Restricted Boltzmann machines 243 244 Pre-Trained Deep Neural Networks — A Hybrid

The corresponding conditional probabilities become


� I �

p(hj = 1|v; θ) = σ wij vi +aj ,
i=1
 
J

p(vi |h; θ) = N  wij hj + bi , 1 ,
j=1
Figure 5.1: A pictorial view of sampling from a RBM during RBM learning (cour-
where vi takes real values and follows a Gaussian distribution with tesy of Geoff Hinton).

mean Jj=1 wij hj + bi and variance one. Gaussian-Bernoulli RBMs can
be used to convert real-valued stochastic variables to binary stochastic
Here, (v1 , h1 ) is a sample from the model, as a very rough estimate
variables, which can then be further processed using the Bernoulli-
of Emodel (vi hj ). The use of (v1 , h1 ) to approximate Emodel (vi hj ) gives
Bernoulli RBMs.
rise to the algorithm of CD-1. The sampling process can be pictorially
The above discussion used two of the most common conditional
depicted in Figure 5.1.
distributions for the visible data in the RBM — Gaussian (for
Note that CD-k generalizes this to more steps of the Markov chain.
continuous-valued data) and binomial (for binary data). More general
There are other techniques for estimating the log-likelihood gradient of
types of distributions in the RBM can also be used. See [386] for the
RBMs, in particular the stochastic maximum likelihood or persistent
use of general exponential-family distributions for this purpose.
contrastive divergence (PCD) [363, 406]. Both work better than CD
Taking the gradient of the log likelihood log p(v; θ) we can derive
when using the RBM as a generative model.
the update rule for the RBM weights as:
Careful training of RBMs is essential to the success of applying
∆wij = Edata (vi hj ) − Emodel (vi hj ), RBM and related deep learning techniques to solve practical problems.
See Technical Report [159] for a very useful practical guide for training
where Edata (vi hj ) is the expectation observed in the training set (with RBMs.
hj sampled given vi according to the model), and Emodel (vi hj ) is that The RBM discussed above is both a generative and an unsupervised
same expectation under the distribution defined by the model. Unfor- model, which characterizes the input data distribution using hidden
tunately, Emodel (vi hj ) is intractable to compute. The contrastive diver- variables and there is no label information involved. However, when
gence (CD) approximation to the gradient was the first efficient method the label information is available, it can be used together with the
proposed to approximate this expected value, where Emodel (vi hj ) is data to form the concatenated “data” set. Then the same CD learn-
replaced by running the Gibbs sampler initialized at the data for one ing can be applied to optimize the approximate “generative” objective
or more steps. The steps in approximating Emodel (vi hj ) is summarized function related to data likelihood. Further, and more interestingly, a
as follows: “discriminative” objective function can be defined in terms of condi-
tional likelihood of labels. This discriminative RBM can be used to
• Initialize v0 at data
“fine tune” RBM for classification tasks [203].
• Sample h0 ∼ p(h|v0 ) Ranzato et al. [297, 295] proposed an unsupervised learning algo-
• Sample v1 ∼ p(v|h0 ) rithm called sparse encoding symmetric machine (SESM), which is
• Sample h1 ∼ p(h|v1 ) quite similar to RBM. They both have a symmetric encoder and
5.2. Unsupervised layer-wise pre-training 245 246 Pre-Trained Deep Neural Networks — A Hybrid

decoder, and a logistic nonlinearity on the top of the encoder. The main
difference is that whereas the RBM is trained using (very approximate)
maximum likelihood, SESM is trained by simply minimizing the aver-
age energy plus an additional code sparsity term. SESM relies on the
sparsity term to prevent flat energy surfaces, while RBM relies on an
explicit contrastive term in the loss, an approximation of the log par-
tition function. Another difference is in the coding strategy in that the
code units are “noisy” and binary in the RBM, while they are quasi-
binary and sparse in SESM. The use of SESM in pre-training DNNs
for speech recognition can be found in [284].

5.2 Unsupervised layer-wise pre-training


Figure 5.2: An illustration of the DBN-DNN architecture.
Here we describe how to stack up RBMs just described to form a
DBN as the basis for DNN’s pre-training. Before delving into details,
above achieves approximate maximum likelihood learning. Note that
we first note that this procedure, proposed by Hinton and Salakhut-
this learning procedure is unsupervised and requires no class label.
dinov [163] is a more general technique of unsupervised layer-wise
When applied to classification tasks, the generative pre-training
pretraining. That is, not only RBMs can be stacked to form deep gen-
can be followed by or combined with other, typically discriminative,
erative (or discriminative) networks, but other types of networks can
learning procedures that fine-tune all of the weights jointly to improve
also do the same, such as autoencoder variants as proposed by Bengio
the performance of the network. This discriminative fine-tuning is per-
et al. [28].
formed by adding a final layer of variables that represent the desired
Stacking a number of the RBMs learned layer by layer from bottom
outputs or labels provided in the training data. Then, the back-
up gives rise to a DBN, an example of which is shown in Figure 5.2. The
propagation algorithm can be used to adjust or fine-tune the network
stacking procedure is as follows. After learning a Gaussian-Bernoulli
weights in the same way as for the standard feed-forward neural net-
RBM (for applications with continuous features such as speech) or
work. What goes to the top, label layer of this DNN depends on the
Bernoulli-Bernoulli RBM (for applications with nominal or binary fea-
application. For speech recognition applications, the top layer, denoted
tures such as black-white image or coded text), we treat the activation
by “l1 , l2 , . . . , lj , . . . , lL ,” in Figure 5.2, can represent either syllables,
probabilities of its hidden units as the data for training the Bernoulli-
phones, sub-phones, phone states, or other speech units used in the
Bernoulli RBM one layer up. The activation probabilities of the second-
HMM-based speech recognition system.
layer Bernoulli-Bernoulli RBM are then used as the visible data input
The generative pre-training described above has produced better
for the third-layer Bernoulli-Bernoulli RBM, and so on. Some theoret-
phone and speech recognition results than random initialization on
ical justification of this efficient layer-by-layer greedy learning strat-
a wide variety of tasks, which will be surveyed in Section 7. Fur-
egy is given in [163], where it is shown that the stacking procedure
ther research has also shown the effectiveness of other pre-training
above improves a variational lower bound on the likelihood of the train-
strategies. As an example, greedy layer-by-layer training may be carried
ing data under the composite model. That is, the greedy procedure
5.2. Unsupervised layer-wise pre-training 247 248 Pre-Trained Deep Neural Networks — A Hybrid

out with an additional discriminative term to the generative cost func- Further, a general framework for layer-wise pre-training can be
tion at each level. And without generative pre-training, purely discrim- found in many deep learning papers; e.g., Section 2 of [21]. This
inative training of DNNs from random initial weights using the tradi- includes, as a special case, the use of RBMs as the single-layer build-
tional stochastic gradient decent method has been shown to work very ing block as discussed in this section. The more general framework
well when the scales of the initial weights are set carefully and the mini- can cover the RBM/DBN as well as any other unsupervised feature
batch sizes, which trade off noisy gradients with convergence speed, extractor. It can also cover the case of unsupervised pre-training of the
used in stochastic gradient decent are adapted prudently (e.g., with representation only followed by a separate stage of learning a classifier
an increasing size over training epochs). Also, randomization order in on top of the unsupervised, pre-trained features [215, 216, 217].
creating mini-batches needs to be judiciously determined. Importantly,
it was found effective to learn a DNN by starting with a shallow neural 5.3 Interfacing DNNs with HMMs
network with a single hidden layer. Once this has been trained discrimi-
natively (using early stops to avoid overfitting), a second hidden layer is The pre-trained DNN as a prominent example of the hybrid deep
inserted between the first hidden layer and the labeled softmax output networks discussed so far in this chapter is a static classifier with
units and the expanded deeper network is again trained discrimina- input vectors having a fixed dimensionality. However, many practi-
tively. This can be continued until the desired number of hidden layers cal pattern recognition and information processing problems, including
is reached, after which a full backpropagation “fine tuning” is applied. speech recognition, machine translation, natural language understand-
This discriminative “pre-training” procedure is found to work well in ing, video processing and bio-information processing, require sequence
practice [324, 419], especially with a reasonably large amount of train- recognition. In sequence recognition, sometimes called classification
ing data. When the amount of training data is increased even more, with structured input/output, the dimensionality of both inputs and
then some carefully designed random initialization methods can work outputs are variable.
well also without using the above pre-training schemes. The HMM, based on dynamic programing operations, is a con-
In any case, pre-training based on the use of RBMs to stack up in venient tool to help port the strength of a static classifier to han-
forming the DBN has been found to work well in most cases, regardless dle dynamic or sequential patterns. Thus, it is natural to combine
of a large or small amount of training data. It is useful to point out feed-forward neural networks and HMMs to bridge the gap between
that there are other ways to perform pre-training in addition to the the static and sequence pattern recognition, as was done in the early
use of RBMs and DBNs. For example, denoising autoencoders have days of neural networks for speech recognition [17, 25, 42]. A popu-
now been shown to be consistent estimators of the data generating lar architecture to fulfill this role with the use of the DNN is shown
distribution [30]. Like RBMs, they are also shown to be generative in Figure 5.3. This architecture has been successfully used in speech
models from which one can sample. Unlike RBMs, however, an recognition experiments as reported in [67, 68].
unbiased estimator of the gradient of the training objective function It is important to note that the unique elasticity of temporal dynam-
can be obtained by the denoising autoencoders, avoiding the need for ics of speech as elaborated in [45, 73, 76, 83] would require temporally
MCMC or variational approximations in the inner loop of training.
Therefore, the greedy layer-wise pre-training may be performed as
effectively by stacking the denoising autoencoders as by stacking the
RBMs each as a single-layer learner.
5.3. Interfacing DNNs with HMMs 249
6
Deep Stacking Networks and Variants —
Supervised Learning

6.1 Introduction

While the DNN just reviewed has been shown to be extremely power-
ful in connection with performing recognition and classification tasks
including speech recognition and image classification, training a DNN
has proven to be difficult computationally. In particular, conventional
techniques for training DNNs at the fine tuning phase involve the uti-
lization of a stochastic gradient descent learning algorithm, which is
difficult to parallelize across machines. This makes learning at large
Figure 5.3: Interface between DBN/DNN and HMM to form a DNN–HMM. This scale nontrivial. For example, it has been possible to use one single,
architecture, developed at Microsoft, has been successfully used in speech recognition very powerful GPU machine to train DNN-based speech recognizers
experiments reported in [67, 68]. [after [67, 68], @IEEE]. with dozens to a few hundreds or thousands of hours of speech training
data with remarkable results. It is less clear, however, how to scale up
correlated models more powerful than HMMs for the ultimate success this success with much more training data. See [69] for recent work in
of speech recognition. Integrating such dynamic models that have real- this direction.
istic co-articulatory properties with the DNN and possibly other deep Here we describe a new deep learning architecture, the deep stacking
learning models to form the coherent dynamic deep architecture is a network (DSN), which was originally designed with the learning scal-
challenging new research direction. ability problem in mind. This chapter is based in part on the recent
publications of [106, 110, 180, 181] with expanded discussions.

250
6.1. Introduction 251 252 Deep Stacking Networks and Variants — Supervised Learning

The central idea of the DSN design relates to the concept of stack- can be distributed over CPU clusters. And in more recent publications,
ing, as proposed and explored in [28, 44, 392], where simple modules of the DSN was used when the key operation of stacking is emphasized.
functions or classifiers are composed first and then they are “stacked”
on top of each other in order to learn complex functions or classifiers.
Various ways of implementing stacking operations have been developed 6.2 A basic architecture of the deep stacking network
in the past, typically making use of supervised information in the sim-
ple modules. The new features for the stacked classifier at a higher A DSN, as shown in Figure 6.1, includes a variable number of layered
level of the stacking architecture often come from concatenation of the modules, wherein each module is a specialized neural network con-
classifier output of a lower module and the raw input features. In [60], sisting of a single hidden layer and two trainable sets of weights. In
the simple module used for stacking was a conditional random field Figure 6.1, only four such modules are illustrated, where each module
(CRF). This type of deep architecture was further developed with hid- is shown with a separate color. In practice, up to a few hundreds of
den states added for successful natural language and speech recognition modules have been efficiently trained and used in image and speech
applications where segmentation information is unknown in the train- classification experiments.
ing data [429]. Convolutional neural networks, as in [185], can also be The lowest module in the DSN comprises a linear layer with a set of
considered as a stacking architecture but the supervision information linear input units, a hidden nonlinear layer with a set of nonlinear units,
is typically not used until in the final stacking module. and a second linear layer with a set of linear output units. A sigmoidal
The DSN architecture was originally presented in [106] and was nonlinearity is typically used in the hidden layer. However, other non-
referred as deep convex network or DCN to emphasize the convex linearities can also be used. If the DSN is utilized in connection with rec-
nature of a major portion of the algorithm used for learning the net- ognizing an image, the input units can correspond to a number of pixels
work. The DSN makes use of supervision information for stacking each (or extracted features) in the image, and can be assigned values based at
of the basic modules, which takes the simplified form of multilayer per- least in part upon intensity values, RGB values, or the like correspond-
ceptron. In the basic module, the output units are linear and the hidden ing to the respective pixels. If the DSN is utilized in connection with
units are sigmoidal nonlinear. The linearity in the output units permits speech recognition, the set of input units may correspond to samples
highly efficient, parallelizable, and closed-form estimation (a result of of speech waveform, or the extracted features from speech waveforms,
convex optimization) for the output network weights given the hidden such as power spectra or cepstral coefficients. The output units in the
units’ activities. Due to the closed-form constraints between the input linear output layer represent the targets of classification. For instance,
and output weights, the input weights can also be elegantly estimated in if the DSN is configured to perform digit recognition, then the output
an efficient, parallelizable, batch-mode manner, which we will describe units may be representative of the values 0, 1, 2, 3, and so forth up to 9
in some detail in Section 6.3. with a 0–1 coding scheme. If the DSN is configured to perform speech
The name “convex” used in [106] accentuates the role of convex recognition, then the output units may be representative of phones,
optimization in learning the output network weights given the hidden HMM states of phones, or context-dependent HMM states of phones.
units’ activities in each basic module. It also points to the importance The lower-layer weight matrix, which we denote by W , connects
of the closed-form constraints, derived from the convexity, between the the linear input layer and the hidden nonlinear layer. The upper-layer
input and output weights. Such constraints make the learning of the weight matrix, which we denote by U , connects the nonlinear hid-
remaining network parameters (i.e., the input network weights) much den layer with the linear output layer. The weight matrix U can be
easier than otherwise, enabling batch-mode learning of the DSN that determined through a closed-form solution given the weight matrix W
when the mean square error training criterion is used.
6.2. A basic architecture of the deep stacking network 253 254 Deep Stacking Networks and Variants — Supervised Learning

learning a weight matrix that describes connection weights between


hidden units and linear output units via convex optimization can con-
tinue for many modules. A resultant learned DSN may then be deployed
in connection with an automatic classification task such as frame-level
speech phone or state classification. Connecting the DSN’s output to an
HMM or any dynamic programming device enables continuous speech
recognition and other forms of sequential pattern recognition.

6.3 A method for learning the DSN weights

Here, we provide some technical details on how the use of linear out-
put units in the DSN facilitates the learning of the DSN weights. A
single module is used to illustrate the advantage for simplicity rea-
sons. First, it is clear that the upper layer weight matrix U can
be efficiently learned once the activity matrix H over all training
samples in the hidden layer is known. Let’s denote the training vec-
tors by X = [x 1 , . . . , x i , . . . , x N ], in which each vector is denoted by
x i = [x1i , . . . , xji , . . . , xDi ]T where D is the dimension of the input vec-
tor, which is a function of the block, and N is the total number of
training samples. Denote by L the number of hidden units and by C
the dimension of the output vector. Then the output of a DSN block is
y i = U T h i where h i = σ(W T x i ) is the hidden-layer vector for sample
Figure 6.1: A DSN architecture using input–output stacking. Four modules are i, U is an L × C weight matrix at the upper layer of a block. W is a
illustrated, each with a distinct color. Dashed lines denote copying layers. [after
[366], @IEEE]. D × L weight matrix at the lower layer of a block, and σ(·) is a sigmoid
function. Bias terms are implicitly represented in the above formulation
if x i and h i are augmented with ones.
As indicated above, the DSN includes a set of serially connected,
Given target vectors in the full training set with a total of
overlapping, and layered modules, wherein each module has the same
N samples, T = [t 1 , . . . , t i , . . . , t N ], where each vector is t i =
architecture — a linear input layer followed by a nonlinear hidden
[t1i , · · · , tji , . . . , tCi ]T , the parameters U and W are learned so as to
layer, which is connected to a linear output layer. Note that the output
minimize the average of the total square error below:
units of a lower module are a subset of the input units of an adjacent
higher module in the DSN. More specifically, in a second module that 1 1
E= �y i − t i �2 = Tr[(Y − T )(Y − T )T ]
is directly above the lowest module in the DSN, the input units can 2 i 2
include the output units of the lowest module and optionally the raw
where the output of the network is
input feature.
This pattern of including output units in a lower module as a y i = U T h i = U T σ(W T x i ) = Gi (UW )
portion of the input units in an adjacent higher module and thereafter
6.4. The tensor deep stacking network 255 256 Deep Stacking Networks and Variants — Supervised Learning

which depends on both weight matrices, as in the standard neural net.


Assuming H = [h 1 , . . . , h i , . . . , h N ] is known, or equivalently, W is
known. Then, setting the error derivative with respective to U to zero
gives

U = (HH T )−1 HT T = F(W ), where h i = σ(W T x i ).

This provides an explicit constraint between U and W which were


treated independently in the conventional backpropagation algorithm.
Now, given the equality constraint U = F (W ), let’s use Lagrangian
multiplier method to solve the optimization problem in learning W
Figure 6.2: Comparisons of a single module of a DSN (left) and that of a tensor
Optimizing the Lagrangian:
DSN (TDSN). Two equivalent forms of a TDSN module are shown to the right.
1 [after [180], @IEEE].
E= �Gi (U , W ) − t i �2 + λ�U − F(W )�
2 i
TDSN are stacked up in a similar way to form a deep architecture.
we can derive batch-mode gradient descent learning algorithm where
The differences between the TDSN and the DSN lie mainly in how
the gradient takes the following form [106, 413]:
each module is constructed. In the DSN, we have one set of hidden
∂E units forming a hidden layer, as denoted at the left panel of Figure 6.2.
= 2X[H T ◦ (1 − H )T ◦ [H † (HT T )(TH † ) − T T (TH † )]],
∂W In contrast, each module of a TDSN contains two independent hidden
where H † = H T (H H T )−1 is pseudo-inverse of H and symbol ◦ layers, denoted as “Hidden 1” and “Hidden 2” in the middle and right
denotes element-wise multiplication. panels of Figure 6.2. As a result of this difference, the upper-layer
Compared with conventional backpropagation, the above method weights, denoted by “U” in Figure 6.2, changes from a matrix (a two
has less noise in gradient computation due to the exploitation of the dimensional array) in the DSN to a tensor (a three dimensional array)
explicit constraint U = F (W ). As such, it was found experimentally in the TDSN, shown as a cube labeled by “U” in the middle panel.
that, unlike backpropagation, batch training is effective, which aids The tensor U has a three-way connection, one to the prediction
parallel learning of the DSN. layer and the remaining to the two separate hidden layers. An equiva-
lent form of this TDSN module is shown in the right panel of Figure 6.2,
where the implicit hidden layer is formed by expanding the two sepa-
6.4 The tensor deep stacking network
rate hidden layers into their outer product. The resulting large vector
The above DSN architecture has recently been generalized to its ten- contains all possible pair-wise products for the two sets of hidden-layer
sorized version, which we call the tensor DSN (TDSN) [180, 181]. It vectors. This turns tensor U into a matrix again whose dimensions are
has the same scalability as the DSN in terms of parallelizability in (1) size of the prediction layer; and (2) product of the two hidden lay-
learning, but it generalizes the DSN by providing higher-order feature ers’ sizes. Such equivalence enables the same convex optimization for
interactions missing in the DSN. learning U developed for the DSN to be applied to learning tensor U.
The architecture of the TDSN is similar to that of the DSN in the Importantly, higher-order hidden feature interactions are enabled in
way that stacking operation is carried out. That is, modules of the the TDSN via the outer product construction for the large, implicit
hidden layer.
6.5. The Kernelized deep stacking network 257 258 Deep Stacking Networks and Variants — Supervised Learning

Figure 6.4: Stacking of TDSN modules by concatenating two hidden-layers’ vectors


with the input vector.

Figure 6.3: Stacking of TDSN modules by concatenating prediction vector with


input vector. [after [180], @IEEE].
In the DSN architecture reviewed above optimizing the weight
matrix U given the hidden layers’ outputs in each module is a con-
Stacking the TDSN modules to form a deep architecture pursues in vex optimization problem. However, the problem of optimizing weight
a similar way to the DSN by concatenating various vectors. Two exam- matrix W and thus the whole network is nonconvex. In a recent exten-
ples are shown in Figures 6.3 and 6.4. Note stacking by concatenating sion of DSN, a tensor structure was imposed, shifting most of the
hidden layers with input (Figure 6.4) would be difficult for the DSN nonconvex learning burden for W to the convex optimization of U
since its hidden layer tends to be too large for practical purposes. [180, 181]. In the new K-DSN extension, we completely eliminate non-
convex learning for W using the kernel trick.
6.5 The Kernelized deep stacking network To derive the K-DSN architecture and the associated learning algo-
rithm, we first take the bottom module of DSN as an example and
The DSN architecture has also recently been generalized to its ker- generalize the sigmoidal hidden layer hi = σ(W T xi ) in the DSN mod-
nelized version, which we call the kernel-DSN (K-DSN) [102, 171]. The ule into a generic nonlinear mapping function G(X) from the raw
motivation of the extension is to increase the size of the hidden units in input feature X, with high dimensionality in G(X) (possibly infinite)
each DSN module, yet without increasing the size of the free parameters determined only implicitly by a kernel function to be chosen. Second,
to learn. This goal can be easily accomplished using the kernel trick,
resulting in the K-DSN which we describe below.
6.5. The Kernelized deep stacking network 259 260 Deep Stacking Networks and Variants — Supervised Learning

we formulate the constrained optimization problem of


1 C
minimize Tr[EE T ] + U T U
2 2
T
subject to T − U G(X) = E.
Third, we make use of dual representations of the above constrained
optimization problem to obtain U = GT a, where vector a takes the
following form:
a = (CI + K )−1 T
and K = G(X )G T (X) is a symmetric kernel matrix with elements
Knm = gT (xn )g(xm ).
Finally, for each new input vector x in the test or dev set, we obtain
the K-DSN (bottom) module’s prediction as
y(x) = U T g(x) = aT G(X)g(x) = k T (x)(CI + K )−1 T ,
where the kernel vector k(x) is so defined that its elements have values
of kn (x) = k(x n , x) in which x n is a training sample and x is the
current test sample.
For lth module in K-DCN where l ≥ 2, the kernel matrix is
Figure 6.5: An example architecture of the K-DSN with three modules each of
modified to
which uses a Gaussian kernel with different kernel parameters. [after [102], @IEEE].
K = G([X |Y (l−1) | Y (l−2) | . . . Y (1) ])G T ([X |Y (l−1) |Y (l−2) | . . . Y (1) ]).
The key advantages of K-DSN can be analyzed as follows. First, smoothing parameter and regularization parameter, respectively. While
unlike DSN which needs to compute hidden units’ output, the K-DSN both parameters are intuitive and their tuning (via line search or leave-
does not need to explicitly compute hidden units’ output G(X) or one-out cross validation) is straightforward for a single bottom mod-
G([X |Y (l−1) |Y (l−2) | . . . Y (1) ]). When Gaussian kernels are used, ker- ule, tuning the full network with all the modules is more difficult. For
nel trick equivalently gives us an infinite number of hidden units with- example, if the bottom module is tuned too well, then adding more
out the need to compute them explicitly. Further, we no longer need modules would not benefit much. In contrast, when the lower modules
to learn the lower-layer weight matrix W in DSN as described in [102] are loosely tuned (i.e., relaxed from the results obtained from straight-
and the kernel parameter (e.g., the single variance parameter σ in the forward methods), the overall K-DSN often performs much better. The
Gaussian kernel) makes K-DSN much less subject to overfitting than experimental results reported by Deng et al. [102] are obtained using a
DSN. Figure 6.5 illustrates the basic architecture of a K-DSN using the set of empirically determined tuning schedules to adaptively regularize
Gaussian kernel and using three modules. the K-DSN from bottom to top modules.
The entire K-DSN with Gaussian kernels is characterized by two The K-DSN described here has a set of highly desirable proper-
sets of module-dependent hyper-parameters: σ (l) and C (l) the kernel ties from the machine learning and pattern recognition perspectives. It
combines the power of deep learning and kernel learning in a principled
6.5. The Kernelized deep stacking network 261
7
way and unlike the basic DSN there is no longer nonconvex optimiza-
tion problem involved in training the K-DSN. The computation steps
make the K-DSN easier to scale up for parallel computing in distributed
Selected Applications in Speech
servers than the DSN and tensor-DSN. There are many fewer param- and Audio Processing
eters in the K-DSN to tune than in the DSN, T-DSN, and DNN, and
there is no need for pre-training. It is found in the study of [102] that
regularization plays a much more important role in the K-DSN than
in the basic DSN and Tensor-DSN. Further, effective regularization
schedules developed for learning the K-DSN weights can be motivated
by intuitive insight from useful optimization tricks such as the heuristic
in Rprop or resilient backpropagation algorithm [302].
However, as inherent in any kernel method, the scalability becomes
an issue also for the K-DSN as the training and testing samples become
very large. A solution is provided in the study by Huang et al. [171],
7.1 Acoustic modeling for speech recognition
based on the use of random Fourier features, which possess the strong
theoretical property of approximating the Gaussian kernel while render- As discussed in Section 2, speech recognition is the very first success-
ing efficient computation in both training and evaluation of the K-DSN ful application of deep learning methods at an industry scale. This
with large training samples. It is empirically demonstrated that just like success is a result of close academic-industrial collaboration, initiated
the conventional K-DSN exploiting rigorous Gaussian kernels, the use at Microsoft Research, with the involved researchers identifying and
of random Fourier features also enables successful stacking of kernel acutely attending to the industrial need for large-scale deployment
modules to form a deep architecture. [68, 89, 109, 161, 323, 414]. It is also a result of carefully exploiting
the strengths of the deep learning and the then-state-of-the-art speech
recognition technology, including notably the highly efficient decoding
techniques.
Speech recognition has long been dominated by the GMM–HMM
method, with an underlying shallow or flat generative model of context-
dependent GMMs and HMMs (e.g., [92, 93, 187, 293]). Neural networks
once were a popular approach but had not been competitive with the
GMM–HMM [42, 87, 261, 382]. Generative models with deep hidden
dynamics likewise have also not been clearly competitive (e.g., [45, 73,
108, 282]).
Deep learning and the DNN started making their impact in speech
recognition in 2010, after close collaborations between academic and

262
7.1. Acoustic modeling for speech recognition 263 264 Selected Applications in Speech and Audio Processing

industrial researchers; see reviews in [89, 161]. The collaborative work In the remainder of this chapter, we review a wide range of speech
started in phone recognition tasks [89, 100, 135, 136, 257, 260, 258, recognition work based on deep learning methods according to several
309, 311, 334], demonstrating the power of hybrid DNN architec- major themes expressed in the section titles.
tures discussed in Section 5 and of subsequent new architectures
with convolutional and recurrent structure. The work also showed
7.1.1 Back to primitive spectral features of speech
the importance of raw speech features of spectrogram — back from
the long-popular MFCC features toward but not yet reaching the Deep learning, also referred as representation learning or (unsuper-
raw speech-waveform level [183, 327]. The collaboration continued to vised) feature learning, sets an important goal of automatic discovery
large vocabulary tasks with more convincing, highly positive results of powerful features from raw input data independent of application
[67, 68, 94, 89, 161, 199, 195, 223, 323, 353, 399, 414]. The success in domains. For speech feature learning and for speech recognition, this
large vocabulary speech recognition is in large part attributed to the goal is condensed to the use of primitive spectral or possibly wave-
use of a very large DNN output layer structured in the same way as form features. Over the past 30 years or so, largely “hand-crafted”
the GMM–HMM speech units (senones), motivated partially by the transformations of speech spectrogram have led to significant accuracy
speech researchers’ desires to take advantage of the context-dependent improvements in the GMM-based HMM systems, despite the known
phone modeling techniques that have been proven to work well in the loss of information from the raw speech data. The most successful
GMM–HMM framework, and to keep the change of the already highly transformation is the non-adaptive cosine transform, which gave rise to
efficient decoder software’s infrastructure developed for the GMM– Mel-frequency cepstral coefficients (MFCC) features. The cosine trans-
HMM systems to a minimum. In the meantime, this body of work form approximately de-correlates feature components, which is impor-
also demonstrated the possibility to reduce the need for the DBN- tant for the use of GMMs with diagonal covariance matrices. However,
like pre-training in effective learning of DNNs when a large amount when GMMs are replaced by deep learning models such as DNNs, deep
of labeled data is available. A combination of three factors helped belief nets (DBNs), or deep autoencoders, such de-correlation becomes
to quickly spread the success of deep learning in speech recognition irrelevant due to the very strength of the deep learning methods in
to the entire speech industry and academia: (1) significantly lowered modeling data correlation. As discussed in detail in Section 4, early
errors compared with the then-state-of-the-art GMM-HMM systems; work of [100] demonstrated this strength and in particular the benefit
(2) minimal decoder changes required to deploy the new DNN-based of spectrograms over MFCCs in effective coding of bottleneck speech
speech recognizer due to the use of senones as the DNN output; and features using autoencoders in an unsupervised manner.
(3) reduced system complexity empowered by the DNN’s strong mod- The pipeline from speech waveforms (raw speech features) to
eling power. By the ICASSP-2013 timeframe, at least 15 major speech MFCCs and their temporal differences goes through intermediate
recognition groups worldwide confirmed experimentally the success of stages of log-spectra and then (Mel-warped) filter-banks, with learned
DNNs with very large tasks and with the use of raw speech spectral parameters based on the data. An important character of deep learn-
features other than MFCCs. The most notable groups include major ing is to move away from separate design of feature representations and
industrial speech labs worldwide: Microsoft [49, 89, 94, 324, 399, 430], of classifiers. This idea of jointly learning classifier and feature trans-
IBM [195, 309, 311, 307, 317], Google [69, 150, 184, 223], iFlyTek, and formation for speech recognition was already explored in early studies
Baidu. Their results represent a new state-of-the-art in speech recog- on the GMM–HMM based systems; e.g., [33, 50, 51, 299]. However,
nition widely deployed in these companies’ voice products and services greater speech recognition performance gain is obtained only recently
with extensive media coverage in recent years.
7.1. Acoustic modeling for speech recognition 265 266 Selected Applications in Speech and Audio Processing

in the recognizers empowered by deep learning methods. For example,


Mohamed et al. [259], Li et al. [221], and Deng et al. [94] showed signif-
icantly lowered speech recognition errors using large-scale DNNs when
moving from the MFCC features back to more primitive (Mel-scaled)
filter-bank features. These results indicate that DNNs can learn a bet-
ter transformation than the original fixed cosine transform from the
Mel-scaled filter-bank features.
Compared with MFCCs, “raw” spectral features not only retain
more information, but also enable the use of convolution and pool-
ing operations to represent and handle some typical speech variabil-
ity — e.g., vocal tract length differences across speakers, distinct speak-
ing styles causing formant undershoot or overshoot, etc. — expressed
explicitly in the frequency domain. For example, the convolutional neu-
ral network (CNN) can only be meaningfully and effectively applied to
speech recognition [1, 2, 3, 94] when spectral features, instead of MFCC
features, are used.
More recently, Sainath et al. [307] went one step further toward Figure 7.1: Illustration of the joint learning of filter parameters and the rest of the
deep network. [after [307], @IEEE].
raw features by learning the parameters that define the filter-banks
on power spectra. That is, rather than using Mel-warped filter-bank
features as the input features as in [1, 3, 50, 221], the weights corre- the input at higher layers, which helps to achieve better speech recog-
sponding to the Mel-scale filters are only used to initialize the param- nition accuracy.
eters, which are subsequently learned together with the rest of the To the extreme end, deep learning would promote to use the lowest
deep network as the classifier. The overall architecture of the jointly level of raw features of speech, i.e., speech sound waveforms, for speech
learned feature generator and classifier is shown in Figure 7.1. Substan- recognition, and learn the transformation automatically. As an initial
tial speech recognition error reduction is reported in [307]. attempt toward this goal the study carried out by Jaitly and Hinton
It has been shown that not only learning the spectral aspect of [183] makes use of speech sound waves as the raw input feature to an
the features are beneficial for speech recognition, learning the tempo- RBM with a convolutional structure as the classifier. With the use
ral aspect of the features is also helpful [332]. Further, Yu et al. [426] of rectified linear units in the hidden layer [130], it is possible, to a
carefully analyzed the properties of different layers in the DNN as the limited extent, to automatically normalize the amplitude variation
layer-wise extracted features starting from the lower raw filter-bank in the waveform signal. Although the final results are disappointing,
features. They found that the improved speech recognition accuracy the work shows that much work is needed along this direction. For
achieved by the DNNs partially attributes to DNN’s ability to extract example, just as demonstrated by Sainath et al. [307] that the use of
discriminative internal representations that are robust to the many raw spectra as features requires additional attention in normalization
sources of variability in speech signals. They also show that these rep- than MFCCs, the use of speech waveforms demands even more
resentations become increasingly insensitive to small perturbations in attention in normalization [327]. This is true for both GMM-based
and deep learning based methods.
7.1. Acoustic modeling for speech recognition 267 268 Selected Applications in Speech and Audio Processing

7.1.2 The DNN–HMM architecture versus use of MSR and University of Toronto researchers [67, 68, 414] extended
DNN-derived features the DNN–HMM system from the monophone phonetic representation of
the DNN outputs to the triphone or context-dependent counterpart and
Another major theme in the recent studies reported in the literature on
from phone recognition to large vocabulary speech recognition. Experi-
applying deep learning methods to speech recognition is two disparate
ments conducted at MSR on the 24-hour and 48-hour Bing mobile voice
ways of using the DNN: (1) Direct applications of the DNN-HMM
search datasets collected under the real usage scenario demonstrate
architecture as discussed in Section 5.3 to perform speech recognition;
that the context-dependent DNN–HMM significantly outperforms the
and (2) The use of DNNs to extract or derive features, which are then
state-of-the-art GMM-HMM system. Three factors, in addition to the
fed into a separate sequence classifier. In the speech recognition lit-
use of the DNN, contribute to the success: the use of tied triphones
erature [42], a system, in which a neural network’s output is directly
as the DNN modeling units, the use of the best available tri-phone
used to estimate the emission probabilities of an HMM, is often called
GMM–HMM to generate the tri-phone state alignment, and the effec-
an ANN/HMM hybrid system. This should be distinguished from the
tive exploitation of a long window of input features. Experiments also
use of “hybrid” in Section 5 and throughout this monograph, where
indicate that the decoding time of a five-layer DNN–HMM is almost
a hybrid of unsupervised pre-training and of supervised fine tuning is
the same as that of the state-of-the-art triphone GMM–HMM.
exploited to learn the parameters of DNNs.
The success was quickly extended to large vocabulary speech recog-
nition tasks with hundreds and even thousands of hours of training set
7.1.2.1 The DNN–HMM architecture as a recognizer and with thousands of tri-phone states, including the Switchboard and
Broadcast News databases, and Google’s voice search and YouTube
An early DNN–HMM architecture [257] was presented at the NIPS tasks [94, 161, 184, 309, 311, 324]. For example, on the Switchboard
Workshop [109], developed, analyzed, and assisted by University of benchmark, the context-dependent DNN–HMM (CD-DNN–HMM) is
Toronto and MSR speech researchers. In this work, a five-layer DNN shown to cut error by one third compared to the state-of-the-art GMM–
(called the DBN in the paper) was used to replace the Gaussian mixture HMM system [323]. As a summary, we show in Table 7.1 some quanti-
models in the GMM–HMM system, and the monophone state was used tative recognition error rates in relatively early literature produced by
as the modeling unit. Although monophones are generally accepted the basic DNN–HMM architecture in comparison with those by the pre-
as a weaker phonetic representation than triphones, the DNN–HMM vious state-of-the-art systems based on the generative models. (More
approach with monophones was shown to achieve higher phone recog- advanced architectures have produced better results than shown here).
nition accuracy than the state-of-the-art triphone GMM–HMM sys- Note from sub-tables A to D, the training data are increased approx-
tems. Further, the DNN results were found to be slightly superior to imately one order of magnitude from one task to the next. Not only
the then-best-performing single system based on the generative hid- the computation scales up well (i.e., almost linearly) with the training
den trajectory model (HTM) in the literature [105, 108] evaluated on size, but most importantly the relative error rate reduction increases
the same, commonly used TIMIT task by many speech researchers substantially with increasing amounts of training data — from approx-
[107, 108, 274, 313]. At MSR, Redmond, the error patterns produced imately 10% to 20%, and then to 30%. This set of results highlight the
by these two separate systems (the DNN vs. the HTM) were carefully strongly desirable properties of the DNN-based methods, despite the
analyzed and found to be very different, reflecting distinct core capa- conceptual simplicity of the overall DNN–HMM architecture and some
bilities of the two approaches and igniting intensive further studies on known weaknesses.
the DNN–HMM approach described below.
7.1. Acoustic modeling for speech recognition 269 270 Selected Applications in Speech and Audio Processing

Table 7.1: Comparisons of the DNN–HMM architecture with the generative model This tandem approach is used by Vinyals and Ravuri [379] where a
(e.g., the GMM–HMM) in terms of phone or word recognition error rates. From
sub-tables A to D, the training data are increased approximately three orders of DNN’s outputs are extracted to serve as the features for mismatched
magnitudes. noisy speech. It is reported that DNNs outperform the neural net-
works with a single hidden layer under the clean condition, but the
Features Setup Error Rates gains slowly diminish as the noise level is increased. Furthermore, using
A: TIMIT Phone recognition (3 hours of training) MFCCs in conjunction with the posteriors computed from DNNs out-
GMM w. Hidden dynamics 24.8% performs using the DNN features alone in low to moderate noise con-
DNN 5 layers × 2048 23.0% ditions with the tandem architecture. Comparisons of such tandem
approach with the direct DNN–HMM approach are made by Tüske
B: Voice Search SER (24–48 hours of training) et al. [368] and Imseng et al. [182].
GMM MPE (760 24-mix) 36.2% An alternative way of extracting the DNN features is to use the
DNN 5 layers × 2048 30.1% “bottleneck” layer, which is narrower than other layers in the DNN,
C: Switch Board WER (309 hours of training) to restrict the capacity of the network. Then, such bottleneck features
are fed to a GMM–HMM system, often in conjunction with the orig-
GMM BMMI (9K 40-mix) 23.6%
inal acoustic features and some dimensionality reduction techniques.
DNN 7 layers × 2048 15.8%
The bottleneck features derived from the DNN are believed to capture
D: Switch Board WER (2000 hours of training) information complementary to conventional acoustic features derived
GMM BMMI (18K 72-mix) 21.7% from the short-time spectra of the input. A speech recognizer based on
DNN 7 layers × 2048 14.6% the above bottleneck feature approach is built by Yu and Seltzer [425],
with the overall architecture shown in Figure 7.2. Several variants of
the DNN-based bottleneck-feature approach have been explored; see
7.1.2.2 The use of DNN-derived features in a separate recognizer details in [16, 137, 201, 285, 308, 368].
Yet another method to derive the features from the DNN is to feed
One clear weakness of the above DNN–HMM architecture for speech its top-most hidden layer as the new features for a separate speech
recognition is that much of the highly effective techniques for the
GMM–HMM systems, including discriminative training (in both fea-
ture space and model space), unsupervised speaker adaptation, noise
robustness, and scalable batch training tools for big training data,
developed over the past 20 some years may not be directly applica-
ble to the new systems although similar techniques have been recently
developed for DNN–HMMs. To remedy this problem, the “tandem”
approach, developed originally by Hermansky et al. [154], has been
adopted, where the output of the neural networks in the form of pos-
terior probabilities for phone classes, are used, often in conjunction
with the acoustic features to form new augmented input features, in a Figure 7.2: Illustration of the use of bottleneck (BN) features extracted from a
separate GMM–HMM system. DNN in a GMM–HMM speech recognizer. [after [425], @IEEE].
7.1. Acoustic modeling for speech recognition 271 272 Selected Applications in Speech and Audio Processing

recognizer. In [399], a GMM–HMM is used as such a recognizer, and used in the DNN. By processing both the training and testing data
the high-dimensional, DNN-derived features are subject to dimension- with the same algorithm, any consistent errors or artifacts introduced
ality reduction before feeding them into the recognizer. More recently, by the enhancement algorithm can be learned by the DNN–HMM rec-
a recurrent neural network (RNN) is used as the “backend” recognizer ognizer. This study also successfully explored the use of the noise aware
receiving the high-dimensional, DNN-derived features as the input training paradigm for training the DNN, where each observation was
without dimensionality reduction [48, 85]. These studies also show augmented with an estimate of the noise. Strong results were obtained
that the use of the top-most hidden layer of the DNN as features is on the Aurora4 task. More recently, Kashiwagi et al. [191] applied the
better than other hidden layers and also better than the output layer SPLICE feature enhancement technique [82] to a DNN speech rec-
in terms of recognition accuracy for the RNN sequence classifier. ognizer. In that study the DNN’s output layer was determined on
clean data instead of on noisy data as in the study reported by Seltzer
7.1.3 Noise robustness by deep learning et al. [325].
Besides DNN, other deep architectures have also been proposed to
The study of noise robustness in speech recognition has a long his- perform feature enhancement and noise-robust speech recognition. For
tory, mostly before the recent rise of deep learning. One major con- example, Mass et al. [235] applied a deep recurrent auto encoder neural
tributing factor to the often observed brittleness of speech recogni- network to remove noise in the input features for robust speech recogni-
tion technology is the inability of the standard GMM–HMM-based tion. The model was trained on stereo (noisy and clean) speech features
acoustic model to accurately model noise-distorted speech test data to predict clean features given noisy input, similar to the SPLICE setup
that differs in character from the training data, which may or may but using a deep model instead of a GMM. Vinyals and Ravuri [379]
not be distorted by noise. A wide range of noise-robust techniques investigated the tandem approaches to noise-robust speech recognition,
developed over past 30 years can be analyzed and categorized using where DNNs were trained directly with noisy speech to generate pos-
five different criteria: (1) feature-domain versus model-domain pro- terior features. Finally, Rennie et al. [300] explored the use of a version
cessing, (2) the use of prior knowledge about the acoustic environ- of the RBM, called the factorial hidden RBM, for noise-robust speech
ment distortion, (3) the use of explicit environment-distortion mod- recognition.
els, (4) deterministic versus uncertainty processing, and (5) the use of
acoustic models trained jointly with the same feature enhancement or
model adaptation process used in the testing stage. See a comprehen- 7.1.4 Output representations in the DNN
sive review in [220] and some additional review literature or original
work in [4, 82, 119, 140, 230, 370, 404, 431, 444]. Most deep learning methods for speech recognition and other infor-
Many of the model-domain techniques developed for GMM–HMMs mation processing applications have focused on learning represen-
(e.g., model-domain noise robustness techniques surveyed by Li et al. tations from input acoustic features without paying attention to
[220] and Gales [119]) are not directly applicable to the new deep output representations. The recent 2013 NIPS Workshop on Learning
learning models for speech recognition. The feature-domain techniques, Output Representations (http://nips.cc/Conferences/2013/Program/
however, can be directly applied to the DNN system. A detailed inves- event.php?ID=3714) was dedicated to bridging this gap. For exam-
tigation of the use of DNNs for noise robust speech recognition in the ple, the Deep Visual-Semantic Embedding Model described in [117],
feature domain was reported by Seltzer et al. [325], who applied the to be discussed more in Section 11) exploits continuous-valued out-
C-MMSE [415] feature enhancement algorithm on the input feature put representations obtained from the text embeddings to assist in the
7.1. Acoustic modeling for speech recognition 273 274 Selected Applications in Speech and Audio Processing

branch of the deep network for classifying images. For speech recogni- related to linguistic structure. In [383, 384], a limitation of the out-
tion, the importance of designing effective linguistic representations for put representation design, based on the context-dependent phone units
the output layers of deep networks is highlighted in [79]. as proposed in [67, 68], is recognized and a solution is offered. The
Most current DNN systems use a high-dimensional output represen- root cause of this limitation is that all context-dependent phone states
tation to match the context-dependent phonetic states in the HMMs. within a cluster created by the decision tree share the same set of
For this reason, the output layer evaluation can cost 1/3 of the total parameters and this reduces its resolution power for fine-grained states
computation time. To improve the decoding speed, techniques such during the decoding phase. The solution proposed formulates output
as low-rank approximation is typically applied to the output layer. representations of the context-dependent DNN as an instance of the
In [310] and [397], the DNN with high-dimensional output layer was canonical state modeling technique, making use of broad phonetic
trained first. The singular value decomposition (SVD)-based dimen- classes. First, triphones are clustered into multiple sets of shorter bi-
sion reduction technique was then performed on the large output-layer phones using broad phone contexts. Then, the DNN is trained to dis-
matrix. The resulting matrices are further combined and as the result criminate the bi-phones within each set. Logistic regression is used
the original large weight matrix is approximated by a product of two to transform the canonical states into the detailed triphone state
much smaller matrices. This technique in essence converts the origi- output probabilities. That is, the overall design of the output rep-
nal large output layer to two layers — a bottleneck linear layer and resentation of the context-dependent DNN is hierarchical in nature,
a nonlinear output layer — both with smaller weight matrices. The solving both the data sparseness and low-resolution problems at the
converted DNN with reduced dimensionality is further refined. The same time.
experimental results show that no speech recognition accuracy reduc- Related work on designing the output linguistic representations for
tion was observed even when the size is cut to half, while the run-time speech recognition can be found in [197] and in [241]. While the designs
computation is significantly reduced. are in the context of GMM–HMM-based speech recognition systems,
The output representations for speech recognition can benefit from they both can be extended to deep learning models.
the structured design of the symbolic or phonological units of speech
as presented in [79]. The rich phonological structure of symbolic nature 7.1.5 Adaptation of the DNN-based speech recognizers
in human speech has been well known for many years. Likewise, it has
also been well understood for a long time that the use of phonetic The DNN–HMM is an advanced version of the artificial neural net-
or its finer state sequences, even with contextual dependency, in engi- work and HMM “hybrid” system developed in 1990s, for which several
neering speech recognition systems, is inadequate in representing such adaptation techniques have been developed. Most of these techniques
rich structure [86, 273, 355], and thus leaving a promising open direc- are based on linear transformation of the network weights of either
tion to improve the speech recognition systems’ performance. Basic input or output layers. A number of exploratory studies on DNN adap-
theories about the internal structure of speech sounds and their rel- tation made use of the same or related linear transformation methods
evance to speech recognition technology in terms of the specification, [223, 401, 402]. However, compared with the earlier narrower and shal-
design, and learning of possible output representations of the underly- lower neural network systems, the DNN–HMM has significantly more
ing speech model for speech target sequences are surveyed in [76] and parameters due to wider and deeper hidden layers used and the much
more recently in [79]. larger output layer designed to model context dependent phones and
There has been a growing body of deep learning work in speech states. This difference casts special challenges to adapting the DNN–
recognition with their focus placed on designing output representations HMM, especially when the adaptation data is small. Here we discuss
7.1. Acoustic modeling for speech recognition 275 276 Selected Applications in Speech and Audio Processing

representative recent studies on overcoming such challenges in adapting locality. The architecture of the multi-scale CNN–DNN was shown to
the large-sized DNN weights in distinct ways. be effective for the combination of these two different types of features.
Yu et al. [430] proposed a regularized adaptation technique for During both training and decoding, the speaker-specific I-vector was
DNNs. It adapts the DNN weights conservatively by forcing the distri- appended to the frame-based fMLLR features.
bution estimated from the adapted model to be close to that estimated
from those before the adaptation. This constraint is realized by adding 7.1.6 Better architectures and nonlinear units
Kullback–Leibler divergence (KLD) regularization to the adaptation
criterion. This type of regularization is shown to be equivalent to a Over recent years, since the success of the (fully-connected) DNN–
modification of the target distribution in the conventional backprop- HMM hybrid system was demonstrated in [67, 68, 109, 161, 257, 258,
agation algorithm and thus the training of the DNN remains largely 308, 309, 324, 429], many new architectures and nonlinear units have
unchanged. The new target distribution is derived to be a linear inter- been proposed and evaluated for speech recognition. Here we provide
polation of the distribution estimated from the model before adaptation an overview of this progress, extending the overview provided in [89].
and the ground truth alignment of the adaptation data. This interpola- The tensor version of the DNN is reported by Yu et al. [421, 422],
tion prevents overtraining by keeping the adapted model from straying which extends the conventional DNN by replacing one or more of its
too far from the speaker-independent model. This type of adaptation layers with a double-projection layer and a tensor layer. In the double-
differs from L2 regularization, which constrains the model parameters projection layer, each input vector is projected into two nonlinear sub-
themselves rather than the output probabilities. spaces. In the tensor layer, two subspace projections interact with each
In [330], adaptation of the DNN was applied not on the conventional other and jointly predict the next layer in the overall deep architecture.
network weights but on the hidden activation functions. In this way, the An approach is developed to map the tensor layers to the conventional
main limitation of current adaptation techniques based on adaptable sigmoid layers so that the former can be treated and trained in a simi-
linear transformation of the network weights in either the input or the lar way to the latter. With this mapping the tensor version of the DNN
output layer is effectively overcome, since the new method only needs can be treated as the DNN augmented with double-projection layers
to adapt a more limited number of hidden activation functions. so that the backpropagation learning algorithm can be cleanly derived
Several studies were carried out on unsupervised or semi-supervised and relatively easily implemented.
adaptation of DNN acoustic models with different types of input fea- A related architecture to the above is the tensor version of the DSN
tures with success [223, 405]. described in Section 6, also usefully applied to speech classification
Most recently, Saon et al. [317] explored a new and highly effective and recognition [180, 181]. The same approach applies to mapping the
method in adapting DNNs for speech recognition. The method com- tensor layers (i.e., the upper layer in each of the many modules in the
bined I-vector features with fMLLR (feature-domain max-likelihood DSN context) to the conventional sigmoid layers. Again, this mapping
linear regression) features as the input into a DNN. I-vectors or simplifies the training algorithm so that it becomes not so far apart
(speaker) identity vectors are commonly used for speaker verifica- from that for the DSN.
tion and speaker recognition applications, as they encapsulate relevant As discussed in Section 3.2, the concept of convolution in time
information about a speaker’s identity in a low-dimensional feature was originated in the TDNN (time-delay neural network) as a shallow
vector. The fMLLR is an effective adaptation technique developed for neural network [202, 382] developed during early days of speech
GMM–HMM systems. Since I-vectors do not obey locality in frequency, recognition. Only recently and when deep architectures (e.g. deep
they must be combined carefully with the fMLLR features that obey convolutional neural network or deep CNN) were used, it has been
7.1. Acoustic modeling for speech recognition 277 278 Selected Applications in Speech and Audio Processing

found that frequency-dimension weight sharing is more effective Perhaps the most notable deep architecture among all is the recur-
for high-performance phone recognition, when the HMM is used to rent neural network (RNN) as well as its stacked or deep versions
handle the time variability, than time-domain weight sharing as in the [135, 136, 153, 279, 377]. While the RNN saw its early success in phone
previous TDNN in which the HMM was not used [1, 2, 3, 81]. These recognition [304], it was not easy to duplicate due to the intricacy
studies also show that designing the pooling scheme in the deep CNN in training, let alone to scale up for larger speech recognition tasks.
to properly trade-off between invariance to vocal tract length and Learning algorithms for the RNN have been dramatically improved
discrimination among speech sounds, together with a regularization since then, and much better results have been obtained recently using
technique of “dropout” [166], leads to even better phone recognition the RNN [48, 134, 235], especially when the bi-directional LSTM (long
performance. This set of work further points to the direction of short-term memory) is used [135, 136]. The basic information flow in
trading-off between trajectory discrimination and invariance expressed the bi-directional RNN and a cell of LSTM is shown in Figures 7.3 and
in the whole dynamic pattern of speech defined in mixed time and 7.4, respectively.
frequency domains using convolution and pooling. Moreover, the Learning the RNN parameters is known to be difficult due to van-
most recent studies reported in [306, 307, 312] show that CNNs also ishing or exploding gradients [280]. Chen and Deng [48] and Deng and
benefit large vocabulary continuous speech recognition. They further
demonstrate that multiple convolutional layers provide even more
improvement when the convolutional layers use a large number of
convolution kernels or feature maps. In particular, Sainath et al. [306]
extensively explored many variants of the deep CNN. In combination
with several novel methods the deep CNN is shown to produce state
of the art results in a few large vocabulary speech recognition tasks.
In addition to the DNN, CNN, and DSN, as well as their tensor
versions, other deep models have also been developed and reported in
the literature for speech recognition. For example, the deep-structured
CRF, which stacks many layers of CRFs, have been usefully applied
to the task of language identification [429], phone recognition [410],
sequential labeling in natural language processing [428], and confi-
dence calibration in speech recognition [423]. More recently, Demuynck
and Triefenbach [70] developed the deep GMM architecture, where the
aspects of DNNs that lead to strong performance are extracted and
applied to build hierarchical GMMs. They show that by going “deep
and wide” and feeding windowed probabilities of a lower layer of GMMs
to a higher layer of GMMs, the performance of the deep-GMM system
can be made comparable to a DNN. One advantage of staying in the
GMM space is that the decades of work in GMM adaptation and dis- Figure 7.3: Information flow in the bi-directional RNN, with both diagrammatic
criminative learning remains applicable. and mathematical descriptions. W’s are weight matrices, not shown but can be easily
inferred in the diagram. [after [136], @IEEE].
7.1. Acoustic modeling for speech recognition 279 280 Selected Applications in Speech and Audio Processing

accuracy drops progressively as the DNN features are extracted from


higher to lower hidden layers of the DNN.
A special case of the RNN is reservoir models or echo state networks,
where the output layers are fixed to be linear instead of nonlinear as
in the regular RNN, and where the recurrent matrices are carefully
designed but not learned. The input matrices are also fixed and not
learned, due partly to the difficulty of learning. Only the weight matri-
ces between the hidden and output layers are learned. Since the output
layer is linear, the learning is very efficient and with global optimum
achievable by a closed-form solution. But due to the fact that many
parameters are not learned, the hidden layer needs to be very large
in order to obtain good results. Triefenbach et al. [365] applied such
models to phone recognition, with reasonably good accuracy obtained.
Palangi et al. [276] presented an improved version of the reservoir
model by learning both the input and recurrent matrices which were
fixed in the previous model that makes use of the linear output (or
readout) units to simplify the learning of only the output matrix in the
RNN. Rather, a special technique is devised that takes advantage of the
linearity in the output units in the reservoir model to learn the input
and recurrent matrices. Compared with the backpropagation through
time (BPTT) algorithm commonly used in learning the general RNNs,
the proposed technique makes use of the linearity in the output units
Figure 7.4: Information flow in an LSTM unit of the RNN, with both diagrammatic
and mathematical descriptions. W’s are weight matrices, not shown but can easily to provide constraints among various matrices in the RNN, enabling
be inferred in the diagram. [after [136], @IEEE]. the computation of the gradients as the learning signal in an analytical
form instead of by recursion as in the BPTT.
In addition to the recent innovations in better architectures of deep
Chen [85] developed a primal-dual training method that formulates learning models for speech recognition reviewed above, there is also a
the learning of the RNN as a formal optimization problem, where cross growing body of work on developing and implementing better nonlinear
entropy is maximized subject to the condition that the infinity norm of units. Although sigmoidal and tanh functions are the most commonly
the recurrent matrix of the RNN is less than a fixed value to guarantee used nonlinear types in DNNs their limitations are well known. For
the stability of RNN dynamics. Experimental results on phone recog- example, it is slow to learn the whole network due to weak gradients
nition demonstrate: (1) the primal-dual technique is highly effective when the units are close to saturation in both directions. Jaitly and
in learning RNNs, with superior performance to the earlier heuristic Hinton [183] appear to be the first to apply the rectified linear units
method of truncating the size of the gradient; (2) The use of a DNN (ReLU) in the DNNs to speech recognition to overcome the weakness
to compute high-level features of speech data to feed into the RNN of the sigmoidal units. ReLU refers to the units in a neural network
gives much higher accuracy than without using the DNN; and (3) The that use the activation function of f (x) = max(0, x). Dahl et al. [65]
7.1. Acoustic modeling for speech recognition 281 282 Selected Applications in Speech and Audio Processing

and Mass et al. [234] successfully applied ReLU to large vocabulary regularization techniques to help prevent overfitting during the deep
speech recognition, with the best accuracy obtained when combining network training.
ReLU with the “Dropout” regularization technique. One of the early studies on DNNs for speech recognition, conducted
Another new type of DNN units demonstrated more recently to be at Microsoft Research and reported in [260], first recognizes the mis-
useful for speech recognition is the “maxout” units, which were used match between the desired error rate and the cross-entropy training
for forming the deep maxout network as described in [244]. A deep criterion in the conventional DNN training. The solution is provided by
maxout network consists of multiple layers which generate hidden acti- replacing the frame-based, cross-entropy training criterion with the full-
vations via the maximum or “maxout” operation over a fixed number sequence-based maximum mutual information optimization objective,
of weighted inputs called a “group.” This is the same operation as in a similar way to defining the training objective for the shallow neu-
the max pooling used in the CNN as discussed earlier for both speech ral network interfaced with an HMM [194]. Equivalently, this amounts
recognition and computer vision. The maximal value within each group to putting the model of conditional random field (CRF) at the top of
is taken as the output from the previous layer. Most recently, Zhang the DNN, replacing the original softmax layer which naturally leads to
et al. [441] generalize the above “maxout” units to two new types. The cross entropy. (Note the DNN was called the DBN in the paper). This
“soft-maxout” type of units replace the original max operation with new sequential discriminative learning technique is developed to jointly
the soft-max function. The second, p-norm type of units used the non- optimize the DNN weights, CRF transition weights, and bi-phone lan-
linearity of y = xp . It is shown experimentally that the p-norm units guage model. Importantly, the speech task is defined in TIMIT, with
with p = 2 perform consistently better than the maxout, tanh, and the use of a simple bi-phone-gram “language” model. The simplicity of
ReLU units. In Gulcehre et al. [138], techniques that automatically the bi-gram language model enables the full-sequence training to carry
learn the p-norm was proposed and investigated. out without the need to use lattices, drastically reducing the training
Finally, Srivastava et al. [350] propose yet another new type of non- complexity.
linear units, called winner-take-all units. Here, local competition among As another way to motivate the full-sequence training method of
neighboring neurons are incorporated into the otherwise regular feed- [260], we note that the earlier DNN phone recognition experiments
forward architecture, which is then trained via backpropagation with made use of the standard frame-based objective function in static pat-
different gradients than the normal one. Winner-take-all is an inter- tern classification, cross-entropy, to optimize the DNN weights. The
esting new form of nonlinearity, and it forms groups of (typically two) transition parameters and language model scores were obtained from
neurons where all the neurons in a group are made zero-valued except an HMM and were trained independently of the DNN weights. However,
the one with the largest value. Experiments show that the network does it has been known during the long history of the HMM research that
not forget as much as networks with standard sigmoidal nonlinearity. sequence classification criteria can be very helpful in improving speech
This new type of nonlinear units are yet to be evaluated in speech and phone recognition accuracy. This is because the sequence classifica-
recognition tasks. tion criteria are more directly correlated with the performance measure
(e.g., the overall word or phone error rate) than frame-level criteria.
7.1.7 Better optimization and regularization More specifically, the use of frame-level cross entropy to train the DNN
for phone sequence recognition does not explicitly take into account the
Another area where significant advances are made recently in fact that the neighboring frames have smaller distances between the
applying deep learning to acoustic model for speech recognition is assigned probability distributions over phone class labels. To overcome
on optimization criteria and methods, as well as on the related this deficiency, one can optimize the conditional probability of the
7.1. Acoustic modeling for speech recognition 283 284 Selected Applications in Speech and Audio Processing

whole sequence of labels, given the whole visible feature utterance or methods to decrease the amount of training data to speed up the train-
equivalently the hidden feature sequence extracted by DNN. To opti- ing. While the batch-mode, second-order Hessian-free techniques prove
mize the log conditional probability on the training data, the gradi- successful for full-sequence training of large-scale DNN–HMM systems,
ent can be taken over the activation parameters, transition parame- the success of the first-order stochastic gradient descent methods is also
ters and lower-layer weights, and then pursue back-propagation of the reported recently [353]. It is found that heuristics are needed to handle
error defined at the sentence level. We remark that in a much earlier the problem of lattice sparseness. That is, the DNN must be adjusted to
study [212], combining a neural network with a CRF-like structure was the updated numerator lattices by additional iterations of frame-based
done, where the mathematical formulation appears to include CRFs as cross-entropy training. Further, artificial silence arcs need to be added
a special case. Also, the benefit of using the full-sequence classification to the denominator lattices, or the maximum mutual information objec-
criteria was shown earlier on shallow neural networks in [194, 291]. tive function needs to be smoothed with the frame-based cross entropy
In implementing the above full-sequence learning algorithm for the objective. The conclusion is that for large vocabulary speech recog-
DNN system as described in [260], the DNN weights are initialized nition tasks with sparse lattices, the implementation of the sequence
using the frame-level cross entropy as the objective. The transition training requires much greater engineering skills than the small tasks
parameters are initialized from the combination of the HMM tran- such as reported in [260], although the objective function as well as the
sition matrices and the “bi-phone language” model scores, and are gradient derivation are essentially the same. Similar conclusions are
then further optimized by tuning the transition features while fixing reached by Vesely et al. [374] when carrying out full-sequence training
the DNN weights before the joint optimization. Using joint optimiza- of DNN–HMMs for large-vocabulary speech recognition. However, dif-
tion with careful scheduling to reduce overfitting, it is shown that the ferent heuristics from [353] are shown to be effective in the training.
full-sequence training outperforms the DNN trained with frame-level Separately, Wiesler et al. [390] investigated the Hessian-free optimiza-
cross entropy by approximately 5% relative [260]. Without the effort tion method for training the DNN with the cross-entropy objective and
to reduce overfitting, it is found that the DNN trained with MMI is empirically analyzed the properties of the method. And finally, Dognin
much more prone to overfitting than that trained with frame-level cross and Goel [113] combined stochastic average gradient and Hessian-free
entropy. This is because the correlations across frames in speech tend optimization for sequence training of deep neural networks with suc-
to be different among the training, development, and test data. Impor- cess in that the training procedure converges in about half the time
tantly, such differences do not show when frame-based objective func- compared with the full Hessian-free sequence training.
tions are used for training. For large DNN–HMM systems with either frame-level or sequence-
For large vocabulary speech recognition where more complex lan- level optimization objectives, speeding up the training is essential
guage models are in use, the optimization methods for full-sequence to take advantage of large amounts of training data and of large
training of the DNN–HMM are much more sophisticated. Kingsbury model sizes. In addition to the methods described above, Dean et al.
et al. [195] reported the first success of such training using parallel, [69] reported the use of the asynchronous stochastic gradient descent
second-order, Hessian-free optimization techniques, which are carefully (ASGD) method, the adaptive gradient descent (Adagrad) method, and
implemented for large vocabulary speech recognition. Sainath et al. the large-scale limited-memory BFGS (L-BFGS) method for very large
[305] improved and speeded up the Hessian-free techniques by reduc- vocabulary speech recognition. Sainath et al. [312] provided a review
ing the number of Krylov subspace solver iterations [378], which are of a wide range of optimization methods for speeding up the training
used for implicit estimation of the Hessian. They also use sampling of DNN-based systems for large speech recognition tasks.
7.1. Acoustic modeling for speech recognition 285 286 Selected Applications in Speech and Audio Processing

In addition to the advances described above focusing on optimiza- with respect to the size of the training data. Dahl et al. [65] applied
tion with the fully supervised learning paradigm, where all train- dropout in conjunction with the ReLU units and to only the top few
ing data contain the label information, the semi-supervised training layers of a fully-connected DNN. Seltzer and Yu [325] applied it to noise
paradigm is also exploited for learning DNN–HMM systems for speech robust speech recognition. Deng et al. [81], on the other hand, applied
recognition. Liao et al. [223] reported the exploration of using semi- dropout to all layers of a deep convolutional neural network, including
supervised training on the DNN–HMM system for the very challenging both the top fully connected DNN layers and the bottom locally con-
task of recognizing YouTube speech. The main technique is based on the nected CNN layer and the pooling layer. It is found that the dropout
use of “island of confidence” filtering heuristics to select useful training rate need to be substantially smaller for the convolutional layer.
segments. Separately, semi-supervised training of DNNs is explored by Subsequent work on applying dropout includes the study by Miao
Vesely et al. [374], where self-training strategies are used as the basis and Metze [243], where DNN-based speech recognition is constrained
for data selection using both the utterance-level and frame-level con- by low resources with sparse training data. Most recently, Sainath et al.
fidences. Frame-selection based on per-frame confidences derived from [306] combined dropout with a number of novel techniques described
confusion in a lattice is found beneficial. Huang et al. [176] reported in this section (including the use of deep CNNs, Hessian-free sequence
another variant of semi-supervised training technique in which multi- learning, the use of ReLU units, and the use of joint fMLLR and filter-
system combination and confidence recalibration is applied to select bank features, etc.) to obtain state of the art results on several large
the training data. Further, Thomas et al. [362] overcome the problem vocabulary speech recognition tasks.
of lacking sufficient training data for acoustic modeling in a number of As a summary, the initial success of deep learning methods for
low-resource scenarios. They make use of transcribed multilingual data speech analysis and recognition reported around 2010 has come a long
and semi-supervised training to build the proposed feature front-ends way over the past three years. An explosive growth in the work and
for subsequent speech recognition. publications on this topic has been observed, and huge excitement has
Finally, we see important progress in deep learning based speech been ignited within the speech recognition community. We expect that
recognition in recent years with the introduction of new regulariza- the growth in the research on deep learning based speech recognition
tion methods based on “dropout” originally proposed by Hinton et al. will continue, at least in the near future. It is also fair to say that the
[166]. Overfitting is very common in DNN training and co-adaptation is continuing large-scale success of deep learning in speech recognition as
prevalent within the DNN with multiple activations adapting together surveyed in this chapter (up to the ASRU-2013 time frame) is a key
to explain input acoustic data. Dropout is a technique to limit co- stimulant to the large-scale exploration and applications of the deep
adaptation. It operates as follows. On each training instance, each hid- learning methods to other areas, which we will survey in Sections 8–11.
den unit is randomly omitted with a fixed probability (e.g., p = 0.5).
Then, decoding is done normally except with straightforward scaling
of the DNN weights (by a factor of 1 − p). Alternatively, the scaling of 7.2 Speech synthesis
the DNN weights can be done during training [by a factor of 1/(1 − p)]
rather than in decoding. The benefits of dropout regularization for In addition to speech recognition, the impact of deep learning has
training DNNs are to make a hidden unit in the DNN act strongly by recently spread to speech synthesis, aimed to overcome the limitations
itself without relying on others, and to serve a way to do model averag- of the conventional approach in statistical parametric synthesis based
ing of different networks. These benefits are most pronounced when the on Gaussian-HMM and decision-tree-based model clustering. The goal
training data is limited, or when the DNN size is disproportionally large of speech synthesis is to generate speech sounds directly from text and
7.2. Speech synthesis 287 288 Selected Applications in Speech and Audio Processing

possibly with additional information. The first set of papers appeared at framework. The deep learning techniques are thus expected to help the
ICASSP, May 2013, where four different deep learning approaches are acoustic modeling aspect of speech synthesis in overcoming the limita-
reported to improve the traditional HMM-based statistical paramet- tions of the conventional shallow modeling approach.
ric speech synthesis systems built based on “shallow” speech models, A series of studies are carried out recently on ways of overcoming
which we briefly review here after providing appropriate background the above limitations using deep learning methods, inspired partly by
information. the intrinsically hierarchical processes in human speech production
Statistical parametric speech synthesis emerged in the mid-1990s, and the successful applications of a number of deep learning methods
and is currently the dominant technology in speech synthesis. See a in speech recognition as reviewed earlier in this chapter. In Ling
recent overview in [364]. In this approach, the relationship between et al. [227, 229], the RBM and DBN as generative models are used
texts and their acoustic realizations are modeled using a set of to replace the traditional Gaussian models, achieving significant
stochastic generative acoustic models. Decision tree-clustered context- quality improvement, in both subjective and objective measures,
dependent HMMs with a Gaussian distribution as the output of an of the synthesized voice. In the approach developed in [190], the
HMM state are the most popular generative acoustic model used. In DBN as a generative model is used to represent joint distribution of
such HMM-based speech synthesis systems, acoustic features including linguistic and acoustic features. Both the decision trees and Gaussian
the spectra, excitation and segment durations of speech are modeled models are replaced by the DBN. The method is very similar to that
simultaneously within a unified context-dependent HMM framework. used for generating digit images by the DBN, where the issue of
At the synthesis time, a text analysis module extracts a sequence of temporal sequence modeling specific to speech (non-issue for image)
contextual factors including phonetic, prosodic, linguistic, and gram- is by-passed via the use of the relatively large, syllable-sized units in
matical descriptions from an input text to be synthesized. Given speech synthesis. On the other hand, in contrast to the generative
the sequence of contextual factors, a sentence-level context-dependent deep models (RBMs and DBNs) exploited above, the study reported
HMM corresponding to the input text is composed, where its model in [435] makes use of the discriminative model of the DNN to represent
parameters are determined by traversing the decision trees. The acous- the conditional distribution of the acoustic features given the linguistic
tic features are predicted so as to maximize their output probabili- features. Finally, in [115], the discriminative model of the DNN is used
ties from the sentence HMM under the constraints between static and as a feature extractor that summarizes high-level structure from the
dynamic features. Finally, the predicted acoustic features are sent to a raw acoustic features. Such DNN features are then used as the input
waveform synthesis module to reconstruct the speech waveforms. It for the second stage for the prediction of prosodic contour targets
has been known for many years that the speech sounds generated from contextual features in the full speech synthesis system.
by this standard approach are often muffled compared with natural The application of deep learning to speech synthesis is in its infancy,
speech. The inadequacy of acoustic modeling based on the shallow- and much more work is expected from that community in the near
structured HMM is conjectured to be one of the reasons. Several very future.
recent studies have adopted deep learning approaches to overcome such
deficiency. One significant advantage of deep learning techniques is
their strong ability to represent the intrinsic correlation or mapping 7.3 Audio and music processing
relationship among the units of a high-dimensional stochastic vector
using a generative (e.g., the RBM and DBN discussed in Section 3.2) Similar to speech recognition but to a less extent, in the area of audio
or discriminative (e.g., the DNN discussed in Section 3.3) modeling and music processing, deep learning has also become of intense interest
7.3. Audio and music processing 289 290 Selected Applications in Speech and Audio Processing

but only quite recently. As an example, the first major event of deep Section 7.2, ReLU units compute y = max(x, 0), and lead to sparser
learning for speech recognition took place in 2009, followed by a series of gradients, less diffusion of credit and blame in the RNN, and faster
events including a comprehensive tutorial on the topic at ICASSP-2012 training. The RNN is applied to the task of automatic recognition of
and with the special issue at IEEE Transactions on Audio, Speech, and chords from audio music, an active area of research in music information
Language Processing, the premier publication for speech recognition, retrieval. The motivation of using the RNN architecture is its power
in the same year. The first major event of deep learning for audio and in modeling dynamical systems. The RNN incorporates an internal
music processing appears to be the special session at ICASSP-2014, memory, or hidden state, represented by a self-connected hidden layer
titled Deep Learning for Music [14]. of neurons. This property makes them well suited to model temporal
In the general field of audio and music processing, the impacted sequences, such as frames in a magnitude spectrogram or chord labels
areas by deep learning include mainly music signal processing and music in a harmonic progression. When well trained, the RNN is endowed
information retrieval [15, 22, 141, 177, 178, 179, 319]. Deep learning with the power to predict the output at the next time step given the
presents a unique set of challenges in these areas. Music audio signals previous ones. Experimental results show that the RNN-based auto-
are time series where events are organized in musical time, rather than matic chord recognition system is competitive with existing state-of-
in real time, which changes as a function of rhythm and expression. The the-art approaches [275]. The RNN is capable of learning basic musical
measured signals typically combine multiple voices that are synchro- properties such as temporal continuity, harmony and temporal dynam-
nized in time and overlapping in frequency, mixing both short-term and ics. It can also efficiently search for the most musically plausible chord
long-term temporal dependencies. The influencing factors include musi- sequences when the audio signal is ambiguous, noisy or weakly discrim-
cal tradition, style, composer and interpretation. The high complexity inative.
and variety give rise to the signal representation problems well-suited A recent review article by Humphrey et al. [179] provides a detailed
to the high levels of abstraction afforded by the perceptually and bio- analysis on content-based music informatics, and in particular on why
logically motivated processing techniques of deep learning. the progress is decelerating throughout the field. The analysis con-
In the early work on audio signals as reported by Lee et al. [215] cludes that hand-crafted feature design is sub-optimal and unsustain-
and their follow-up work, the convolutional structure is imposed on able, that the power of shallow architectures is fundamentally limited,
the RBM while building up a DBN. Convolution is made in time by and that short-time analysis cannot encode musically meaningful struc-
sharing weights between hidden units in an attempt to detect the same ture. These conclusions motivate the use of deep learning methods
“invariant” feature over different times. Then a max-pooling operation aimed at automatic feature learning. By embracing feature learning, it
is performed where the maximal activations over small temporal neigh- becomes possible to optimize a music retrieval system’s internal feature
borhoods of hidden units are obtained, inducing some local temporal representation or discovering it directly, since deep architectures are
invariance. The resulting convolutional DBN is applied to audio as well especially well-suited to characterize the hierarchical nature of music.
as speech data for a number of tasks including music artist and genre Finally, we review the very recent work by van den Oord, et al. [371]
classification, speaker identification, speaker gender classification, and on content-based music recommendation using deep learning methods.
phone classification, with promising results presented. Automatic music recommendation has become an increasingly signifi-
The RNN has also been recently applied to music processing appli- cant and useful technique in practice. Most recommender systems rely
cations [22, 40, 41], where the use of ReLU hidden units instead of on collaborative filtering, suffering from the cold start problem where
logistic or tanh nonlinearities are explored in the RNN. As reviewed in it fails when no usage data is available. Thus, collaborative filtering is
7.3. Audio and music processing 291
8
not effective for recommending new and unpopular songs. Deep learning
methods power the latent factor model for recommendation, which pre-
dicts the latent factors from music audio when they cannot be obtained
Selected Applications in Language
from usage data. A traditional approach using a bag-of-words represen- Modeling and Natural Language Processing
tation of the audio signals is compared with deep CNNs with rigorous
evaluation made. The results show highly sensible recommendations
produced by the predicted latent factors using deep CNNs. The study
demonstrates that a combination of convolutional neural networks and
richer audio features lead to such promising results for content-based
music recommendation.
Like speech recognition and speech synthesis, much more work is
expected from the music and audio signal processing community in the
near future.

Research in language, document, and text processing has seen


increasing popularity recently in the signal processing community,
and has been designated as one of the main focus areas by the IEEE
Signal Processing Society’s Speech and Language Processing Technical
Committee. Applications of deep learning to this area started with
language modeling (LM), where the goal is to provide a probability
to any arbitrary sequence of words or other linguistic symbols (e.g.,
letters, characters, phones, etc.). Natural language processing (NLP)
or computational linguistics also deals with sequences of words or
other linguistic symbols, but the tasks are much more diverse (e.g.,
translation, parsing, text classification, etc.), not focusing on providing
probabilities for linguistic symbols. The connection is that LM is
often an important and very useful component of NLP systems.
Applications to NLP is currently one of the most active areas in
deep learning research, and deep learning is also considered as one
promising direction by the NLP research community. However, the
intersection between the deep learning and NLP researchers is so far
not nearly as large as that for the application areas of speech or vision.
This is partly because the hard evidence for the superiority of deep

292
8.1. Language modeling 293 294 Language Modeling and Natural Language Processing

learning over the current state of the art NLP methods has not been formed by using a fixed length history of N − 1 words. Each of the
as strong as speech or visual object recognition. previous N − 1 words is encoded using the very sparse 1-of-V coding,
where V is the size of the vocabulary. Then, this 1-of-V orthogonal rep-
resentation of words is projected linearly to a lower dimensional space,
8.1 Language modeling using the projection matrix shared among words at different positions
in the history. This type of continuous-space, distributed representation
Language models (LMs) are crucial part of many successful applica- of words is called “word embedding,” very different from the common
tions, such as speech recognition, text information retrieval, statistical symbolic or localist presentation [26, 27]. After the projection layer,
machine translation and other tasks of NLP. Traditional techniques for a hidden layer with nonlinear activation function, which is either a
estimating the parameters in LMs are based on N-gram counts. Despite hyperbolic tangent or a logistic sigmoid, is used. An output layer of
known weaknesses of N -grams and huge efforts of research communities the neural network then follows the hidden layer, with the number of
across many fields, N -grams remained the state-of-the-art until neural output units equal to the size of the full vocabulary. After the network
network and deep learning based methods were shown to significantly is trained, the output layer activations represent the “N -gram” LM’s
lower the perplexity of LMs, one common (but not ultimate) measure of probability distribution.
the LM quality, over several standard benchmark tasks [245, 247, 248]. The main advantage of NNLMs over the traditional counting-based
Before we discuss neural network based LMs, we note the use of N -gram LMs is that history is no longer seen as exact sequence of N −1
hierarchical Bayesian priors in building up deep and recursive struc- words, but rather as a projection of the entire history into some lower
ture for LMs [174]. Specifically, Pitman-Yor process is exploited as the dimensional space. This leads to a reduction of the total number of
Bayesian prior, from which a deep (four layers) probabilistic genera- parameters in the model that have to be trained, resulting in automatic
tive model is built. It offers a principled approach to LM smoothing clustering of similar histories. Compared with the class-based N -gram
by incorporating the power-law distribution for natural language. As LMs, the NNLMs are different in that they project all words into the
discussed in Section 3, this type of prior knowledge embedding is more same low dimensional space, in which there can be many degrees of
readily achievable in the generative probabilistic modeling setup than similarity between words. On the other hand, NNLMs have much larger
in the discriminative neural network based setup. The reported results computational complexity than N -gram LMs.
on LM perplexity reduction are not nearly as strong as that achieved Let’s look at the strengths of the NNLMs again from the view-
by the neural network based LMs, which we discuss next. point of distributed representations. A distributed representation of a
There has been a long history [19, 26, 27, 433] of using (shallow) symbol is a vector of features which characterize the meaning of the
feed-forward neural networks in LMs, called the NNLM. The use of symbol. Each element in the vector participates in representing the
DNNs in the same way for LMs appeared more recently in [8]. An LM meaning. With an NNLM, one relies on the learning algorithm to dis-
is a function that captures the salient statistical characteristics of the cover meaningful, continuous-valued features. The basic idea is to learn
distribution of sequences of words in natural language. It allows one to to associate each word in the dictionary with a continuous-valued vec-
make probabilistic predictions of the next word given preceding ones. tor representation, which in the literature is called a word embedding,
An NNLM is one that exploits the neural network’s ability to learn where each word corresponds to a point in a feature space. One can
distributed representations in order to reduce the impact of the curse imagine that each dimension of that space corresponds to a semantic
of dimensionality. The original NNLM, with a feed-forward neural net- or grammatical characteristic of words. The hope is that functionally
work structure works as follows: the input of the N-gram NNLM is
8.1. Language modeling 295 296 Language Modeling and Natural Language Processing

similar words get to be closer to each other in that space, at least along
some directions. A sequence of words can thus be transformed into a
sequence of these learned feature vectors. The neural network learns to
map that sequence of feature vectors to the probability distribution over
the next word in the sequence. The distributed representation approach
to LMs has the advantage that it allows the model to generalize well to
sequences that are not in the set of training word sequences, but that
are similar in terms of their features, i.e., their distributed represen-
tation. Because neural networks tend to map nearby inputs to nearby
outputs, the predictions corresponding to word sequences with similar
features are mapped to similar predictions.
The above ideas of NNLMs have been implemented in various
studies, some involving deep architectures. The idea of structuring
hierarchically the output of an NNLM in order to handle large Figure 8.1: The SOUL–NNLM architecture with hierarchical structure in the out-
vocabularies was introduced in [18, 262]. In [252], the temporally put layers of the neural network [after [207], @IEEE].
factored RBM was used for language modeling. Unlike the traditional
N -gram model, the factored RBM uses distributed representations
not only for context words but also for the words being predicted. models, called RNNLMs. The main difference between the feed-forward
This approach is generalized to deeper structures as reported in [253]. and the recurrent architecture for LMs is different ways of representing
Subsequent work on NNLM with “deep” architectures can be found in the word history. For feed-forward NNLM, the history is still just pre-
[205, 207, 208, 245, 247, 248]. As an example, Le et al. [207] describes vious several words. But for the RNNLM, an effective representation
an NNLM with structured output layer (SOUL–NNLM) where the pro- of history is learned from the data during training. The hidden layer of
cessing depth in the LM is focused in the neural network’s output rep- RNN represents all previous history and not just N − 1 previous words,
resentation. Figure 8.1 illustrates the SOUL-NNLM architecture with thus the model can theoretically represent long context patterns. A fur-
hierarchical structure in the output layers of the neural network, which ther important advantage of the RNNLM over the feed-forward coun-
shares the same architecture with the conventional NNLM up to the terpart is the possibility to represent more advanced patterns in the
hidden layer. The hierarchical structure for the network’s output vocab- word sequence. For example, patterns that rely on words that could
ulary is in the form of a clustering tree, shown to the right of Figure 8.1, have occurred at variable positions in the history can be encoded much
where each word belongs to only one class and ends in a single leaf node more efficiently with the recurrent architecture. That is, the RNNLM
of the tree. As a result of the hierarchical structure, the SOUL–NNLM can simply remember some specific word in the state of the hidden
enables the training of the NNLM with a full, very large vocabulary. layer, while the feed-forward NNLM would need to use parameters for
This gives advantages over the traditional NNLM which requires short- each specific position of the word in the history.
lists of words in order to carry out the efficient computation in training. The RNNLM is trained using the algorithm of back-propagation
As another example neural-network-based LMs, the work described through time; see details in [245], which provided Figure 8.2 to show
in [247, 248] and [245] makes use of RNNs to build large scale language during training how the RNN unfolds as a deep feed-forward network
(with three time steps back in time).
8.1. Language modeling 297 298 Language Modeling and Natural Language Processing

A separate work on applying RNN to an LM with the unit


of characters instead of words can be found in [153, 357]. Many
interesting properties such as predicting long-term dependencies (e.g.,
making open and closing quotes in a paragraph) are demonstrated.
However, the usefulness of characters instead of words as units in
practical applications is not clear because the word is such a powerful
representation for natural language. Changing words to characters in
LMs may limit most practical application scenarios and the training
become more difficult. Word-level models currently remain superior.
In the most recent work, Mnih and Teh [255] and Mnih and
Kavukcuoglu [254] have developed a fast and simple training algorithm
for NNLMs. Despite their superior performance, NNLMs have been
used less widely than standard N -gram LMs due to the much longer
training time. The reported algorithm makes use of a method called
noise-contrastive estimation or NCE [139] to achieve much faster train-
ing for NNLMs, with time complexity independent of the vocabulary
size; hence a flat instead of tree-structured output layer in the NNLM
is used. The idea behind NCE is to perform nonlinear logistic regres-
sion to discriminate between the observed data and some artificially
generated noise. That is, to estimate parameters in a density model of
observed data, we can learn to discriminate between samples from the
data distribution and samples from a known noise distribution. As an
important special case, NCE is particularly attractive for unnormalized
distributions (i.e., free from partition functions in the denominator). In
order to apply NCE to train NNLMs efficiently, Mnih and Teh [255]
and Mnih and Kavukcuoglu [254] first formulate the learning problem
Figure 8.2: During the training of RNNLMs, the RNN unfolds into a deep feed-
forward network; based on Figure 3.2 of [245].
as one which takes the objective function as the distribution of the word
in terms of a scoring function. The NNLM then can be viewed as a way
to quantify the compatibility between the word history and a candidate
The training of the RNNLM achieves stability and fast convergence, next word using the scoring function. The objective function for train-
helped by capping the growing gradient in training RNNs. Adaptation ing the NNLM thus becomes exponentiation of the scoring function,
schemes for the RNNLM are also developed by sorting the training normalized by the same constant over all possible words. Removing
data with respect to their relevance and by training the model during the costly normalization factor, NCE is shown to speed up the NNLM
processing of the test data. Empirical comparisons with other state-of- training over an order of magnitude.
the-art counting-based N -gram LMs show much better performance of A similar concept to NCE is used in the recent work of [250], which
RNNLM in the perplexity measure, as reported in [247, 248] and [245]. is called negative sampling. This is applied to a simplified version of
8.2. Natural language processing 299 300 Language Modeling and Natural Language Processing

an NNLM, for the purpose of constructing word embedding instead learning is used where “context” of the word is used as the learning
of computing probabilities of word sequences. Word embedding is an signal in neural networks. Excellent tutorials were recently given by
important concept for NLP applications, which we discuss next. Socher et al. [338, 340] to explain how the neural network is trained
to perform word embedding. More recent work proposes new ways of
learning word embeddings that better capture the semantics of words
8.2 Natural language processing by incorporating both local and global document contexts and better
account for homonymy and polysemy by learning multiple embeddings
Machine learning has been a dominant tool in NLP for many years. per word [169]. Also, there is strong evidence that the use of RNNs can
However, the use of machine learning in NLP has been mostly limited to also provide empirically good performance in learning word embeddings
numerical optimization of weights for human designed representations [245]. While the use of NNLMs, whose aim is to predict the future words
and features from the text data. The goal of deep or representation in context, also induces word embeddings as its by-product, much sim-
learning is to automatically develop features or representations from pler ways of achieving the embeddings are possible without the need to
the raw text material appropriate for a wide range of NLP tasks. do word prediction. As shown by Collobert and Weston [62], the neural
Recently, neural network based deep learning methods have networks used for creating word embeddings need much smaller output
been shown to perform well on various NLP tasks such as language units than the huge size typically required for NNLMs.
modeling, machine translation, part-of-speech tagging, named entity In the same early paper on word embedding, Collobert and Weston
recognition, sentiment analysis, and paraphrase detection. The most [62] developed and employed a convolutional network as the common
attractive aspect of deep learning methods is their ability to perform model to simultaneously solve a number of classic problems includ-
these tasks without external hand-designed resources or time-intensive ing part-of-speech tagging, chunking, named entity tagging, semantic
feature engineering. To this end, deep learning develops and makes use role identification, and similar word identification. More recent work
an important concept called “embedding,” which refers to the represen- reported in [61] further developed a fast, purely discriminative approach
tation of symbolic information in natural language text at word-level, for parsing based on the deep recurrent convolutional architecture. Col-
phrase-level, and even sentence-level in terms of continuous-valued lobert et al. [63] provide a comprehensive review on ways of applying
vectors. unified neural network architectures and related deep learning algo-
The early work highlighting the importance of word embedding rithms to solve NLP problems from “scratch,” meaning that no tradi-
came from [62], [367], and [63], although the original form came from tional NLP methods are used to extract features. The theme of this
[26] as a side product of language modeling. Raw symbolic word rep- line of work is to avoid task-specific, “man-made” feature engineering
resentations are transformed from the sparse vectors via 1-of-V coding while providing versatility and unified features constructed automati-
with a very high dimension (i.e., the vocabulary size V or its square or cally from deep learning applicable to all natural language processing
even its cubic) into low-dimensional, real-valued vectors via a neural tasks. The systems described in [63] automatically learn internal repre-
network and then used for processing by subsequent neural network lay- sentations or word embedding from vast amounts of mostly unlabeled
ers. The key advantage of using the continuous space to represent words training data while performing a wide range of NLP tasks.
(or phrases) is its distributed nature, which enables sharing or grouping The recent work by Mikolov et al. [246] derives word embeddings
the representations of words with a similar meaning. Such sharing is by simplifying the NNLM described in Section 8.1. It is found that
not possible in the original symbolic space, constructed by 1-of-V cod- the NNLM can be successfully trained in two steps. First, continuous
ing with a very high dimension, for representing words. Unsupervised word vectors are learned using a simple model which eliminates the
8.2. Natural language processing 301 302 Language Modeling and Natural Language Processing

models is a highly efficient way of learning high-quality word represen-


tations, much like the somewhat earlier lightweight LMs developed by
Mnih and Teh [255] described in Section 8.1. Consequently, results that
used to require very considerable hardware and software infrastructure
can now be obtained on a single desktop with minimal programming
effort and using less time and data. This most recent work also shows
that for representation learning, only five noise samples in NCE can be
sufficient for obtaining strong results for word embedding, much fewer
than that required for LMs. The authors also used an “inversed lan-
guage model” for computing word embeddings, similar to the way in
which the Skip-gram model is used in [250].
Huang et al. [169] recognized the limitation of the earlier work on
word embeddings in that these models were built with only local con-
text and one representation per word. They extended the local context
models to one that can incorporate global context from full sentences
or the entire document. This extended models accounts for homonymy
Figure 8.3: The CBOW architecture (a) on the left, and the Skip-gram architecture
(b) on the right. [after [246], @ICLR]. and polysemy by learning multiple embeddings for each word. An illus-
tration of this model is shown in Figure 8.4. In the earlier work by the
same research group [344], a recursive neural network with local con-
nonlinearity in the upper neural network layer and share the projec- text was developed to build a deep architecture. The network, despite
tion layer for all words. And second, the N -gram NNLM is trained missing global context, was already shown to be capable of successful
on top of the word vectors. So, after removing the second step in the
NNLM, the simple model is used to learn word embeddings, where the
simplicity allows the use of very large amount of data. This gives rise
to a word embedding model called Continuous Bag-of-Words Model
(CBOW), as shown in Figure 8.3a. Further, since the goal is no longer
computing probabilities of word sequences as in LMs, the word embed-
ding system here is made more effective by not only to predict the
current word based on the context but also to perform inverse pre-
diction known as “Skip-gram” model, as shown in Figure 8.3b. In
the follow-up work [250] by the same authors, this word embedding
system including the Skip-gram model is extended by a much faster
learning method called negative sampling, similar to NCE discussed in Figure 8.4: The extended word-embedding model using a recursive neural network
Section 8.1. that takes into account not only local context but also global context. The global
context is extracted from the document and put in the form of a global semantic
In parallel with the above development, Mnih and Kavukcuoglu vector, as part of the input into the original word-embedding model with local
[254] demonstrate that NCE training of lightweight word embedding context. Taken from Figure 1 of [169]. [after [169], @ACL].
8.2. Natural language processing 303 304 Language Modeling and Natural Language Processing

merging of natural language words based on the learned semantic trans-


formations of their original features. This deep learning approach pro-
vided an excellent performance on natural language parsing. The same
approach was also demonstrated to be reasonably successful in pars-
ing natural scene images. In related studies, a similar recursive deep
architecture is used for paraphrase detection [346], and for predicting
sentiment distributions from text [345].
We now turn to selected applications of deep learning methods
including the use of neural network architectures and word embed-
dings to practically useful NLP tasks. Machine translation is one of
such tasks, pursued by NLP researchers for many years based typically
on shallow statistical models. The work described in [320] are perhaps Figure 8.5: Illustration of the basic approach reported in [122] for machine trans-
lation. Parallel pairs of source (denoted by f) and target (denoted by e) phrases
the first comprehensive report on the successful application of neural- are projected into continuous-valued vector representations (denoted by the two y
network-based language models with word embeddings, trained on a vectors), and their translation score is computed by the distance between the pair in
GPU, for large machine translation tasks. They address the problem of this continuous space. The projection is performed by deep neural networks (denoted
by the two arrows) whose weights are learned on parallel training data. [after [121],
high computation complexity, and provide a solution that allows train-
@NIPS].
ing 500 million words with 20 hours. Strong results are reported, with
perplexity down from 71 to 60 in LMs and the corresponding BLEU
score gained by 1.8 points using the neural-network-based language improves the performance of a state-of-the-art phrase-based statisti-
models with word embeddings compared with the best back-off LM. cal machine translation system, leading to a gain close to 1.0 BLEU
A more recent study on applying deep learning methods to machine point.
translation appears in [121, 123], where the phrase-translation compo- A related approach to machine translation was developed by
nent, rather than the LM component in the machine translation system Schwenk [320]. The estimation of the translation model probabilities of
is replaced by the neural network models with semantic word embed- a phrase-based machine translation system is carried out using neural
dings. As shown in Figure 8.5 for the architecture of this approach, networks. The translation probability of phrase pairs is learned using
a pair of source (denoted by f ) and target (denoted by e) phrases continuous-space representations induced by neural networks. A sim-
are projected into continuous-valued vector representations in a low- plification is made that decomposes the translation probability of a
dimensional latent semantic space (denoted by the two y vectors).Then phrase or a sentence to a product of n-gram probabilities as in a stan-
their translation score is computed by the distance between the pair in dard n-gram language model. No joint representations of a phrase in
this new space. The projection is performed by two deep neural net- the source language and the translated version in the target language
works (not shown here) whose weights are learned on parallel training are exploited as in the approach reported by Gao et al. [122, 123].
data. The learning is aimed to directly optimize the quality of end- Yet another deep learning approach to machine translation
to-end machine translation results. Experimental evaluation has been appeared in [249]. As in other approaches, a corpus of words in one
performed on two standard Europarl translation tasks used by the NLP language are compared with the same corpus of words translated into
community, English–French and German–English. The results show another, and words and phrases in such bilingual data that share similar
that the new semantic-based phrase translation model significantly statistical properties are considered equivalent. A new technique is
8.2. Natural language processing 305 306 Language Modeling and Natural Language Processing

proposed that automatically generates dictionaries and phrase tables both entities and relations. More recent work [340] adopts an alterna-
that convert one language into another. It does not rely on versions of tive approach, based on the use of neural tensor networks, to attack
the same document in different languages. Instead, it uses data mining the problem of reasoning over a large joint knowledge graph for rela-
techniques to model the structure of a source language and then com- tion classification. The knowledge graph is represented as triples of a
pares it to the structure of the target language. The technique is shown relation between two entities, and the authors aim to develop a neu-
to translate missing word and phrase entries by learning language struc- ral network model suitable for inference over such relationships. The
tures based on large monolingual data and mapping between languages model they presented is a neural tensor network, with one layer only.
from small bilingual data. It is based on vector-valued word embed- The network is used to represent entities in a fixed-dimensional vectors,
dings as discussed earlier in this chapter and it learns a linear mapping which are created separately by averaging pre-trained word embedding
between vector spaces of source and target languages. vectors. It then learn the tensor with the newly added relationship ele-
An earlier study on applying deep learning techniques with DBNs ment that describes the interactions among all the latent components
was provided in [111] to attack a machine transliteration problem, in each of the relationships. The neural tensor network can be visu-
a much easier task than machine translation. This type of deep alized in Figure 8.6, where each dashed box denotes one of the two
architectures and learning may be generalized to the more difficult slices of the tensor. Experimentally, the paper [340] shows that this
machine translation problem but no follow-up work has been reported. tensor model can effectively classify unseen relationships in WordNet
As another early NLP application, Sarikaya et al. [318] applied DNNs and FreeBase.
(called DBNs in the paper) to perform a natural language call–routing As the final example of deep learning applied successfully to NLP,
task. The DNNs use unsupervised learning to discover multiple layers we discuss here sentiment analysis applications based on recursive deep
of features that are then used to optimize discrimination. Unsupervised
feature discovery is found to make DBNs far less prone to overfitting
than the neural networks initialized with random weights. Unsuper-
vised learning also makes it easier to train neural networks with many
hidden layers. DBNs are found to produce better classification results
than several other widely used learning techniques, e.g., maximum
entropy and boosting based classifiers.
One most interesting NLP task recently tackled by deep learn-
ing methods is that of knowledge base (ontology) completion, which
is instrumental in question-answering and many other NLP applica-
tions. An early work in this space came from [37], where a process is
introduced to automatically learn structured distributed embeddings
of knowledge bases. The proposed representations in the continuous-
valued vector space are compact and can be efficiently learned from
large-scale data of entities and relations. A specialized neural network Figure 8.6: Illustration of the neural tensor network described in [340], with two
architecture, a generalization of “Siamese” network, is used. In the relationships shown as two slices in the tensor. The tensor is denoted by W [1:2] . The
network contains a bilinear tensor layer that directly relates the two entity vectors
follow-up work that focuses on multi-relational data [36], the semantic (shown as e 1 and e 2 ) across three dimensions. Each dashed box denotes one of the
matching energy model is proposed to learn vector representations for two slices of the tensor. [after [340], @NIPS].
8.2. Natural language processing 307
9
models published recently by Socher et al. [347]. Sentiment analysis is
a task that is aimed to estimate the positive or negative opinion by an
algorithm based on input text information. As we discussed earlier in
Selected Applications in Information Retrieval
this chapter, word embeddings in the semantic space achieved by neural
network models have been very useful but it is difficult for them to
express the meaning of longer phrases in a principled way. For sentiment
analysis with the input data from typically many words and phrases,
the embedding model requires the compositionality properties. To this
end, Socher et al. [347] developed the recursive neural tensor network,
where each layer is constructed similarly to that of the neural tensor
network described in [340] with an illustration shown in Figure 8.6.
The recursive construction of the full network exhibiting properties
of compositionality follows that of [344] for the regular, non-tensor
network. When trained on a carefully constructed sentiment analysis
9.1 A brief introduction to information retrieval
database, the recursive neural tensor network is shown to outperform all
previous methods on several metrics. The new model pushes the state of Information retrieval (IR) is a process whereby a user enters a query
the art in single sentence positive/negative classification accuracy from into the automated computer system that contains a collection of many
80% up to 85.4%. The accuracy of predicting fine-grained sentiment documents with the goal of obtaining a set of most relevant documents.
labels for all phrases reaches 80.7%, an improvement of 9.7% over bag- Queries are formal statements of information needs, such as search
of-features baselines. strings in web search engines. In IR, a query does not uniquely identify
a single document in the collection. Instead, several documents may
match the query with different degrees of relevancy.
A document, sometimes called an object as a more general term
which may include not only a text document but also an image, audio
(music or speech), or video, is an entity that contains information
and represented as an entry in a database. In this section, we limit
the “object” to only text documents. User queries in IR are matched
against the documents’ representation stored in the database. Docu-
ments themselves often are not kept or stored directly in the IR sys-
tem. Rather, they are represented in the system by metadata. Typical
IR systems compute a numeric score on how well each document in
the database matches the query, and rank the objects according to this
value. The top-ranking documents from the system are then shown to

308
9.2. Semantic hashing with deep autoencoders for document 309 310 Selected Applications in Information Retrieval

the user. The process may then be iterated if the user wishes to refine We will review selected studies in the recent literature in the remain-
the query. der of this section below.
Based partly on [236], common IR methods consist of several
categories: 9.2 Semantic hashing with deep autoencoders for document
indexing and retrieval
• Boolean retrieval, where a document either matches a query or
does not. Here we discuss the “semantic hashing” approach for the application
• Algebraic approaches to retrieval, where models are used to rep- of deep autoencoders to document indexing and retrieval as published
resent documents and queries as vectors, matrices, or tuples. The in [159, 314]. It is shown that the hidden variables in the final layer of
similarity of the query vector and document vector is represented a DBN not only are easy to infer after using an approximation based
as a scalar value. This value can be used to produce a list of doc- on feed-forward propagation, but they also give a better representation
uments that are rank-ordered for a query. Common models and of each document, based on the word-count features, than the widely
methods include vector space model, topic-based vector space used latent semantic analysis and the traditional TF-IDF approach
model, extended Boolean model, and latent semantic analysis. for information retrieval. Using the compact code produced by deep
• Probabilistic approaches to retrieval, where the process of IR autoencoders, documents are mapped to memory addresses in such a
is treated as a probabilistic inference. Similarities are computed way that semantically similar text documents are located at nearby
as probabilities that a document is relevant for a given query, addresses to facilitate rapid document retrieval. The mapping from a
and the probability value is then used as the score in ranking word-count vector to its compact code is highly efficient, requiring only
documents. Common models and methods include binary a matrix multiplication and a subsequent sigmoid function evaluation
independence model, probabilistic relevance model with the for each hidden layer in the encoder part of the network.
BM25 relevance function, methods of inference with uncertainty, A deep generative model of DBN is exploited for the above purpose
probabilistic, language modeling, http://en.wikipedia.org/wiki/ as discussed in [165]. Briefly, the lowest layer of the DBN represents
Uncertain_inference and the technique of latent Dirichlet the word-count vector of a document and the top layer represents a
allocation. learned binary code for that document. The top two layers of the DBN
• Feature-based approaches to retrieval, where documents are form an undirected associative memory and the remaining layers form
viewed as vectors of values of feature functions. Principled meth- a Bayesian (also called belief) network with directed, top-down connec-
ods of “learning to rank” are devised to combine these features tions. This DBN, composed of a set of stacked RBMs as we reviewed
into a single relevance score. Feature functions are arbitrary in Section 5, produces a feed-forward “encoder” network that converts
functions of document and query, and as such Feature-based word-count vectors to compact codes. By composing the RBMs in the
approaches can easily incorporate almost any other retrieval opposite order, a “decoder” network is constructed that maps com-
model as just yet another feature. pact code vectors into reconstructed word-count vectors. Combining
the encoder and decoder, one obtains a deep autoencoder (subject to
Deep learning applications to IR are rather recent. The approaches further fine-tuning as discussed in Section 4) for document coding and
in the literature so far belong mostly to the category of feature-based subsequent retrieval.
approaches. The use of deep networks is mainly for extracting seman- After the deep model is trained, the retrieval process starts with
tically meaningful features for subsequent document ranking stages. mapping each query into a 128-bit binary code by performing a forward
9.3. DSSM for document retrieval 311 312 Selected Applications in Information Retrieval

pass through the model with thresholding. Then the Hamming dis- and (2) these models are often trained in an unsupervised manner using
tance between the query binary code and all the documents’ 128-bit an objective function that is only loosely coupled with the evaluation
binary codes, especially those of the “neighboring” documents defined metric for the retrieval task. In order to improve semantic matching for
in the semantic space, are computed extremely efficiently. The effi- IR, two lines of research have been conducted to extend the above latent
ciency is accomplished by looking up the neighboring bit vectors in semantic models. The first is the semantic hashing approach reviewed
the hash table. The same idea as discussed here for coding text docu- in Section 9.1 above in this section based on the use of deep autoen-
ments for information retrieval has been explored for audio document coders [165, 314]. While the hierarchical semantic structure embedded
retrieval and speech feature coding problems with some initial explo- in the query and the document can be extracted via deep learning,
ration reported in [100], discussed in Section 4 in detail. the deep learning approach used for their models still adopts an unsu-
pervised learning method where the model parameters are optimized
9.3 Deep-structured semantic modeling (DSSM) for the re-construction of the documents rather than for differentiating
for document retrieval the relevant documents from the irrelevant ones for a given query. As
a result, the deep neural network models do not significantly outper-
Here we discuss the more advanced and recent approach to large-scale form strong baseline IR models that are based on lexical matching. In
document retrieval (Web search) based on a specialized deep architec- the second line of research, click-through data, which consists of a list
ture, called deep-structured semantic model or deep semantic similarity of queries and the corresponding clicked documents, is exploited for
model (DSSM), as published in [172], and its convolutional version (C- semantic modeling so as to bridge the language discrepancy between
DSSM), as published in [328]. search queries and Web documents in recent studies [120, 124]. These
Modern search engines retrieve Web documents mainly by match- models are trained on click-through data using objectives that tailor to
ing keywords in documents with those in a search query. However, lex- the document ranking task. However, these click-through-based models
ical matching can be inaccurate due to the fact that a concept is often are still linear, suffering from the issue of expressiveness. As a result,
expressed using different vocabularies and language styles in documents these models need to be combined with the keyword matching models
and queries. Latent semantic models are able to map a query to its rel- (such as BM25) in order to obtain a significantly better performance
evant documents at the semantic level where lexical-matching often than baselines.
fails [236]. These models address the language discrepancy between The DSSM approach reported in [172] aims to combine the
Web documents and search queries by grouping different terms that strengths of the above two lines of work while overcoming their weak-
occur in a similar context into the same semantic cluster. Thus, a query nesses. It uses the DNN architecture to capture complex semantic prop-
and a document, represented as two vectors in the lower-dimensional erties of the query and the document, and to rank a set of documents
semantic space, can still have a high similarity even if they do not share for a given query. Briefly, a nonlinear projection is performed first to
any term. Probabilistic topic models such as probabilistic latent seman- map the query and the documents to a common semantic space. Then,
tic models and latent Dirichlet allocation models have been proposed the relevance of each document given the query is calculated as the
for semantic matching to partially overcome such difficulties. However, cosine similarity between their vectors in that semantic space. The
the improvement on IR tasks has not been as significant as originally DNNs are trained using the click-through data such that the condi-
expected because of two main factors: (1) most state-of-the-art latent tional likelihood of the clicked document given the query is maximized.
semantic models are based on linear projection, and thus are inadequate Different from the previous latent semantic models that are learned
in capturing effectively the complex semantic properties of documents; in an unsupervised fashion, the DSSM is optimized directly for Web
9.3. DSSM for document retrieval 313 314 Selected Applications in Information Retrieval

where tanh function is used at the output layer and the hidden layers
li , i = 2, . . . , N − 1:
1 − e−2x
f (x) = .
1 + e−2x
The semantic relevance score between a query Q and a document D
can then be computed as the consine distance
Ty
yQ D
R(Q, D) = cosine(yQ , yD ) = ,
�yQ ��yD �

where yQ and yD are the concept vectors of the query and the docu-
Figure 9.1: The DNN component of the DSSM architecture for computing semantic ment, respectively. In Web search, given the query, the documents can
features. The DNN uses multiple layers to map high-dimensional sparse text features, be sorted by their semantic relevance scores.
for both Queries and Documents into low-dimensional dense features in a semantic Learning of the DNN weights Wi and bi shown in Figure 9.1 is an
space. [after [172], @CIKM].
important contribution of the study of [172]. Compared with the DNNs
used in speech recognition where the targets or labels of the training
document ranking, and thus gives superior performance. Furthermore, data are readily available, the DNN in the DSSM does not have such
to deal with large vocabularies in Web search applications, a new word label information well defined. That is, rather than using the common
hashing method is developed, through which the high-dimensional term cross entropy or mean square errors as the training objective function,
vectors of queries or documents are projected to low-dimensional letter IR-centric loss functions need to be developed in order to train the DNN
based n-gram vectors with little information loss. weights in the DSSM using the available data such as click-through logs.
Figure 9.1 illustrates the DNN part in the DSSM architecture. The The click-through logs consist of a list of queries and their clicked
DNN is used to map high-dimensional sparse text features into low- documents. A query is typically more relevant to the documents that
dimensional dense features in a semantic space. The first hidden layer, are clicked on than those that are not. This weak supervision informa-
with 30k units, accomplishes word hashing. The word-hashed features tion can be exploited to train the DSSM. More specifically, the weight
are then projected through multiple layers of non-linear projections. matrices in the DSSM, Wi , is learned to maximize the posterior prob-
The final layer’s neural activities in this DNN form the feature in the ability of the clicked documents given the queries
semantic space.
exp(γR(Q, D))
To show the computational steps in the various layers of the DNN P (D | Q) = 
D  ∈D exp(γR(Q, D ))

in Figure 9.1, we denote x as the input term vector, y as the output
vector, li , i = 1, . . . , N − 1, as the intermediate hidden layers, Wi as defined on the semantic relevance score R(Q, D) between the Query (Q)
the ith projection matrix, and bi as the ith bias vector, we have and the Document (D), where γ is a smoothing factor set empirically
on a held-out data set, and D denotes the set of candidate documents
l1 = W1 x,
to be ranked. Ideally, D should contain all possible documents, as in
li = f (Wi li−1 + bi ), i>1 the maximum mutual information training for speech recognition where
y = f (WN lN −1 + bN ), all possible negative candidates may be considered [147]. However in
9.3. DSSM for document retrieval 315 316 Selected Applications in Information Retrieval

this case D is of Web scale and thus is intractable in practice. In the


implementation of DSSM learning described in [172], a subset of the
negative candidates are used, following the common practice adopted
in MCE (Minimum Classification Error) training in speech recognition
[52, 118, 417, 418]. In other words, for each query and clicked-document
pair, denoted by (QD+ ) where Q is a query and D+ is the clicked doc-
ument, the set of D is approximated by including D+ and only four
randomly selected unclicked documents, denoted by Dj− ; j = 1, . . . , 4}.
In the study reported in [172], no significant difference was found when
different sampling strategies were used to select the unclicked docu-
ments.
With the above simplification the DSSM parameters are estimated Figure 9.2: Architectural illustration of the DSSM for document retrieval (from
to maximize the approximate likelihood of the clicked documents given [170, 171]). All DNNs shown have shared weights. A set of n documents are shown
here to illustrate the random negative sampling discussed in the text for simplifying
the queries across the training set the training procedure for the DSSM. [after [172], @CIKM].

L(Λ) = log P (D+ | Q),
(Q,D + ,Dj− ) The convolutional neural network component of the C-DSSM is
shown in Figure 9.3, where a window size of three is illustrated for
where Λ denotes the parameter set of the DNN weights {Wi } in the the convolutional layer. The overall C-DSSM architecture is similar
DSSM. In Figure 9.2, we show the overall DSSM architecture that to the DSSM architecture shown in Figure 9.2 except that the fully-
contains several DNNs. All these DNNs share the same weights but take connected DNNs are replaced by the convolutional neural networks
different documents (one positive and several negatives) as inputs when with locally-connected tied weights and additional max-pooling layers.
training the DSSM parameters. Details of the gradient computation The model component shown in Figure 9.3 contains (1) a word hashing
of this approximate loss function with respect to the DNN weights layer to transform words into letter-tri-gram count vectors in the same
tied across documents and queries can be found in [172] and are not way as the DSSM; (2) a convolutional layer to extract local contextual
elaborated here. features for each context window; (3) a max-pooling layer to extract
Most recently, the DSSM described above has been extended to its and combine salient local contextual features to form a global feature
convolutional version, or C-DSSM [328]. In the C-DSSM, semantically vector; and (4) a semantic layer to represent the high-level semantic
similar words within context are projected to vectors that are close information of the input word sequence.
to each other in the contextual feature space through a convolutional The main motivation for using the convolutional structure in the
structure. The overall semantic meaning of a sentence is found to be C-DSSM is its ability to map a variable-length word sequence to a low-
determined by a few key words in the sentence, and thus the C-DSSM dimensional vector in a latent semantic space. Unlike most previous
uses an additional max pooling layer to extract the most salient local models that treat a query or a document as a bag of words, a query
features to form a fixed-length global feature vector. The global feature or a document in the C-DSSM is viewed as a sequence of words with
vector is then fed to the remaining nonlinear DNN layer(s) to map it contextual structures. By using the convolutional structure, local con-
to a point in the shared semantic space. textual information at the word n-gram level is modeled first. Then,
9.4. Use of deep stacking networks for information retrieval 317 318 Selected Applications in Information Retrieval

IR quality measure. The exception is found in the region of high IR


quality.
As described in Section 6, the simplicity of the DSN’s training objec-
tive, the mean square error (MSE), drastically facilitates its success-
ful applications to image recognition, speech recognition, and speech
understanding. The MSE objective and classification error rate have
been shown to be well correlated in these speech or image applications.
For information retrieval (IR) applications, however, the inconsistency
between the MSE objective and the desired objective (e.g., NDCG)
is much greater than that for the above classification-focused applica-
tions. For example, the NDCG as a desirable IR objective function is
a highly non-smooth function of the parameters to be learned, with a
very different nature from the nonlinear relationship between MSE and
classification error rate. Thus, it is of interest to understand to what
extent the NDCG is reasonably well correlated with classification rate
or MSE where the relevance level in IR is used as the DSN prediction
Figure 9.3: The convolutional neural network component of the C-DSSM, with the target. Further, can the advantage of learning simplicity in the DSN
window size of three is illustrated for the convolutional layer. [after [328], @WWW]. be applied to improve IR quality measures such as the NDCG? Our
experimental results presented in [88] provide largely positive answers
to both of the above questions. In addition, special care that need to
salient local features in a word sequence are combined to form a global
be taken in implementing DSN learning algorithms when moving from
feature vector. Finally, the high-level semantic information of the word
classification to IR applications are addressed.
sequence is extracted to form a global vector representation. Like the
The IR task in the experiments of [88] is the sponsored search
DSSM just described, the C-DSSM is also trained on click-through data
related to ad placement. In addition to the organic web search
by maximizing the conditional likelihood of the clicked documents given
results, commercial search engines also provide supplementary spon-
a query using the back-propagation algorithm.
sored results in response to the user’s query. The sponsored search
results are selected from a database pooled by advertisers who bid to
9.4 Use of deep stacking networks for information retrieval have their ads displayed on the search result pages. Given an input
query, the search engine will retrieve relevant ads from the database,
In parallel with the IR studies reviewed above, the deep stacking net- rank them, and display them at the proper place on the search result
work (DSN) discussed in Section 6 has also been explored recently page; e.g., at the top or right hand side of the web search results. Find-
for IR with insightful results [88]. The experimental results suggest ing relevant ads to a query is quite similar to common web search. For
that the classification error rate using the binary decision of “relevant” instance, although the documents come from a constrained database,
versus “non-relevant” from the DSN, which is closely correlated with the task resembles typical search ranking that targets on predicting
the DSN training objective, is also generally correlated well with the document relevance to the input query. The experiments conducted for
NDCG (normalized discounted cumulative gain) as the most common
9.4. Use of deep stacking networks for information retrieval 319
10
this task are the first with the use of deep learning techniques (based
on the DSN architecture) on the ad-related IR problem. The prelimi-
nary results from the experiments are the close correlation between the
Selected Applications in Object Recognition
MSE as the DSN training objective with the NDCG as the IR quality and Computer Vision
measure over a wide NDCG range.

Over the past two years or so, tremendous progress has been made in
applying deep learning techniques to computer vision, especially in the
field of object recognition. The success of deep learning in this area
is now commonly accepted by the computer vision community. It is
the second area in which the application of deep learning techniques
is successful, following the speech recognition area as we reviewed and
analyzed in Sections 2 and 7.
Excellent surveys on the recent progress of deep learning for
computer vision are available in the NIPS-2013 tutorial (https://
nips.cc/Conferences/2013/Program/event.php?ID=4170 with video
recording at http://research.microsoft.com/apps/video/default.aspx?
id=206976&l=i) and slides at http://cs.nyu.edu/∼fergus/presentations/
nips2013_final.pdf, and also in the CVPR-2012 tutorial (http://cs.nyu.
edu/∼fergus/tutorials/deep_learning_cvpr12). The reviews provided
in this section below are based partly on these tutorials, in connection
with the earlier deep learning material in this monograph. Another
excellent source which this section draws from is the most recent Ph.D.
thesis on the topic of deep learning for computer vision [434].

320
10.1. Unsupervised or generative feature learning 321 322 Selected Applications in Object Recognition and Computer Vision

Over many years, object recognition in computer vision has been the gain reported in [100] and described in Section 4 of this monograph
relying on hand-designed features such as SIFT (scale invariant fea- on the speech data over the traditional technique of vector quantiza-
ture transform) and HOG (histogram of oriented gradients), akin to tion. Also, Nair and Hinton [265] developed a modified DBN where the
the reliance of speech recognition on hand-designed features such as top-layer model uses a third-order Boltzmann machine. This type of
MFCC and PLP. However, features like SIFT and HOG only capture DBN is applied to the NORB database — a three-dimensional object
low-level edge information. The design of features to effectively capture recognition task. An error rate close to the best published result on this
mid-level information such as edge intersections or high-level represen- task is reported. In particular, it is shown that the DBN substantially
tation such as object parts becomes much more difficult. Deep learning outperforms shallow models such as SVMs. In [358], two strategies to
aims to overcome such challenges by automatically learning hierarchies improve the robustness of the DBN are developed. First, sparse connec-
of visual features in both unsupervised and supervised manners directly tions in the first layer of the DBN are used as a way to regularize the
from data. The review below categorizes the many deep learning meth- model. Second, a probabilistic de-noising algorithm is developed. Both
ods applied to computer vision into two classes: (1) unsupervised fea- techniques are shown to be effective in improving robustness against
ture learning where the deep learning is used to extract features only, occlusion and random noise in a noisy image recognition task. DBNs
which may be subsequently fed to relatively simple machine learning have also been successfully applied to create compact but meaning-
algorithm for classification or other tasks; and (2) supervised learning ful representations of images [360] for retrieval purposes. On this large
methods where end-to-end learning is adopted to jointly optimize fea- collection image retrieval task, deep learning approaches also produced
ture extractor and classifier components of the full system when large strong results. Further, the use of a temporally conditional DBN for
amounts of labeled training data are available. video sequence and human motion synthesis were reported in [361]. The
conditional RBM and DBN make the RBM and DBN weights associ-
ated with a fixed time window conditioned on the data from previous
10.1 Unsupervised or generative feature learning time steps. The computational tool offered in this type of temporal
DBN and the related recurrent networks may provide the opportunity
When labeled data are relatively scarce, unsupervised learning algo- to improve the DBN–HMMs towards efficient integration of temporal-
rithms have been shown to learn useful visual feature hierarchies. In centric human speech production mechanisms into DBN-based speech
fact, prior to the demonstration of remarkable successes of CNN archi- production model.
tectures with supervised learning in the 2012 ImageNet competition, Deep learning methods have a rich family, including hierarchical
much of the work in applying deep learning methods to computer probabilistic and generative models (neural networks or otherwise).
vision had been on unsupervised feature learning. The original unsuper- One most recent example of this type developed and applied to facial
vised deep autoencoder that exploits DBN pre-training was developed expression datasets is the stochastic feed-forward neural networks that
and demonstrated by Hinton and Salakhutdinov [164] with success on can be learned efficiently and that can induce a rich multiple-mode
the image recognition and dimensionality reduction (coding) tasks of distribution in the output space not possible with the standard, deter-
MNIST with only 60,000 samples in the training set; see details of this ministic neural networks [359]. In Figure 10.1, we show the architecture
task in http://yann.lecun.com/exdb/mnist/ and an analysis in [78]. of a typical stochastic feed-forward neural network with four hidden
It is interesting to note that the gain of coding efficiency using the DBN- layers with mixed deterministic and stochastic neurons (left) used to
based autoencoder on the image data over the conventional method of model multi-mode distributions illustrated on the right. The stochastic
principal component analysis as demonstrated in [164] is very similar to network here is a deep, directed graphical model, where the generation
10.1. Unsupervised or generative feature learning 323 324 Selected Applications in Object Recognition and Computer Vision

10.2 Supervised feature learning and classification

The origin of the applications of deep learning to object recognition


tasks can be traced to the convolutional neural networks (CNNs)
in the early 90s; see a comprehensive overview in [212]. The CNN-
based architectures in the supervised learning mode have captured
intense interest in computer vision since October 2012 shortly after
the ImageNet competition results were released (http://www.image-
net.org/challenges/LSVRC/2012/). This is mainly due to the huge
Figure 10.1: Left: A typical architecture of the stochastic feed-forward neural recognition accuracy gain over competing approaches when large
network with four hidden layers. Right: Illustration of how the network can produce
a distribution with two distinct modes and use them to represent two or more amounts of labeled data are available to efficiently train large CNNs
different facial expressions y given a neutral face x. [after [359], @NIPS]. using GPU-like high-performance computing platforms. Just like DNN-
based deep learning methods have outperformed previous state-of-
the-art approaches in speech recognition in a series of benchmark
process starts from input x, a neural face, and generates the output tasks including phone recognition, large-vocabulary speech recognition,
y, the facial expression. In face expression classification experiments, noise-robust speech recognition, and multi-lingual speech recognition,
the learned unsupervised hidden features generated from this stochas- CNN-based deep learning methods have demonstrated the same in a
tic network are appended to the image pixels and helped to obtain set of computer vision benchmark tasks including category-level object
superior accuracy to the baseline classifier based on the conditional recognition, object detection, and semantic segmentation.
RBM/DBN [361]. The basic architecture of the CNN described in [212] is shown in
Perhaps the most notable work in the category of unsupervised deep Figure 10.1. To incorporate the relative invariance of the spatial rela-
feature learning for computer vision (prior to the recent surge of the tionship in typical image pixels with respect to the location, the CNN
work on CNNs) is that of [209], a nine-layer locally connected sparse uses a convolutional layer with local receptive fields and with tied fil-
autoencoder with pooling and local contrast normalization. The model ter weights, much like 2-dimensional FIR filters in image processing.
has one billion connections, trained on the dataset with 10 million The output of the FIR filters is then passed through a nonlinear acti-
images downloaded from the Internet. The unsupervised feature learn- vation function to create activation maps, followed by another non-
ing methods allow the system to train a face detector without having to linear pooling (labeled as “subsampling” in Figure 10.2) layer that
label images as containing a face or not. And the control experiments reduces the data rate while providing invariance to slightly differ-
show that this feature detector is robust not only to translation but ent input images. The output of the pooling layer is fed to a few
also to scaling and out-of-plane rotation. fully connected layers as in the DNN discussed in earlier chapters.
Another set of popular studies on unsupervised deep feature learn- The whole architecture above is also called the deep CNN in the
ing for computer vision are based on deep sparse coding models [226]. literature.
This type of deep models produced state-of-the-art accuracy results on Deep models with convolution structure such as CNNs have been
the ImageNet object recognition tasks prior to the rise of the CNN found effective and have been in use in computer vision and image
architectures armed with supervised learning to perform joint feature recognition since 90s [57, 185, 192, 198, 212]. The most notable advance
learning and classification, which we turn to now. was achieved in the 2012 ImageNet LSVRC competition, in which
10.2. Supervised feature learning and classification 325 326 Selected Applications in Object Recognition and Computer Vision

Figure 10.2: The original convolutional neural network that is composed of mul-
tiple alternating convolution and pooling layers followed by fully connected layers. Figure 10.3: The architecture of the deep-CNN system which won the 2012 Ima-
[after [212], @IEEE]. geNet competition by a large margin over the second-best system and the state of
the art by 2012. [after [198], @NIPS].

the task is to train a model with 1.2 million high-resolution images


to classify unseen images to one of the 1000 different image classes. significantly lower than 26.2% achieved by the second-best system
On the test set consisting of 150k images, the deep CNN approach which combines scores from many classifiers using a set of hand-
described in [198] achieved the error rates considerably lower than the crafted features such as SIFT and Fisher vectors. See details in http://
previous state-of-the-art. Very large deep-CNNs are used, consisting of www.image-net.org/challenges/LSVRC/2012/oxford_vgg.pdf about
60 million weights, and 650,000 neurons, and five convolutional layers the best competing method. It is noted, however, that the Fisher-
together with max-pooling layers. Additional two fully-connected layers vector-encoding approach has recently been extended by Simonyan
as in the DNN described previously are used on top of the CNN layers. et al. [329] via stacking in multiple layers to form deep Fisher net-
Although all the above structures were developed separately in earlier works, which achieve competitive results with deep CNNs at a smaller
work, their best combination accounted for major part of the success. computational learning cost.
See the overall architecture of the deep CNN system in Figure 10.3. Two The state of the art performance demonstrated in [198] using the
additional factors contribute to the final success. The first is a powerful deep-CNN approach is further improved by another significant mar-
regularization technique called “dropout”; see details in [166] and a gin during 2013, using a similar approach but with bigger models
series of further analysis and improvement in [10, 13, 240, 381, 385]. In and larger amounts of training data. A summary of top-5 test error
particular, Warde-Farley et al. [385] analyzed the disentangling effects rates from 11 top-performing teams participating in the 2013 Ima-
of dropout and showed that it helps because different members of the geNet ILSVRC competition is shown in Figure 10.4, with the best
bag share parameters. Applications of the same “dropout” techniques result of the 2012 competition shown to the right most as the baseline.
are also successful for some speech recognition tasks [65, 81]. The Here we see rapid error reduction on the same task from the lowest
second factor is the use of non-saturating neurons or rectified linear pre-2012 error rate of 26.2% (non-neural networks) to 15.3% in 2012
units (ReLU) that compute f (x) = max(x, 0), which significantly and further to 11.2% in 2013, both achieved with deep-CNN technol-
speeds up the overall training process especially with efficient GPU ogy. It is also interesting to observe that all major entries in the 2013
implementation. This deep-CNN system achieved a winning top-5 test ImageNet ILSVRC competition is based on deep learning approaches.
error rate of 15.3% using extra training data from ImageNet Fall 2011 For example, the Adobe system shown in Figure 10.4 is based on the
release, or 16.4% using only supplied training data in ImageNet-2012, deep-CNN reported in [198] including the use of dropout. The network
10.2. Supervised feature learning and classification 327 328 Selected Applications in Object Recognition and Computer Vision

augments the amount of training data by down-sampling images to


256 pixels. The system contains a total of 65M parameters. Multiple
such models were averaged together to further boost performance. The
main novelty is to use the visualization technique based on the deconvo-
lutional networks as described in [434, 437] to identify what makes the
deep model perform well, based on which a powerful deep architecture
was chosen. See more details of these systems in http://www.image-
net.org/challenges/LSVRC/2013/results.php.
While the deep CNN has demonstrated remarkable classification
performance on object recognition tasks, there has been no clear under-
standing of why they perform so well until recently. Zeiler and Fergus
[435, 436] conducted research to address just this issue, and then used
the gained understanding to further improve the CNN systems, which
Figure 10.4: Summary results of ImageNet Large Scale Visual Recognition yielded excellent performance as shown in Figure 10.4 with labels “ZF”
Challenge 2013 (ILSVRC2013), representing the state-of-the-are performance of and “Clarifai.” A novel visualization technique is developed that gives
object recognition systems. Data source: http://www.image-net.org/challenges/
insight into the function of intermediate feature layers of the deep CNN.
LSVRC/2013/results.php.
The technique also sheds light onto the operation of the full network
acting as a classifier. The visualization technique is based on a decon-
architecture is modified to include more filters and connections. At test volutional network, which maps the neural activities in intermediate
time, image saliency is used to obtain 9 crops from original images, layers of the original convolutional network back to the input pixel
which are combined with the standard five multiview crops. The NUS space. This allows the researchers to examine what input pattern orig-
system uses a non-parametric, adaptive method to combine the out- inally caused a given activation in the feature maps. Figure 10.5 (the
puts from multiple shallow and deep experts, including deep-CNN, top portion) illustrates how a deconvolutional network is attached to
kernel, and GMM methods. The VGG system is described in [329] each of its layers, thereby providing a closed loop back to image pixels
and uses a combination of the deep Fisher vector network and the as the input to the original CNN. The information flow in this closed
deep-CNN. The ZF system is based on a combination of a large CNN loop is as follows. First, an input image is presented to the deep CNN in
with a range of different architectures. The choice of architectures was a feed-forward manner so that the features at all layers are computed.
assisted by visualization of model features using a deconvolutional net- To examine a given CNN activation, all other activations in the layer
work as described by Zeiler et al. [437], Zeiler and Fergus [435, 436], are set to zero and the feature maps are passed as input to the attached
and Zeiler ([434]). The CognitiveVision system uses an image classifi- deconvolutional network’s layer. Then, successive operations, opposite
cation scheme based on a DNN architecture. The method is inspired to the feed-forward computation in the CNN, are carried out including
by cognitive psychophysics about how the human vision system first unpooling, rectifying, and filtering. This allows the reconstruction of
learns to classify the basic-level categories and then learns to clas- the activity in the layer beneath that gave rise to the chosen activa-
sify categories at the subordinate level for fine-grained object recogni- tion. These operations are repeated until input layer is reached. During
tion. Finally, the best-performing system called Clarifai in Figure 10.4 unpooling, non-invertibility of the max pooling operation in the CNN is
is based on a large and deep CNN with dropout regularization. It
10.2. Supervised feature learning and classification 329 330 Selected Applications in Object Recognition and Computer Vision

In addition to the deep-CNN architecture described above, the DNN


architecture has also been shown to be highly successful in a number
of computer vision tasks [54, 55, 56, 57]. We have not found in the
literature on direct comparisons among the CNN, DNN, and other
related architectures on the identical tasks.
Finally, the most recent study on supervised learning for computer
vision shows that the deep CNN architecture is not only successful for
object/image classification discussed earlier in this section but also suc-
cessful for objection detection in the whole images [128]. The detection
task is substantially more complex than the classification task.
As a brief summary of this chapter, deep learning has made huge
inroads into computer vision, soon after its success in speech recogni-
tion discussed in Section 7. So far, it is the supervised learning paradigm
based on the deep CNN architecture and the related classification tech-
niques that are making the greatest impact, showcased by the ImageNet
competition results from 2012 and 2013. These methods can be used
for not only object recognition but also many other computer vision
tasks. There has been some debate as to the reasons for the success of
these CNN-based deep learning methods, and about their limitations.
Many questions are still open as to how these methods can be tai-
lored to certain computer vision applications and how to scale up the
models and training data. Finally, we discussed a number of studies on
unsupervised and generative approaches of deep learning to computer
vision and image modeling problems in the earlier part of this chapter.
Figure 10.5: The top portion shows how a deconvolutional network’s layer (left) Their performance has not been competitive with the supervised learn-
is attached to a corresponding CNN’s layer (right). The d econvolutional network ing approach on object recognition tasks with ample training data. To
reconstructs an approximate version of the CNN features from the layer below. The
bottom portion is an illustration of the unpooling operation in the deconvolutional
achieve long term and ultimate success in computer vision, it is likely
network, where “Switches” are used to record the location of the local max in each that unsupervised learning will be needed. To this end, many open
pooling region during pooling in the CNN. [after [436], @arXiv]. problems in unsupervised feature learning and deep learning need to
be addressed and much more research need to be carried out.

resolved by an approximate inverse, where the locations of the maxima


within each pooling region are recorded in a set of “switch” variables.
These switches are used to place the reconstructions from the layer
above into appropriate locations, preserving the structure of the stim-
ulus. This procedure is shown at the bottom portion of Figure 10.5.
332 Selected Applications in Multimodal and Multi-task Learning
11
representations and statistical strengths across tasks (e.g., those involv-
ing separate modalities of audio, image, touch, and text) is expected
Selected Applications in Multimodal to greatly facilitate many machine learning scenarios under low- or
and Multi-task Learning zero-resource conditions. Before deep learning methods were adopted,
there had been numerous efforts in multi-modal and multi-task learn-
ing. For example, a prototype called MiPad for multi-modal interac-
tions involving capturing, leaning, coordinating, and rendering a mix
of speech, touch, and visual information was developed and reported
in [175, 103]. And in [354, 443], mixed sources of information from
multiple-sensory microphones with separate bone-conductive and air-
born paths were exploited to de-noise speech. These early studies all
used shallow models and learning methods and achieved worse than
desired performance. With the advent of deep learning, it is hopeful
that the difficult multi-modal learning problems can be solved with
Multi-task learning is a machine learning approach that learns to solve
eventual success to enable a wide range of practical applications. In
several related problems at the same time, using a shared represen-
this chapter, we will review selected applications in this area, orga-
tation. It can be regarded as one of the two major classes of transfer
nized according to different combinations of more than one modalities
learning or learning with knowledge transfer, which focuses on general-
or learning tasks. Much of the work reviewed here is on-going research,
izations across distributions, domains, or tasks. The other major class
and readers should expect follow-up publications in the future.
of transfer learning is adaptive learning, where knowledge transfer is
carried out in a sequential manner, typically from a source task to a
target task [95]. Multi-modal learning is a closely related concept to 11.1 Multi-modalities: Text and image
multi-task learning, where the learning domains or “tasks” cut across
several modalities for human–computer interactions or other applica- The underlying mechanism for potential effectiveness of multi-modal
tions embracing a mixture of textual, audio/speech, touch, and visual learning involving text and image is the common semantics associated
information sources. with the text and image. The relationship between the text and image
The essence of deep learning is to automate the process of dis- may come, for example, from the text annotations of an image (as the
covering effective features or representations for any machine learn- training data for a multi-modal learning system). If the related text
ing task, including automatically transferring knowledge from one task and image share the same representation in a common semantic space,
to another concurrently. Multi-task learning is often applied to con- the system can generalize to the unseen situation where either text
ditions where no or very little training data are available for the tar- or image is unavailable. It can thus be naturally used for zero-shot
get task domain, and hence is sometimes called zero-shot or one-shot learning for image or text. In other words, multi-modality learning can
learning. It is evident that difficult multi-task leaning naturally fits the use text information to help image/visual recognition, and vice versa.
paradigm of deep learning or representation learning where the shared Exploiting text information for image/visual recognition constitutes
331 most of the work done in this space, which we review in this section
below.
11.1. Multi-modalities: Text and image 333 334 Selected Applications in Multimodal and Multi-task Learning

The deep architecture, called DeViSE (deep visual-semantic embed-


ding) and developed by Frome et al. [117], is a typical example of the
multi-modal learning where text information is used to improve the
image recognition system, especially for performing zero-shot learning.
Image recognition systems are often limited in their ability to scale
to large number of object categories, due in part to the increasing
difficulty of acquiring sufficient training data with text labels as the
number of image categories grows. The multi-modal DeViSE system
is aimed to leverage text data to train the image models. The joint
Figure 11.1: Illustration of the multi-modal DeViSE architecture. The left portion
model is trained to identify image classes using both labeled image is an image recognition neural network with a softmax output layer. The right por-
data and the semantic information learned from unannotated text. An tion is a skip-gram text model providing word embedding vectors; see Section 8.2
illustration of the DeViSE architecture is shown in the center portion and Figure 8.3 for details. The center is the joint deep image-text model of DeViSE,
with the two Siamese branches initialized by the image and word embedding mod-
of Figure 10.1. It is initialized with the parameters pre-trained at the
els below the softmax layers. The layer labeled “transformation” is responsible for
lower layers of two models: the deep-CNN for image classification in mapping the outputs of the image (left) and text (right) branches into the same
the left portion of the figure and the text embedding model in the semantic space. [after [117], @NIPS].
right portion of the figure. The part of the deep CNN, labeled “core
visual model” in Figure 10.1, is further learned to predict the target It is also interesting to compare the DeViSE architecture of
word-embedding vector using a projection layer labeled “transforma- Figure 11.1 with the DSSM architecture of Figure 9.2 in Section 9.
tion” and using a similarity metric. The loss function used in training The branches of “Query” and “Documents” in DSSM are analogous to
adopts a combination of dot-product similarity and max-margin, hinge the branches of “image” and “text-label” in DeViSE. Both DeViSE and
rank loss. The former is the un-normalized version of the cosine loss DSSM use the objective function related to cosine distance between
function used for training the DSSM model in [170] as described in two vectors for training the network weights in an end-to-end fash-
Section 9.3. The latter is similar to the earlier joint image-text model ion. One key difference, however, is that the two sets of inputs to the
called WSABIE (web scale annotation by image embedding developed DSSM are both text (i.e., “Query” and “Documents” designed for IR),
by Weston et al. [388, 389]. The results show that the information pro- and thus mapping “Query” and “Documents” to the same semantic
vided by text improves zero-shot image predictions, achieving good hit space is conceptually more straightforward compared with the need
rates (close to 15%) across thousands of the labels never seen by the in DeViSE for mapping from one modality (image) to another (text).
image model. Another key difference is that the generalization ability of DeViSE to
The earlier WSABIE system as described in [388, 389] adopted unseen image classes comes from computing text embedding vectors
a shallow architecture and trained a joint embedding model of both for many unsupervised text sources (i.e., with no image counterparts)
images and labels. Rather than using deep architectures to derive the that would cover the text labels corresponding to the unseen classes.
highly nonlinear image (as well as text-embedding) feature vectors as in The generalization ability of the DSSM over unseen words, however,
DeViSE, the WSABIE uses simple image features and a linear mapping is derived from a special coding scheme for words in terms of their
to arrive at the joint embedding space. Further, it uses an embedding constituent letters.
vector for each possible label. Thus, unlike DeViSE, WSABIE could The DeViSE architecture has inspired a more recent method,
not generalize to new classes. which maps images into the semantic embedding space via convex
11.1. Multi-modalities: Text and image 335 336 Selected Applications in Multimodal and Multi-task Learning

combination of embedding vectors for the text label and the image
classes [270]. Here is the main difference. DeViSE replaces the last,
softmax layer of a CNN image classifier with a linear transformation
layer. The new transformation layer is then trained together with the
lower layers of the CNN. The method in [270] is much simpler — keep-
ing the softmax layer of the CNN while not training the CNN. For a
test image, the CNN first produces top N-best candidates. Then, the
convex combination of the corresponding N embedding vectors in the
semantic space is computed. This gives a deterministic transformation
from the outputs of the softmax classifier into the embedding space.
This simple multi-modal learning method is shown to work very well Figure 11.2: Illustration of the multi-modal DeViSE architecture. The left portion
on the ImageNet zero-shot learning task. is an image recognition neural network with a softmax output layer. The right por-
tion is a skip-gram text model providing word embedding vectors; see Section 8.2
Another thread of studies separate from but related to the above and Figure 8.3 for details. The center is the joint deep image-text model of DeViSE,
work on multi-modal learning involving text and image have cen- with the two Siamese branches initialized by the image and word embedding mod-
tered on the use of multi-modal embeddings, where data from multiple els below the softmax layers. The layer labeled “transformation” is responsible for
mapping the outputs of the image (left) and text (right) branches into the same
sources with separate modalities of text and image are projected into
semantic space. [after [196], @NIPS].
the same vector space. For example, Socher and Fei-Fei [341] project
words and images into the same space using kernelized canonical cor-
relation analysis. Socher et al. [342] map images to single-word vectors Word representations and image features are jointly learned by train-
so that the constructed multi-modal system can classify images with- ing the multi-modal language model together with a convolutional net-
out seeing any examples of the class, i.e., zero-shot learning similar work. An illustration of the multi-modal language model is shown in
to the capability of DeViSE. The most recent work by Socher et al. Figure 11.2.
[343] extends their earlier work from single-word embeddings to those
of phrases and full-length sentences. The mechanism for mapping sen- 11.2 Multi-modalities: Speech and image
tences instead of the earlier single words into the multi-modal embed-
ding space is derived from the power of the recursive neural network Ngiam et al. [268, 269] propose and evaluate an application of
described in Socher et al. [347] as summarized in Section 8.2, and its deep networks to learn features over audio/speech and image/video
extension with dependency tree. modalities. They demonstrate cross-modality feature learning, where
In addition to mapping text to image (or vice versa) into the same better features for one modality (e.g., image) is learned when multiple
vector space or to creating the joint image/text embedding space, modalities (e.g., speech and image) are present at feature learning time.
multi-modal learning for text and image can also be cast in the frame- A bi-modal deep autoencoder architecture for separate audio/speech
work of language models. In [196], a model of natural language is made and video/image input channels are shown in Figure 11.3. The essence
conditioned on other modalities such as image as the focus of the of this architecture is to use a shared, middle layer to represent both
study. This type of multi-modal language model is used to (1) retrieve types of modalities. This is a straightforward generalization from
images given complex description queries, (2) retrieve phrase descrip- the single-modal deep autoencoder for speech shown in Figure 4.1 of
tions given image queries, and (3) generate text conditioned on images. Section 4 to bi-modal counterpart. The authors further show how to
11.2. Multi-modalities: Speech and image 337 338 Selected Applications in Multimodal and Multi-task Learning

nets, a probabilistic version based on deep Boltzmann machine (DBM)


has appeared more recently for the same multimodal application. In
[348], a DBM is used to extract a unified representation integrat-
ing separate modalities, useful for both classification and information
retrieval tasks. Rather than using the “bottleneck” layers in the deep
autoencoder to represent multimodal inputs, here a probability den-
sity is defined on the joint space of multimodal inputs, and states of
suitably defined latent variables are used for the representation. The
advantage of this probabilistic formulation, possibly lacking in the tra-
ditional deep autoencoder, is that the missing modality’s information
can be filled in naturally by sampling from its conditional distribution.
More recent work on autoencoders [22, 30] shows the capability of gen-
eralized denoising autoencoders in carrying out sampling, thus they
may overcome the earlier problem of filling-in the missing modality’s
information. For the bi-modal data consisting of image and text, the
multimodal DBM was shown to slightly outperform the traditional ver-
Figure 11.3: The architecture of a deep denoising autoencoder for multi-modal sion of the deep multimodal autoencoder as well as multimodal DBN
audio/speech and visual features. [after [269], @ICML]. in classification and information retrieval tasks. No results on the com-
parisons with the generalized version of deep autoencoders has been
learn a shared audio and video representation, and evaluate it on a reported but may appear soon.
fixed task, where the classifier is trained with audio-only data but The several architectures discussed so far in this chapter for multi-
tested with video-only data and vice versa. The work concludes that modal processing and learning can be regarded as special cases of
deep learning architectures are generally effective in learning multi- more general multi-task learning and transfer learning [22, 47]. Trans-
modal features from unlabeled data and in improving single modality fer learning, encompassing both adaptive and multi-task learning, refers
features through cross modality information transfer. One exception to the ability of a learning architecture and technique to exploit com-
is the cross-modality setting using the CUAVE dataset. The results mon hidden explanatory factors among different learning tasks. Such
presented in [269, 268] show that learning video features with both exploitation permits sharing of aspects of diverse types of input data
video and audio outperforms that with only video data. However, the sets, thus allowing the possibility of transferring knowledge across seem-
same paper also shows that a model of [278] in which a sophisticated ingly different learning tasks. As argued in [22], the learning archi-
signal processing technique for extracting visual features, together tecture shown in Figure 11.4 and the associated learning algorithms
with the uncertainty-compensation method developed originally from have an advantage for such tasks because they learn representations
robust speech recognition [104], gives the best classification accuracy that capture underlying factors, a subset of which may be relevant
in the cross-modal learning task, beating the features derived from the for each particular task. We will discuss a number of such multi-task
generative deep architecture designed for this task. learning applications in the remainder of this chapter that are confined
While the deep generative architecture for multimodal learning with a single modality of speech, natural language processing, or image
described in [268, 269] is based on non-probabilistic autoencoder neural domain.
11.3 ML within the speech, NLP or image domain 339 340 Selected Applications in Multimodal and Multi-task Learning

Figure 11.4: A DNN architecture for multitask learning that is aimed to dis-
cover hidden explanatory factors shared among three tasks A, B, and C. [after [22],
@IEEE].

Figure 11.5: A DNN architecture for multilingual speech recognition. [after [170],
11.3 Multi-task learning within the speech, NLP or image @IEEE].
domain
appropriately, serve as increasingly complex feature transformations
Within the speech domain, one most interesting application of multi-
sharing common hidden factors across the acoustic data in different
task learning is multi-lingual or cross-lingual speech recognition, where
languages. The final softmax layer representing a log-linear classifier
speech recognition for different languages is considered as different
makes use of the most abstract feature vectors represented in the top-
tasks. Various approaches have been taken to attack this rather chal-
most hidden layer. While the log-linear classifier is necessarily sepa-
lenging acoustic modeling problem for speech recognition, where the
rate for different languages, the feature transformations can be shared
difficulty lies in the lack of transcribed speech data due to economic
across languages. Excellent multilingual speech recognition results are
considerations in developing speech recognition systems for all lan-
reported, far exceeding the earlier results using the GMM–HMM based
guages in the world. Cross-language data sharing and data weighing
approaches [225, 420]. The implication of this set of work is signif-
are common and useful approaches for the GMM–HMM system [225].
icant and far reaching. It points to the possibility of quickly build-
Another successful approach for the GMM–HMM is to map pronunci-
ing a high-performance DNN-based system for a new language from
ation units across languages either via knowledge-based or data-driven
an existing multilingual DNN. This huge benefit would require only a
methods [420]. But they are much inferior to the DNN–HMM approach
small amount of training data from the target language, although hav-
which we now summarize.
ing more data would further improve the performance. This multitask
In recent papers of [94, 170] and [150], two research groups inde-
learning approach can reduce the need for the unsupervised pre-training
pendently developed closely related DNN architectures with multi-task
stage, and can train the DNN with much fewer epochs. Extension
learning capabilities for multilingual speech recognition. See Figure 11.5
of this set of work would be to efficiently build a language-universal
for an illustration of this type of architecture. The idea behind these
speech recognition system. Such a system cannot only recognize many
architectures is that the hidden layers in the DNN, when learned
languages and improve the accuracy for each individual language, but
11.3. ML within the speech, NLP or image domain 341 342 Selected Applications in Multimodal and Multi-task Learning

by advocating the use of even finer units of speech than phones to


bridge the raw acoustic information of speech to semantic content of
speech via a hierarchy of linguistic structure. These atomic speech units
include “speech attributes” in the detection-based and knowledge-rich
modeling framework for speech recognition, whose accuracy has been
significantly boosted recently by the use of deep learning methods
[332, 330, 427].
Within the natural language processing domain, the best known
example of multi-task learning is the comprehensive studies reported
in [62, 63], where a range of separate “tasks” of part-of-speech tag-
ging, chunking, named entity tagging, semantic role identification, and
similar-word identification in natural language processing are attacked
using a common representation of words and a unified deep learning
approach. A summary of these studies can be found in Section 8.2.
Figure 11.6: A DNN architecture for speech recognition trained with mixed-
bandwidth acoustic data with 16-kHz and 8-kHz sampling rates; [after [221], Finally, within the domain of image/vision as a single modality,
@IEEE]. deep learning has also been found effective in multi-task learning. Sri-
vastava and Salakhutdinov [349] present a multi-task learning approach
also expand the languages supported by simply stacking softmax layers based on hierarchical Bayesian priors in a DNN system applied to var-
on the DNN for new languages. ious image classification data sets. The priors are combined with a
A closely related DNN architecture, as shown in Figure 11.6, with DNN, which improves discriminative learning by encouraging infor-
multitask learning capabilities was also recently applied to another mation sharing among tasks and by discovering similar classes among
acoustic modeling problem — learning joint representations for two which knowledge is transferred. More specifically, methods are devel-
separate sets of acoustic data [94, 221]. The set that consists of the oped to jointly learn to classify images and a hierarchy of classes, such
speech data with 16 kHz sampling rate is of wideband and high qual- that “poor classes,” for which there are relatively few training examples,
ity, which is often collected from increasingly popular smart phones can benefit from similar “rich classes,” for which more training exam-
under the voice search scenario. Another, narrowband data set has a ples are available. This work can be considered as an excellent instance
lower sampling rate of 8kHz, often collected using the telephony speech of learning output representations, in addition to learning input rep-
recognition systems. resentation of the DNN as the focus of nearly all deep learning work
As a final example of multi-task learning within the speech domain, reported in the literature.
let us consider phone recognition and word recognition as separate As another example of multi-task learning within the single-
“tasks.” That is, phone recognition results are used not for producing modality domain of image, Ciresan et al. [58] applied the architec-
text outputs but for language-type identification or for spoken doc- ture of deep CNNs to character recognition tasks for Latin and for
ument retrieval. Then, the use of pronunciation dictionary in almost Chinese. The deep CNNs trained on Chinese characters are shown to
all speech systems can be considered as multi-task learning that share be easily capable of recognizing uppercase Latin letters. Further, learn-
the tasks of phone recognition and word recognition. More advanced ing Chinese characters is accelerated by first pre-training a CNN on a
frameworks in speech recognition have pushed this direction further small subset of all classes and then continuing to train on all classes.
344 Conclusion
12
to emotion recognition from speech in [207, 222], to spoken language
understanding in [242, 366, 403], to speaker recognition in [351, 372],
Conclusion to language-type recognition in [112], to dialogue state tracking for
spoken dialogue systems in [94, 152], to automatic voice activity
detection in [442], to speech enhancement in [396], to voice conversion
in [266], and to single-channel source separation in [132, 387].
The literature on deep learning is vast, mostly coming from
the machine learning community. The signal processing community
embraced deep learning only within the past four years or so (start-
ing around end of 2009) and the momentum is growing fast ever since.
This monograph is written mainly from the signal and information pro-
cessing perspective. Beyond surveying the existing deep learning work,
a classificatory scheme based on the architectures and on the nature
of the learning algorithms is developed, and an analysis and discus-
This monograph first presented a brief history of deep learning (focus-
sions with concrete examples are presented. It is our hope that the
ing on speech recognition) and developed a categorization scheme to
survey conducted in this monograph will provide insight for readers to
analyze the existing deep networks in the literature into unsupervised
better understand the capability of the various deep learning systems
(many of which are generative), supervised, and hybrid classes. The
discussed in the monograph, the connection among different but sim-
deep autoencoder, the DSN (as well as many of its variants), and
ilar deep learning methods, and how to design proper deep learning
the DBN–DNN or pre-trained DNN architectures, one in each of the
algorithms under different circumstances.
three classes, are discussed and analyzed in detail, as they appear to
Throughout this review, the important message is conveyed that
be popular and promising approaches based on the authors’ personal
building and learning deep hierarchies of features are highly desirable.
research experiences. Applications of deep learning in five broad
We have discussed the difficulty of learning parameters in all layers of
areas of information processing are also reviewed, including speech
deep networks in one shot due to optimization difficulties that need
and audio (Section 7), natural language modeling and processing
to be better understood. The unsupervised pre-training method in the
(Section 8), information retrieval (Section 9), object recognition and
hybrid architecture of the DBN–DNN, which we reviewed in detail in
computer vision (Section 10), and multi-modal and multi-task learning
Section 5, appears to have offered a useful, albeit empirical, solution
(Section 11). There are other interesting yet non-mainstream applica-
to poor local optima in optimization and to regularization for the deep
tions of deep learning, which are not covered in this monograph. For
model containing massive parameters even though a solid theoretical
interested readers, please consult recent papers on the applications of
foundation is still lacking. The effectiveness of the pre-training method,
deep learning to optimal control in [219], to reinforcement learning in
which was one factor that stimulated the interest in deep learning by
[256], to malware classification in [66], to compressed sensing in [277],
the signal processing community in 2009 via collaborations between
to recognition confidence prediction in [173], to acoustic-articulatory
academic and industrial researchers, is most prominent when the super-
inversion mapping in [369], to emotion recognition from video in [189],
vised training data are limited.
343 Deep learning is an emerging technology. Despite the empirical
promising results reported so far, much more work needs to be carried
345 346 Conclusion

out. Importantly, it has not been the experience of deep learning to massive data sets and massive computing power. It would become
researchers that a single deep learning technique can be successful for increasingly difficult to explore the new algorithmic space without the
all classification tasks. For example, while the popular learning strat- access to large, real-world data sets and without the related engineer-
egy of generative pre-training followed by discriminative fine-tuning ing expertise. How well deep learning algorithms behave would depend
seems to work well empirically for many tasks, it failed to work for heavily on the amount of data and computing power available. As we
some other tasks that have been explored (e.g., language identifica- showed with speech recognition examples, a deep learning algorithm
tion or speaker recognition; unpublished). For these tasks, the features that appears to be performing not so remarkably on small data sets
extracted at the generative pre-training phase seem to describe the can begin to perform considerably better when these limitations are
underlying speech variations well but do not contain sufficient infor- removed, one of main reasons for the recent resurgence in neural net-
mation to distinguish between different languages. A learning strategy work research. As an example, the DBN pre-training that ignited a new
that can extract discriminative yet also invariant features is expected to era of (deep) machine learning research appears unnecessary if enough
provide better solutions. This idea has also been called “disentangling” data and computing power are used.
and is developed further in [24]. Further, extracting discriminative fea- As a consequence, effective and scalable parallel algorithms are
tures may greatly reduce the model size needed in many of the current critical for training deep models with large data sets, as in many com-
deep learning systems. Domain knowledge such as what kind of invari- mon information processing applications such as speech recognition
ance is useful for a specific task in hand (e.g., vision, speech, or natural and machine translation. The popular mini-batch stochastic gradient
language) and what kind of regularization in terms of parameter con- technique is known to be difficult to parallelize over computers.
straints is key to the success of applying deep learning methods. More- The common practice nowadays is to use GPGPUs to speed up the
over, new types of DNN architectures and learning beyond the several learning process, although recent advance in developing asynchronous
popular ones discussed in this monograph are currently under active stochastic gradient descent learning has shown promises by using
development by the deep learning research community (e.g., [24, 89]), large-scale CPU clusters [69, 209] and GPU clusters [59]. In this
holding the promise to improve the performance of deep learning mod- interesting computing architecture, many different replicas of the DNN
els in more challenging applications in signal processing and in artificial compute gradients on different subsets of the training data in parallel.
intelligence. These gradients are communicated to a central parameter server
Recent published work showed that there is vast room to improve that updates the shared weights. Even though each replica typically
the current optimization techniques for learning deep architectures computes gradients using parameter values not immediately updated,
[69, 208, 238, 239, 311, 356, 393]. To what extent pre-training is essen- stochastic gradient descent is robust to the slight errors this has
tial to learning the full set of parameters in deep architectures is introduced. To make deep learning techniques scalable to very large
currently under investigation, especially when very large amounts of training data, theoretically sound parallel learning and optimization
labeled training data are available, reducing or even obliterating the algorithms together with novel architectures need to be further devel-
need for model regularization. Some preliminary results have been dis- oped [31, 39, 49, 69, 181, 322, 356]. Optimization methods specific to
cussed in this monograph and in [55, 161, 323, 429]. speech recognition problems may need to be taken into account in order
In recent years, machine learning is becoming increasingly depen- to push speech recognition advances to the next level [46, 149, 393].
dent on large-scale data sets. For instance, many of the recent successes One major barrier to the application of DNNs and related deep
of deep learning as discussed in this monograph have relied on the access models is that it currently requires considerable skill and experience to
347 348 Conclusion

choose sensible values for hyper-parameters such as the learning rate have so far dealt with unstructured or “flat” classification problems.
schedule, the strength of the regularizer, the number of layers and the For example, although speech recognition is a sequential classification
number of units per layer, etc. Sensible values for one hyper-parameter problem by nature, in the most successful and large-scale systems, a
may depend on the values chosen for other hyper-parameters and separate HMM is used to handle the sequence structure and the DNN
hyper-parameter tuning in DNNs is especially expensive. Some inter- is only used to produce the frame-level, unstructured posterior dis-
esting methods for solving the problem have been developed recently, tributions. Recent proposals have called for and investigated moving
including random sampling [32] and Bayesian optimization procedure beyond the “flat” representations and incorporating structures in both
[337]. Further research is needed in this important area. the deep learning architectures and input and output representations
This monograph, mainly in Sections 8 and 11 on natural language [79, 136, 338, 349].
and multi-modal applications, has touched on some recent work on Finally, deep learning researchers have been advised by neuroscien-
using deep learning methods to do reasoning, moving beyond the topic tists to seriously consider a broader set of issues and learning architec-
of more straightforward pattern recognition using supervised, unsuper- tures so as to gain insight into biologically plausible representations in
vised or hybrid learning methods to which much of this monograph the brain that may be useful for practical applications [272]. How can
has been devoted to. In principle, since deep networks are naturally computational neuroscience models about hierarchical brain structure
equipped with distributed representations (rf. Table 3.1) using their help improve engineering deep learning architectures? How may the
layer-wise collections of units for coding relations and coding entities, biologically feasible learning styles in the brain [158, 395] help design
concepts, events, topics, etc., they can potentially perform powerful more effective and more robust deep learning algorithms? All these
reasoning over structures, as argued in various historical publications issues and those discussed earlier in this section will need intensive
as well as recent essays [38, 156, 286, 288, 292, 336, 335]. While initial research in order to further push the frontier of deep learning.
explorations on this capability of deep networks have recently appeared
in the literature, as reviewed in Sections 8 and 11, much research is
needed. If successful, this new type of deep learning “machine” will
open up many novel and exciting applications in applied artificial intel-
ligence as a “thinking brain.” We expect growing work of deep learning
in this area, full of new challenges, in the future.
Further, solid theoretical foundations of deep learning need to be
established in a myriad of aspects. As an example, the success of deep
learning in unsupervised learning has not been demonstrated as much
as for supervised learning; yet the essence and major motivation of deep
learning lie right in unsupervised learning for automatically discover-
ing data representation. The issues involve appropriate objectives for
learning effective feature representations and the right deep learning
architectures/algorithms for distributed representations to effectively
disentangle the hidden explanatory factors of variation in the data.
Unfortunately, a majority of the successful deep learning techniques
References 350 References

[8] E. Arisoy, T. Sainath, B. Kingsbury, and B. Ramabhadran. Deep neural


network language models. In Proceedings of the Joint Human Language
Technology Conference and the North American Chapter of the Associ-
ation of Computational Linguistics (HLT-NAACL) Workshop. 2012.
[9] O. Aslan, H. Cheng, D. Schuurmans, and X. Zhang. Convex two-layer
modeling. In Proceedings of Neural Information Processing Systems
(NIPS). 2013.
[10] J. Ba and B. Frey. Adaptive dropout for training deep neural networks.
[1] O. Abdel-Hamid, L. Deng, and D. Yu. Exploring convolutional neural In Proceedings of Neural Information Processing Systems (NIPS). 2013.
network structures and optimization for speech recognition. Proceedings [11] J. Baker, L. Deng, J. Glass, S. Khudanpur, C.-H. Lee, N. Morgan, and
of Interspeech, 2013. D. O’Shaughnessy. Research developments and directions in speech
[2] O. Abdel-Hamid, L. Deng, D. Yu, and H. Jiang. Deep segmental neural recognition and understanding. IEEE Signal Processing Magazine,
networks for speech recognition. In Proceedings of Interspeech. 2013. 26(3):75–80, May 2009.
[3] O. Abdel-Hamid, A. Mohamed, H. Jiang, and G. Penn. Applying convo- [12] J. Baker, L. Deng, J. Glass, S. Khudanpur, C.-H. Lee, N. Morgan, and
lutional neural networks concepts to hybrid NN-HMM model for speech D. O’Shaughnessy. Updated MINS report on speech recognition and
recognition. In Proceedings of International Conference on Acoustics understanding. IEEE Signal Processing Magazine, 26(4), July 2009.
Speech and Signal Processing (ICASSP). 2012. [13] P. Baldi and P. Sadowski. Understanding dropout. In Proceedings of
[4] A. Acero, L. Deng, T. Kristjansson, and J. Zhang. HMM adaptation Neural Information Processing Systems (NIPS). 2013.
using vector taylor series for noisy speech recognition. In Proceedings [14] E. Battenberg, E. Schmidt, and J. Bello. Deep learning for
of Interspeech. 2000. music, special session at International Conference on Acoustics
[5] G. Alain and Y. Bengio. What regularized autoencoders learn from the Speech and Signal Processing (ICASSP) (http://www.icassp2014.org/
data generating distribution. In Proceedings of International Conference special_sections.html#ss8), 2014.
on Learning Representations (ICLR). 2013. [15] E. Batternberg and D. Wessel. Analyzing drum patterns using condi-
[6] G. Anthes. Deep learning comes of age. Communications of the Asso- tional deep belief networks. In Proceedings of International Symposium
ciation for Computing Machinery (ACM), 56(6):13–15, June 2013. on Music Information Retrieval (ISMIR). 2012.
[7] I. Arel, C. Rose, and T. Karnowski. Deep machine learning — a new [16] P. Bell, P. Swietojanski, and S. Renals. Multi-level adaptive networks
frontier in artificial intelligence. IEEE Computational Intelligence Mag- in tandem and hybrid ASR systems. In Proceedings of International
azine, 5:13–18, November 2010. Conference on Acoustics Speech and Signal Processing (ICASSP). 2013.
[17] Y. Bengio. Artificial neural networks and their application to sequence
recognition. Ph.D. Thesis, McGill University, Montreal, Canada, 1991.
[18] Y. Bengio. New distributed probabilistic language models. Technical
Report, University of Montreal, 2002.
[19] Y. Bengio. Neural net language models. Scholarpedia, 3, 2008.
349
[20] Y. Bengio. Learning deep architectures for AI. in Foundations and
Trends in Machine Learning, 2(1):1–127, 2009.
References 351 352 References

[21] Y. Bengio. Deep learning of representations for unsupervised and trans- [33] A. Biem, S. Katagiri, E. McDermott, and B. Juang. An application
fer learning. Journal of Machine Learning Research Workshop and Con- of discriminative feature extraction to filter-bank-based speech recog-
ference Proceedings, 27:17–37, 2012. nition. IEEE Transactions on Speech and Audio Processing, 9:96–110,
[22] Y. Bengio. Deep learning of representations: Looking forward. In Sta- 2001.
tistical Language and Speech Processing, pages 1–37. Springer, 2013. [34] J. Bilmes. Dynamic graphical models. IEEE Signal Processing Maga-
[23] Y. Bengio, N. Boulanger, and R. Pascanu. Advances in optimizing recur- zine, 33:29–42, 2010.
rent networks. In Proceedings of International Conference on Acoustics [35] J. Bilmes and C. Bartels. Graphical model architectures for speech
Speech and Signal Processing (ICASSP). 2013. recognition. IEEE Signal Processing Magazine, 22:89–100, 2005.
[24] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A [36] A. Bordes, X. Glorot, J. Weston, and Y. Bengio. A semantic matching
review and new perspectives. IEEE Transactions on Pattern Analysis energy function for learning with multi-relational data — application
and Machine Intelligence (PAMI), 38:1798–1828, 2013. to word-sense disambiguation. Machine Learning, May 2013.
[25] Y. Bengio, R. De Mori, G. Flammia, and R. Kompe. Global optimiza- [37] A. Bordes, J. Weston, R. Collobert, and Y. Bengio. Learning structured
tion of a neural network-hidden markov model hybrid. IEEE Transac- embeddings of knowledge bases. In Proceedings of Association for the
tions on Neural Networks, 3:252–259, 1992. Advancement of Artificial Intelligence (AAAI). 2011.
[26] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural proba- [38] L. Bottou. From machine learning to machine reasoning: An essay.
bilistic language model. In Proceedings of Neural Information Processing Journal of Machine Learning Research, 14:3207–3260, 2013.
Systems (NIPS). 2000. [39] L. Bottou and Y. LeCun. Large scale online learning. In Proceedings of
[27] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural proba- Neural Information Processing Systems (NIPS). 2004.
bilistic language model. Journal of Machine Learning Research, 3:1137– [40] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent. Modeling
1155, 2003. Temporal dependencies in high-dimensional sequences: Application to
[28] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer- polyphonic music generation and transcription. In Proceedings of Inter-
wise training of deep networks. In Proceedings of Neural Information national Conference on Machine Learning (ICML). 2012.
Processing Systems (NIPS). 2006. [41] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent. Audio chord
[29] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependen- recognition with recurrent neural networks. In Proceedings of Interna-
cies with gradient descent is difficult. IEEE Transactions on Neural tional Symposium on Music Information Retrieval (ISMIR). 2013.
Networks, 5:157–166, 1994. [42] H. Bourlard and N. Morgan. Connectionist Speech Recognition: A
[30] Y. Bengio, E. Thibodeau-Laufer, and J. Yosinski. Deep generative Hybrid Approach. Kluwer, Norwell, MA, 1993.
stochastic networks trainable by backprop. arXiv 1306:1091, 2013. [43] J. Bouvrie. Hierarchical learning: Theory with applications in speech
also accepted to appear in Proceedings of International Conference on and vision. Ph.D. thesis, MIT, 2009.
Machine Learning (ICML), 2014.
[44] L. Breiman. Stacked regression. Machine Learning, 24:49–64, 1996.
[31] Y. Bengio, L. Yao, G. Alain, and P. Vincent. Generalized denoising
autoencoders as generative models. In Proceedings of Neural Informa- [45] J. Bridle, L. Deng, J. Picone, H. Richards, J. Ma, T. Kamm, M. Schus-
tion Processing Systems (NIPS). 2013. ter, S. Pike, and R. Reagan. An investigation of segmental hidden
dynamic models of speech coarticulation for automatic speech recogni-
[32] J. Bergstra and Y. Bengio. Random search for hyper-parameter opti- tion. Final Report for 1998 Workshop on Language Engineering, CLSP,
mization. Journal on Machine Learning Research, 3:281–305, 2012. Johns Hopkins, 1998.
[46] P. Cardinal, P. Dumouchel, and G. Boulianne. Large vocabulary speech
recognition on parallel architectures. IEEE Transactions on Audio,
Speech, and Language Processing, 21(11):2290–2300, November 2013.
References 353 354 References

[47] R. Caruana. Multitask learning. Machine Learning, 28:41–75, 1997. [60] W. Cohen and R. V. de Carvalho. Stacked sequential learning. In
[48] J. Chen and L. Deng. A primal-dual method for training recurrent Proceedings of International Joint Conference on Artificial Intelligence
neural networks constrained by the echo-state property. In Proceedings (IJCAI), pages 671–676. 2005.
of International Conference on Learning Representations. April 2014. [61] R. Collobert. Deep learning for efficient discriminative parsing. In
[49] X. Chen, A. Eversole, G. Li, D. Yu, and F. Seide. Pipelined back- Proceedings of Artificial Intelligence and Statistics (AISTATS). 2011.
propagation for context-dependent deep neural networks. In Proceedings [62] R. Collobert and J. Weston. A unified architecture for natural language
of Interspeech. 2012. processing: Deep neural networks with multitask learning. In Proceed-
[50] R. Chengalvarayan and L. Deng. Hmm-based speech recognition using ings of International Conference on Machine Learning (ICML). 2008.
state-dependent, discriminatively derived transforms on Mel-warped [63] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and
DFT features. IEEE Transactions on Speech and Audio Processing, P. Kuksa. Natural language processing (almost) from scratch. Journal
pages 243–256, 1997. on Machine Learning Research, 12:2493–2537, 2011.
[51] R. Chengalvarayan and L. Deng. Use of generalized dynamic feature [64] G. Dahl, M. Ranzato, A. Mohamed, and G. Hinton. Phone recognition
parameters for speech recognition. IEEE Transactions on Speech and with the mean-covariance restricted boltzmann machine. In Proceedings
Audio Processing, pages 232–242, 1997a. of Neural Information Processing Systems (NIPS), volume 23, pages
[52] R. Chengalvarayan and L. Deng. Speech trajectory discrimination using 469–477. 2010.
the minimum classification error learning. IEEE Transactions on Speech [65] G. Dahl, T. Sainath, and G. Hinton. Improving deep neural networks
and Audio Processing, 6(6):505–515, 1998. for LVCSR using rectified linear units and dropout. In Proceedings
[53] Y. Cho and L. Saul. Kernel methods for deep learning. In Proceedings of of International Conference on Acoustics Speech and Signal Processing
Neural Information Processing Systems (NIPS), pages 342–350. 2009. (ICASSP). 2013.

[54] D. Ciresan, A. Giusti, L. Gambardella, and J. Schmidhuber. Deep neural [66] G. Dahl, J. Stokes, L. Deng, and D. Yu. Large-scale malware classifi-
networks segment neuronal membranes in electron microscopy images. cation using random projections and neural networks. In Proceedings
In Proceedings of Neural Information Processing Systems (NIPS). 2012. of International Conference on Acoustics Speech and Signal Processing
(ICASSP). 2013.
[55] D. Ciresan, U. Meier, L. Gambardella, and J. Schmidhuber. Deep, big,
simple neural nets for handwritten digit recognition. Neural Computa- [67] G. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent DBN-
tion, December 2010. HMMs in large vocabulary continuous speech recognition. In Proceed-
ings of International Conference on Acoustics Speech and Signal Pro-
[56] D. Ciresan, U. Meier, J. Masci, and J. Schmidhuber. A committee of cessing (ICASSP). 2011.
neural networks for traffic sign classification. In Proceedings of Interna-
tional Joint Conference on Neural Networks (IJCNN). 2011. [68] G. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent, pre-trained
deep neural networks for large vocabulary speech recognition. IEEE
[57] D. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural Transactions on Audio, Speech, & Language Processing, 20(1):30–42,
networks for image classification. In Proceedings of Computer Vision January 2012.
and Pattern Recognition (CVPR). 2012.
[69] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao,
[58] D. C. Ciresan, U. Meier, and J. Schmidhuber. Transfer learning for Latin M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Ng. Large scale
and Chinese characters with deep neural networks. In Proceedings of distributed deep networks. In Proceedings of Neural Information Pro-
International Joint Conference on Neural Networks (IJCNN). 2012. cessing Systems (NIPS). 2012.
[59] A. Coates, B. Huval, T. Wang, D. Wu, A. Ng, and B. Catanzaro. Deep [70] K. Demuynck and F. Triefenbach. Porting concepts from DNNs back
learning with COTS HPC. In Proceedings of International Conference to GMMs. In Proceedings of the Automatic Speech Recognition and
on Machine Learning (ICML). 2013. Understanding Workshop (ASRU). 2013.
References 355 356 References

[71] L. Deng. A generalized hidden Markov model with state-conditioned [83] L. Deng and M. Aksmanovic. Speaker-independent phonetic classifi-
trend functions of time for the speech signal. Signal Processing, cation using hidden markov models with state-conditioned mixtures of
27(1):65–78, 1992. trend functions. IEEE Transactions on Speech and Audio Processing,
[72] L. Deng. A stochastic model of speech incorporating hierarchical nonsta- 5:319–324, 1997.
tionarity. IEEE Transactions on Speech and Audio Processing, 1(4):471– [84] L. Deng, M. Aksmanovic, D. Sun, and J. Wu. Speech recognition using
475, 1993. hidden Markov models with polynomial regression functions as nonsta-
[73] L. Deng. A dynamic, feature-based approach to the interface between tionary states. IEEE Transactions on Speech and Audio Processing,
phonology and phonetics for speech modeling and recognition. Speech 2(4):507–520, 1994.
Communication, 24(4):299–323, 1998. [85] L. Deng and J. Chen. Sequence classification using the high-level fea-
[74] L. Deng. Computational models for speech production. In Compu- tures extracted from deep neural networks. In Proceedings of Interna-
tational Models of Speech Pattern Processing, pages 199–213. Springer tional Conference on Acoustics Speech and Signal Processing (ICASSP).
Verlag, 1999. 2014.

[75] L. Deng. Switching dynamic system models for speech articulation and [86] L. Deng and K. Erler. Structural design of a hidden Markov model
acoustics. In Mathematical Foundations of Speech and Language Pro- based speech recognizer using multi-valued phonetic features: Compar-
cessing, pages 115–134. Springer-Verlag, New York, 2003. ison with segmental speech units. Journal of the Acoustical Society of
America, 92(6):3058–3067, 1992.
[76] L. Deng. Dynamic Speech Models — Theory, Algorithm, and Applica-
tion. Morgan & Claypool, December 2006. [87] L. Deng, K. Hassanein, and M. Elmasry. Analysis of correlation struc-
ture for a neural predictive model with application to speech recognition.
[77] L. Deng. An overview of deep-structured learning for information pro- Neural Networks, 7(2):331–339, 1994.
cessing. In Proceedings of Asian-Pacific Signal & Information Process-
ing Annual Summit and Conference (APSIPA-ASC). October 2011. [88] L. Deng, X. He, and J. Gao. Deep stacking networks for informa-
tion retrieval. In Proceedings of International Conference on Acoustics
[78] L. Deng. The MNIST database of handwritten digit images for machine Speech and Signal Processing (ICASSP). 2013c.
learning research. IEEE Signal Processing Magazine, 29(6), November
2012. [89] L. Deng, G. Hinton, and B. Kingsbury. New types of deep neural
network learning for speech recognition and related applications: An
[79] L. Deng. Design and learning of output representations for speech recog- overview. In Proceedings of International Conference on Acoustics
nition. In Neural Information Processing Systems (NIPS) Workshop on Speech and Signal Processing (ICASSP). 2013b.
Learning Output Representations. December 2013.
[90] L. Deng and X. D. Huang. Challenges in adopting speech recognition.
[80] L. Deng. A tutorial survey of architectures, algorithms, and applications Communications of the Association for Computing Machinery (ACM),
for deep learning. In Asian-Pacific Signal & Information Processing 47(1):11–13, January 2004.
Association Transactions on Signal and Information Processing. 2013.
[91] L. Deng, B. Hutchinson, and D. Yu. Parallel training of deep stacking
[81] L. Deng, O. Abdel-Hamid, and D. Yu. A deep convolutional neural networks. In Proceedings of Interspeech. 2012b.
network using heterogeneous pooling for trading acoustic invariance
with phonetic confusion. In Proceedings of International Conference [92] L. Deng, M. Lennig, V. Gupta, F. Seitz, P. Mermelstein, and P. Kenny.
on Acoustics Speech and Signal Processing (ICASSP). 2013. Phonemic hidden Markov models with continuous mixture output den-
sities for large vocabulary word recognition. IEEE Transactions on
[82] L. Deng, A. Acero, L. Jiang, J. Droppo, and X. Huang. High perfor- Signal Processing, 39(7):1677–1681, 1991.
mance robust speech recognition using stereo training data. In Pro-
ceedings of International Conference on Acoustics Speech and Signal [93] L. Deng, M. Lennig, F. Seitz, and P. Mermelstein. Large vocabulary
Processing (ICASSP). 2001. word recognition using context-dependent allophonic hidden Markov
models. Computer Speech and Language, 4(4):345–357, 1990.
References 357 358 References

[94] L. Deng, J. Li, K. Huang, Yao, D. Yu, F. Seide, M. Seltzer, G. Zweig, [105] L. Deng and D. Yu. Use of differential cepstra as acoustic features
X. He, J. Williams, Y. Gong, and A. Acero. Recent advances in deep in hidden trajectory modeling for phonetic recognition. In Proceedings
learning for speech research at Microsoft. In Proceedings of Interna- of International Conference on Acoustics Speech and Signal Processing
tional Conference on Acoustics Speech and Signal Processing (ICASSP). (ICASSP). 2007.
2013a. [106] L. Deng and D. Yu. Deep convex network: A scalable architecture for
[95] L. Deng and X. Li. Machine learning paradigms in speech recogni- speech pattern classification. In Proceedings of Interspeech. 2011.
tion: An overview. IEEE Transactions on Audio, Speech, & Language, [107] L. Deng, D. Yu, and A. Acero. A bidirectional target filtering model of
21:1060–1089, May 2013. speech coarticulation: Two-stage implementation for phonetic recogni-
[96] L. Deng and J. Ma. Spontaneous speech recognition using a statistical tion. IEEE Transactions on Audio and Speech Processing, 14(1):256–
coarticulatory model for the vocal tract resonance dynamics. Journal 265, January 2006.
of the Acoustical Society America, 108:3036–3048, 2000. [108] L. Deng, D. Yu, and A. Acero. Structured speech modeling. IEEE
[97] L. Deng and D. O’Shaughnessy. Speech Processing — A Dynamic and Transactions on Audio, Speech and Language Processing, 14(5):1492–
Optimization-Oriented Approach. Marcel Dekker, 2003. 1504, September 2006.
[98] L. Deng, G. Ramsay, and D. Sun. Production models as a structural [109] L. Deng, D. Yu, and G. Hinton. Deep learning for speech recognition and
basis for automatic speech recognition. Speech Communication, 33(2– related applications. Neural Information Processing Systems (NIPS)
3):93–111, August 1997. Workshop, 2009.
[99] L. Deng and H. Sameti. Transitional speech units and their represen- [110] L. Deng, D. Yu, and J. Platt. Scalable stacking and learning for build-
tation by regressive Markov states: Applications to speech recognition. ing deep architectures. In Proceedings of International Conference on
IEEE Transactions on speech and audio processing, 4(4):301–306, July Acoustics Speech and Signal Processing (ICASSP). 2012a.
1996. [111] T. Deselaers, S. Hasan, O. Bender, and H. Ney. A deep learning
[100] L. Deng, M. Seltzer, D. Yu, A. Acero, A. Mohamed, and G. Hinton. approach to machine transliteration. In Proceedings of 4th Workshop on
Binary coding of speech spectrograms using a deep autoencoder. In Statistical Machine Translation, pages 233–241. Athens, Greece, March
Proceedings of Interspeech. 2010. 2009.
[101] L. Deng and D. Sun. A statistical approach to automatic speech [112] A. Diez. Automatic language recognition using deep neural networks.
recognition using the atomic speech units constructed from overlap- Thesis, Universidad Autonoma de Madrid, SPAIN, September 2013.
ping articulatory features. Journal of the Acoustical Society of America, [113] P. Dognin and V. Goel. Combining stochastic average gradient and
85(5):2702–2719, 1994. hessian-free optimization for sequence training of deep neural networks.
[102] L. Deng, G. Tur, X. He, and D. Hakkani-Tur. Use of kernel deep convex In Proceedings of the Automatic Speech Recognition and Understanding
networks and end-to-end learning for spoken language understanding. Workshop (ASRU). 2013.
In Proceedings of IEEE Workshop on Spoken Language Technologies. [114] D. Erhan, Y. Bengio, A. Courvelle, P. Manzagol, P. Vencent, and S. Ben-
December 2012. gio. Why does unsupervised pre-training help deep learning? Journal
[103] L. Deng, K. Wang, A. Acero, H. W. Hon, J. Droppo, C. Boulis, Y. Wang, on Machine Learning Research, pages 201–208, 2010.
D. Jacoby, M. Mahajan, C. Chelba, and X. Huang. Distributed speech [115] R. Fernandez, A. Rendel, B. Ramabhadran, and R. Hoory. F0 contour
processing in mipad’s multimodal user interface. IEEE Transactions on prediction with a deep belief network-gaussian process hybrid model. In
Speech and Audio Processing, 10(8):605–619, 2002. Proceedings of International Conference on Acoustics Speech and Signal
[104] L. Deng, J. Wu, J. Droppo, and A. Acero. Dynamic compensation of Processing (ICASSP), pages 6885–6889. 2013.
HMM variances using the feature enhancement uncertainty computed [116] S. Fine, Y. Singer, and N. Tishby. The hierarchical hidden Markov
from a parametric model of speech distortion. IEEE Transactions on model: Analysis and applications. Machine Learning, 32:41–62, 1998.
Speech and Audio Processing, 13(3):412–421, 2005.
References 359 360 References

[117] A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and [130] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural
T. Mikolov. Devise: A deep visual-semantic embedding model. In Pro- networks. In Proceedings of Artificial Intelligence and Statistics (AIS-
ceedings of Neural Information Processing Systems (NIPS). 2013. TATS). April 2011.
[118] Q. Fu, X. He, and L. Deng. Phone-discriminating minimum classifica- [131] I. Goodfellow, M. Mirza, A. Courville, and Y. Bengio. Multi-prediction
tion error (p-mce) training for phonetic recognition. In Proceedings of deep boltzmann machines. In Proceedings of Neural Information Pro-
Interspeech. 2007. cessing Systems (NIPS). 2013.
[119] M. Gales. Model-based approaches to handling uncertainty. In Robust [132] E. Grais, M. Sen, and H. Erdogan. Deep neural networks for single
Speech Recognition of Uncertain or Missing Data: Theory and Applica- channel source separation. arXiv:1311.2746v1, 2013.
tion, pages 101–125. Springer, 2011. [133] A. Graves. Sequence transduction with recurrent neural networks. Rep-
[120] J. Gao, X. He, and J.-Y. Nie. Clickthrough-based translation models resentation Learning Workshop, International Conference on Machine
for web search: From word models to phrase models. In Proceedings of Learning (ICML), 2012.
Conference on Information and Knowledge Management (CIKM). 2010. [134] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber. Connection-
[121] J. Gao, X. He, W. Yih, and L. Deng. Learning semantic representations ist temporal classification: Labeling unsegmented sequence data with
for the phrase translation model. In Proceedings of Neural Informa- recurrent neural networks. In Proceedings of International Conference
tion Processing Systems (NIPS) Workshop on Deep Learning. December on Machine Learning (ICML). 2006.
2013. [135] A. Graves, N. Jaitly, and A. Mohamed. Hybrid speech recognition with
[122] J. Gao, X. He, W. Yih, and L. Deng. Learning semantic representations deep bidirectional LSTM. In Proceedings of the Automatic Speech Recog-
for the phrase translation model. MSR-TR-2013-88, September 2013. nition and Understanding Workshop (ASRU). 2013.
[123] J. Gao, X. He, W. Yih, and L. Deng. Learning continuous phrase rep- [136] A. Graves, A. Mohamed, and G. Hinton. Speech recognition with deep
resentations for translation modeling. In Proceedings of Association for recurrent neural networks. In Proceedings of International Conference
Computational Linguistics (ACL). 2014. on Acoustics Speech and Signal Processing (ICASSP). 2013.
[124] J. Gao, K. Toutanova, and W.-T. Yih. Clickthrough-based latent seman- [137] F. Grezl and P. Fousek. Optimizing bottle-neck features for LVCSR. In
tic models for web search. In Proceedings of Special Interest Group on Proceedings of International Conference on Acoustics Speech and Signal
Information Retrieval (SIGIR). 2011. Processing (ICASSP). 2008.
[125] R. Gens and P. Domingo. Discriminative learning of sum-product net- [138] C. Gulcehre, K. Cho, R. Pascanu, and Y. Bengio. Learned-
works. Neural Information Processing Systems (NIPS), 2012. norm pooling for deep feedforward and recurrent neural networks.
[126] D. George. How the brain might work: A hierarchical and temporal http://arxiv.org/abs/1311.1780, 2014.
model for learning and recognition. Ph.D. thesis, Stanford University, [139] M. Gutmann and A. Hyvarinen. Noise-contrastive estimation of unnor-
2008. malized statistical models, with applications to natural image statistics.
[127] M. Gibson and T. Hain. Error approximation and minimum phone error Journal of Machine Learning Research, 13:307–361, 2012.
acoustic model estimation. IEEE Transactions on Audio, Speech, and [140] T. Hain, L. Burget, J. Dines, P. Garner, F. Grezl, A. Hannani, M. Hui-
Language Processing, 18(6):1269–1279, August 2010. jbregts, M. Karafiat, M. Lincoln, and V. Wan. Transcribing meetings
[128] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature with the AMIDA systems. IEEE Transactions on Audio, Speech, and
hierarchies for accurate object detection and semantic segmentation. Language Processing, 20:486–498, 2012.
arXiv:1311.2524v1, 2013. [141] P. Hamel and D. Eck. Learning features from music audio with deep
[129] X. Glorot and Y. Bengio. Understanding the difficulty of training deep belief networks. In Proceedings of International Symposium on Music
feed-forward neural networks. In Proceedings of Artificial Intelligence Information Retrieval (ISMIR). 2010.
and Statistics (AISTATS). 2010.
References 361 362 References

[142] G. Hawkins, S. Ahmad, and D. Dubinsky. Hierarchical temporal mem- [154] H. Hermansky, D. Ellis, and S. Sharma. Tandem connectionist feature
ory including HTM cortical learning algorithms. Numenta Technical extraction for conventional HMM systems. In Proceedings of Interna-
Report, December 10 2010. tional Conference on Acoustics Speech and Signal Processing (ICASSP).
[143] J. Hawkins and S. Blakeslee. On Intelligence: How a New Understanding 2000.
of the Brain will lead to the Creation of Truly Intelligent Machines. [155] Y. Hifny and S. Renals. Speech recognition using augmented conditional
Times Books, New York, 2004. random fields. IEEE Transactions on Audio, Speech, and Language
[144] X. He and L. Deng. Speech recognition, machine translation, and speech Processing, 17(2):354–365, February 2009.
translation — a unifying discriminative framework. IEEE Signal Pro- [156] G. Hinton. Mapping part-whole hierarchies into connectionist networks.
cessing Magazine, 28, November 2011. Artificial Intelligence, 46:47–75, 1990.
[145] X. He and L. Deng. Optimization in speech-centric information process- [157] G. Hinton. Preface to the special issue on connectionist symbol pro-
ing: Criteria and techniques. In Proceedings of International Conference cessing. Artificial Intelligence, 46:1–4, 1990.
on Acoustics Speech and Signal Processing (ICASSP). 2012. [158] G. Hinton. The ups and downs of Hebb synapses. Canadian Psychology,
[146] X. He and L. Deng. Speech-centric information processing: An 44:10–13, 2003.
optimization-oriented approach. In Proceedings of the IEEE. 2013. [159] G. Hinton. A practical guide to training restricted boltzmann machines.
[147] X. He, L. Deng, and W. Chou. Discriminative learning in sequential pat- UTML Tech Report 2010-003, Univ. Toronto, August 2010.
tern recognition — a unifying review for optimization-oriented speech [160] G. Hinton. A better way to learn features. Communications of the
recognition. IEEE Signal Processing Magazine, 25:14–36, 2008. Association for Computing Machinery (ACM), 54(10), October 2011.
[148] G. Heigold, H. Ney, P. Lehnen, T. Gass, and R. Schluter. Equivalence of [161] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior,
generative and log-liner models. IEEE Transactions on Audio, Speech, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury. Deep neu-
and Language Processing, 19(5):1138–1148, February 2011. ral networks for acoustic modeling in speech recognition. IEEE Signal
[149] G. Heigold, H. Ney, and R. Schluter. Investigations on an EM-style opti- Processing Magazine, 29(6):82–97, November 2012.
mization algorithm for discriminative training of HMMs. IEEE Trans- [162] G. Hinton, A. Krizhevsky, and S. Wang. Transforming autoencoders. In
actions on Audio, Speech, and Language Processing, 21(12):2616–2626, Proceedings of International Conference on Artificial Neural Networks.
December 2013. 2011.
[150] G. Heigold, V. Vanhoucke, A. Senior, P. Nguyen, M. Ranzato, M. Devin, [163] G. Hinton, S. Osindero, and Y. Teh. A fast learning algorithm for deep
and J. Dean. Multilingual acoustic models using distributed deep neu- belief nets. Neural Computation, 18:1527–1554, 2006.
ral networks. In Proceedings of International Conference on Acoustics
Speech and Signal Processing (ICASSP). 2013. [164] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data
with neural networks. Science, 313(5786):504–507, July 2006.
[151] I. Heintz, E. Fosler-Lussier, and C. Brew. Discriminative input stream
combination for conditional random field phone recognition. IEEE [165] G. Hinton and R. Salakhutdinov. Discovering binary codes for docu-
Transactions on Audio, Speech, and Language Processing, 17(8):1533– ments by learning deep generative models. Topics in Cognitive Science,
1546, November 2009. pages 1–18, 2010.
[152] M. Henderson, B. Thomson, and S. Young. Deep neural network [166] G. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhut-
approach for the dialog state tracking challenge. In Proceedings of Spe- dinov. Improving neural networks by preventing co-adaptation of fea-
cial Interest Group on Disclosure and Dialogue (SIGDIAL). 2013. ture detectors. arXiv: 1207.0580v1, 2012.
[153] M. Hermans and B. Schrauwen. Training and analysing deep recur- [167] S. Hochreiter. Untersuchungen zu dynamischen neuronalen net-
rent neural networks. In Proceedings of Neural Information Processing zen. Diploma thesis, Institut fur Informatik, Technische Universitat
Systems (NIPS). 2013. Munchen, 1991.
References 363 364 References

[168] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural [179] E. Humphrey, J. Bello, and Y. LeCun. Feature learning and deep archi-
Computation, 9:1735–1780, 1997. tectures: New directions for music informatics. Journal of Intelligent
[169] E. Huang, R. Socher, C. Manning, and A. Ng. Improving word represen- Information Systems, 2013.
tations via global context and multiple word prototypes. In Proceedings [180] B. Hutchinson, L. Deng, and D. Yu. A deep architecture with bilinear
of Association for Computational Linguistics (ACL). 2012. modeling of hidden representations: Applications to phonetic recogni-
[170] J. Huang, J. Li, L. Deng, and D. Yu. Cross-language knowledge transfer tion. In Proceedings of International Conference on Acoustics Speech
using multilingual deep neural networks with shared hidden layers. In and Signal Processing (ICASSP). 2012.
Proceedings of International Conference on Acoustics Speech and Signal [181] B. Hutchinson, L. Deng, and D. Yu. Tensor deep stacking net-
Processing (ICASSP). 2013. works. IEEE Transactions on Pattern Analysis and Machine Intelli-
[171] P. Huang, L. Deng, M. Hasegawa-Johnson, and X. He. Random fea- gence, 35:1944–1957, 2013.
tures for kernel deep convex network. In Proceedings of International [182] D. Imseng, P. Motlicek, P. Garner, and H. Bourlard. Impact of deep
Conference on Acoustics Speech and Signal Processing (ICASSP). 2013. MLP architecture on different modeling techniques for under-resourced
[172] P. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck. Learning speech recognition. In Proceedings of the Automatic Speech Recognition
deep structured semantic models for web search using clickthrough data. and Understanding Workshop (ASRU). 2013.
Association for Computing Machinery (ACM) International Conference [183] N. Jaitly and G. Hinton. Learning a better representation of speech
Information and Knowledge Management (CIKM), 2013. sound waves using restricted boltzmann machines. In Proceedings of
[173] P. Huang, K. Kumar, C. Liu, Y. Gong, and L. Deng. Predicting speech International Conference on Acoustics Speech and Signal Processing
recognition confidence using deep learning with word identity and score (ICASSP). 2011.
features. In Proceedings of International Conference on Acoustics Speech [184] N. Jaitly, P. Nguyen, and V. Vanhoucke. Application of pre-trained deep
and Signal Processing (ICASSP). 2013. neural networks to large vocabulary speech recognition. In Proceedings
[174] S. Huang and S. Renals. Hierarchical bayesian language models for of Interspeech. 2012.
conversational speech recognition. IEEE Transactions on Audio, Speech, [185] K. Jarrett, K. Kavukcuoglu, and Y. LeCun. What is the best multi-
and Language Processing, 18(8):1941–1954, November 2010. stage architecture for object recognition? In Proceedings of International
[175] X. Huang, A. Acero, C. Chelba, L. Deng, J. Droppo, D. Duchene, Conference on Computer Vision, pages 2146–2153. 2009.
J. Goodman, and H. Hon. Mipad: A multimodal interaction proto- [186] H. Jiang and X. Li. Parameter estimation of statistical models using
type. In Proceedings of International Conference on Acoustics Speech convex optimization: An advanced method of discriminative training
and Signal Processing (ICASSP). 2001. for speech and language processing. IEEE Signal Processing Magazine,
[176] Y. Huang, D. Yu, Y. Gong, and C. Liu. Semi-supervised GMM and DNN 27(3):115–127, 2010.
acoustic model training with multi-system combination and confidence [187] B. Juang, S. Levinson, and M. Sondhi. Maximum likelihood estimation
re-calibration. In Proceedings of Interspeech, pages 2360–2364. 2013. for multivariate mixture observations of Markov chains. IEEE Trans-
[177] E. Humphrey and J. Bello. Rethinking automatic chord recognition actions on Information Theory, 32:307–309, 1986.
with convolutional neural networks. In Proceedings of International [188] B.-H. Juang, W. Chou, and C.-H. Lee. Minimum classification error
Conference on Machine Learning and Application (ICMLA). 2012a. rate methods for speech recognition. IEEE Transactions On Speech
[178] E. Humphrey, J. Bello, and Y. LeCun. Moving beyond feature design: and Audio Processing, 5:257–265, 1997.
Deep architectures and automatic feature learning in music informat- [189] S. Kahou et al. Combining modality specific deep neural networks for
ics. In Proceedings of International Symposium on Music Information emotion recognition in video. In Proceedings of International Conference
Retrieval (ISMIR). 2012. on Multimodal Interaction (ICMI). 2013.
References 365 366 References

[190] S. Kang, X. Qian, and H. Meng. Multi-distribution deep belief network [202] K. Lang, A. Waibel, and G. Hinton. A time-delay neural network archi-
for speech synthesis. In Proceedings of International Conference on tecture for isolated word recognition. Neural Networks, 3(1):23–43, 1990.
Acoustics Speech and Signal Processing (ICASSP), pages 8012–8016. [203] H. Larochelle and Y. Bengio. Classification using discriminative
2013. restricted boltzmann machines. In Proceedings of International Con-
[191] Y. Kashiwagi, D. Saito, N. Minematsu, and K. Hirose. Discriminative ference on Machine Learning (ICML). 2008.
piecewise linear transformation based on deep learning for noise robust [204] D. Le and P. Mower. Emotion recognition from spontaneous speech
automatic speech recognition. In Proceedings of the Automatic Speech using hidden markov models with deep belief networks. In Proceed-
Recognition and Understanding Workshop (ASRU). 2013. ings of the Automatic Speech Recognition and Understanding Workshop
[192] K. Kavukcuoglu, P. Sermanet, Y. Boureau, K. Gregor, M. Mathieu, (ASRU). 2013.
and Y. LeCun. Learning convolutional feature hierarchies for visual [205] H. Le, A. Allauzen, G. Wisniewski, and F. Yvon. Training continuous
recognition. In Proceedings of Neural Information Processing Systems space language models: Some practical issues. In Proceedings of Empiri-
(NIPS). 2010. cal Methods in Natural Language Processing (EMNLP), pages 778–788.
[193] H. Ketabdar and H. Bourlard. Enhanced phone posteriors for improving 2010.
speech recognition systems. IEEE Transactions on Audio, Speech, and [206] H. Le, I. Oparin, A. Allauzen, J. Gauvain, and F. Yvon. Structured
Language Processing, 18(6):1094–1106, August 2010. output layer neural network language model. In Proceedings of Interna-
[194] B. Kingsbury. Lattice-based optimization of sequence classification cri- tional Conference on Acoustics Speech and Signal Processing (ICASSP).
teria for neural-network acoustic modeling. In Proceedings of Interna- 2011.
tional Conference on Acoustics Speech and Signal Processing (ICASSP). [207] H. Le, I. Oparin, A. Allauzen, J.-L. Gauvain, and F. Yvon. Struc-
2009. tured output layer neural network language models for speech recogni-
[195] B. Kingsbury, T. Sainath, and H. Soltau. Scalable minimum bayes tion. IEEE Transactions on Audio, Speech, and Language Processing,
risk training of deep neural network acoustic models using distributed 21(1):197–206, January 2013.
hessian-free optimization. In Proceedings of Interspeech. 2012. [208] Q. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Ng. On
[196] R. Kiros, R. Zemel, and R. Salakhutdinov. Multimodal neural lan- optimization methods for deep learning. In Proceedings of International
guage models. In Proceedings of Neural Information Processing Systems Conference on Machine Learning (ICML). 2011.
(NIPS) Deep Learning Workshop. 2013. [209] Q. Le, M. Ranzato, R. Monga, M. Devin, G. Corrado, K. Chen, J. Dean,
[197] T. Ko and B. Mak. Eigentriphones for context-dependent acoustic mod- and A. Ng. Building high-level features using large scale unsupervised
eling. IEEE Transactions on Audio, Speech, and Language Processing, learning. In Proceedings of International Conference on Machine Learn-
21(6):1285–1294, 2013. ing (ICML). 2012.
[198] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with [210] Y. LeCun. Learning invariant feature hierarchies. In Proceedings of
deep convolutional neural networks. In Proceedings of Neural Informa- European Conference on Computer Vision (ECCV). 2012.
tion Processing Systems (NIPS). 2012. [211] Y. LeCun and Y. Bengio. Convolutional networks for images, speech,
[199] Y. Kubo, T. Hori, and A. Nakamura. Integrating deep neural networks and time series. In M. Arbib, editor, The Handbook of Brain The-
into structural classification approach based on weighted finite-state ory and Neural Networks, pages 255–258. MIT Press, Cambridge, Mas-
transducers. In Proceedings of Interspeech. 2012. sachusetts, 1995.
[200] R. Kurzweil. How to Create a Mind. Viking Books, December 2012. [212] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learn-
[201] P. Lal and S. King. Cross-lingual automatic speech recognition using ing applied to document recognition. Proceedings of the IEEE, 86:2278–
tandem features. IEEE Transactions on Audio, Speech, and Language 2324, 1998.
Processing, 21(12):2506–2515, December 2013.
References 367 368 References

[213] Y. LeCun, S. Chopra, M. Ranzato, and F. Huang. Energy-based models [224] H. Liao, E. McDermott, and A. Senior. Large scale deep neural network
in document recognition and computer vision. In Proceedings of Inter- acoustic modeling with semi-supervised training data for youtube video
national Conference on Document Analysis and Recognition (ICDAR). transcription. In Proceedings of the Automatic Speech Recognition and
2007. Understanding Workshop (ASRU). 2013.
[214] C.-H. Lee. From knowledge-ignorant to knowledge-rich modeling: A new [225] H. Lin, L. Deng, D. Yu, Y. Gong, A. Acero, and C.-H. Lee. A study on
speech research paradigm for next-generation automatic speech recog- multilingual acoustic modeling for large vocabulary ASR. In Proceedings
nition. In Proceedings of International Conference on Spoken Language of International Conference on Acoustics Speech and Signal Processing
Processing (ICSLP), pages 109–111. 2004. (ICASSP). 2009.
[215] H. Lee, R. Grosse, R. Ranganath, and A. Ng. Convolutional deep belief [226] Y. Lin, F. Lv, S. Zhu, M. Yang, T. Cour, K. Yu, L. Cao, and T. Huang.
networks for scalable unsupervised learning of hierarchical representa- Large-scale image classification: Fast feature extraction and SVM train-
tions. In Proceedings of International Conference on Machine Learning ing. In Proceedings of Computer Vision and Pattern Recognition
(ICML). 2009. (CVPR). 2011.
[216] H. Lee, R. Grosse, R. Ranganath, and A. Ng. Unsupervised learning [227] Z. Ling, L. Deng, and D. Yu. Modeling spectral envelopes using
of hierarchical representations with convolutional deep belief networks. restricted boltzmann machines and deep belief networks for statisti-
Communications of the Association for Computing Machinery (ACM), cal parametric speech synthesis. IEEE Transactions on Audio Speech
54(10):95–103, October 2011. Language Processing, 21(10):2129–2139, 2013.
[217] H. Lee, Y. Largman, P. Pham, and A. Ng. Unsupervised feature learning [228] Z. Ling, L. Deng, and D. Yu. Modeling spectral envelopes using
for audio classification using convolutional deep belief networks. In restricted boltzmann machines for statistical parametric speech synthe-
Proceedings of Neural Information Processing Systems (NIPS). 2010. sis. In International Conference on Acoustics Speech and Signal Pro-
[218] P. Lena, K. Nagata, and P. Baldi. Deep spatiotemporal architectures cessing (ICASSP), pages 7825–7829. 2013.
and learning for protein structure prediction. In Proceedings of Neural [229] Z. Ling, K. Richmond, and J. Yamagishi. Articulatory control of HMM-
Information Processing Systems (NIPS). 2012. based parametric speech synthesis using feature-space-switched multi-
[219] S. Levine. Exploring deep and recurrent architectures for optimal con- ple regression. IEEE Transactions on Audio, Speech, and Language
trol. arXiv:1311.1761v1. Processing, 21, January 2013.

[220] J. Li, L. Deng, Y. Gong, and R. Haeb-Umbach. An overview of [230] L. Lu, K. Chin, A. Ghoshal, and S. Renals. Joint uncertainty decoding
noise-robust automatic speech recognition. IEEE/Association for Com- for noise robust subspace gaussian mixture models. IEEE Transactions
puting Machinery (ACM) Transactions on Audio, Speech, and Language on Audio, Speech, and Language Processing, 21(9):1791–1804, 2013.
Processing, pages 1–33, 2014. [231] J. Ma and L. Deng. A path-stack algorithm for optimizing dynamic
[221] J. Li, D. Yu, J. Huang, and Y. Gong. Improving wideband speech regimes in a statistical hidden dynamical model of speech. Computer,
recognition using mixed-bandwidth training data in CD-DNN-HMM. Speech and Language, 2000.
In Proceedings of IEEE Spoken Language Technology (SLT). 2012. [232] J. Ma and L. Deng. Efficient decoding strategies for conversational
[222] L. Li, Y. Zhao, D. Jiang, and Y. Zhang etc. Hybrid deep neural network– speech recognition using a constrained nonlinear state-space model.
hidden markov model (DNN-HMM) based speech emotion recognition. IEEE Transactions on Speech and Audio Processing, 11(6):590–602,
In Proceedings Conference on Affective Computing and Intelligent Inter- 2003.
action (ACII), pages 312–317. September 2013. [233] J. Ma and L. Deng. Target-directed mixture dynamic models for spon-
[223] H. Liao. Speaker adaptation of context dependent deep neural net- taneous speech recognition. IEEE Transactions on Speech and Audio
works. In Proceedings of International Conference on Acoustics Speech Processing, 12(1):47–58, 2004.
and Signal Processing (ICASSP). 2013.
References 369 370 References

[234] A. Maas, A. Hannun, and A. Ng. Rectifier nonlinearities improve neural [247] T. Mikolov, A. Deoras, D. Povey, L. Burget, and J. Cernocky. Strategies
network acoustic models. International Conference on Machine Learn- for training large scale neural network language models. In Proceedings
ing (ICML) Workshop on Deep Learning for Audio, Speech, and Lan- of the IEEE Automatic Speech Recognition and Understanding Work-
guage Processing, 2013. shop (ASRU). 2011.
[235] A. Maas, Q. Le, T. O’Neil, O. Vinyals, P. Nguyen, and P. Ng. Recurrent [248] T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, and S. Khudanpur.
neural networks for noise reduction in robust ASR. In Proceedings of Recurrent neural network based language model. In Proceedings of
Interspeech. 2012. International Conference on Acoustics Speech and Signal Processing
[236] C. Manning, P. Raghavan, and H. Schütze. Introduction to Information (ICASSP), pages 1045–1048. 2010.
Retrieval. Cambridge University Press, 2009. [249] T. Mikolov, Q. Le, and I. Sutskever. Exploiting similarities among lan-
[237] J. Markoff. Scientists see promise in deep-learning programs. New York guages for machine translation. arXiv:1309.4168v1, 2013.
Times, November 24 2012. [250] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed
[238] J. Martens. Deep learning with hessian-free optimization. In Proceedings representations of words and phrases and their compositionality. In
of International Conference on Machine Learning (ICML). 2010. Proceedings of Neural Information Processing Systems (NIPS). 2013.

[239] J. Martens and I. Sutskever. Learning recurrent neural networks with [251] Y. Minami, E. McDermott, A. Nakamura, and S. Katagiri. A recogni-
hessian-free optimization. In Proceedings of International Conference tion method with parametric trajectory synthesized using direct rela-
on Machine Learning (ICML). 2011. tions between static and dynamic feature vector time series. In Pro-
ceedings of International Conference on Acoustics Speech and Signal
[240] D. McAllester. A PAC-bayesian tutorial with a dropout bound. ArX- Processing (ICASSP), pages 957–960. 2002.
ive1307.2118, July 2013.
[252] A. Mnih and G. Hinton. Three new graphical models for statistical lan-
[241] I. McGraw, I. Badr, and J. R. Glass. Learning lexicons from speech guage modeling. In Proceedings of International Conference on Machine
using a pronunciation mixture model. IEEE Transactions on Audio, Learning (ICML), pages 641–648. 2007.
Speech, and Language Processing, 21(2):357,366, February 2013.
[253] A. Mnih and G. Hinton. A scalable hierarchical distributed lan-
[242] G. Mesnil, X. He, L. Deng, and Y. Bengio. Investigation of recurrent- guage model. In Proceedings of Neural Information Processing Systems
neural-network architectures and learning methods for spoken language (NIPS), pages 1081–1088. 2008.
understanding. In Proceedings of Interspeech. 2013.
[254] A. Mnih and K. Kavukcuoglu. Learning word embeddings efficiently
[243] Y. Miao and F. Metze. Improving low-resource CD-DNN-HMM using with noise-contrastive estimation. In Proceedings of Neural Information
dropout and multilingual DNN training. In Proceedings of Interspeech. Processing Systems (NIPS). 2013.
2013.
[255] A. Mnih and W.-T. Teh. A fast and simple algorithm for training
[244] Y. Miao, S. Rawat, and F. Metze. Deep maxout networks for low neural probabilistic language models. In Proceedings of International
resource speech recognition. In Proceedings of the Automatic Speech Conference on Machine Learning (ICML), pages 1751–1758. 2012.
Recognition and Understanding Workshop (ASRU). 2013.
[256] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-
[245] T. Mikolov. Statistical language models based on neural networks. stra, and M. Riedmiller. Playing arari with deep reinforcement learning.
Ph.D. thesis, Brno University of Technology, 2012. Neural Information Processing Systems (NIPS) Deep Learning Work-
[246] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of shop, 2013. also arXiv:1312.5602v1.
word representations in vector space. In Proceedings of International [257] A. Mohamed, G. Dahl, and G. Hinton. Deep belief networks for phone
Conference on Learning Representations (ICLR). 2013. recognition. In Proceedings of Neural Information Processing Systems
(NIPS) Workshop Deep Learning for Speech Recognition and Related
Applications. 2009.
References 371 372 References

[258] A. Mohamed, G. Dahl, and G. Hinton. Acoustic modeling using deep [270] M. Norouzi, T. Mikolov, S. Bengio, J. Shlens, A. Frome, G. Corrado,
belief networks. IEEE Transactions on Audio, Speech, & Language Pro- and J. Dean. Zero-shot learning by convex combination of semantic
cessing, 20(1), January 2012. embeddings. arXiv:1312.5650v2, 2013.
[259] A. Mohamed, G. Hinton, and G. Penn. Understanding how deep belief [271] N. Oliver, A. Garg, and E. Horvitz. Layered representations for learning
networks perform acoustic modelling. In Proceedings of International and inferring office activity from multiple sensory channels. Computer
Conference on Acoustics Speech and Signal Processing (ICASSP). 2012. Vision and Image Understanding, 96:163–180, 2004.
[260] A. Mohamed, D. Yu, and L. Deng. Investigation of full-sequence train- [272] B. Olshausen. Can ‘deep learning’ offer deep insights about visual rep-
ing of deep belief networks for speech recognition. In Proceedings of resentation? Neural Information Processing Systems (NIPS) Workshop
Interspeech. 2010. on Deep Learning and Unsupervised Feature Learning, 2012.
[261] N. Morgan. Deep and wide: Multiple layers in automatic speech recog- [273] M. Ostendorf. Moving beyond the ‘beads-on-a-string’ model of speech.
nition. IEEE Transactions on Audio, Speech, & Language Processing, In Proceedings of the Automatic Speech Recognition and Understanding
20(1), January 2012. Workshop (ASRU). 1999.
[262] N. Morgan, Q. Zhu, A. Stolcke, K. Sonmez, S. Sivadas, T. Shinozaki, [274] M. Ostendorf, V. Digalakis, and O. Kimball. From HMMs to segment
M. Ostendorf, P. Jain, H. Hermansky, D. Ellis, G. Doddington, B. Chen, models: A unified view of stochastic modeling for speech recognition.
O. Cretin, H. Bourlard, and M. Athineos. Pushing the envelope — aside IEEE Transactions on Speech and Audio Processing, 4(5), September
[speech recognition]. IEEE Signal Processing Magazine, 22(5):81–88, 1996.
September 2005. [275] L. Oudre, C. Fevotte, and Y. Grenier. Probabilistic template-based
[263] F. Morin and Y. Bengio. Hierarchical probabilistic neural network lan- chord recognition. IEEE Transactions on Audio, Speech, and Language
guage models. In Proceedings of Artificial Intelligence and Statistics Processing, 19(8):2249–2259, November 2011.
(AISTATS). 2005. [276] H. Palangi, L. Deng, and R. Ward. Learning input and recurrent weight
[264] K. Murphy. Machine Learning — A Probabilistic Perspective. The MIT matrices in echo state networks. Neural Information Processing Systems
Press, 2012. (NIPS) Deep Learning Workshop, December 2013.
[265] V. Nair and G. Hinton. 3-d object recognition with deep belief nets. In [277] H. Palangi, R. Ward, and L. Deng. Using deep stacking network to
Proceedings of Neural Information Processing Systems (NIPS). 2009. improve structured compressive sensing with multiple measurement vec-
[266] T. Nakashika, R. Takashima, T. Takiguchi, and Y. Ariki. Voice conver- tors. In Proceedings of International Conference on Acoustics Speech
sion in high-order eigen space using deep belief nets. In Proceedings of and Signal Processing (ICASSP). 2013.
Interspeech. 2013. [278] G. Papandreou, A. Katsamanis, V. Pitsikalis, and P. Maragos. Adap-
[267] H. Ney. Speech translation: Coupling of recognition and translation. In tive multimodal fusion by uncertainty compensation with application to
Proceedings of International Conference on Acoustics Speech and Signal audiovisual speech recognition. IEEE Transactions on Audio, Speech,
Processing (ICASSP). 1999. and Language Processing, 17:423–435, 2009.

[268] J. Ngiam, Z. Chen, P. Koh, and A. Ng. Learning deep energy models. In [279] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio. How to construct deep
Proceedings of International Conference on Machine Learning (ICML). recurrent neural networks. In Proceedings of International Conference
2011. on Learning Representations (ICLR). 2014.

[269] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Ng. Multimodal [280] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training
deep learning. In Proceedings of International Conference on Machine recurrent neural networks. In Proceedings of International Conference
Learning (ICML). 2011. on Machine Learning (ICML). 2013.
[281] J. Peng, L. Bo, and J. Xu. Conditional neural fields. In Proceedings of
Neural Information Processing Systems (NIPS). 2009.
References 373 374 References

[282] P. Picone, S. Pike, R. Regan, T. Kamm, J. bridle, L. Deng, Z. Ma, [294] M. Ranzato, Y. Boureau, and Y. LeCun. Sparse feature learning for
H. Richards, and M. Schuster. Initial evaluation of hidden dynamic deep belief networks. In Proceedings of Neural Information Processing
models on conversational speech. In Proceedings of International Con- Systems (NIPS). 2007.
ference on Acoustics Speech and Signal Processing (ICASSP). 1999. [295] M. Ranzato, S. Chopra, Y. LeCun, and F.-J. Huang. Energy-based
[283] J. Pinto, S. Garimella, M. Magimai-Doss, H. Hermansky, and models in document recognition and computer vision. In Proceed-
H. Bourlard. Analysis of MLP-based hierarchical phone posterior prob- ings of International Conference on Document Analysis and Recognition
ability estimators. IEEE Transactions on Audio, Speech, and Language (ICDAR). 2007.
Processing, 19(2), February 2011. [296] M. Ranzato and G. Hinton. Modeling pixel means and covariances using
[284] C. Plahl, T. Sainath, B. Ramabhadran, and D. Nahamoo. Improved factorized third-order boltzmann machines. In Proceedings of Computer
pre-training of deep belief networks using sparse encoding symmet- Vision and Pattern Recognition (CVPR). 2010.
ric machines. In Proceedings of International Conference on Acoustics [297] M. Ranzato, C. Poultney, S. Chopra, and Y. LeCun. Efficient learning
Speech and Signal Processing (ICASSP). 2012. of sparse representations with an energy-based model. In Proceedings
[285] C. Plahl, R. Schlüter, and H. Ney. Hierarchical bottleneck features for of Neural Information Processing Systems (NIPS). 2006.
LVCSR. In Proceedings of Interspeech. 2010. [298] M. Ranzato, J. Susskind, V. Mnih, and G. Hinton. On deep generative
[286] T. Plate. Holographic reduced representations. IEEE Transactions on models with applications to recognition. In Proceedings of Computer
Neural Networks, 6(3):623–641, May 1995. Vision and Pattern Recognition (CVPR). 2011.
[287] T. Poggio. How the brain might work: The role of information and [299] C. Rathinavalu and L. Deng. Construction of state-dependent dynamic
learning in understanding and replicating intelligence. In G. Jacovitt, parameters by maximum likelihood: Applications to speech recognition.
A. Pettorossi, R. Consolo, and V. Senni, editors, Information: Science Signal Processing, 55(2):149–165, 1997.
and Technology for the New Century, pages 45–61. Lateran University [300] S. Rennie, K. Fouset, and P. Dognin. Factorial hidden restricted boltz-
Press, 2007. mann machines for noise robust speech recognition. In Proceedings
[288] J. Pollack. Recursive distributed representations. Artificial Intelligence, of International Conference on Acoustics Speech and Signal Processing
46:77–105, 1990. (ICASSP). 2012.
[289] H. Poon and P. Domingos. Sum-product networks: A new deep archi- [301] S. Rennie, H. Hershey, and P. Olsen. Single-channel multi-talker speech
tecture. In Proceedings of Uncertainty in Artificial Intelligence. 2011. recognition — graphical modeling approaches. IEEE Signal Processing
[290] D. Povey and P. Woodland. Minimum phone error and I-smoothing Magazine, 33:66–80, 2010.
for improved discriminative training. In Proceedings of International [302] M. Riedmiller and H. Braun. A direct adaptive method for faster back-
Conference on Acoustics Speech and Signal Processing (ICASSP). 2002. propagation learning: The RPROP algorithm. In Proceedings of the
[291] R. Prabhavalkar and E. Fosler-Lussier. Backpropagation training for IEEE International Conference on Neural Networks. 1993.
multilayer conditional random field based phone recognition. In Pro- [303] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio. Contractive
ceedings of International Conference on Acoustics Speech and Signal autoencoders: Explicit invariance during feature extraction. In Proceed-
Processing (ICASSP). 2010. ings of International Conference on Machine Learning (ICML), pages
[292] A. Prince and P. Smolensky. Optimality: From neural networks to uni- 833–840. 2011.
versal grammar. Science, 275:1604–1610, 1997. [304] A. Robinson. An application of recurrent nets to phone probability
[293] L. Rabiner. A tutorial on hidden markov models and selected applica- estimation. IEEE Transactions on Neural Networks, 5:298–305, 1994.
tions in speech recognition. In Proceedings of the IEEE, pages 257–286. [305] T. Sainath, L. Horesh, B. Kingsbury, A. Aravkin, and B. Ramabhad-
1989. ran. Accelerating hessian-free optimization for deep neural networks by
implicit pre-conditioning and sampling. arXiv: 1309.1508v3, 2013.
References 375 376 References

[306] T. Sainath, B. Kingsbury, A. Mohamed, G. Dahl, G. Saon, H. Soltau, [317] G. Saon, H. Soltau, D. Nahamoo, and M. Picheny. Speaker adaptation
T. Beran, A. Aravkin, and B. Ramabhadran. Improvements to deep of neural network acoustic models using i-vectors. In Proceedings of the
convolutional neural networks for LVCSR. In Proceedings of the Auto- Automatic Speech Recognition and Understanding Workshop (ASRU).
matic Speech Recognition and Understanding Workshop (ASRU). 2013. 2013.
[307] T. Sainath, B. Kingsbury, A. Mohamed, and B. Ramabhadran. Learn- [318] R. Sarikaya, G. Hinton, and B. Ramabhadran. Deep belief nets for nat-
ing filter banks within a deep neural network framework. In Proceed- ural language call-routing. In Proceedings of International Conference
ings of The Automatic Speech Recognition and Understanding Workshop on Acoustics Speech and Signal Processing (ICASSP), pages 5680–5683.
(ASRU). 2013. 2011.
[308] T. Sainath, B. Kingsbury, and B. Ramabhadran. Autoencoder bottle- [319] E. Schmidt and Y. Kim. Learning emotion-based acoustic features with
neck features using deep belief networks. In Proceedings of International deep belief networks. In Proceedings IEEE of Signal Processing to Audio
Conference on Acoustics Speech and Signal Processing (ICASSP). 2012. and Acoustics. 2011.
[309] T. Sainath, B. Kingsbury, B. Ramabhadran, P. Novak, and [320] H. Schwenk. Continuous space translation models for phrase-based sta-
A. Mohamed. Making deep belief networks effective for large vocab- tistical machine translation. In Proceedings of Computional Linguistics.
ulary continuous speech recognition. In Proceedings of the Automatic 2012.
Speech Recognition and Understanding Workshop (ASRU). 2011. [321] H. Schwenk, A. Rousseau, and A. Mohammed. Large, pruned or contin-
[310] T. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhad- uous space language models on a gpu for statistical machine translation.
ran. Low-rank matrix factorization for deep neural network training In Proceedings of the Joint Human Language Technology Conference
with high-dimensional output targets. In Proceedings of International and the North American Chapter of the Association of Computational
Conference on Acoustics Speech and Signal Processing (ICASSP). 2013. Linguistics (HLT-NAACL) 2012 Workshop on the future of language
[311] T. Sainath, B. Kingsbury, H. Soltau, and B. Ramabhadran. Optimiza- modeling for Human Language Technology (HLT), pages 11–19.
tion techniques to improve training speed of deep neural networks for [322] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. On parallelizability of
large speech tasks. IEEE Transactions on Audio, Speech, and Language stochastic gradient descent for speech DNNs. In Proceedings of Interna-
Processing, 21(11):2267–2276, November 2013. tional Conference on Acoustics Speech and Signal Processing (ICASSP).
[312] T. Sainath, A. Mohamed, B. Kingsbury, and B. Ramabhadran. Con- 2014.
volutional neural networks for LVCSR. In Proceedings of International [323] F. Seide, G. Li, X. Chen, and D. Yu. Feature engineering in context-
Conference on Acoustics Speech and Signal Processing (ICASSP). 2013. dependent deep neural networks for conversational speech transcription.
[313] T. Sainath, B. Ramabhadran, M. Picheny, D. Nahamoo, and In Proceedings of the Automatic Speech Recognition and Understanding
D. Kanevsky. Exemplar-based sparse representation features: From Workshop (ASRU), pages 24–29. 2011.
TIMIT to LVCSR. IEEE Transactions on Speech and Audio Processing, [324] F. Seide, G. Li, and D. Yu. Conversational speech transcription using
November 2011. context-dependent deep neural networks. In Proceedings of Interspeech,
[314] R. Salakhutdinov and G. Hinton. Semantic hashing. In Proceedings of pages 437–440. 2011.
Special Interest Group on Information Retrieval (SIGIR) Workshop on [325] M. Seltzer, D. Yu, and E. Wang. An investigation of deep neural net-
Information Retrieval and Applications of Graphical Models. 2007. works for noise robust speech recognition. In Proceedings of Interna-
[315] R. Salakhutdinov and G. Hinton. Deep boltzmann machines. In Pro- tional Conference on Acoustics Speech and Signal Processing (ICASSP).
ceedings of Artificial Intelligence and Statistics (AISTATS). 2009. 2013.

[316] R. Salakhutdinov and G. Hinton. A better way to pretrain deep boltz- [326] M. Shannon, H. Zen, and W. Byrne. Autoregressive models for statisti-
mann machines. In Proceedings of Neural Information Processing Sys- cal parametric speech synthesis. IEEE Transactions on Audio, Speech,
tems (NIPS). 2012. Language Processing, 21(3):587–597, 2013.
References 377 378 References

[327] H. Sheikhzadeh and L. Deng. Waveform-based speech recognition using [339] R. Socher, Y. Bengio, and C. Manning. Deep learning for NLP.
hidden filter models: Parameter selection and sensitivity to power nor- Tutorial at Association of Computational Logistics (ACL), 2012, and
malization. IEEE Transactions on on Speech and Audio Processing North American Chapter of the Association of Computational Linguis-
(ICASSP), 2:80–91, 1994. tics (NAACL), 2013. http://www.socher.org/index.php/DeepLearning
[328] Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil. Learning semantic Tutorial.
representations using convolutional neural networks for web search. In [340] R. Socher, D. Chen, C. Manning, and A. Ng. Reasoning with neural
Proceedings World Wide Web. 2014. tensor networks for knowledge base completion. In Proceedings of Neural
[329] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep fisher networks for Information Processing Systems (NIPS). 2013.
large-scale image classification. In Proceedings of Neural Information [341] R. Socher and L. Fei-Fei. Connecting modalities: Semi-supervised
Processing Systems (NIPS). 2013. segmentation and annotation of images using unaligned text corpora.
[330] M. Siniscalchi, J. Li, and C. Lee. Hermitian polynomial for speaker In Proceedings of Computer Vision and Pattern Recognition (CVPR).
adaptation of connectionist speech recognition systems. IEEE Trans- 2010.
actions on Audio, Speech, and Language Processing, 21(10):2152–2161, [342] R. Socher, M. Ganjoo, H. Sridhar, O. Bastani, C. Manning, and A. Ng.
2013a. Zero-shot learning through cross-modal transfer. In Proceedings of Neu-
[331] M. Siniscalchi, T. Svendsen, and C.-H. Lee. A bottom-up modular ral Information Processing Systems (NIPS). 2013b.
search approach to large vocabulary continuous speech recognition. [343] R. Socher, Q. Le, C. Manning, and A. Ng. Grounded compositional
IEEE Transactions on Audio, Speech, Language Processing, 21, 2013. semantics for finding and describing images with sentences. Neu-
[332] M. Siniscalchi, D. Yu, L. Deng, and C.-H. Lee. Exploiting deep neu- ral Information Processing Systems (NIPS) Deep Learning Workshop,
ral networks for detection-based speech recognition. Neurocomputing, 2013c.
106:148–157, 2013. [344] R. Socher, C. Lin, A. Ng, and C. Manning. Parsing natural scenes
[333] M. Siniscalchi, D. Yu, L. Deng, and C.-H. Lee. Speech recognition using and natural language with recursive neural networks. In Proceedings of
long-span temporal patterns in a deep network model. IEEE Signal International Conference on Machine Learning (ICML). 2011.
Processing Letters, 20(3):201–204, March 2013. [345] R. Socher, J. Pennington, E. Huang, A. Ng, and C. Manning. Dynamic
[334] G. Sivaram and H. Hermansky. Sparse multilayer perceptrons for pooling and unfolding recursive autoencoders for paraphrase detection.
phoneme recognition. IEEE Transactions on Audio, Speech, & Lan- In Proceedings of Neural Information Processing Systems (NIPS). 2011.
guage Processing, 20(1), January 2012. [346] R. Socher, J. Pennington, E. Huang, A. Ng, and C. Manning. Semi-
[335] P. Smolensky. Tensor product variable binding and the representation supervised recursive autoencoders for predicting sentiment distribu-
of symbolic structures in connectionist systems. Artificial Intelligence, tions. In Proceedings of Empirical Methods in Natural Language Pro-
46:159–216, 1990. cessing (EMNLP). 2011.

[336] P. Smolensky and G. Legendre. The Harmonic Mind — From Neu- [347] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. Manning, A. Ng, and
ral Computation to Optimality-Theoretic Grammar. The MIT Press, C. Potts. Recursive deep models for semantic compositionality over a
Cambridge, MA, 2006. sentiment treebank. In Proceedings of Empirical Methods in Natural
Language Processing (EMNLP). 2013.
[337] J. Snoek, H. Larochelle, and R. Adams. Practical bayesian optimization
of machine learning algorithms. In Proceedings of Neural Information [348] N. Srivastava and R. Salakhutdinov. Multimodal learning with deep
Processing Systems (NIPS). 2012. boltzmann machines. In Proceedings of Neural Information Processing
Systems (NIPS). 2012.
[338] R. Socher. New directions in deep learning: Structured models, tasks,
and datasets. Neural Information Processing Systems (NIPS) Workshop
on Deep Learning and Unsupervised Feature Learning, 2012.
References 379 380 References

[349] N. Srivastava and R. Salakhutdinov. Discriminative transfer learning [361] G. Taylor, G. E. Hinton, and S. Roweis. Modeling human motion using
with tree-based priors. In Proceedings of Neural Information Processing binary latent variables. In Proceedings of Neural Information Processing
Systems (NIPS). 2013. Systems (NIPS). 2007.
[350] R. Srivastava, J. Masci, S. Kazerounian, F. Gomez, and J. Schmidhuber. [362] S. Thomas, M. Seltzer, K. Church, and H. Hermansky. Deep neural
Compete to compute. In Proceedings of Neural Information Processing network features and semi-supervised training for low resource speech
Systems (NIPS). 2013. recognition. In Proceedings of Interspeech. 2013.
[351] T. Stafylakis, P. Kenny, M. Senoussaoui, and P. Dumouchel. Prelimi- [363] T. Tieleman. Training restricted boltzmann machines using approx-
nary investigation of boltzmann machine classifiers for speaker recogni- imations to the likelihood gradient. In Proceedings of International
tion. In Proceedings of Odyssey, pages 109–116. 2012. Conference on Machine Learning (ICML). 2008.
[352] V. Stoyanov, A. Ropson, and J. Eisner. Empirical risk minimization of [364] K. Tokuda, Y. Nankaku, T. Toda, H. Zen, H. Yamagishi, and K. Oura.
graphical model parameters given approximate inference, decoding, and Speech synthesis based on hidden markov models. Proceedings of the
model structure. In Proceedings of Artificial Intelligence and Statistics IEEE, 101(5):1234–1252, 2013.
(AISTATS). 2011. [365] F. Triefenbach, A. Jalalvand, K. Demuynck, and J.-P. Martens. Acoustic
[353] H. Su, G. Li, D. Yu, and F. Seide. Error back propagation for sequence modeling with hierarchical reservoirs. IEEE Transactions on Audio,
training of context-dependent deep networks for conversational speech Speech, and Language Processing, 21(11):2439–2450, November 2013.
transcription. In Proceedings of International Conference on Acoustics [366] G. Tur, L. Deng, D. Hakkani-Tür, and X. He. Towards deep under-
Speech and Signal Processing (ICASSP). 2013. standing: Deep convex networks for semantic utterance classification. In
[354] A. Subramanya, L. Deng, Z. Liu, and Z. Zhang. Multi-sensory speech Proceedings of International Conference on Acoustics Speech and Signal
processing: Incorporating automatically extracted hidden dynamic Processing (ICASSP). 2012.
information. In Proceedings of IEEE International Conference on Mul- [367] J. Turian, L. Ratinov, and Y. Bengio. Word representations: A simple
timedia & Expo (ICME). Amsterdam, July 2005. and general method for semi-supervised learning. In Proceedings of
[355] J. Sun and L. Deng. An overlapping-feature based phonological model Association for Computational Linguistics (ACL). 2010.
incorporating linguistic constraints: Applications to speech recognition. [368] Z. Tüske, M. Sundermeyer, R. Schlüter, and H. Ney. Context-dependent
Journal on Acoustical Society of America, 111(2):1086–1101, 2002. MLPs for LVCSR: TANDEM, hybrid or both? In Proceedings of Inter-
[356] I. Sutskever. Training recurrent neural networks. Ph.D. Thesis, Univer- speech. 2012.
sity of Toronto, 2013. [369] B. Uria, S. Renals, and K. Richmond. A deep neural network for
[357] I. Sutskever, J. Martens, and G. Hinton. Generating text with recurrent acoustic-articulatory speech inversion. Neural Information Processing
neural networks. In Proceedings of International Conference on Machine Systems (NIPS) Workshop on Deep Learning and Unsupervised Feature
Learning (ICML). 2011. Learning, 2011.
[358] Y. Tang and C. Eliasmith. Deep networks for robust visual recogni- [370] R. van Dalen and M. Gales. Extended VTS for noise-robust speech
tion. In Proceedings of International Conference on Machine Learning recognition. IEEE Transactions on Audio, Speech, and Language Pro-
(ICML). 2010. cessing, 19(4):733–743, 2011.
[359] Y. Tang and R. Salakhutdinov. Learning Stochastic Feedforward Neural [371] A. van den Oord, S. Dieleman, and B. Schrauwen. Deep content-based
Networks. NIPS, 2013. music recommendation. In Proceedings of Neural Information Process-
[360] A. Tarralba, R. Fergus, and Y. Weiss. Small codes and large image ing Systems (NIPS). 2013.
databases for recognition. In Proceedings of Computer Vision and Pat- [372] V. Vasilakakis, S. Cumani, and P. Laface. Speaker recognition by means
tern Recognition (CVPR). 2008. of deep belief networks. In Proceedings of Biometric Technologies in
Forensic Science. 2013.
References 381 382 References

[373] K. Vesely, A. Ghoshal, L. Burget, and D. Povey. Sequence-discriminative [385] D. Warde-Farley, I. Goodfellow, A. Courville, and Y. Bengi. An empir-
training of deep neural networks. In Proceedings of Interspeech. 2013. ical analysis of dropout in piecewise linear networks. In Proceedings of
[374] K. Vesely, M. Hannemann, and L. Burget. Semi-supervised training of International Conference on Learning Representations (ICLR). 2014.
deep neural networks. In Proceedings of the Automatic Speech Recogni- [386] M. Welling, M. Rosen-Zvi, and G. Hinton. Exponential family harmo-
tion and Understanding Workshop (ASRU). 2013. niums with an application to information retrieval. In Proceedings of
[375] P. Vincent. A connection between score matching and denoising autoen- Neural Information Processing Systems (NIPS). 2005.
coder. Neural Computation, 23(7):1661–1674, 2011. [387] C. Weng, D. Yu, M. Seltzer, and J. Droppo. Single-channel mixed speech
[376] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. Manzagol. recognition using deep neural networks. In Proceedings of International
Stacked denoising autoencoders: Learning useful representations in a Conference on Acoustics Speech and Signal Processing (ICASSP). 2014.
deep network with a local denoising criterion. Journal of Machine [388] J. Weston, S. Bengio, and N. Usunier. Large scale image annotation:
Learning Research, 11:3371–3408, 2010. Learning to rank with joint word-image embeddings. Machine Learning,
[377] O. Vinyals, Y. Jia, L. Deng, and T. Darrell. Learning with recursive 81(1):21–35, 2010.
perceptual representations. In Proceedings of Neural Information Pro- [389] J. Weston, S. Bengio, and N. Usunier. Wsabie: Scaling up to large
cessing Systems (NIPS). 2012. vocabulary image annotation. In Proceedings of International Joint
[378] O. Vinyals and D. Povey. Krylov subspace descent for deep learning. In Conference on Artificial Intelligence (IJCAI). 2011.
Proceedings of Artificial Intelligence and Statistics (AISTATS). 2012. [390] S. Wiesler, J. Li, and J. Xue. Investigations on hessian-free optimization
[379] O. Vinyals and S. Ravuri. Comparing multilayer perceptron to deep for cross-entropy training of deep neural networks. In Proceedings of
belief network tandem features for robust ASR. In Proceedings of Interspeech. 2013.
International Conference on Acoustics Speech and Signal Processing [391] M. Wohlmayr, M. Stark, and F. Pernkopf. A probabilistic interac-
(ICASSP). 2011. tion model for multi-pitch tracking with factorial hidden markov model.
[380] O. Vinyals, S. Ravuri, and D. Povey. Revisiting recurrent neural net- IEEE Transactions on Audio, Speech, and Language Processing, 19(4),
works for robust ASR. In Proceedings of International Conference on May 2011.
Acoustics Speech and Signal Processing (ICASSP). 2012. [392] D. Wolpert. Stacked generalization. Neural Networks, 5(2):241–259,
[381] S. Wager, S. Wang, and P. Liang. Dropout training as adaptive reg- 1992.
ularization. In Proceedings of Neural Information Processing Systems [393] S. J. Wright, D. Kanevsky, L. Deng, X. He, G. Heigold, and H. Li.
(NIPS). 2013. Optimization algorithms and applications for speech and language pro-
[382] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang. Phoneme cessing. IEEE Transactions on Audio, Speech, and Language Processing,
recognition using time-delay neural networks. IEEE Transactions on 21(11):2231–2243, November 2013.
Acoustical Speech, and Signal Processing, 37:328–339, 1989. [394] L. Xiao and L. Deng. A geometric perspective of large-margin training
[383] G. Wang and K. Sim. Context-dependent modelling of deep neural of gaussian models. IEEE Signal Processing Magazine, 27(6):118–123,
network using logistic regression. In Proceedings of the Automatic Speech November 2010.
Recognition and Understanding Workshop (ASRU). 2013. [395] X. Xie and S. Seung. Equivalence of backpropagation and contrastive
[384] G. Wang and K. Sim. Regression-based context-dependent modeling hebbian learning in a layered network. Neural computation, 15:441–454,
of deep neural networks for speech recognition. IEEE/Association for 2003.
Computing Machinery (ACM) Transactions on Audio, Speech, and Lan- [396] Y. Xu, J. Du, L. Dai, and C. Lee. An experimental study on speech
guage Processing, 2014. enhancement based on deep neural networks. IEEE Signal Processing
Letters, 21(1):65–68, 2014.
References 383 384 References

[397] J. Xue, J. Li, and Y. Gong. Restructuring of deep neural network [409] D. Yu and L. Deng. Solving nonlinear estimation problems using splines.
acoustic models with singular value decomposition. In Proceedings of IEEE Signal Processing Magazine, 26(4):86–90, July 2009.
Interspeech. 2013. [410] D. Yu and L. Deng. Deep-structured hidden conditional random fields
[398] S. Yamin, L. Deng, Y. Wang, and A. Acero. An integrative and discrimi- for phonetic recognition. In Proceedings of Interspeech. September 2010.
native technique for spoken utterance classification. IEEE Transactions [411] D. Yu and L. Deng. Accelerated parallelizable neural networks learning
on Audio, Speech, and Language Processing, 16:1207–1214, 2008. algorithms for speech recognition. In Proceedings of Interspeech. 2011.
[399] Z. Yan, Q. Huo, and J. Xu. A scalable approach to using DNN-derived [412] D. Yu and L. Deng. Deep learning and its applications to signal and
features in GMM-HMM based acoustic modeling for LVCSR. In Pro- information processing. IEEE Signal Processing Magazine, pages 145–
ceedings of Interspeech. 2013. 154, January 2011.
[400] D. Yang and S. Furui. Combining a two-step CRF model and a joint [413] D. Yu and L. Deng. Efficient and effective algorithms for training single-
source-channel model for machine transliteration. In Proceedings of hidden-layer neural networks. Pattern Recognition Letters, 33:554–558,
Association for Computational Linguistics (ACL), pages 275–280. 2010. 2012.
[401] K. Yao, D. Yu, L. Deng, and Y. Gong. A fast maximum likelihood non- [414] D. Yu, L. Deng, and G. E. Dahl. Roles of pre-training and fine-tuning in
linear feature transformation method for GMM-HMM speaker adapta- context-dependent DBN-HMMs for real-world speech recognition. Neu-
tion. Neurocomputing, 2013a. ral Information Processing Systems (NIPS) 2010 Workshop on Deep
[402] K. Yao, D. Yu, F. Seide, H. Su, L. Deng, and Y. Gong. Adaptation of Learning and Unsupervised Feature Learning, December 2010.
context-dependent deep neural networks for automatic speech recogni- [415] D. Yu, L. Deng, J. Droppo, J. Wu, Y. Gong, and A. Acero. Robust
tion. In Proceedings of International Conference on Acoustics Speech speech recognition using cepstral minimum-mean-square-error noise
and Signal Processing (ICASSP). 2012. suppressor. IEEE Transactions on Audio, Speech, and Language Pro-
[403] K. Yao, G. Zweig, M. Hwang, Y. Shi, and D. Yu. Recurrent neural cessing, 16(5), July 2008.
networks for language understanding. In Proceedings of Interspeech. [416] D. Yu, L. Deng, Y. Gong, and A. Acero. A novel framework and training
2013. algorithm for variable-parameter hidden markov models. IEEE Trans-
[404] T. Yoshioka and T. Nakatani. Noise model transfer: Novel approach to actions on Audio, Speech and Language Processing, 17(7):1348–1360,
robustness against nonstationary noise. IEEE Transactions on Audio, 2009.
Speech, and Language Processing, 21(10):2182–2192, 2013. [417] D. Yu, L. Deng, X. He, and A. Acero. Large-margin minimum clas-
[405] T. Yoshioka, A. Ragni, and M. Gales. Investigation of unsupervised sification error training: A theoretical risk minimization perspective.
adaptation of DNN acoustic models with filter bank input. In Pro- Computer Speech and Language, 22(4):415–429, October 2008.
ceedings of International Conference on Acoustics Speech and Signal [418] D. Yu, L. Deng, X. He, and X. Acero. Large-margin minimum classi-
Processing (ICASSP). 2013. fication error training for large-scale speech recognition tasks. In Pro-
[406] L. Younes. On the convergence of markovian stochastic algorithms with ceedings of International Conference on Acoustics Speech and Signal
rapidly decreasing ergodicity rates. Stochastics and Stochastic Reports, Processing (ICASSP). 2007.
65(3):177–228, 1999. [419] D. Yu, L. Deng, G. Li, and F. Seide. Discriminative pretraining of deep
[407] D. Yu, X. Chen, and L. Deng. Factorized deep neural networks for adap- neural networks. U.S. Patent Filing, November 2011.
tive speech recognition. International Workshop on Statistical Machine [420] D. Yu, L. Deng, P. Liu, J. Wu, Y. Gong, and A. Acero. Cross-lingual
Learning for Speech Processing, March 2012b. speech recognition under runtime resource constraints. In Proceedings
[408] D. Yu, D. Deng, and S. Wang. Learning in the deep-structured con- of International Conference on Acoustics Speech and Signal Processing
ditional random fields. Neural Information Processing Systems (NIPS) (ICASSP). 2009b.
2009 Workshop on Deep Learning for Speech Recognition and Related
Applications, 2009.
References 385 386 References

[421] D. Yu, L. Deng, and F. Seide. Large vocabulary speech recognition using [433] F. Zamora-Martínez, M. Castro-Bleda, and S. España-Boquera. Fast
deep tensor neural networks. In Proceedings of Interspeech. 2012c. evaluation of connectionist language models. International Conference
[422] D. Yu, L. Deng, and F. Seide. The deep tensor neural network with on Artificial Neural Networks, pages 144–151, 2009.
applications to large vocabulary speech recognition. IEEE Transactions [434] M. Zeiler. Hierarchical convolutional deep learning in computer vision.
on Audio, Speech, and Language Processing, 21(2):388–396, 2013. Ph.D. Thesis, New York University, January 2014.
[423] D. Yu, J.-Y. Li, and L. Deng. Calibration of confidence measures in [435] M. Zeiler and R. Fergus. Stochastic pooling for regularization of deep
speech recognition. IEEE Transactions on Audio, Speech and Language, convolutional neural networks. In Proceedings of International Confer-
19:2461–2473, 2010. ence on Learning Representations (ICLR). 2013.
[424] D. Yu, F. Seide, G. Li, and L. Deng. Exploiting sparseness in deep [436] M. Zeiler and R. Fergus. Visualizing and understanding convolutional
neural networks for large vocabulary speech recognition. In Proceedings networks. arXiv:1311.2901, pages 1–11, 2013.
of International Conference on Acoustics Speech and Signal Processing [437] M. Zeiler, G. Taylor, and R. Fergus. Adaptive deconvolutional networks
(ICASSP). 2012. for mid and high level feature learning. In Proceedings of International
[425] D. Yu and M. Seltzer. Improved bottleneck features using pre-trained Conference on Computer vision (ICCV). 2011.
deep neural networks. In Proceedings of Interspeech. 2011. [438] H. Zen, M. Gales, J. F. Nankaku, and Y. K. Tokuda. Product of
[426] D. Yu, M. Seltzer, J. Li, J.-T. Huang, and F. Seide. Feature learning in experts for statistical parametric speech synthesis. IEEE Transactions
deep neural networks — studies on speech recognition. In Proceedings on Audio, Speech, and Language Processing, 20(3):794–805, March 2012.
of International Conference on Learning Representations (ICLR). 2013. [439] H. Zen, Y. Nankaku, and K. Tokuda. Continuous stochastic feature
[427] D. Yu, S. Siniscalchi, L. Deng, and C. Lee. Boosting attribute and mapping based on trajectory HMMs. IEEE Transactions on Audio,
phone estimation accuracies with deep neural networks for detection- Speech, and Language Processings, 19(2):417–430, February 2011.
based speech recognition. In Proceedings of International Conference [440] H. Zen, A. Senior, and M. Schuster. Statistical parametric speech syn-
on Acoustics Speech and Signal Processing (ICASSP). 2012. thesis using deep neural networks. In Proceedings of International Con-
[428] D. Yu, S. Wang, and L. Deng. Sequential labeling using deep-structured ference on Acoustics Speech and Signal Processing (ICASSP), pages
conditional random fields. Journal of Selected Topics in Signal Process- 7962–7966. 2013.
ing, 4:965–973, 2010. [441] X. Zhang, J. Trmal, D. Povey, and S. Khudanpur. Improving deep
[429] D. Yu, S. Wang, Z. Karam, and L. Deng. Language recognition using neural network acoustic models using generalized maxout networks. In
deep-structured conditional random fields. In Proceedings of Interna- Proceedings of International Conference on Acoustics Speech and Signal
tional Conference on Acoustics Speech and Signal Processing (ICASSP), Processing (ICASSP). 2014.
pages 5030–5033. 2010. [442] X. Zhang and J. Wu. Deep belief networks based voice activity detec-
[430] D. Yu, K. Yao, H. Su, G. Li, and F. Seide. KL-divergence regularized tion. IEEE Transactions on Audio, Speech, and Language Processing,
deep neural network adaptation for improved large vocabulary speech 21(4):697–710, 2013.
recognition. In Proceedings of International Conference on Acoustics [443] Z. Zhang, Z. Liu, M. Sinclair, A. Acero, L. Deng, J. Droppo, X. Huang,
Speech and Signal Processing (ICASSP). 2013. and Y. Zheng. Multi-sensory microphones for robust speech detection,
[431] K. Yu, M. Gales, and P. Woodland. Unsupervised adaptation with dis- enhancement and recognition. In Proceedings of International Confer-
criminative mapping transforms. IEEE Transactions on Audio, Speech, ence on Acoustics Speech and Signal Processing (ICASSP). 2004.
and Language Processing, 17(4):714–723, 2009. [444] Y. Zhao and B. Juang. Nonlinear compensation using the gauss-newton
[432] K. Yu, Y. Lin, and H. Lafferty. Learning image representations from method for noise-robust speech recognition. IEEE Transactions on
the pixel level via hierarchical sparse coding. In Proceedings Computer Audio, Speech, and Language Processing, 20(8):2191–2206, 2012.
Vision and Pattern Recognition (CVPR). 2011.
References 387

[445] W. Zou, R. Socher, D. Cer, and C. Manning. Bilingual word embed-


dings for phrase-based machine translation. In Proceedings of Empirical
Methods in Natural Language Processing (EMNLP). 2013.
[446] G. Zweig and P. Nguyen. A segmental CRF approach to large vocab-
ulary continuous speech recognition. In Proceedings of the Automatic
Speech Recognition and Understanding Workshop (ASRU). 2009.

You might also like