Transcript Lec1
Transcript Lec1
I have a terrible confession to make. This class is actually being run not by me, but by these
two guys:
Alfredo Canziani and Mark Goldstein, whose names are here [points to slide]. They are the
TA'S and
you'll talk to them much more often than you'll talk to me.
That's the first thing. The other confession I have to make is that if you have questions
about this class,
don't ask them at the end of this course because I have to run right after the class to catch
an airplane.
Some very basic course information. There is a website as you can see.
I will do what I can to post the PDF of the slides on the website.
Probably just before the lecture, probably just a few minutes before the lecture, usually.
But, but it should be there by the time you get to class, or at least by the time I get to class.
There's going to be nine lectures that I'm going to teach, on Monday evenings.
will be running. So, they'll go through some of the, you know, practical
mathematics that are necessary for this, and basic concepts. Some tutorials on how to use
PyTorch and various other
software tools.
And there's going to be three guest lectures. The names of the guest lectures are not
finalized.
And it's going to take one of those sessions, you know, around March.
And the evaluation will be done on the midterm and on a final project.
And, you can sort of, you know, band in groups of two.
The project will probably have to do with a combination of self-supervised learning and
autonomous driving. We are
Okay, let me talk a little bit about, so this first lecture is really going to be sort of a broad
introduction about
what deep running is really, and what it can do and what it cannot do.
So, we'll go through the entire arc, if you want, of the class
sort of, broad high-level idea of all the topics we're talking about. And whenever I'll talk
about a particular topic
picture.
So, there is a prerequisite for the class which is, you know:
Which is good.
But I'm not going to assume that you know everything about this. Particularly, I am not going
to assume that you know a lot of the,
you know, sort of deep underlying techniques. OK, so here is the course plan and
go faster on certain sections that you think are too obvious because you've played with this
before, or other things.
So intro to
Supervised Learning, Neural Nets, Deep Learning. That's what I'm going to talk about today.
What deep learning can do, what it cannot do, what are good features. Deep learning is
about learning representations.
Next week will be about back propagation and basic architectural components. So things
like
the fact that you build neural nets out of modules that you connect with each other. Then
you compute gradients you get automatic
activation functions — you know diVerent modules. Tricks like weight sharing and weight
tying,
multiplicative interactions,
macro architectures, like mixture of experts, Siamese net, hyper networks, etc.
So we'll dive pretty quickly in and that's appropriate if you've already played with some of
those things.
Then there will be either one or two lectures – I haven't completely decided yet — about
convolutional nets and their applications.
Then, more specifically, about deep learning architectures that are useful in special cases.
So things like recurrent neural nets with back propagation through time, which is the way
you train recurrent neural nets.
And ...
recurrent neural nets to things like control and, you know, producing time series and stuV
like that.
gating
really a basis of their architecture like memory networks, transformers, adapters, etc.,
which are sort of very recent
architectures that have become extremely popular and in things like NLP and other
other areas. And then a little bit about graph neural nets, which I'm not going to talk about
this a lot because there is another
course that you can take by Joan Bruna where he spends a lot of time on graph neural nets.
Then...
then we'll talk about how we get those deep learning systems to work. And, so, various
tricks to get them to work.
Sort of understanding the type of optimization that takes place in neural nets. So...
And there are certain rules about optimization in the convex case that are well understood.
which is the case for most deep learning systems. And they're not very well understood
also in deep learning, because the the cost function is not ...
is not convex.
It has local minima, and saddle points, and things like this. So it's important to understand
the geometry of the objective function.
the the big secret here is that nobody actually understands. So ...
come up through a combination of intuition and a little bit of theoretical analysis and
Many of which do not work. And then something a little exotic called target prop and the
Lagrangian formulation of back prop.
Then I'll switch to my favorite topic, which is energy-based models. So this is sort of a
general formulation of a lot of diVerent, sort of, approaches to learning — whether they are
supervised, unsupervised, self-supervised.
Like, you know, searching for the value of variables that nobody tells you the value of, but
that your process
So you could think of reasoning in neural nets as a process by which you have
some energy function that's being optimized with respect to some variables. And the value
you get as a result of this optimization
is the value of those variables you were trying to find. And so,
there is sort of the the common view that a neural net is just a function that computes its
output as a function of its
input. So you just run through the neural net, you get an output.
form of inference in the sense that you can only produce one output for a given input.
But very often there are multiple possible answers to a given input. And so, how do you,
kind of, represent problems of this type where there are multiple answers, multiple
possible answers, to a given input? And one answer to this is:
You make those answers the minima of some energy function, and your inference
algorithm is going to find values of those
variables that minimize this objective function. And there might be multiple minima. So
that means your model might produce multiple
So, energy-based models, it's kind of a way of doing this. A special case of those energy-
based models are
graphical model, Bayesian nets and things like this. Energy methods are a little more
general. So a little less specific.
So special cases of this include things like what people used to call structure prediction.
And then there is a lot of applications of this in what's called self-supervised learning. And
that will be the topic of the next couple lectures.
And probably something that's going to become really dominant in the future. It's already ...
... in the space of a year, it's become dominant in natural language processing. And, in the
last few months, just three months
there's been a few papers that show that self-supervised learning methods actually
work really well in things like computer vision as well. And so my guess,
in the next few years. So, I think it's useful to hear about it in this class.
I'm not going to go through a laundry list of this — but there are
things that you may have heard of: like variational auto-encoders, de-noising auto-
encoders,
transformer architectures that are trained for natural language processing. They are trained
with self-supervised learning and they are a special case of a de-noising auto-encoder.
you may have heard of without realizing they were all, kind of —
it can be all understood in the context of this sort of energy-based approach. And that
includes also generatative adversarial networks (GANs),
So, you know, how do we get machines to really kind of become really intelligent? They're
not superintelligent.
They're not very intelligent right now. They can solve very narrow problems very well,
sometimes with superhuman performance. But no machine has any kind of common
sense. And, the most intelligent machines that we have
probably have less common sense than a house cat. So, how do we get to cat-level
intelligence first, and then maybe human-level intelligence?
And, I don't pretend to have the answer, but I have, you know —
a few ideas that are interesting to discuss in the context of self-supervised learning, there.
I've had some applications. Any questions? So that's the plan of the course. Okay. it might
change, dynamically
...Okay...
[Student Question: Will we also be having assignments in the course?] Yeah, yeah, there
are assignments.
[...Inaudible...]
Okay, so for those of you who didn't hear Alfredo, because he didn't speak very loudly:
practice to get familiar with all the techniques that you would need for
probably be boring for some of you who've already played with those things — but let's start
from the basics.
Deep learning is inspired by what people have observed about the brain, but the inspiration
is just an inspiration.
there's a lot of details about the brain that are irrelevant, and that we don't know if they are
relevant actually to human intelligence.
And, it's a little bit the same as, you know, airplanes being inspired by birds.
But the underlying principles of flight for birds and airplanes are essentially the same, but
the details are extremely diVerent.
They both have wings. They both generate lift by propelling themselves through air, but, you
know, airplanes don't have feathers and don't flap their wings.
So it's a bit of the same idea. And the history of this goes back to a field
that has kind of almost disappeared, or, at least changed names now, called Cybernetics.
If you want a specialist about the history of cybernetics, he is sitting right there:
Joe Lemelin
So, Joe is actually philosopher. And he is interested in the — he actually has a seminar on,
kind of, the history of AI.
He knows everything about, you know, the history of cybernetics. So it started in the 40's
with
two gentlemen: McCulloch and Pitts. Their picture is on the top right here.
And, they got the idea that if neurons are basically threshold units that are on or oV,
then by connecting neurons with each other, you can build Boolean circuits and you can
basically do logical inference with neurons.
The brain is basically a logical inference machine because the neurons are binary. And this
idea —
computes a weighted sum of its inputs and then compares the weighted sum to a
threshold. It turns on
Then there was, you know, quasi-simultaneously Donald Hebb, who had the idea that
the brain learns by modifying the strength of the connections between the neurons that are
called the synapses. And you had the idea
of what's now called Hebbian learning, which is that if two neurons fire together, then the
connection that links them
That's not an idea for learning algorithm, but it's sort of a first idea, perhaps.
And then cybernetics was proposed by this guy Norbert Wiener, who is here. [Bottom Right]
This is the whole idea that by having systems that, kind of, have sensors
and have actuators, you can have feedback loops and you can have, you know, self-
regulating systems.
And, what's the theory behind this? You know, we sort of take that for granted now. But the
idea that
for example:
You know, you drive your car, right? You turn the wheel and
there's a so-called PID controller that actually turns the wheel in proportion to how you turn
the steering wheel.
that basically measures the position of the steering wheel, measures the position of
the wheel of the car. And, then, if there is a diVerence between the two, kind of corrects the
wheels of the car so that
learning algorithms that modified the weights of very simple neural nets. And what you see
here at the bottom
the two pictures here [Bottom Left]: This is Frank Rosenblatt, and this is the Perceptron.
This was a physical
optical sensors so you could show it pictures. It was very low resolution.
it had neurons that could compute a weighted sum, and the weights could be adapted. And
the weights were potentiometers.
The potentiometers had motors on them so they could rotate for the learning algorithm.
So it was electro-mechanical. And what he's holding here in his hand is a module of eight
weights
with (you can count them), with those potentiometers, motorized potentiometers on them.
Okay, so
that was a little bit of history of where neural nets come from.
Another interesting piece of history is that this whole idea of, sort of, trying to build
intelligent machines by basically simulating networks of neurons
was born in the 40's, kind of took oV a little bit in the late-50's, and
when people realized that with the kind of learning algorithms and architectures that
people were proposing at the time
you couldn't do much. You know, you could do some basic, very simple pattern
recognition, but you couldn't do much.
So between
1969, roughly, and —
or 1968 — and
neural nets
except a few kind of isolated researchers mostly in Japan. Japan is its own, kind of,
relatively isolated ecosystem for funding research. People don't listen to the same kind of
...
... fashions, if you want. And then the field took oV again in 1985, roughly, with the
People were looking for something like this in the 60's and basically didn't find it.
And the reason they didn't find it was because they had the wrong neurons.
The way to get backpropagation to work is to use an activation function that is continuous,
you know, the idea of using continuous neurons. And so, they didn't think that you could
train those systems with
Now there's another reason for this, which is that if you have a neural net with binary
neurons,
you never need to compute multiplications. You never need to multiply two numbers.
You only need to add numbers, right. If your neuron is active, you just add the weight to the
weighted sum.
you need to multiply the activation of a neuron by a weight to get a contribution to the
weighted sum.
It turns out, before the 1980's, multiplying two numbers, particularly floating point
numbers, on
any sort of non-ridiculously expensive computer was extremely slow. And so there was an
incentive to not use
So the reason why backprop didn't emerge earlier than the mid 80's is because that's
when, you know,
People didn't think of it this way, but that's, you know, kind of retrospectively, that's pretty
much what happened.
So,
1995 — lasted about 10 years. In 1995, it died again. People in machine learning
the late 2000's, early 2010. So, when around 2009/2010 people realized that you could use
and get an improvement for speech recognition. It didn't start with ImageNet. It started with
speech recognition, around 2010.
every major player in speech recognition had deployed commercial speech recognition
systems that use
neural nets. So, if you had an Android phone and you were using any other speech
recognition features in an Android phone
around 2012
that used neural nets. That was probably the first really, really wide
deployment of, kind of, modern forms of deep learning, if you want.
Then at the end of 2012 / early-2013, the same thing happened in computer vision, where
the computer vision community realized
deep learning, convolutional nets in particular, work much better than whatever it is that
they were using before, and
started to switch to using commercial nets, and basically abandoned all previous
techniques.
So that created a second revolution, now in computer vision. And then three years later,
around 2016 or so,
the same thing happened in natural language processing — in language translation, and
things like this: 2015/16.
And, now we're going to see — it's not happened yet — but we're going to see the same
revolution occur
in things like robotics, and control, and, you know, a whole bunch of application areas.
Okay. So, you all know what supervised learning is, I'm sure.
And this is really what the vast majority — and not the vast majority —
90-some percent
applications of deep learning use supervised learning as kind of the main thing. So
supervised learning is the process by which
of examples of, let's say, images together with a category (if you want to do image
recognition). Or a bunch of
transcription: a bunch of text in one language with the transcription in another language,
etc.
And you tweak the parameters of that function in such a way that the output gets closer to
the when you want.
Show a picture of a car, if the system doesn't say car, tweak the
parameters. The parameters in the neural net are going to be the weights,
tweak
the knobs so that the output gets closer to the one you want.
The trick in neural nets is: How do you figure out in which direction and by how much to
tweak the knobs so that the output
That's what gradient computation and backpropagation is about. But before we get to this,
a little bit of history again. So there was a flurry of models, basic models for
classification. You know, starting with the Perceptron, there was another competing model
called the Adaline,
which is on the top right here. They are based on the same kind of basic architectures:
compute the weighted sum of inputs
What you see, the Adaline here, the thing that Bernie Widrow is tweaking is actually a
physical
So it's like the Perceptron, it was much less, you know, much smaller, in many ways.
The reason I tell you about this is that the the Perception actually was a two layer neural
net, a two layer neural net in
with adaptive weights. But the first layer was fixed. In fact, most of the time with most
experiments it was,
neurons that would, you know, be threshold neurons with random weights, essentially.
This is what they called the associative layer. And that basically
became the basis for the sort of conceptual design of a pattern recognition system for the
next four decades.
you run it through a feature extractor that is supposed to extract the relevant
So, you want to recognize a face? Can you detect an eye? How do you detect an eye?
recognize a car, you know. Well, they are kind of dark, round things, etc.
So, the problem here — and so, what this feature extractor produces is a vector of
features, which are things that may be numbers, or they may be on or oV.
Okay, so it's just a list of numbers, a vector. And you're going to feed that vector to trainable
classifier. In the case of
Perceptron or a simple neural net, it's gonna be just the system that computes a weighted
sum, compares it to a threshold.
The problem is that you have to engineer the feature extractor. So the entire literature of
pattern recognition
(statistical pattern recognition at least)
and a lot of computer vision (at least the part of computer vision that's interested in
recognition) was focused on
this part: the feature extractor. How do you design a feature extractor for a particular
problem? You want to do,
features for recognizing Hangul? And how can you extract them using all kinds of
algorithmic tricks?
How do you pre-process the images? You know, how do you normalize their size? You know,
things like that.
How do you skeletonize them? How do you segment them from their background?
So the entire literature was devoted to this [Feature Extractor], very very little was devoted
to that [Trainable Classifier].
And what deep learning brought to the table is this idea that instead of having this kind of
two-stage process
The idea of deep learning is that you learn the entire task end-to-end.
as a cascade or a sequence of
And then you stack them, you stack multiple layers of them, which is why it's called deep
learning.
So the only reason for the "deep" word in deep learning is the fact that there are multiple
layers. There is nothing more to that.
And then you train the entire thing end-to-end. So the complication here, of course, is the
fact that
the parameters that are in the first box: How do you know how to tune them so that the
output does—
nonlinear? It's because, if you have two successive modules, and they're both linear, you
can collapse them into a single linear.
functions, or the composition of two linear functions is a linear function. Take a vector
multiply by a matrix and then multiply it by a second matrix.
It's as if you had pre-computed the product of those two matrices, and then multiplied the
input vector by that
composite matrix. So there's no point having multiple layers if those layers are linear.
arameters that you can tune — things like weights in the neural nets — and is non-linear?
So ...
Take an input. An input can be represented as a vector, right. An image is just a list of
numbers.
Think of it as a vector, ignore the fact that it's an image for now.
Piece of audio, whatever it is that your sensors or your data set gives you, is a vector.
Multiply this vector by a matrix. The coeVicients in this matrix are the tunable parameters.
And then take the resulting vector, right — when you multiply a matrix by a vector, you get a
vector.
And,
if you want to have the simplest possible nonlinear function, use something like
what's shown at the top here [ReLU(x) = max(x, 0)], which people in neural nets call the
ReLU; people in engineering call this half wave rectification;
So, apply this nonlinear function to every component of the vector that results from
multiplying the input vector by the matrix.
Okay. Now you get a new vector, which has lots of zeros in it, because whenever the
weighted sum was less than zero,
the output is zero, if you pass through the ReLU. And then repeat the process: Take that
vector, multiply it by a weight matrix;
pass the result through point wise non-linearity; take the result, pass it — multiplied by a
matrix — pass the result through
nonlinearities.
Okay now
Why is that called a neural net at all? It's because when you take a vector and you multiply
a vector by a matrix,
to compute each
component of the output, you actually compute a weighted sum of the components of the
input
by a corresponding row in the matrix, right. So this little symbol here — there's a bunch of
weighted sum. And you do this for every row that gives you the
result, right.
So, the number of units after the multiplication by a matrix is going to be equal to the
number of rows of your matrix.
And the number of columns of the matrix, of course has to be equal to the size of the input.
Okay. So supervised learning — in slightly more formal terms than the one I showed earlier
— is the idea by which you're going to compare the output that the system produces...
So, right, you show an input, you run through the neural net, you get an output. You're going
to compare this output with a target output.
It computes the distance, for example, Euclidean distance between a target vector and the
vector that
the neural net produces, the deep learning system system produces.
And then you can compute the average of this cost function, which is just a scalar
So, a training set is composed of a bunch of pairs of inputs and outputs; compute the
average of this over the training set.
The function you want to minimize with respect to the parameters of the system (the
tunable knobs) is that average.
Okay. So, you want to find the value of the parameters that minimizes the average error
between the
output you want and the output you get, averaged over a training set of samples.
So...
I'm sure the vast majority of people here, sort of, have at least an intuitive understanding of
what gradient descent is.
So, basically, the way to minimize this is to compute the gradient, right.
It's like you are in a mountain, you're lost in the mountain (and
there is fog and it's night, and you want to go to the village in the valley.
you turn around and you see which way is down, and you take a step down in the direction
of steepest descent.
Okay. So, this search for the direction that goes down: that's called "computing a gradient"
or, technically, a negative gradient.
Okay, then you take a step down. That's taking a step down in the
direction of negative gradient. And if you keep doing this and your steps are small enough —
small enough so that when you take a step, you don't jump to the other side of the
mountain — then
the valley is convex. Which means that if there is no, kind of, lake, no mountain lakes in the
middle where, you know,
there's kind of a
minimum and you're going to get stuck in that minimum the valley might be lower, but you
don't see it.
is important as a concept.
But here is another concept, which is the concept of stochastic gradient, which I'm sure
again a lot of you have heard [of].
in more detail. The objective function you're computing is an average over many many
samples.
But it turns out it is more eVicient to just take one sample or a small group of samples,
this sample makes, then computing the gradient of that error with respect to the
parameters and taking a step.
A small step.
Then a new
value for the error and another value for the gradient, which may be in a diVerent direction
because it's a diVerent sample.
the cost surface but in kind of a noisy way — there's going to be a lot of fluctuations.
So what is shown here is an example of this. This is stochastic gradient applied to a very
simple problem with two dimensions
And it looks kind of semi-periodic because the examples are always shown in the same
order,
which is not what you're supposed to do with stochastic gradient. But as you can see the
path is really erratic.
And the other reason is that you actually get better generalization in the end.
So if you measure the performance of the system on a separate set that you —
I assume you all know the concepts of "training set" and "test set" and "validation set" but
—
So if you test the performance of the system on a diVerent set, you get generally better
generalization if you use stochastic gradient than if you actually use the real, true gradient
descent.
No.
So let me tell you why: I mean, this is something we're gonna talk about again when we talk
about optimization.
But let me tell you: I give you a training set with a million training samples. It's actually
100 repetitions of the same training sample with 10,000 samples. Okay. So my actual
sample is 10,000 training samples.
I scrambled it. I tell you here is my training set with a million training samples.
compute the same values a hundred times. You're gonna spend a hundred times more
work than necessary.
redundancy in them, like very many samples are very similar to each other, etc.
if you don't stochastic gradient, you're not going to be able to take advantage of that
redundancy.
Okay.
Don't pay attention to the formula. Don't get scared because we're going to come back to
this in more detail.
But,
a neural net of the type that I showed you earlier as a bunch of modules that are stacked on
top of each other.
And you all know the basic rule of calculus. You know, how do you compute the derivative
of a function
multiplied by the derivative of G at the point X. Right. So you get the product of the two
derivatives. So this is the same thing except that the functions, instead of being scalar
functions, are vector functions.
They get vectors as inputs and the previous vectors as outputs. More generally, actually,
they take multi-dimensional arrays as input and multi-dimensional arrays as output, but
that doesn't matter.
in the case of
functional modules that have multiple inputs, multiple outputs that you can view as
functions? Right.
And, basically, it's the same rule if you, kind of, blindly apply them — it's the same rule as
you applied for,
You know,
what you see in the end is that if you want to compute the derivative of
the diVerence between the output you want and the output you get,
which is the value of your objective function, with respect to any variable inside of the
network,
then you have to kind of back, you know, propagate derivatives backwards and kind of
multiply things on the way.
All right. We'll be much more formal about this next week. For now,
you just know why it's called backpropagation: because it applies to multiple layers.
So, an image, even sort of a relatively low resolution image, is typically like, you know, a few
hundred pixels on the side.
256 x 256, to take a random example. OK, a car image: 256 x 256. So it's got
65,536 pixels, times three, because you have R, G, and B components to it for, you know,
you have three value for each pixels. And so
If you have a matrix that is going to multiply this vector, this matrix is going to have to have
two hundred thousand rows —
And
depending on how many units you have here in the first layer,
there's going to be a 200,000 x, you know, maybe a large number. That's a huge matrix.
you know, that's already a lot of—very, very large matrix: billions.
full matrix. What you're going to have to do if you want to deal with things like images is
make some hypothesis about the structure of this matrix so that it's not a completely full
matrix that, you know,
At least for a lot of practical applications. So this is where inspiration from the brain comes
back.
classic work, in neuroscience in the 1960s by the gentlemen at the top here:
Hubel and Wiesel. They actually won a Nobel Prize for this in the in the 70s
but their workforce from the late 50s—early 60s. And what they did was that they poked
electrode in the
monkeys,
so, first of all, well, this is a human brain. But, I mean, this chart is from much later. But all
mammalian
visual systems is organized in similar way. You have signals coming in to your eyes,
striking your retina. You have a few layers of neurons in your retina in front of your
photoreceptors that, kind of, pre-process the
signal, if you want. They kind of compress it, because you can't have—
you know the human eye is something like a hundred million pixels.
But the problem is you cannot have a hundred million fibers coming out of your eyes,
because otherwise
compression. They don't do JPEG compression, but they do compression. So that the
signal can be compressed to one million fibers. Right.
You have one million fibers coming out of each of your eyes. And
that makes your, you know, optical nerve about this big, which means, you know, you can
carry the signal and turn your eyes.
Invertebrates are not like that. Invertebrates have actually—so, it's a big mistake because
the wires
because the neurons that process the signal in front of your retina, the wires have to kind
of—
it'll be in front of your retina, and so blocking part of the view, if you want. And
then they have to punch a hole through your retina to get through your brain.
So there's a blind spot in your visual field because that's where your optical nerve punches
through your retina.
if you have a camera like this to have the wires coming out the front and then,
you know, dig a hole in your sensor to get the wires back. It's much better if the wires come
out the back, right? And
you know, like squid and octopus actually have wires coming out the back. They're much
luckier.
But anyway.
So, the signal goes from your eyes to a little piece of brain called the lateral geniculate
nucleus,
And,
and, then that goes to the back of your brain where the the primary visual cortex area called
v1 is.
It's called V1 in humans. And there's something called the ventral hierarchy: V1, V2, V4 (IT),
which is a bunch of brain areas going from the back to the side. And
in the infero-temporal cortex right here, this is where object categories are represented.
you have a bunch of neurons firing that represent your grandmother in this area. And it
doesn't matter
And those things have been discovered by experiments with patients that had to have their
skull open for a few weeks, and
where, you know, people poked electrode and had them watch movies and realize this is
known that turns on if
Jennifer Aniston is on the movie. And it only turns on for Jennifer Aniston.
So with the
idea that somehow the visual context, you know, can do pattern recognition and seems to
have this sort of hierarchical structure,
multi-layer structure,
There's barely enough time for the signal to go from your retina to the infero-temporal
cortex.
It takes about, it's a few millisecond delay per neuron that you have to go through.
100 milliseconds you barely have time to, for just you know, a few spikes to go through the
entire system.
So there's no time for like, you know, recurrent connections and like, you know, etc. Doesn't
mean that there are no recurrent connections.
this gentleman here, Kunihiko Fukushima, had the idea of taking inspiration from
Hubel and Wiesel in the 70s, and sort of built a neural net model on the computer that had
this idea that,
And then the neuron next to it will react to another area that's next to the first one, right.
which means that neighboring neurons react to neighboring regions in the visual field.
What they also realized is that the group of neurons that all react to the same area in the
visual field, and they seem to
turn on for edges at a particular orientation. So one neuron will turn on for, if
it's receptive field has an edge, a vertical edge, and then the one next to it if the
edge is a little slanted, and then the one next to it if the edge is a little
rotated, etc.
V1 basically as
field and then react to orientations. And those groups of neurons that react to multiple
orientations are replicated over the entire visual field.
So this guy Fukushima said, well, why don't I build a neural net that that does this? I'm not
going to,
features, but I'm going to use some sort of unsupervised learning algorithm to
to train it.
So he was not training his system end-to-end. He was training it layer by layer in some sort
of unsupervised fashion,
so he used the concept that those neurons were replicated across the visual field, and
then he used another concept from Hubel and Wiesel called complex cells.
So complex cells are units that pool the activities of a bunch of simple cells, which are
those
oriented
since it integrates the outputs from all those simple cells, will stay activated
So those complex cells build a little bit of shift invariance in the representation.
You can shift an edge a little bit, and it will not change the activation the activity of one of
those complex cells.
So that's
what we now call "convolution" and "pooling" in the context of convolutional nets.
led me in the mid-80s or late-80s to come up with convolutional nets. So they are basically
networks where the connections are local; they are replicated across the visual field; and
you
intersperse,
sort of, feature detection layers that detect those local features with pooling operation.
We'll talk about this at length in three weeks.
But it has,
it recycles this idea from Hubel and Wiesel and Fukushima that
(...if I can can get my pointer...)
that, basically, every neuron in one layer computes a weighted sum of a small area of the
input, and
But those weights are replicated across, so every neuron in a layer uses the same
set of weights. OK, so this is the idea of weight tying or weight sharing.
So using backprop we were able to train neural nets like this to recognize
And this is me when I was about your age, maybe a little older. I'm about thirty
It's a New Jersey number. And I hit a key, and there is this neural net running on the 386 PC
with a special accelerator card
recognizing those characters, running a neural net very similar to the one I just showed you
the animation of.
And so this was kind of new at the time because this was back when
And this could basically train the entire, like, learn the entire task from end-to-end.
You know, basically, the first few layers of that neural net would would play the role of a
feature extractor,
but it was trained from data.
The only task for which there was enough data was either character recognition of speech
recognition.
but to recognize groups of characters, so multiple characters at a single time. And it's
because of this convolutional nature of the network,
basically allowed those systems to just, you know, be applied to a large image and then
they will just
field of view, whenever they see a shape that they can recognize.
So, basically, if you have a large image, you train a convolutional net that it has a small
input window, and you swipe it
over the entire image, and whenever it turns on it means it's detected
So here the system, you know, is capable of doing simultaneous segmentation and
recognition.
people in pattern recognition would have an explicit program that would separate individual
objects from their background and from each other, and then send each
But with this you could, you can do both at the same time.
You don't have to worry about it. You don't have to build any special program for it.
So in particular this could be applied to natural images for things like facial detection,
pedestrian detection, things like this. Right.
convolutional net to
distinguish between an image where you have a face and an image where you don't have a
face, train this with several thousand examples, and
there is a face. Of course, the face could be bigger than the window
so you sub-sample the image: you make it smaller, and you swipe your network again, and
then make it smaller again, swipe your network
OK.
In particular you can use this to drive robots. So this is things that were done before deep
running became popular, OK.
It's applied to the image coming from a camera, from a you know, running robot.
as to whether the central pixel in that window is on the ground or is an obstacle, right.
Whatever it classifies as being an obstacle is red or purple, if it's on a foot of the obstacle.
And then you can sort of map this to a map, which you see at the top.
And then do planning in this map to reach a particular goal, and then use this to navigate.
Raia Hadsell on the right, Pierre Sermanet on the Left, who are annoying this poor robot.
Pretty confident the robot is not going to break their legs, since they actually wrote the code
and trained it.
Pierre Sermanet is a research scientist at Google Brain in California working on robotics.
So semantic segmentation is the idea that you can, again, with this kind of sliding window
approach,
you can train a convolutional net to classify the central pixel using a window as a context.
obstacles from non-obstacles. It's trained to classify something like 30 categories. This is—
this is down
And, it you know, it knows about roads, and people, and plants, and
trees, and whatever—but it finds, you know, desert in the middle of Washington Square
Park, which is not...
So it's not perfect. At the time it was state of the art, though. That was the best
I was running around giving talks like trying to evangelize people about deep learning back
then. This was around 2010.
So this is before the, kind of, deep learning revolution, if you want.
Israel was sitting in one of my talks. And he's a theoretician, but he was really kind of
transfixed by the
take a sabbatical and work for a company called Mobileye, which was a start-up in Israel at
the time—
working on autonomous driving. And so, a couple of months after he heard my talk, he
started working at Mobileye.
He told the Mobileye people, you know—"You should try this convolutional net stuV.
This works really well." And the engineers there said—Nah. No, we don't believe in that
stuV. We have our own method.
And all of a sudden the whole company switched to using convolutional nets.
vision system for cars that, you know, can keep a car in a highway, and can break if there is
a pedestrian or
cyclist
crossing. I'll come back to this in a minute. They were basically using this technique.
Semantic segmentation, very similar to the one I showed for the robot before.
You have to be aware of the fact also that back in the 80s, people were really interested in
using, in sort of
implementing special types of hardware that could run neural nets really fast.
And these are kind of a few examples of neural net chips that were actually implemented—I
had to do with some of them—
but they were implemented by people working in the same group as I was, as I was at Bell
Labs in New Jersey.
So this was kind of a hot topic in the 1980s, and then of course with the
interest in neural nets dying in the mid-90s people weren't working on this anymore, until a
few years ago. Now
the hottest topic in in chip design in the chip industry is neural net accelerators.
computer architecture
you know, chip, like ISSCC, which is the big kind of solid-state circuit conference—half the
talks are about
neural net accelerators.
OK, so then
in speech recognition, image recognition, natural language processing, and it's continuing.
We're in the middle of it now for other topics.
And what happened, and I'm really sad to say it didn't happen in my lab, but
we started, with Yoshua Bengio and GeoV Hinton back in the early 2000s—we knew, you
know
we knew that the whole community was making a mistake by dismissing neural nets and
deep learning. And so—
we didn't use the term deep learning yet. We invented it a few years later—so, around 2003
or so, 2004,
we started kind of a conspiracy, if you want. We got together and we said we're just going to
try to, kind of, beat some records, and some data sets, invent some new algorithms that
will allow us to train very large neural nets.
So that will, and will collect very large data sets, so that we will show the world that those
things really work,
That really kind of succeeded beyond our wildest dreams. In particular, in 2012
GeoV Hinton had one student Alex Krizhevsky, who spent a lot of time implementing
convolutional nets on
GPUs, which were kind of new at the time—they were not entirely new but they were
starting to become really
high-performance.
to train on the ImageNet dataset. The ImageNet dataset is a bunch of natural photos.
1,000 diVerent categories. And the training set had 1.3 million samples.
convolutional net, pretty much on the same model as what we had before, implemented on
GPUs, and let it run
by a large margin. So this is the error rate on ImageNet going back to 2010. So 2010
So, basically you get an error if the correct category is not in the top 5 among 1,000. OK, so
it's kind of a mild
measure of error.
2011 it was 25.8%, the system that was able to do this was actually very, very large.
It was sort of somewhat convolutional-net-like, but it wasn't trained. I mean only the last
layer was trained.
And then
GeoV and his team got it down to 16.4%, and then that was a
watershed moment for the computer vision community. A lot of people said, Okay, you
know now we know that this thing works.
to refusing every paper that does not have a convolutional net in it in 2016.
So now it's the new religion, right. You can't get a paper in a computer vision conference
unless you use ConvNets somehow.
And the error rate went down really quickly, you know people found all kinds of really cute
architectural tricks
that, sort of, made those things work better. And what you'd see in there is that there was
an inflation of the number of layers.
So my convolutional nets from the 90s had 7 layers or so, and from the early 2000s. And
then
GoogLeNet had I don't know how many, because it's hard to figure out how you count. And
then the workhorse now of
But some, you know, some networks have 100 layers or so.
So Alfredo a few years ago put together this chart that shows
And the x-axis is the number of billions of operations you need to do to compute the output.
The y-axis is the top one accuracy on ImageNet. So it's not the same measure of
performance as the one I showed you before.
And the size of the of the blob is the memory occupancy, so the
number of
millions of
floats that you need to store to store the weight, the weight values. Now people are very
smart about compressing those things like
there's entire teams at Google, Facebook, and various other places that only work on
optimizing those networks and compressing the
Because,
rough idea,
Facebook, for example, runs a convolutional net on its servers per day is in the tens of
billions.
Okay. So there's a huge incentive to optimizing the amount of computation necessary for
this.
So one...
So
compositionality is
the
property by which
parts are composed of sub-parts; sub-parts are really combinations of motifs; and motifs
are combinations of
contours or edges,
right, or textures.
And those are just combinations of pixels. Okay, so there's this so-called compositional
hierarchy that,
form
And so if you, kind of, mimic this compositional hierarchy in the architecture of the
network, and you let it learn the appropriate
you know, form the the features of the next layer, that's really what deep learning is.
Okay. Learning to represent the world and exploit the structure of the world—and the
world, being the fact that
The most incomprehensible thing about the world is that the world is comprehensible.
Like, among all the complicated things that the world, you know, the world could be
extremely complicated,
And it looks like a conspiracy that we are able to understand at least part of the world.
And so Stuart Geman's version of this is that the world is compositional... or there is a God.
So this has led to incredible progress in things like computer vision, as you know, from
you know, being able to unreliably identify, you know, detect people, to being able to
generate masks for every object,
accurate masks, and then even to figure out the pose, and then do this in real time on a
mobile platform, you know.
I mean the progress has been sort of nothing short of incredible, and most of those things
are based on
object detection/recognition
architectures called RetinaNet, feature pyramid network. There's various names for it.
Or U-Net. Then another type called Mark-RCNN, both of them actually originated from
Facebook.
Or, the people who originated them are now at Facebook—they sometimes came up with it
before they came to Facebook.
But, you know, those things work really well, you know, they can do things like that: detect
objects that are partially occluded, and,
you know, draw a mask of every object. So basically, this is a neural net, a convolutional net
where the input is an image.
But the output is also an image. In fact, the output is a whole bunch of images, one per
category. And for each category
Those things can also do what's called "instant segmentation." So if you have a whole
bunch of sheeps,
it can tell you, you know, not just that this region is sheep,
but actually pick out the individual sheeps and will tell them apart, and it will count the
sheeps right, and fall asleep.
That's what you're supposed to do right, to fall asleep you count sheeps, right?
And
deep learning is that a lot of the community has embraced the whole concept that
research has to be done in the open.
So a lot of the stuV that we're gonna be talking about, as you probably know,
in the class is
you know, published with code. It's not just code, it's actually pre-trained models that you
can just download and run.
So that's
really new. I mean people didn't use to do research this way, particularly in industry.
But even in academia people weren't used to kind of distributing their code.
somehow the race has kind of driven people to kind of be more open about research.
So there's a lot of applications of all this, as I said, you know self-driving cars.
This is actually a video from Mobileye, and Mobileye was pretty early in this in using
convolutional nets for autonomous driving.
they had managed to shoehorn a convolutional net on the chip that they had designed for
some other purpose. And they sold the
I mean self-driving, not really self-driving, they have driving assistance, right.
They can keep in lane on the highway and change lane—had this Mobileye system.
And, and that's, that's pretty cool. So that's a convolutional net. It's a little chip that is, you
know,
just behind the, it looks out the window and it's behind the
Since then this—you know, four/five years ago—since then this kind of technology has
been
in some European countries, every single car that comes out, even low-end cars, has
convolutional-net-based vision systems. And they call this:
AEBS—
So not every car on the roads have them yet, because you know people keep their cars for a
long time.
So this is a, it's probably the hottest topic in radiology these days—is how to use
for radiology. This [slide image] is lifted from a paper by some of our colleagues here at
NYU,
where they analyzed MRI images. So there's one big advantage to convolutional nets: it's
that they don't need to look at the screen to
uses. It's a 3D convolutional net that looks at the entire volume of an MRI image and
then produces, you know,
it uses the very similar technique for, as I was showing before, for semantic segmentation.
And it produces,
you know, here a femur bone, but you know, it could be—
malignant tumor in
And there's, you know, various other projects in medical imaging that are going around.
Okay.
Lots of applications in science and physics, bioinformatics, you know, whatever, which
we'll come back to so...
Okay.
So there's a bunch mysteries in deep learning. They're not complete mysteries, because
people have some understanding of all this,
but they are mysteries in the sense that we don't have, like, a nice theory for everything.
that theoreticians were asking many years ago, when I was trying to convince the world that
deep longing was a good idea,
was that they would tell me: Well, you can approximate any function with just two layers,
why do you need more? And I'll come back to this in a minute.
I talked about the compositionality of natural images, or natural data, in general. This is
true for speech also and values of the signals, natural signals.
throwing at neural nets, people who'd never played with neural nets were throwing out
neural nets back in the old days. Say, like
you know, you have no guarantee that your algorithm will converge—you know,
the way we train neural nets breaks everything that every textbook in statistics tell you?
if you have n data points, you shouldn't have more than n parameters,
You know, you might regularize. If you're a Bayesian, you might through a prior.
But...
(which is equivalent)
But
with neural nets, neural nets are wildly over parametrized. We train neural nets with
hundreds of millions of parameters, routinely. They're used in production. And
the number of training samples is nowhere near that. How does that work?
But it works!
OK, so things we can do with deep learning today: You know, we can
have safer cars; we can have better medical analysis, medical image analysis systems;
we can have pretty good language translation, far from perfect, but useful;
stupid chatbots;
And,
you know, lots of applications in energy management and production, and all kinds of stuV;
manufacturing, environmental protection.
But we don't have really intelligent machines. We don't have machines with common
sense. We don't have
You know, I mean, there's a lot of things we don't know how to do, right. Which is why we
still do research.
But really we should know in advance what representations are. So I talked about the
traditional model of pattern recognition.
But...
you know, you have your raw data, and you want to turn it into a form that is useful,
somehow.
Ideally, you'd like to turn it into a form that's useful regardless of what you want to do with it.
Sort of "useful" in a general way. OK, and it's not entirely clear what that means.
But, at least, you want to turn it into a representation that's useful for the task that you are
envisioning.
sort of, general ways to pre-process natural data in such a way that you produce good
representations of it.
But the things like tallying the space, doing random projection. So random projection is
actually, kind of,
That was the idea behind the Perceptron. So the first layer of a perceptron is a layer of
random projections.
which you know, has a smaller output dimension than input dimension,
with some sort of non-linearity at the end, right. So think about a single layer neural net
with nonlinearities, but the weights are random.
random projections.
claiming that it's great because you don't have to do multi-layer training.
it came back in the 60s, and then it came back again in the 1980s, and then it came back
again.
And now it came back. There's a whole community, mostly in Asia. They call
two layer neural nets, where the first layer is random, they call this "extreme learning
machines," OK.
It's, you know, from pixels to edges to texton, motifs, parts, objects. In text you have
characters, word, word groups, clauses, sentences, stories.
OK, so here are many attempts at dismissing the whole idea of deep learning. OK, first,
first thing. And this is things that I've heard for decades, OK—
from mostly theoreticians, but a lot of people, and you have to know about them because
they're going to come back in five years when people say, "Oh, deep learning sucks."
and I'm sure many of you have heard about kernel machines and support-vector machines.
I mean, even if it's a rough idea what this is. OK, a few hands. Who has no idea what a
support-vector machine is?
OK, like most people haven't raised their hands for either.
OK, come on all the way up. Cool, all right. Who has no idea what it is? Don't be shy, it's
okay. [Alfredo: inaudible]
Right, so here's the support-vector machine. Support-vector machine is a two layer neural
net.
It's not really a neural net, people don't like when it's formulated this way, but really you can
think of it this way.
each unit in the first layer compares the input vector X to one of the training samples X^i's.
OK, so you take your training samples, let's say you have a thousand of them—
and you have some function K that is going to compare X and X^i.
Good example of a function to compare the two is you take the dot product between X and
X^i, and you pass the result
through,
response,
And, then you take those scores coming out of this K function that compares the input to
every sample and
you compute a weighted sum of them. And what you're going to learn are the weights, the
alphas.
Okay, so it's a two-layer neural net in which the second layer is trainable and the first layer
is fixed.
But in a way you can think of the first layer as being trained in an unsupervised manner,
because it uses the data
from the training set, but it only uses the X's, doesn't use the Y's.
It uses the data in the stupidest way you can imagine, which is you store every X and
Okay.
That's what support vector machine is. You can write a thousand page book about the cute
mathematics
behind that. But the bottom line is it's a two-layer neural net where the first layer is trainied
a very
stupid way unsupervised, and the second layer is just a linear classifier.
So, it's basically glorified template matching, because it basically compares the input
vector to all the all the training samples.
first of all, for every image you're gonna have to compare it with a million images,
or maybe a little less if you're smart, and how you train it.
That's going to be very expensive, and the kind of comparison you're making is basically
what solves the problem. The weighted sum you're gonna get at the end is really the cherry
on the cake.
So...
You can approximate, you can have theorems that show that you can approximate any
function you want, as close as you want, by
tuning the
And so, if you were to talk to a theoretician, they'll tell you: Why do you need deep learning?
The number of terms in that sum can be very large, and nobody tells you what kernel
function you can use. And so,
You can use a two-layer neural net, OK. So this is the top right here. The first layer is a
nonlinear function
F applied to the product of a matrix W^0 by input vector; and then the second layer
multiplied by the second matrix; and then
OK, so this is a composition of two linear and non-linear operations. Again, you can show
that under some conditions
you can approximate any function you want with something like this.
OK, so the dimension of what comes out of the first layer, if it's high enough
—potentially infinite—you can approximate any function you want as close as you want, by
making this layer go to infinity.
So again, you talk to theoreticians and they tell you: Why do you need layers? I can
approximate anything I want with two layers.
And...
For some of you this may sound familiar. For most of you probably not.
OK, so, when you design logic circuits, right, you have AND-gates and OR-gates and... or
NAND-gates, right.
And if you..
You can show that any Boolean function can be written as a bunch of ORs on a bunch of, a
bunch...
you know, a bunch of ANDs and then an OR on top of this. That's called disjointed normal
form (DNF).
The problem is that for most functions, the number of terms you need in the middle is
exponential in the size of the input.
if I give you N bits, and ask you to construct a circuit that tells me if the number of bits that
are on
even or odd.
there is only two sequential steps that are necessary to run your program. So basically your
program has two sequential instructions.
You can have, you can run as many instructions as you want in your program
but they have to run in parallel, most of them. And you're only allowed two sequential
steps.
OK.
And, the kind of instructions you have access to are things like, you know, linear
combinations, nonlinearities—
So for most,
most problems,
the number of intermediate values you're going to have to compute in the first step
There's only a tiny number of problems for which you're going to be able to get away with a
non-exponential number of minterms.
But if you allow your program to run multiple steps sequentially then all of a sudden, you
know,
resources. So people who design computers circuits know this, right. You can design, for
example, a circuit that adds two binary numbers. And
you know, taking the carry into account that gives you the second bit of result, and then
carry, you know,
So problem with this is that it takes the time that's proportional to the size of the
so that the number of steps necessary to do an addition is actually not N, it's much less
than that.
complexity of the circuit. The the number, like the area that it takes on the chip.
So this exchange between time and space, or, between depth and
and...
models? So, you know, a two-layer neural net, one that has one hidden layer,
I don't call that "deep," even though technically it uses backprop. But, eh, you know, it
doesn't really learn
complex representations.
So there's this idea of hierarchy in deep learning. SVMs definitely aren't deep.
Unless you learn complicated kernels, but then they're not SVM's anymore.
So, here's an example I like. There is something called the manifold hypothesis, and it's the
fact that
natural data—
camera with a 1,000 x 1,000 pixel resolution. That's 1 million pixels at 3 million values.
components.
Among all the possible vectors with 3 million components, how many of them correspond
to what we would call natural images?
We can tell when we see a picture whether it's a natural image or not.
We have a model in our visual system that tells us this looks like a real, like a real image.
tiny, tiny subset of the set of all possible images. There's way more ways of combining
random pixels in
nonsensical images than there are ways of combining pixels into things that look like
natural images.
things that,
ambient space.
I take lots of pictures of a person making faces, right. So the person is in front of a white
background.
And she, kind of, moves her head around and, you know, makes faces, etc.
The set of all images of that person—so I take a long video of that person—the set of all
images of that person
So a question I have for you is, What's the dimension of that surface?
Yeah, you've probably heard my spiel before, but... [Speaker: What did the person say?]
OK, so for whoever hasn't heard this, you have a shot, another shot at an answer.
Anyone. You can look down your laptop, but, you know, I can point at you or something.
Yes.
You, any idea? Maybe you heard what he said. [Inaudible student comment.]
It's a 1D space.
Any idea?
OK, the images I'm taking are a million pixels. OK, so the ambient space is 3 million
dimensions.
And the person can move the head, you know, turn around, things like this. But not really
move the whole body.
I mean you only see the face, it's mostly centered.
[Student: A thousand.]
The surface area of the person. Right. So it's bounded by the number of pixels
occupied by the person. That's for sure. That's a, that's an upper bound.
Yes.
Those pixels, of course, are not gonna take all possible values. So that's a wide upper
ground. Any other idea?
the
So there's 3 degrees of freedom due to the fact that you can tilt your head this way, that
way, or that way.
Then there is translation, this way, that way. Maybe this way and that way, maybe up or
down. That's 6.
And then the number of muscles in your face, right. So you can
smile. You can,
you know, pout. You can do all kinds of stuV, right. And you can do this, you know,
independently. You close one eye.
So, whatever independent muscles, you have—not counting the tongue, because there's
tons of muscles in the tongue.
So,
locally, if you want to parameterize the surface occupied by all those pictures—move from
one picture to another—
parameters that determine the position of a point on that surface. Of course it's a highly
nonlinear surface.
So what you'd like is an ideal feature extractor to be able to disentangle the explanatory
factors of variation of what you're observing.
it's not just I move my muscles and I move my head around—each of those is an
independent factor of variation. Again
You know, the lighting could change. That's another set of, you know, variable—
variables. And what you'd like is a representation that basically individually represents each
of those factors of variations.
And the bottom line is that nobody has any idea how to do this. OK.
representation learning.
Yes.
OK, so the question is: Is there some sort of pre-processing like PCA that will find those
vectors? Yeah, so PCA will find
then PCA will find the dimension of that plane—principal component analysis, right.
If you take me and my oldest son that looks like me, and
the distance between our images will be relatively small even though we're not the same
person.
20 pixels,
there's more distance between me and myself shifted than there is between me and my
son, OK.
So...
the manifold of my face, you know, is some complicated manifold in that space.
My son is a slight diVerent manifold which does not intersect mine.
Yet these two, those two manifolds are very close to each other, and they're closer to each
other than
any two samples from my manifold, and two samples from his manifold. So PCA is not
going to tell you anything, basically.
OK, here is another reason why that surface is not, is not a plane.
Now imagine the manifold, which is a linear manifold, one dimensional manifold, of me
turning my head all the way 360.
OK.