KEMBAR78
Transcript Lec1 | PDF | Derivative | Visual Cortex
0% found this document useful (0 votes)
28 views60 pages

Transcript Lec1

The document outlines the structure and content of a deep learning course led by Alfredo Canziani and Mark Goldstein, including nine lectures, practical sessions, and guest lectures on various topics. It emphasizes the importance of familiarity with machine learning concepts and provides an overview of course topics such as supervised learning, neural networks, and self-supervised learning. The course will also include a midterm exam, assignments, and a final project that involves competition among teams.

Uploaded by

Abhay Shakya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views60 pages

Transcript Lec1

The document outlines the structure and content of a deep learning course led by Alfredo Canziani and Mark Goldstein, including nine lectures, practical sessions, and guest lectures on various topics. It emphasizes the importance of familiarity with machine learning concepts and provides an overview of course topics such as supervised learning, neural networks, and self-supervised learning. The course will also include a midterm exam, assignments, and a final project that involves competition among teams.

Uploaded by

Abhay Shakya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

OK, so first of all

I have a terrible confession to make. This class is actually being run not by me, but by these
two guys:

Alfredo Canziani and Mark Goldstein, whose names are here [points to slide]. They are the
TA'S and

you'll talk to them much more often than you'll talk to me.

That's the first thing. The other confession I have to make is that if you have questions
about this class,

don't ask them at the end of this course because I have to run right after the class to catch
an airplane.

But that can wait until next week.

OK, so let's start right in.

Some very basic course information. There is a website as you can see.

I will do what I can to post the PDF of the slides on the website.

Probably just before the lecture, probably just a few minutes before the lecture, usually.

But, but it should be there by the time you get to class, or at least by the time I get to class.

There's going to be nine lectures that I'm going to teach, on Monday evenings.

There is also a practical session every Tuesday night

that further on Mark

will be running. So, they'll go through some of the, you know, practical

questions. Some, you know, refreshers on sort of math, you know

mathematics that are necessary for this, and basic concepts. Some tutorials on how to use
PyTorch and various other

software tools.

And there's going to be three guest lectures. The names of the guest lectures are not
finalized.

But it's gonna be on topics like natural language processing,

computer vision, self-supervised learning, things like that.


There's going to be a midterm exam, or at least we think there is.

And it's going to take one of those sessions, you know, around March.

And the evaluation will be done on the midterm and on a final project.

And, you can sort of, you know, band in groups of two.

Did we say two or three, or just two?

We didn't decide yet, we'll see.

The project will probably have to do with a combination of self-supervised learning and
autonomous driving. We are

discussing with various people for data and

things like that.

Okay, let me talk a little bit about, so this first lecture is really going to be sort of a broad
introduction about

what deep running is really, and what it can do and what it cannot do.

So it will serve as an introduction to the entire thing.

So, we'll go through the entire arc, if you want, of the class

but in, sort of, very superficial terms

so that you get,

sort of, broad high-level idea of all the topics we're talking about. And whenever I'll talk
about a particular topic

you'll see where it fits in this kind of whole

picture.

But before that...

So, there is a prerequisite for the class which is, you know:

You need to be kind of familiar with machine learning or at least basic

concepts in machine learning.

Who here has

played with PyTorch, TensorFlow, has trained a neural net?


OK. Who has not done that? Don't be shy, OK. OK? So the majority has.

Which is good.

But I'm not going to assume that you know everything about this. Particularly, I am not going
to assume that you know a lot of the,

you know, sort of deep underlying techniques. OK, so here is the course plan and

depending on what you tell me

I can adjust this and sort of, you know,

go faster on certain sections that you think are too obvious because you've played with this
before, or other things.

So intro to

Supervised Learning, Neural Nets, Deep Learning. That's what I'm going to talk about today.

What deep learning can do, what it cannot do, what are good features. Deep learning is
about learning representations.

That's what I'm going to talk about.

Next week will be about back propagation and basic architectural components. So things
like

the fact that you build neural nets out of modules that you connect with each other. Then
you compute gradients you get automatic

diVerentiations. And then these various types of architectures, loss functions,

activation functions — you know diVerent modules. Tricks like weight sharing and weight
tying,

multiplicative interactions,

attention gating — things like this, right.

And then particular

macro architectures, like mixture of experts, Siamese net, hyper networks, etc.

So we'll dive pretty quickly in and that's appropriate if you've already played with some of
those things.
Then there will be either one or two lectures – I haven't completely decided yet — about
convolutional nets and their applications.

One of them might end up being a guest lecture.

Then, more specifically, about deep learning architectures that are useful in special cases.

So things like recurrent neural nets with back propagation through time, which is the way
you train recurrent neural nets.

And ...

... and sort of ...

... applications of ...

recurrent neural nets to things like control and, you know, producing time series and stuV
like that.

Then things like combine recurrence and

gating

and multiplicative interactions like gated recurrent units and LSTM.

And then things that really use multiplicative interactions as kind of

really a basis of their architecture like memory networks, transformers, adapters, etc.,
which are sort of very recent

architectures that have become extremely popular and in things like NLP and other

other areas. And then a little bit about graph neural nets, which I'm not going to talk about
this a lot because there is another

course that you can take by Joan Bruna where he spends a lot of time on graph neural nets.

Then...

then we'll talk about how we get those deep learning systems to work. And, so, various
tricks to get them to work.

Sort of understanding the type of optimization that takes place in neural nets. So...

...the type of...

you know, we...


we use, of course — learning is always about, almost always about, optimization. And deep
learning is almost always about gradient-based optimization.

And there are certain rules about optimization in the convex case that are well understood.

But they're not well understood when the training is stochastic,

which is the case for most deep learning systems. And they're not very well understood

also in deep learning, because the the cost function is not ...

is not convex.

It has local minima, and saddle points, and things like this. So it's important to understand
the geometry of the objective function.

I say it's important to understand but

the the big secret here is that nobody actually understands. So ...

it's important to understand that nobody understands. Okay?

But there are a few tricks that have

come up through a combination of intuition and a little bit of theoretical analysis and

empirical search. Things like initialization tricks,

normalization tricks, and regularization tricks like drop out.

(Grading clipping is more for optimization.) Things like momentum,

average HDD, the various methods for parallelizing HDD.

Many of which do not work. And then something a little exotic called target prop and the
Lagrangian formulation of back prop.

Then I'll switch to my favorite topic, which is energy-based models. So this is sort of a

general formulation of a lot of diVerent, sort of, approaches to learning — whether they are
supervised, unsupervised, self-supervised.

And whether they involve things like inference.

Like, you know, searching for the value of variables that nobody tells you the value of, but
that your process

your system is supposed to infer.


So that could be thought of as sort of a way of implementing reasoning with neural nets.

So you could think of reasoning in neural nets as a process by which you have

some energy function that's being optimized with respect to some variables. And the value
you get as a result of this optimization

is the value of those variables you were trying to find. And so,

there is sort of the the common view that a neural net is just a function that computes its
output as a function of its

input. So you just run through the neural net, you get an output.

But that's a fairly restrictive

form of inference in the sense that you can only produce one output for a given input.

But very often there are multiple possible answers to a given input. And so, how do you,

kind of, represent problems of this type where there are multiple answers, multiple
possible answers, to a given input? And one answer to this is:

You make those answers the minima of some energy function, and your inference
algorithm is going to find values of those

variables that minimize this objective function. And there might be multiple minima. So
that means your model might produce multiple

outputs for a given input. Okay?

So, energy-based models, it's kind of a way of doing this. A special case of those energy-
based models are

probabilistic models: like, you know Bayesian methods,

graphical model, Bayesian nets and things like this. Energy methods are a little more
general. So a little less specific.

So special cases of this include things like what people used to call structure prediction.

And then there is a lot of applications of this in what's called self-supervised learning. And
that will be the topic of the next couple lectures.

So self-supervised learning is a very very active topic of research today.

And probably something that's going to become really dominant in the future. It's already ...
... in the space of a year, it's become dominant in natural language processing. And, in the
last few months, just three months

there's been a few papers that show that self-supervised learning methods actually

work really well in things like computer vision as well. And so my guess,

my guess is that ...

self-supervised learning is going to take over the world

in the next few years. So, I think it's useful to hear about it in this class.

The things like —

I'm not going to go through a laundry list of this — but there are

things that you may have heard of: like variational auto-encoders, de-noising auto-
encoders,

BERT, which is, you know, those

transformer architectures that are trained for natural language processing. They are trained
with self-supervised learning and they are a special case of a de-noising auto-encoder.

So a lot of those things

you may have heard of without realizing they were all, kind of —

it can be all understood in the context of this sort of energy-based approach. And that
includes also generatative adversarial networks (GANs),

which I'm sure many of you have heard of.

And then there is self-supervised learning and beyond.

So, you know, how do we get machines to really kind of become really intelligent? They're
not superintelligent.

They're not very intelligent right now. They can solve very narrow problems very well,

sometimes with superhuman performance. But no machine has any kind of common
sense. And, the most intelligent machines that we have

probably have less common sense than a house cat. So, how do we get to cat-level
intelligence first, and then maybe human-level intelligence?

And, I don't pretend to have the answer, but I have, you know —
a few ideas that are interesting to discuss in the context of self-supervised learning, there.

I've had some applications. Any questions? So that's the plan of the course. Okay. it might
change, dynamically

But, at least, that's the intent. Any questions?

...Okay...

[Student Question: Will we also be having assignments in the course?] Yeah, yeah, there
are assignments.

[...Inaudible...]

Okay, so for those of you who didn't hear Alfredo, because he didn't speak very loudly:

The final project is actually going to be a competition between the teams.

So there is going to be a leaderboard and everything.

And, in preparation for this, the assignments will be basically

practice to get familiar with all the techniques that you would need for

deep learning in general, but for the final project in particular.

[Also for the midterm.]

Right, also for midterm, obviously.

Okay, so most of you probably know that — and this is gonna

probably be boring for some of you who've already played with those things — but let's start
from the basics.

Deep learning is inspired by what people have observed about the brain, but the inspiration
is just an inspiration.

It's not, the attempt is not to copy the brain because

there's a lot of details about the brain that are irrelevant, and that we don't know if they are
relevant actually to human intelligence.

So the inspiration is at kind of the conceptual level.

And, it's a little bit the same as, you know, airplanes being inspired by birds.

But the underlying principles of flight for birds and airplanes are essentially the same, but
the details are extremely diVerent.
They both have wings. They both generate lift by propelling themselves through air, but, you
know, airplanes don't have feathers and don't flap their wings.

So it's a bit of the same idea. And the history of this goes back to a field

that has kind of almost disappeared, or, at least changed names now, called Cybernetics.

If you want a specialist about the history of cybernetics, he is sitting right there:

Joe Lemelin

Can you raise your hand?

So, Joe is actually philosopher. And he is interested in the — he actually has a seminar on,
kind of, the history of AI.

In the what department is this?

[Joe: Media, Culture, and Communication.]

Media, Culture, and Communication. So, not CS.

He knows everything about, you know, the history of cybernetics. So it started in the 40's
with

two gentlemen: McCulloch and Pitts. Their picture is on the top right here.

And they came up with the idea that,

you know, people at the time were interested in logic, but

neuroscience was a very sort of nascent field.

And, they got the idea that if neurons are basically threshold units that are on or oV,

then by connecting neurons with each other, you can build Boolean circuits and you can
basically do logical inference with neurons.

So they say, you know:

The brain is basically a logical inference machine because the neurons are binary. And this
idea —

So the idea was that a neuron

computes a weighted sum of its inputs and then compares the weighted sum to a
threshold. It turns on

if it's above the threshold, turns oV if it's below.


Which is sort of simplified view of how real neurons work — a very, very simplified view.

That model kind of stuck with the field for decades.

Almost four decades.

Actually, a full four decades.

Then there was, you know, quasi-simultaneously Donald Hebb, who had the idea that

neurons in the brain — it's an old idea that

the brain learns by modifying the strength of the connections between the neurons that are
called the synapses. And you had the idea

of what's now called Hebbian learning, which is that if two neurons fire together, then the
connection that links them

increases. And if they don't fire together, maybe it decreases.

That's not an idea for learning algorithm, but it's sort of a first idea, perhaps.

And then cybernetics was proposed by this guy Norbert Wiener, who is here. [Bottom Right]

This is the whole idea that by having systems that, kind of, have sensors

and have actuators, you can have feedback loops and you can have, you know, self-
regulating systems.

And, what's the theory behind this? You know, we sort of take that for granted now. But the
idea that

you know, you have things, like kind of,

for example:

You know, you drive your car, right? You turn the wheel and

there's a so-called PID controller that actually turns the wheel in proportion to how you turn
the steering wheel.

And it's a feedback mechanism

that basically measures the position of the steering wheel, measures the position of

the wheel of the car. And, then, if there is a diVerence between the two, kind of corrects the
wheels of the car so that

they match the orientation steering wheel. That's a feedback mechanism.


That, the stability of this, and the rules about this basically all come initially from this work.

That led to a gentleman by the name of Frank Rosenblatt to basically imagine

learning algorithms that modified the weights of very simple neural nets. And what you see
here at the bottom

the two pictures here [Bottom Left]: This is Frank Rosenblatt, and this is the Perceptron.
This was a physical

analog computer. It was not a three-line Python program,

which is what it is now.

It was a gigantic machine with, you know, wires and

optical sensors so you could show it pictures. It was very low resolution.

And then it had —

it had neurons that could compute a weighted sum, and the weights could be adapted. And
the weights were potentiometers.

The potentiometers had motors on them so they could rotate for the learning algorithm.

So it was electro-mechanical. And what he's holding here in his hand is a module of eight
weights

with (you can count them), with those potentiometers, motorized potentiometers on them.

Okay, so

that was a little bit of history of where neural nets come from.

Another interesting piece of history is that this whole idea of, sort of, trying to build
intelligent machines by basically simulating networks of neurons

was born in the 40's, kind of took oV a little bit in the late-50's, and

completely died in the 1960's, in the late-1960s

when people realized that with the kind of learning algorithms and architectures that
people were proposing at the time

you couldn't do much. You know, you could do some basic, very simple pattern
recognition, but you couldn't do much.

So between
1969, roughly, and —

or 1968 — and

1984, I would say,

basically, nobody in the world was working on

neural nets

except a few kind of isolated researchers mostly in Japan. Japan is its own, kind of,

relatively isolated ecosystem for funding research. People don't listen to the same kind of
...

... fashions, if you want. And then the field took oV again in 1985, roughly, with the

emergence of backpropagation [backprop]. So backpropagation is

an algorithm for training multi-layer neural nets, as many of you know.

People were looking for something like this in the 60's and basically didn't find it.

And the reason they didn't find it was because they had the wrong neurons.

They were using

McCulloch-Pitts neurons that are binary.

The way to get backpropagation to work is to use an activation function that is continuous,

diVerentiable, or at least, continuous.

And people just didn't have ...

you know, the idea of using continuous neurons. And so, they didn't think that you could
train those systems with

gradient descent, because things were not diVerentiable.

Now there's another reason for this, which is that if you have a neural net with binary
neurons,

you never need to compute multiplications. You never need to multiply two numbers.

You only need to add numbers, right. If your neuron is active, you just add the weight to the
weighted sum.

If it is inactive, you don't do anything.


If you have continuous neurons,

you need to multiply the activation of a neuron by a weight to get a contribution to the
weighted sum.

It turns out, before the 1980's, multiplying two numbers, particularly floating point
numbers, on

any sort of non-ridiculously expensive computer was extremely slow. And so there was an
incentive to not use

continuous neurons for that reason.

So the reason why backprop didn't emerge earlier than the mid 80's is because that's
when, you know,

computers became fast enough to do floating point multiplications, pretty much.

People didn't think of it this way, but that's, you know, kind of retrospectively, that's pretty
much what happened.

So,

there was a wave of interest in neural nets between 1985 and

1995 — lasted about 10 years. In 1995, it died again. People in machine learning

basically abandoned the idea of using neural nets.

(For reasons that I'm not gonna go into right now.)

And that lasted until

the late 2000's, early 2010. So, when around 2009/2010 people realized that you could use

multi-layer neural nets, training with backprop,

and get an improvement for speech recognition. It didn't start with ImageNet. It started with
speech recognition, around 2010.

And within 18 months of the first papers being published on this,

every major player in speech recognition had deployed commercial speech recognition
systems that use

neural nets. So, if you had an Android phone and you were using any other speech
recognition features in an Android phone
around 2012

that used neural nets. That was probably the first really, really wide

deployment of, kind of, modern forms of deep learning, if you want.

Then at the end of 2012 / early-2013, the same thing happened in computer vision, where
the computer vision community realized

deep learning, convolutional nets in particular, work much better than whatever it is that
they were using before, and

started to switch to using commercial nets, and basically abandoned all previous
techniques.

So that created a second revolution, now in computer vision. And then three years later,
around 2016 or so,

the same thing happened in natural language processing — in language translation, and
things like this: 2015/16.

And, now we're going to see — it's not happened yet — but we're going to see the same
revolution occur

in things like robotics, and control, and, you know, a whole bunch of application areas.

But let me get to this:

Okay. So, you all know what supervised learning is, I'm sure.

And this is really what the vast majority — and not the vast majority —

90-some percent

applications of deep learning use supervised learning as kind of the main thing. So
supervised learning is the process by which

you collect a bunch of pairs of inputs and outputs

of examples of, let's say, images together with a category (if you want to do image
recognition). Or a bunch of

audio clips with their text

transcription: a bunch of text in one language with the transcription in another language,
etc.

and you feed


one example to the machine. It produces an output. If the output is correct

you don't do anything, or you don't do much. If the output is incorrect

then you tweak the parameters of the machine. (Think of it as a parametrized

function of some kind.)

And you tweak the parameters of that function in such a way that the output gets closer to
the when you want.

Okay. This is in non-technical terms what supervised learning is all about.

Show a picture of a car, if the system doesn't say car, tweak the

parameters. The parameters in the neural net are going to be the weights,

you know, that compute weighted sums in those simulated neurons.

tweak

the knobs so that the output gets closer to the one you want.

The trick in neural nets is: How do you figure out in which direction and by how much to
tweak the knobs so that the output

gets closer to the one you want?

That's what gradient computation and backpropagation is about. But before we get to this,

a little bit of history again. So there was a flurry of models, basic models for

classification. You know, starting with the Perceptron, there was another competing model
called the Adaline,

which is on the top right here. They are based on the same kind of basic architectures:
compute the weighted sum of inputs

compared to a threshold. If it's above the threshold,

turn on. if it's below the threshold, turn oV.

What you see, the Adaline here, the thing that Bernie Widrow is tweaking is actually a
physical

analog computer again.

So it's like the Perceptron, it was much less, you know, much smaller, in many ways.
The reason I tell you about this is that the the Perception actually was a two layer neural
net, a two layer neural net in

which the second layer was trainable

with adaptive weights. But the first layer was fixed. In fact, most of the time with most
experiments it was,

It was determined randomly. You would, like, randomly connect

input pixels of an image to

neurons that would, you know, be threshold neurons with random weights, essentially.

This is what they called the associative layer. And that basically

became the basis for the sort of conceptual design of a pattern recognition system for the
next four decades.

I want to say four decades.

Yeah, pretty much.

So, that model is one by which you take an input,

you run it through a feature extractor that is supposed to extract the relevant

characteristics of the input that will be useful for the task.

So, you want to recognize a face? Can you detect an eye? How do you detect an eye?

Well, there is probably a dark circle somewhere.

Things like that, right? You want to

recognize a car, you know. Well, they are kind of dark, round things, etc.

So, the problem here — and so, what this feature extractor produces is a vector of

features, which are things that may be numbers, or they may be on or oV.

Okay, so it's just a list of numbers, a vector. And you're going to feed that vector to trainable
classifier. In the case of

Perceptron or a simple neural net, it's gonna be just the system that computes a weighted
sum, compares it to a threshold.

The problem is that you have to engineer the feature extractor. So the entire literature of
pattern recognition
(statistical pattern recognition at least)

and a lot of computer vision (at least the part of computer vision that's interested in
recognition) was focused on

this part: the feature extractor. How do you design a feature extractor for a particular
problem? You want to do,

I don't know, Hangul character recognition. What are the relevant

features for recognizing Hangul? And how can you extract them using all kinds of
algorithmic tricks?

How do you pre-process the images? You know, how do you normalize their size? You know,
things like that.

How do you skeletonize them? How do you segment them from their background?

So the entire literature was devoted to this [Feature Extractor], very very little was devoted
to that [Trainable Classifier].

And what deep learning brought to the table is this idea that instead of having this kind of
two-stage process

for pattern recognition, where one stage is built by hand,

where the representation of the input is the result of,

you know, some hand-engineered program, essentially.

The idea of deep learning is that you learn the entire task end-to-end.

Okay, so basically you build your

pattern recognition system, or whatever it is that you want to do with it

as a cascade or a sequence of

modules. All of those modules have tunable parameters.

All of them have some sort of nonlinearity in them.

And then you stack them, you stack multiple layers of them, which is why it's called deep
learning.

So the only reason for the "deep" word in deep learning is the fact that there are multiple
layers. There is nothing more to that.
And then you train the entire thing end-to-end. So the complication here, of course, is the
fact that

the parameters that are in the first box: How do you know how to tune them so that the
output does—

you know, goes closer to the output you want?

That's what backpropagation does for you.

Okay, why do all those modules have to be

nonlinear? It's because, if you have two successive modules, and they're both linear, you
can collapse them into a single linear.

Right, the product of two linear

functions, or the composition of two linear functions is a linear function. Take a vector
multiply by a matrix and then multiply it by a second matrix.

It's as if you had pre-computed the product of those two matrices, and then multiplied the
input vector by that

composite matrix. So there's no point having multiple layers if those layers are linear.

Okay, there's actually a point, but it's a minor point.

So, since they have to be nonlinear, what is the simplest

multi-layer architecture you can imagine that has

arameters that you can tune — things like weights in the neural nets — and is non-linear?

And you realize quickly that

it has to look something like this:

So ...

Take an input. An input can be represented as a vector, right. An image is just a list of
numbers.

Think of it as a vector, ignore the fact that it's an image for now.

Piece of audio, whatever it is that your sensors or your data set gives you, is a vector.

Multiply this vector by a matrix. The coeVicients in this matrix are the tunable parameters.
And then take the resulting vector, right — when you multiply a matrix by a vector, you get a
vector.

Pass each component of this vector through a nonlinear function.

And,

if you want to have the simplest possible nonlinear function, use something like

what's shown at the top here [ReLU(x) = max(x, 0)], which people in neural nets call the
ReLU; people in engineering call this half wave rectification;

people in math call this positive part.

Whatever you want to call it, okay.

So, apply this nonlinear function to every component of the vector that results from
multiplying the input vector by the matrix.

Okay. Now you get a new vector, which has lots of zeros in it, because whenever the
weighted sum was less than zero,

the output is zero, if you pass through the ReLU. And then repeat the process: Take that
vector, multiply it by a weight matrix;

pass the result through point wise non-linearity; take the result, pass it — multiplied by a
matrix — pass the result through

nonlinearities.

That's a basic neural net, essentially.

Okay now

Why is that called a neural net at all? It's because when you take a vector and you multiply

a vector by a matrix,

to compute each

component of the output, you actually compute a weighted sum of the components of the
input

by a corresponding row in the matrix, right. So this little symbol here — there's a bunch of

components of the vector

coming into this layer. And,


you take a row of the matrix, compute a weighted sum of those values where the weights
are the

values in the row of that matrix, and that gives you a

weighted sum. And you do this for every row that gives you the

result, right.

So, the number of units after the multiplication by a matrix is going to be equal to the
number of rows of your matrix.

And the number of columns of the matrix, of course has to be equal to the size of the input.

Okay. So supervised learning — in slightly more formal terms than the one I showed earlier

— is the idea by which you're going to compare the output that the system produces...

So, right, you show an input, you run through the neural net, you get an output. You're going
to compare this output with a target output.

And you are going to have an objective function,

a loss module, that computes a distance, a discrepancy,

penalty — whatever you want to call it.

Divergence — okay, there's various names for it.

And then you're going to compute the average of that.

So the output of this cost function is a scalar, right.

It computes the distance, for example, Euclidean distance between a target vector and the
vector that

the neural net produces, the deep learning system system produces.

And then you can compute the average of this cost function, which is just a scalar

You're going to average it over a training set, right.

So, a training set is composed of a bunch of pairs of inputs and outputs; compute the
average of this over the training set.

The function you want to minimize with respect to the parameters of the system (the
tunable knobs) is that average.
Okay. So, you want to find the value of the parameters that minimizes the average error
between the

output you want and the output you get, averaged over a training set of samples.

So...

I'm sure the vast majority of people here, sort of, have at least an intuitive understanding of
what gradient descent is.

So, basically, the way to minimize this is to compute the gradient, right.

It's like you are in a mountain, you're lost in the mountain (and

it's a very smooth mountain), but

there is fog and it's night, and you want to go to the village in the valley.

And so, the best thing you can do is

you turn around and you see which way is down, and you take a step down in the direction
of steepest descent.

Okay. So, this search for the direction that goes down: that's called "computing a gradient"
or, technically, a negative gradient.

Okay, then you take a step down. That's taking a step down in the

direction of negative gradient. And if you keep doing this and your steps are small enough —

small enough so that when you take a step, you don't jump to the other side of the
mountain — then

eventually you're going to converge to the valley, if

the valley is convex. Which means that if there is no, kind of, lake, no mountain lakes in the
middle where, you know,

there's kind of a

minimum and you're going to get stuck in that minimum the valley might be lower, but you
don't see it.

Okay. So that's why convexity

is important as a concept.
But here is another concept, which is the concept of stochastic gradient, which I'm sure
again a lot of you have heard [of].

I'll come back to that

in more detail. The objective function you're computing is an average over many many
samples.

You can compute the objective function in its gradient

over the entire training set by averaging the

the value of the entire training set.

But it turns out it is more eVicient to just take one sample or a small group of samples,

computing the error that

this sample makes, then computing the gradient of that error with respect to the
parameters and taking a step.

A small step.

Then a new

sample comes in — you're going to get another

value for the error and another value for the gradient, which may be in a diVerent direction
because it's a diVerent sample.

And take a step in that direction.

And if you keep doing this, you're gonna go down the

the cost surface but in kind of a noisy way — there's going to be a lot of fluctuations.

So what is shown here is an example of this. This is stochastic gradient applied to a very
simple problem with two dimensions

where you only have two weights.

And it looks kind of semi-periodic because the examples are always shown in the same
order,

which is not what you're supposed to do with stochastic gradient. But as you can see the
path is really erratic.

Why do people use this? There's various reasons.


One reason is that, empirically it converges much, much faster, particularly if you have a
very large training set.

And the other reason is that you actually get better generalization in the end.

So if you measure the performance of the system on a separate set that you —

I assume you all know the concepts of "training set" and "test set" and "validation set" but

So if you test the performance of the system on a diVerent set, you get generally better

generalization if you use stochastic gradient than if you actually use the real, true gradient
descent.

The problem is... yes

[inaudible student question]

No.

It's worse. So computing the entire gradient of the entire dataset —

It is computationally feasible. I mean you can do it. It's

not any more expensive than...

...you know... [inaudible student comment]

It'll be less noisy but it will be slower.

So let me tell you why: I mean, this is something we're gonna talk about again when we talk
about optimization.

But let me tell you: I give you a training set with a million training samples. It's actually

100 repetitions of the same training sample with 10,000 samples. Okay. So my actual
sample is 10,000 training samples.

I replicate it 100 times and I claim that, you know,

I scrambled it. I tell you here is my training set with a million training samples.

So if you do a full gradient, you're going to

compute the same values a hundred times. You're gonna spend a hundred times more
work than necessary.

Without knowing it.


Okay. So this only works because of repetition. But it also works in, kind of, more normal
situations in machine learning where you have

samples that are, have a lot of

redundancy in them, like very many samples are very similar to each other, etc.

So if there is any kind of coherence — if your system is capable of generalization,

then that means stochastic gradient is going to be more eVicient because

if you don't stochastic gradient, you're not going to be able to take advantage of that
redundancy.

So that's one case where noise is good for you.

Okay.

Don't pay attention to the formula. Don't get scared because we're going to come back to
this in more detail.

But,

why is backpropagation called backpropagation?

Again, this is very informal.

It's basically a practical application of Chain Rule. So you can think of

a neural net of the type that I showed you earlier as a bunch of modules that are stacked on
top of each other.

And you can think of this as compositions of functions.

And you all know the basic rule of calculus. You know, how do you compute the derivative
of a function

composed with another function?

Well the derivative of, you know, F composed with G is the

derivative of F at the point of G of X,

multiplied by the derivative of G at the point X. Right. So you get the product of the two

derivatives. So this is the same thing except that the functions, instead of being scalar
functions, are vector functions.

They get vectors as inputs and the previous vectors as outputs. More generally, actually,
they take multi-dimensional arrays as input and multi-dimensional arrays as output, but
that doesn't matter.

Basically, what is the generalization of this chain rule

in the case of

functional modules that have multiple inputs, multiple outputs that you can view as
functions? Right.

And, basically, it's the same rule if you, kind of, blindly apply them — it's the same rule as
you applied for,

as you apply for regular derivatives.

(Except here you have to use partial derivatives.)

You know,

what you see in the end is that if you want to compute the derivative of

the diVerence between the output you want and the output you get,

which is the value of your objective function, with respect to any variable inside of the
network,

then you have to kind of back, you know, propagate derivatives backwards and kind of
multiply things on the way.

All right. We'll be much more formal about this next week. For now,

you just know why it's called backpropagation: because it applies to multiple layers.

OK, so the the picture I showed earlier of this neural net

is nice, but what if the input is actually an image? Right.

So, an image, even sort of a relatively low resolution image, is typically like, you know, a few
hundred pixels on the side.

OK. So let's say

256 x 256, to take a random example. OK, a car image: 256 x 256. So it's got

65,536 pixels, times three, because you have R, G, and B components to it for, you know,
you have three value for each pixels. And so

that's, you know, roughly two hundred thousand values.


OK, so your vector here is a vector with two hundred thousand components.

If you have a matrix that is going to multiply this vector, this matrix is going to have to have
two hundred thousand rows —

Columns, I'm sorry.

And

depending on how many units you have here in the first layer,

there's going to be a 200,000 x, you know, maybe a large number. That's a huge matrix.

Right. Even if it's 200,000 x 100,000. So you have a

compression in the first layer

you know, that's already a lot of—very, very large matrix: billions.

So it's not really practical to think of this as a

full matrix. What you're going to have to do if you want to deal with things like images is

make some hypothesis about the structure of this matrix so that it's not a completely full
matrix that, you know,

connects everything to everything.

That would be impractical.

At least for a lot of practical applications. So this is where inspiration from the brain comes
back.

There was some work,

classic work, in neuroscience in the 1960s by the gentlemen at the top here:

Hubel and Wiesel. They actually won a Nobel Prize for this in the in the 70s

but their workforce from the late 50s—early 60s. And what they did was that they poked
electrode in the

visual cortex of various animals: you know, cats,

monkeys,

mice, you know, whatever.

(I think they like cats a lot.)


And

they tried to figure out

what the neurons in the visual cortex were doing.

And what they discovered was that—

so, first of all, well, this is a human brain. But, I mean, this chart is from much later. But all
mammalian

visual systems is organized in similar way. You have signals coming in to your eyes,

striking your retina. You have a few layers of neurons in your retina in front of your
photoreceptors that, kind of, pre-process the

signal, if you want. They kind of compress it, because you can't have—

you know the human eye is something like a hundred million pixels.

So it's like a hundred million pixel camera, megapixel camera.

But the problem is you cannot have a hundred million fibers coming out of your eyes,
because otherwise

your optical nerve would be this big. And you

wouldn't be able to move your eyes.

So those neurons in front of your retina do

compression. They don't do JPEG compression, but they do compression. So that the
signal can be compressed to one million fibers. Right.

You have one million fibers coming out of each of your eyes. And

that makes your, you know, optical nerve about this big, which means, you know, you can
carry the signal and turn your eyes.

This is actually a mistake that evolution made for vertebrates.

Invertebrates are not like that. Invertebrates have actually—so, it's a big mistake because

the wires

collecting the information from your retina,

because the neurons that process the signal in front of your retina, the wires have to kind
of—
it'll be in front of your retina, and so blocking part of the view, if you want. And

then they have to punch a hole through your retina to get through your brain.

So there's a blind spot in your visual field because that's where your optical nerve punches
through your retina.

So it's kind of ridiculous

if you have a camera like this to have the wires coming out the front and then,

you know, dig a hole in your sensor to get the wires back. It's much better if the wires come
out the back, right? And

vertebrates got that wrong. Invertebrates got it right. So,

you know, like squid and octopus actually have wires coming out the back. They're much
luckier.

But anyway.

So, the signal goes from your eyes to a little piece of brain called the lateral geniculate
nucleus,

which is under your brain actually—and like at the basis of it.

It does a little bit of

contrast normalization. We'll talk about this

again in a few lectures.

And,

and, then that goes to the back of your brain where the the primary visual cortex area called
v1 is.

It's called V1 in humans. And there's something called the ventral hierarchy: V1, V2, V4 (IT),

which is a bunch of brain areas going from the back to the side. And

in the infero-temporal cortex right here, this is where object categories are represented.

So, when you go around and you see your grandmother,

you have a bunch of neurons firing that represent your grandmother in this area. And it
doesn't matter

what your grandmother is wearing,


you know, what what position she is in, if there is occlusion or whatever—those neurons
will fire if you see your grandmother.

So the sort of category-level things.

And those things have been discovered by experiments with patients that had to have their
skull open for a few weeks, and

where, you know, people poked electrode and had them watch movies and realize this is
known that turns on if

Jennifer Aniston is on the movie. And it only turns on for Jennifer Aniston.

So with the

idea that somehow the visual context, you know, can do pattern recognition and seems to
have this sort of hierarchical structure,

multi-layer structure,

there's also the idea that the visual

process is essentially a feed-forward process. So the process by which you recognize an


everyday object is very fast. It takes about 100 milliseconds.

There's barely enough time for the signal to go from your retina to the infero-temporal
cortex.

It takes about, it's a few millisecond delay per neuron that you have to go through.

100 milliseconds you barely have time to, for just you know, a few spikes to go through the
entire system.

So there's no time for like, you know, recurrent connections and like, you know, etc. Doesn't
mean that there are no recurrent connections.

There's tons of them but somehow

fast recognition is done without them.

So this is called the the feed-forward ventral pathway. And

this gentleman here, Kunihiko Fukushima, had the idea of taking inspiration from

Hubel and Wiesel in the 70s, and sort of built a neural net model on the computer that had
this idea that,

first there were layers,


But also the idea that Hubel and Wiesel discovered, that individual neurons only react to a
small part of the visual field.

So they poked electrodes in

neurons in V1 and they realized that this neuron in V1 only reacts to

motifs that appear in a very small area in the visual field.

And then the neuron next to it will react to another area that's next to the first one, right.

So the neurons seem to be organized in what's called a retinotopic way,

which means that neighboring neurons react to neighboring regions in the visual field.

What they also realized is that the group of neurons that all react to the same area in the
visual field, and they seem to

turn on for edges at a particular orientation. So one neuron will turn on for, if

it's receptive field has an edge, a vertical edge, and then the one next to it if the

edge is a little slanted, and then the one next to it if the edge is a little

rotated, etc.

And so they had this picture of

V1 basically as

oriented, orientation selectivity, so neurons that look at a local

field and then react to orientations. And those groups of neurons that react to multiple
orientations are replicated over the entire visual field.

So this guy Fukushima said, well, why don't I build a neural net that that does this? I'm not
going to,

you know, necessarily insist that my system extracts oriented

features, but I'm going to use some sort of unsupervised learning algorithm to

to train it.

So he was not training his system end-to-end. He was training it layer by layer in some sort
of unsupervised fashion,

which I'm not going to go into the details of.


And then he used another concept from,

so he used the concept that those neurons were replicated across the visual field, and

then he used another concept from Hubel and Wiesel called complex cells.

So complex cells are units that pool the activities of a bunch of simple cells, which are
those

oriented

orientation-selective units. And

they pull them in such a way that if

an orientation, if an oriented edge is moved a little bit

it will activate diVerent simple cells, but the complex cell,

since it integrates the outputs from all those simple cells, will stay activated

until the edge moves beyond its receptive field.

So those complex cells build a little bit of shift invariance in the representation.

You can shift an edge a little bit, and it will not change the activation the activity of one of
those complex cells.

So that's

what we now call "convolution" and "pooling" in the context of convolutional nets.

And that basically is what

led me in the mid-80s or late-80s to come up with convolutional nets. So they are basically

networks where the connections are local; they are replicated across the visual field; and

you

intersperse,

sort of, feature detection layers that detect those local features with pooling operation.
We'll talk about this at length in three weeks.

So I'm not going to go to into every detail.

But it has,

it recycles this idea from Hubel and Wiesel and Fukushima that
(...if I can can get my pointer...)

that, basically, every neuron in one layer computes a weighted sum of a small area of the
input, and

the weighted sum uses those weights.

But those weights are replicated across, so every neuron in a layer uses the same

set of weights. OK, so this is the idea of weight tying or weight sharing.

So using backprop we were able to train neural nets like this to recognize

handwritten digits. This is back from the late-80s early-90s.

And this is me when I was about your age, maybe a little older. I'm about thirty

in this video. And

this is my phone number when I was working at Bell Labs.

Doesn't work anymore.

It's a New Jersey number. And I hit a key, and there is this neural net running on the 386 PC
with a special accelerator card

recognizing those characters, running a neural net very similar to the one I just showed you
the animation of.

And the thing could,

you know, recognize characters of any style,

including very strange styles,

including even stranger styles.

And so this was kind of new at the time because this was back when

character recognition, or pattern recognition in general,

were still on the model of:

we extract features and then we train a classifier on top.

And this could basically train the entire, like, learn the entire task from end-to-end.

You know, basically, the first few layers of that neural net would would play the role of a
feature extractor,
but it was trained from data.

The reason why we used character recognition is because this was

this was the only thing for which we had data.

The only task for which there was enough data was either character recognition of speech
recognition.

Speech recognition experiments were somewhat successful, but not as much.

Pretty quickly, we realized we could use those

convolutional nets not just to recognize individual characters,

but to recognize groups of characters, so multiple characters at a single time. And it's
because of this convolutional nature of the network,

which I'll come back to in three lectures, that

basically allowed those systems to just, you know, be applied to a large image and then
they will just

turn on whenever they see in their

field of view, whenever they see a shape that they can recognize.

So, basically, if you have a large image, you train a convolutional net that it has a small
input window, and you swipe it

over the entire image, and whenever it turns on it means it's detected

the object that you trained it to detect.

So here the system, you know, is capable of doing simultaneous segmentation and
recognition.

You know, back in the—before that

people in pattern recognition would have an explicit program that would separate individual
objects from their background and from each other, and then send each

individual object, character for example, to a recognizer.

But with this you could, you can do both at the same time.

You don't have to worry about it. You don't have to build any special program for it.
So in particular this could be applied to natural images for things like facial detection,
pedestrian detection, things like this. Right.

Same thing, train a

convolutional net to

distinguish between an image where you have a face and an image where you don't have a
face, train this with several thousand examples, and

then take that window, swipe it over an image: whenever it turns on

there is a face. Of course, the face could be bigger than the window

so you sub-sample the image: you make it smaller, and you swipe your network again, and
then make it smaller again, swipe your network

again. So now you can detect faces regardless of size.

OK.

In particular you can use this to drive robots. So this is things that were done before deep
running became popular, OK.

So this is an example where the the network here is a convolutional net.

It's applied to the image coming from a camera, from a you know, running robot.

And it's trying to classify

every window of a small window—like, 40 x 40 pixels or so, even less—

as to whether the central pixel in that window is on the ground or is an obstacle, right.

So whatever it classifies as being on the ground is green

Whatever it classifies as being an obstacle is red or purple, if it's on a foot of the obstacle.

And then you can sort of map this to a map, which you see at the top.

And then do planning in this map to reach a particular goal, and then use this to navigate.

And so these are two former PhD students.

Raia Hadsell on the right, Pierre Sermanet on the Left, who are annoying this poor robot.

Pretty confident the robot is not going to break their legs, since they actually wrote the code
and trained it.
Pierre Sermanet is a research scientist at Google Brain in California working on robotics.

Raia Hadsell is head of robotics research—

Director of Robotics research at DeepMind. They did pretty well.

So a similar idea can be used for what's called semantic segmentation.

So semantic segmentation is the idea that you can, again, with this kind of sliding window
approach,

you can train a convolutional net to classify the central pixel using a window as a context.

But here it's not just trained to classify

obstacles from non-obstacles. It's trained to classify something like 30 categories. This is—

this is down

Washington Place, I think. This is Washington Square Park.

And, it you know, it knows about roads, and people, and plants, and

trees, and whatever—but it finds, you know, desert in the middle of Washington Square
Park, which is not...

There's no beach that I'm aware of...

So it's not perfect. At the time it was state of the art, though. That was the best

system there was to do this kind of semantic segmentation.

I was running around giving talks like trying to evangelize people about deep learning back
then. This was around 2010.

So this is before the, kind of, deep learning revolution, if you want.

And one person, a professor from

Israel was sitting in one of my talks. And he's a theoretician, but he was really kind of
transfixed by the

potential applications of this, and he was just about to

take a sabbatical and work for a company called Mobileye, which was a start-up in Israel at
the time—

working on autonomous driving. And so, a couple of months after he heard my talk, he
started working at Mobileye.
He told the Mobileye people, you know—"You should try this convolutional net stuV.

This works really well." And the engineers there said—Nah. No, we don't believe in that
stuV. We have our own method.

So he implemented it, and tried it himself, beat the hell out of

all the benchmarks they had.

And all of a sudden the whole company switched to using convolutional nets.

And they were the first company to actually come up with a

vision system for cars that, you know, can keep a car in a highway, and can break if there is
a pedestrian or

cyclist

crossing. I'll come back to this in a minute. They were basically using this technique.

Semantic segmentation, very similar to the one I showed for the robot before.

This was a guy by the name of Shai Shalev-Schwartz.

You have to be aware of the fact also that back in the 80s, people were really interested in
using, in sort of

implementing special types of hardware that could run neural nets really fast.

And these are kind of a few examples of neural net chips that were actually implemented—I
had to do with some of them—

but they were implemented by people working in the same group as I was, as I was at Bell
Labs in New Jersey.

So this was kind of a hot topic in the 1980s, and then of course with the

interest in neural nets dying in the mid-90s people weren't working on this anymore, until a
few years ago. Now

the hottest topic in in chip design in the chip industry is neural net accelerators.

You go to any conference on

computer architecture

you know, chip, like ISSCC, which is the big kind of solid-state circuit conference—half the
talks are about
neural net accelerators.

And I worked on a few of those things.

OK, so then

something happened, as I told you, around 2010, -13, -15

in speech recognition, image recognition, natural language processing, and it's continuing.
We're in the middle of it now for other topics.

And what happened, and I'm really sad to say it didn't happen in my lab, but

but with our friends,

we started, with Yoshua Bengio and GeoV Hinton back in the early 2000s—we knew, you
know

that deep learning was working really well and

we knew that the whole community was making a mistake by dismissing neural nets and
deep learning. And so—

we didn't use the term deep learning yet. We invented it a few years later—so, around 2003
or so, 2004,

we started kind of a conspiracy, if you want. We got together and we said we're just going to

try to, kind of, beat some records, and some data sets, invent some new algorithms that
will allow us to train very large neural nets.

So that will, and will collect very large data sets, so that we will show the world that those
things really work,

because nobody really believed it.

That really kind of succeeded beyond our wildest dreams. In particular, in 2012

GeoV Hinton had one student Alex Krizhevsky, who spent a lot of time implementing

convolutional nets on

GPUs, which were kind of new at the time—they were not entirely new but they were
starting to become really

high-performance.

So he was very good at sort of hacking that and


and then they were able to train much larger neural nets, convolutional nets, than anybody
was able to do before.

And so they used it to

to train on the ImageNet dataset. The ImageNet dataset is a bunch of natural photos.

And the system is supposed to recognize

the main objects in the photo among

1,000 diVerent categories. And the training set had 1.3 million samples.

Which is kinda large.

So what they did was build this really large,

and very deep

convolutional net, pretty much on the same model as what we had before, implemented on
GPUs, and let it run

for a couple weeks. And with that they beat

the performance of best competing systems

by a large margin. So this is the error rate on ImageNet going back to 2010. So 2010

it was about 28% error, top 5.

So, basically you get an error if the correct category is not in the top 5 among 1,000. OK, so
it's kind of a mild

measure of error.

2011 it was 25.8%, the system that was able to do this was actually very, very large.

It was sort of somewhat convolutional-net-like, but it wasn't trained. I mean only the last
layer was trained.

And then

GeoV and his team got it down to 16.4%, and then that was a

watershed moment for the computer vision community. A lot of people said, Okay, you
know now we know that this thing works.

And the whole community went from


basically refusing every paper that had neural nets in them in 2011 and 2012

to refusing every paper that does not have a convolutional net in it in 2016.

So now it's the new religion, right. You can't get a paper in a computer vision conference
unless you use ConvNets somehow.

And the error rate went down really quickly, you know people found all kinds of really cute
architectural tricks

that, sort of, made those things work better. And what you'd see in there is that there was
an inflation of the number of layers.

So my convolutional nets from the 90s had 7 layers or so, and from the early 2000s. And
then

AlexNet had, I don't know, 12.

Then VGG the year afterward, after that had 19.

GoogLeNet had I don't know how many, because it's hard to figure out how you count. And
then the workhorse now of

object recognition, the standard

backbone, as people called them, has 50 layers.

It's called ResNet-50.

But some, you know, some networks have 100 layers or so.

So Alfredo a few years ago put together this chart that shows

where each of those blob is a particular network architecture.

And the x-axis is the number of billions of operations you need to do to compute the output.

Okay, those things are really big billions of connections.

The y-axis is the top one accuracy on ImageNet. So it's not the same measure of
performance as the one I showed you before.

So the best systems are around 84, today.

And the size of the of the blob is the memory occupancy, so the

number of

millions of
floats that you need to store to store the weight, the weight values. Now people are very
smart about compressing those things like

you know quantizing them, and

there's entire teams at Google, Facebook, and various other places that only work on
optimizing those networks and compressing the

things so they can run fast.

Because,

to give you just a

rough idea,

the number of times

Facebook, for example, runs a convolutional net on its servers per day is in the tens of
billions.

Okay. So there's a huge incentive to optimizing the amount of computation necessary for
this.

So one...

one reason why

convolutional nets are being so successful

is that they exploit a property of natural data, which is compositionality.

So

compositionality is

the

property by which

a scene is composed of objects;

objects are composed of parts;

parts are composed of sub-parts; sub-parts are really combinations of motifs; and motifs
are combinations of

contours or edges,
right, or textures.

And those are just combinations of pixels. Okay, so there's this so-called compositional
hierarchy that,

you know, particular combinations of

objects at one layer in the hierarchy

form

objects at the next layer.

And so if you, kind of, mimic this compositional hierarchy in the architecture of the
network, and you let it learn the appropriate

combinations of features at one layer that,

you know, form the the features of the next layer, that's really what deep learning is.

Okay. Learning to represent the world and exploit the structure of the world—and the
world, being the fact that

there is organization in the world because the world is compositional.

A statistician by the name of Stuart Geman, who is at Brown University, said—

so he was kind of playing on the famous Einstein quote, Einstein said:

The most incomprehensible thing about the world is that the world is comprehensible.

Like, among all the complicated things that the world, you know, the world could be
extremely complicated,

so complicated that we have no way of understanding it.

And it looks like a conspiracy that we are able to understand at least part of the world.

And so Stuart Geman's version of this is that the world is compositional... or there is a God.

(Because you need supernatural

things to be able to understand it if the world is not compositional.)

So this has led to incredible progress in things like computer vision, as you know, from

you know, being able to unreliably identify, you know, detect people, to being able to
generate masks for every object,
accurate masks, and then even to figure out the pose, and then do this in real time on a
mobile platform, you know.

I mean the progress has been sort of nothing short of incredible, and most of those things
are based on

two basic families of architectures.

This sort of so-called one-pass

object detection/recognition

architectures called RetinaNet, feature pyramid network. There's various names for it.

Or U-Net. Then another type called Mark-RCNN, both of them actually originated from
Facebook.

Or, the people who originated them are now at Facebook—they sometimes came up with it
before they came to Facebook.

But, you know, those things work really well, you know, they can do things like that: detect
objects that are partially occluded, and,

you know, draw a mask of every object. So basically, this is a neural net, a convolutional net
where the input is an image.

But the output is also an image. In fact, the output is a whole bunch of images, one per
category. And for each category

outputs the mask of the object from that category.

Those things can also do what's called "instant segmentation." So if you have a whole
bunch of sheeps,

it can tell you, you know, not just that this region is sheep,

but actually pick out the individual sheeps and will tell them apart, and it will count the
sheeps right, and fall asleep.

That's what you're supposed to do right, to fall asleep you count sheeps, right?

And

the cool thing about

deep learning is that a lot of the community has embraced the whole concept that
research has to be done in the open.
So a lot of the stuV that we're gonna be talking about, as you probably know,

in the class is

it's not just published, but it's

you know, published with code. It's not just code, it's actually pre-trained models that you
can just download and run.

All open source. All free to use.

So that's

really new. I mean people didn't use to do research this way, particularly in industry.

But even in academia people weren't used to kind of distributing their code.

But deep learning has sort of,

somehow the race has kind of driven people to kind of be more open about research.

So there's a lot of applications of all this, as I said, you know self-driving cars.

This is actually a video from Mobileye, and Mobileye was pretty early in this in using
convolutional nets for autonomous driving.

To the point that in 2015

they had managed to shoehorn a convolutional net on the chip that they had designed for
some other purpose. And they sold the

licensed the technology to Tesla. So the first self-driving Tesla's,

I mean self-driving, not really self-driving, they have driving assistance, right.

They can keep in lane on the highway and change lane—had this Mobileye system.

And, and that's, that's pretty cool. So that's a convolutional net. It's a little chip that is, you
know,

just behind the, it looks out the window and it's behind the

the rear-view mirror.

Since then this—you know, four/five years ago—since then this kind of technology has
been

very widely deployed by a lot of diVerent companies.


Mobileye still now was bought by Intel, and they have like 70 or 80 percent of the market for
those vision systems.

But, but there is a lot of

companies that—and car manufacturers—

that use those things. So in fact

in some European countries, every single car that comes out, even low-end cars, has
convolutional-net-based vision systems. And they call this:

emergency, emergency, sort of, advanced emergency braking system or automated


emergency braking system.

AEBS—

is deployed in every car in France, for example.

It reduces collisions by 40%.

So not every car on the roads have them yet, because you know people keep their cars for a
long time.

But what that means is that it saves lives.

So a very positive application of deep learning.

Another big category of applications, of course, is medical imaging.

So this is a, it's probably the hottest topic in radiology these days—is how to use

AI (which means convolutional nets)

for radiology. This [slide image] is lifted from a paper by some of our colleagues here at
NYU,

where they analyzed MRI images. So there's one big advantage to convolutional nets: it's
that they don't need to look at the screen to

look at an MRI. In particular

to be able to look at an MRI, they don't have to slice it into 2D images.

They can look at the entire 3D volume. This is

one property that this thing

uses. It's a 3D convolutional net that looks at the entire volume of an MRI image and
then produces, you know,

it uses the very similar technique for, as I was showing before, for semantic segmentation.
And it produces,

it basically turns on on the output image wherever there is some,

you know, here a femur bone, but you know, it could be—

so this is the kind of result it produces.

It works better in 3D than in 2D slices. Or, it can turn on when it detects

malignant tumor in

mammograms. (This is to 2D, it's not 3D.)

And there's, you know, various other projects in medical imaging that are going around.

Okay.

Lots of applications in science and physics, bioinformatics, you know, whatever, which
we'll come back to so...

Okay.

So there's a bunch mysteries in deep learning. They're not complete mysteries, because
people have some understanding of all this,

but they are mysteries in the sense that we don't have, like, a nice theory for everything.

Why do they work so well?

So one big question

that theoreticians were asking many years ago, when I was trying to convince the world that
deep longing was a good idea,

was that they would tell me: Well, you can approximate any function with just two layers,

why do you need more? And I'll come back to this in a minute.

What's so special about convolutional nets?

I talked about the compositionality of natural images, or natural data, in general. This is
true for speech also and values of the signals, natural signals.

But it seems a little contrived.


How is it that we can train the system, despite the fact that the objective function we're
minimizing is very non-convex. We may have

lots of local minima.

This was a big criticism that people were

throwing at neural nets, people who'd never played with neural nets were throwing out
neural nets back in the old days. Say, like

you know, you have no guarantee that your algorithm will converge—you know,

it's too scary. I'm not gonna use it.

And, the last one is: why is it that

the way we train neural nets breaks everything that every textbook in statistics tell you?

Every textbook in statistics tells you,

if you have n data points, you shouldn't have more than n parameters,

because you're going to overfit like crazy.

You know, you might regularize. If you're a Bayesian, you might through a prior.

But...

(which is equivalent)

But

what guarantee do you have? And

with neural nets, neural nets are wildly over parametrized. We train neural nets with
hundreds of millions of parameters, routinely. They're used in production. And

the number of training samples is nowhere near that. How does that work?

But it works!

OK, so things we can do with deep learning today: You know, we can

have safer cars; we can have better medical analysis, medical image analysis systems;

we can have pretty good language translation, far from perfect, but useful;

stupid chatbots;

you know, very good information search retrieval and filtering.


Google and Facebook nowadays are completely built around deep learning. You take deep
learning out of them and they crumble.

And,

you know, lots of applications in energy management and production, and all kinds of stuV;
manufacturing, environmental protection.

But we don't have really intelligent machines. We don't have machines with common
sense. We don't have

intelligent personal assistants. We don't have,

you know, smart chatbots. We don't have household robots.

You know, I mean, there's a lot of things we don't know how to do, right. Which is why we
still do research.

OK, so, deep learning is really about learning representations.

But really we should know in advance what representations are. So I talked about the
traditional model of pattern recognition.

But...

Representation is really about

you know, you have your raw data, and you want to turn it into a form that is useful,
somehow.

Ideally, you'd like to turn it into a form that's useful regardless of what you want to do with it.

Sort of "useful" in a general way. OK, and it's not entirely clear what that means.

But, at least, you want to turn it into a representation that's useful for the task that you are
envisioning.

And there's been a lot of ideas over the decades on,

sort of, general ways to pre-process natural data in such a way that you produce good
representations of it.

I'm not going to go through the details of this laundry list.

But the things like tallying the space, doing random projection. So random projection is
actually, kind of,

you know, like a monster that


rears its head periodically, like every five years. And you have to whack it on the head every
time he pops up.

That was the idea behind the Perceptron. So the first layer of a perceptron is a layer of
random projections.

What does that mean? A random projection is a random matrix,

which you know, has a smaller output dimension than input dimension,

with some sort of non-linearity at the end, right. So think about a single layer neural net
with nonlinearities, but the weights are random.

So you can think of this as

random projections.

And a lot of people are rediscovering that wheel periodically,

claiming that it's great because you don't have to do multi-layer training.

And so it started with the Perceptron, and then you know

it came back in the 60s, and then it came back again in the 1980s, and then it came back
again.

And now it came back. There's a whole community, mostly in Asia. They call

two layer neural nets, where the first layer is random, they call this "extreme learning
machines," OK.

It's like, it's ridiculous, but it exists.

They're not "extreme," I mean they're extremely stupid, but—you know.

Right, so I was mentioning the compositionality of the world.

It's, you know, from pixels to edges to texton, motifs, parts, objects. In text you have
characters, word, word groups, clauses, sentences, stories.

In speech it's the same, you have individual samples.

You have, you know,

spectral band, sound, phone, phonemes, words, etc.

You always have this kind of hierarchy.

OK, so here are many attempts at dismissing the whole idea of deep learning. OK, first,
first thing. And this is things that I've heard for decades, OK—

from mostly theoreticians, but a lot of people, and you have to know about them because

they're going to come back in five years when people say, "Oh, deep learning sucks."

Why not use support-vector machines? OK, so,

here is support-vector machines here on the top left. Support-vector machine is a,

and I'm sure many of you have heard about kernel machines and support-vector machines.

Who knows what this is?

I mean, even if it's a rough idea what this is. OK, a few hands. Who has no idea what a
support-vector machine is?

Don't be shy. Yeah. Yeah, I mean it's okay if you don't.

OK, like most people haven't raised their hands for either.

[Alfredo: Hands up, please, who knows support-vector machines.]

OK, come on all the way up. Cool, all right. Who has no idea what it is? Don't be shy, it's
okay. [Alfredo: inaudible]

All right, good.

Right, so here's the support-vector machine. Support-vector machine is a two layer neural
net.

It's not really a neural net, people don't like when it's formulated this way, but really you can
think of it this way.

It's a two layer neural net,

where the first layer, which is symbolized by this function K here,

each unit in the first layer compares the input vector X to one of the training samples X^i's.

OK, so you take your training samples, let's say you have a thousand of them—

so you have a thousand X^i's, from i = 1 to 1,000,

and you have some function K that is going to compare X and X^i.

Good example of a function to compare the two is you take the dot product between X and
X^i, and you pass the result
through,

like, exponential minus square or something. So you get a Gaussian

response,

as a function of the distance between X and X^i, OK.

So it's a way of comparing to two vectors, doesn't matter what it is.

And, then you take those scores coming out of this K function that compares the input to
every sample and

you compute a weighted sum of them. And what you're going to learn are the weights, the
alphas.

Okay, so it's a two-layer neural net in which the second layer is trainable and the first layer
is fixed.

But in a way you can think of the first layer as being trained in an unsupervised manner,
because it uses the data

from the training set, but it only uses the X's, doesn't use the Y's.

It uses the data in the stupidest way you can imagine, which is you store every X and

use every single X as the weight of a neuron, if you want.

Okay.

That's what support vector machine is. You can write a thousand page book about the cute
mathematics

behind that. But the bottom line is it's a two-layer neural net where the first layer is trainied
a very

stupid way unsupervised, and the second layer is just a linear classifier.

So, it's basically glorified template matching, because it basically compares the input
vector to all the all the training samples.

And so, it doesn't work if you want to do like,

you know, computer vision with raw images. If X is an image and

the X^i's are a million images from ImageNet—

first of all, for every image you're gonna have to compare it with a million images,
or maybe a little less if you're smart, and how you train it.

That's going to be very expensive, and the kind of comparison you're making is basically

what solves the problem. The weighted sum you're gonna get at the end is really the cherry
on the cake.

I use that analogy too often, actually.

So...

You can approximate, you can have theorems that show that you can approximate any
function you want, as close as you want, by

tuning the

K function and the alphas.

And so, if you were to talk to a theoretician, they'll tell you: Why do you need deep learning?

I can approximate any function I want with a kernel machine.

The number of terms in that sum can be very large, and nobody tells you what kernel
function you can use. And so,

that doesn't solve the problem.

You can use a two-layer neural net, OK. So this is the top right here. The first layer is a
nonlinear function

F applied to the product of a matrix W^0 by input vector; and then the second layer
multiplied by the second matrix; and then

passes it through another non-linearity.

OK, so this is a composition of two linear and non-linear operations. Again, you can show
that under some conditions

you can approximate any function you want with something like this.

Given that you have a large enough

vector in the middle.

OK, so the dimension of what comes out of the first layer, if it's high enough

—potentially infinite—you can approximate any function you want as close as you want, by
making this layer go to infinity.
So again, you talk to theoreticians and they tell you: Why do you need layers? I can
approximate anything I want with two layers.

But there is an argument, which is

it could be very, very expensive to do it in two layers.

And...

For some of you this may sound familiar. For most of you probably not.

Let's say I want to design a logic circuit.

OK, so, when you design logic circuits, right, you have AND-gates and OR-gates and... or
NAND-gates, right.

You can do everything with just NAND's, right, negative ANDs.

And if you..

You can show that any Boolean function can be written as a bunch of ORs on a bunch of, a
bunch...

you know, a bunch of ANDs and then an OR on top of this. That's called disjointed normal
form (DNF).

So any function can be written in two layers.

The problem is that for most functions, the number of terms you need in the middle is
exponential in the size of the input.

So, for example,

if I give you N bits, and ask you to construct a circuit that tells me if the number of bits that
are on

in the input string is

even or odd.

OK, it's a simple Boolean function: 1 or 0 on the output.

The number of gates that you need is essentially exponential,

in the middle. If you do it in two layers.

If you allow yourself to do it in

log(N) layers, where N is number of input bits,


then it's linear. OK. So you go from exponential complexity to linear complexity if you allow
yourself to use multiple layers.

It's as if, you know, when you write a program—

I'll tell you

write the program in such a way that

there is only two sequential steps that are necessary to run your program. So basically your
program has two sequential instructions.

You can have, you can run as many instructions as you want in your program

but they have to run in parallel, most of them. And you're only allowed two sequential
steps.

OK.

And, the kind of instructions you have access to are things like, you know, linear
combinations, nonlinearities—

like simple things, right. Not like entire sub-programs.

So for most,

most problems,

the number of intermediate values you're going to have to compute in the first step

is going to be exponential in the size of the input.

There's only a tiny number of problems for which you're going to be able to get away with a
non-exponential number of minterms.

But if you allow your program to run multiple steps sequentially then all of a sudden, you
know,

it can be much simpler. It will run slower, but

it will take a lot less memory. It will

take a lot less stuV—

resources. So people who design computers circuits know this, right. You can design, for
example, a circuit that adds two binary numbers. And

there is a very simple way to do this,


which is that you first you take the first two bits, you add them, and then you propagate the
carry at the

second bit, the second pair of bits,

you know, taking the carry into account that gives you the second bit of result, and then
carry, you know,

propagate the carry, and then you do this sequentially, right.

So problem with this is that it takes the time that's proportional to the size of the

numbers that you're trying to add. So circuit designers

have a way of,

basically, pre-computing the carry, it's called carry lookahead,

so that the number of steps necessary to do an addition is actually not N, it's much less
than that.

OK. But that, that's at the expense of a huge increase in the

complexity of the circuit. The the number, like the area that it takes on the chip.

So this exchange between time and space, or, between depth and

and...

and, kind of, time is is known.

So what do we call deep

models? So, you know, a two-layer neural net, one that has one hidden layer,

I don't call that "deep," even though technically it uses backprop. But, eh, you know, it
doesn't really learn

complex representations.

So there's this idea of hierarchy in deep learning. SVMs definitely aren't deep.

Unless you learn complicated kernels, but then they're not SVM's anymore.

So what are good features? What are good representations?

So, here's an example I like. There is something called the manifold hypothesis, and it's the
fact that
natural data—

So, if I take a picture of this room

with, you know, a

camera with a 1,000 x 1,000 pixel resolution. That's 1 million pixels at 3 million values.

It leaves, you can think of it as a vector with 3 million

components.

Among all the possible vectors with 3 million components, how many of them correspond
to what we would call natural images?

We can tell when we see a picture whether it's a natural image or not.

We have a model in our visual system that tells us this looks like a real, like a real image.

And we can tell when it's not. So the number of

combinations of pixels that actually are things that we

think of as natural images is a tiny, tiny, tiny,

tiny, tiny subset of the set of all possible images. There's way more ways of combining
random pixels in

nonsensical images than there are ways of combining pixels into things that look like
natural images.

So the manifold hypothesis is that the set of

things that,

you know, look natural to us live in a low-dimensional

surface inside the high-dimensional

ambient space.

And a good example to convince yourself of this: Imagine

I take lots of pictures of a person making faces, right. So the person is in front of a white
background.

Her hair not moving.

And she, kind of, moves her head around and, you know, makes faces, etc.
The set of all images of that person—so I take a long video of that person—the set of all
images of that person

lives in a low dimensional surface.

So a question I have for you is, What's the dimension of that surface?

Whatever magnitude, okay. Any, yes?

[Inaudible student comment.]

Yeah, you've probably heard my spiel before, but... [Speaker: What did the person say?]

Huh? [Speaker: What did they say?]

OK, so for whoever hasn't heard this, you have a shot, another shot at an answer.

OK, any guess?

No? Don't be shy. I want like multiple proposals.

Anyone. You can look down your laptop, but, you know, I can point at you or something.

OK, any idea?

Yes.

No idea? It's OK.

You, any idea? Maybe you heard what he said. [Inaudible student comment.]

Linear, what does that mean? [Inaudible student comment.]

It's a 1D space.

OK, a one-dimensional subspace.

OK, any other proposal?

Any idea?

OK, the images I'm taking are a million pixels. OK, so the ambient space is 3 million
dimensions.

[Inaudible student comment]

They don't change, no.

And the person can move the head, you know, turn around, things like this. But not really
move the whole body.
I mean you only see the face, it's mostly centered.

[Student: A thousand.]

A thousand, OK. Why?

[Inaudible student comment.]

OK yeah, that's a good guess.

At least the motivation.

[Inaudible student comment.]

Say again. [Student: The surface area of the person.]

The surface area of the person. Right. So it's bounded by the number of pixels

occupied by the person. That's for sure. That's a, that's an upper bound.

Yes.

Those pixels, of course, are not gonna take all possible values. So that's a wide upper
ground. Any other idea?

OK. So, basically

the

dimension of that, as you said,

Is bounded by the number of muscles in the face of the person.

Right. The number of degrees of freedom

that you observe in that person

is the number of muscles in their face.

The number of independently movable muscles, right.

So there's 3 degrees of freedom due to the fact that you can tilt your head this way, that
way, or that way.

That's 3, right there.

Then there is translation, this way, that way. Maybe this way and that way, maybe up or
down. That's 6.

And then the number of muscles in your face, right. So you can
smile. You can,

you know, pout. You can do all kinds of stuV, right. And you can do this, you know,
independently. You close one eye.

You can smile in one direction, you know, I mean...

So, whatever independent muscles, you have—not counting the tongue, because there's
tons of muscles in the tongue.

And that's about 50.

Maybe a little more.

So,

regardless, it's less than 100. OK, so the surface,

locally, if you want to parameterize the surface occupied by all those pictures—move from
one picture to another—

it's a surface with less than 100

parameters that determine the position of a point on that surface. Of course it's a highly
nonlinear surface.

It's not like this beautiful Calabi-Yau manifold here, but

but it's a it is a surface nonetheless.

Of course the answer was in the slide so, you know.

So what you'd like is an ideal feature extractor to be able to disentangle the explanatory
factors of variation of what you're observing.

Right. So the diVerent aspects of my face, you know,

it's not just I move my muscles and I move my head around—each of those is an
independent factor of variation. Again

I can also remove my glasses.

You know, the lighting could change. That's another set of, you know, variable—

variables. And what you'd like is a representation that basically individually represents each
of those factors of variations.

So if there is a criterion to satisfy in learning good representations


it's that: it's finding independent explanatory factors of variation of the data that you're
looking at.

And the bottom line is that nobody has any idea how to do this. OK.

But that would be the ultimate goal of

representation learning.

And we basically are at the end. OK.

I'll take two more questions, if there is any.

Yes.

[Inaudible student question.]

OK, so the question is: Is there some sort of pre-processing like PCA that will find those
vectors? Yeah, so PCA will find

those if the manifold is linear. So if you assume that the surface

occupied by all those examples or faces is a plane,

then PCA will find the dimension of that plane—principal component analysis, right.

But, no, it's not linear unfortunately, right. Let me...

Yeah, let me give you an example.

If you take me and my oldest son that looks like me, and

you place us making the same face in the same position,

the distance between our images will be relatively small even though we're not the same
person.

Now if you take my face and my face shifted by

20 pixels,

there's more distance between me and myself shifted than there is between me and my
son, OK.

So...

What that means is that, you know,

the manifold of my face, you know, is some complicated manifold in that space.
My son is a slight diVerent manifold which does not intersect mine.

Yet these two, those two manifolds are very close to each other, and they're closer to each
other than

any two samples from my manifold, and two samples from his manifold. So PCA is not
going to tell you anything, basically.

OK, here is another reason why that surface is not, is not a plane.

You're looking at me right now.

Now imagine the manifold, which is a linear manifold, one dimensional manifold, of me
turning my head all the way 360.

OK.

That manifold is topologically identical to a circle. It's not flat.

Can't be, it can't be aligned. So PCA is not going to find it.

OK, I gotta blast oV. Thanks! See you next week.

make notes out of this lecture transcipt

You might also like