Learning from Big Data
Lecture 19: Image recognition and CNNs
Dr. Lloyd T. Elliott, Fall 2022
Why image recognition is difficult
'Typographic attack': pen and paper fool AI into thinking apple is an iPod
• Recognizing objects in real scene
• Variation in lighting and viewpoint
• De nition of objects
• Requires huge amounts of knowledge (even for segmentation and viewport /
lighting)
fi
State of the art
• ADOP: Approximate di erentiable one-pixel point rendering (University of
Erlangen-Nuremberg)
• https://www.youtube.com/watch?v=WJRyu1JUtVw
ff
Things that make it hard
• Segmentation: real scenes are cluttered with other objects:
• Hard to tell which pieces go together
• Parts of an object can be hidden or clipped (occlusion)
• Lighting: Intensities are as much determined by lighting as by nature of object
• Deformation: Wide variety of shapes have the same name
• A ordances: For many objects, function is more important than shape for
de nition
ff
fi
More things that make it hard to recognize objects
• Viewpoint: wide variety of viewpoints for the same object
• "Information hops between input dimensions" dimension hopping
• We don't see this for many types of structured data (medical for example)
Viewpoint invariance
• Each time we look at an object, we have a di erent viewpoint. Unlike other machine
learning tasks
• Humans are so good at viewpoint variation, it's hard to appreciate how di cult it is
• One of the main di culties in computer vision
• Typical approaches:
• Use redundant invariant features
• Bounding boxes
• Replicated features with pooling "convolutional neurons"
ffi
ff
ffi
Invariant feature approach
• Extract large, overlapping / redundant set of features invariant to
transformations (rotation, scaling, translation, shear, stretch)
• Example: centre / surround for visual eld
• Problem: features will overlap with objects that are not in foreground ("parts
of di erent objects")
• Put a box around objects
• Normalize within the box
• Choosing the box is di cult (chicken / egg problem)
ff
ffi
fi
Brute force normalization
• When training recognizer, use well-segmented upright images to t correct
box
• At test time, try all possible boxes in a range of position and scales
• This was used often in computer vision ~2015
fi
Convolutional neural nets
• LeNet 1990s
• Use many di erent copies of the same feature detector with di erent
positions
• A feature detector useful in one place in the
image is likely useful in other areas too
• When we learn, we keep the red arrows all
having the same weights as each other
Red connections all have the same weight
ff
ff
Convolutional neural nets
• Replication greatly reduces the number of free parameters to be learned
• In this example 27 -> 9 weights
• Make many maps, each one with replicates
of the same feature. Di erent maps learn to
detect di erent features.
• Each patch of the image can then be
represented by features of many di erent types
Red connections all have the same weight
ff
ff
ff
Backpropagation with weight constraints
• It's easy to modify the backpropagation algorithm to incorporate linear
constraints between the weights
• We compute gradients as usual, but we modify gradients so that they satisfy
the constraints
w1 = w2 ! w1 = w2
<latexit sha1_base64="LD7YG2MnmpYdYhnbm0b3qUuXbyQ=">AAACFnicbVDLSgMxFM3UV62vUZdugkVwY5kpRd0IRV24rGAf0BmGTJq2oZkHyR1LGfoVbvwVNy4UcSvu/BvTdhBtPXDh5Jx7yb3HjwVXYFlfRm5peWV1Lb9e2Njc2t4xd/caKkokZXUaiUi2fKKY4CGrAwfBWrFkJPAFa/qDq4nfvGdS8Si8g1HM3ID0Qt7llICWPPNk6Nn4Ag+9MnYk7/WBSBkNsXPNBBA8M38eZc8sWiVrCrxI7IwUUYaaZ346nYgmAQuBCqJU27ZicFMigVPBxgUnUSwmdEB6rK1pSAKm3HR61hgfaaWDu5HUFQKeqr8nUhIoNQp83RkQ6Kt5byL+57UT6J67KQ/jBFhIZx91E4EhwpOMcIdLRkGMNCFUcr0rpn0iCQWdZEGHYM+fvEga5ZJ9WqrcVorVyyyOPDpAh+gY2egMVdENqqE6ougBPaEX9Go8Gs/Gm/E+a80Z2cw++gPj4xu9bJ0/</latexit>
• This is done as follows:
• Compute @E @E
<latexit sha1_base64="5xbcRYBlUAx0rL1ZgOhEGzrbHNQ=">AAACKXicfVBbS8MwGE29znmr+uhLcAg+yGjHUB+HIvg4wV1gLSXN0i0sTUuSKqP07/jiX/FFQVFf/SOmW0HdxAOBwznnS/IdP2ZUKst6NxYWl5ZXVktr5fWNza1tc2e3LaNEYNLCEYtE10eSMMpJS1HFSDcWBIU+Ix1/dJH7nVsiJI34jRrHxA3RgNOAYqS05JkNJxAIp06MhKKIwcvsm995dnYM/w3UMs+sWFVrAjhP7IJUQIGmZz47/QgnIeEKMyRlz7Zi5ab5lZiRrOwkksQIj9CA9DTlKCTSTSebZvBQK30YREIfruBE/TmRolDKcejrZIjUUM56ufiX10tUcOamlMeJIhxPHwoSBlUE89pgnwqCFRtrgrCg+q8QD5FuRulyy7oEe3bledKuVe2Tav26XmmcF3WUwD44AEfABqegAa5AE7QABvfgEbyAV+PBeDLejI9pdMEoZvbALxifX8Rbp40=</latexit>
,
@w1 @w2
• Use update @E @E for w1 and w2
<latexit sha1_base64="y+uFVK9ycT8U4mZLwiEDg54HMuc=">AAACPHicfVDNS8MwHE3n15xfU49egkMQhNGOoR6HH+BxovuAtZQ0S7ewtA1J6hilf5gX/whvnrx4UMSrZ7OtoG7ig8DjvfdL8nseZ1Qq03wycguLS8sr+dXC2vrG5lZxe6cpo1hg0sARi0TbQ5IwGpKGooqRNhcEBR4jLW9wPvZbd0RIGoW3asSJE6BeSH2KkdKSW7yxLwhTCA6hzUXEVQRtXyCc2BwJRRGDl+k3H7pWCo/g/5FK6hZLZtmcAM4TKyMlkKHuFh/tboTjgIQKMyRlxzK5cpLxlZiRtGDHknCEB6hHOpqGKCDSSSbLp/BAK13oR0KfUMGJ+nMiQYGUo8DTyQCpvpz1xuJfXidW/qmT0JDHioR4+pAfM6g7GjcJu1QQrNhIE4QF1X+FuI90M0r3XdAlWLMrz5NmpWwdl6vX1VLtLKsjD/bAPjgEFjgBNXAF6qABMLgHz+AVvBkPxovxbnxMozkjm9kFv2B8fgEW766n</latexit>
w/ +
@w1 @w2
• We can thus force backpropagation to use replicated features
What does replicating the features achieve?
• Equivariant activities: the neural activities in the next layer are not invariant to
translation, but they are equivariant
• Representation changes by as much as image
• Invariant knowledge: if a feature can be detected in one location, it can be
detected in other locations too
Pooling the output of replicated feature detectors
• To get invariance in activity, we must pool the output of the convolutional
layer
• Average / maximum together neighbouring replicated detectors to give a
single output to the next level
• Reduces number of inputs to the next layer (means, we can learn more
features)
• Problem: after several levels of this pooling, we lose information about the
precise location of the object (that's ne for kilns for example)
fi
LeNet5
• Yann LeCun and collaborators developed the rst good recognizer for
handwritten digits using backpropagation and feedforward
• Many hidden layers, many maps of replicated units, pooling between layers.
Did not require segmentation
• Was deployed by USPS, ~10% of zip code reading in USA in early 2000s
fi
LeNet5 in tensorflow
• medium.com/@mgazar
model = keras.Sequential()
model.add(layers.Conv2D(filters=6, kernel_size=(3, 3), activation='relu', input_shape=(32,32,1)))
model.add(layers.AveragePooling2D())
model.add(layers.Conv2D(filters=16, kernel_size=(3, 3), activation='relu'))
model.add(layers.AveragePooling2D())
model.add(layers.Flatten())
model.add(layers.Dense(units=120, activation='relu'))
model.add(layers.Dense(units=84, activation='relu'))
model.add(layers.Dense(units=10, activation = 'softmax'))
Prior knowledge in machine learning
• LeNet5 prior knowledge:
• Connectivity
• Weight constraints
• Activation functions
• Less intrusive than hand engineering features, but still pushes the network
towards a particular way of solving the problem
• Alternative: use prior knowledge to create more training data: augment
training data with simulated data (Hofman 1993)
More tricks
• Data augmentation
• Subsample & transform training images (AugMix Hendrycks et al. 2019)
Thank you