KEMBAR78
Convnets 2 | PDF | Computational Neuroscience | Applied Mathematics
0% found this document useful (0 votes)
13 views17 pages

Convnets 2

The document provides an overview of convolutional neural networks (CNNs), detailing the structure and function of convolutional layers, activation functions like ReLU, pooling layers, and fully connected layers. It explains how to combine these layers to process input data, using examples such as the MNIST dataset for digit recognition. Additionally, it includes Python code for implementing a CNN using the Keras library, demonstrating the training and evaluation of the model.

Uploaded by

tingtkang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views17 pages

Convnets 2

The document provides an overview of convolutional neural networks (CNNs), detailing the structure and function of convolutional layers, activation functions like ReLU, pooling layers, and fully connected layers. It explains how to combine these layers to process input data, using examples such as the MNIST dataset for digit recognition. Additionally, it includes Python code for implementing a CNN using the Keras library, demonstrating the training and evaluation of the model.

Uploaded by

tingtkang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

» Convolutional Layer

∗ Recall we can apply several filters to the same input and stack
their outputs together. E.g.

kernel 3 × 3 × 3

* =

output w[1] ∗ a[0]


(30 × 30 × 2)
input a[0]
(32 × 32 × 3) kernel 3 × 3 × 3

∗ To get a complete convolutional layer we pass the elements of


the output through a nonlinearity, usually after adding a bias.
∗ Kernel weights w[1] , input a[0] , bias/offset b[1] (weights w[1] and
bias b[1] are unknown parameters that need to be learned).
∗ After convolution output is w[1] ∗ a[0]
∗ Add bias to get z[1] = w[1] ∗ a[0] + b[1]
∗ Final output a[1] = g(z[1] ), for nonlinear activation function g(·).
Note: g(·) is applied separately to each element of z[1] .
» Choice of Activation Function g(·)

{
x x≥0
∗ ReLU (Rectified Linear Unit) g(x) =
0 x<0
∗ Almost universally used nowadays (older choices were sigmoid
and tanh). Quick to compute, observed to work pretty well.
∗ But can lead to “dead” neurons where output is always zero
→ leaky ReLU
2
ReLU
activation function f(x)

1.5 Sigmoid
tanh
1

0.5

-0.5

-1
-2 -1 0 1 2
x
» Combining Convolutional Layers
∗ We can use the output from one convolution layer as the input
to another convolution layer
∗ E.g. Suppose input to first layer is 32 × 32 × 3 and convolve
this with 16 kernels of size 3 × 3 × 3 → output is 30 × 30 × 16
∗ Now use this 30 × 30 × 16 output as input to a second layer
with 8 kernels fo size 3 × 3 × 16 → output is 28 × 28 × 8 tensor
∗ All layers use ReLU activation function. Stride is 1.
∗ Typical way of drawing this schematically:

→ → →
conv conv
3 × 3, 16 3 × 3, 8
input 30 × 30 × 16 28 × 28 × 8
32 × 32 × 3

∗ Notes:
∗ “conv 3 × 3, 16” means convolutional layer with 3 × 3 kernel and
16 output channels.
∗ Number of channels in each kernel must match number of input
channels e.g. 3 × 3 × 3 for 3 input channels and 3 × 3 × 16 for 16
input channels, no choice here. So usually abbreviate to 3 × 3.
∗ Depth of cube roughly indicates #output channels.
» Combining Convolutional Layers

Some more notes:


∗ No padding used, so output is smaller than input. Could keep
the same using padding.
∗ Number of kernel weights/parameters for first layer is
16 × 3 × 3 × 3 = 432, and for second layer 8 × 3 × 3 × 16 = 1152
∗ Using equations:
∗ Input a[0] to first layer, output is a[1] = g(w[1] ∗ a[0] + b[1] )
∗ Input a[1] to second layer, output is a[2] = g(w[2] ∗ a[1] + b[2] )
where w[1] , w[2] are layer kernel weights, b[1] , b[2] layer bias
parameters and g(·) is ReLU.
» Pooling Layer

∗ Pooling layers are used to reduce the size of matrices in tensor


∗ E.g. Suppose want to downsample 4 × 4 matrix to 2 × 2 matrix:
1 2 3 4
1 3 2 3

3 2 1 4
6 1 1 2

∗ Use max-pooling with 2 × 2 block size and stride 2:


1 2 3 4
1 3 2 3 3 4

3 2 1 4 6 4
6 1 1 2

1. Partition input matrix into 2 × 2 blocks, stride of 2 means blocks


don’t overlap.
2. Calculate value of max element in each block.
3. Use max as value of corresponding output element.
» Pooling Layer

∗ E.g. Max-pooling with 3 × 3 block size and stride 1:


1 2 3 4 1 2 3 4
1 3 2 3 3 1 3 2 3 3 4
→ →
3 2 1 4 3 2 1 4
6 1 1 2 6 1 1 2

1 2 3 4 1 2 3 4
1 3 2 3 3 4 1 3 2 3 3 4
→ →
3 2 1 4 6 3 2 1 4 6 4
6 1 1 2 6 1 1 2

∗ But mostly use stride=block size → no overlap between blocks


∗ Pooling block size and stride must be chosen compatible with
size of input matrix
∗ As well as max-pooling there is average pooling → output is
average of elements in a block. But rarely used.
» Down-sampling Using Strided Convolution
∗ Recall that we can use strides > 1 in a convolutional layer → also reduces
size of output
1 -1
∗ E.g. Applying 2 × 2 kernel
1 -1
with stride 2:
1 -1 1 -1
1 2 3 4 1 2 3 4
1 -1 -3 1 -1 -3 -2
1 3 2 3 → 1 3 2 3 →
3 2 1 4 3 2 1 4
6 1 1 2 6 1 1 2

1 2 3 4 1 2 3 4
1 3 2 3 1 3 2 3
1 -1
-3 -2 1 -1
-3 -2
→ →
3 2 1 4 6 3 2 1 4 6 -4
1 -1 1 -1
6 1 1 2 6 1 1 2
→ for 4 × 4 input the output is reduced to 2 × 2
∗ Often works well, e.g. see Striving For Simplicity: The All Convolutional Net
https://arxiv.org/pdf/1412.6806.pdf
∗ Not quite the same as using (2,2) kernel with stride 1 and same padding
followed by (2,2) max-pooling:
∗ (2,2) kernel with stride 1 and same padding does 16 convolutions
whereas (2,2) kernel with stride 2 calcs only 4 convolutions (so faster,
computationally cheaper)
∗ Max-pooling combines info from all 4 convolutions involving 2 × 2 block
whereas (2,2) kernel with stride 2 only uses info from 1 convolution per
2 × 2 block (uses less info)
» Fully-Connected Layer
∗ Fully-connected (FC) layer = one layer of MLP. Called dense
layer in keras.
∗ Each output is a function of a weighted sum of all of the inputs
∗ Input is vector x (not a tensor or matrix). Output is y = f(wT x),
w are weights/parameters, f(·) is nonlinear function.
x1
f y
x2

..
.

xn

∗ If input is output from a convolution layer, i.e. a tensor, need


to flatten it before it can be used as input to FC layer.
∗ flattening → take all elements of tensor and write them as a
list/array [ ] [ ]
1 2 4 5
∗ e.g. two channels , → [1, 2, 3, 4, 4, 5, 6, 7].
3 4 6 7
» Fully-Connected Layer
∗ A FC-layer can have multiple outputs e.g. Input x and two
output y1 = f(wT x), y2 = f(vT x). Here w is weight vector for y1 ,
v the weight vector for y2 .
x1
f y1
x2

..
.
f y2
xn
∗ If input vector x has n elements and have m outputs then
FC-layer has n × m parameters.
∗ Suppose have h0 × w0 × c0 input and h1 × w1 × c1 output.
∗ Convolution layer has c1 × k × k × c0 parameters for k × k kernel
∗ FC-layer has h0 × w0 × c0 × h1 × w1 × c1 parameters
∗ h0 = w0 = 32, c0 = 32, h1 = w1 = 32, c1 = 32, conv 3 × 3 layer
has 9216 parameters, FC layer has 109 parameters.
∗ Common to use FC-layer as the last layer in a ConvNet i.e. the
layer which generates the (smallish number of) final outputs.
∗ How to choose nonlinear function f(·)?
∗ Common choice: softmax.
∗ Recall softmax = multi-class logistic regression model.
» Convolutional Network Example

MNIST Dataset1
∗ Training data: 60K images of handwritten digits 0-9. Test
data 10K images
∗ Each image is 28 × 28 pixels, gray scale
∗ Task is to predict which digit an image shows.
∗ Widely studied, relatively easy task. Best performance to
date is 99.8% accuracy using ConvNet

1 https://en.wikipedia.org/wiki/MNIST_database#cite_note-Gradient-9
» Convolutional Network Example

→ →
conv conv →
3 × 3, 32 3 × 3, 64 softmax
stride 2 13 × 13, 32 stride 2
input 6 × 6, 64
28 × 28, 1

∗ Uses strides to downsample the image.


∗ Input 28 × 28 × 1 → 13 × 13 × 32 → 6 × 6 × 64
∗ Number of channels increases as we move through network (1 → 32 → 64),
size of image decreases (28 × 28 → 13 × 13 → 6 × 6)
∗ We use final softmax layer/logistic regression to map from ConvNet features
to final output (flatten step not shown in schematic)
∗ Output is 10 × 1 → there are 10 classes, corresponding to digits 0-9,
elements of output vector are probability of each class. To make
prediction pick the class with highest probability.
» Convolutional Network Example
∗ We’ll use Python keras package for ConvNets (its a front end to tensorflow)
import numpy as np
from tensorflow import keras
from tensorflow.keras import regularizers
from keras.layers import Dense, Flatten, Conv2D, MaxPooling2D, Dropout

num_classes = 10
input_shape = (28, 28, 1)

# Load MNIST dataset


(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
# Scale images to the [0, 1] range
x_train = x_train.astype(”float32”) / 255
x_test = x_test.astype(”float32”) / 255
# Make sure images have shape (28, 28, 1)
x_train = np.expand_dims(x_train, −1)
x_test = np.expand_dims(x_test, −1)

model = keras.Sequential()
#3x3 kernel with stride 2, 32 output channels.
model.add(Conv2D(32, kernel_size=(3, 3), strides=(2,2), input_shape=input_shape, activation=”relu”))
#3x3 kernel with stride 2, 64 output channels.
model.add(Conv2D(64, kernel_size=(3, 3), strides=(2,2), activation=”relu”))
# use CNN output as input to a Logistic regression classifier. Regularise logistic loss with L2 penalty.
model.add(Flatten())
model.add(Dense(num_classes, activation=’softmax’,activity_regularizer=regularizers.l2(0.01)))
model.summary()

model.compile(loss=”categorical_crossentropy”, optimizer=’adam’, metrics=[”accuracy”])


model.fit(x_train, y_train, batch_size=32, epochs=5, validation_split=0.2)
score = model.evaluate(x_test, y_test, verbose=0)
print(”Test loss: %f accuracy: %f”%(score[0],score[1]))

∗ Note: use regularisation on FC-layers but usually not on convolutional layers.


Why?
» Convolutional Network Example
∗ Typical output:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 13, 13, 32) 320
_________________________________________________________________
conv2d_1 (Conv2D) (None, 6, 6, 64) 18496
_________________________________________________________________
flatten (Flatten) (None, 2304) 0
_________________________________________________________________
dense (Dense) (None, 10) 23050
=================================================================
Total params: 41,866
Trainable params: 41,866
Non−trainable params: 0
_________________________________________________________________
Epoch 1/5
3000/3000 [==============================] − 6s 2ms/step − loss: 0.1927 − accuracy: 0.9447 − val_loss: 0.0916 −
val_accuracy: 0.9765
Epoch 2/5
3000/3000 [==============================] − 6s 2ms/step − loss: 0.0788 − accuracy: 0.9788 − val_loss: 0.0755 −
val_accuracy: 0.9814
Epoch 3/5
3000/3000 [==============================] − 6s 2ms/step − loss: 0.0584 − accuracy: 0.9850 − val_loss: 0.0700 −
val_accuracy: 0.9820
Epoch 4/5
3000/3000 [==============================] − 6s 2ms/step − loss: 0.0466 − accuracy: 0.9882 − val_loss: 0.0723 −
val_accuracy: 0.9819
Epoch 5/5
3000/3000 [==============================] − 6s 2ms/step − loss: 0.0384 − accuracy: 0.9908 − val_loss: 0.0616 −
val_accuracy: 0.9858
Test loss: 0.051263 accuracy: 0.987100

∗ Achieves 98.7% accuracy on test data, model takes about 30s to train
∗ Baseline for comparison:
∗ Logistic regression: 73s to train, achieves 92% accuracy
∗ Kernelised SVM: 711s to train, achieves 94% accuracy
» Convolutional Network Example

∗ Can also use dropouts rather than L2 penalty for regularisation → using
dropouts is popular in ConvNets
model = keras.Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), strides=(2,2), input_shape=input_shape, activation=”relu”))
model.add(Conv2D(64, kernel_size=(3, 3), strides=(2,2), activation=”relu”))
model.add(Dropout(0.5))
model.add(Flatten())
model.add(Dense(num_classes, activation=’softmax’))

∗ Again, note that use regularisation on FC-layers but usually not on


convolutional layers.
» Convolutional Network Example

An alternative (but v similar) architecture:

→ → →
conv →
max-pool conv →
3 × 3, 32 max-pool softmax
(2, 2) 3 × 3, 64
14 × 14, 32 (2, 2) 7 × 7, 64
input 28 × 28, 32 14 × 14, 64
28 × 28, 1

∗ Use “same” padding in conv layers → output is same size as input.


∗ Use max-pool to downsample, stride=kernel size=2
∗ 28 × 28 × 1 → 28 × 28 × 32 → 14 × 14 × 32 → 14 × 14 × 64 → 8 × 8 × 64
∗ Using same padding plus max-pool like this is currently popular ... but that
might well change
∗ Python keras code:
model = keras.Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), input_shape=input_shape, padding=”same”,activation=”relu”))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(64, kernel_size=(3, 3), padding=”same”, activation=”relu”))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(num_classes, activation=’softmax’,activity_regularizer=regularizers.l2(0.01)))
model.summary()
» Convolutional Network Example
∗ Typical output:
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 28, 28, 32) 320
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 14, 14, 32) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 14, 14, 64) 18496
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 7, 7, 64) 0
_________________________________________________________________
flatten (Flatten) (None, 3136) 0
_________________________________________________________________
dense (Dense) (None, 10) 31370
=================================================================
Total params: 50,186
Trainable params: 50,186
Non−trainable params: 0
_________________________________________________________________
Epoch 1/5
3000/3000 [==============================] − 21s 7ms/step − loss: 0.1490 − accuracy: 0.9565 − val_loss: 0.0627 −
val_accuracy: 0.9854
Epoch 2/5
3000/3000 [==============================] − 22s 7ms/step − loss: 0.0570 − accuracy: 0.9850 − val_loss: 0.0527 −
val_accuracy: 0.9886
Epoch 3/5
3000/3000 [==============================] − 22s 7ms/step − loss: 0.0432 − accuracy: 0.9898 − val_loss: 0.0567 −
val_accuracy: 0.9849
Epoch 4/5
3000/3000 [==============================] − 22s 7ms/step − loss: 0.0345 − accuracy: 0.9920 − val_loss: 0.0504 −
val_accuracy: 0.9877
Epoch 5/5
3000/3000 [==============================] − 21s 7ms/step − loss: 0.0284 − accuracy: 0.9941 − val_loss: 0.0474 −
val_accuracy: 0.9901
Test loss: 0.044836 accuracy: 0.989100

∗ Achieves 98.9% accuracy on test data


∗ Takes 100s to train (longer than when use strides to downsample, why?)
» Cross-validation
∗ Training by minimising cost function and using
cross-validation to select hyperparameters (not just
regularisation penalty but also number of convolutional
output channels etc) is best practice
∗ But ...
∗ ... it often takes ages to train ConvNets. Even in above v easy
example it takes a minute or so, with bigger networks and
more data training can easily take days even with a good GPU
rig
∗ So k-fold cross-validation usually impractical, just takes too
long
∗ Instead often just keep a hold-out test set and use that to
evaluate hyperparameter choices. Also often only evaluate
only a few hyperparameter values as otherwise takes too long.
∗ Its not great, but we have little choice. Also means you can
see many conflicting/random views on web for how to
approach the same ML task.

You might also like