FACE RECOGNIGITION SYSTEM BLOCK DIAGRAM
DATA COLLECTION
At first we collect the images of the respective person . Then we would label their images in a
directory as shown in the figure below. After this we will proceed further processing.
DETECTING AND ALIGNING FACE USING MTCNN
We need to first detect and align the face before feeding it to the FaceNet model. This is
achieved with a Multi-task cascaded convolutional neural network (MTCNN).The large visual
variations of faces, such as occlusions, large pose variations and extreme lightings, impose
great challenges for these tasks in real world applications. So MTCNN tries to overcome these
error. The pipeline of the cascaded framework that includes three-stage multi-task deep
convolutional networks
STAGE 1: In the first stage, we produces candidate windows quickly through a shallow CNN.
We exploit a fully convolutional network, called Proposal Network (P-Net), to obtain the
candidate facial windows and their bounding box regression vectors. Then candidates are
calibrated based on the estimated bounding box regression vectors. After that, we employ non-
maximum suppression (NMS) to merge highly overlapped candidates.
STAGE 2: Then, it refines the windows by rejecting a large number of non-faces windows
through a more complex CNN. All candidates are fed to another CNN, called Refine Network
(R-Net), which further rejects a large number of false candidates, performs calibration with
bounding box regression, and conducts NMS.
STAGE3: Finally, it uses a more powerful CNN to refine the result again and output five
facial landmarks positions. This stage is similar to the second stage, but in this stage we aim to
identify face regions with more supervision. In particular, the network will output five facial
landmarks’ positions
TRAINING THE ALIGNED IMAGES USING FACENET
For training image we use FaceNet which uses deep convolution network. It is a one shot
model that directly learns a mapping from face images to a compact Euclidean space where
distances directly correspond to a measure of face similarity. One-shot learning aims to learn
information about object categories from one, or only a few, training images. Whereas most
machine learning based object categorization algorithms require training on hundreds or
thousands of images and very large datasets which is not feasible in large set of data.
Fig: Model structure
Our network consists of a batch input layer and a deep CNN followed by L2 normalization,
which results in the face embedding. This is followed by the triplet loss during training.
Fig: Triplet loss before and after training
Here, the Anchor denotes the images which is trained in the neural network, positive denotes
the matching images and negative denotes the image non-matching images.
The Triplet Loss minimizes the distance between an anchor and a positive, both of which have
the same identity, and maximizes the distance between the anchor and a negative of a different
identity.
To train images in the neural network, it use triplets of roughly aligned matching / non-
matching face patches. A triplet is nothing but a collection one anchor image, one matching
image to the anchor image and one non-matching image to the anchor image. So the triplet
loss minimizes the distance between an anchor and a positive, both of which have the same
identity, and maximizes the distance between the anchor and a negative of a different identity.
In a way, distance would be closer for similar faces and further away for non-similar faces.
A more precise mathematical expression for triplet loss is given below. Aside from what we
have mentioned above, Triplet Loss tried to solve open-set problem by adding a parameter α
to its loss function, which not only classifies images but also enforces a margin between
different classes
Triplet loss is given by;
Where, the embedding is represented by f(x) ∈ R d. It embeds an image x into a d-dimensional
Euclidean space. xai represents the anchor image of person, x pi represents the positive
(matching) image of person and xni represents the negative (non-matching) image of person.
Α is a margin that is enforced between positive and negative pairs.
7.5 FACE RECOGNITION
Once the FaceNet model is trained, we can create the embedding for the face by feeding into
the model. In order to compare two images, create the embedding for both images by feeding
through the model separately. Then we can use above formula to find the distance which will
be lower value for similar faces and higher value for different face. And if we found any
unknown face having Euclidean distance more than the threshold distance, then the system
will alert the admin.
The Euclidean distance can be calculated as;
The above can be summarize by the following paragraph;
Let us see how we can recognize faces, with what all we have done above. Now you have with
you, the corpus of 128-dimensional embedding's with corresponding employee names.
Whenever an employee faces your detection camera, the image being captured will be ran
through the pre-trained network to create the 128-dimensional embedding which will then be
compared to the stored embedding's using Euclidean(L2) distance. If the lowest distance
between the captured embedding and the stored embedding's is less than a threshold value, the
system can recognize that person as the employee corresponding to that lowest distant
embedding.