First Evaluation Report

This page outlines the work that has been completed as of the first evaluation in the GSoC program. Some code examples on how to use the module, some image results, theory as well as explanations into the design will be included in this. Rather than sectioning this page by week, it is sectioned according to the various submodules in the system.

To start off, the following is the work that was proposed to be completed by the first evaluation (present on the last page of the proposal):


I have fortunately been able to complete all the tasks that I had detailed out here. We go into more detail now.

Face Detector


As stated in the proposal all face detectors are implemented as an implementation of the abstract base class (ABC) FaceDetector. This design was chosen was chosen so that all face detectors added in the future will have a clearly defined scope for sufficiency and necessity; it is sufficient to implement only the abstract methods in the base class, and it is necessary to implement them. This prevents a face detector from becoming too big in function, which is an avoidable nuisance as our main goal is to have a strict API which detects faces and returns the bounding boxes.

The first detector that was implemented was Viola Jones, using the openCV interface. As this was the detector used in the PoC, the implementation did not present any particular hurdles. There were two disadvantage here: the first one is in terms of time, as the Viola Jones method used does not allow for parallel/batch processing; so a batch of images are processed sequentially. The second one is the scoring of the detections; Viola Jones as implemented in openCV returns the faces based on the parameters provided, but does not return the confidence scores. This is not an issue when it comes to the face recognition module. However, the lack of scores prevented us from being able to run the evaluator on the Viola Jones detection results. (More info on the evaluation techniques is given below)

The simple interface to ViolaJones to detect faces in a single image can be seen below. The presence of ABCs let us have a unified way to call any face detector.

from rekognition.face_detection import FaceBoxes, ViolaJones

image = cv2.cvtColor(cv2.imread('group-selfie.jpg'), cv2.COLOR_BGR2RGB)
classifier = ViolaJones() # replace ViolaJones with FaceBoxes
faces = classifier._detect([image])

ax = plt.gca()
for a, (x,y,w,h) in faces:
    rect = patches.Rectangle((x,y), w, h, linewidth=1, edgecolor='r', facecolor='none')

The second detector that was implemented was FaceBoxes, a deep neural network architecture to detect faces in realtime. FaceBoxes was implemented using TensorFlow, making use of the model provided in this custom repository (Faceboxes-TF Repo). Using the scores returned by FaceBoxes, we ran the evaluator and documented the results. The PR Curve and the Average Precision for the WIDER FACE dataset are also shown in the next section for convenience. As shown in the beginning of this section, there is a huge qualitative performance improvement in FaceBoxes compared to ViolaJones.

Face Detector Evaluation

Just satisfying the code structure necessitated by the abstract class is not enough proof-checking before adding a new Face Detector to the module; it is important to not have the module filled with poorly performing face detectors. For this we need a baseline based on the quality of face detections, and the future face detectors must also be of similar or better quality when compared to FaceBoxes.

This was the core reason behind my decision to add a module-specific face detection evaluator. The evaluator would have to be simple to use for evaluation on the specific performance dataset (more on this below), but at the same time able to be used by any user for their custom face detection ground truths and predictions. The latter reason was why we could not use any detection evaluator module provided by various face detection challenges; they are specific to that dataset. So FaceDetectionEvaluator was designed with this generality, with logic to add ignorable faces as needed in some datasets (WIDER FACE).

FaceDetectionEvaluator computes the following:

  • PR Curve: computing the PR Curve involves calculating the detection precision and recall and plotting them at various thresholds of the detector score (this is why we could not performance evaluate ViolaJones, as mentioned above). For reference, with \(TP, FP, FN\) being the True Positive, False Positive and False Negative status of a detection, then

    \[\text{Precision} = \frac{TP}{TP+FP},\ \text{Recall} = \frac{TP}{TP+FN}\]
  • Average Precision: Average Precision corresponds to the area under the PR Curve. However, there exist various definitions of AP in literature, including methods such as 11 points interpolated average precision and trapezoidal interpolated average precision. Based on the recent papers and the evaluation criteria they use, we use the trapozoidal method used in the PASCAL Object Detection Challenge. Here we take every consecutive recall points (that have different values) and calculate the area under the curve between these two points. To accommodate for possible errors, the precision values are replaced by the values of its upper hull starting from the largest recall point.

An example of the PR Curve generated for FaceBoxes on the specific dataset:


Evaluation Dataset (WIDER)

The specific dataset that was chosen for Face Detection performance evaluation is WIDER FACE. WIDER FACE has a large variety of images categorically ordered with each face present with its face detections and difficulty (easy, medium and hard). WIDER FACE is also a frequently cited and is baselined-against in face detection papers. We use the validation set for analysis, which consists of 3000+ images.

This led to our next conundrum; how do we package the dataset? The dataset had been modified from its original structure, with additional files for easing its usage. So it could not be downloaded from the WIDER FACE source. One possible option was storing it on cloud, but this had its own issues with the storage account, and user request bandwidth limits. Hence I opted for the git approach. The tar.gz of the WIDER dataset and its metadata was stored under Git LFS (Large File Storage).

This has couple of benefits: it allows the user to clone the repository without having to download the large dataset, which is present as a reference; it allows the dataset to remain in the git ecosystem, not requiring the developer to connect to third-parties to download it; if and when needed, the developer could download it at any point of time based on his use. The way to do this is detailed out on the project README.

Along with the dataset, a notebook is also present in tests/performance/face_detection/ which is ready-made for calculating the performance scores and the PR Curve. The work till this point lies in the timeline for the first two weeks.

Face Alignment

Different detected faces come in different orientations and sizes. To improve face recognition results, the all the faces are converted, or aligned, to a similar structure. This structure asserts equal image dimension and equal positions of certain facial elements across all faces. This is what the Face Alignment methods do. Similar to FaceDetector, FaceAlignment is also an ABC which requires future face alignment methods to be derived classes; the philosophy behind the usage of this is same as what was described for the face detector. What differs importantly are the required functions. Face Alignment comes in two steps: the first step is to locate the various features in a face: these features typically correspond to points around the eye, nose and the jaw. The second step is to use these landmarks and the initial bounding box to generate an aligned face.

The first alignment method implemented involved Kazemi-Sullivan, a boosted cascade classifiers based algorithm to detect facial landmarks at over 400 faces per second. Kazemi-Sullivan was implemented using the dlib-interface. This returns 68 well-defined points across the face. We then use the points defined for the corners of the two eyes and the base of the nose to align the image to predefined alignment locations on a fixed dimension canvas. This warping is done via an affine transformation; the transformation matrix is generated using the three points mentioned above.

Since KazemiSullivan models’ sizes for are large, we have not directly included any in the source code. However, one model is present as an LFS file at tests/models/KazemiSullivanModel.dat for use.

Finding the facial landmarks of a single face can be done using this code snippet:

Aligner = KazemiSullivan(model='shape_predictor_68_face_landmarks.dat')
landmarks = Aligner._landmarkDetection([image], [[faces[0][1]]])

# show landmarks
landmark = landmarks[0][0]
for (x,y) in landmark:

# for more info on the indexing visit the function docs
aligned_face = Aligner._align([image], [[faces[0][1]]])[0][0]

An example face alignment operation using KazemiSullivan:


Face Alignment Evaluation

With the same reasons as applicable for Face Detector above, Face Alignment too required an evaluator. And similarly, generality is important here too. So the interface to FaceAlignmentEvaluator is simply just the input of the predicted landmarks and the final landmarks. The plots and the quantitative scoring of the predictions require only these two inputs.

FaceAlignmentEvaluator computes the following:

  • (Average) Error: the average error is computed across all the faces landmark predictions. The error is computed as the RMS of the residuals of each landmark. Since such an error would be highly dependent on the face size, this error is normalized. This normalization can be the distance between the outer eye points, distance between the pupils or the length of diagonal of the face’s bounding box.

  • CED Curve: the Cumulative Error Distribution curve computes the fraction of faces in the dataset that are landmarked under a given threshold of error, plotted against the error threshold. Ideally the curve must rise at an early error threshold and saturate as soon as possible.

  • Failure Rate: Within the CED Curve, we put a limit on the error (failure threshold) such that any face with error above that value is considered to be erroneously landmarked. By default this is taken as 0.08. The failure rate is the fraction of faces that are wrongly landmarked

  • AUC: As stated above, the larger the area under the CED curve, the better the predictions generally are. The AUC is computed by a simple integration of the CED curve from error threshold of zero to the failure threshold.

An example of the CED curve used on Kazemi-Sullivan is shown here (run on iBUG dataset):


Evaluation Dataset (iBUG)

The initially proposed dataset was MENPO. However, MENPO dataset does not come with the bounding boxes for the faces to be landmarked. This required a change in dataset, and we settled on iBUG, which was a subset in the 300W face annotation challenge. We chose only this subset of 130+ images because this set was labelled the “challenge” dataset in various papers. Although the landmarking error rates will be lower, this provides us with a tough baseline as face annotation methods improve. Due to reasons as mentioned in the WIDER FACE dataset, we have modified the dataset structure and added meta-data and put up the tar.gz as an LFS file (in tests/dataset/).

Along with the dataset, just as in Face Detection, a notebook is also present in tests/performance/face_detection/ which is ready-made for calculating the performance scores and the CED Curve.

Next Step

The first evaluation period goals have all been reached. The next step is to finish the DAN Face Alignment network, and have a draft for the recognition module which integrates Face Detection and Face Alignment. Then work starts on the Face Identification and completing the face recognition module. By end of phase II, work will have started on the video processing module.