Final Evaluation Report

This page outlines the work that has been completed, and the final state of the project, putting emphasis on the work that hass been done since the second evaluation.

To start off, the following is the work that was proposed to be completed by the third evaluation:


I have fortunately been able to complete all except the Shot Segmentation Evaluator, which is thankfully not a crucial part of the project.

For a more detailed look at what has been done, it is helpful to read through the previous two reports as well. This report mainly looks at the VPM and the web-app for Rekognition.

Shot Segmentation

Initial work of this month was on Shot Segmentation, which is a submodule used by VPM. A base Shot Segmentation class ShotSegmentation was created, with a derived class being PSDSegmenter, an easy to use wrapper around PySceneDetect. To use this one a video file is as simple as:

segmenter = PSDSegmenter(filename=video_name)
scene_timings = segmenter.segment()

Video Processing Module (VPM)

The Video Processing Module is the final piece on top of the work that has been done for the last three months, which integrates the FRM (Face Recognition Module), Shot Segmentation, Video Generation and Feedback Learning.

The first step of VPM was to merge Shot Segmentation and FRM, so that we could get the results of face recognition of celebrities per scene. For every scene, the analytics and logs are collected separately. Thus VPM was given a higher level of interfacability, as only the directory locations are now necessary to perform computations.

For training on a directory of faces, we just have to give:


And for testing on a video of faces, we just have to give:


While interacting with FRM is not hard, this VPM interfacing is much easier and faster. The scene details encompasses every single bounding box detected, as well as its predicted result. This is necessary in feedback.


Ideally it should be easy to learn from exampples; to do this, every bounding box has a signature triplet code s:f:b, based on the scene number (s), frame number within the scene (f), and the bounding box number in the frame (b). This makes it a hierarchical code, making it easier for accessing records; had we kept the frame number, it would have become too big at a very fast rate. Using only the scene number absences us from that problem. The triplet also is unique due to the hierarchy.

The use of this triple coding comes in for easier way for the user to denote a bounding box. This is integrated with the the Video Generation subpart. When the video is generated, every detected bounding box is accompanied with the code (as shown in the following picture) so that the user can quickyl pass this as feedback (with the correct label) in the GUI, or through the VPM API.


Web-App (GUI)

The final task before code cleaning, documentation and reports were to build a simple yet effective GUI for Rekognition, to showcase its abilities. The Webapp was built on the existing prototype that was used for Proof of Concept, which was made using Flask.

The GUI includes a JSON-based input to choose an alignment method, detector, embedder as we wish. We can train using some submodules, and test using some other, while being very simple to do so.

The major difference here was that, building the Webapp became easier. Due to the extremely easy interface of Rekognition via VPM, calling functions to do a particular task was very short. More effort was required in giving a uniform UI, and on the initial proposal of clicking on the bounding box. However, video-based JS extensions were either too big, or required extra computation to reduce lag. This resulted in me opting for a code-based techique as said above. More of the GUI can be seen here: Demo.


In terms of the project, this Poor Man’s Rekognition was built on the basis on extensive modularity and stability. I wanted the final project to be robust, yet capable of being extended with more features. This is reflected in the GUI too - one can pick and choose the modules at any time.

Another philosophy I kept in mind throughout was to ensure the robustness of models. I needed a unified performance evaluation system for all submodules so that no break-down issues can happen in the future.

Participating in this GSoC, in creating such a fleshed-out pipeline which at the same time comes with a lot of future scope has been a very learning experience. I faced troubles and delays more than expected, but was able to complete the project on time. I thank CCExtractor, my mentor Johannes Lochter and Google for this opportunity.

Next Steps

The next plans are to involve face tracking into this as that would prevent missing faces in a scene due to some temporary occlusion. This will go in hand with resolving bug issues!