Carefully chosen anchor boxes of varying sizes and aspect ratios help create diverse regions of interest.
Single-shot detectors must place special emphasis on the issue of multiple scales because they detect objects with a single pass through the CNN framework. If objects are detected from the final CNN layers alone, only large items will be found as smaller items may lose too much signal during downsampling in the pooling layers. To address this problem, single-shot detectors typically look for objects within multiple CNN layers including earlier layers where higher resolution remains. Despite the precaution of using multiple feature maps, single-shot detectors notoriously struggle to detect small objects, especially those in tight groupings like a flock of birds.
Feature maps from multiple CNN layers help predict objects at multiple scales.
The feature pyramid network (FPN) takes the concept of multiple feature maps one step further. Images first pass through the typical CNN pathway, yielding semantically rich final layers. Then to regain better resolution, FPN creates a top-down pathway by upsampling this feature map. While the top-down pathway helps detect objects of varying sizes, spatial positions may be skewed. Lateral connections are added between the original feature maps and the corresponding reconstructed layers to improve object localization. FPN currently provides one of the leading ways to detect objects at multiple scales, and YOLO was augmented with this technique in its 3rd version.
The feature pyramid network detects objects of varying sizes by reconstructing high resolution layers from layers with greater semantic strength.
The limited amount of annotated data currently available for object detection proves to be another substantial hurdle. Object detection datasets typically contain ground truth examples for about a dozen to a hundred classes of objects, while image classification datasets can include upwards of 100,000 classes. Furthermore, crowd sourcing often produces image classification tags for free (for example, by parsing the text of user-provided photo captions). Gathering ground truth labels along with accurate bounding boxes for object detection, however, remains incredibly tedious work.
The COCO dataset, provided by Microsoft, currently leads as some of the best object detection data available. COCO contains 300,000 segmented images with 80 different categories of objects with very precise location labels. Each image contains about 7 objects on average, and items appear at very broad scales. As helpful as this dataset is, object types outside of these 80 select classes will not be recognized if training solely on COCO.
A very interesting approach to alleviate data scarcity comes from YOLO9000, the second version of YOLO. YOLO9000 incorporates many important updates into YOLO, but it also aims to narrow the dataset gap between object detection and image classification. YOLO9000 trains simultaneously on both COCO and ImageNet, an image classification dataset with tens of thousands of object classes. COCO information helps precisely locate objects, while ImageNet increases YOLO’s classification “vocabulary.” A hierarchical WordTree allows YOLO9000 to first detect an object’s concept (such as “animal/dog”) and to then drill down into specifics (such as “Siberian husky”). This approach appears to work well for concepts known to COCO like animals but performs poorly on less prevalent concepts since RoI suggestion comes solely from the training with COCO.
YOLO9000 trains on both COCO and ImageNet to increase classification "vocabulary."
Class imbalance proves to be an issue for most classification problems, and object detection is no exception. Consider a typical photograph. More likely than not, the photograph contains a few main objects and the remainder of the image is filled with background. Recall that selective search in R-CNN produces 2,000 candidate RoIs per image–just imagine how many of these regions do not contain objects and are considered negatives!
A recent approach called focal loss is implemented in RetinaNet and helps diminish the impact of class imbalance. In the optimization loss function, focal loss replaces the traditional log loss when penalizing misclassifications: \[ FL(p_u) = -\overbrace{(1-p_u)^\gamma\;}^{*} \log(p_u)\] where \(p_u \) is the predicted class probability for the true class and \(\gamma > 0\). The additional factor (*) reduces loss for well-classified examples with high probabilities, and the overall effect de-emphasizes classes with many examples that the model knows well, such as the background class. Objects of interest occupying minority classes, therefore, receive more significance and see improved accuracy.
Object detection is customarily considered to be much harder than image classification, particularly because of these five challenges: dual priorities, speed, multiple scales, limited data, and class imbalance. Researchers have dedicated much effort to overcome these difficulties, yielding oftentimes amazing results; however, significant challenges still persist.
Basically all object detection frameworks continue to struggle with small objects, especially those bunched together with partial occlusions. Real-time detection with top-level classification and localization accuracy remains challenging, and practitioners must often prioritize one or the other when making design decisions. Video tracking may see improvements in the future if some continuity between frames is assumed rather than processing each frame individually. Furthermore, an interesting enhancement that may see more exploration would extend the current two-dimensional bounding boxes into three-dimensional bounding cubes. Even though many object detection obstacles have seen creative solutions, these additional considerations–and plenty more–signal that object detection research is certainly not done!
ALGORITHMS · LITERATURE REVIEWS