Published: 3 January 2024
Contributors: Jacob Murel Ph.D., Eda Kavlakoglu
Object detection is a technique that uses neural networks to localize and classify objects in images. This computer vision task has a wide range of applications, from medical imaging to self-driving cars.
Object detection is a computer vision task that aims to locate objects in digital images. As such, it is an instance of artificial intelligence that consists of training computers to see as humans do, specifically by recognizing and classifying objects according to semantic categories.1 Object localization is a technique for determining the location specific objects in an image by demarcating the object through a bounding box. Object classification is another technique that determines to which category a detected object belongs. The object detection task combines subtasks of object localization and classification to simultaneously estimate the location and type of object instances in one or more images.2
Object detection overlaps with other computer vision techniques, but developers nevertheless treat it as a discrete endeavor.
Image classification (or image recognition) aims to classify images according to defined categories. A rudimentary example of this is CAPTCHA image tests, in which a group of images may be organized as images with stop signs and images without. Image classification assigns one label to a whole image.
Object detection, by comparison, delineates individual objects in an image according to specified categories. While image classification divides images among those that have stop signs and those that do not, object detection locates and categorizes all of the road signs in an image, as well as other objects such as cars and people.
Image segmentation (or semantic segmentation) is similar to object detection, albeit more precise. Like object detection, segmentation delineates objects in an image according to semantic categories. But rather than mark objects using boxes, segmentation demarcates objects at the pixel level.
Understanding object detection’s inner machinations requires a foundation in computer vision and digital image processing more broadly. This section provides a general overview.
In computer vision, images are expressed as continuous functions on a 2D coordinate plane represented as f(x,y). When digitized, images undergo two primary processes called sampling and quantization, which, in short, together convert the continuous image function into a discrete grid structure of pixel elements. The computer can then segment an image into discrete regions according to visual similarity and proximity of pixels.3
By labeling images using an annotation interface, users define a specific object as a region of specific pixel-level features (for example, area, gray-value, and so on). When given an input image, the object detection model recognizes regions with similar features to those defined in the training dataset as the same object. In this way, object detection is a form of pattern recognition. Object detection models do not recognize objects per se, but rather aggregates of properties such as size, shape, color, and so on, and classify regions according to visual patterns inferred from manually annotated training data.4
An object detection model for a self-driving car, for example, does not recognize pedestrians but a set of features that form the general pattern characterizing pedestrian objects (as defined in the training data).
While different model families use different architectures, deep learning models for object detection follow a general structure. They consist of a backbone, neck, and head.
The backbone extracts features from an input image. Often, the backbone is derived from part of a pretrained classification model. The feature extraction produces a myriad feature maps of varying resolutions that the backbone passes to the neck. This latter portion of the structure concatenates the feature maps for each image. The architecture then passes the layered feature maps to the head, which predicts bounding boxes and classification scores for each feature set.
Two-stage detectors separate object localization and classification in the head, while single-stage detectors combine these tasks. The former generally return higher localization accuracy while the latter perform more quickly.5
Intersection over union (IoU) is a common evaluation metric used in object detection models. A bounding box is the quadrate output demarcating a detected object as predicted by the model. IoU calculates the ratio of two bounding boxes’ area of intersection (that is, area of boxes’ overlapping sections) over their area of union (that is, total area of both boxes combined):6
We can visualize this equation as:
Models use IoU to measure prediction accuracy by calculating the IoU between a predicted box and ground truth box. Model architectures also use IoU to generate final bounding box predictions. Because models often initially generate several hundred bounding box predictions for a single detected object, models use IoU to weigh and consolidate bounding box predictions into a single box per detected object.
Other metrics may be used for different evaluations of object detection models. Generalized intersection over union (GIoU) is a modified version of IoU that accounts for improvements in object localization for which basic IoU can still return a null value.7 Object detection research also employs common informational retrieval metrics, such as mean average precision and recall.
There are a number of machine learning approaches to object detection tasks. Examples include the Viola-Jones framework8 and the histogram of oriented gradients.9 Recent object detection research and development, however, has focused largely on convolutional neural networks (CNNs). As such, this page focuses on two types of CNNs most discussed in object detection research. Note that these models are tested and compared using benchmark datasets, such as the Microsoft COCO dataset or ImageNet.
R-CNN (region-based convolutional neural network) is a two-stage detector that uses a method called region proposals to generate 2,000 region predictions per image. R-CNN then warps the extracted regions to a uniform size and runs those regions through separate networks for feature extraction and classification. Each region is ranked according to the confidence of its classification. R-CNN then rejects regions that have a certain IoU overlap with a higher scoring selected region. The remaining non-overlapping and top-ranking classified regions are the model’s output.10 As expected, this architecture is computational expensive and slow. Fast R-CNN and Faster R-CNN are later modifications that reduce the size of the R-CNN’s architecture and thereby decrease processing time while also increasing accuracy.11
YOLO (You Only Look Once) is a family of single-stage detection architectures based in Darknet, an open-source CNN framework. First developed in 2016, the YOLO architecture prioritizes speed. Indeed, YOLO's speed makes it preferable for real-time object detection and has earned it the common descriptor of state-of-the-art object detector. YOLO differs from R-CNN in several ways. While R-CNN passes extracted image regions through multiple networks that separately extract features and classify images, YOLO condenses these actions into a single network. Secondly, compared to R-CNN’s ~2000 region proposals, YOLO makes less than 100 bounding box predictions per image. In addition to being faster than R-CNN, YOLO also produces less background false positives, although it has a higher localization error.12 There have been many updates to YOLO since its inception, generally focusing on speed and accuracy.13
Though originally developed for object detection, later versions of R-CNN and YOLO can also train classification and segmentation models. Specifically, Mask R-CNN combines both object detection and segmentation, while YOLOv5 can train separate classification, detection, and segmentation models.
Of course, there are many other model architectures beyond R-CNN and YOLO. SSD and Retinanet are two additional models that use a simplified architecture similar to YOLO.14 DETR is another architecture developed by Facebook (now Meta) that combines CNN with a transformer model and shows performance comparable to Faster R-CNN.15
In many use cases, object detection is not an end in itself but one stage in a larger computer vision task.
Self-driving cars widely adopt object detection to recognize objects such as cars and pedestrians. One such example is Tesla’s Autopilot AI. Because of their increased speed, simple architectures like YOLO and SimpleNet are obviously more ideal for autonomous driving.16
Object detection can assist in visual inspection tasks. For instance, a substantive body of object detection research investigates metrics and models for identifying physiological indicators of disease in medical images like X-rays and MRI scans. In this area, much research has focused on improving dataset imbalances given the scarcity of such medical images of disease.17
Video surveillance may employ real-time object detection to track crime-associated objects, such as guns or knives in security camera footage. By detecting such objects, security systems can further predict and prevent crime. Researchers have developed gun detection algorithms using both R-CNN and YOLO.18
Imbalanced datasets are one issue plaguing object detection tasks, as negative samples (that is, images without the object of interest) vastly outnumber positive samples in many domain-specific datasets. This is a particular issue with medical images, where positive samples of diseases are difficult to acquire. Recent research utilizes data augmentation to expand and diversify limited datasets for improved model performance.19
Past developments in object detection have largely focused on 2D images. More recently, researchers have turned to object detection applications for 3D images and video. Motion blur and shifting camera focus cause problems in identifying objects across video frames. Researchers have explored a range of methods and architectures to help track objects across frames in spite of such conditions, such as the recurrent neural network architecture long short-term memory (LSTM)20 and transformer-based models.21 Transformers have been utilized to quicken object detection models for real-time detection tasks. Parallel processing techniques are one further notable area of study in this endeavor.22
