Home Topics Object Detection What is object detection?
Explore the IBM object detection solution Subscribe for AI updates
Pixelated color image

Published: 3 January 2024
Contributors: Jacob Murel Ph.D., Eda Kavlakoglu

Object detection is a technique that uses neural networks to localize and classify objects in images. This computer vision task has a wide range of applications, from medical imaging to self-driving cars.

Object detection is a computer vision task that aims to locate objects in digital images. As such, it is an instance of artificial intelligence that consists of training computers to see as humans do, specifically by recognizing and classifying objects according to semantic categories.1 Object localization is a technique for determining the location specific objects in an image by demarcating the object through a bounding box. Object classification is another technique that determines to which category a detected object belongs. The object detection task combines subtasks of object localization and classification to simultaneously estimate the location and type of object instances in one or more images.2

Computer vision tasks

Object detection overlaps with other computer vision techniques, but developers nevertheless treat it as a discrete endeavor.

Image classification (or image recognition) aims to classify images according to defined categories. A rudimentary example of this is CAPTCHA image tests, in which a group of images may be organized as images with stop signs and images without. Image classification assigns one label to a whole image.

Object detection, by comparison, delineates individual objects in an image according to specified categories. While image classification divides images among those that have stop signs and those that do not, object detection locates and categorizes all of the road signs in an image, as well as other objects such as cars and people.

Image segmentation (or semantic segmentation) is similar to object detection, albeit more precise. Like object detection, segmentation delineates objects in an image according to semantic categories. But rather than mark objects using boxes, segmentation demarcates objects at the pixel level.

Learn and operate Presto

Explore the free O'Reilly ebook to learn how to get started with Presto, the open source SQL engine for data analytics.

Related content

Register for the ebook on responsible AI workflows

How object detection works

Understanding object detection’s inner machinations requires a foundation in computer vision and digital image processing more broadly. This section provides a general overview.

Image processing

In computer vision, images are expressed as continuous functions on a 2D coordinate plane represented as f(x,y). When digitized, images undergo two primary processes called sampling and quantization, which, in short, together convert the continuous image function into a discrete grid structure of pixel elements. The computer can then segment an image into discrete regions according to visual similarity and proximity of pixels.3

By labeling images using an annotation interface, users define a specific object as a region of specific pixel-level features (for example, area, gray-value, and so on). When given an input image, the object detection model recognizes regions with similar features to those defined in the training dataset as the same object. In this way, object detection is a form of pattern recognition. Object detection models do not recognize objects per se, but rather aggregates of properties such as size, shape, color, and so on, and classify regions according to visual patterns inferred from manually annotated training data.4

An object detection model for a self-driving car, for example, does not recognize pedestrians but a set of features that form the general pattern characterizing pedestrian objects (as defined in the training data).

Model architecture

While different model families use different architectures, deep learning models for object detection follow a general structure. They consist of a backbone, neck, and head.

The backbone extracts features from an input image. Often, the backbone is derived from part of a pretrained classification model. The feature extraction produces a myriad feature maps of varying resolutions that the backbone passes to the neck. This latter portion of the structure concatenates the feature maps for each image. The architecture then passes the layered feature maps to the head, which predicts bounding boxes and classification scores for each feature set.

Two-stage detectors separate object localization and classification in the head, while single-stage detectors combine these tasks. The former generally return higher localization accuracy while the latter perform more quickly.5

Evaluation metrics

Intersection over union (IoU) is a common evaluation metric used in object detection models. A bounding box is the quadrate output demarcating a detected object as predicted by the model. IoU calculates the ratio of two bounding boxes’ area of intersection (that is, area of boxes’ overlapping sections) over their area of union (that is, total area of both boxes combined):6

We can visualize this equation as:

Models use IoU to measure prediction accuracy by calculating the IoU between a predicted box and ground truth box. Model architectures also use IoU to generate final bounding box predictions. Because models often initially generate several hundred bounding box predictions for a single detected object, models use IoU to weigh and consolidate bounding box predictions into a single box per detected object.

Other metrics may be used for different evaluations of object detection models. Generalized intersection over union (GIoU) is a modified version of IoU that accounts for improvements in object localization for which basic IoU can still return a null value.7 Object detection research also employs common informational retrieval metrics, such as mean average precision and recall.

Object detection algorithms and architectures

There are a number of machine learning approaches to object detection tasks. Examples include the Viola-Jones framework8 and the histogram of oriented gradients.9 Recent object detection research and development, however, has focused largely on convolutional neural networks (CNNs). As such, this page focuses on two types of CNNs most discussed in object detection research. Note that these models are tested and compared using benchmark datasets, such as the Microsoft COCO dataset or ImageNet.

R-CNN (region-based convolutional neural network) is a two-stage detector that uses a method called region proposals to generate 2,000 region predictions per image. R-CNN then warps the extracted regions to a uniform size and runs those regions through separate networks for feature extraction and classification. Each region is ranked according to the confidence of its classification. R-CNN then rejects regions that have a certain IoU overlap with a higher scoring selected region. The remaining non-overlapping and top-ranking classified regions are the model’s output.10 As expected, this architecture is computational expensive and slow. Fast R-CNN and Faster R-CNN are later modifications that reduce the size of the R-CNN’s architecture and thereby decrease processing time while also increasing accuracy.11

YOLO (You Only Look Once) is a family of single-stage detection architectures based in Darknet, an open-source CNN framework. First developed in 2016, the YOLO architecture prioritizes speed. Indeed, YOLO's speed makes it preferable for real-time object detection and has earned it the common descriptor of state-of-the-art object detector. YOLO differs from R-CNN in several ways. While R-CNN passes extracted image regions through multiple networks that separately extract features and classify images, YOLO condenses these actions into a single network. Secondly, compared to R-CNN’s ~2000 region proposals, YOLO makes less than 100 bounding box predictions per image. In addition to being faster than R-CNN, YOLO also produces less background false positives, although it has a higher localization error.12 There have been many updates to YOLO since its inception, generally focusing on speed and accuracy.13

Though originally developed for object detection, later versions of R-CNN and YOLO can also train classification and segmentation models. Specifically, Mask R-CNN combines both object detection and segmentation, while YOLOv5 can train separate classification, detection, and segmentation models.

Of course, there are many other model architectures beyond R-CNN and YOLO. SSD and Retinanet are two additional models that use a simplified architecture similar to YOLO.14 DETR is another architecture developed by Facebook (now Meta) that combines CNN with a transformer model and shows performance comparable to Faster R-CNN.15

Example use cases

In many use cases, object detection is not an end in itself but one stage in a larger computer vision task.

Autonomous driving

Self-driving cars widely adopt object detection to recognize objects such as cars and pedestrians. One such example is Tesla’s Autopilot AI. Because of their increased speed, simple architectures like YOLO and SimpleNet are obviously more ideal for autonomous driving.16

Medical imaging

Object detection can assist in visual inspection tasks. For instance, a substantive body of object detection research investigates metrics and models for identifying physiological indicators of disease in medical images like X-rays and MRI scans. In this area, much research has focused on improving dataset imbalances given the scarcity of such medical images of disease.17


Video surveillance may employ real-time object detection to track crime-associated objects, such as guns or knives in security camera footage. By detecting such objects, security systems can further predict and prevent crime. Researchers have developed gun detection algorithms using both R-CNN and YOLO.18

Recent research

Imbalanced datasets are one issue plaguing object detection tasks, as negative samples (that is, images without the object of interest) vastly outnumber positive samples in many domain-specific datasets. This is a particular issue with medical images, where positive samples of diseases are difficult to acquire. Recent research utilizes data augmentation to expand and diversify limited datasets for improved model performance.19

Past developments in object detection have largely focused on 2D images. More recently, researchers have turned to object detection applications for 3D images and video. Motion blur and shifting camera focus cause problems in identifying objects across video frames. Researchers have explored a range of methods and architectures to help track objects across frames in spite of such conditions, such as the recurrent neural network architecture long short-term memory (LSTM)20 and transformer-based models.21 Transformers have been utilized to quicken object detection models for real-time detection tasks. Parallel processing techniques are one further notable area of study in this endeavor.22

Related resources Self-supervised object detection and retrieval using unlabeled videos

IBM researchers propose an unsupervised method for object detection and retrieval without manual labeling.

Introduction to computer vision

IBM developers provide an overview of computer vision tasks, including object detection for images and videos.

Improving object detection from scratch via gated feature reuse

IBM researchers present a simple and parameter-efficient drop-in module for training one-stage object detectors from scratch.

Train a YOLOv8 object detection model in Python

Fine tune a pre-trained object detection model.

Take the next step

Build an AI strategy for your business on one collaborative AI and data platform—IBM watsonx. Train, validate, tune and deploy AI models to help you scale and accelerate the impact of AI with trusted data across your business.

Explore watsonx Book a live demo

1 Bogusław Cyganek, Object Detection and Recognition in Digital Images: Theory and Practice, Wiley, 2013.

2 Kemal Oksuz, Baris Can Cam, Sinan Kalkan, and Emre Akbas, "Imbalance Problems in Object Detection: A Review," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 43, No. 10, 2021, pp. 3388-3415, https://ieeexplore.ieee.org/document/9042296 (link resides outside ibm.com).

3 Archangelo Disante and Cosimo Disante, Handbook of Image Processing and Computer Vision, Vol. 1, Springer, 2020. Milan Sonka, Vaclav Hlavac, and Roger Boyle, Image Processing, Analysis, and Machine Vision, 4th edition, Cengage, 2015.

4 Archangelo Disante and Cosimo Disante, Handbook of Image Processing and Computer Vision, Vol. 3, Springer, 2020. Milan Sonka, Vaclav Hlavac, and Roger Boyle, Image Processing, Analysis, and Machine Vision, 4th edition, Cengage, 2015.

5 Benjamin Planche and Eliot Andres, Hands-On Computer Vision with TensorFlow 2, Packt Publishing, 2019. Van Vung Pham and Tommy Dang, Hands-On Computer Vision with Detectron2, Packt Publishing, 2023.  Licheng Jiao, Fan Zhang, Fang Liu, Shuyuan Yang, Lingling Li, Zhixi Feng, Rong Qu, "A survey of deep learning-based object detection," IEEE Access, Vol. 7, 2019, pp. 128837-128868, https://ieeexplore.ieee.org/document/8825470 (link resides outside ibm.com). Richard Szeliski, Computer Vision: Algorithms and Applications, 2nd edition, Springer, 2021.

6 Richard Szeliski, Computer Vision: Algorithms and Applications, 2nd edition, Springer, 2021.

7 Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese, "Generalized intersection over union: A metric and a loss for bounding box regression," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019,pp. 658-666, accessible here (link resides outside ibm.com).

8 P. Viola and M. Jones, "Rapid object detection using a boosted cascade of simple features," Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2001, https://ieeexplore.ieee.org/document/990517 (link resides outside ibm.com).

9 N. Dalal and B. Triggs, "Histograms of oriented gradients for human detection," Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2005, pp. 886-893, https://ieeexplore.ieee.org/document/1467360 (link resides outside ibm.com).

10 Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik, "Rich feature hierarchies for accurate object detection and semantic segmentation," Proceedings of the 2014 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2014, https://arxiv.org/abs/1311.2524 (link resides outside ibm.com).

11 Ross Girschick, "Fast R-CNN," Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1440-1448, https://arxiv.org/abs/1504.08083 (link resides outside ibm.com). Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun, "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," Advances in Neural Information Processing Systems (NIPS 2015), Vol. 28, https://proceedings.neurips.cc/paper_files/paper/2015/hash/14bfa6bb14875e45bba028a21ed38046-Abstract.html (link resides outside ibm.com).

12 Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi, "You Only Look Once: Unified, Real-Time Object Detection," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779-788, https://arxiv.org/abs/1506.02640 (link resides outside ibm.com).

13 Joseph Redmon and Ali Farhadi, "YOLOv3: An Incremental Improvement," 2018, https://arxiv.org/abs/1804.02767 (link resides outside ibm.com). Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao, "YOLOv4: Optimal Speed and Accuracy of Object Detection," European Conference on Computer Vision, 2020, https://arxiv.org/abs/2004.10934 (link resides outside ibm.com). Xin Huang, Xinxin Wang, Wenyu Lv, Xiaying Bai, Xiang Long, Kaipeng Deng, Qingqing Dang, Shumin Han, Qiwen Liu, Xiaoguang Hu, Dianhai Yu, Yanjun Ma, and Osamu Yoshie, "PP-YOLOv2: A Practical Object Detector," 2021, https://arxiv.org/abs/2104.10419 (link resides outside ibm.com). Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao, "YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors," 2022, https://arxiv.org/abs/2207.02696 (link resides outside ibm.com).

14 Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg, "SSD: Single Shot MultiBox Detector," Proceedings of the European Conference of Computer Vision (ECCV), 2016, pp. 21-37, https://arxiv.org/abs/1512.02325 (link resides outside ibm.com). Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár, "Focal Loss for Dense Object Detection," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 42, No. 2, 2020, pp. 318-327, https://arxiv.org/abs/1708.02002 (link resides outside ibm.com).

15 Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko, "End-to-End Object Detection with Transformers," Proceedings of the European Conference of Computer Vision (ECCV), 2020, https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123460205.pdf (link resides outside ibm.com).

16 Abhishek Balasubramaniam and Sudeep Pasricha, "Object Detection in Autonomous Vehicles: Status and Open Challenges," 2022, https://arxiv.org/abs/2201.07706 (link resides outside ibm.com). Gene Lewis, "Object Detection for Autonomous Vehicles," 2016, https://web.stanford.edu/class/cs231a/prev_projects_2016/object-detection-autonomous.pdf (link resides outside ibm.com).

17 Trong-Hieu Nguyen-Mau, Tuan-Luc Huynh, Thanh-Danh Le, Hai-Dang Nguyen, and Minh-Triet Tran, "Advanced Augmentation and Ensemble Approaches for Classifying Long-Tailed Multi-Label Chest X-Rays," Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2023, pp. 2729-2738, https://openaccess.thecvf.com/content/ICCV2023W/CVAMD/html/Nguyen-Mau_Advanced_Augmentation_and_Ensemble_Approaches_for_Classifying_Long-Tailed_Multi-Label_Chest_ICCVW_2023_paper.html (link resides outside ibm.com). Changhyun Kim, Giyeol Kim, Sooyoung Yang, Hyunsu Kim, Sangyool Lee, and Hansu Cho, "Chest X-Ray Feature Pyramid Sum Model with Diseased Area Data Augmentation Method," Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2023, pp. 2757-2766, https://openaccess.thecvf.com/content/ICCV2023W/CVAMD/html/Kim_Chest_X-Ray_Feature_Pyramid_Sum_Model_with_Diseased_Area_Data_ICCVW_2023_paper.html (link resides outside ibm.com).

18 Palash Yuvraj Ingle and Young-Gab Kim, "Real-Time Abnormal Object Detection for Video Surveillance in Smart Cities," Sensors, Vol. 22, No. 10, 2022, https://www.mdpi.com/1424-8220/22/10/3862 (link resides outside ibm.com).

19 Manisha Saini and Seba Susan, "Tackling class imbalance in computer vision: a contemporary review," Artificial Intelligence Review, Vol. 56, 2023, pp. 1279–1335, https://link.springer.com/article/10.1007/s10462-023-10557-6 (link resides outside ibm.com).

20 Kai Kang, Hongsheng Li, Tong Xiao, Wanli Ouyang, Junjie Yan, Xihui Liu, and Xiaogang Wang, "Object Detection in Videos With Tubelet Proposal Networks," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 727-735, https://openaccess.thecvf.com/content_cvpr_2017/html/Kang_Object_Detection_in_CVPR_2017_paper.html (link resides outside ibm.com).

21 Sipeng Zheng, Shizhe Chen, and Qin Jin, "VRDFormer: End-to-End Video Visual Relation Detection With Transformers," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 18836-18846, https://openaccess.thecvf.com/content/CVPR2022/html/Zheng_VRDFormer_End-to-End_Video_Visual_Relation_Detection_With_Transformers_CVPR_2022_paper.html (link resides outside ibm.com).

22 Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko, "End-to-End Object Detection with Transformers," Proceedings of the European Conference on Computer Vision (ECCV), 2020, pp. 213-229, https://link.springer.com/chapter/10.1007/978-3-030-58452-8_13 (link resides outside ibm.com), Mekhriddin Rakhimov ( link resides outside ibm.com), Jamshid Elov ( link resides outside ibm.com), Utkir Khamdamov ( link resides outside ibm.com), Shavkatjon Aminov ( link resides outside ibm.com), and Shakhzod Javliev (link resides outside ibm.com), "Parallel Implementation of Real-Time Object Detection using OpenMP," International Conference on Information Science and Communications Technologies (ICISCT), 2021, https://ieeexplore.ieee.org/document/9670146 (link resides outside ibm.com). Yoon-Ki Kim and Yongsung Kim, "DiPLIP: Distributed Parallel Processing Platform for Stream Image Processing Based on Deep Learning Model Inference," Electronics, Vol. 9, No. 10, 2020, https://www.mdpi.com/2079-9292/9/10/1664 (link resides outside ibm.com).