The latest AI trends, brought to you by experts
Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.
Computer vision is a subfield of artificial intelligence (AI) that equips machines with the ability to process, analyze and interpret visual inputs such as images and videos. It uses machine learning to help computers and other systems derive meaningful information from visual data.
Computer vision can be pictured as the interaction between three broad processes, each working together and informing one another: recognition, reconstruction and reorganization. Image recognition is all about identifying actions, objects, people, places and writing in digital images or videos. Reconstruction derives the three-dimensional characteristics of those entities, while reorganization infers the relationships between the entities.1
Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.
Radiology imaging in pneumonia diagnosis is a common use case in computer vision. Radiologists have to carefully interpret chest X-rays, a process which can be error-prone and time-consuming due to the subtlety of pneumonia symptoms and their similarities with other lung conditions.2 A computer vision system can help.
There are multiple types of models and approaches for computer vision tasks, but the following hypothetical example illustrates a common workflow:
The first step is to collect the necessary visual data. Hospitals generate huge volumes of chest X-rays, which they can use to train a computer vision algorithm. Since the goal is for the algorithm to classify whether an X-ray image depicts pneumonia or not, hospitals will need to compile a dataset of chest X-ray scans and correctly label or annotate each scan as either normal or signifying pneumonia.
For other use cases, images and videos can come from sources such as cameras and sensors. Datasets like COCO, ImageNet and Open Images provide large collections of annotated images.
An AI model is only as good as the data used to train it, which makes high-quality data crucial for computer vision. Preprocessing can help improve data quality through data cleaning and enhancements like adjusting brightness or contrast to sharpen images, as well as resizing and smoothing.
Datasets must also be sufficiently large and diverse enough for computer vision algorithms to produce accurate results. Synthetic data generation and data augmentation can help expand the size and diversity of datasets. For instance, hospitals can use geometric transformations such as rotating chest X-ray images to the left or right or flipping images upside down to augment their data.
Selecting the right machine learning model is crucial for optimizing efficiency and performance. Convolutional neural networks (CNNs) continue to be the primary deep learning model for image processing tasks, while recurrent neural networks (RNNs) are particularly suited for processing sequential data such as video frames.
However, advances in AI are powering a shift toward transformer models. For instance, a vision transformer (ViT) applies elements of a transformer-based language model to computer vision. ViTs process an image into patches and treat them as sequences, similar to tokens in a language transformer. The vision transformer then implements a self-attention mechanism across these patches to create a transformer-based representation of the input image. ViTs often match or exceed the performance of CNNs on computer vision tasks like image classification.3
Once a model has been chosen, model training follows. The training stage involves running the model on training data specific to a computer vision task, measuring performance against ground truth and optimizing parameters to improve performance over time.
CNNs consist of three types of layers: a convolutional layer, a pooling layer and a fully connected layer. The convolutional layer is where feature extraction happens. Feature extraction entails determining and capturing key visual attributes from raw image data, such as colors, edges, shapes and textures. In the case of X-ray images with pneumonia, features to be extracted include asymmetric lung contours, bright regions that indicate inflammation or the presence of fluid (as opposed to dark, air-filled regions), clouded or opaque lung areas, and coarse or patchy textures.4 Feature extraction allows algorithms to distinguish significant relationships and patterns in visual data.
An X-ray image is treated as a matrix of pixel values. Another matrix of weights (parameters that control how much influence a given input feature has on the model’s output) known as a filter or kernel is applied to an area of the X-ray image, with a dot product calculated between the input pixel values. The filter moves, or “convolves,” across the image to extract features, and the entire process is known as a convolution. The final output from the series of dot products is called an activation map or a feature map. Each filter is tuned to respond to specific patterns, such as edges, shapes or textures, allowing the CNN to learn multiple visual features simultaneously.
The feature map is fed into a pooling layer to further reduce the map’s size and compress its dimensions. Another filter sweeps through the entire input, taking the maximum or average values within a group of cells in the feature map. This retains the most essential features, allowing the model to focus its attention on them.
The act of moving across an image to extract features, reduce dimensions and produce a classification is known as a forward pass. After this forward pass, the model applies a loss function to calculate its error or the difference between its predicted classification and the true classification.
To minimize the loss function, backpropagation is employed. Backpropagation is a backward pass to compute the gradient of the loss function with respect to each weight. Then, the gradient descent technique is implemented to update model weights and optimize the model.
Finally, the fully connected layer conducts the task of classification based on the features extracted through the previous layers and their different filters. The CNN then generates its outputs, which are probabilities for each class (in this case, normal vs. pneumonia). For the chest X-ray image classification task, this output will indicate either a normal scan or, if the likelihood passes a predetermined threshold, a scan positive for pneumonia.
Computer vision algorithms can be trained on a wide range of tasks, some of which include:
Image recognition is the broadest form of computer vision. It encompasses the identification of people, places, objects and other entities in digital images and serves as the foundation for tasks like image classification, object detection and image segmentation.
Image classification is a core computer vision task that categorizes images into predefined groups or classes. It predicts the most fitting label for an image or objects within an image. The previously illustrated scenario of pneumonia diagnosis using chest X-rays is an example of image classification.
Object detection aims to pinpoint where objects are in digital images. It melds two learning techniques: object localization and image classification.
Object localization identifies the location of specific objects in an image by drawing bounding boxes around them. Then, image classification distinguishes the category to which objects belong. In footage of road traffic, for example, computer vision apps can use object detection to not only classify vehicles but also locate them on the road.
Common CNN architectures for object detection include R-CNN (region-based convolutional neural network) and YOLO (you only look once). R-CNN implements two-stage detection by first determining regions bearing objects then running those regions through separate networks for classification and more exact localization. Meanwhile, YOLO conducts single-stage detection by blending localization and classification in a single network pass, making it swift enough for real-time object detection.
Object detection for videos usually applies transformer-based models and RNNs, particularly the long short-term memory architecture.
Image segmentation is a more precise, pixel-level version of object detection. It partitions a digital image into discrete groups of pixels known as image segments, then labels pixels according to their class or instance.
While object detection can classify multiple elements within an image and approximate each element’s width and height, image segmentation discerns exact boundaries or shapes. This makes image segmentation valuable for delineating closely bunched objects with overlapping bounding boxes.
Image segmentation can be further subdivided into three task types:
For instance, in an image of a city street, semantic segmentation might treat cars parked one in front of the other as one long car segment, while instance segmentation separates and determines the shape of each car.
Object tracking follows and traces an object as it moves across a sequence of video or image frames. It pinpoints and distinguishes the object in each frame and preserves the object’s continuity during traversal.
Scene understanding extends a step beyond object recognition, capturing a higher level of visual information. Upon identifying objects in an image, deep learning models predict connections between them, such as actions, events and interactions.
Graph neural networks (GNNs) can be used to represent the spatial relationships between objects in an image. In the traffic footage example, computer vision systems can infer that a taxi is moving in front of a car, a car is parked to the left of a taxi or a car is turning right.
Vision language models (VLMs) can also help with scene understanding. This pairing of large language models (LLMs) with vision transformers can recognize and classify objects within an image and provide contextual descriptions such as an object’s position relative to other visual elements.
Facial recognition applies image recognition to facial features. It captures the geometry of a face and spots key patterns like the distance between the eyes, the distance from forehead to chin, the contour of the nose and the shape of the lips.
Facial recognition can identify individuals in real time or in photos or videos. A popular example is biometric authentication through face recognition to unlock smartphones.
Pose estimation gauges the spatial position of different body parts to recognize gestures and track body movements. For instance, pose estimation can aid in marking the orientation of a gamer’s arms and hands during virtual reality gameplay. A more real-world example is NASA’s computer vision software that provides robotic arm operators aboard the International Space Station with real-time pose estimation for precise grappling of targets.5
Optical character recognition (OCR), also referred to as text recognition, extracts and converts text from images, scanned documents and other sources into a machine-readable format. As such, it helps automate the digitalization of handwritten text and paper records.
The OCR workflow follows these steps:
CNNs and transformer-based models are capable of more intelligent character recognition, extracting features such as curves, line intersections, loops and the number of angled lines in a character. These algorithms are also capable of intelligent word recognition, distinguishing words instead of characters for faster processing.
Image generation employs generative AI models to produce images. Here are some common generative models used for image generation:
VLMs are also capable of generating images given a text description.
Visual inspection automates the identification of defects. Through object detection, computer vision systems inspect images or videos to spot faults and flaws. Image segmentation can also be implemented to more precisely locate defects.
Computer vision-powered visual inspection machines can help companies carry out swifter and safer inspections with increased consistency and accuracy, be it pointing out corrosion on hard-to-reach areas of bridges or finding faulty connectors in assembled electronic products.
As a mature field of AI, computer vision has gone through many advancements, leading to a broad array of use cases. Here are some real-world applications of computer vision:
Cameras, drones and satellites capture high-resolution images of crops and farm areas. Computer vision technologies then analyze these images to aid in evaluating plant health and pinpoint pests and weeds for more targeted herbicide application.
In the automotive industry, self-driving cars compose a 3D model of their environment using a mix of cameras, lidar, radar and sensors. Then, they apply object detection, image segmentation and scene understanding for safe navigation, avoiding obstacles such as pedestrians and other vehicles and precisely detecting road features like lanes, traffic lights and traffic signs.
Medical imaging is a key area of application for computer vision. For instance, object detection can automate image analysis, locating and identifying potential markers of disease in X-rays and CT, MRI and ultrasound scans. Additionally, instance segmentation can delineate the specific boundaries of organs, tissues and tumors, aiding in a more accurate diagnosis that can better inform decision-making for treatments and patient care.
Computer vision systems help with inventory management, scanning items to determine stock levels. They can also power quality control, recognizing defects in real time. These systems analyze product images and can rapidly and more accurately flag faults or inconsistencies compared to inspectors using their own human vision.
Amazon’s Just Walk Out technology, for example, uses computer vision in small retail and food service stores to track customer selections and automate the checkout experience. Customers can just take their items and leave without lining up at payment counters.6
Online stores can also use augmented reality coupled with face recognition and pose estimation for their virtual try-on experiences, allowing customers to visualize how clothes, eyewear or makeup will look on them before purchasing.
Like autonomous vehicles, robots use cameras, lidar and sensors to map their surroundings. They then apply computer vision algorithms to complete their tasks, such as assisting surgeons with complex procedures, navigating through warehouses to transport goods, picking only ripe produce and putting objects in assembly lines.
Object detection can help spacecraft locate and avoid hazards during landing, while rovers can implement the same capability for navigating terrain.7 Image classification can be employed for categorizing asteroids, meteors and even space debris, while object tracking monitors the trajectories of these astronomical objects.
Many tools exist for building computer vision apps, helping streamline the development process. A few popular tools include:
Keras is a deep learning application programming interface (API) that can run on top of other AI frameworks like PyTorch and TensorFlow. It provides dozens of tutorials and examples for various computer vision tasks, including image and video classification, image segmentation, object detection and OCR.
OpenCV is one of the most widely used computer vision libraries. This open-source library is home to more than 2,500 computer vision algorithms and contains modules for image processing, object detection, video analysis and more. It’s written in C++ but also has wrappers for programming languages like Java and Python.
Scikit-image is an open-source collection of algorithms for image processing in Python. It supports preprocessing, feature extraction, object detection and image segmentation, among other tasks. Its simplicity makes it accessible for beginners.
TensorFlow is an open-source machine learning platform from Google. While it serves more general-purpose deep learning applications, TensorFlow also provides computer vision-specific datasets, tools for preprocessing and functions for image and video classification, image segmentation and object detection.
The torchvision library forms part of the PyTorch ecosystem. It encompasses common image transformations, datasets and other utility functions. The package also offers models for image and video classification, object detection and semantic and instance segmentation.
Computer vision is one of the earliest disciplines of AI. For decades, computer science researchers have been developing ways for machines to understand visual data.
Experimentation began in the 1950s to the 1960s when neurophysiologists showed cats an array of images while recording neural activity. They discovered that the animals responded first to lines, concluding that image processing starts with simple shapes like straight edges.8
At around the same time, the first computer image scanning technology was developed, equipping computers with the ability to digitize and acquire images.9 Another milestone was reached when computers developed the ability to transform two-dimensional images into three-dimensional forms.10
In 1982, neuroscientist David Marr established that vision works hierarchically and introduced algorithms for machines to detect corners, curves, edges and similar basic shapes.11 During the same decade, computer scientist Kunihiko Fukushima developed a network of cells that could recognize patterns and named it “neocognitron,” which included convolutional layers in a neural network.12
By 2000, the focus of study was on image classification and object recognition.13 In 2009, the ImageNet dataset was introduced, containing millions of labeled images for training computer vision algorithms.14 In 2012, a team from the University of Toronto created the AlexNet CNN, which was trained on the ImageNet dataset and significantly reduced the error rate for image recognition, paving the way for today’s computer vision models.15
1. The three R’s of computer vision: Recognition, reconstruction and reorganization, Pattern Recognition Letters, 8 February 2016
2. Efficient pneumonia detection using Vision Transformers on chest X-rays, Scientific Reports, 30 January 2024
3. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, arXiv, 3 June 2021
4. NGBoost Classifier Using Deep Features for Pneumonia Chest X-Ray Classification, Applied Sciences, 8 September 2025
5. Computer Vision Lends Precision to Robotic Grappling, NASA Technology Transfer Program, Accessed 11 September 2025
6. Amazon Just Walk Out, AWS, Accessed 11 September 2025
7. The Computer Vision Laboratory, NASA JPL Robotics, Accessed 11 September 2025
8. From Cats to the Cortex: Unravelling the Hierarchical Processing System of Vision and Brain Plasticity, Cureus, 2 September 2024
9. Your Engineering Heritage: Scanners and Computer Image Processing, IEEE-USA InSight, 8 February 2016
10. A Simple World: The Blocks World, Foundations of Computer Vision, 2024
11. Marr’s Computational Theory of Vision, Foundations of Computer Vision, 2024
12. Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position, Biological Cybernetics, 1980
13. Computer Vision, Foundations of Computer Vision, 2024
14. ImageNet: A large-scale hierarchical image database, IEEE Conference on Computer Vision and Pattern Recognition, 2009
15. CHM Releases AlexNet Source Code, Computer History Museum, 20 March 2025