What is image recognition?

12 November 2024

Authors

Tim Mucci

IBM Writer

What is image recognition?

Image recognition is an application of machine learning (ML) (or machine learning technology) that enables software and devices to identify objects, places, people, writing and actions in digital images or video.

This technology enables computers to identify product defects, helps medical professionals spot anomalies and is integral to the development of autonomous vehicles.

Image recognition is a core task of computer vision, the broader field of artificial intelligence (AI) technology that enables software and machines to understand and react to visual data.

Engineers use traditional machine learning and deep learning models in image recognition. These approaches are typically separate, and whether they are combined or used independently depends on the specific problem and resource requirements.

Image recognition with traditional machine learning

Machine learning uses algorithms that classify images based on features manually extracted by human engineers. Engineers preprocess the images and analyze them according to the specific goal or problem that they’re attempting to solve through image recognition.

Maybe it’s identifying faces, detecting objects or classifying textures. In each case, the engineer uses their domain knowledge to preprocess the images and train the algorithm.

Normalization

Engineers prepare images for analysis by normalizing the image, which means scaling pixel values to a standard range, typically between 0–1 or -1–1, so data is consistent and more manageable for machine learning models to process.

Preprocessing also includes resizing images, converting them to grayscale to reduce computational complexity or removing noise by using Gaussian filtering techniques. “Noise” in image recognition refers to any unwanted or random variation in pixels, for example, a speckled, grainy, blurry or distorted image.

Feature extraction

Next, engineers must select the features that provide the most meaningful information. It might be edges when detecting shapes or color intensity if the result is to distinguish objects by hue. Because machine learning models rely on manually extracted features, data annotation labels essential information.

By annotating objects of interest within images, the models can more easily recognize and classify specific objects such as “cat” or “dog.” Precisely annotated data allows machine learning algorithms to learn the visual features of each category accurately.

Encoding for machine learning

Engineers extract these features and format them into numerical vectors, making it easier for machine learning models to process and compare images. Engineers translate each image into a fixed-length feature vector, a list of numbers summarizing its importance.

Image recognition with deep learning

In contrast, deep learning models can learn directly from the image. Deep learning, a subset of machine learning, uses layered neural networks to accomplish complex image preprocessing and recognition tasks, but at the cost of higher computational and data requirements.

Convolutional neural networks (CNNs) are deep learning architectures with convolutional layers that analyze and learn the structured nature of image data.

Input layer

CNN’s deep neural network automatically detects the image’s raw pixel value. The CNN passes that information through layers of the deep network to extract patterns and ultimately make predictions about the image.

The network’s layers begin with the input layer. The input layer processes the image’s raw pixel values, treating them as a grid of numerical intensities and passes them on to subsequent layers for pattern extraction.

Feature extraction

Next, the convolutional layer applies small filters or kernels, over the image to detect local patterns such as edges or texture. Convolution reduces the need for manual feature extraction because the network can learn the patterns directly from the data.

After each convolution, an activation function introduces nonlinearity into the model, allowing the network to learn complex patterns, shapes and objects by stacking multiple layers.

Pooling and flattening

Pooling layers downsample the image to reduce its size while retaining important features to make sure that the model is computationally efficient in handling variations such as slight rotations or shifts in the image.

After the network extracts features, it flattens the data into a one-dimensional vector and passes it through fully connected layers. These layers integrate the learned patterns from earlier stages to identify complex relationships and refine the classification process.

Output layer

Finally, the data reaches the output layer, which consolidates the extracted features and produces a final prediction. This prediction is compared to the annotated training dataset to calculate errors and adjust the network’s weights for improved accuracy.

For example, to train a model to recognize images of cats, engineers might use supervised learning, labeling thousands of images with tags such as “cat” or “not cat” so the model can learn key features such as fur texture, whiskers and ear shape.

Alternatively, in unsupervised learning, the model works with unlabeled data to discover patterns independently. The model identifies relationships without predefined categories by clustering images based on shared characteristics (for example, similar shapes or textures).

This approach is helpful for tasks such as fraud detection, quality control and pattern analysis when labeled data is unavailable. In unsupervised learning, the model would independently cluster images based on shared patterns, grouping all cat images without explicitly knowing they are cats.

A third approach, self-supervised learning, combines aspects of unsupervised learning by starting with unlabeled data but generating pseudo labels from the data’s inherent structure, enabling models to learn meaningful representations without traditional labels, making them powerful for tasks with limited labeled datasets.

With self-supervised learning, the model could analyze parts of an image, such as reconstructing a partially obscured cat face, to identify patterns and features. Ultimately, the trained model—whether using machine learning or deep learning—could accurately identify and classify new, unseen images of cats, distinguishing them from other animals or objects.

Challenges in image recognition

While image recognition technologies have advanced, they still face challenges that impact accuracy and reliability. Engineers mitigate these issues by combining improved model architectures, diverse training datasets and preprocessing techniques.

Cluttered or obscured images

Supervised learning uses labeled data, with each image tagged with its correct category to guide the algorithm through clear examples. For instance, training a system to recognize cars involves a dataset labeled "cat" and "not cat." The model then learns to differentiate based on visual patterns within these labeled examples.

Angle and perspective variations

In unsupervised learning, the algorithm uses unlabeled data, discovering patterns independently. It's akin to giving a child a box of toys to sort by similarity; unsupervised algorithms cluster images based on shared characteristics (for example, whiskers, fur, four legs and tails) without knowing the explicit categories.

Lighting conditions

Changes in lighting, such as shadows, brightness variations or low-light environments, can impact the performance of image recognition systems. Bright spots might wash out details, while shadows might obscure critical features, causing the model to misinterpret an object’s shape or color.

Advanced methods such as adaptive histogram equalization or incorporating training data with varying lighting conditions help models perform better under different lighting scenarios.

Limitations in training data

The performance of an image recognition model depends on the diversity and quality of its training data. Models trained on datasets that primarily feature high-resolution, idealized images might struggle when encountering lower-quality or real-world variations.

To mitigate this, engineers curate diverse datasets that represent real-world conditions. Techniques such as transfer learning enable models to use pretrained knowledge from large, robust datasets and improve performance even with limited data.

Object size and proximity

The size of an object in an image, influenced by its proximity to the camera, can affect the model’s ability to identify it accurately. Small objects might not have enough detail for recognition, while overly close objects might appear distorted or too large for the model to classify correctly.

Engineers train models on datasets that include objects of varying sizes and distances to handle this. Multiscale image processing techniques and feature pyramids are also employed to help ensure that the model can handle objects across a wide range of sizes.

Image recognition and object detection

Object detection extends image recognition by identifying objects and pinpointing their locations within an image. This technique allows the system to answer questions such as, “Where is the cat in this picture?” or “How many cats are in the scene?” Object detection gives more context, as it recognizes objects and their positions, sizes and orientations within the frame.

For example, instead of identifying “a cat” in a picture, object detection allows the computer to specify, “There’s a cat sitting on the sofa in the left corner of the picture,” providing a spatial understanding of the scene and relationships between objects.

Image recognition tasks can also vary in complexity. Image categorization or image classification, assigns a single label to an entire image based on its content, answering the question, “What is in this image?”

For example, a model trained on labeled datasets of cats and dogs learns to distinguish between the two by identifying their unique features. When presented with a new image, the model analyzes these features to predict whether it shows a cat or a dog.

Models use bounding boxes to outline these individual objects, separating them from the background and marking where each object begins and ends. This precision is critical for applications such as autonomous driving, where accurately detecting objects such as vehicles, pedestrians and road signs is essential for safety.

3D design of balls rolling on a track

The latest AI News + Insights 


Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter. 

Evolving uses of image recognition

Image recognition is advancing rapidly, paving the way for more sophisticated applications across numerous industries and use cases. Here are some dominant real-world applications of image recognition:

Autonomous vehicles

Many smartphones are equipped with facial recognition technology that allows users to unlock their devices by looking at the screen. This application of image recognition has become common, with systems recognizing individual facial features to verify identity.

Facial recognition

Facial recognition is also widely used in security and surveillance to identify individuals from video feeds. This technology helps law enforcement agencies track suspects in public areas, while companies use it in building security to control access.

Social media management and moderation

Social media platforms use image recognition to suggest tags in photos, identifying and recognizing the faces of friends and family. Social media also uses AR filters that detect facial landmarks to position virtual elements, such as glasses or animal ears, in a way that aligns with facial movements.

In addition, these platforms use image recognition to moderate content by filtering inappropriate images, maintaining platform safety and enhancing user experience.

Smart glasses and real-time information

Building on current augmented reality (AR) applications in mobile devices, smart glasses equipped with image recognition software can offer users augmented views of their surroundings, overlaying real-time information about objects and locations.

AR technology provides contextual data on anything the user looks at, from identifying landmarks to retrieving product details in stores.

Home appliances

Image recognition in home appliances enables features such as inventory tracking in smart refrigerators, obstacle detection in robotic vacuums and human or object recognition in security cameras.

It also powers functions such as fabric type detection in washing machines, food recognition in smart ovens and facial analysis in smart mirrors or baby monitors.

Delivery robots rely on image recognition to navigate environments, detect obstacles and identify delivery locations for accurate and efficient autonomous deliveries.

In contrast, robots in warehouses and industrial settings use the same technology for scanning and retrieving items, performing quality checks, assembling parts and sorting materials.

Medical imaging

Medical image analysis assists healthcare professionals in analyzing X-rays, MRIs and CT scans. These systems can detect anomalies that the human eye might miss, such as early signs of lung cancer, brain strokes or tumors, leading to more timely diagnoses.

Merative, formerly IBM Watson® Health, applies image recognition to analyze complex imaging data, supporting radiologists in identifying critical findings.

Medical image recognition is advancing with AI-powered diagnostics, so image recognition systems can assist in detecting early-stage diseases with greater accuracy.

Already enhancing areas such as tumor detection, the technology supports specialists with a highly trained “second set of eyes” for advanced diagnostics, particularly in areas where minute details are critical.

Optical character recognition (OCR)

OCR technology digitizes printed text by scanning documents, books and receipts. Apps use OCR to recognize and convert printed text into digital formats that users can edit or search. OCR was a critical early use case for image recognition, which helped pave the way for widespread digitization in every industry.

Document processing

Banks and financial institutions use image recognition to automate verification checks, IDs and other documents, reducing fraud and streamlining customer onboarding. The technology scans document images for crucial details, authenticates them and flags any anomalies for review.

Think Newsletter

 

The latest AI and tech insights from Think

Sign up today
Related solutions Foundation models in watsonx.ai™

Explore the IBM library of foundation models on the watsonx platform to scale generative AI for your business with confidence

IBM Maximo Visual Inspection

The IBM Maximo Visual Inspection is a no-code computer vision platform designed to automate visual inspection processes. Explore resources, a self-guided demo, product tours and solution briefs.

Artificial intelligence (AI) consulting services

IBM Consulting® is working with global clients and partners to cocreate what’s next in AI. Our diverse, global team of more than 20,000 AI experts can help you quickly and confidently design and scale cutting edge AI solutions and automation across your business. ​

Resources

IBM AI Academy
AI Education

Get started

Mixture of Experts
Podcast

Listen now

How to sharpen Watson Visual Recognition results using preprocessing
Blog

Learn more

Supervised versus unsupervised learning: What's the difference?
Blog

Learn more

Take the next step

Train, validate, tune and deploy generative AI foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

Explore watsonx.ai Book a live demo