What is computer vision?

Computer vision is a field of artificial intelligence (AI) that enables computers and systems to derive meaningful information from digital images, videos and other visual inputs — and take actions or make recommendations based on that information. If AI enables computers to think, computer vision enables them to see, observe and understand.

Computer vision works much the same as human vision, except humans have a head start. Human sight has the advantage of lifetimes of context to train how to tell objects apart, how far away they are, whether they are moving and much more. Computer vision trains machines to perform these functions, but it has to do it in much less time with cameras, data and algorithms rather than retinas, optic nerves and a visual cortex.

Computer vision is used in industries ranging from agriculture to automotive, and the market is growing. It is expected to reach USD 48.6 billion by 2022.(1)

Augmenting human perception

See how a 3D computer-vision-driven task completion prototype interacts with a visually impaired user to guide them in solving a jigsaw puzzle in a natural and intuitive way.


How computer vision works

Computer vision needs lots of data. It runs analyses of data over and over until it discerns distinctions and ultimately recognize images. For example, to train a computer to recognize apples, it needs to be fed vast quantities of apple images and apple-related items to learn the differences and recognize an apple.

Two essential technologies are used to accomplish this: a type of machine learning called deep learning and a convolutional neural network (CNN).

Machine learning uses algorithmic models that enable a computer to teach itself about the context of visual data. If enough data is fed through the model, the computer will “look” at the data and teach itself to tell one image from another. Algorithms enable the machine to learn by itself, rather than someone programming it to recognize an image.

A CNN helps a machine learning or deep learning model “look” by breaking images down into pixels that are given tags or labels. It uses the labels to perform convolutions (a mathematical operation on two functions to produce a third function) and makes predictions about what it is “seeing.” The neural network runs convolutions and checks the accuracy of its predictions in a series of iterations until the predictions start to come true. It is then recognizing or seeing images in a way similar to humans.

Much like a human making out an image at a distance, a CNN first discerns hard edges and simple shapes, then fills in information as it runs iterations of its predictions. A CNN is used to understand single images. A recurrent neural network (RNN) is used in a similar way for video applications to help computers understand how pictures in a series of frames are related to one another.

Uncovering ancient geoglyphs with AI

See how researchers are using AI and computer vision to analyze drone and satellite images to identify geoglyphs in southern Peru.


The evolution of computer vision

Scientists and engineers have been trying to develop ways for machines to see and understand visual data for about 60 years. Experimentation began in 1959 when neurophysiologists showed a cat an array of images, attempting to correlate a response in its brain. They discovered that it responded first to hard edges or lines, and scientifically, this meant that image processing starts with simple shapes like straight edges.(2)

At about the same time, the first computer image scanning technology was developed, enabling computers to digitize and acquire images. Another milestone was reached in 1963 when computers were able to transform two-dimensional images into three-dimensional forms. In the 1960s, AI emerged as an academic field of study, and it also marked the beginning of the AI quest to solve the human vision problem.

1974 saw the introduction of optical character recognition (OCR) technology, which could recognize text printed in any font or typeface.(3) Similarly, intelligent character recognition (ICR) could decipher hand-written text using neural networks.(4) Since then, OCR and ICR have found their way into document and invoice processing, vehicle plate recognition, mobile payments, machine translation and other common applications.

In 1982, neuroscientist David Marr established that vision works hierarchically and introduced algorithms for machines to detect edges, corners, curves and similar basic shapes. Concurrently, computer scientist Kunihiko Fukushima developed a network of cells that could recognize patterns. The network, called the Neocognitron, included convolutional layers in a neural network.

By 2000, the focus of study was on object recognition, and by 2001, the first real-time face recognition applications appeared. Standardization of how visual data sets are tagged and annotated emerged through the 2000s. In 2010, the ImageNet data set became available. It contained millions of tagged images across a thousand object classes and provides a foundation for CNNs and deep learning models used today. In 2012, a team from the University of Toronto entered a CNN into an image recognition contest. The model, called AlexNet, significantly reduced the error rate for image recognition. After this breakthrough, error rates have fallen to just a few percent.(5)

Look deeper

Computer vision and multimedia at IBM AI Research

Access videos, papers, workshops and more.

IBM Watson and computer vision on Medium

Get more resources on computer vision from the Medium publication platform.

Computer vision and augmented reality at IBM Research

Gain insights into technology and solutions for object recognition and augmented reality.

Why is computer vision important?

There is a lot of research being done in the computer vision field, but it’s not just research. Real-world applications demonstrate how important computer vision is to endeavors in business, entertainment, transportation, healthcare and everyday life. A key driver for the growth of these applications is the flood of visual information flowing from smartphones, security systems, traffic cameras and other visually instrumented devices. The information creates a test bed to train computer vision applications and a launchpad for them to become part of a range of human activities:

  • IBM used computer vision to create My Moments for the 2018 Masters golf tournament. IBM Watson watched hundreds of hours of Masters footage and could identify the sights (and sounds) of significant shots. It curated these key moments and delivered them to fans as personalized highlight reels.
  • Google Translate lets users point a smartphone camera at a sign in another language and almost immediately obtain a translation of the sign in their preferred language.(6)
  • The development of self-driving vehicles relies on computer vision to make sense of the visual input from a car’s cameras and other sensors. It’s essential to identify other cars, traffic signs, lane markers, pedestrians, bicycles and all of the other visual information encountered on the road.
  • IBM is applying computer vision technology along with a fluorescent dye and an infrared camera to help surgeons spot cancer using photons. And it’s applying computer vision to help diagnose skin cancer.
  • Computer vision is being used to help ensure facial analysis technology is built and trained responsibly.

Identifying cancer with AI

Read how AI and computer vision technology are helping surgeons spot cancer using photons.


Getting started

Many organizations don’t have the resources to fund computer vision labs and create deep learning models and neural networks. They may also lack the computing power required to process huge sets of visual data. Companies such as IBM are helping by offering computer vision software development services. These services deliver pre-built learning models available from the cloud — and also ease demand on computing resources. Users connect to the services through an application programming interface (API) and use them to develop computer vision applications.

IBM has also introduced a computer vision platform that addresses both developmental and computing resource concerns. IBM PowerAI Vision includes tools that enable subject matter experts to label, train and deploy deep learning vision models — without coding or deep learning expertise. The tools use IBM Power Systems servers to offer an AI development platform with the required performance characteristics. The vision models can be deployed in local data centers, the cloud and edge devices.

While it’s getting easier to obtain resources to develop computer vision applications, an important question to answer early on is: What exactly will these applications do? Understanding and defining specific computer vision tasks can focus and validate projects and applications and make it easier to get started.

Here are a few examples of established computer vision tasks:

  • Image classification sees an image and can classify it (a dog, an apple, a person’s face). More precisely, it is able to accurately predict that a given image belongs to a certain class. For example, a social media company might want to use it to automatically identify and segregate objectionable images uploaded by users.
  • Object detection can use image classification to identify a certain class of image and then detect and tabulate their appearance in an image or video. Examples include detecting damages on an assembly line or identifying machinery that requires maintenance.
  • Object tracking follows or tracks an object once it is detected. This task is often executed with images captured in sequence or real-time video feeds. Autonomous vehicles, for example, need to not only classify and detect objects such as pedestrians, other cars and road infrastructure, they need to track them in motion to avoid collisions and obey traffic laws.(7)
  • Content-based image retrieval uses computer vision to browse, search and retrieve images from large data stores, based on the content of the images rather than metadata tags associated with them. This task can incorporate automatic image annotation that replaces manual image tagging. These tasks can be used for digital asset management systems and can increase the accuracy of search and retrieval.

Training computer vision models

See how to easily train highly accurate models to classify and detect objects in images and videos.

Resources

Dialog-based interactive image retrieval

Learn about a natural language-based system for interactive image retrieval that is more expressive than conventional systems.

Using AI to improve tea inspection

See how IBM AI scientists developed an automated tea quality inspection machine, using computer vision and a deep neural network detection model.

IBM Research blog

IBM Research is one of the world’s largest corporate research labs. Learn more about research being done across industries.

Solutions

Watson Visual Recognition

Quickly and accurately tag, classify and search visual content using machine learning.

IBM PowerAI Vision

Train highly accurate models to classify images and detect objects in images and videos — without deep learning expertise.