What is computer vision?

Computer vision is a field of artificial intelligence (AI) that enables computers and systems to derive meaningful information from digital images, videos and other visual inputs — and take actions or make recommendations based on that information.

If artificial intelligence enables computers to think, computer vision enables them to see, observe and understand.

Computer vision works much the same as human vision, except humans have a head start. Human sight has the advantage of lifetimes of context to train itself to tell objects apart, how far away they are, whether they are moving and much more. Computer vision trains machines to perform these functions but has to do it in a much shorter period of time — using cameras, data and algorithms rather than retinas, optic nerves and a visual cortex.

To catch up, computer vision needs data — lots of data. It runs analyses of the data over and over until it discerns distinctions and ultimately recognize images. For example, to train a computer to recognize apples, it needs to be fed vast quantities of apple images and apple-related items to learn the differences and recognize an apple.

Two essential technologies are used to accomplish this: a type of machine learning called deep learning and a type neural network called a convolutional neural network (CNN).

Machine learning uses algorithmic models that enable a computer to teach itself about the context of visual data. If enough data is fed through the model, the computer will “look” at the data and teach itself to tell one image from another. Algorithms enable the machine to learn by itself, rather than a programmer programming it to recognize an image.

The CNN helps the machine learning or deep learning model “look” by breaking images down into pixels that are given tags or labels. It uses the labels to perform convolutions (a mathematical operation on two functions to produce a third function) and makes predictions about what it is “seeing.” It runs convolutions and checks the accuracy of its predictions in a series of iterations until the predictions start to come true. It is now recognizing or seeing images in a way similar to humans.

Much like a human making out an image at a distance, the CNN first discerns hard edges and simple shapes, then fills in information as it runs iterations of its predictions.

CNN is used to understand single images. A recurrent neural network (RNN) is used in a very similar way for video applications to help computers understand how pictures in a series of frames are related to one another.

Two areas that are related to but different from computer vision are image processing and image analysis. These fields focus more on enhancing the clarity or other aspects of an image for human review or manipulation, rather than recognizing and understanding the image itself.(1)

Another term sometimes incorrectly associated with computer vision is computer vision syndrome, an eye-strain condition resulting from prolonged focusing on a computer screen. 

Computer vision is used today in industries ranging from agriculture to automotive — and the market is growing. It’s expected to reach USD 48.6 billion by 2022, according to Forbes.

The evolution of computer vision

Scientists and engineers have been trying to develop ways for machines to see and understand visual data for about 60 years. According to Rostyslav Demush at Hacker Noon, experimentation began in 1959 when neurophysiologists David Hubel and Torsten Wiesel showed a cat an array of images, attempting to correlate a response in its brain. After initial failures, they discovered that the animal responded first to hard edges or lines, which they learned accidently when switching slides. Scientifically, this meant that there are simple and complex neurons in the visual cortex and that image processing starts with simple shapes like straight edges. (Sound familiar?)

At about the same time as Hubel and Wiesel’s work, the first computer image scanning technology was developed, enabling computers to digitize and acquire images. Another important milestone was reached in 1963 when computers became able to transform two-dimensional images into three-dimensional forms.

In the 1960s, artificial intelligence emerged as an academic field of study and the Summer Vision Project was launched to teach machines how to see. Although largely a failure, it marked the birth of AI’s quest to solve the human vision problem.

In 1974, Ray Kurzwweil introduced optical character recognition technology (OCR) that could recognize text printed in virtually any front or typeface. Similarly, intelligent character recognition (ICR) can decipher hand-written text using neural networks. Initially developed to recognize and communicate text for the blind, OCR  and ICR have found their way into document and invoice processing, vehicle plate recognition, mobile payments, machine translation and other fairly common applications today.(2)(3)

In 1982, British neuroscientist David Marr (building on the work of Hubel and Wiesel) established that vision works hierarchically and introduced algorithms for machines to detect edges, corners, curves and similar basic shapes. Concurrently, Kunihiko Fukushima, a Japanese computer scientist, developed a network of cells that could recognize patterns. The network, called the Neocognitron, included convolutional layers in a neural network.

By 2000, the focus of study was on object recognition. By 2001, the first real-time face recognition applications appeared. Through the 2000s, standardization of how visual data sets are tagged and annotated emerged. In 2010 the ImageNet data set was made available. It contains millions of tagged images across a thousand object classes and provides a foundation for convolutional neural networks and deep learning models used today.

In 2012, a team from the University of Toronto entered a convolutional neural network model into an image recognition contest: the ImageNet Large Scale Visual Recognition Challenge. The model, called AlexNet after researcher Alex Krizhevsky, drastically reduced the error rate for image recognition  After this breakthrough, error rates at the competition have fallen to just a few percent, thanks to convolutional neural networks.(4)

Look deeper

Computer Vision and Multimedia at IBM AI Research

Access videos, papers, workshops and more.

Visit IBM Watson and computer vision on Medium

Get mores resources on computer vision from the Medium publication platform.

Computer Vision and Augmented Reality at IBM Research

Gain insights into technology and solutions for object recognition and augmented reality.

Why is computer vision important?

There is a lot of research being done in the computer vision field, but it’s not just research. Real-world applications demonstrate how important computer vision is to endeavors in business, entertainment, transportation, healthcare and everyday life.

One of the key drivers for the growing application of computer vision is the flood of visual information flowing from smartphones, security systems, traffic cameras and a host of other visually instrumented devices. This information creates a testbed to train computer vision applications and a launchpad for them to become part of a range of human activities:

  • Google Translate, for example, lets users point a smartphone camera at a sign in another language and almost immediately obtain a translation of the sign in their preferred language.(5)
  • IBM® used computer vision to create My Moments for the 2018 Masters golf tournament. IBM Watson® watches hundreds of hours of Masters footage and can identify the sights (and sounds) of significant shots. It curates these key moments and delivers them to fans as personalized highlight reels.
  • IBM Watson also helped create a trailer for the AI horror-thriller Morgan. It analyzes the film’s footage and generates the visual basis for the trailer.
  • The automotive industry’s pursuit of autonomous vehicles relies on computer vision to make sense out of all of the visual input from a self-driving car’s cameras, light detection and ranging (LIDAR) and other sensors. It’s essential to identify other cars, traffic signs and signals, lane markers, pedestrians, walkways, bicycles and all of the other visual information encountered on the road.
  • IBM is applying computer vision technology along with a fluorescent dye and an infrared camera to help surgeons spot cancer using photons. And it’s applying computer vision to help diagnose skin cancer.
  • Computer vision is also being used to help ensure facial analysis technology is built and trained responsibly.

Research to reality

Watch the Masters with Watson

IBM and the Masters used computer vision to identify meaningful moments from the Masters tournament and deliver customized highlights. Including a spoiler-free mode to catch up without ruining the finish.

Identifying skin cancer with computer vision

Skin cancer is the most commonly diagnosed cancer in the United States. Read how skin image analysis, along with machine learning, computer vision and cloud computing, is making a difference.

AI - Augmenting Human Perception

At IBM Research – Ireland our team have built a 3D Computer Vision driven task completion prototype called the “Puzzle Solving Toolkit” which interacts with a visually impaired user to guide them in solving a jigsaw puzzle in a natural and intuitive way.

Getting started

Many organizations don’t have the research and development resources to fund computer vision labs and create deep learning models and neural networks. They are also often lacking the raw computing power required to process huge sets of visual data.

Companies such as IBM, Google and Microsoft are helping by offering computer vision software development services. These services deliver pre-built learning models available from the cloud — so they can also ease demand on computing resources. Users connect to the services through an application programming interface (API) and use them to develop innovative computer vision applications.

IBM has also introduced a computer vision platform that addresses both developmental and computing resource concerns. IBM PowerAI Vision includes a set of software tools that enable subject matter experts to label, train and deploy deep learning vision models — without coding or deep learning expertise. The tools use IBM Power® Systems servers to offer an AI development platform for computer vision with the required performance characteristics. The developed training models can be deployed in local data centers, the cloud and edge devices.

While it’s getting easier to obtain resources to develop computer vision applications, an important question to answer early on is: What exactly will these applications do? Understanding and defining specific computer vision tasks can focus and validate projects and applications and make it easier to get started.

Here are a few examples of established computer vision tasks:

  • Image classification sees an image and can classify it (a dog, an apple, a formula one race car, a person’s face). More precisely, it is able to accurately predict that a given image belongs to a certain class. For example, a social media company might want to use image classification to automatically identify and segregate objectionable images uploaded by users.
  • Object detection can use image classification to identify a certain class of image and then detect and tabulate their appearance in an image or video. Examples include detecting damages on assembly lines or identifying machinery that requires maintenance or performing other visual inspections.
  • Object tracking follows or tracks an object once it is detected. This task is often executed with images captured in sequence or real-time video feeds. Autonomous vehicles, for example, need to not only classify and detect objects such as pedestrians, other cars and road infrastructure, they need to track them in motion to avoid collisions and obey traffic laws.(6)
  • Content-based image retrieval uses computer vision to browse, search and retrieve images from large data stores based on the content of the images rather than metadata tags associated with the images. This task can incorporate automatic image annotation that replaces manual image tagging. These tasks can be used for digital asset management systems and can increase the accuracy of search and retrieval.
  • Pose estimation refers to an object’s position and orientation in an image or sequence of images, enabling computer vision applications to identify an object’s position relative to its surroundings. This task is often applied to robotics where a robot needs to visually understand where objects are to perform tasks and avoid running into other objects.(1)

Tools and trials

Watson Visual Recognition

Quickly and accurately tag, classify and search visual content using machine learning.

IBM PowerAI Vision

Train highly accurate models to classify images and detect objects in images and videos — without deep learning expertise.

More resources

Join the Custom Object Detection Beta

Train the object detection model to recognize objects important to a workflow or domain. For example, detect damage to cars, find machines that need maintenance, or perform visual inspections.

Dialog-Based Interactive Image Retrieval

A natural language-based system for interactive image retrieval that is more expressive than conventional systems based on binary or fixed-form feedback.

IBM to release world’s largest annotation for studying bias in facial analysis

Society is paying more attention than ever to the question of bias in artificial intelligence. See how IBM is helping to ensure facial recognition technology is built and trained responsibly.

IBM Research Blog

With more than 3,000 researchers in 12 labs located across six continents, IBM Research is one of the world’s largest and most influential corporate research labs. Read the blog that covers the research across industries and topics.