Cloud scaling, Part 3
Explore video analytics in the cloud
Using methods, tools, and system design for video and image analysis, monitoring, and security
This content is part # of # in the series: Cloud scaling, Part 3
This content is part of the series:Cloud scaling, Part 3
Stay tuned for additional content in this series.
Digital video using standards such as that from Motion Picture Experts Group (MPEG) for encoding video to compress, transport, uncompress, and display it has led to a revolution in computing ranging from social networking media and amateur digital cinema to improved training and education. Tools for decoding and consuming digital video are widely used by all every day, but tools to encode and analyze uncompressed video frames are needed for video analytics, such as Open Computer Vision (OpenCV). One of the readily available and quite capable tools for encoding and decoding of digital video is FFmpeg; for still images, GNU Image Processing (GIMP) is quite useful (see Related topics for links). With these three basic tools, an open source developer is fully equipped to start exploring computer vision (CV) and video analytics. Before exploring these tools and development methods, however, let's first define these terms better and consider applications.
The first article in this series, Cloud scaling, Part 1: Build your own and scale with HPC on demand, provided a simple example using OpenCV that implements a Canny edge transformation on continuous real-time video from a Linux® web cam. This is an example of a CV application that you could use as a first step in segmenting an image. In general, CV applications involve acquisition, digital image formats for pixels (picture elements that represent points of illumination), images and sequences of them (movies), processing and transformation, segmentation, recognition, and ultimately scene descriptions. The best way to understand what CV encompasses is to look at examples. Figure 1 shows face and facial feature detection analysis using OpenCV. Note that in this simple example, using the Haar Cascade method (a machine learning algorithm) for detection analysis, the algorithm best detects faces and eyes that are not occluded (for example, my youngest son's face is turned to the side) or shadowed and when the subject is not squinting. This is perhaps one of the most important observations that can be made regarding CV: It's not a trivial problem. Researchers in this field often note that although much progress has been made since its advent more than 50 years ago, most applications still can't match the scene segmentation and recognition performance of a 2-year-old child, especially when the ability to generalize and perform recognition in a wide range of conditions (lighting, size variation, orientation and context) is considered.
Figure 1. Using OpenCV for facial recognition
To help you understand the analytical methods used in CV, I have created a
small test set of images from the Anchorage, Alaska area that is available for download. The images have been
processed using GIMP and OpenCV. I developed
C/C++ code to
use the OpenCV application programming interface with a Linux web cam,
precaptured images, or MPEG movies. The use of CV to understand video
content (sequences of images), either in real time or from precaptured
databases of image sequences, is typically referred to as video
Defining video analytics
Video analytics is broadly defined as analysis of digital video content from cameras (typically visible light, but it could be from other parts of the spectrum, such as infrared) or stored sequences of images. Video analytics involves several disciplines but at least includes:
- Image acquisition and encoding. As a sequence of images or groups of compressed images. This stage of video analytics can be complex, including photometer (camera) technology, analog decoding, digital formats for arrays of light samples (pixels) in frames and sequences, and methods of compressing and decompressing this data.
- CV. The inverse of graphical rendering, where acquired scenes are converted into descriptions compared to rendering a scene from a description. Most often, CV assumes that this process of using a computer to "see" should operate wherever humans do, which often distinguishes it from machine vision. The goal of seeing like a human does most often means that CV solutions employ machine learning.
- Machine vision. Again, the inverse of rendering but most often in a well-controlled environment for the purpose of process control—for example, inspecting printed circuit boards or fabricated parts to make sure they are geometrically correct within tolerances.
- Image processing. A broad application of digital signal processing methods to samples from photometers and radiometers (detectors that measure electromagnetic radiation) to understand the properties of an observation target.
- Machine learning. Algorithms developed based on the refinement of the algorithm through training data, whereby the algorithm improves performance and generalizes when tested with new data.
- Real-time and interactive systems. Systems that require response by a deadline relative to a request for service or at least a quality of service that meets SLAs with customers or users of the services.
- Storage, networking, database, and computing. All required to process digital data used in video analytics, but a subtle, yet important distinction is that this is an inherently data-centric compute problem, as was discussed in Part 2 of this series.
Video analytics, therefore, is broader in scope than CV and is a system design problem that might include mobile elements like a smart phone (for example, Google Goggles) and cloud-based services for the CV aspects of the overall system. For example, IBM has developed a video analytics system known as the video correlation and analysis suite (VCAS); it is a good example of a system design concept. Detailed focus on each system design discipline involved in a video analytics solution is beyond the scope of this article, but many pointers to more information for system designers are available in Related topics. The rest of this article focuses on CV processing examples and applications.
Basic structure of video analytics applications
You can break the architecture of cloud-based video analytics systems down into two major segments: embedded intelligent sensors (such as smart phones, tablets with a camera, or customized smart cameras) and cloud-based processing for analytics that can't be directly computed on the embedded device. Why break the architecture into two segments compared to fully solving in the smart embedded device? Embedding CV in transportation, smart phones, and products is not always practical. Even when embedding a smart camera is smart, so often, the compressed video or scene description may be back-hauled to a cloud-based video analytics system, just to offload the resource-limited embedded device. Perhaps more important, though, than resource limitations is that video transported to the cloud for analysis allows for correlation with larger data sets and annotation with up-to-date global information for augmented reality (AR) returned to the devices.
The smart camera devices for applications like gesture and facial expression recognition must be embedded. However, more intelligent inference to identify people and objects and fully parse scenes is likely to require scalable data-centric systems that can be more efficiently scaled in a data center. Furthermore, data processing acceleration at scale ranging from the Khronos OpenVX CV acceleration standards to the latest MPEG standards and feature-recognition databases are key to moving forward with improved video analytics, and two-segment cloud plus smart camera solutions allow for rapid upgrades.
With sufficient data-centric computing capability leveraging the cloud and smart cameras, the dream of inverse rendering can perhaps be realized where, in the ultimate "Turing-like" test that can be demonstrated for CV, scene parsing and re-rendered display and direct video would be indistinguishable for a remote viewer. This is essentially done now in digital cinema with photorealistic rendering, but this rendering is nowhere close to real time or interactive.
Video analytics apps: Individual scenarios
Killer applications for video analytics are being thought of every day for CV and video analytics, some perhaps years from realization because of computing requirements or implementation cost. Nevertheless, here is a list of interesting applications:
- AR views of scenes for improved understanding. If you have ever looked at, for example, a landing plane and thought, I wish I could see the cockpit view with instrumentation, this is perhaps possible. I worked in Space Shuttle mission control long ago, where a large development team meticulously re-created a view of the avionics for ground controllers that shadowed what astronauts could see—all graphical, but imaging fusion of both video and graphics to annotate and re-create scenes with meta-data. A much simplified example is presented here in concept to show how an aircraft observed via a tablet computer camera could be annotated with attitude and altitude estimation data (see the example in this article).
- Skeletal transformations to track the movement and estimate the intent and trajectory of an animal that might jump onto a highway. See the example in this article.
- Fully autonomous or mostly autonomous vehicles with human supervisory control only. Think of the steps between today's cruise control and tomorrow's full autonomous car. Cars that can parallel park themselves today are a great example of this stepwise development.
- Beyond face detection to reliable recognition and, perhaps more importantly, for expression feedback. Is the driver of a semiautonomous vehicle aggravated, worried, surprised?
- Virtual shopping (AR to try products). Shoppers can see themselves in that new suit.
- Signage that interacts with viewers. This is based on expressions, likes and dislikes, and data that the individual has made public.
- Two-way television and interactive digital cinema. Entertainment for which viewers can influence the experience, almost as if they were actors in the content.
- Interactive telemedicine. This is available any time with experts from anywhere in the world.
I make no attempt in this article to provide an exhaustive list of applications, but I explore more by looking closely at both AR (annotated views of the world through a camera and display—think heads-up displays such as fighter pilots have) and skeletal transformations for interactive tracking. To learn more beyond these two case studies and for more in-depth application-specific uses of CV and video analytics in medicine, transportation safety, security and surveillance, mapping and remote sensing, and an ever-increasing list of system automation that includes video content analysis, consult the many entries in Related topics. The tools available can help anyone with computer engineering skills get started. You can also download a larger set of test images as well as all OpenCV code I developed for this article.
Example: Augmented reality
Real-time video analytics can change the face of reality by augmenting the view a consumer has with a smart phone held up to products or our view of the world (for example, while driving a vehicle) and can allow for a much more interactive experience for users for everything from movies to television, shopping, and travel to how we work. In AR, the ideal solution provides seamless transition from scenes captured with digital video to scenes generated by rendering for a user in real time, mixing both digital video and graphics in an AR view for the user. Poorly designed AR systems distract a user from normal visual cues, but a well-designed AR system can increase overall situation awareness, fusing metrics with visual cues (think fighter pilot heads-up displays).
The use of CV and video analytics in intelligent transportation systems has significant value for safety improvement, and perhaps eventually CV may be the key technology for self-driving vehicles. This appears to be the case based on the U.S. Defense Advanced Research Projects Agency challenge and the Google car, although use of the full spectrum with forward-looking infrared and instrumentation in addition to CV has made autonomous vehicles possible. Another potentially significant application is air traffic safety, especially for airports to detect and prevent runway incursion scenarios. The imagined AR view of an aircraft on final approach at Ted Stevens airport in Anchorage shows a Hough linear transform that might be used to segment and estimate aircraft attitude and altitude visually, as shown in Figure 2. Runway incursion safety is of high interest to the U.S. Federal Aviation Administration (FAA), and statistics for these events can be found in Related topics.
Figure 2. AR display example
For intelligent transportation, drivers will most likely want to participate even as systems become more intelligent, so a balance of automation and human participation and intervention should be kept in mind (for autonomous or semiautonomous vehicles).
Skeletal transformation examples: Tracking movement for interactive systems
Skeletal transformations are useful for applications like gesture recognition or gate analysis of humans or animals—any application where the motion of a body's skeleton (rigid members) must be tracked can benefit from a skeletal transformation. Most often, this transformation is applied to bodies or limbs in motion, which further enables the use of background elimination for foreground tracking. However, it can still be applied to a single snapshot, as shown in Figure 3, where a picture of a moose is first converted to a gray map, then a threshold binary image, and finally the medial distance is found for each contiguous region and thinned to a single pixel, leaving just the skeletal structure of each object. Notice that the ears on the moose are back—an indication of the animal's intent (higher-resolution skeletal transformation might be able to detect this as well as the gait of the animal).
Figure 3. Skeletal transformation of a moose
Skeletal transformations can certainly be useful in tracking animals that might cross highways or charge a hiker, but the transformation has also become of high interest for gesture recognition in entertainment, such as in the Microsoft® Kinect® software developer kit (SDK). Gesture recognition can be used for entertainment but also has many practical purposes, such as automatic sign language recognition—not yet available as a product but a concept in research. Certainly skeletal transformation CV can analyze the human gait for diagnostic or therapeutic purposes in medicine or to capture human movement for animation in digital cinema.
Skeletal transformations are widely used in gesture-recognition systems for entertainment. Creative and Intel have teamed up to create an SDK for Windows® called the Creative* Interactive Gesture Camera Developer Kit (see Related topics for a link) that uses a time-of-flight light detection and ranging sensor, camera, and stereo microphone. This SDK is similar to the Kinect SDK but intended for early access for developers to build gesture-recognition applications for the device. The SDK is amazingly affordable and could become the basis from some breakthrough consumer devices now that it is in the hands of a broad development community. To get started, you can purchase the device from Intel, and then download the Intel® Perceptual Computing SDK. The demo images are included as an example along with numerous additional SDK examples to help developers understand what the device can do. You can use the finger tracking example shown in Figure 4 right away just by installing the SDK for Microsoft Visual Studio® and running the Gesture Viewer sample.
Figure 4. Skeletal transformation using the Intel Perceptual Computing SDK and Creative Interactive Gesture Camera Developer Kit
The future of video analytics
This article makes an argument for the use of video analytics primarily to improve public safety; for entertainment purposes, social networking, telemedicine, and medical augmented diagnostics; and to envision products and services as a consumer. Machine vision has quietly helped automate industry and process control for years, but CV and video analytics in the cloud now show promise for providing vision-based automation in the everyday world, where the environment is not well controlled. This will be a challenge both in terms of algorithms for image processing and machine learning as well as data-centric computer architectures discussed in this series. The challenges for high-performance video analytics (in terms of receiver operating characteristics and throughput) should not be underestimated, but with careful development, this rapidly growing technology promises a wide range of new products and even human vision system prosthetics for those with sign impairments or loss of vision. Based on the value of vision to humans, no doubt this is also fundamental to intelligent computing systems.
- Learning OpenCV by Gary Bradski and Adrian Kaehler (O'Reilly, 2008) is probably the best place to start learning about CV.
- Numerous excellent academic textbooks with algorithm details
and fundamental theory are available, including:
- Computer Vision: Models, Learning, and Inference by Simon J.D. Prince (Cambridge UP, 2012)
- Computer and Machine Vision: Theory, Algorithms, Practicalities by E.R. Davies (Academic Press, 2012)
- Computer Vision: Algorithms and Applications by Richard Szeliski (Springer, 2011)
- Computer Vision by Linda Shapiro and George Stockman (Prentice Hall, 2001)
- Courses at universities on CV, video analytics, and
interactive or real-time systems are becoming more widely available at
both the graduate and undergraduate levels.
- Universities such as Carnegie Mellon and the Computer Vision Group, Stanford and the Stanford Vision Lab in the Stanford AI Lab, and the Massachusetts Institute of Technology (MIT) and the CSAIL Computer Vision Research Group have large research and teaching programs.
- I work at two state universities that have significant coursework, including undergraduate courses and research at University of Alaska Anchorage in the Computer Prototype and Assembly Lab for classes such as Computer and Machine Vision and the University of Colorado at Boulder in the Embedded Certificate Program as an adjunct professor.
- The courses at CU-Boulder in Real-time embedded systems are offered by the Electrical Computer and Energy Engineering department on campus and via distance for summer courses, including Real-Time Digital Media and a summer version of Real-Time Embedded Systems taught via the Center for Advanced Engineering and Technology Education.
- It is also possible to learn more about these topics through Udacity, such as this great course, Introduction to Artificial Intelligence, which covers machine learning and artificial intelligence-related image processing and computer vision and Introduction to Parallel Programming, which covers the use of GP-GPUs that can be used to speed up graphics and CV processing.
- Research by IBM and partners in CV and video analytics includes IBM Exploratory Computer Vision, IBM Smart Surveillance Research, and IBM Augmented Reality.
- Medical uses for video analytics and CV range from the Artificial Retina Project to smart microscopes and radiology equipment, most often not to fully replace medical clinicians but rather to assist them or extend their reach to rural areas through telemedicine. The Medical Vision Group at MIT is a good place to start.
- Of course, you need to download and install OpenCV. I found this OpenCV installation procedure for Ubuntu easy to follow, and it includes a great facial-recognition example from OpenCV for Haar Cascade detection.
- Learn more about VCAS.
- Video analytics requires image analysis as well as encode/decode tools for digital video. A great place to start is with open systems software and hardware, including OpenCV, the OpenVX hardware acceleration standard, and FFmpeg tools for encoding/decoding digital video. Finally, GIMP tools are great for interactive work—for example, choosing thresholds based on histogram analysis or taking a quick look at a Sobel edge transformation.
- CV methods can be used for search, such as the Google Image search services, to detect faces for social networking such as Facebook face detection, used to assist with tagging friends in photos, which has not been without some controversy, as described and explored in detail at this 2011 FTC Face Facts Forum. The Google image search works well finding identical matches but not so well for true recognition—for example, my picture of cows returned no other images of cows. Either way, facial recognition, which might include automatic identification of individuals rather than just segmentation of the face, involves public policy controversy. Recently, Facebook acquired Face.com, and a host of interesting features could come from it, including mood and age estimation.
- Download Microsoft Kinect SDK.
- Purchase the Creative Interactive Gesture Camera Developer Kit from Intel.
- Download the Intel Perceptual Computing SDK.
- Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, or use a product in a cloud environment.