The past decade has been defined by an unprecedented amount of visual content humans have been able to generate – from social media, to entertainment and manufacturing, to even the satellites circling earth from high above the buzz of daily life. With recent advancements in cognitive technologies like large-scale deep learning and semantic-facet based visual modeling, we’ve started to accelerate our ability to discover insights from this data — but it’s still been a challenge to go beyond recognizing a baseline level of detail.
Today, IBM is making an important step forward in advancing this ability by rolling out a significant update to the image classifier capability in Watson Visual Recognition, a service that allows users to understand the contents of an image or video frame. Its active vocabulary is over 2.5 greater in size than the previous model, with a built-in set of tens of thousands of visual labels. This enhancement greatly improves the service’s ability to recognize highly specific visual concepts.
These new, built-in labels cover a broad set of visual concepts that includes objects, people, places, activities, scenes, and many more categories as well as fine-grain attributes, such as specific colors. The depth of each category, to more specific labels, has also been increased. The result is a built-in classifier that, for many typical photos, can give both more specific and more accurate labels. It augments the description with more general tags based on a hierarchy – such as knowing a “horse” is an “animal”. The service also makes fine distinctions that produce highly specific labels. For example, given a photo of “people having an enjoyable dining experience,” the service can now recognize that the scene is not just a restaurant but specifically is a beer garden based on its visual appearance.
This level of specificity is enabled because Visual Recognition now provides on average nine or more labels for each image — this is up from an average two to three labels per image generated by our previous version. We achieved this major step forward by using a very large set of training images from a broad variety of photographic scenes and a distributed network of Graphics Processing Units (GPUs). Watson soaked up all that information into a convolutional neural network with tens of thousands of tags in its vocabulary. We also developed new methods for inferencing that use semantic reasoning to optimize the specificity, saliency and accuracy of tags produced by the service.
Of course, many enterprises have custom data that they want to create their own private classifiers for, and Watson Visual Recognition also features custom training and classification. When there is a need to learn a new set of image labels for a specific domain, like a company’s product portfolio, the service allows developers to quickly train and “plug-in” new custom models, just by providing example images. Applications can then use the custom models in conjunction with the base tagging service to provide both domain specific custom-learned labels and a broad set of built-in labels. Custom classifiers can be improved over time by adding new training examples as well.
This development to Visual Recognition is an important step in our continuous journey of bringing the power of sight to Watson. It builds on a growing foundation of world-class research and development in Visual Comprehension that is breaking new ground on challenges ranging from using image analysis to improve the care of patients with skin cancer, to advancing technology for automatic image captioning, to pushing the boundaries of AI and creativity for making the world’s first cognitive film trailer.
Are you ready to bring the power of Watson’s vision to your images and data? You can learn more about our Visual Recognition service here.