Quantity matters when training computers to accurately recognize what’s in an image. The more they see, the more they learn. But, training new visual recognition models from a large number of images using deep learning can quickly become a bottleneck, especially for cloud environments that use commodity hardware and GPUs. Commodity machines with an average of two to four GPUs are not equipped to process large datasets in a timely fashion.
Our research team has overcome this hurdle by building a distributed deep learning system that breaks up a large visual training task into smaller, decentralized training jobs that run in parallel. It trains individual models using multiple computing nodes using smaller segments of training data, where the models are exchanged asynchronously during training via a central parameter server. This way every node gets periodically updated on what other nodes have learned. This technique incorporates information learned in the other jobs into the individual models as they continue to learn. Our innovation has been to make this type of asynchronous distributed training both efficient and accurate as it scales to very large image datasets.
Once we created the distributed deep learning system, we set our sights on evaluating the system for training visual models for multiple large datasets, including ImageNet 22K, which consists of tens of millions of images labeled with tens of thousands of visual categories. We were encouraged when the system was able to learn models from this data with higher accuracy than previously published results.
We subsequently used the distributed deep learning system to train models for the Watson Visual Recognition service, which allows users to understand the contents of an image or video frame. We developed and applied additional methods for hierarchical reasoning over the outputs of the model to provide the most meaningful set of labels for each input image. The resulting models enable developers to tap into a much larger vocabulary of labels trained from massive amounts of image data, and we started to see some great feedback from users.
Since this initial roll-out, we’ve continued to expand the semantic coverage of the visual recognition vocabulary by almost doubling the number of training images. As a result, the Watson Visual Recognition service has greatly expanded its ability to accurately recognize the overall ‘scene’ of an image, allowing it to more completely label the who, what, and where of each image, as well as the activities and dominant colors. For example, shown the below image of fireworks, it can tell you the scene has fireworks in a waterfront setting, and also comment on some of the colors seen in the scene.
Our external testing shows a 16 percent increase in user preference for labels generated using this new capability. By releasing the new model in production for the Watson Visual Recognition service, users will have even more capabilities to extract information and insights from images.
Beyond improving built-in visual recognition for the Watson service, we will also be able to apply new base models learned from large training data sets for the Watson Visual Recognition custom learning service, which will accelerate the time-to-value for developers to create custom solutions for different domains.
We have also reduced the latency for reasoning about an image, using the trained models via some clever engineering techniques. Developers have had the capability of submitting a .zip file with multiple images, but now the service is twice as fast when images are submitted this way. We have also significantly lowered the latency which might occur when the system is experiencing high demand.
By continually teaching Watson with larger data sets of training images, and engineering it in such a way that it can learn quickly using distributed deep learning, its visual IQ is rapidly growing. This helps users and developers in any industry tap into a highly robust Visual Recognition service that can make sense out of visual data to help discover game-changing insights. The price of not seeing is simply too high. What will you do with Watson Visual Recognition?