Share this post:
Today, we’re introducing our latest AI research in the form of a new beta feature: the IBM Watson Visual Recognition food model. This feature provides a built-in capability for recognizing 2,000+ different foods within images, providing enhanced specificity and accuracy in this content domain compared to Visual Recognition’s general tagging feature. Using the food model, restaurant diners can easily compare their meals to ones from previous visits to the establishment, while restaurants can better understand how often their food is being shared across social media. The food model is the first of many pre-built custom models that will accelerate the time-to-value for developers to create custom solutions for different domains using Watson Visual Recognition. Like free refill French fries – we think the possibilities are bottomless!
Photo of a platter of oysters with results of the returned tags provided by the Watson Visual Recognition food model
The genesis behind our efforts stemmed from the observation that users of food- and nutrition-logging apps get frustrated by the manual process of tracking their meals.
What if we could train a system to automatically identify the foods at popular restaurant chains and simplify food logging? With frequent lunch-time trips to restaurants near the lab, we took photos of known foods and trained a first version of the food recognition model. This use case was an example of “food in context” – where the system recognized foods from known menus. We could always refer back to the menu if we, or the system, were unsure of what it was seeing. We were never hungry, and often the results of our daily training experiments ended up as leftovers for dinner! But, like a good plate of brownies, we found food visual recognition to be addictive!
The larger challenge we took on was what we called “food in the wild,” where the system doesn’t know the restaurant menu or a user’s food history. We started by searching for images of many different foods online, which produced an initial noisy data set with weakly labeled images. We did a lot of work to match the correct foods to the correct labels to clean up the data set, and today we have the largest known collection of more than 1.5 million labeled food images corresponding to 2,000+ different foods. We further developed a taxonomy around the foods that allowed us to classify foods hierarchically. To improve the system’s accuracy, we came up with a novel idea to exploit this food hierarchy in combination with deep learning methods for fine-grained recognition. This model forms the basis of the Visual Recognition food model.
Using the food model in the Visual Recognition API, Watson focuses specifically on the food shown in the photo. Thus, it is different from general visual tagging, which identifies other information in a photo, such as a plate, knife, blanket, strawberry, table, and people in a picture of food.
Photo of chocolate-covered strawberries and the returned tags provided by the Watson Visual Recognition food model
With the food model, the system homes in only on the food in the photo – in the example here, this would be the strawberries. The accuracy of food identification is only one piece of our model. The system’s recognition goes deeper by performing fine-grain recognition of the foods. In the case of the strawberry dish, it might also tag the photo as “strawberry dipped in chocolate” when that label applies. Using the hierarchy, the service might also label the photo as a “fruit dish,” which gives a higher-level category for the food. Traditionally, deep learning gives you a list of flat classification scores, but by utilizing the hierarchy and fine-grain classification, we trained the deep learning model to make better mistakes even when a food cannot be identified accurately [i].
As important as it is to teach the system “what is a plate of strawberries” – we had to teach the system what is food and what is not food. To make the service as efficient as possible, the food and non-food classifier and the fine-grained food recognition classifier share most parts of the deep learning networks while having separate branches at the top-level of the network. To make a prediction on a test image, the system only needs a single, very-fast forward pass through the food model to detect and categorize the foods.
Now that Watson has become an expert in recognizing what you’re eating, we’re excited to see the applications and interpretations developers and data scientists will build on our technology!
[i] Hui Wu, Michele Merler, Rosario Uceda-Sosa and John Smith. “Learning to make better mistakes: semantics-aware visual food recognition”. ACM Multimedia Conference, 2016.