Posted in: Cognitive Computing, Thomas J Watson Research Center

Training Watson to see what’s on your plate

Today, we’re introducing our latest AI research in the form of a new beta feature: the IBM Watson Visual Recognition food model. This feature provides a built-in capability for recognizing 2,000+ different foods within images, providing enhanced specificity and accuracy in this content domain compared to Visual Recognition’s general tagging feature. Using the food model, restaurant diners can easily compare their meals to ones from previous visits to the establishment, while restaurants can better understand how often their food is being shared across social media. The food model is the first of many pre-built custom models that will accelerate the time-to-value for developers to create custom solutions for different domains using Watson Visual Recognition. Like free refill French fries – we think the possibilities are bottomless!


Photo of a platter of oysters with results of the returned tags provided by the Watson Visual Recognition food model

The genesis behind our efforts stemmed from the observation that users of food- and nutrition-logging apps get frustrated by the manual process of tracking their meals.

What if we could train a system to automatically identify the foods at popular restaurant chains and simplify food logging? With frequent lunch-time trips to restaurants near the lab, we took photos of known foods and trained a first version of the food recognition model. This use case was an example of “food in context” – where the system recognized foods from known menus. We could always refer back to the menu if we, or the system, were unsure of what it was seeing. We were never hungry, and often the results of our daily training experiments ended up as leftovers for dinner! But, like a good plate of brownies, we found food visual recognition to be addictive!

The larger challenge we took on was what we called “food in the wild,” where the system doesn’t know the restaurant menu or a user’s food history. We started by searching for images of many different foods online, which produced an initial noisy data set with weakly labeled images. We did a lot of work to match the correct foods to the correct labels to clean up the data set, and today we have the largest known collection of more than 1.5 million labeled food images corresponding to 2,000+ different foods. We further developed a taxonomy around the foods that allowed us to classify foods hierarchically. To improve the system’s accuracy, we came up with a novel idea to exploit this food hierarchy in combination with deep learning methods for fine-grained recognition. This model forms the basis of the Visual Recognition food model.

Using the food model in the Visual Recognition API, Watson focuses specifically on the food shown in the photo. Thus, it is different from general visual tagging, which identifies other information in a photo, such as a plate, knife, blanket, strawberry, table, and people in a picture of food.


Photo of chocolate-covered strawberries and the returned tags provided by the Watson Visual Recognition food model

With the food model, the system homes in only on the food in the photo – in the example here, this would be the strawberries. The accuracy of food identification is only one piece of our model. The system’s recognition goes deeper by performing fine-grain recognition of the foods. In the case of the strawberry dish, it might also tag the photo as “strawberry dipped in chocolate” when that label applies.  Using the hierarchy, the service might also label the photo as a “fruit dish,” which gives a higher-level category for the food. Traditionally, deep learning gives you a list of flat classification scores, but by utilizing the hierarchy and fine-grain classification, we trained the deep learning model to make better mistakes even when a food cannot be identified accurately [i].

As important as it is to teach the system “what is a plate of strawberries” – we had to teach the system what is food and what is not food. To make the service as efficient as possible, the food and non-food classifier and the fine-grained food recognition classifier share most parts of the deep learning networks while having separate branches at the top-level of the network. To make a prediction on a test image, the system only needs a single, very-fast forward pass through the food model to detect and categorize the foods.

Now that Watson has become an expert in recognizing what you’re eating, we’re excited to see the applications and interpretations developers and data scientists will build on our technology!

[i] Hui Wu, Michele Merler, Rosario Uceda-Sosa and John Smith. “Learning to make better mistakes: semantics-aware visual food recognition”. ACM Multimedia Conference, 2016.

  • Hui Wu says:

    Agreed. Measuring accurate calorie intake just by using a visual classifier is challenging. For example, calorie count can vary a lot even if the food/drink looks identical. Good app design should consider these cases and can ask for user input and leverage metadata (like dieting history/habit, restaurant menu, etc) to help disambiguating difficult cases.

  • Fernanda Frigo says:

    How can we access this corpus ?

  • Mike Lord says:

    I really like the idea of training Watson to understand our food better. If we could only get Watson to help prevent us from eating the wrong foods, this would be a gold mine. I have health concerns and want to only put healthy foods in my body. I would love it if an app could help me make better and better and eventually the best choice when selecting what to eat. What’s highly rated, and healthy too, on this menu at this particular restaurant.

  • Andy Moore says:

    Has the Lose It! app been using a beta of this service?

    And did this come from IBMs internal innovation project incubator? I’m sure I saw something like this the last time our “Shark Tank / Drangon Den” website was running.

  • Lesley Bolden says:

    Absolutely amazing…possibilities are endless.

  • Hello,

    see a very tasty application with this LinkedIn pulse article (in French):

    works very well,

  • Hui Wu says:

    Cute application! This work also belongs to Watson Visual Recognition service:

    Food API is part of the pre-built model, whereas the above work uses trainable model that can take user submitted data (like images of fancy chocolate) and allows the user to build their own classifiers.

  • Sonia says:


  • Naga Katreddi says:

    Interesting! If there is a way to extend this app to handle the calorie count based on the amount of each food on the plate. Calculating the actual weight from a picture could be tricky. But having that feature would be really useful and could get this app close to the consumers.

  • Hui Wu says:

    Actually some people are starting to look at that:

    I haven’t tested out their food recognition accuracy. But the number of food classes they can recognize seems to be roughly half of what Watson API can recognize.

  • Kunal Dutt Sharma says:

    Amazing breakthrough…. Endless possibilities to assist visually impaired people worldwide. Would be really efficient if this could be paired with regular/normal eye wear to aid visually impaired.

  • Bjorn Gevert says:

    Hi, they have an app like this in the latest episode of HBO’s Silicon Valley. Since SV is a satire, the app only has two replies: “Hot Dog” or “Not Hot Dog”. No other comparisons, I am sure our app will be amazing!

  • Hui Wu says:

    Yeah, that app is hilarious. I also have a testing of the food API on cute puppies and blueberries, thinking about adding one for hotdogs 🙂

  • Glenn Grant says:

    I am interested in whether you could you use the data to help people with allergies? E.g. if Watson could recognise the food on a plate and then give a likelihood of certain common allergies being present?

    I am thinking in particular for food chain restaurants where they have a very prescribed set of ingredients for their food. Clearly would be a lot harder for freshly cooked food.

  • Hui Wu says:

    Yeah, that’s a great application. Imagine using food visual API in combination with food substance sensors like SCiO. Visual results and chemical sensor results can complement each other to get a more accurate food allergy prediction.

  • Amine says:

    Just trying to understand, How visual recognition will differentiate between coffee and Coca-Cola if they are in the same cup?

  • Hui Wu says:

    As long as the drinks/foods are visually discernible enough to humans, we can train machine to recognize the difference. But you will need to collect good training images to capture that subtle difference.

  • Dave says:

    Will there be a calorie count added into the app? That would be very useful.

  • Hui Wu says:

    We are providing the food API not as an end-to-end solution, but as a building block for many possible food/nutrition related apps! In your example, a developer might design a calorie logging app which calls the food API to auto recognize the food type, (which saves the manual input trouble from the user) and then retrieving the nutrition info using the recognized food names.

  • Dave says:

    Lots of possibilities with this, any chance there will be a caloric intake identifier for the various foods – for example the Strawberry is X calories, the Chocolate covering is Y calories and display the total? (assuming it can tell the difference between real chocolate and sugar based coatings)

  • Hui Wu says:

    If you have a really fine-grained nutrition database, then linking that with the result from food API can potentially give you good calorie estimate. But of course, there are always challenging corner cases, like diet coke vs coca cola, coffee with skim milk or whole milk. But again, these cases are hard even for human nutritionist just based on images.

  • Zi Kang Cao says:

    This is awesome!
    Perhaps consider building new model for plant? So when Watson powered app user wander in the botanic garden, they will easily tell you what’s the name of the strange flower 🙂

  • Hui Wu says:

    It’s totally possible! In fact, if you already have training data for different flowers, you can try using the custom training API to train your own flower model 🙂

  • bryan says:

    The timing couldn’t be better with the latest season of silicon valley on HBO prompting such an app idea.

  • Hui Wu says:

    Agreed! Although I think their app only recognizes hotdogs, which is a much easier problem than recognizing more than 2000 kinds of foods.

  • -->
    Hui Wu, Research Staff Member Multimedia and Vision Group

    Hui Wu

    Research Staff Member, Multimedia and Vision Group, IBM Research