New model augments visual recognition to help AI identify unfamiliar objects

Share this post:

Blog co-authors : L-R Maria-Irina Nicolae, Vincent Lonij and Ambrish Rawat, Research Staff Members, IBM Research – Ireland

Applications of AI are quickly becoming ubiquitous, powered by algorithms that learn from large amounts of data. Humans, on the other hand, learn very differently: they are able to reason based on a small number of assumptions and a set of logical rules. Our IBM Research team designed a method capable of combining these two learning styles, augmenting large data sets with structured human-generated knowledge and logical rules to improve performance of visual recognition.

Different learning styles

Most state-of-art AI systems use statistical learning, which relies on detecting patterns in large amounts of annotated data. Having captured meaningful patterns, these system are then expected to make accurate predictions when faced with new data. For instance, a system trained to differentiate between images of different animals is first trained on a large set of labeled images and is subsequently able to classify new images into one of the known categories of animals.

By comparison, humans are capable of incrementally building their knowledge with only a few examples and learn to reason with the gained knowledge. This allows them to understand and interpret entirely new objects. When faced with, say, an animal they have never seen before, a human is able to analyze its characteristics based on previous experience. They can then make decisions based on these characteristics (without needing to know the name of the animal), for example, “this looks like a predator, run away.”

Figure 1: (left) Example image input to our model (a grasshopper). (right) Selected output of our model consisting of properties (links in knowledge-graph).

Our research is focused on bridging the gap between these two learning paradigms. In our recent paper, we describe how we augment a statistical visual recognition system with a structured knowledge database. This enables the visual recognition system to predict properties of objects even if those object categories were not available when training the system.

Our approach

Our model consists of two separate components: a knowledge graph and an image recognition system. Knowledge graphs are a rich source of structured information; they can be thought of as a network of interconnected nodes where each node represents a concept and connections between nodes represent relationships. We make this knowledge consumable by a statistical learning system by converting each concept into a point in a semantic space. A semantic space is a space with the property that nearby locations represent similar concepts. For example, in the figure below, car is one of the concepts and images of various types of cars correspond to points in the vicinity of the concept. Next, we train an image recognition system to associate each image with a location in the semantic space, instead of a class label as is the standard approach in image recognition.

Figure 2: Schematic of a semantic space. Similar concepts are located near each other in the space. The semantic meaning of novel concepts can be inferred if their location in the space is known.


The benefit of this approach is that we can predict the properties of objects even if images of their respective categories were unavailable at training time, provided that they are somewhat similar to known object categories. For example, if the system has seen cars during training, but not trucks and then encounters a truck, it can still recognize that it has wheels and is a type of vehicle.

The final system can accurately predict properties of object categories even when these were absent from the knowledge graph used for training. This shows that our system is capable of truly open-world operation, when both image training datasets and initial knowledge of the world are incomplete.

Our method can improve many visual recognition systems currently in use in the real world.  This could impact major industries including self-driving vehicles, retail, and security systems.

More AI stories

Text2Scene: Generating Compositional Scenes from Textual Descriptions

At CVPR 2019, IBM researchers introduce techniques to interpret visually descriptive language to generate compositional scene representations from textual descriptions.

Continue reading

Overcoming Challenges In Automated Image Captioning

At CVPR 2019, IBM researchers introduce an improved method to bridge the semantic gap between visual scenes and language to produce diverse, creative and human-like captions.

Continue reading

Label Set Operations (LaSO) Networks for Multi-Label Few-Shot Learning

Data augmentation is one of the leading methods to tackle the problem of few-shot learning, but current synthesis approaches only address the scenario of a single label per image, when in reality real life images may contain multiple objects. The IBM team came up with a novel technique for synthesizing samples with multiple labels.

Continue reading