Applications of AI are quickly becoming ubiquitous, powered by algorithms that learn from large amounts of data. Humans, on the other hand, learn very differently: they are able to reason based on a small number of assumptions and a set of logical rules. Our IBM Research team designed a method capable of combining these two learning styles, augmenting large data sets with structured human-generated knowledge and logical rules to improve performance of visual recognition.
Different learning styles
Most state-of-art AI systems use statistical learning, which relies on detecting patterns in large amounts of annotated data. Having captured meaningful patterns, these system are then expected to make accurate predictions when faced with new data. For instance, a system trained to differentiate between images of different animals is first trained on a large set of labeled images and is subsequently able to classify new images into one of the known categories of animals.
By comparison, humans are capable of incrementally building their knowledge with only a few examples and learn to reason with the gained knowledge. This allows them to understand and interpret entirely new objects. When faced with, say, an animal they have never seen before, a human is able to analyze its characteristics based on previous experience. They can then make decisions based on these characteristics (without needing to know the name of the animal), for example, “this looks like a predator, run away.”
Our research is focused on bridging the gap between these two learning paradigms. In our recent paper, we describe how we augment a statistical visual recognition system with a structured knowledge database. This enables the visual recognition system to predict properties of objects even if those object categories were not available when training the system.
Our model consists of two separate components: a knowledge graph and an image recognition system. Knowledge graphs are a rich source of structured information; they can be thought of as a network of interconnected nodes where each node represents a concept and connections between nodes represent relationships. We make this knowledge consumable by a statistical learning system by converting each concept into a point in a semantic space. A semantic space is a space with the property that nearby locations represent similar concepts. For example, in the figure below, car is one of the concepts and images of various types of cars correspond to points in the vicinity of the concept. Next, we train an image recognition system to associate each image with a location in the semantic space, instead of a class label as is the standard approach in image recognition.
The benefit of this approach is that we can predict the properties of objects even if images of their respective categories were unavailable at training time, provided that they are somewhat similar to known object categories. For example, if the system has seen cars during training, but not trucks and then encounters a truck, it can still recognize that it has wheels and is a type of vehicle.
The final system can accurately predict properties of object categories even when these were absent from the knowledge graph used for training. This shows that our system is capable of truly open-world operation, when both image training datasets and initial knowledge of the world are incomplete.
Our method can improve many visual recognition systems currently in use in the real world. This could impact major industries including self-driving vehicles, retail, and security systems.