While developing household robots that can sense and act in the physical world is an important goal of the computer vision and robotics communities, directly training models with real robots is expensive and often involves safety risks. This has resulted in a trend toward using simulators to train and evaluate AI algorithms. In recent years, the development of 3D virtual environments such as AI2-THOR, Gibson, Habitat and VirtualHome, which can simulate photo-realistic scenes, has served as a major driving force for the progress of vision-based robot navigation and human/AI collaboration.
However, to-date, most tasks defined in these virtual environments have focused on visual navigation in high-quality synthetic scenes or real-world RGB-D scans, while paying little or no attention to physical interaction. Recently platforms such as Sapien and iGibson have coupled photorealistic rendering with high-fidelity physics simulations, but their interactions are still mostly limited to opening doors and pushing objects out of the way.
Challenge Facing Embodied AI
To truly train robots that can serve as home assistants, we must develop Embodied AI systems that can perceive and act in realistic, cluttered physical environments to fulfill a goal. In other words, these agents must be capable of physical interactions that move and change the state of objects within the environment.
Such a task falls within the domain of Task and Motion Planning (TAMP), where the goal is to operate a robot in environments containing many objects, with the robot taking actions to move and change the state of those objects to perform specified tasks, such as rearranging furniture or uncovering hidden objects.
Up to now, the field of Embodied AI has lacked challenging Embodied AI benchmarks with a clear task and evaluation metric that can test embodied agents’ task-and-motion planning abilities in 3D-simulated physical home environments. This involves complex physical scene understanding that combines visual perception, reasoning and hierarchical planning to solve challenging tasks in the physical world. Also, most implementations of embodied agents lack physically mapped action spaces that allow them to interact with the environment, and effectively change both object and scene state.
By providing a benchmark that includes a well-defined task, an agent capable of complex physical interactions and a clear evaluation metric, the TDW Transport Challenge seeks to address these lacks in the field.
The TDW Transport Challenge
This benchmark task has been structured as an open challenge that the team feels will empower researchers to develop more intelligent physics-driven robots for the physical world. The code for the TDW Transport Challenge is available on GitHub.
An overview of the ThreeDWorld Transport Challenge. In this example task, the agent must transport objectsscattered across multiple rooms and place them on the bed (marked with a green bounding box) in the bedroom. The agentcan first pick up a container, put objects into it, and then transport them to the goal location.
The specifics of the TDW Transport Challenge task:
An embodied agent is spawned randomly inside a simulated physical home environment. The agent must find a small set of target objects scattered around the house, pick them up, and transport them to a specified final location within a given interaction budget (defined as a maximum episode length in steps).
Various containers are positioned around the house, which the agent can find and use to collect and transport several objects together. Without using a container as a tool, the agent can only transport up to two objects at a time. However, while the containers help the agent transport items efficiently, it also uses up some of the valuable interaction budget to find them; therefore, the agent must reason about a case-by-case optimal plan.
The challenges of the Challenge
This task poses several challenges for embodied agents beyond the semantic exploration of unknown environments, including synergy between navigation and interaction (the agent cannot directly move to grasp an object if the path to it is obstructed, e.g. by a table), physics-aware interaction (grasping might fail if the agent’s arm cannot reach an object), physics-aware navigation (collision with obstacles might cause objects to be dropped), reasoning about tool usage, and hierarchical planning for such a long-horizon task
We used TDW because it’s an interactive physical simulation platform, developed by MIT in conjunction with the MIT-IBM Watson AI Lab. It’s a multi-modal platform for interactive physical simulation, built on the state-of-the-art game development platform Unity. TDW has been designed to be highly flexible and general, allowing for a wide range of use case.
The platform has been instrumental in a wide range of studies, including generating large-scale datasets for training networks to generalize against real-world images; evaluating object material and mass via impact sound generation; and training and evaluating physically realistic forward prediction algorithms.
For this challenge the team first created a dataset of 3D multi-room home environments filled with furniture and prop objects, all of which respond to physics. To train AI agents to navigate and physically interact within these virtual environments, the team further developed a fully physics-driven robot-like agent — the Magnebot — with articulated arms and magnet-like end effectors with 9 degrees of freedom. A corresponding high-level navigation and interaction API drives the Magnebot’s motion and arm articulation.
The team conducted preliminary evaluations of various agent models on this task and found that a pure reinforcement learning model struggled to succeed at the task, while a hierarchical planning-based agent achieved better performance, transporting some objects, but was still far from solving the task.
Empowering development of more intelligent, physics-driven robots for the physical world
Based on the results achieved so far, the team believes this challenge will be of great value in assessing AI agents’ abilities to rearrange multiple objects in a physically realistic environment. By removing the barrier to entry to the TAMP field, other researchers can now study this type of challenging TAMP-style embodied AI task in the face of realistic physical constraints. To add more difficulty and realism to the challenge, the team plans to include deformable or soft body objects in future versions.
Our team is also actively developing a second challenge that utilizes the multi-modal aspects of TDW to focus on a common but potentially serious occurrence on a robot production line – finding a small object that has somehow been dropped. TDW’s audio implementation provides full audio spatialization and reverberation that respects the geometry of interior spaces. In addition, the platform’s PyImpact library uses modal synthesis and the physics of the scene to generate plausible, realistic impact sounds in real-time, based on the masses and materials of colliding objects.
In this new task, an embodied agent is spawned inside a single room environment, and “hears” an unknown object fall to the ground, somewhere in the room. The agent must locate and identify this target object by using both visual and auditory modalities. Objects may be behind a sofa, on top of a cabinet, inside a containing object or occluded by other objects that the agent needs to physically move to reveal the target object.
Such a task would be extremely difficult, if not impossible, for an agent to solve by vision alone, without the spatial cues and the object material clues provided by TDW’s audio implementation. By incorporating the auditory modality into this work, the MIT/IBM team hopes to further empower researchers to develop more intelligent, multi-sensory robots for the physical world.
To tackle bias in AI, our IBM Research team in collaboration with the University of Michigan has developed practical procedures and tools to help machine learning and AI achieve Individual Fairness. The key idea of Individual Fairness is to treat similar individuals well, similarly, to achieve fairness for everyone.
To better simulate how the human brain makes decisions, we’ve combined the strengths of symbolic AI and neural networks. Specifically, we combined the learning representations that neural networks create with the symbol-like entities represented by high-dimensional and distributed vectors. The idea is to guide a neural network to represent unrelated objects with dissimilar high-dimensional vectors.
We've developed speech synthesis technology that emulates the type of expressiveness humans naturally deploy in face-to-face communication. In our recent paper Supervised and Unsupervised Approaches for Controlling Narrow Lexical Focus in Sequence-to-Sequence Speech Synthesis presented at the IEEE Spoken Language Technologies Workshop in Shenzhen, China, we describe a system that can emphasize or highlight certain words to improve the expressiveness of a sentence or help with context ambiguity.