MIT and IBM announce ThreeDWorld Transport Challenge for physically realistic Embodied AI

Share this post:

While developing household robots that can sense and act in the physical world is an important goal of the computer vision and robotics communities, directly training models with real robots is expensive and often involves safety risks. This has resulted in a trend toward using simulators to train and evaluate AI algorithms. In recent years, the development of 3D virtual environments such as AI2-THOR, Gibson, Habitat and VirtualHome, which can simulate photo-realistic scenes, has served as a major driving force for the progress of vision-based robot navigation and human/AI collaboration.

However, to-date, most tasks defined in these virtual environments have focused on visual navigation in high-quality synthetic scenes or real-world RGB-D scans, while paying little or no attention to physical interaction. Recently platforms such as Sapien and iGibson have coupled photorealistic rendering with high-fidelity physics simulations, but their interactions are still mostly limited to opening doors and pushing objects out of the way.

Challenge Facing Embodied AI

To truly train robots that can serve as home assistants, we must develop Embodied AI systems that can perceive and act in realistic, cluttered physical environments to fulfill a goal. In other words, these agents must be capable of physical interactions that move and change the state of objects within the environment.

To that end, a team at MIT Brain and Cognitive Sciences, in collaboration with the MIT-IBM Watson AI Lab, has developed a new Embodied AI benchmark, the ThreeDWorld (TDW) Transport Challenge, described in “The ThreeDWorld Transport Challenge: A Visually Guided Task-and-Motion Planning Benchmark for Physically Realistic Embodied AI,” currently in preprint on arXiv. The challenge aims to measure an Embodied AI agent’s ability to change the states of multiple objects to accomplish a complex task, performed within a photo- and physically realistic virtual environment.

[Take the TDW Transport Challenge]


Such a task falls within the domain of Task and Motion Planning (TAMP), where the goal is to operate a robot in environments containing many objects, with the robot taking actions to move and change the state of those objects to perform specified tasks, such as rearranging furniture or uncovering hidden objects.

Up to now, the field of Embodied AI has lacked challenging Embodied AI benchmarks with a clear task and evaluation metric that can test embodied agents’ task-and-motion planning abilities in 3D-simulated physical home environments. This involves complex physical scene understanding that combines visual perception, reasoning and hierarchical planning to solve challenging tasks in the physical world. Also, most implementations of embodied agents lack physically mapped action spaces that allow them to interact with the environment, and effectively change both object and scene state.

By providing a benchmark that includes a well-defined task, an agent capable of complex physical interactions and a clear evaluation metric, the TDW Transport Challenge seeks to address these lacks in the field.

The TDW Transport Challenge

This benchmark task has been structured as an open challenge that the team feels will empower researchers to develop more intelligent physics-driven robots for the physical world. The code for the TDW Transport Challenge is available on GitHub.

An overview of the ThreeDWorld Transport Challenge. In this example task, the agent must transport objectsscattered across multiple rooms and place them on the bed (marked with a green bounding box) in the bedroom. The agentcan first pick up a container, put objects into it, and then transport them to the goal location.

An overview of the ThreeDWorld Transport Challenge. In this example task, the agent must transport objectsscattered across multiple rooms and place them on the bed (marked with a green bounding box) in the bedroom. The agentcan first pick up a container, put objects into it, and then transport them to the goal location.


The specifics of the TDW Transport Challenge task:

An embodied agent is spawned randomly inside a simulated physical home environment. The agent must find a small set of target objects scattered around the house, pick them up, and transport them to a specified final location within a given interaction budget (defined as a maximum episode length in steps).

Various containers are positioned around the house, which the agent can find and use to collect and transport several objects together. Without using a container as a tool, the agent can only transport up to two objects at a time. However, while the containers help the agent transport items efficiently, it also uses up some of the valuable interaction budget to find them; therefore, the agent must reason about a case-by-case optimal plan.

The challenges of the Challenge

This task poses several challenges for embodied agents beyond the semantic exploration of unknown environments, including synergy between navigation and interaction (the agent cannot directly move to grasp an object if the path to it is obstructed, e.g. by a table), physics-aware interaction (grasping might fail if the agent’s arm cannot reach an object), physics-aware navigation (collision with obstacles might cause objects to be dropped), reasoning about tool usage, and hierarchical planning for such a long-horizon task

We used TDW because it’s an interactive physical simulation platform, developed by MIT in conjunction with the MIT-IBM Watson AI Lab. It’s a multi-modal platform for interactive physical simulation, built on the state-of-the-art game development platform Unity. TDW has been designed to be highly flexible and general, allowing for a wide range of use case.

The platform has been instrumental in a wide range of studies, including generating large-scale datasets for training networks to generalize against real-world images; evaluating object material and mass via impact sound generation; and training and evaluating physically realistic forward prediction algorithms.

For this challenge the team first created a dataset of 3D multi-room home environments filled with furniture and prop objects, all of which respond to physics. To train AI agents to navigate and physically interact within these virtual environments, the team further developed a fully physics-driven robot-like agent — the Magnebot — with articulated arms and magnet-like end effectors with 9 degrees of freedom. A corresponding high-level navigation and interaction API drives the Magnebot’s motion and arm articulation.

The team conducted preliminary evaluations of various agent models on this task and found that a pure reinforcement learning model struggled to succeed at the task, while a hierarchical planning-based agent achieved better performance, transporting some objects, but was still far from solving the task.

Empowering development of more intelligent, physics-driven robots for the physical world

Based on the results achieved so far, the team believes this challenge will be of great value in assessing AI agents’ abilities to rearrange multiple objects in a physically realistic environment. By removing the barrier to entry to the TAMP field, other researchers can now study this type of challenging TAMP-style embodied AI task in the face of realistic physical constraints. To add more difficulty and realism to the challenge, the team plans to include deformable or soft body objects in future versions.

Our team is also actively developing a second challenge that utilizes the multi-modal aspects of TDW to focus on a common but potentially serious occurrence on a robot production line – finding a small object that has somehow been dropped. TDW’s audio implementation provides full audio spatialization and reverberation that respects the geometry of interior spaces. In addition, the platform’s PyImpact library uses modal synthesis and the physics of the scene to generate plausible, realistic impact sounds in real-time, based on the masses and materials of colliding objects.

In this new task, an embodied agent is spawned inside a single room environment, and “hears” an unknown object fall to the ground, somewhere in the room. The agent must locate and identify this target object by using both visual and auditory modalities. Objects may be behind a sofa, on top of a cabinet, inside a containing object or occluded by other objects that the agent needs to physically move to reveal the target object.

Such a task would be extremely difficult, if not impossible, for an agent to solve by vision alone, without the spatial cues and the object material clues provided by TDW’s audio implementation. By incorporating the auditory modality into this work, the MIT/IBM team hopes to further empower researchers to develop more intelligent, multi-sensory robots for the physical world.

Submit your solution to The TDW Transport Challenge before the June 1, 2021 deadline.

Inventing What’s Next.

Stay up to date with the latest announcements, research, and events from IBM Research through our newsletter.


More AI stories

New research helps make AI fairer in decision-making

To tackle bias in AI, our IBM Research team in collaboration with the University of Michigan has developed practical procedures and tools to help machine learning and AI achieve Individual Fairness. The key idea of Individual Fairness is to treat similar individuals well, similarly, to achieve fairness for everyone.

Continue reading

Mimicking the brain: Deep learning meets vector-symbolic AI

To better simulate how the human brain makes decisions, we’ve combined the strengths of symbolic AI and neural networks. Specifically, we combined the learning representations that neural networks create with the symbol-like entities represented by high-dimensional and distributed vectors. The idea is to guide a neural network to represent unrelated objects with dissimilar high-dimensional vectors.

Continue reading

Austin or Boston? Making artificial speech more expressive, natural, and controllable

We've developed speech synthesis technology that emulates the type of expressiveness humans naturally deploy in face-to-face communication. In our recent paper Supervised and Unsupervised Approaches for Controlling Narrow Lexical Focus in Sequence-to-Sequence Speech Synthesis presented at the IEEE Spoken Language Technologies Workshop in Shenzhen, China, we describe a system that can emphasize or highlight certain words to improve the expressiveness of a sentence or help with context ambiguity.

Continue reading