Researchers from MIT are helping robots get one step closer to seeing like humans. The team created a model that allows robots to perceive their physical environments like humans in real-time. This allows robots to perform high-level commands. It has been difficult for robots to complete high-level tasks because they need to take the pixel values seen through their camera and create an understanding of the environment.
The model is called 3D Dynamic Scene Graphs. It allows robots to quickly generate a 3D map of the surroundings with objects and their semantic labels. A robot can extract relevant information from the 3D map to determine the location of objects and rooms or the movement of people in the path.
Currently, there are two main methods for robot navigation. The first is 3D mapping, which allows robots to reconstruct their environment in 3D as they explore in real-time. The second method is semantic segmentation, which helps robots classify features in an environment as semantic objects, which is mostly done on 2D images.
3D Dynamic Scene Graphs models spatial perception first to generate a 3D map of an environment in real-time and label objects, people and structures within the 3D map.
A key component of Kimera. Kimera is an open-source library that was developed to construct a 3D geometric model of an environment and encode an object’s size and shape at the same time. It is a mix of mapping and semantic understanding in 3D. Kimera takes in streams of images from a robot’s camera and inertial measurements from its onboard sensors to estimate the trajectory of a robot and reconstruct a scene as 3D mesh in real-time.
Kimera generates semantic 3D mesh using existing neural networks that were trained on millions of real-world images to predict the label of pixels. It projects labels in 3D with ray casting, which is commonly used in computer graphics. The result is a map of a robot’s environment that resembles dense 3D mesh.
But relying on mesh alone is computationally expensive and time-consuming. To overcome this issue, the team built off of Kimera and created algorithms to construct a 3D dynamic scene graph from the dense 3D semantic mesh.
With 3D dynamic scene graphs, the associated algorithms break down 3D semantic mesh into semantic layers. The robot "sees" through the layers to recognize objects in an environment. The layers progress in a hierarchy from objects to people to open spaces and structures to whole buildings. This stops the robot from having to make sense of billions of points and faces in the original mesh.
The team also developed algorithms that track the movement and shape of humans in an environment in real-time.
They tested the model in a photo-realistic simulator that simulates a robot navigating through dynamic office environments with people moving around.
The team presented their research at the Robotics: Science and Systems Conference. A paper on this research can be accessed here.