DeepMind Unveils RT-2: Robot Translates Thoughts to Action

DeepMind presented a new model called Robotic Transformer 2 (RT-2), which has the capability of translating visual and language data into specific actions. This model is based on the principles of Vision-Language-Action (VLA) studies using data from the Internet and robotics. It converts this information into generalized instructions for controlling robots.

RT-2 is an improvement upon the previous model, Robotic Transformer 1 (RT-1), which was trained using multitask demonstrations. RT-1 can learn combinations of various tasks and objects in robotics data.

The new model receives images from a robot camera and predicts the actions that the robot should perform based on these images.

RT-2 has demonstrated an improved ability to generalize and a deep understanding of semantics and visualization compared to previous models. It can interpret new commands, respond to user commands, and perform basic reasoning about object categories and descriptions.

It can predict the actions of the robot. For example, when instructed to find an object that could be used to score a nail, the model predicted that the robot would choose a stone after logical reasoning.

RT-2 is also capable of handling more complex commands that require discussing the intermediate steps necessary to complete a task. It can plan actions based on both images and text commands, thanks to its basis on VLM models.

This new model demonstrates that VLM models can directly control robots by combining VLM training with robotics data. It not only improves existing VLM models but also opens up possibilities for creating a universal physical robot capable of reasoning, problem-solving, and performing a wide range of tasks in the real world.

/Reports, release notes, official announcements.