Meta reveals an AI capable of anticipating the laws of physics to better pilot robots

Deal Score0
Deal Score0

Meta is active on all the fronts of artificial intelligence. On the sidelines of the models LLAMA 4the social networks giant has just lifted the veil on V-Jepa 2, a model of artificial intelligence at least ambitious, since it intends to teach a reasoning machine on the physical world as a human would do. This new version is in line with the group’s work on Jepa architecture initiated in 2022, while obviously going even further. It is not only capable of predicting with a certain acuity the consequences of an action in a video, but also of planning tasks for a robot in an environment which is completely unknown to it. One more step towards what the company calls advanced artificial intelligence (friend), agents capable of evolving and adapting to our world. To explain it more clearly, this model could be the basis of behavior in the space of the completely autonomous robots of tomorrow.

Video presentation of V-Jepa 2 (in English)

Advertising, your content continues below

Understand the world

Before going further, let’s dwell on what a “world model” (or World Model in English is). This is a notion that we, humans, permanently use without even realizing it. If you are launched, you are not going to wait for it to be in your hands to move; You anticipate its trajectory to place yourself in the right place. In the same way, if you throw a tennis ball in the air, you know it will fall. You would be surprised to say the least if she started to float, to change direction or to transform into apple in full flight.

Advertising, your content continues below

This physical intuition, which seems so natural to us, is the fruit of an internal model of the world that our brain has built by observation and experience. This is this internal simulator that allows us to imagine the likely consequences of our actions before executing them. For an agent IA to “think before acting” in a similar way, Meta believes that he must master three fundamental capacities: first, understanding what he observes, such as recognition of objects or movements in a video. Then, prediction, that is to say the ability to anticipate the evolution of the world, whether naturally or in response to an action. Finally, planning, which is based on prediction to develop a sequence of actions in order to achieve a goal.

An architecture food for video

To build such a model, Meta relies on its Jepa architecture (for a predictive architective embedding), of which V-Jépa 2 is the last incarnation with its 1.2 billion parameters. The principle is to learn by observation, without direct human supervision and therefore without costly annotations. The model is made up of two main bricks: a encoder, which analyzes the raw video to extract useful semantic information, and a predictor, which anticipates the future state from this information.

Training takes place in two stages. First, a pre-training phase called “without action” where the model has ingested more than a million hours of video and a million images from various sources. This enormous amount of visual data allows him to learn fundamental concepts on the functioning of the world: how objects move, interact with each other or with humans. Already at this stage, V-Jepa 2 demonstrates important comprehension and prediction capacities from Meta.

Advertising, your content continues below

An explanatory video of Meta on what are the “World Models” (in English)

From virtual to reality

If this first training allows the model to predict how the world could evolve, it does not take into account the specific actions that an agent could take. This is where the second training phase comes in. To make the model useful for robotics, the researchers refined it with data from robots, including both video observations and control commands executed. The predictor then learns to link an action to a consequence. It is interesting to note that this phase does not require a mountain of data since Meta indicates that 62 hours of robotic data were enough to obtain a model usable for planning and control.

The result? V-Ipa 2 can be used for task planning by a robot in “Zero-Shot” mode, that is to say without having been specifically trained on the objects it must manipulate or the environment in which it evolves. This is a major difference with other foundation models for robots which often require data specific to the robot and its deployment environment. Concretely, for a simple task like entering an object, we provide the robot an image of the objective to be achieved. From his current state, he uses the predictor to imagine the consequences of a series of possible actions and chooses that which brings him closest to the goal. He repeats this planning process at each stage until success. For longer tasks, such as taking an object and depositing it in a specific location, it is possible to decompose the objective into a series of visual sub-objectives that the robot tries to reach successively. With this method, V-Jepa 2 already reaches success rates of 65 to 80 % for manipulation tasks of unknown objects in new environments.

Finally, true to its recent AI practices, Meta not only publishes its new model, but also provides the community with three new benchmarks designed to assess the ability of AI to reason on the physical world from videos.

Advertising, your content continues below

More Info

We will be happy to hear your thoughts

Leave a reply

Bonplans French
Logo