Humanoid robotic development has advanced at a snail’s pace for the better part of two decades, but fast acceleration is now occurring courtesy to a collaboration by Figure AI and OpenAI, resulting in the most gorgeous actual humanoid robot movie I’ve ever seen.
On Wednesday, Figure AI, a startup robotics firm, released a video update (see below) of their Figure 01 robot running a new Visual Language Model (VLM), which has somehow converted the bot from a somewhat dull automaton into a full-fledged sci-fi bot with capabilities approaching C-3PO.
Figure 01 is shown standing behind a table with a plate, an apple, and a cup. There is a drainer on the left. A person stands in front of the robot and says, “Figure 01, what do you see right now?”
After a few seconds, Figure 01 responds in a wonderfully human-sounding voice (there is no face, just an animated light that moves in tune with the speech), explaining everything on the table as well as the man standing in front of it.
“That’s cool,” I thought.
Then the man says, “Hey, can I have something to eat?”
Figure 01 says, “Sure thing,” and then picks up the apple with a deft flourish of fluid movement and hands it to the guy.
“Woah,” I exclaimed.
The man then empties some crumpled garbage from a bin in front of Figure 01, asking, “Can you explain why you did what you just did while picking up this trash?”
Figure 01 spends no time expressing its logic as it places the paper back in the bin. “So, I gave you the apple because it’s the only edible item I could provide you with from the table.”
I thought, “This can’t be real.”
It is, nevertheless, according to Figure AI.
Speech-to-speech
According to the business, Figure 01 uses “speech-to-speech” reasoning to understand images and texts, relying on a whole vocal interaction to build its responses. This differs from, say, OpenAI’s GPT-4, which focuses on written cues.
It also employs what the manufacturer calls “learned low-level bimanual manipulation.” To control movement, the system uses precise picture calibrations (down to the pixel level) in conjunction with its neural network. “These networks take in onboard images at 10hz, and generate 24-DOF actions (wrist poses and finger joint angles) at 200hz,” the company said in a press release.
The company states that all behavior in the movie is based on system learning and is not teleoperated, implying that there is no one behind the scenes puppeteering Figure 1.
Without seeing Figure 01 in person and asking my own questions, it is difficult to corroborate these allegations. It is possible that Figure 01 has ran this code before. It could have been the 100th time, which would explain its speed and smoothness.
Or maybe this is 100% true, in which case, amazing. Just wow.