The commonsense of behavior can be formulated as a goal-conditioned policy function. The prior knowledge that the state closer to the button is closer to success can be formulated as the value function. The ability to recognize the success state can be formulated as the 0-1 success-reward function, which equals 1 only if the task succeeds.
We assume the success-reward prior is relatively precise, given the simplicity of binary classification in determining success. The value and policy prior knowledge are noisier.
(1) Policy Regularization from Policy Prior
(2) Reward Shaping from Value Prior
(3) 0-1 Success Feedback from Success Prior
FAC leverages foundation policy guidance and an automatic reward function, enabling the agent to efficiently learn from abundant prior knowledge.
Foundation Priors
(1) Code-as-Policy. We apply the code policy for real robots. Before code generation, we define some primitive skills. We implement the corresponding interface between primitive skills and control systems, so that they can be executed directly by the robot.
(2) UniPi. In simulation, we fine-tuned a conditioned video diffusion model (Seer) with 10 videos for each task, and use a pre-trained inverse dynamics model to infer actions from the generated videos. There are 16 frames of the generated video examples, as follows.
- bin-picking-v2: pick the green bin from the red box to the table.
- button-press-topdown-v2: press down the red button with the red robotic arm.
- door-open-v2: open the door by turning the handle.
- door-unlock-v2: unlock the door with the red robotic arm.
- drawer-close-v2: close the green drawer with the red robotic arm.
- drawer-open-v2: open the green drawer with the red robotic arm.
- hammer-v2: pushing the nail into the wall by the hammer.
- window-close-v2: close the window with red robotic arm.
Value Priors
We choose the
VIP model as the value foundation priors, which can give some rough values for the states, taking input as the current image and the goal image. It is noisy, but can provide some guidance for the agent. Here is the value example.
Success Priors
We use GPT-4V to give 0/1 response from the given images and prompts. The success prior is relatively precise, given the simplicity of binary classification in determining success. Here is the success example.
- Does the robotic arm water the plants? Attention, if the spout orients horizontally over the plant, you should output 1 for yes. Otherwise, you should output 0 for no without any space first. Be sure of your answer and explain your reason afterward.
4. Experiments
The proposed Foundation Actor-Critic (FAC) is built upon DrQ-v2. We make expriments on both real-world and meta-world environments.
(1) Sample efficient learning. Across the 5 tasks on real robots, FAC can achieve 86\% success rates after 1 hour of real-time learning. Across the 8 tasks in the simulated Meta-world, FAC can achieve 100\% success rates in 7/8 tasks under less than 100k frames (about 1 hour of training). It surpasses baseline methods that rely on manually designed rewards over 1M frames.
(2) Minimal and effective reward engineering. The reward function is derived from the value and success-reward prior knowledge, eliminating the need for human-specified dense rewards or teleoperated demonstrations.
(3) Agnostic to prior foundation model forms and robust against noisy priors. FAC demonstrates resilience under quantization errors in simulations.
Real-World
Meta-World
FAC is robust to noisy priors
(1) We discretized the policy prior into {-1, 0, 1}, which makes the policy prior only contain rough directional information.
(2) Under the discretized policy prior, we use uniform noise as policy prior action at 20% or 50% probability.
(3) We also tested robustness with systematically wrong policy priors, where actions have a 20% or 50% chance of being inverted (e.g., -1 to 1).
Using the discretized policy with 50% noise, FAC can still reach 100% success rates in many environments.
The results indicate that our proposed FAC is robust to the quality of the foundation prior. And it performs well with 20\% wrong direction but struggles at 50\%, where misleading information is abundant.
Moreover, the better the foundation prior is, the more sample efficient the FAC is.