Reinforcement Learning with Foundation Priors:
Let the Embodied Agent Efficiently Learn on Its Own

1Tsinghua University, 2Shanghai Qi Zhi Institute, 3Shanghai Artificial Intelligence Laboratory, 4UC Berkeley

Abstract

Reinforcement learning (RL) is a promising approach for solving robotic manipulation tasks. However, it is challenging to apply the RL algorithms directly in the real world. For one thing, RL is data-intensive and typically requires millions of interactions with environments, which are impractical in real scenarios. For another, it is necessary to make heavy engineering efforts to design reward functions manually. To address these issues, we leverage foundation models in this paper. We propose Reinforcement Learning with Foundation Priors (RLFP) to utilize guidance and feedback from policy, value, and success-reward foundation models. Within this framework, we introduce the Foundation-guided Actor-Critic (FAC) algorithm, which enables embodied agents to explore more efficiently with automatic reward functions. The benefits of our framework are threefold: (1) sample efficient; (2) minimal and effective reward engineering; (3) agnostic to foundation model forms and robust to noisy priors. Our method achieves remarkable performances in various manipulation tasks on both real robots and in simulation. Across 5 dexterous tasks with real robots, FAC achieves an average success rate of 86% after one hour of real-time learning. Across 8 tasks in the simulated Meta-world, FAC achieves 100% success rates in 7/8 tasks under less than 100k frames (about 1-hour training), outperforming baseline methods with manual-designed rewards in 1M frames. We believe the RLFP framework can enable future robots to explore and learn autonomously in the physical world for more tasks.

1. Reinforcement Learning with Foundation Priors (RLFP)

The commonsense of behavior can be formulated as a goal-conditioned policy function. The prior knowledge that the state closer to the button is closer to success can be formulated as the value function. The ability to recognize the success state can be formulated as the 0-1 success-reward function, which equals 1 only if the task succeeds. We assume the success-reward prior is relatively precise, given the simplicity of binary classification in determining success. The value and policy prior knowledge are noisier.

Intro Image

2. Foundation-guided Actor-Critic (FAC)

(1) Policy Regularization from Policy Prior

(2) Reward Shaping from Value Prior

(3) 0-1 Success Feedback from Success Prior

FAC leverages foundation policy guidance and an automatic reward function, enabling the agent to efficiently learn from abundant prior knowledge.

Intro Image

3. Acquiring Foundation Priors

Foundation Priors

(1) Code-as-Policy. We apply the code policy for real robots. Before code generation, we define some primitive skills. We implement the corresponding interface between primitive skills and control systems, so that they can be executed directly by the robot.

Intro Image
 

(2) UniPi. In simulation, we fine-tuned a conditioned video diffusion model (Seer) with 10 videos for each task, and use a pre-trained inverse dynamics model to infer actions from the generated videos. There are 16 frames of the generated video examples, as follows.

  • bin-picking-v2: pick the green bin from the red box to the table.
  • button-press-topdown-v2: press down the red button with the red robotic arm.
  • door-open-v2: open the door by turning the handle.
  • door-unlock-v2: unlock the door with the red robotic arm.
  • drawer-close-v2: close the green drawer with the red robotic arm.
  • drawer-open-v2: open the green drawer with the red robotic arm.
  • hammer-v2: pushing the nail into the wall by the hammer.
  • window-close-v2: close the window with red robotic arm.
 
Value Priors
We choose the VIP model as the value foundation priors, which can give some rough values for the states, taking input as the current image and the goal image. It is noisy, but can provide some guidance for the agent. Here is the value example.
 
Success Priors
We use GPT-4V to give 0/1 response from the given images and prompts. The success prior is relatively precise, given the simplicity of binary classification in determining success. Here is the success example.
  • Does the robotic arm water the plants? Attention, if the spout orients horizontally over the plant, you should output 1 for yes. Otherwise, you should output 0 for no without any space first. Be sure of your answer and explain your reason afterward.

4. Experiments

The proposed Foundation Actor-Critic (FAC) is built upon DrQ-v2. We make expriments on both real-world and meta-world environments.

(1) Sample efficient learning. Across the 5 tasks on real robots, FAC can achieve 86\% success rates after 1 hour of real-time learning. Across the 8 tasks in the simulated Meta-world, FAC can achieve 100\% success rates in 7/8 tasks under less than 100k frames (about 1 hour of training). It surpasses baseline methods that rely on manually designed rewards over 1M frames.

(2) Minimal and effective reward engineering. The reward function is derived from the value and success-reward prior knowledge, eliminating the need for human-specified dense rewards or teleoperated demonstrations.

(3) Agnostic to prior foundation model forms and robust against noisy priors. FAC demonstrates resilience under quantization errors in simulations.

Real-World

Meta-World

FAC is robust to noisy priors

(1) We discretized the policy prior into {-1, 0, 1}, which makes the policy prior only contain rough directional information.

(2) Under the discretized policy prior, we use uniform noise as policy prior action at 20% or 50% probability.

(3) We also tested robustness with systematically wrong policy priors, where actions have a 20% or 50% chance of being inverted (e.g., -1 to 1).

Using the discretized policy with 50% noise, FAC can still reach 100% success rates in many environments. The results indicate that our proposed FAC is robust to the quality of the foundation prior. And it performs well with 20\% wrong direction but struggles at 50\%, where misleading information is abundant. Moreover, the better the foundation prior is, the more sample efficient the FAC is.

BibTeX

@misc{ye2024reinforcementlearningfoundationpriors,
      title={Reinforcement Learning with Foundation Priors: Let the Embodied Agent Efficiently Learn on Its Own}, 
      author={Weirui Ye and Yunsheng Zhang and Haoyang Weng and Xianfan Gu and Shengjie Wang and Tong Zhang and Mengchen Wang and Pieter Abbeel and Yang Gao},
      year={2024},
      eprint={2310.02635},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2310.02635}, 
}