First Order Model-Based RL through Decoupled Backpropagation (DMO)

1 New York University, 2 LAAS-CNRS, 3 ANITI

Abstract

We present Decoupled forward-backward Model-based policy Optimization (DMO), a first-order gradient RL method that unrolls trajectories using a high-fidelity simulator while computing gradients via a learned differentiable dynamics model. This decoupling avoids compounding prediction errors in model rollouts and preserves the benefits of analytical gradients without requiring differentiable physics. Empirically, DMO improves sample and wall-clock efficiency across locomotion and manipulation benchmarks and deploys on a Unitree Go2 robot for both quadrupedal and bipedal locomotion tasks with robust sim-to-real transfer.

Additional Clips

Quadrupedal Hardware Experiments (Go2 Walking)

Bipedal Hardware Experiments (Go2 Front-Legs Balancing)

Simulation Demos

What is First-Order Gradient RL?

First-order gradient reinforcement learning (RL) computes policy updates using analytical gradients of the RL objective with respect to policy parameters. Unlike zero-order methods, which estimate gradients using sampled perturbations, first-order methods leverage the chain rule, requiring access to derivatives of both the reward and environment dynamics. This enables more informative, lower-variance policy updates and often dramatically improves sample efficiency—provided that these gradients are available.

First-Order RL Objective Formula
Equation: Discounted return estimate as used in first-order policy optimization for RL.

Limitations of Prior First-Order Methods

Previous first-order methods have taken two major paths:

  • Differentiable simulators (such as APG): Directly compute exact gradients by backpropagating through physics-based simulation. However, such simulators are rarely available for realistic or complex robotic environments and can be impractical for multi-physics or contact-rich scenarios.
  • Model-Based RL (MBRL): Use a learned differentiable model to provide gradients. While flexible and general, prediction errors compound along simulated rollouts, which can degrade policy optimization.

APG vs MBRL: Differentiable Sim vs Learned Model
Comparison: APG leverages differentiable simulators for true gradients; MBRL relies on learned models, but rollouts diverge due to prediction errors.

DMO: Decoupled Forward-Backward Model-Based Policy Optimization

DMO (Decoupled forward-backward Model-based policy Optimization) is a new first-order gradient RL method that decouples trajectory generation from gradient computation:

  • Trajectory unrolling uses a high-fidelity simulator, ensuring realistic transitions and eliminating the compounding of model prediction errors.
  • Gradients are computed via a learned differentiable model, enabling efficient backpropagation even when the simulator isn’t itself differentiable.
DMO Architecture
Architecture: DMO decouples forward simulation (top) from gradient computation (bottom) with the learned model, both referencing real transitions.

DMO thus combines the best of both worlds, bringing high sample efficiency, robust optimization, and reliable sim-to-real transfer. DMO can be applied on top of any first-order RL algorithm via this forward-backward decoupling.

DMO applied to SHAC Loop Diagram
DMO Workflow: When applied to SHAC, DMO cycles between parallel simulation, learned model training, value function learning, and policy updates via the decoupled gradient.

Results

We evaluated DMO across a suite of diverse continuous control benchmarks—locomotion and manipulation—using the DFlex GPU-accelerated simulator, as well as on real Unitree Go2 quadruped hardware, using IsaacGym for training. Our benchmarks compared against strong baselines: PPO (model-free), SAC (model-free), and MAAC (first-order model-based).

Benchmark Environments Visualized
Benchmark environments: Visualization of the simulation environments used in our experiments. From left to right: Ant, SNU Humanoid, Cheetah, Hopper, Allegro Hand, and Humanoid. These diverse tasks span locomotion and manipulation challenges.
Go2 Quadruped: Sim-to-Real Results
Go2 quadruped experiments: Policies trained in IsaacGym simulated environments and directly deployed on the real Unitree Go2 robot in the real world, for both walking and bipedal tasks.
  • Sample efficiency: DMO achieves high final performance with an order of magnitude fewer environment interactions than PPO and SAC, and consistently outperforms model-based MAAC.
  • Wall-clock time: Despite additional computation for gradient backpropagation, DMO converges significantly faster in real time than all baselines.
  • Ablation (Decoupling): We show that the key to DMO’s performance is the decoupling of rollouts and gradient computation. Using a learned model for both leads to much lower returns.
  • Gradient analysis: Gradients computed via DMO remain much closer (in cosine similarity) to those from a ground-truth differentiable simulator than gradients from standard MBRL rollouts, supporting more reliable optimization.
Sample efficiency
Sample efficiency: DMO achieves a high normalized return using dramatically fewer samples than PPO, SAC, and MAAC.
Wall-clock time efficiency
Wall-clock time efficiency: DMO reaches high performance faster in wall-clock time than PPO, SAC, and MAAC on aggregate benchmarks.
Ablation: Decoupling vs. Model-Based Forward
Ablation study: Performance with DMO's decoupled update (blue) is nearly double that of a conventional model-based approach (pink) that uses the learned model for both rollouts and gradients.
Gradient quality analysis
Gradient alignment: Cosine similarity between DMO policy gradients and those from a differentiable simulator is significantly higher than with usual model-based rollouts, validating DMO's reliable optimization.

BibTeX

@inproceedings{amigo2025dmo,
  title={First Order Model-Based RL through Decoupled Backpropagation},
  author={Amigo, Joseph and Khorrambakht, Rooholla and Chane-Sane, Elliot and Righetti, Ludovic and Mansard, Nicolas},
  booktitle={Conference on Robot Learning (CoRL)},
  year={2025}
}