Reinforcement Learning and Optimal Control for Autonomous Systems I — Project

Train a Unitree Go2 walking policy in Isaac Lab and iterate from a weak baseline to a robust gait.

Instructor: Prof. Ludovic Righetti, NYU Tandon School of Engineering.

Quadruped Parkour Inspiration

Recent quadruped videos demonstrate agile jumping, stair climbing, and dynamic maneuvers that motivate this project’s performance goals for gait quality and robustness.

Many state-of-the-art locomotion policies are trained with deep reinforcement learning such as PPO in massively parallel GPU simulators like NVIDIA Isaac Gym and its modern successor Isaac Lab.

Unitree in the wild: dynamic maneuvers and jumps.

Extreme Parkour with Legged Robots (CMU).

Note: Examples above are external or illustrative videos meant as inspiration; they showcase what robust locomotion policies and systems can achieve when trained and tuned extensively on Isaac Lab.

This is what an IsaacLab RL training looks like in the IsaacSim simulator:

What PPO training looks like in Isaac Lab (not rendered in real time).

Project overview

The goal is to start from a minimal Isaac Lab training pipeline and significantly improve the learned Go2 walking policy through reward design and robustness techniques.

The provided baseline tracks only linear and angular velocities, which leads to a very poor gait that is unstable and not transferable to hardware without additional shaping and constraints.

What you will do

  • Extend the reward beyond velocity tracking by adding terms for posture stabilization, foot clearance, foot slip minimization, smooth actions, contact regularization, and collision penalties.
  • Use domain randomization to improve robustness across speeds, disturbances, and ground properties.
  • Benchmark with metrics like velocity tracking error, base orientation error, slip count, episode length, and energy proxies to guide iteration.

Baseline policy (weak)

This is the starting point with only linear and yaw velocity tracking in the reward, which typically yields unstable contacts, slipping feet, and poor base control that do not transfer to the real robot.

Your task is to improve this policy using principled shaping and robustness strategies.

Target gait example

The clips below illustrate the type of stable, symmetric trot and directional tracking expected by the end of the project, emphasizing clean footfalls, limited slip, and consistent base orientation.

Stable trot at nominal speed with low slip (Hardware).

Example of a good policy trained in simulation.

Treat these as qualitative references for gait quality and document the design choices that lead to similar behavior.

Resources