TD-CD-MPPI: Temporal-Difference Constraint-Discounted Model Predictive Path Integral Control

1 Industrial Engineering Department, University of Trento, Trento, Italy
2 LAAS-CNRS, Université de Toulouse, CNRS, Toulouse, France
3 Artificial and Natural Intelligence Toulouse Institute (ANITI), Toulouse
Accepted for publication in IEEE Robotics and Automation Letters (RAL), 2025
TD-CD-MPPI Overview
TD-CD-MMPI controller structure, based on MPPI sample based optimization using the forward dynamics computed by a contact simulator. A terminal value function (VF), learned offline via temporal-difference learning, enables long-term reasoning with short roll-outs. Constraint manager modulate trajectory discounts based on violation, enforcing feasibility without penalty shaping.

Abstract

Path Integral methods have demonstrated remarkable capabilities for solving non-linear stochastic optimal control problems through sampling-based optimization. However, their computational complexity grows linearly with the prediction horizon, limiting long-term reasoning, while constraints are merely enforced through handcrafted penalties. In this work, we propose a unified and efficient framework for enabling long-horizon reasoning and constraint enforcement within Model Predictive Path Integral (MPPI) control. First, we introduce a practical method to incorporate a terminal value function, learned offline via temporal-difference learning, to approximate the long-term cost-to-go. This allows for significantly shorter roll-outs while enabling infinite-horizon reasoning, thereby improving computational efficiency and motion performance. Second, we propose a discount modulation strategy that adjusts the return of sampled trajectories based on constraint violations. This provides a more interpretable and effective mechanism for enforcing constraints compared to traditional cost shaping. Our formulation retains the flexibility and sampling efficiency of MPPI while supporting structured integration of long-term objectives and constraint handling. We validate our approach on both simulated and real-world robotic locomotion tasks, demonstrating improved performance, constraint-awareness, and generalization under reduced computational budgets.

Experimental Results

The proposed TD-CD-MPPI controller was evaluated both in simulation and on the Solo12 quadruped robot. We compared its performance against DIAL-MPC and Vanilla MPPI for various prediction horizons and sample numbers. Results indicate that incorporating a terminal value function enables long-horizon reasoning and constraint satisfaction while retaining real-time feasibility.

Average Reward vs Horizon
Average Reward vs Number of Samples
Average reward across horizons and number of samples: TD-CD-MPPI outperforms DIAL-MPC and Vanilla MPPI, maintaining high performance even with reduced horizon length or fewer sampled trajectories.

Overall, TD-CD-MPPI matches or exceeds the performance of DIAL-MPC while reducing the required sampling budget by up to 10× and GPU memory consumption by 30%. The controller generalizes effectively to out-of-distribution conditions such as ramp and stair climbing, maintaining stability and constraint satisfaction.

Horizon Comparison

This comparison illustrates the impact of the MPC planning horizon on control performance. While DIAL-MPC becomes unstable for short horizons, the proposed TD-CD-MPPI maintains stable walking even with reduced horizon length, thanks to the terminal value function that extends the controller’s effective planning horizon.

Figure 1 – Effect of the MPC horizon on locomotion stability. TD-CD-MPPI remains stable even with short horizons where DIAL-MPC fails.

Number of Samples Comparison

Here, we evaluate the controller’s robustness when the number of trajectory rollouts is reduced. TD-CD-MPPI preserves performance and constraint satisfaction with significantly fewer samples compared to DIAL-MPC, demonstrating its computational efficiency.

Figure 2 – Comparison of locomotion with different numbers of samples. TD-CD-MPPI achieves stable walking even with 5–10× fewer samples.

Stair Climbing Comparison

The following videos compare DIAL-MPC (left) and TD-CD-MPPI (right) when climbing stairs — a scenario outside the training distribution. Both controllers use the correct terrain model, but TD-CD-MPPI generalizes without retraining the value function, maintaining stable and constraint-consistent motion.

(a) DIAL-MPC baseline

(b) TD-CD-MPPI (ours)

Figure 3 – Comparison of stair climbing performance. TD-CD-MPPI adapts online and completes the climb successfully, while DIAL-MPC loses balance under the same conditions.

Hardware Experiments

Finally, the proposed controller was deployed on the real Solo12 quadruped robot. The experiments include flat-ground walking, ramp traversal, and staircase climbing. Despite dynamics mismatches and additional payload, TD-CD-MPPI transfers to real hardware without any fine-tuning or domain adaptation.

(a) Ramp traversal

(b) Ramp traversal

(c) Model weight mismatch

Figure 4 – Real-world deployment on the Solo12 robot. TD-CD-MPPI generalizes across different terrains and maintains constraint satisfaction under modeling uncertainties.

BibTeX

@ARTICLE{11248834,
  author={Crestaz, Pietro Noah and De Matteis, Ludovic and Chane-Sane, Elliot and Mansard, Nicolas and Del Prete, Andrea},
  journal={IEEE Robotics and Automation Letters}, 
  title={TD-CD-MPPI: Temporal-Difference Constraint-Discounted Model Predictive Path Integral Control}, 
  year={2026},
  volume={11},
  number={1},
  pages={498-505},
  keywords={Trajectory;Costs;Cognition;Optimal control;Temporal difference learning;Stochastic processes;Robot sensing systems;Computational modeling;Collision avoidance;Planning;Optimization and optimal control;legged robots;whole-body motion planning and control},
  doi={10.1109/LRA.2025.3632612}}