Simulation Evaluation

Always evaluate in simulation first, even if you have a real robot. Sim evaluation is fast, safe, and gives you a reproducible baseline number you can compare to after retraining.

source ~/lerobot-env/bin/activate # Evaluate your best checkpoint (replace step_050000 with your checkpoint step) python -m lerobot.scripts.eval \ --pretrained-policy-name-or-path \ ~/lerobot-policies/pick-place-v1/checkpoints/step_050000 \ --env.name gym_pusht/PushT-v0 \ --eval.n-episodes 20 \ --eval.use-async-envs false # Outputs: success_rate, mean_reward, episode_videos/
What to expect: A well-trained policy on 50 sim demonstrations should achieve 60–85% success rate in MuJoCo. Below 40% suggests a dataset quality issue. Above 85% means the task is too easy or the sim environment is too forgiving — try a harder variant.

Real Robot Safety Checklist

If you are evaluating on a real robot, run through this checklist before your first rollout. An untested policy can move in unexpected ways.

  • Clear the workspace of any objects not part of the task. The policy learned to act in a specific visual context — unexpected objects can cause erratic behavior.
  • Stay at the emergency stop (E-stop) or be ready to press Ctrl+C for the entire evaluation session. Do not walk away from a running policy.
  • Start with speed limited to 50% maximum. Reduce to 30% if the first trial looks jerky or imprecise.
  • Position objects to match your training workspace setup exactly. Use the same camera angle, same lighting, same object colors. Distribution shift is the most common cause of zero real-world success rate.
  • Never evaluate above the physical stop limits of your robot joints. Check these in your robot config before the first run.

Real Robot Evaluation Protocol

Run exactly 20 trials. This gives you enough samples for a reliable success rate estimate (±10% at the 95% confidence level). Record each trial on video — you will need the footage to diagnose failure modes.

# Run the policy on your real robot python -m lerobot.scripts.control_robot \ --robot-path lerobot/configs/robot/so100.yaml \ --control-mode eval \ --pretrained-policy-name-or-path \ ~/lerobot-policies/pick-place-v1/checkpoints/step_050000 \ --eval.n-episodes 20 \ --record-video 1

After each trial, manually score it: 1 for complete task success, 0 for any failure (partial grasps, drops, misses). Your success rate is the sum divided by 20.

Diagnosing Failure Modes

Watch your video recordings and categorize failures. Most failures fall into one of three categories:

Data quality

Inconsistent approach trajectory — the arm never fully commits to the grasp

The policy is averaging across multiple grasp strategies in your training data. This happens when some demonstrations approach from the left and others from the right, or when gripper close timing is inconsistent. Fix: re-record with a single, deliberate strategy throughout all demonstrations.

Model capacity

Trajectory looks reasonable but precision is off by 1–2cm consistently

The model is learning the right behavior but lacks the capacity to be precise. This happens when chunk_size is too short (not enough planning horizon) or when dim_feedforward is too small. Fix: increase chunk_size to 150, retrain. Or add more diverse demonstrations to regularize the network.

Distribution shift

Works perfectly in some positions, fails completely in others

The object positions during evaluation are outside the distribution of your training data. The policy has not seen those positions before. Fix: collect more demonstrations with more diverse object positions, or constrain your evaluation to positions that are well-represented in your training data.

Unit 5 Complete When...

You have run 20 evaluation trials (in sim or on your real robot) and measured a success rate. You have watched all failure-mode videos and identified whether the primary failure is data quality, model capacity, or distribution shift. You have this diagnosis written down — you will use it to guide your data collection in Unit 6.