🏆 RoboMME Leaderboard

Benchmarking memory-augmented robotic generalist policies

Tasks

800

Test Episodes

—

Models Evaluated

(Data updated on 03/01/2026)

📊 Regular Leaderboard

Radar View

📋 About

A benchmark for evaluating memory-augmented robotic policies across 16 tasks in four suites: Counting, Permanence, Reference, and Imitation.

⚖️ Evaluation

Models are evaluated on 800 test episodes (50 per task). Multiple runs with different random seeds are recommended.

🔄 Submit

(1) Download the dataset and RoboMME repository, (2) train and evaluate with your policy, (3) submit a PR following the submission instructions.