๐Ÿ† RoboMME Leaderboard

Benchmarking memory-augmented robotic generalist policies

16
Tasks
800
Test Episodes
โ€”
Models Evaluated

(Data updated on 03/01/2026)

๐Ÿ“Š Leaderboard

Radar View

๐Ÿ“‹ About

A benchmark for evaluating memory-augmented robotic policies across 16 tasks in four suites: Counting, Permanence, Reference, and Imitation.

โš–๏ธ Evaluation

Models are evaluated on 800 test episodes (50 per task). Multiple runs with different random seeds are recommended.

๐Ÿ”„ Submit

(1) Download the dataset and RoboMME repository, (2) train and evaluate with your policy, (3) submit a PR following the submission instructions.