Benchmarking memory-augmented robotic generalist policies
(Data updated on 03/01/2026)
A benchmark for evaluating memory-augmented robotic policies across 16 tasks in four suites: Counting, Permanence, Reference, and Imitation.
Models are evaluated on 800 test episodes (50 per task). Multiple runs with different random seeds are recommended.
(1) Download the dataset and RoboMME repository, (2) train and evaluate with your policy, (3) submit a PR following the submission instructions.