RoboMME

TL;DR: RoboMME is a large-scale benchmark for memory-augmented robotic manipulation, evaluating how well models remember, reason, and act across temporal, spatial, object, and procedural memory.

RoboMME is a large-scale, cognitively motivated robotic benchmark for memory-augmented manipulation, comprising four task suites that target distinct memory types: (1) The Counting task suite emphasizes temporal memory, requiring robots to accumulate and reason over past events; (2) The Permanence task suite focuses on spatial memory, requiring tracking of object locations under occlusion and environmental changes; (3) The Reference task suite evaluates object memory, requiring identification under varied referential cues; (4) The Imitation task suite assesses procedural memory, measuring the ability to reproduce previously demonstrated behaviors.

Abstract

Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluations remain confined to narrow, non-standardized settings. This limits their systematic understanding, comparison, and progress measurement. To address these challenges, we introduce RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models in long-horizon, history-dependent scenarios. Our benchmark comprises 16 manipulation tasks constructed under a carefully designed taxonomy that evaluates temporal, spatial, object, and procedural memory. We further develop a suite of 14 memory-augmented VLA variants built on the π0.5 backbone to systematically explore different memory representations across multiple integration strategies. Experimental results show that the effectiveness of memory representations is highly task-dependent, with each design offering distinct advantages and limitations across different tasks.

RoboMME Task Examples

Task 1.1: BinFill

Instruction: Put two green cubes into the bin and press the button to stop.

Task 1.2: PickXtimes

Instruction: Pick up the blue cube and place it on the target, repeating this pick-and-place action three times, then press the button to stop.

Task 1.3: SwingXtimes

Instruction: Pick up the red cube, move it to the right-side target and then to the left-side target, repeating this right-to-left swing motion three times, then put down the cube and press the button to stop.

Spatial directions (e.g., left/right) follow the robot base coordinate frame.

Task 1.4: StopCube

Instruction: Press the button to stop the cube exactly at the target on its third visit.

Task 2.1: VideoUnmask

Instruction: Watch the video carefully, then pick up the container hiding the green cube.

Red-bordered frames denote the video-based initial observation prior to execution.

Task 2.2: ButtonUnmask

Instruction: First press the button, then pick up the container hiding the green cube, finally pick up another container hiding the red cube.

Task 2.3: VideoUnmaskSwap

Instruction: Watch the video carefully, then pick up the container hiding the blue cube, finally pick up another container hiding the red cube.

Red-bordered frames denote the video-based initial observation prior to execution.

Task 2.4: ButtonUnmaskSwap

Instruction: First press both buttons on the table, then pick up the container hiding the red cube.

Task 3.1: PickHighlight

Instruction: First press the button, then pick up all highlighted cubes, finally press the button again to stop.

Task 3.2: VideoRepick

Instruction: Watch the video carefully, then pick up the same cube that was previously picked up twice, and finally press the button to stop.

Red-bordered frames denote the video-based initial observation prior to execution.

Task 3.3: VideoPlaceButton

Instruction: Watch the video carefully and place the blue cube on the target where it was placed immediately before the button was pressed.

Red-bordered frames denote the video-based initial observation prior to execution.

Task 3.4: VideoPlaceOrder

Instruction: Watch the video carefully and place the green cube on the third target where it was placed.

Red-bordered frames denote the video-based initial observation prior to execution.

Task 4.1: MoveCube

Instruction: Watch the video carefully, then move the cube to the target in the same manner shown in the video.

Red-bordered frames denote the video-based initial observation prior to execution.

Task 4.2: InsertPeg

Instruction: Watch the video carefully, then grasp the same peg at the same end and insert it into the same side of the box as in the video.

Red-bordered frames denote the video-based initial observation prior to execution.

Task 4.3: PatternLock

Instruction: Watch the video carefully, then use the stick attached to the robot to retrace the same pattern shown in the video.

Red-bordered frames denote the video-based initial observation prior to execution.

Task 4.4: RouteStick

Instruction: Watch the video carefully, then use the stick attached to the robot to navigate around the sticks on the table, following the same path shown in the video.

Red-bordered frames denote the video-based initial observation prior to execution.

MME-VLA Suite

Building on RoboMME, we construct a family of memory-augmented vision-language-action (VLA) models based on the π0.5 backbone, collectively termed the MME-VLA suite. We systematically compare different memory representations and their integration mechanisms.

MME-VLA Architecture — **Framework of MME-VLA Suite.** The top part illustrates three memory representations, each with two instantiations: (1) ***Symbolic Memory*** summarizes past interactions as high-level abstractions via language-based subgoals, optionally grounded to image pixels; (2) ***Perceptual Memory*** encodes history as raw visual tokens, using either token dropping to remove redundancy or uniform frame sampling to preserve essential context; (3) ***Recurrent Memory*** compresses history into fixed-size latent states through test-time training or recurrent memory transformers. The bottom part shows three end-to-end integration mechanisms for differentiable memory (perceptual and recurrent): (1) ***Memory-as-Context*** directly concatenates memory tokens with observation tokens; (2) ***Memory-as-Modulator*** applies adaptive LayerNorm to modulate the action expert; (3) ***Memory-as-Expert*** adds a separate lightweight memory expert that interacts with other experts through block-wise causal attention.

RoboMME Experiments

We fine-tune a total of 14 memory-augmented VLA variants based on π_0.5:

Symbolic Memory Variants: SimpleSG (Simple Subgoal) or GroundSG (Grounded Subgoal), and leverage Gemini (prompt-based Gemini-2.5-Pro), QwenVL (fine-tuned Qwen3-VL-4B), or Oracle (simulator ground-truth) as subgoal predictors → 2 VLA variants & 3 subgoal predictors
Perceptual Memory Variants: TokenDrop (Token dropping) or FrameSamp (Frame sampling), combined with Context (memory-as-context), Modul (memory-as-modulator), or Expert (memory-as-expert) integration mechanisms → 2 × 3 = 6 VLA variants
Recurrent Memory Variants: TTT (Test-Time Training) or RMT (Recurrent Memory Transformer), combined with Context (memory-as-context), Modul (memory-as-modulator), or Expert (memory-as-expert) integration mechanisms → 2 × 3 = 6 VLA variants

Memory Representation	Method	Subgoal Predictor	Integration Mechanism
Symbolic	SimpleSG, GroundSG	Gemini, QwenVL, Oracle	--
Perceptual	TokenDrop, FrameSamp	--	Context, Modul, Expert
Recurrent	TTT, RMT	--	Context, Modul, Expert

Naming Convention: Method+Integration Mechanism/Subgoal Predictor, e.g., FrameSamp+Modul or SimpleSG+QwenVL

Key Takeaways:

There is no “one-size-fits-all” memory design: No single representation or integration strategy dominates across all tasks.
Symbolic memory excels at counting and visual grounding, while perceptual memory is essential for motion mimicking.
Memory-as-Modulator is the most effective integration strategy for perceptual memory.

More specifically, we investigate the following research questions (RQs):

▶ RQ1: Which memory representations and integration mechanisms yield the strongest performance?

Across all MME-VLA variants:

Memory Representation: Perceptual Memory > Symbolic Memory > Recurrent Memory
Integration Mechanism: Memory-as-Modulator > Memory-as-Expert > Memory-as-Context
Within Perceptual Memory: Frame Sampling > Token Dropping
Within Symbolic Memory: Grounded Subgoal > Simple Subgoal
Recurrent Memory performs worst likely due to training instability

▶ RQ2: Is high-level symbolic reasoning alone sufficient for memory-augmented manipulation?

High-level symbolic reasoning is powerful yet not sufficient on its own:

As the upper bound of symbolic memory, GroundSG+Oracle successfully solves many tasks (84% overall success rate), confirming the strong representational capacity of language for high-level reasoning (see leaderboard).
Performance still degrades on manipulation-intensive tasks such as StopCube and InsertPeg, where language offers limited guidance and precise visuomotor control becomes the primary bottleneck.
The VLA policy still struggles in cluttered scenes, often manipulating wrong objects and causing unintended collisions despite unambiguous subgoals (see simulation demo below).

▶ RQ3: How do humans perform on RoboMME?

Humans are strong but not perfect on RoboMME:

We reformulate each task into an online VideoQA problem: videos are revealed incrementally, humans choose the next high-level action from a fixed candidate set, and an oracle planner executes actions perfectly.
Humans reach 90.5% success but still fail to fully solve the benchmark, with consistent errors on long-horizon tasks (e.g., PatternLock) and time-sensitive tasks (e.g., StopCube) (see leaderboard).
RoboMME therefore remains challenging and serves as a rigorous testbed for memory-augmented policies.

Play with an interactive demo to test your memory →

▶ RQ4: How does the effectiveness of different memory designs depend on task characteristics?

Different memory designs provide complementary strengths:

Symbolic Memory excels on Event-Salient (e.g., counting) and Short-Horizon Video Reasoning tasks
Perceptual Memory excels on Motion-Centric and Time-Sensitive tasks
MemER (Sridhar et al., 2025) excels on Dynamic Scene-Change tasks due to its reuse of all past keyframe images

To better analyze the effectiveness of memory representations, we group the 16 tasks by their primary functional requirements as shown below:

View radar chart by task characteristics on the leaderboard →

▶ RQ5: How does memory affect the efficiency-performance balance of memory-augmented policies?

Perceptual memory achieves the best efficiency-performance balance:

FrameSamp/TokenDrop+Modul: consistent gains with modest cost increase
SimpleSG/GroundSG+QwenVL: ~3× computation of π_0.5
MemER (Sridhar et al., 2025): ~5× computation of π_0.5

▶ RQ6: Do the trends observed on RoboMME transfer to real-world robotic manipulation?

Yes. We evaluate four real-world tasks designed to mirror simulation tasks on each task suite:

Counting: PutFruits → BinFill
Permanence: TrackCube → VideoUnmask/VideoUnmaskSwap
Reference: RepickBlock → VideoRepick
Imitation: PatternLock → DrawPattern

The results exhibit similar patterns: Symbolic Memory performs best on counting (PutFruits), while Perceptual Memory excels on motion-centric tasks (DrawPattern). On the remaining tasks, both achieve comparable performance.

View more real-world robot rollout demos below →

Simulation Rollout Demos

Task 1.1: BinFill Task Goal: put one red cube into the bin, then press the button to stop

Task 1.2: PickXtimes Task Goal: pick up the green cube and place it on the target, repeating this action three times, then press the button to stop

Task 1.3: SwingXtimes Task Goal: pick up the green cube, move it to the top of the right-side target, then move it to the top of the left-side target, repeating this back and forth motion two times, finally press the button to stop

Task 1.4: StopCube Task Goal: press the button to stop the cube just as it reaches the target for the second time

Task 2.1: VideoUnmask Task Goal: watch the video carefully, then pick up the container hiding the green cube

Task 2.2: ButtonUnmask Task Goal: first press the button, then pick up the container hiding the red cube

Task 2.3: VideoUnmaskSwap Task Goal: watch the video carefully, then pick up the container hiding the green cube, finally pick up another container hiding the blue cube

Task 2.4: ButtonUnmaskSwap Task Goal: first press both buttons on the table, then pick up the container hiding the blue cube, finally pick up another container hiding the green cube

Task 3.1: PickHighlight Task Goal: first press the button, then pick up all cubes that have been highlighted with white areas on the table

Task 3.2: VideoRepick Task Goal: watch the video carefully, then repeatedly pick up and put down the same block that was previously picked up for three times, finally press the button to stop

Task 3.3: VideoPlaceButton Task Goal: watch the video carefully, then place the green cube on the target right after the button was pressed

Task 3.4: VideoPlaceOrder Task Goal: watch the video carefully, then place the red cube on the first target it was previously placed on

Task 4.1: MoveCube Task Goal: watch the video carefully, then move the cube to the target in the same manner as before

Task 4.2: InsertPeg Task Goal: watch the video carefully, then grasp the same end of the same peg you've picked before and insert it into the same side of the box

Task 4.3: PatternLock Task Goal: Watch the video carefully, then use the stick attached to the robot to retrace the same pattern

Task 4.4: RouteStick Task Goal: Watch the video carefully, then use the stick attached to the robot to navigate around the sticks on the table, following the same path

For demonstration purposes, we visualize only the first 10 episodes for each task (the full evaluation contains 50). In simulation experiments, we use the front-view images for memory feature construction or VLM subgoal prediction. For the GroundSG policy, we overlay the predicted grounding information as yellow dots on the front-view images for visualization when it is available. Red-bordered frames indicate the video-based initial observation before execution.

Real Robot Rollout Demos

We evaluate four real-world tasks—PutFruits, TrackCube, RepickBlock, and DrawPattern—designed to mirror the BinFill, VideoUnmask/VideoUnmaskSwap, VideoRepick, and PatternLock tasks in simulation. Real-world results exhibit similar patterns to simulation, validating the transferability of our findings.

Task 1: PutFruits

Task Goal: put 1 fruit from the basket into the bin and press the button to stop.

GroundSG+QwenVL

FrameSamp+Modul

Task 2: TrackCube

Task Goal: watch the video carefully, then pick up the cup that hides the yellow cube.

GroundSG+QwenVL

FrameSamp+Modul

Task 3: RepickBlock

Task Goal: watch the video carefully, then pick up all the blocks that have been picked up before.

GroundSG+QwenVL

FrameSamp+Modul

Task 4: DrawPattern

Task Goal: watch the video carefully, then replicate the same path.

GroundSG+QwenVL

FrameSamp+Modul

In real-world experiments, we use the right-shoulder view images for memory feature construction or VLM subgoal prediction. For the GroundSG policy, we overlay the predicted grounding information as red dots on the right-shoulder view images for visualization when it is available. Red-bordered frames indicate the video-based initial observation before execution.

RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

Abstract

RoboMME Task Examples

MME-VLA Suite

RoboMME Experiments

Simulation Rollout Demos

Real Robot Rollout Demos