RoboMME Task Examples MME-VLA Suite RoboMME Experiments Simulation Rollout Demos Real Robot Rollout Demos

RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

β˜…Equal Advising
1University of Michigan, 2Stanford University, 3Figure AI

TL;DR: RoboMME is a large-scale benchmark for memory-augmented robotic manipulation, evaluating how well models remember, reason, and act across temporal, spatial, object, and procedural memory.

RoboMME is a large-scale, cognitively motivated robotic benchmark for memory-augmented manipulation, comprising four task suites that target distinct memory types: (1) The Counting task suite emphasizes temporal memory, requiring robots to accumulate and reason over past events; (2) The Permanence task suite focuses on spatial memory, requiring tracking of object locations under occlusion and environmental changes; (3) The Reference task suite evaluates object memory, requiring identification under varied referential cues; (4) The Imitation task suite assesses procedural memory, measuring the ability to reproduce previously demonstrated behaviors.

Abstract

Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluations remain confined to narrow, non-standardized settings. This limits their systematic understanding, comparison, and progress measurement. To address these challenges, we introduce RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models in long-horizon, history-dependent scenarios. Our benchmark comprises 16 manipulation tasks constructed under a carefully designed taxonomy that evaluates temporal, spatial, object, and procedural memory. We further develop a suite of 14 memory-augmented VLA variants built on the Ο€0.5 backbone to systematically explore different memory representations across multiple integration strategies. Experimental results show that the effectiveness of memory representations is highly task-dependent, with each design offering distinct advantages and limitations across different tasks.

RoboMME Task Examples

Task 1.1: BinFill
Instruction: Put two green cubes into the bin and press the button to stop.
Task 1.2: PickXtimes
Instruction: Pick up the blue cube and place it on the target, repeating this pick-and-place action three times, then press the button to stop.
Task 1.3: SwingXtimes
Instruction: Pick up the red cube, move it to the right-side target and then to the left-side target, repeating this right-to-left swing motion three times, then put down the cube and press the button to stop.
Spatial directions (e.g., left/right) follow the robot base coordinate frame.
Task 1.4: StopCube
Instruction: Press the button to stop the cube exactly at the target on its third visit.
Task 2.1: VideoUnmask
Instruction: Watch the video carefully, then pick up the container hiding the green cube.
Red-bordered frames denote the video-based initial observation prior to execution.
Task 2.2: ButtonUnmask
Instruction: First press the button, then pick up the container hiding the green cube, finally pick up another container hiding the red cube.
Task 2.3: VideoUnmaskSwap
Instruction: Watch the video carefully, then pick up the container hiding the blue cube, finally pick up another container hiding the red cube.
Red-bordered frames denote the video-based initial observation prior to execution.
Task 2.4: ButtonUnmaskSwap
Instruction: First press both buttons on the table, then pick up the container hiding the red cube.
Task 3.1: PickHighlight
Instruction: First press the button, then pick up all highlighted cubes, finally press the button again to stop.
Task 3.2: VideoRepick
Instruction: Watch the video carefully, then pick up the same cube that was previously picked up twice, and finally press the button to stop.
Red-bordered frames denote the video-based initial observation prior to execution.
Task 3.3: VideoPlaceButton
Instruction: Watch the video carefully and place the blue cube on the target where it was placed immediately before the button was pressed.
Red-bordered frames denote the video-based initial observation prior to execution.
Task 3.4: VideoPlaceOrder
Instruction: Watch the video carefully and place the green cube on the third target where it was placed.
Red-bordered frames denote the video-based initial observation prior to execution.
Task 4.1: MoveCube
Instruction: Watch the video carefully, then move the cube to the target in the same manner shown in the video.
Red-bordered frames denote the video-based initial observation prior to execution.
Task 4.2: InsertPeg
Instruction: Watch the video carefully, then grasp the same peg at the same end and insert it into the same side of the box as in the video.
Red-bordered frames denote the video-based initial observation prior to execution.
Task 4.3: PatternLock
Instruction: Watch the video carefully, then use the stick attached to the robot to retrace the same pattern shown in the video.
Red-bordered frames denote the video-based initial observation prior to execution.
Task 4.4: RouteStick
Instruction: Watch the video carefully, then use the stick attached to the robot to navigate around the sticks on the table, following the same path shown in the video.
Red-bordered frames denote the video-based initial observation prior to execution.

MME-VLA Suite

Building on RoboMME, we construct a family of memory-augmented vision-language-action (VLA) models based on the Ο€0.5 backbone, collectively termed the MME-VLA suite. We systematically compare different memory representations and their integration mechanisms.

MME-VLA Architecture
Framework of MME-VLA Suite. The top part illustrates three memory representations, each with two instantiations: (1) Symbolic Memory summarizes past interactions as high-level abstractions via language-based subgoals, optionally grounded to image pixels; (2) Perceptual Memory encodes history as raw visual tokens, using either token dropping to remove redundancy or uniform frame sampling to preserve essential context; (3) Recurrent Memory compresses history into fixed-size latent states through test-time training or recurrent memory transformers. The bottom part shows three end-to-end integration mechanisms for differentiable memory (perceptual and recurrent): (1) Memory-as-Context directly concatenates memory tokens with observation tokens; (2) Memory-as-Modulator applies adaptive LayerNorm to modulate the action expert; (3) Memory-as-Expert adds a separate lightweight memory expert that interacts with other experts through block-wise causal attention.

RoboMME Experiments

We fine-tune a total of 14 memory-augmented VLA variants based on Ο€0.5:

  • Symbolic Memory Variants: SimpleSG (Simple Subgoal) or GroundSG (Grounded Subgoal), and leverage Gemini (prompt-based Gemini-2.5-Pro), QwenVL (fine-tuned Qwen3-VL-4B), or Oracle (simulator ground-truth) as subgoal predictors β†’ 2 VLA variants & 3 subgoal predictors
  • Perceptual Memory Variants: TokenDrop (Token dropping) or FrameSamp (Frame sampling), combined with Context (memory-as-context), Modul (memory-as-modulator), or Expert (memory-as-expert) integration mechanisms β†’ 2 Γ— 3 = 6 VLA variants
  • Recurrent Memory Variants: TTT (Test-Time Training) or RMT (Recurrent Memory Transformer), combined with Context (memory-as-context), Modul (memory-as-modulator), or Expert (memory-as-expert) integration mechanisms β†’ 2 Γ— 3 = 6 VLA variants
Memory Representation Method Subgoal Predictor Integration Mechanism
Symbolic SimpleSG, GroundSG Gemini, QwenVL, Oracle --
Perceptual TokenDrop, FrameSamp -- Context, Modul, Expert
Recurrent TTT, RMT -- Context, Modul, Expert

Naming Convention: Method+Integration Mechanism/Subgoal Predictor, e.g., FrameSamp+Modul or SimpleSG+QwenVL

Key Takeaways:

  • There is no β€œone-size-fits-all” memory design: No single representation or integration strategy dominates across all tasks.
  • Symbolic memory excels at counting and visual grounding, while perceptual memory is essential for motion mimicking.
  • Memory-as-Modulator is the most effective integration strategy for perceptual memory.

More specifically, we investigate the following research questions (RQs):

β–Ά RQ1: Which memory representations and integration mechanisms yield the strongest performance?

Across all MME-VLA variants:

  • Memory Representation: Perceptual Memory > Symbolic Memory > Recurrent Memory
  • Integration Mechanism: Memory-as-Modulator > Memory-as-Expert > Memory-as-Context
  • Within Perceptual Memory: Frame Sampling > Token Dropping
  • Within Symbolic Memory: Grounded Subgoal > Simple Subgoal
  • Recurrent Memory performs worst likely due to training instability
MME-VLA Suite Results
β–Ά RQ2: Is high-level symbolic reasoning alone sufficient for memory-augmented manipulation?

High-level symbolic reasoning is powerful yet not sufficient on its own:

  • As the upper bound of symbolic memory, GroundSG+Oracle successfully solves many tasks (84% overall success rate), confirming the strong representational capacity of language for high-level reasoning (see leaderboard).
  • Performance still degrades on manipulation-intensive tasks such as StopCube and InsertPeg, where language offers limited guidance and precise visuomotor control becomes the primary bottleneck.
  • The VLA policy still struggles in cluttered scenes, often manipulating wrong objects and causing unintended collisions despite unambiguous subgoals (see simulation demo below).
β–Ά RQ3: How do humans perform on RoboMME?

Humans are strong but not perfect on RoboMME:

  • We reformulate each task into an online VideoQA problem: videos are revealed incrementally, humans choose the next high-level action from a fixed candidate set, and an oracle planner executes actions perfectly.
  • Humans reach 90.5% success but still fail to fully solve the benchmark, with consistent errors on long-horizon tasks (e.g., PatternLock) and time-sensitive tasks (e.g., StopCube) (see leaderboard).
  • RoboMME therefore remains challenging and serves as a rigorous testbed for memory-augmented policies.
  • Hugging Face demo Play with an interactive demo to test your memory β†’

β–Ά RQ4: How does the effectiveness of different memory designs depend on task characteristics?

Different memory designs provide complementary strengths:

  • Symbolic Memory excels on Event-Salient (e.g., counting) and Short-Horizon Video Reasoning tasks
  • Perceptual Memory excels on Motion-Centric and Time-Sensitive tasks
  • MemER (Sridhar et al., 2025) excels on Dynamic Scene-Change tasks due to its reuse of all past keyframe images
Task Memory Correspondence

To better analyze the effectiveness of memory representations, we group the 16 tasks by their primary functional requirements as shown below:

Task Category

View radar chart by task characteristics on the leaderboard β†’

β–Ά RQ5: How does memory affect the efficiency-performance balance of memory-augmented policies?

Perceptual memory achieves the best efficiency-performance balance:

  • FrameSamp/TokenDrop+Modul: consistent gains with modest cost increase
  • SimpleSG/GroundSG+QwenVL: ~3Γ— computation of Ο€0.5
  • MemER (Sridhar et al., 2025): ~5Γ— computation of Ο€0.5
Efficiency-Performance Comparison
β–Ά RQ6: Do the trends observed on RoboMME transfer to real-world robotic manipulation?

Yes. We evaluate four real-world tasks designed to mirror simulation tasks on each task suite:

  • Counting: PutFruits β†’ BinFill
  • Permanence: TrackCube β†’ VideoUnmask/VideoUnmaskSwap
  • Reference: RepickBlock β†’ VideoRepick
  • Imitation: PatternLock β†’ DrawPattern

The results exhibit similar patterns: Symbolic Memory performs best on counting (PutFruits), while Perceptual Memory excels on motion-centric tasks (DrawPattern). On the remaining tasks, both achieve comparable performance.

Real World Tasks
Real World Results

View more real-world robot rollout demos below β†’

Simulation Rollout Demos

Task 1.1: BinFill Task Goal: put one red cube into the bin, then press the button to stop
Task 1.2: PickXtimes Task Goal: pick up the green cube and place it on the target, repeating this action three times, then press the button to stop
Task 1.3: SwingXtimes Task Goal: pick up the green cube, move it to the top of the right-side target, then move it to the top of the left-side target, repeating this back and forth motion two times, finally press the button to stop
Task 1.4: StopCube Task Goal: press the button to stop the cube just as it reaches the target for the second time
Task 2.1: VideoUnmask Task Goal: watch the video carefully, then pick up the container hiding the green cube
Task 2.2: ButtonUnmask Task Goal: first press the button, then pick up the container hiding the red cube
Task 2.3: VideoUnmaskSwap Task Goal: watch the video carefully, then pick up the container hiding the green cube, finally pick up another container hiding the blue cube
Task 2.4: ButtonUnmaskSwap Task Goal: first press both buttons on the table, then pick up the container hiding the blue cube, finally pick up another container hiding the green cube
Task 3.1: PickHighlight Task Goal: first press the button, then pick up all cubes that have been highlighted with white areas on the table
Task 3.2: VideoRepick Task Goal: watch the video carefully, then repeatedly pick up and put down the same block that was previously picked up for three times, finally press the button to stop
Task 3.3: VideoPlaceButton Task Goal: watch the video carefully, then place the green cube on the target right after the button was pressed
Task 3.4: VideoPlaceOrder Task Goal: watch the video carefully, then place the red cube on the first target it was previously placed on
Task 4.1: MoveCube Task Goal: watch the video carefully, then move the cube to the target in the same manner as before
Task 4.2: InsertPeg Task Goal: watch the video carefully, then grasp the same end of the same peg you've picked before and insert it into the same side of the box
Task 4.3: PatternLock Task Goal: Watch the video carefully, then use the stick attached to the robot to retrace the same pattern
Task 4.4: RouteStick Task Goal: Watch the video carefully, then use the stick attached to the robot to navigate around the sticks on the table, following the same path

For demonstration purposes, we visualize only the first 10 episodes for each task (the full evaluation contains 50). In simulation experiments, we use the front-view images for memory feature construction or VLM subgoal prediction. For the GroundSG policy, we overlay the predicted grounding information as yellow dots on the front-view images for visualization when it is available. Red-bordered frames indicate the video-based initial observation before execution.

Real Robot Rollout Demos

We evaluate four real-world tasksβ€”PutFruits, TrackCube, RepickBlock, and DrawPatternβ€”designed to mirror the BinFill, VideoUnmask/VideoUnmaskSwap, VideoRepick, and PatternLock tasks in simulation. Real-world results exhibit similar patterns to simulation, validating the transferability of our findings.

Task 1: PutFruits
Task Goal: put 1 fruit from the basket into the bin and press the button to stop.
GroundSG+QwenVL
FrameSamp+Modul
Task 2: TrackCube
Task Goal: watch the video carefully, then pick up the cup that hides the yellow cube.
GroundSG+QwenVL
FrameSamp+Modul
Task 3: RepickBlock
Task Goal: watch the video carefully, then pick up all the blocks that have been picked up before.
GroundSG+QwenVL
FrameSamp+Modul
Task 4: DrawPattern
Task Goal: watch the video carefully, then replicate the same path.
GroundSG+QwenVL
FrameSamp+Modul

In real-world experiments, we use the right-shoulder view images for memory feature construction or VLM subgoal prediction. For the GroundSG policy, we overlay the predicted grounding information as red dots on the right-shoulder view images for visualization when it is available. Red-bordered frames indicate the video-based initial observation before execution.