Vision-Language-Action Models

VLA-Pro: Cross-Task Procedural Memory Transfer for Vision-Language-Action Models

arXiv preprint

Overview of VLA-Pro.
VLA-Pro retrieves task-relevant procedural memories and fuses task-specific LoRA adapters during inference.
59.3%RoboTwin success
with VLA-Pro on pi0.5
20.9%RLBench zero-shot
average success
65.0%Real-world average
success with VLA-Pro
+207%RoboTwin gain
on RDT backbone
+51%RLBench improvement
over pi0.5 baseline
+59.2Real-world percentage
point improvement

Abstract

Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation, yet they still struggle to generalize to unseen tasks that necessitate transferring relevant experience across objects, scenes, and action patterns. This paper proposes VLA-Pro, a plug-and-play framework designed to enhance cross-task generalization by storing task-relevant procedural memories at training time and transferring these memories during inference.

VLA-Pro stores task-specific LoRA adapters as parameterized procedural memories during training. At inference time, it retrieves relevant procedural memories based on the current multi-modal context and dynamically fuses these memories for generating the current action chunk. Experiments on RoboTwin, RLBench, and real-world manipulation tasks show that VLA-Pro consistently improves cross-task generalization across multiple backbones.

Highlights

Procedural Memory

Stores seen-task procedural states and task-specific LoRA adapters as retrievable, executable memories.

Action-Aware Retrieval

Matches the current procedural state using action type, object geometry, end-effector orientation, and target interaction point.

LoRA Fusion

Converts memory similarities into fusion coefficients and merges top-k task adapters for the current execution stage.

Cross-Task Transfer

Improves unseen-task execution on RoboTwin, RLBench, and real-world robotic manipulation scenarios.

Method

VLA-Pro extracts a structured procedural state, retrieves relevant memories, and injects a fused LoRA adapter into the VLA backbone.

Figure 2: VLA-Pro method overview.
1

Base LoRA

Train a shared base LoRA on seen tasks to capture general manipulation knowledge.

2

Memory Construction

Extract procedural states and fine-tune task-specific LoRA adapters for source tasks.

3

Online Retrieval

Before each action chunk, query the memory bank with the current procedural state.

4

Adapter Fusion

Softmax-normalize top-k similarities and merge retrieved LoRA adapters.

5

Task Execution

Load the fused adapter, execute the current chunk, unload it, and repeat.

Memory Bank Examples

Representative source memories and transfer targets from RoboTwin and RLBench.

RoboTwin click bell task.

RoboTwin

click_bell

Transfers precise target-contact behavior in held-out simulation tasks.

RLBench close jar task.

RLBench

close_jar

Object-level source memory for jar lid interaction and retrieval.

RoboTwin place bread behind task.

RoboTwin

place_bread_behind

Stores spatial-relation placement patterns for unseen object layouts.

RLBench open drawer task.

RLBench

open_drawer

Pulling and spatial-interaction memory for drawer-like objects.

RoboTwin place glue on stand task.

RoboTwin

place_glue_stand

Provides grasp-and-place experience for upright placement targets.

RLBench push buttons task.

RLBench

push_buttons

Target-point memory for pressing small articulated objects.

Results

VLA-Pro is evaluated across RoboTwin simulation, RLBench zero-shot transfer, and real-world robot manipulation.

RoboTwin Simulation

X-VLA base
17.0
X-VLA VLA-Pro
30.0
RDT base
11.1
RDT VLA-Pro
34.1
pi0.5 base
40.4
pi0.5 VLA-Pro
59.3
BackboneBaseVLA-ProGain
X-VLA17.030.0+76%
RDT11.134.1+207%
pi0.540.459.3+47%
Best gainRDT backbone+207%

RLBench Zero-Shot

RDT
10.2
pi0.5
13.8
AtomicVLA
14.7
VLA-Pro k=1
16.9
VLA-Pro k=2
20.9
VLA-Pro k=3
16.4
MethodAvg.Comparison
RDT10.2general baseline
pi0.513.8backbone baseline
AtomicVLA14.7skill-expert baseline
VLA-Pro20.9+51% vs pi0.5

Real-World Robot

bottle box
20
microphone
85
cup box
95
shake chem.
75
flick bottle
50
tap chips
65
MetricBaseVLA-Pro
Average5.865.0
Best task15.095.0
Improvement+59.2 percentage points
Held-out tasks6 real-world tasks

RoboTwin

VLA-Pro improves all three backbones on held-out simulation tasks, with the largest relative gain on RDT.

RLBench

On RLBench, k=2 achieves the best average success rate, indicating that a proper number of related memories improves transfer.

Real-World

Using pi0.5 as the backbone, VLA-Pro increases real-world average success from 5.8% to 65.0% over six held-out tasks.

Videos

RoboTwin simulation clips and real-world experiment clips from the project assets.

RoboTwin Simulation Success

place_apple_stand

close_microwave

stack_bowls

Real-World Experiments

Real-world 1

Real-world 2

Real-world 3

Real-world 4

BibTeX

citation.bib
@misc{vlapro2026,
  title         = {VLA-Pro: Cross-Task Procedural Memory Transfer for Vision-Language-Action Models},
  author        = {YOUR_AUTHOR_LIST},
  year          = {2026},
  eprint        = {YOUR_ARXIV_ID},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO},
  url           = {https://arxiv.org/abs/YOUR_ARXIV_ID}
}