Vision-Language-Action Models

VLA-Pro: Cross-Task Procedural Memory Transfer for Vision-Language-Action Models

arXiv preprint

Paper Code Videos Cite

Overview of VLA-Pro. — VLA-Pro retrieves task-relevant procedural memories and fuses task-specific LoRA adapters during inference.

59.3%RoboTwin success
with VLA-Pro on pi0.5

20.9%RLBench zero-shot
average success

65.0%Real-world average
success with VLA-Pro

+207%RoboTwin gain
on RDT backbone

+51%RLBench improvement
over pi0.5 baseline

+59.2Real-world percentage
point improvement

Abstract

Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation, yet they still struggle to generalize to unseen tasks that necessitate transferring relevant experience across objects, scenes, and action patterns. This paper proposes VLA-Pro, a plug-and-play framework designed to enhance cross-task generalization by storing task-relevant procedural memories at training time and transferring these memories during inference.

VLA-Pro stores task-specific LoRA adapters as parameterized procedural memories during training. At inference time, it retrieves relevant procedural memories based on the current multi-modal context and dynamically fuses these memories for generating the current action chunk. Experiments on RoboTwin, RLBench, and real-world manipulation tasks show that VLA-Pro consistently improves cross-task generalization across multiple backbones.

Highlights

Procedural Memory

Stores seen-task procedural states and task-specific LoRA adapters as retrievable, executable memories.

Action-Aware Retrieval

Matches the current procedural state using action type, object geometry, end-effector orientation, and target interaction point.

LoRA Fusion

Converts memory similarities into fusion coefficients and merges top-k task adapters for the current execution stage.

Cross-Task Transfer

Improves unseen-task execution on RoboTwin, RLBench, and real-world robotic manipulation scenarios.

Method

VLA-Pro extracts a structured procedural state, retrieves relevant memories, and injects a fused LoRA adapter into the VLA backbone.

Base LoRA

Train a shared base LoRA on seen tasks to capture general manipulation knowledge.

Memory Construction

Extract procedural states and fine-tune task-specific LoRA adapters for source tasks.

Online Retrieval

Before each action chunk, query the memory bank with the current procedural state.

Adapter Fusion

Softmax-normalize top-k similarities and merge retrieved LoRA adapters.

Task Execution

Load the fused adapter, execute the current chunk, unload it, and repeat.

Memory Bank Examples

Representative source memories and transfer targets from RoboTwin and RLBench.

RoboTwin

click_bell

Transfers precise target-contact behavior in held-out simulation tasks.

RLBench

close_jar

Object-level source memory for jar lid interaction and retrieval.

RoboTwin

place_bread_behind

Stores spatial-relation placement patterns for unseen object layouts.

RLBench

open_drawer

Pulling and spatial-interaction memory for drawer-like objects.

RoboTwin

place_glue_stand

Provides grasp-and-place experience for upright placement targets.

RLBench

push_buttons

Target-point memory for pressing small articulated objects.

Results

VLA-Pro is evaluated across RoboTwin simulation, RLBench zero-shot transfer, and real-world robot manipulation.

RoboTwin Simulation

X-VLA base

17.0

X-VLA VLA-Pro

30.0

RDT base

11.1

RDT VLA-Pro

34.1

pi0.5 base

40.4

pi0.5 VLA-Pro

59.3

Backbone	Base	VLA-Pro	Gain
X-VLA	17.0	30.0	+76%
RDT	11.1	34.1	+207%
pi0.5	40.4	59.3	+47%
Best gain	RDT backbone		+207%

RLBench Zero-Shot

RDT

10.2

pi0.5

13.8

AtomicVLA

14.7

VLA-Pro k=1

16.9

VLA-Pro k=2

20.9

VLA-Pro k=3

16.4

Method	Avg.	Comparison
RDT	10.2	general baseline
pi0.5	13.8	backbone baseline
AtomicVLA	14.7	skill-expert baseline
VLA-Pro	20.9	+51% vs pi0.5

Real-World Robot

bottle box

microphone

cup box

shake chem.

flick bottle

tap chips

Metric	Base	VLA-Pro
Average	5.8	65.0
Best task	15.0	95.0
Improvement	+59.2 percentage points
Held-out tasks	6 real-world tasks

RoboTwin

VLA-Pro improves all three backbones on held-out simulation tasks, with the largest relative gain on RDT.

RLBench

On RLBench, k=2 achieves the best average success rate, indicating that a proper number of related memories improves transfer.

Real-World

Using pi0.5 as the backbone, VLA-Pro increases real-world average success from 5.8% to 65.0% over six held-out tasks.

Videos

RoboTwin simulation clips and real-world experiment clips from the project assets.

RoboTwin Simulation Success

place_apple_stand

close_microwave

stack_bowls

Real-World Experiments

Real-world 1

Real-world 2

Real-world 3

Real-world 4

BibTeX

citation.bib

@misc{vlapro2026,
  title         = {VLA-Pro: Cross-Task Procedural Memory Transfer for Vision-Language-Action Models},
  author        = {YOUR_AUTHOR_LIST},
  year          = {2026},
  eprint        = {YOUR_ARXIV_ID},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO},
  url           = {https://arxiv.org/abs/YOUR_ARXIV_ID}
}