Enhancing Generalization in Vision-Language-Action Models

0:00 / 0:00

Abstract

Vision-language-action (VLA) models, finetuned from powerful pretrained vision-language models (VLMs), promise to create generalist robots. However, this finetuning process often degrades the very representations that make them powerful, limiting generalization. We propose a framework that preserves these pretrained features while adapting them for robot manipulation. Our method introduces a dual-encoder design to retain features, a string-based action tokenizer to align actions with language, and a co-training strategy to balance robot and vision-language data. Our evaluations show significant improvements in robustness, generalization, and overall task success.

Method

Our framework is built on three key ideas to prevent representation degradation. (1) Partially-Frozen Visual Encoders: We use two encoders—one frozen to preserve robust, pretrained VLM features and one trainable to adapt to the specific robot task. (2) String-Based Action Tokenizer: We represent continuous robot actions as strings, unifying them with the text-based pretraining of the language model. (3) Co-Training Strategy: We mix robot demonstration data with vision-language datasets that emphasize spatial reasoning, preventing the model from overfitting to robot-specific data and enhancing its generalization capabilities.

Key Findings & Results

SimplerEnv Visual Matching

Bar chart for Visual Variant Aggregation

SimplerEnv Variant Aggregation (OOD)

Language Robustness

Improving Generalization and Robustness

A key challenge for robots is generalizing to novel scenes and instructions. Baseline models show a significant drop in performance when backgrounds change or instructions are paraphrased. Our approach demonstrates substantially stronger generalization, maintaining high success rates even with out-of-distribution (OOD) visual and language inputs.

Robust Real-World Performance and Retained Reasoning Capabilities

Qualitative results of robot picking knife and carrot

Real World Performance.

VQA Benchmark Performance.

In real-world tests, our models consistently outperform baselines, especially in the presence of distracting objects. While baseline models often get confused by distractors (e.g., picking a carrot instead of a knife), our model demonstrates a more robust understanding of the task, successfully completing the instructed action. Additionally, standard VLA finetuning harms the model's ability to perform general visual reasoning. Our training recipe allows the model to retain significantly higher performance on standard VQA benchmarks (solid lines vs. dashed lines), demonstrating that it doesn't just learn robotic actions but also preserves its core reasoning abilities.

Preserving Semantic Structure in Visual Representations

t-SNE visualizations comparing feature representations

t-SNE visualizations of vision encoder features on CIFAR-10.

We visualize how different training approaches affect the learned visual representations using t-SNE on CIFAR-10. Comparing (i) the original visual backbone from the VLM before VLA training, (ii) after direct VLA fine-tuning on robot data, and (iii) after applying our approaches, we observe that our method yields noticeably tighter and more well-separated class clusters. The numbers in the visualization indicate linear-probe classification performance on CIFAR-10 using the corresponding features. For both OpenVLA and π0 models, our approach leads to better linear-probe performance, indicating superior preservation of semantic structures in pretrained visual representations. This demonstrates that our framework maintains the rich representational capacity of the original vision-language models while adapting them for robotic tasks.

Demonstrations

Here we showcase sample demonstrations of our model's performance in various scenarios, highlighting its generalization capabilities.

Demonstration for picking up a carrot and placing it on plate

Task 1: Put carrot on plate.

Demonstration for putting knife on cloth

Task 2: Put knife on cloth.

Demonstration for placing carrot on a plate

Task 3: Place the carrot on the plate.

Demonstration for placing object on target

Task 4: Place carrot on plate.

Task 5: Put carrot on plate.

Task 6: Put carrot on yellow plate.

BibTeX

@article{grover2025enhancing,
  title={Enhancing Generalization in Vision-Language-Action Models by Preserving Pretrained Representations},
  author={Grover, Shresth and Gopalkrishnan, Akshay and Ai, Bo and Christensen, Henrik I. and Su, Hao and Li, Xuanlin},
  journal={arXiv preprint arXiv:2509.11417},
  year={2025}
}

Enhancing Generalization in Vision-Language-Action Models by Preserving Pretrained Representations

Our framework improves the robustness and generalization of robot manipulation policies by preserving the rich features of pretrained vision-language models.