Vision-language-action (VLA) models, finetuned from powerful pretrained vision-language models (VLMs), promise to create generalist robots. However, this finetuning process often degrades the very representations that make them powerful, limiting generalization. We propose a framework that preserves these pretrained features while adapting them for robot manipulation. Our method introduces a dual-encoder design to retain features, a string-based action tokenizer to align actions with language, and a co-training strategy to balance robot and vision-language data. Our evaluations show significant improvements in robustness, generalization, and overall task success.
 
        Our framework is built on three key ideas to prevent representation degradation. (1) Partially-Frozen Visual Encoders: We use two encoders—one frozen to preserve robust, pretrained VLM features and one trainable to adapt to the specific robot task. (2) String-Based Action Tokenizer: We represent continuous robot actions as strings, unifying them with the text-based pretraining of the language model. (3) Co-Training Strategy: We mix robot demonstration data with vision-language datasets that emphasize spatial reasoning, preventing the model from overfitting to robot-specific data and enhancing its generalization capabilities.
 
            SimplerEnv Visual Matching
 
            SimplerEnv Variant Aggregation (OOD)
 
            Language Robustness
A key challenge for robots is generalizing to novel scenes and instructions. Baseline models show a significant drop in performance when backgrounds change or instructions are paraphrased. Our approach demonstrates substantially stronger generalization, maintaining high success rates even with out-of-distribution (OOD) visual and language inputs.
 
        Real World Performance.
 
        VQA Benchmark Performance.
In real-world tests, our models consistently outperform baselines, especially in the presence of distracting objects. While baseline models often get confused by distractors (e.g., picking a carrot instead of a knife), our model demonstrates a more robust understanding of the task, successfully completing the instructed action. Additionally, standard VLA finetuning harms the model's ability to perform general visual reasoning. Our training recipe allows the model to retain significantly higher performance on standard VQA benchmarks (solid lines vs. dashed lines), demonstrating that it doesn't just learn robotic actions but also preserves its core reasoning abilities.
 
        t-SNE visualizations of vision encoder features on CIFAR-10.
We visualize how different training approaches affect the learned visual representations using t-SNE on CIFAR-10. Comparing (i) the original visual backbone from the VLM before VLA training, (ii) after direct VLA fine-tuning on robot data, and (iii) after applying our approaches, we observe that our method yields noticeably tighter and more well-separated class clusters. The numbers in the visualization indicate linear-probe classification performance on CIFAR-10 using the corresponding features. For both OpenVLA and π0 models, our approach leads to better linear-probe performance, indicating superior preservation of semantic structures in pretrained visual representations. This demonstrates that our framework maintains the rich representational capacity of the original vision-language models while adapting them for robotic tasks.
Here we showcase sample demonstrations of our model's performance in various scenarios, highlighting its generalization capabilities.
 
              Task 1: Put carrot on plate.
 
              Task 2: Put knife on cloth.
 
              Task 3: Place the carrot on the plate.
 
              Task 4: Place carrot on plate.
 
              Task 5: Put carrot on plate.
 
              Task 6: Put carrot on yellow plate.
@article{grover2025enhancing,
  title={Enhancing Generalization in Vision-Language-Action Models by Preserving Pretrained Representations},
  author={Grover, Shresth and Gopalkrishnan, Akshay and Ai, Bo and Christensen, Henrik I. and Su, Hao and Li, Xuanlin},
  journal={arXiv preprint arXiv:2509.11417},
  year={2025}
}