Lossless Adaptation of Pretrained Vision Models For Robotic Manipulation

by   Mohit Sharma, et al.

Recent works have shown that large models pretrained on common visual learning tasks can provide useful representations for a wide range of specialized perception problems, as well as a variety of robotic manipulation tasks. While prior work on robotic manipulation has predominantly used frozen pretrained features, we demonstrate that in robotics this approach can fail to reach optimal performance, and that fine-tuning of the full model can lead to significantly better results. Unfortunately, fine-tuning disrupts the pretrained visual representation, and causes representational drift towards the fine-tuned task thus leading to a loss of the versatility of the original model. We introduce "lossless adaptation" to address this shortcoming of classical fine-tuning. We demonstrate that appropriate placement of our parameter efficient adapters can significantly reduce the performance gap between frozen pretrained representations and full end-to-end fine-tuning without changes to the original representation and thus preserving original capabilities of the pretrained model. We perform a comprehensive investigation across three major model architectures (ViTs, NFNets, and ResNets), supervised (ImageNet-1K classification) and self-supervised pretrained weights (CLIP, BYOL, Visual MAE) in 3 task domains and 35 individual tasks, and demonstrate that our claims are strongly validated in various settings.


page 5

page 18


On the Effectiveness of Adapter-based Tuning for Pretrained Language Model Adaptation

Adapter-based tuning has recently arisen as an alternative to fine-tunin...

Policy-Induced Self-Supervision Improves Representation Finetuning in Visual RL

We study how to transfer representations pretrained on source tasks to t...

ViT2EEG: Leveraging Hybrid Pretrained Vision Transformers for EEG Data

In this study, we demonstrate the application of a hybrid Vision Transfo...

Vision Models Can Be Efficiently Specialized via Few-Shot Task-Aware Compression

Recent vision architectures and self-supervised training methods enable ...

Supervised Fine-tuning Evaluation for Long-term Visual Place Recognition

In this paper, we present a comprehensive study on the utility of deep c...

Standardizing and Centralizing Datasets to Enable Efficient Training of Agricultural Deep Learning Models

In recent years, deep learning models have become the standard for agric...

To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks

While most previous work has focused on different pretraining objectives...

Please sign up or login with your details

Forgot password? Click here to reset