Compressing And Debiasing Vision-Language Pre-Trained Models for Visual Question Answering
Despite the excellent performance of large-scale vision-language pre-trained models (VLPs) on conventional visual question answering task, they still suffer from two problems: First, VLPs tend to rely on language biases in datasets and fail to generalize to out-of-distribution (OOD) data. Second, they are inefficient in terms of memory footprint and computation. Although promising progress has been made in both problems, most existing works tackle them independently. To facilitate the application of VLP to VQA tasks, it is imperative to jointly study VLP compression and OOD robustness, which, however, has not yet been explored. In this paper, we investigate whether a VLP can be compressed and debiased simultaneously by searching sparse and robust subnetworks. To this end, we conduct extensive experiments with LXMERT, a representative VLP, on the OOD dataset VQA-CP v2. We systematically study the design of a training and compression pipeline to search the subnetworks, as well as the assignment of sparsity to different modality-specific modules. Our results show that there indeed exist sparse and robust LXMERT subnetworks, which significantly outperform the full model (without debiasing) with much fewer parameters. These subnetworks also exceed the current SoTA debiasing models with comparable or fewer parameters. We will release the codes on publication.
READ FULL TEXT