The canonical approach to video-text retrieval leverages a coarse-graine...
Model merging (e.g., via interpolation or task arithmetic) fuses multipl...
Vision transformers (ViTs) have achieved impressive results on various
c...
Fine-tuning large pre-trained models on downstream tasks has been adopte...
Recently, fine-tuning language models pre-trained on large text corpora ...
During typical gradient-based training of deep neural networks, all of t...