ELVIS: Empowering Locality of Vision Language Pre-training with Intra-modal Similarity

04/11/2023
by   Sumin Seo, et al.
0

Deep learning has shown great potential in assisting radiologists in reading chest X-ray (CXR) images, but its need for expensive annotations for improving performance prevents widespread clinical application. Visual language pre-training (VLP) can alleviate the burden and cost of annotation by leveraging routinely generated reports for radiographs, which exist in large quantities as well as in paired form (imagetext pairs). Additionally, extensions to localization-aware VLPs are being proposed to address the needs of accurate localization of abnormalities for CAD in CXR. However, we find that the formulation proposed by locality-aware VLP literatures actually leads to loss in spatial relationships required for downstream localization tasks. Therefore, we propose Empowering Locality of VLP with Intra-modal Similarity, ELVIS, a VLP aware of intra-modal locality, to better preserve the locality within radiographs or reports, which enhances the ability to comprehend location references in text reports. Our locality-aware VLP method significantly outperforms state-of-the art baselines in multiple segmentation tasks and the MS-CXR phrase grounding task. Qualitatively, ELVIS is able to focus well on regions of interest described in the report text compared to prior approaches, allowing for enhanced interpretability.

READ FULL TEXT
research
09/10/2021

EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling

While large scale pre-training has achieved great achievements in bridgi...
research
02/27/2023

Knowledge-enhanced Pre-training for Auto-diagnosis of Chest Radiology Images

Despite of the success of multi-modal foundation models pre-trained on l...
research
06/16/2022

MixGen: A New Multi-Modal Data Augmentation

Data augmentation is a necessity to enhance data efficiency in deep lear...
research
03/30/2021

Self-supervised Image-text Pre-training With Mixed Data In Chest X-rays

Pre-trained models, e.g., from ImageNet, have proven to be effective in ...
research
08/26/2021

LocTex: Learning Data-Efficient Visual Representations from Localized Textual Supervision

Computer vision tasks such as object detection and semantic/instance seg...
research
01/11/2023

Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing

Self-supervised learning in vision-language processing exploits semantic...
research
09/13/2019

Towards Generalizable Forgery Detection with Locality-aware AutoEncoder

With advancements of deep learning techniques, it is now possible to gen...

Please sign up or login with your details

Forgot password? Click here to reset