ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax

by   Zachary Huemann, et al.

Clinical imaging databases contain not only medical images but also text reports generated by physicians. These narrative reports often describe the location, size, and shape of the disease, but using descriptive text to guide medical image analysis has been understudied. Vision-language models are increasingly used for multimodal tasks like image generation, image captioning, and visual question answering but have been scarcely used in medical imaging. In this work, we develop a vision-language model for the task of pneumothorax segmentation. Our model, ConTEXTual Net, detects and segments pneumothorax in chest radiographs guided by free-form radiology reports. ConTEXTual Net achieved a Dice score of 0.72 ± 0.02, which was similar to the level of agreement between the primary physician annotator and the other physician annotators (0.71 ± 0.04). ConTEXTual Net also outperformed a U-Net. We demonstrate that descriptive language can be incorporated into a segmentation model for improved performance. Through an ablative study, we show that it is the text information that is responsible for the performance gains. Additionally, we show that certain augmentation methods worsen ConTEXTual Net's segmentation performance by breaking the image-text concordance. We propose a set of augmentations that maintain this concordance and improve segmentation training.


page 6

page 7


A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports

Joint image-text embedding extracted from medical images and associated ...

Implementation of a Modified U-Net for Medical Image Segmentation on Edge Devices

Deep learning techniques, particularly convolutional neural networks, ha...

MultiResUNet : Rethinking the U-Net Architecture for Multimodal Biomedical Image Segmentation

In recent years Deep Learning has brought about a breakthrough in Medica...

Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime

This paper explores training medical vision-language models (VLMs) – whe...

TandemNet: Distilling Knowledge from Medical Images Using Diagnostic Reports as Optional Semantic References

In this paper, we introduce the semantic knowledge of medical images fro...

WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset

Webpages have been a rich resource for language and vision-language task...

Bi-VLGM : Bi-Level Class-Severity-Aware Vision-Language Graph Matching for Text Guided Medical Image Segmentation

Medical reports with substantial information can be naturally complement...

Please sign up or login with your details

Forgot password? Click here to reset