Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology Report Generation

by   Yaowei Li, et al.

Automatic radiology report generation has attracted enormous research interest due to its practical value in reducing the workload of radiologists. However, simultaneously establishing global correspondences between the image (e.g., Chest X-ray) and its related report and local alignments between image patches and keywords remains challenging. To this end, we propose an Unify, Align and then Refine (UAR) approach to learn multi-level cross-modal alignments and introduce three novel modules: Latent Space Unifier (LSU), Cross-modal Representation Aligner (CRA) and Text-to-Image Refiner (TIR). Specifically, LSU unifies multimodal data into discrete tokens, making it flexible to learn common knowledge among modalities with a shared network. The modality-agnostic CRA learns discriminative features via a set of orthonormal basis and a dual-gate mechanism first and then globally aligns visual and textual representations under a triplet contrastive loss. TIR boosts token-level local alignment via calibrating text-to-image attention with a learnable mask. Additionally, we design a two-stage training procedure to make UAR gradually grasp cross-modal alignments at different levels, which imitates radiologists' workflow: writing sentence by sentence first and then checking word by word. Extensive experiments and analyses on IU-Xray and MIMIC-CXR benchmark datasets demonstrate the superiority of our UAR against varied state-of-the-art methods.


page 3

page 7

page 8


Multi-Granularity Cross-modal Alignment for Generalized Medical Visual Representation Learning

Learning medical visual representations directly from paired radiology r...

Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens

Contrastive learning-based vision-language pre-training approaches, such...

See Finer, See More: Implicit Modality Alignment for Text-based Person Retrieval

Text-based person retrieval aims to find the query person based on a tex...

VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix

Existing vision-language pre-training (VLP) methods primarily rely on pa...

Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval

The current state-of-the-art image-sentence retrieval methods implicitly...

Identity-Aware Textual-Visual Matching with Latent Co-attention

Textual-visual matching aims at measuring similarities between sentence ...

Please sign up or login with your details

Forgot password? Click here to reset