Exploring the Capabilities of Large Language Model Encoders for Image-Text Retrieval in Chest X-rays

arXiv CS Tuesday 02 June 2026, 04:00 UTC By Hanbin Ko, Gihun Cho, Inhyeok Baek, Donguk Kim, Joonbeom Koo, Changi Kim, Dongheon Lee, Chang Min Park 1 min read

Key Points

Announce Type: replace Abstract: Multimodal learning from paired medical images and clinical text is a central challenge in medical data-driven informatics, where effective cross-modal alignment is critical for scalable analysis and retrieval. In chest radiography, vision-language pretraining is constrained by heterogeneous radiology reports that contain abbreviations, impression-only notes, and institution-specific writing styles. Unlike general-domain settings, naively aggregating large...

arXiv:2509.15234v2 Announce Type: replace Abstract: Multimodal learning from paired medical images and clinical text is a central challenge in medical data-driven informatics, where effective cross-modal alignment is critical for scalable analysis and retrieval. In chest radiography, vision-language pretraining is constrained by heterogeneous radiology reports that contain abbreviations, impression-only notes, and institution-specific writing styles. Unlike general-domain settings, naively aggregating large collections of noisy reports can plateau or even degrade multimodal learning when reporting styles differ substantially. We propose a domain-adapted bidirectional large language model text encoder for chest radiograph reports, trained with masked token prediction and supervised contrastive learning on stylistically diverse but clinically equivalent report variants to produce robust, generalizable text embeddings. We then integrate this encoder into a dual-tower contrastive vision-language framework using parameter-efficient adaptation to improve image-text alignment. Across 1.6 million paired studies from public datasets and a de-identified hospital cohort, the proposed models improve bidirectional retrieval accuracy and external generalization, achieving GREEN scores of 0.308 on MIMIC-CXR and 0.618 on Open-I, while reducing the degradation observed when abbreviation-rich, impression-only hospital reports are added to training.

Image-Text Retrieval (ORG)

Originally published by arXiv CS Read original →

Exploring the Capabilities of Large Language Model Encoders for Image-Text Retrieval in Chest X-rays

Related Stories

Breakthrough as AI to speed up cancer diagnoses for millions on the NHS

DUI suspect on the run from cops in Louisiana is slowed, not stopped, by an alligator attack caught on bodycam

Devoted couple married for 67 years die just 41 hours apart as wife 'could not cope' alone

Psychiatrist allowed to practice despite sexually assaulting colleague