Vision-Language Modeling in 3D Medical Imaging (VLM3D) Workshops

ICCV 2025 – Full-Day Workshop

Date & Venue: ICCV 2025, Honolulu, Hawaii
Format: invited talks · paper track (ICCV proceedings) · live benchmark reveal & poster session
Paper submission is now completed.


Watch the Workshop Recording
Access Recording

Passcode: Vn+8YLt5

📸 Workshop Photo Highlights

Workshop Schedule (08:00–17:00)

Time Session
08:00–08:30 Opening and Introduction, Prof. Bjoern Menze
08:30–09:15 Keynote: Dr. Daguang Xu
09:15–09:30 Coffee Break
09:30–10:15 Keynote: Prof. Björn Ommer
10:15–11:00 Keynote: Prof. Pranav Rajpurkar
11:00–11:15 Coffee Break
11:15–11:35 Oral Presentation 1: CTFlow: Video-Inspired Latent Flow Matching for 3D CT Synthesis
11:40–12:00 Oral Presentation 2: Unified Supervision For Vision-Language Modeling in 3D Computed Tomography
12:00–13:30 Lunch and Poster Session
13:30–14:15 Keynote: Dr. Fernando Pérez-García
14:15–15:00 Keynote: Prof. Akshay Chaudhari
15:00–15:15 Coffee Break
15:15–15:35 Oral Presentation 3: T3D: Advancing 3D Medical Vision-Language Pre-training by Learning Multi-View Visual Consistency
15:40–16:00 Oral Presentation 4: Foundation Models for Multimodal MRI Synthesis with Language Guidance
16:00–16:45 Keynote: Dr. Zongwei Zhou
16:45–17:00 Concluding Remarks

Why VLM3D?

VLM3D is the first ICCV workshop devoted entirely to vision-language methods for volumetric (3D) medical data. Our goal is to create a forum where computer-vision, NLP, and clinical-AI researchers can:

  • share state-of-the-art techniques for 3D report generation, abnormality reasoning, and generative modelling;
  • discuss open problems—efficient volumetric representation learning, clinical grounding, and trustworthy evaluation;
  • build new collaborations that accelerate translation of multimodal AI into radiology practice.

Confirmed Speakers

Prof. Bjoern Menze is a professor at the University of Zurich and a leading expert in biomedical image analysis. Formerly a W3 professor at TU Munich, he has held positions at Inria, ETH Zurich, MIT, and Harvard. His work has earned awards such as the MICCAI Best Paper and Young Scientist Impact Award. Recently, he has pioneered 3D vision-language modeling in medical imaging with large-scale efforts like CT-RATE, CT-CLIP, and GenerateCT.

Prof. Björn Ommer is a professor at Ludwig Maximilian University of Munich, where he leads the Computer Vision & Learning Group. He is renowned for his work in generative AI, notably as a co-developer of Stable Diffusion, a widely used open-source text-to-image model. His research encompasses semantic image understanding, visual synthesis, and explainable AI. Ommer's contributions have been recognized with the 2024 German AI Prize and the Eduard Rhein Foundation Technology Prize. His work has applications in various domains, including medical imaging, where it aids in the automated analysis of medical image data

Dr. Daguang Xu leads healthcare AI research at NVIDIA’s AI-Infra group, focusing on 3D medical imaging, EHR mining, and vision-language modeling. He has co-led major open-source projects like MONAI and developed influential 3D models such as UNETR and MAISI. With 90+ peer-reviewed papers and ~50 patents, his work bridges cutting-edge 3D AI and clinical impact.

Prof. Akshay Chaudhari is an Assistant Professor of Radiology and Biomedical Data Science at Stanford University and Interim Division Chief of Integrative Biomedical Imaging Informatics. He leads the Machine Intelligence in Medical Imaging (MIMI) group, developing multimodal foundation models and physics-guided AI techniques that transform both image acquisition and analysis across vision, language, and EHR data. He co-founded Cognita, a company building next-generation multimodal AI systems to deliver fast, trustworthy diagnostics for radiology workflows.

Prof. Pranav Rajpurkar is an Associate Professor at Harvard University’s Department of Biomedical Informatics. He designs algorithms and curates datasets to advance trustworthy, clinician-level AI across medical imaging, clinical text, and electronic health records. He co-founded a2z Radiology AI, a company developing comprehensive diagnostic-imaging systems that serve as an AI safety net for radiologists. Rajpurkar also co-hosts The AI Health Podcast, edits the Doctor Penguin AI Health Newsletter, and teaches the “AI for Medicine” Coursera series and the AI for Healthcare Bootcamp.

Dr. Fernando PĂ©rez-GarcĂ­a is a Senior Research Engineer at Microsoft Research Health Futures. His work focuses on vision–language foundation models for healthcare and their translation to clinical practice. Prior to joining Microsoft, he was at the Centre for Neuroimaging at the Paris Brain Institute, building histological and MRI brain atlases for deep brain stimulation. He then moved on to UCL and King’s College London for his PhD in Medical Imaging, where he investigated the potential of AI to improve the treatment of epilepsy, developing open-source software tools, such as TorchIO, in the meantime.

Dr. Zongwei Zhou is currently an Assistant Research Scientist in Computer Science at Johns Hopkins University and a member of the Malone Center for Engineering in Healthcare. His work has been recognized with the AMIA Doctoral Dissertation Award, the MICCAI Young Scientist Award, and the Elsevier-MedIA Best Paper Award. He has authored over 60 peer-reviewed publications and hold 10 U.S. patents. His contributions have been ranked among the most popular in IEEE TMI and the most cited in EJNMMI Research.

Accepted Papers

Oral Presentations

[11:15] T3D: Advancing 3D Medical Vision-Language Pre-training by Learning Multi-View Visual Consistency
Che Liu, Cheng Ouyang, Yinda Chen, César Quilodrán-Casas, Lei Ma, Jie Fu, Yike Guo, Anand Shah, Wenjia Bai, Rossella Arcucci
[11:40] Unified Supervision For Vision-Language Modeling in 3D Computed Tomography
Hao-Chih Lee, Zelong Liu, Hamza Ahmed, Spencer Kim, Sean Huvers, Vishwesh Nath, Zahi A. Fayad, Timothy Deyer, Xueyan Mei
[15:15] CTFlow: Video-Inspired Latent Flow Matching for 3D CT Synthesis
Jiayi Wang, Hadrien Reynaud, Franciskus Xaverius Erick, Bernhard Kainz
[15:40] Foundation Models for Multimodal MRI Synthesis with Language Guidance
Mahmut Yurt, Xiaozhi Cao, Zihan Zhou, Kawin Setsompop, Shreyas Vasanawala, John Pauly

Poster Presentations

All posters will be presented during the lunch session (12:00–13:30).

Generative Text-Guided 3D Vision-Language Pretraining for Unified Medical Image Segmentation
Yinda Chen, Che Liu, Wei Huang, Xiaoyu Liu, Haoyuan Shi, Sibo Cheng, Rossella Arcucci, Zhiwei Xiong
Prediction Degeneracy in Medical Vision-Language Models: Implications for Robustness and Interpretability
Martin Goetze, Dennis Eschweiler, Brendan Huang, Gustav MĂĽller-Franzes, Carolina Ramirez, Madeline Hess, Lavinia Goldermann, Sharmila Majumdar, Daniel Truhn
RadZero3D: Bridging Self-Supervised Video Models and Medical Vision-Language Alignment for Zero-Shot Chest CT Interpretation
Jonggwon Park, Kyoyun Choi, Byungmu Yoon, Hong Geun Cho, Bumcheol Hwang
Patch-wise Intensity Mapping for Individualized Brain Abnormality Detection in Alzheimer’s Disease
Yangshuang Xu, Rongjie Liu, Chao Huang
CranioCaption3D: Language-Grounded 3-D Skull Phenotyping for Rare Craniofacial Syndromes via Self-Supervised Captioning and Ontology-Guided Retrieval
Khartik Uppalapati, Bora Yimenicioglu, Adan Eftekhari, Shakeel Abdulkareem, Bhavya Uppalapati
CT-GRAPH: Hierarchical Graph Attention Network for Anatomy-Guided CT-Report Generation
Hamza Kalisch, Fabian Hörst, Jens Kleesiek, Ken Herrmann, Constantin Seibold
Preprocessing–Architecture Co-Design for Multi-Label Classification in 3D Chest CT Scans
Jung Jun Ah, Juhui Lee, Jihye Heo, Dongheon Lee