Multimodal Large Language Model-Enabled Video Translation: A Role-Oriented Survey

arXiv CS Tuesday 02 June 2026, 04:00 UTC By Bingzheng Qu, Kehai Chen, Xuefeng Bai, Min Zhang 1 min read

Key Points

Announce Type: replace Abstract: Recent progress in multimodal large language models (MLLMs) is reshaping video translation from a cascaded pipeline of automatic speech recognition, machine translation, text-to-speech, and lip synchronization into a unified multimodal reasoning and generation problem. High-quality video translation requires not only semantic fidelity, but also temporal alignment, speaker consistency, and emotional expressiveness across visual, acoustic, and linguistic...

arXiv:2604.11283v2 Announce Type: replace Abstract: Recent progress in multimodal large language models (MLLMs) is reshaping video translation from a cascaded pipeline of automatic speech recognition, machine translation, text-to-speech, and lip synchronization into a unified multimodal reasoning and generation problem. High-quality video translation requires not only semantic fidelity, but also temporal alignment, speaker consistency, and emotional expressiveness across visual, acoustic, and linguistic streams. This survey provides a focused review of MLLM-enabled video translation through a role-oriented taxonomy. We organize MLLM-enabled and MLLM-relevant studies into three functional roles: Semantic Reasoner, which grounds translation in video understanding, temporal reasoning, and multimodal fusion; Expressive Performer, which supports controllable and context-aware speech generation; and Visual Synthesizer, which enables lip synchronization and visually coherent speaker rendering. We further summarize representative datasets, benchmarks, and metrics for each role, and discuss how current evaluation protocols fall short of end-to-end video translation requirements. Finally, we identify open challenges in long-form video understanding, temporal modeling, multimodal alignment, multilingual robustness, and responsible deployment, outlining future directions for natural and trustworthy cross-lingual video communication.

Semantic Reasoner (ORG)

Originally published by arXiv CS Read original →

Multimodal Large Language Model-Enabled Video Translation: A Role-Oriented Survey

Related Stories

USDA reverses course to allow pet dogs to travel from US to Mexico as it tries to slow screwworm spread

Starbucks stock is a bright spot in Wednesday's bleak market. Here's why

These in-demand jobs pay over $100,000 — and offer raises that keep ahead of inflation

Ipsos Poll Shows Majority of Adults Would Rejoin EU