Attend to Anything: Foundation Model for Unified Human Attention Modeling

arXiv CS Wednesday 03 June 2026, 04:00 UTC By Wenzhuo Zhao, Ronghao Xian, Keren Fu, Qijun Zhao 1 min read

Key Points

arXiv:2606.03540v1 Announce Type: new Abstract: Existing human attention (saliency) modeling methods persist as highly fragmented across modalities, scenes, and task formulations. Consequently, even with increasing model capacity and data scale, current models predominantly remain scene-dependent and task-specific, failing to practically generalize in real-world applications. To address the fundamental limitations, we present the Attend to Anything Model (AAM), a multi-modal foundation model that unifies attention modeling across various image, video, and audio-visual tasks and scenes. AAM reformulates attention as a cognitive entailment relationship organized in a general-to-specific hierarchy, implemented through language prompts with hierarchical embeddings in hyperbolic space. Furthermore, to unify static image and dynamic video attention, we adopt a fluid-dynamics perspective, formulating video-frame attention as a diffusive temporal evolution governed by the Fokker--Planck equation. Extensive experiments on 16 benchmarks demonstrate that AAM consistently outperforms state-of-the-art methods by an average of 6\% across various scenarios, while achieving approximately a 4$\times$ speedup in video inference. Overall, these results demonstrate that AAM provides a principled foundation for future research on attention and saliency-related tasks. The dataset and code will be available at https://github.com/wz-zhao/Attend-to-Anything.

Fokker--Planck (ORG) AAM (ORG)

Originally published by arXiv CS Read original →

Attend to Anything: Foundation Model for Unified Human Attention Modeling

Related Stories

What's gone wrong for the Cubs -- and if they can ...

We’re All on Starship Elon Now

We’re All on Starship Elon Now

Jeff Bezos’s Blue Origin says it will fly again this year after explosion. Nasa needs it to