VFEM: Visual Feature Empowered Multivariate Time Series Forecasting with Cross-Modal Fusion

arXiv CS Tuesday 09 June 2026, 04:00 UTC By Yanlong Wang, Hang Yu, Jian Xu, Fei Ma, Hongkang Zhang, Tongtong Feng, Zijian Zhang, Shao-Lun Huang, Danny Dongning Sun, Xiao-Ping Zhang 1 min read

Key Points

Announce Type: replace Abstract: Large time series foundation models often adopt channel-independent architectures to handle varying data dimensions, but this design ignores crucial cross-channel dependencies. Meanwhile, existing cross-modal methods predominantly rely on textual modalities, leaving the spatial pattern recognition capabilities of vision models underexplored for time series analysis. To address these limitations, we propose VFEM, a cross-modal forecasting model that leverages...

arXiv:2510.03244v2 Announce Type: replace Abstract: Large time series foundation models often adopt channel-independent architectures to handle varying data dimensions, but this design ignores crucial cross-channel dependencies. Meanwhile, existing cross-modal methods predominantly rely on textual modalities, leaving the spatial pattern recognition capabilities of vision models underexplored for time series analysis. To address these limitations, we propose VFEM, a cross-modal forecasting model that leverages pre-trained large vision models (LVMs) to capture complex cross-variable patterns. VFEM transforms multivariate time series into visual representations, enabling LVMs to perceive spatial relationships that are not explicitly modeled by channel-independent models. Through a dual-branch architecture, visual and temporal features are independently extracted and then fused via cross-modal attention, allowing complementary information from both modalities to enhance forecasting. By freezing the LVM and training only 7.45% of the total parameters, VFEM achieves competitive performance on multiple benchmarks, offering a new perspective on multivariate time series forecasting.

Cross-Modal Fusion arXiv:2510.03244v2 Announce Type (ORG) VFEM (ORG) LVM (ORG)

Originally published by arXiv CS Read original →

VFEM: Visual Feature Empowered Multivariate Time Series Forecasting with Cross-Modal Fusion

Related Stories

How undrafted players became so vital to teams' su...

Brussels' datacenter efficiency scorecard may come with a credit warning

Early diabetes warning sign that is easily missed in June, July and August

Big agriculture is killing our bees. We’ll all pay the price | Jennie Durant