Interpretable Modeling of Driver Attention Shifts with a Vision--Language Model

arXiv CS Tuesday 02 June 2026, 04:00 UTC By Kaiser Hamid, Khandakar Ashrafi Akbar, Peihang Li, Nade Liang 1 min read

Key Points

arXiv:2508.05852v2 Announce Type: replace Abstract: Driver gaze is commonly modeled as a spatial heatmap, but heatmaps alone are difficult for humans to interpret because they do not explain which road object or region is being monitored or why an attention shift may matter. This study examines whether minimal human-grounded supervision can steer a vision--language model toward interpretable descriptions of driver attention shifts. Using selected high-change gaze moments from the Berkeley DeepDrive-Attention dataset, we compare zero-shot, one-shot, and LoRA fine-tuned VLM conditions against human-refined reference descriptions and expert ratings. Results show that fine-tuning with 80 expert-refined attention examples improves ROUGE-L, METEOR, Entity Alignment F1, and Human Alignment Score relative to unsteered VLM outputs. The findings suggest that language-based descriptions can complement gaze heatmaps by making driver attention more accessible for human-factors analysis, driver-monitoring review, and situation-awareness support.

Interpretable Modeling of Driver Attention Shifts (ORG) Berkeley DeepDrive-Attention (ORG) ROUGE-L (ORG) Entity Alignment F1 (ORG) VLM (ORG)

Originally published by arXiv CS Read original →

Interpretable Modeling of Driver Attention Shifts with a Vision--Language Model

Related Stories

England face India in final T20 World Cup warm-up - updates

El Niño, extreme weather and the future: World Cup Q&A with a climate scientist

‘I’m disappointed and I’m not alone’: Matty Lee hits out at Olympic president’s ‘amateur’ stance on pay

‘I’m disappointed and I’m not alone’: Matty Lee hits out at Olympic president’s ‘amateur’ stance on pay