DTFSal: Audio-Visual Dynamic Token Fusion for Video Saliency Prediction

*University of Tehran
Teaser Image

Overview of our DTFSal model, which integrates a multi-scale encoder, a hierarchical multi-decoder, LTEB, DLTFB, and AMFB for efficient and accurate audio-visual saliency prediction.

Chat about this Paper

Abstract

Audio-visual saliency prediction aims to mimic human visual attention by identifying salient regions in videos through the integration of both visual and auditory information. Although visual-only approaches have significantly advanced, effectively incorporating auditory cues remains challenging due to complex spatio–temporal interactions and high computational demands. To address these challenges, we propose Dynamic Token Fusion Saliency (DFTSal), a novel audio-visual saliency prediction framework designed to balance accuracy with computational efficiency. Our approach features a multi-scale visual encoder equipped with two novel modules: the Learnable Token Enhancement Block (LTEB), which adaptively weights tokens to emphasize crucial saliency cues, and the Dynamic Learnable Token Fusion Block (DLTFB), which employs a shifting operation to reorganize and merge features, effectively capturing long-range dependencies and detailed spatial information. In parallel, an audio branch processes raw audio signals to extract meaningful auditory features. Both visual and audio features are integrated using our Adaptive Multimodal Fusion Block (AMFB), which employs local, global, and adaptive fusion streams for precise cross-modal fusion. The resulting fused features are processed by a hierarchical multi-decoder structure, producing accurate saliency maps. Extensive evaluations on six audio-visual benchmarks demonstrate that DFTSal achieves SOTA performance while maintaining computational efficiency.

DTFSal: Experimental Results Across Audio-Visual Datasets

Experimental Results

Comparison with previous methods on six audio-visual saliency datasets. For our model, we indicate the percentage (%) change in performance relative to the second-best result, or to the best result if ours is not the top performer. The best results are highlighted in red, the second-best in blue, and the third-best in dark green.

Performance on Visual Only Datasets

visual only Performance

Comparison with Previous Methods on DHF1K and UCF Sports Datasets. For our model, we indicate the percentage (%) change in performance relative to the second-best result, or to the best result if ours is not the top performer. The best results are highlighted in red, the second-best in blue, and the third-best in green.

DTFSal: Qualitative Results of Saliency maps

Qualitative Results of Audio-Visual Saliency maps

av_fig

Figure: Comparative visualizations of our DTFSal model compared with previous SOTA audio-visual saliency prediction methods.


av_fig2

Figure: Additional comparative visualizations of our DTFSal model compared with previous SOTA audio-visual saliency prediction methods.


Qualitative Results of Visual Only Saliency maps

visual only Results

Figure: Comparative visualizations of our DTFSal model compared with previous SOTA visual-only saliency prediction methods.


BibTeX

@misc{,
        author    = {Kiana Hoshanfar and Alireza Hosseini and Ahmad Kalhor and Babak Nadjar Araabi},
        title     = {DTFSal: Audio-Visual Dynamic Token Fusion for Video Saliency Prediction},
        eprint   = { },
        year      = {2025},
        url={ },
}