Audio-Visual Transformer Based Crowd Counting

09/04/2021
by   Usman Sajid, et al.
0

Crowd estimation is a very challenging problem. The most recent study tries to exploit auditory information to aid the visual models, however, the performance is limited due to the lack of an effective approach for feature extraction and integration. The paper proposes a new audiovisual multi-task network to address the critical challenges in crowd counting by effectively utilizing both visual and audio inputs for better modalities association and productive feature extraction. The proposed network introduces the notion of auxiliary and explicit image patch-importance ranking (PIR) and patch-wise crowd estimate (PCE) information to produce a third (run-time) modality. These modalities (audio, visual, run-time) undergo a transformer-inspired cross-modality co-attention mechanism to finally output the crowd estimate. To acquire rich visual features, we propose a multi-branch structure with transformer-style fusion in-between. Extensive experimental evaluations show that the proposed scheme outperforms the state-of-the-art networks under all evaluation settings with up to 33.8 the vision-only variant of our network and empirically demonstrate its superiority over previous approaches.

READ FULL TEXT

page 1

page 8

research
12/17/2021

Towards More Effective PRM-based Crowd Counting via A Multi-resolution Fusion and Attention Network

The paper focuses on improving the recent plug-and-play patch rescaling ...
research
01/06/2020

Plug-and-Play Rescaling Based Crowd Counting in Static Images

Crowd counting is a challenging problem especially in the presence of hu...
research
10/04/2020

Multi-Resolution Fusion and Multi-scale Input Priors Based Crowd Counting

Crowd counting in still images is a challenging problem in practice due ...
research
05/14/2020

Ambient Sound Helps: Audiovisual Crowd Counting in Extreme Conditions

Visual crowd counting has been recently studied as a way to enable peopl...
research
08/02/2021

Congested Crowd Instance Localization with Dilated Convolutional Swin Transformer

Crowd localization is a new computer vision task, evolved from crowd cou...
research
03/12/2022

Joint CNN and Transformer Network via weakly supervised Learning for efficient crowd counting

Currently, for crowd counting, the fully supervised methods via density ...
research
05/19/2021

Single-Layer Vision Transformers for More Accurate Early Exits with Less Overhead

Deploying deep learning models in time-critical applications with limite...

Please sign up or login with your details

Forgot password? Click here to reset