ActionFormer: Localizing Moments of Actions with Transformers
Self-attention based Transformer models have demonstrated impressive results for image classification and object detection, and more recently for video understanding. Inspired by this success, we investigate the application of Transformer networks for temporal action localization in videos. To this end, we present ActionFormer – a simple yet powerful model to identify actions in time and recognize their categories in a single shot, without using action proposals or relying on pre-defined anchor windows. ActionFormer combines a multiscale feature representation with local self-attention, and uses a light-weighted decoder to classify every moment in time and estimate the corresponding action boundaries. We show that this orchestrated design results in major improvements upon prior works. Without bells and whistles, ActionFormer achieves 65.6 prior model by 8.7 absolute percentage points and crossing the 60 first time. Further, ActionFormer demonstrates strong results on ActivityNet 1.3 (36.0 mAP over prior works). Our code is available at http://github.com/happyharrycn/actionformer_release
READ FULL TEXT