Action Recognition for Depth Video using Multi-view Dynamic Images
Dynamic image is the recently emerged action representation paradigm able to compactly capture the temporal evolution, especially in context of deep Convolutional Neural Network(CNN). Inspired by its preliminary success towards RGB videos, we propose its extension to the depth domain. To better exploit the 3D characteristics of depth video to leverage the performance, multi-view dynamic image is proposed by us. In particular, the raw depth video will be densely projected onto the different imaging view-points by rotating the virtual camera around the specific instances within the 3D space. Dynamic images are then extracted from the yielded multi-view depth videos respectively to constitute the multi-view dynamic images. In this way, more view-tolerant representative information can be involved in multiview dynamic images than the single-view counterpart. A novel CNN learning model is consequently proposed to execute feature learning on multi-view dynamic images. The dynamic images from different views will share the same convolutional layers, but with the different fully-connected layers. This model aims to enhance the tuning of shallow convolutional layers by alleviating gradient vanishing. Furthermore, to address the effect of spatial variation an action proposal method based on faster R-CNN is proposed. The dynamic images will be extracted only from the action proposal regions. In experiments, our approach can achieve the state-of-the-art performance on 3 challenging datasets (i.e., NTU RGB-D, Northwestern-UCLA and UWA3DII).
READ FULL TEXT