Action Representation Using Classifier Decision Boundaries
Most popular deep learning based models for action recognition are designed to generate separate predictions within their short temporal windows, which are often aggregated by heuristic means to assign an action label to the full video segment. Given that not all frames from a video characterize the underlying action, pooling schemes that impose equal importance to all frames might be unfavorable. In an attempt towards tackling this challenge, we propose a novel pooling scheme, dubbed SVM pooling, based on the notion that among the bag of features generated by a CNN on all temporal windows, there is at least one feature that characterizes the action. To this end, we learn a decision hyperplane that separates this unknown yet useful feature from the rest. Applying multiple instance learning in an SVM setup, we use the parameters of this separating hyperplane as a descriptor for the video. Since these parameters are directly related to the support vectors in a max-margin framework, they serve as robust representations for pooling of the CNN features. We devise a joint optimization objective and an efficient solver that learns these hyperplanes per video and the corresponding action classifiers over the hyperplanes. Showcased experiments on the standard HMDB and UCF101 datasets demonstrate state-of-the-art performance.
READ FULL TEXT