Visual Word Selection without Re-Coding and Re-Pooling
The Bag-of-Words (BoW) representation is widely used in computer vision. The size of the codebook impacts the time and space complexity of the applications that use BoW. Thus, given a training set for a particular computer vision task, a key problem is pruning a large codebook to select only a subset of visual words. Evaluating possible selections of words to be included in the pruned codebook can be computationally prohibitive; in a brute-force scheme, evaluating each pruned codebook requires re-coding of all features extracted from training images to words in the candidate codebook and then re-pooling the words to obtain a representation of each image, e.g., histogram of visual word frequencies. In this paper, a method is proposed that selects and evaluates a subset of words from an initially large codebook, without the need for re-coding or re-pooling. Formulations are proposed for two commonly-used schemes: hard and soft (kernel) coding of visual words with average-pooling. The effectiveness of these formulations is evaluated on the 15 Scenes and Caltech 10 benchmarks.
READ FULL TEXT