PnP-DETR: Towards Efficient Visual Analysis with Transformers

09/15/2021
by   PetsTime, et al.
10

Recently, DETR pioneered the solution of vision tasks with transformers, it directly translates the image feature map into the object detection result. Though effective, translating the full feature map can be costly due to redundant computation on some area like the background. In this work, we encapsulate the idea of reducing spatial redundancy into a novel poll and pool (PnP) sampling module, with which we build an end-to-end PnP-DETR architecture that adaptively allocates its computation spatially to be more efficient. Concretely, the PnP module abstracts the image feature map into fine foreground object feature vectors and a small number of coarse background contextual feature vectors. The transformer models information interaction within the fine-coarse feature space and translates the features into the detection result. Moreover, the PnP-augmented model can instantly achieve various desired trade-offs between performance and computation with a single model by varying the sampled feature length, without requiring to train multiple models as existing methods. Thus it offers greater flexibility for deployment in diverse scenarios with varying computation constraint. We further validate the generalizability of the PnP module on panoptic segmentation and the recent transformer-based image recognition model ViT and show consistent efficiency gain. We believe our method makes a step for efficient visual analysis with transformers, wherein spatial redundancy is commonly observed. Code will be available at <https://github.com/twangnh/pnp-detr>.

READ FULL TEXT

page 4

page 7

page 8

page 12

research
04/11/2022

Consistency Learning via Decoding Path Augmentation for Transformers in Human Object Interaction Detection

Human-Object Interaction detection is a holistic visual recognition task...
research
04/22/2021

Multiscale Vision Transformers

We present Multiscale Vision Transformers (MViT) for video and image rec...
research
03/12/2022

The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy

Vision transformers (ViTs) have gained increasing popularity as they are...
research
03/19/2020

Spatially Adaptive Inference with Stochastic Feature Sampling and Interpolation

In the feature maps of CNNs, there commonly exists considerable spatial ...
research
01/09/2022

Glance and Focus Networks for Dynamic Visual Recognition

Spatial redundancy widely exists in visual recognition tasks, i.e., disc...
research
05/20/2021

Content-Augmented Feature Pyramid Network with Light Linear Transformers

Recently, plenty of work has tried to introduce transformers into comput...
research
09/05/2023

Compressing Vision Transformers for Low-Resource Visual Learning

Vision transformer (ViT) and its variants have swept through visual lear...

Please sign up or login with your details

Forgot password? Click here to reset