Conditional DETR V2: Efficient Detection Transformer with Box Queries

by   Xiaokang Chen, et al.

In this paper, we are interested in Detection Transformer (DETR), an end-to-end object detection approach based on a transformer encoder-decoder architecture without hand-crafted postprocessing, such as NMS. Inspired by Conditional DETR, an improved DETR with fast training convergence, that presented box queries (originally called spatial queries) for internal decoder layers, we reformulate the object query into the format of the box query that is a composition of the embeddings of the reference point and the transformation of the box with respect to the reference point. This reformulation indicates the connection between the object query in DETR and the anchor box that is widely studied in Faster R-CNN. Furthermore, we learn the box queries from the image content, further improving the detection quality of Conditional DETR still with fast training convergence. In addition, we adopt the idea of axial self-attention to save the memory cost and accelerate the encoder. The resulting detector, called Conditional DETR V2, achieves better results than Conditional DETR, saves the memory cost and runs more efficiently. For example, for the DC5-ResNet-50 backbone, our approach achieves 44.8 AP with 16.4 FPS on the COCO val set and compared to Conditional DETR, it runs 1.6× faster, saves 74% of the overall memory cost, and improves 1.0 AP score.


page 9

page 16


Box-DETR: Understanding and Boxing Conditional Spatial Queries

Conditional spatial queries are recently introduced into DEtection TRans...

Anchor DETR: Query Design for Transformer-Based Detector

In this paper, we propose a novel query design for the transformer-based...

Group DETR: Fast Training Convergence with Decoupled One-to-Many Label Assignment

Detection Transformer (DETR) relies on One-to-One label assignment, i.e....

Deep Equilibrium Object Detection

Query-based object detectors directly decode image features into object ...

Pair DETR: Contrastive Learning Speeds Up DETR Training

The DETR object detection approach applies the transformer encoder and d...

Fast Convergence of DETR with Spatially Modulated Co-Attention

The recently proposed Detection Transformer (DETR) model successfully ap...

Priors are Powerful: Improving a Transformer for Multi-camera 3D Detection with 2D Priors

Transfomer-based approaches advance the recent development of multi-came...

Please sign up or login with your details

Forgot password? Click here to reset