Dual Semantic Fusion Network for Video Object Detection

09/16/2020
by   Lijian Lin, et al.
2

Video object detection is a tough task due to the deteriorated quality of video sequences captured under complex environments. Currently, this area is dominated by a series of feature enhancement based methods, which distill beneficial semantic information from multiple frames and generate enhanced features through fusing the distilled information. However, the distillation and fusion operations are usually performed at either frame level or instance level with external guidance using additional information, such as optical flow and feature memory. In this work, we propose a dual semantic fusion network (abbreviated as DSFNet) to fully exploit both frame-level and instance-level semantics in a unified fusion framework without external guidance. Moreover, we introduce a geometric similarity measure into the fusion process to alleviate the influence of information distortion caused by noise. As a result, the proposed DSFNet can generate more robust features through the multi-granularity fusion and avoid being affected by the instability of external guidance. To evaluate the proposed DSFNet, we conduct extensive experiments on the ImageNet VID dataset. Notably, the proposed dual semantic fusion network achieves, to the best of our knowledge, the best performance of 84.1% mAP among the current state-of-the-art video object detectors with ResNet-101 and 85.4% mAP with ResNeXt-101 without using any post-processing steps.

READ FULL TEXT

page 2

page 7

research
05/24/2023

DynStatF: An Efficient Feature Fusion Strategy for LiDAR 3D Object Detection

Augmenting LiDAR input with multiple previous frames provides richer sem...
research
01/10/2023

Video Semantic Segmentation with Inter-Frame Feature Fusion and Inner-Frame Feature Refinement

Video semantic segmentation aims to generate accurate semantic maps for ...
research
10/02/2021

Light Field Saliency Detection with Dual Local Graph Learning andReciprocative Guidance

The application of light field data in salient object de-tection is beco...
research
03/24/2020

RN-VID: A Feature Fusion Architecture for Video Object Detection

Consecutive frames in a video are highly redundant. Therefore, to perfor...
research
06/18/2021

Multi-Granularity Network with Modal Attention for Dense Affective Understanding

Video affective understanding, which aims to predict the evoked expressi...
research
08/13/2023

Video Captioning with Aggregated Features Based on Dual Graphs and Gated Fusion

The application of video captioning models aims at translating the conte...
research
12/05/2022

BiSTNet: Semantic Image Prior Guided Bidirectional Temporal Feature Fusion for Deep Exemplar-based Video Colorization

How to effectively explore the colors of reference exemplars and propaga...

Please sign up or login with your details

Forgot password? Click here to reset