Towards Robust Video Instance Segmentation with Temporal-Aware Transformer

01/20/2023
by   Zhenghao Zhang, et al.
0

Most existing transformer based video instance segmentation methods extract per frame features independently, hence it is challenging to solve the appearance deformation problem. In this paper, we observe the temporal information is important as well and we propose TAFormer to aggregate spatio-temporal features both in transformer encoder and decoder. Specifically, in transformer encoder, we propose a novel spatio-temporal joint multi-scale deformable attention module which dynamically integrates the spatial and temporal information to obtain enriched spatio-temporal features. In transformer decoder, we introduce a temporal self-attention module to enhance the frame level box queries with the temporal relation. Moreover, TAFormer adopts an instance level contrastive loss to increase the discriminability of instance query embeddings. Therefore the tracking error caused by visually similar instances can be decreased. Experimental results show that TAFormer effectively leverages the spatial and temporal information to obtain context-aware feature representation and outperforms state-of-the-art methods.

READ FULL TEXT

page 1

page 3

page 7

page 8

research
03/24/2022

Video Instance Segmentation via Multi-scale Spatio-temporal Split Attention Transformer

State-of-the-art transformer-based video instance segmentation (VIS) app...
research
05/26/2023

Spatio-Temporal Transformer-Based Reinforcement Learning for Robot Crowd Navigation

The social robot navigation is an open and challenging problem. In exist...
research
05/31/2021

VidFace: A Full-Transformer Solver for Video FaceHallucination with Unaligned Tiny Snapshots

In this paper, we investigate the task of hallucinating an authentic hig...
research
07/06/2022

Delving into Sequential Patches for Deepfake Detection

Recent advances in face forgery techniques produce nearly visually untra...
research
03/21/2023

3D Mitochondria Instance Segmentation with Spatio-Temporal Transformers

Accurate 3D mitochondria instance segmentation in electron microscopy (E...
research
04/11/2019

MAIN: Multi-Attention Instance Network for Video Segmentation

Instance-level video segmentation requires a solid integration of spatia...
research
07/13/2022

Entry-Flipped Transformer for Inference and Prediction of Participant Behavior

Some group activities, such as team sports and choreographed dances, inv...

Please sign up or login with your details

Forgot password? Click here to reset