FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting

by   Rui Liu, et al.

Transformer, as a strong and flexible architecture for modelling long-range relations, has been widely explored in vision tasks. However, when used in video inpainting that requires fine-grained representation, existed method still suffers from yielding blurry edges in detail due to the hard patch splitting. Here we aim to tackle this problem by proposing FuseFormer, a Transformer model designed for video inpainting via fine-grained feature fusion based on novel Soft Split and Soft Composition operations. The soft split divides feature map into many patches with given overlapping interval. On the contrary, the soft composition operates by stitching different patches into a whole feature map where pixels in overlapping regions are summed up. These two modules are first used in tokenization before Transformer layers and de-tokenization after Transformer layers, for effective mapping between tokens and features. Therefore, sub-patch level information interaction is enabled for more effective feature propagation between neighboring patches, resulting in synthesizing vivid content for hole regions in videos. Moreover, in FuseFormer, we elaborately insert the soft composition and soft split into the feed-forward network, enabling the 1D linear layers to have the capability of modelling 2D structure. And, the sub-patch level feature fusion ability is further enhanced. In both quantitative and qualitative evaluations, our proposed FuseFormer surpasses state-of-the-art methods. We also conduct detailed analysis to examine its superiority.


page 1

page 4

page 6

page 7

page 8


ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator

Recently, several Vision Transformer (ViT) based methods have been propo...

TransFG: A Transformer Architecture for Fine-grained Recognition

Fine-grained visual classification (FGVC) which aims at recognizing obje...

Multi-direction and Multi-scale Pyramid in Transformer for Video-based Pedestrian Retrieval

In video surveillance, pedestrian retrieval (also called person re-ident...

Coarse-to-Fine Vision Transformer

Vision Transformers (ViT) have made many breakthroughs in computer visio...

Exploring and Improving Mobile Level Vision Transformers

We study the vision transformer structure in the mobile level in this pa...

DeViT: Deformed Vision Transformers in Video Inpainting

This paper proposes a novel video inpainting method. We make three main ...

The Piano Inpainting Application

Autoregressive models are now capable of generating high-quality minute-...

Please sign up or login with your details

Forgot password? Click here to reset