Language-based Video Editing via Multi-Modal Multi-Level Transformer

04/02/2021
by   Tsu-Jui Fu, et al.
0

Video editing tools are widely used nowadays for digital design. Although the demand for these tools is high, the prior knowledge required makes it difficult for novices to get started. Systems that could follow natural language instructions to perform automatic editing would significantly improve accessibility. This paper introduces the language-based video editing (LBVE) task, which allows the model to edit, guided by text instruction, a source video into a target video. LBVE contains two features: 1) the scenario of the source video is preserved instead of generating a completely different video; 2) the semantic is presented differently in the target video, and all changes are controlled by the given instruction. We propose a Multi-Modal Multi-Level Transformer (M^3L-Transformer) to carry out LBVE. The M^3L-Transformer dynamically learns the correspondence between video perception and language semantic at different levels, which benefits both the video understanding and video frame synthesis. We build three new datasets for evaluation, including two diagnostic and one from natural videos with human-labeled text. Extensive experimental results show that M^3L-Transformer is effective for video editing and that LBVE can lead to a new field toward vision-and-language research.

READ FULL TEXT

page 3

page 6

page 8

page 9

research
07/21/2020

Multi-modal Transformer for Video Retrieval

The task of retrieving video content relevant to natural language querie...
research
05/21/2023

InstructVid2Vid: Controllable Video Editing with Natural Language Instructions

We present an end-to-end diffusion-based method for editing videos with ...
research
06/27/2023

Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos

To realize human-robot collaboration, robots need to execute actions for...
research
07/27/2022

AutoTransition: Learning to Recommend Video Transition Effects

Video transition effects are widely used in video editing to connect sho...
research
11/23/2022

Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation

Generating a video given the first several static frames is challenging ...
research
09/12/2021

MovieCuts: A New Dataset and Benchmark for Cut Type Recognition

Understanding movies and their structural patterns is a crucial task to ...
research
12/31/2020

TransRegex: Multi-modal Regular Expression Synthesis by Generate-and-Repair

Since regular expressions (abbrev. regexes) are difficult to understand ...

Please sign up or login with your details

Forgot password? Click here to reset