Foundation Model for Endoscopy Video Analysis via Large-scale Self-supervised Pre-train

by   Zhao Wang, et al.

Foundation models have exhibited remarkable success in various applications, such as disease diagnosis and text report generation. To date, a foundation model for endoscopic video analysis is still lacking. In this paper, we propose Endo-FM, a foundation model specifically developed using massive endoscopic video data. First, we build a video transformer, which captures both local and global long-range dependencies across spatial and temporal dimensions. Second, we pre-train our transformer model using global and local views via a self-supervised manner, aiming to make it robust to spatial-temporal variations and discriminative across different scenes. To develop the foundation model, we construct a large-scale endoscopy video dataset by combining 9 publicly available datasets and a privately collected dataset from Baoshan Branch of Renji Hospital in Shanghai, China. Our dataset overall consists of over 33K video clips with up to 5 million frames, encompassing various protocols, target organs, and disease types. Our pre-trained Endo-FM can be easily adopted for a given downtream task via fine-tuning by serving as the backbone. With experiments on 3 different types of downstream tasks, including classification, segmentation, and detection, our Endo-FM surpasses the current state-of-the-art self-supervised pre-training and adapter-based transfer learning methods by a significant margin, such as VCL (3.1 segmentation, and 5.5 classification, 9.6 datasets, and models are released at


page 1

page 2

page 3

page 4


DiT: Self-supervised Pre-training for Document Image Transformer

Image Transformer has recently achieved significant progress for natural...

TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter

Visual foundation models like CLIP excel in learning feature representat...

Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning

Despite the success of fully-supervised human skeleton sequence modeling...

Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning

Recently, large-scale pre-trained language-image models like CLIP have s...

WinDB: HMD-free and Distortion-free Panoptic Video Fixation Learning

To date, the widely-adopted way to perform fixation collection in panopt...

Self-supervised Video Transformer

In this paper, we propose self-supervised training for video transformer...

EarthPT: a foundation model for Earth Observation

We introduce EarthPT – an Earth Observation (EO) pretrained transformer....

Please sign up or login with your details

Forgot password? Click here to reset