Exploring Data Redundancy in Real-world Image Classification through Data Selection

06/25/2023
by   Zhenyu Tang, et al.
0

Deep learning models often require large amounts of data for training, leading to increased costs. It is particularly challenging in medical imaging, i.e., gathering distributed data for centralized training, and meanwhile, obtaining quality labels remains a tedious job. Many methods have been proposed to address this issue in various training paradigms, e.g., continual learning, active learning, and federated learning, which indeed demonstrate certain forms of the data valuation process. However, existing methods are either overly intuitive or limited to common clean/toy datasets in the experiments. In this work, we present two data valuation metrics based on Synaptic Intelligence and gradient norms, respectively, to study the redundancy in real-world image data. Novel online and offline data selection algorithms are then proposed via clustering and grouping based on the examined data values. Our online approach effectively evaluates data utilizing layerwise model parameter updates and gradients in each epoch and can accelerate model training with fewer epochs and a subset (e.g., 19 accuracy in a variety of datasets. It also extends to the offline coreset construction, producing subsets of only 18 the proposed adaptive data selection and coreset computation are available (https://github.com/ZhenyuTANG2023/data_selection).

READ FULL TEXT

page 16

page 17

page 18

research
03/29/2022

Online Continual Learning on a Contaminated Data Stream with Blurry Task Boundaries

Learning under a continuously changing data distribution with incorrect ...
research
09/07/2023

Privacy-preserving Continual Federated Clustering via Adaptive Resonance Theory

With the increasing importance of data privacy protection, various priva...
research
04/07/2023

Asynchronous Federated Continual Learning

The standard class-incremental continual learning setting assumes a set ...
research
07/15/2022

Suppressing Poisoning Attacks on Federated Learning for Medical Imaging

Collaboration among multiple data-owning entities (e.g., hospitals) can ...
research
05/19/2022

FedILC: Weighted Geometric Mean and Invariant Gradient Covariance for Federated Learning on Non-IID Data

Federated learning is a distributed machine learning approach which enab...
research
06/07/2021

Continual Active Learning for Efficient Adaptation of Machine Learning Models to Changing Image Acquisition

Imaging in clinical routine is subject to changing scanner protocols, ha...
research
08/14/2023

CTP: Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation

Vision-Language Pretraining (VLP) has shown impressive results on divers...

Please sign up or login with your details

Forgot password? Click here to reset