Dataset Bias Mitigation Through Analysis of CNN Training Scores

by   Ekberjan Derman, et al.

Training datasets are crucial for convolutional neural network-based algorithms, which directly impact their overall performance. As such, using a well-structured dataset that has minimum level of bias is always desirable. In this paper, we proposed a novel, domain-independent approach, called score-based resampling (SBR), to locate the under-represented samples of the original training dataset based on the model prediction scores obtained with that training set. In our method, once trained, we use the same CNN model to infer on its own training samples, obtain prediction scores, and based on the distance between predicted and ground-truth, we identify samples that are far away from their ground-truth and augment them in the original training set. The temperature term of the Sigmoid function is decreased to better differentiate scores. For experimental evaluation, we selected one Kaggle dataset for gender classification. We first used a CNN-based classifier with relatively standard structure, trained on the training images, and evaluated on the provided validation samples of the original dataset. Then, we assessed it on a totally new test dataset consisting of light male, light female, dark male, and dark female groups. The obtained accuracies varied, revealing the existence of categorical bias against certain groups in the original dataset. Subsequently, we trained the model after resampling based on our proposed approach. We compared our method with a previously proposed variational autoencoder (VAE) based algorithm. The obtained results confirmed the validity of our proposed method regrading identifying under-represented samples among original dataset to decrease categorical bias of classifying certain groups. Although tested for gender classification, the proposed algorithm can be used for investigating dataset structure of any CNN-based tasks.


page 7

page 8

page 9


An Algorithm to Attack Neural Network Encoder-based Out-Of-Distribution Sample Detector

Deep neural network (DNN), especially convolutional neural network, has ...

BiasEnsemble: Revisiting the Importance of Amplifying Bias for Debiasing

In image classification, "debiasing" aims to train a classifier to be le...

Deep learning for Aerosol Forecasting

Reanalysis datasets combining numerical physics models and limited obser...

Gender recognition and biometric identification using a large dataset of hand images

The human hand possesses distinctive features which can reveal gender in...

Examining CNN Representations with respect to Dataset Bias

Given a pre-trained CNN without any testing samples, this paper proposes...

Improving the results of string kernels in sentiment analysis and Arabic dialect identification by adapting them to your test set

Recently, string kernels have obtained state-of-the-art results in vario...

Deep learning for seismic phase detection and picking in the aftershock zone of 2008 Mw7.9 Wenchuan

The increasing volume of seismic data from long-term continuous monitori...

Please sign up or login with your details

Forgot password? Click here to reset