L_0-ARM: Network Sparsification via Stochastic Binary Optimization

by   Yang Li, et al.

We consider network sparsification as an L_0-norm regularized binary optimization problem, where each unit of a neural network (e.g., weight, neuron, or channel, etc.) is attached with a stochastic binary gate, whose parameters are jointly optimized with original network parameters. The Augment-Reinforce-Merge (ARM), a recently proposed unbiased gradient estimator, is investigated for this binary optimization problem. Compared to the hard concrete gradient estimator from Louizos et al., ARM demonstrates superior performance of pruning network architectures while retaining almost the same accuracies of baseline methods. Similar to the hard concrete estimator, ARM also enables conditional computation during model training but with improved effectiveness due to the exact binary stochasticity. Thanks to the flexibility of ARM, many smooth or non-smooth parametric functions, such as scaled sigmoid or hard sigmoid, can be used to parameterize this binary optimization problem and the unbiasness of the ARM estimator is retained, while the hard concrete estimator has to rely on the hard sigmoid function to achieve conditional computation and thus accelerated training. Extensive experiments on multiple public datasets demonstrate state-of-the-art pruning rates with almost the same accuracies of baseline methods. The resulting algorithm L_0-ARM sparsifies the Wide-ResNet models on CIFAR-10 and CIFAR-100 while the hard concrete estimator cannot. We plan to release our code to facilitate the research in this area.


page 1

page 2

page 3

page 4


Neural Plasticity Networks

Neural plasticity is an important functionality of human brain, in which...

Augment-Reinforce-Merge Policy Gradient for Binary Stochastic Policy

Due to the high variance of policy gradients, on-policy optimization alg...

Learning Sparse Neural Networks through L_0 Regularization

We propose a practical method for L_0 norm regularization for neural net...

DisARM: An Antithetic Gradient Estimator for Binary Latent Variables

Training models with discrete latent variables is challenging due to the...

Learnable Bernoulli Dropout for Bayesian Deep Learning

In this work, we propose learnable Bernoulli dropout (LBD), a new model-...

DiffPrune: Neural Network Pruning with Deterministic Approximate Binary Gates and L_0 Regularization

Modern neural network architectures typically have many millions of para...

Joint Multi-Dimension Pruning

We present joint multi-dimension pruning (named as JointPruning), a new ...

Please sign up or login with your details

Forgot password? Click here to reset