Coresets for Scalable Bayesian Logistic Regression

by   Jonathan H. Huggins, et al.

The use of Bayesian methods in large-scale data settings is attractive because of the rich hierarchical models, uncertainty quantification, and prior specification they provide. Standard Bayesian inference algorithms are computationally expensive, however, making their direct application to large datasets difficult or infeasible. Recent work on scaling Bayesian inference has focused on modifying the underlying algorithms to, for example, use only a random data subsample at each iteration. We leverage the insight that data is often redundant to instead obtain a weighted subset of the data (called a coreset) that is much smaller than the original dataset. We can then use this small coreset in any number of existing posterior inference algorithms without modification. In this paper, we develop an efficient coreset construction algorithm for Bayesian logistic regression models. We provide theoretical guarantees on the size and approximation quality of the coreset -- both for fixed, known datasets, and in expectation for a wide class of data generative models. Crucially, the proposed approach also permits efficient construction of the coreset in both streaming and parallel settings, with minimal additional effort. We demonstrate the efficacy of our approach on a number of synthetic and real-world datasets, and find that, in practice, the size of the coreset is independent of the original dataset size. Furthermore, constructing the coreset takes a negligible amount of time compared to that required to run MCMC on it.


page 1

page 2

page 3

page 4


PASS-GLM: polynomial approximate sufficient statistics for scalable Bayesian GLM inference

Generalized linear models (GLMs) -- such as logistic regression, Poisson...

Using bagged posteriors for robust inference and model criticism

Standard Bayesian inference is known to be sensitive to model misspecifi...

Patterns of Scalable Bayesian Inference

Datasets are growing not just in size but in complexity, creating a dema...

Scalable Bayesian inference for time series via divide-and-conquer

Bayesian computational algorithms tend to scale poorly as data size incr...

Automated Scalable Bayesian Inference via Hilbert Coresets

The automation of posterior inference in Bayesian data analysis has enab...

Parallelizing MCMC with Random Partition Trees

The modern scale of data has brought new challenges to Bayesian inferenc...

Probably the Best Itemsets

One of the main current challenges in itemset mining is to discover a sm...

Please sign up or login with your details

Forgot password? Click here to reset