Subdata selection for big data regression: an improved approach

04/29/2023
by   Vasilis Chasiotis, et al.
0

In the big data era researchers face a series of problems. Even standard approaches/methodologies, like linear regression, can be difficult or problematic with huge volumes of data. Traditional approaches for regression in big datasets may suffer due to the large sample size, since they involve inverting huge data matrices or even because the data cannot fit to the memory. Proposed approaches are based on selecting representative subdata to run the regression. Existing approaches select the subdata using information criteria and/or properties from orthogonal arrays. In the present paper we improve existing algorithms providing a new algorithm that is based on D-optimality approach. We provide simulation evidence for its performance. Evidence about the parameters of the proposed algorithm is also provided in order to clarify the trade-offs between execution time and information gain. Real data applications are also provided.

READ FULL TEXT

page 3

page 14

page 15

page 17

page 20

research
05/02/2023

On the selection of optimal subdata for big data regression based on leverage scores

Regression can be really difficult in case of big datasets, since we hav...
research
06/08/2020

A Survey of Bayesian Statistical Approaches for Big Data

The modern era is characterised as an era of information or Big Data. Th...
research
05/30/2021

Orthogonal Subsampling for Big Data Linear Regression

The dramatic growth of big datasets presents a new challenge to data sto...
research
11/04/2021

Auto Tuning of Hadoop and Spark parameters

Data of the order of terabytes, petabytes, or beyond is known as Big Dat...
research
02/11/2020

Big Data and model-based survey sampling

Big Data are huge amounts of digital information that are automatically ...
research
08/12/2020

Sampling Based Approximate Skyline Calculation on Big Data

The existing algorithms for processing skyline queries cannot adapt to b...
research
10/02/2017

Online and Distributed Robust Regressions under Adversarial Data Corruption

In today's era of big data, robust least-squares regression becomes a mo...

Please sign up or login with your details

Forgot password? Click here to reset