Analyzing the Fine Structure of Distributions

08/15/2019
by   Michael C. Thrun, et al.
0

One aim of data mining is the identification of interesting structures in data. Basic properties of the empirical distribution, such as skewness and an eventual clipping, i.e., hard limits in value ranges, need to be assessed. Of particular interest is the question, whether the data originates from one process, or contains subsets related to different states of the data producing process. Data visualization tools should deliver a sensitive picture of the univariate probability density distribution (PDF) for each feature. Visualization tools for PDFs are typically kernel density estimates and range from the classical histogram to modern tools like bean or violin plots. Conventional methods have difficulties in visualizing the pdf in case of uniform, multimodal, skewed and clipped data if density estimation parameters remain in a default setting. As a consequence, a new visualization tool called Mirrored Density plot (MD plot) is proposed which is particularly designed to discover interesting structures in continuous features. The MD plot does not require any adjustments of parameters of density estimation which makes the usage compelling for non-experts. The visualization tools are evaluated in comparison to statistical tests for the typical challenges of explorative distribution analysis. The results are presented on bimodal Gaussian and skewed distributions as well as several features with published pdfs. In exploratory data analysis of 12 features describing the quarterly financial statements, when statistical testing becomes a demanding task, only the MD plots can identify the structure of their pdfs. Overall, the MD plot can outperform the methods mentioned above.

READ FULL TEXT

page 8

page 11

page 13

page 14

page 18

page 23

page 25

page 26

research
02/20/2013

Estimating Continuous Distributions in Bayesian Classifiers

When modeling a probability distribution with a Bayesian network, we are...
research
03/03/2022

Statistical visualisation for tidy and geospatial data in R via kernel smoothing methods in the eks package

Kernel smoothers are essential tools for data analysis due to their abil...
research
11/25/2022

Copula Density Neural Estimation

Probability density estimation from observed data constitutes a central ...
research
07/19/2022

AccuStripes: Adaptive Binning for the Visual Comparison of Univariate Data Distributions

Understanding and comparing distributions of data (e.g., regarding their...
research
12/24/2018

bigMap: Big Data Mapping with Parallelized t-SNE

We introduce an improved unsupervised clustering protocol specially suit...
research
10/15/2018

Bounding Entities within Dense Subtensors

Group-based fraud detection is a promising methodology to catch frauds o...

Please sign up or login with your details

Forgot password? Click here to reset