Ranking the information content of distance measures

04/30/2021
by   Aldo Glielmo, et al.
0

Real-world data typically contain a large number of features that are often heterogeneous in nature, relevance, and also units of measure. When assessing the similarity between data points, one can build various distance measures using subsets of these features. Using the fewest features but still retaining sufficient information about the system is crucial in many statistical learning approaches, particularly when data are sparse. We introduce a statistical test that can assess the relative information retained when using two different distance measures, and determine if they are equivalent, independent, or if one is more informative than the other. This in turn allows finding the most informative distance measure out of a pool of candidates. The approach is applied to find the most relevant policy variables for controlling the Covid-19 epidemic and to find compact yet informative representations of atomic structures, but its potential applications are wide ranging in many branches of science.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/10/2014

Similarity Learning for High-Dimensional Sparse Data

A good measure of similarity between data points is crucial to many task...
research
06/20/2021

Opportunities and challenges in partitioning the graph measure space of real-world networks

Based on a large dataset containing thousands of real-world networks ran...
research
10/31/2016

A New Distance Measure for Non-Identical Data with Application to Image Classification

Distance measures are part and parcel of many computer vision algorithms...
research
09/18/2023

On the Use of the Kantorovich-Rubinstein Distance for Dimensionality Reduction

The goal of this thesis is to study the use of the Kantorovich-Rubinstei...
research
11/02/2020

Assessing racial inequality in COVID-19 testing with Bayesian threshold tests

There are racial disparities in the COVID-19 test positivity rate, sugge...
research
06/15/2023

Ranking and Selection in Large-Scale Inference of Heteroscedastic Units

The allocation of limited resources to a large number of potential candi...
research
09/04/2023

OutRank: Speeding up AutoML-based Model Search for Large Sparse Data sets with Cardinality-aware Feature Ranking

The design of modern recommender systems relies on understanding which p...

Please sign up or login with your details

Forgot password? Click here to reset