Similarity of Objects and the Meaning of Words

by   Rudi Cilibrasi, et al.

We survey the emerging area of compression-based, parameter-free, similarity distance measures useful in data-mining, pattern recognition, learning and automatic semantics extraction. Given a family of distances on a set of objects, a distance is universal up to a certain precision for that family if it minorizes every distance in the family between every two objects in the set, up to the stated precision (we do not require the universal distance to be an element of the family). We consider similarity distances for two types of objects: literal objects that as such contain all of their meaning, like genomes or books, and names for objects. The latter may have literal embodyments like the first type, but may also be abstract like "red" or "christianity." For the first type we consider a family of computable distance measures corresponding to parameters expressing similarity according to particular featuresdistances generated by web users corresponding to particular semantic relations between the (names for) the designated objects. For both families we give universal similarity distance measures, incorporating all particular distance measures in the family. In the first case the universal distance is based on compression and in the second case it is based on Google page counts related to search terms. In both cases experiments on a massive scale give evidence of the viability of the approaches. between pairs of literal objects. For the second type we consider similarity


page 1

page 2

page 3

page 4


Information Distance in Multiples

Information distance is a parameter-free similarity measure based on com...

A New Family of Near-metrics for Universal Similarity

We propose a family of near-metrics based on local graph diffusion to ca...

The similarity metric

A new class of distances appropriate for measuring similarity relations ...

Generalized quantum similarity learning

The similarity between objects is significant in a broad range of areas....

Web Similarity

Normalized web distance (NWD) is a similarity or normalized semantic dis...

Zero-error dissimilarity based classifiers

We consider general non-Euclidean distance measures between real world o...

Tensor SimRank for Heterogeneous Information Networks

We propose a generalization of SimRank similarity measure for heterogene...

Please sign up or login with your details

Forgot password? Click here to reset