On synthetic data with predetermined subject partitioning and cluster profiling, and partially specified categorical variable marginal correlation structure

09/04/2017
by   Michail Papathomas, et al.
0

A standard approach for assessing the performance of partition or mixture models is to create synthetic data sets with a pre-specified clustering structure, and evaluate how well the model reveals this structure. A common format is that subjects are assigned to different clusters, with variable observations simulated so that subjects within the same cluster have similar profiles, allowing for some variability. In this manuscript, we focus on observations from categorical variables. First, theoretical results are derived to explore the dependence structure between the variables, in relation to the clustering structure for the subjects. Then, a novel approach is proposed that allows partial control over the marginal correlation structure of the variables. Practical examples are shown and additional theoretical results are derived. To illustrate our methods we focus on simulating observations that emulate Single Nucleotide Polymorphisms. We compare a synthetic dataset to a real one, to demonstrate the extend to which the correlation structure for the variables is controlled.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset