Clustering Longitudinal Life-Course Sequences using Mixtures of Exponential-Distance Models
Sequence analysis is an increasingly popular approach for the analysis of life-courses represented by categorical sequences, i.e. as the ordered collection of activities experienced by subjects over a given time period. Several criteria have been introduced in the literature to measure pairwise dissimilarities among sequences. Typically, dissimilarity matrices are employed as the input to heuristic clustering algorithms, with the aim of identifying the most relevant patterns in the data. Here, we propose a model-based clustering approach for categorical sequence data. The technique is applied to a popular survey data set containing information on the career trajectories, in terms of monthly labour market activities, of a cohort of Northern Irish youths tracked from the age of 16 to the age of 22. Specifically, we develop a family of methods for clustering sequence data directly based on mixtures of exponential-distance models, which we call MEDseq. The Hamming distance and weighted variants thereof are employed as the distance metric. The existence of a closed-form expression for the normalising constant using these metrics facilitates the development of an ECM algorithm for model fitting. We allow the probability of component membership to depend on fixed covariates. The MEDseq models can also accommodate sampling weights, which are typically associated with life-course data. Including the weights and covariates in the clustering process in a holistic manner allows new insights to be gleaned from the Northern Irish data.
READ FULL TEXT