The authors discuss how one can compensate for missing values when clustering joint categorical sequences. For example, one might have a set of sequences with both nominal values, such as family size, and binary values, for example, marriage status. Usually such sequences will exhibit missing values; however, discarding these sequences will result in too few sequences to allow valid conclusions. Some missing values can be inferred--for example, age--but often this is not the case.
The authors consider various ways to address missing values. The principal method used is the idea of an edit distance, which measures the number and size of the changes required to edit one sequence to another. In the case where the entries in the sequence consist of multiple values, one can use the average of the edit distances for each category. Given a distance, one can proceed to infer a set of clusters of “like” sequences.
Using a study of income dynamics as an exemplar exhibiting both binary and nominal values, the authors provide a detailed description of the process. They give experimental results comparing various choices for distance. The income dynamics sequences provide sufficient variety to allow several measures for distance.
By using dimension reduction techniques, the authors are able to provide visual representations of the clusters corresponding to the distance choices. The paper is a useful guide to the available techniques, with sufficient illustrative examples so that readers can apply the ideas.