Cluster analysis comprises several unsupervised techniques aiming to identify a subgroup (cluster) structure underlying the observations of a data set. The desired cluster allocation is such that it assigns similar observations to the same subgroup. Depending on the field of application and on domain-specific requirements, different approaches exist that tackle the clustering problem. In distance-based clustering, a distance metric is used to determine similarity between data objects. The distance metric can be used to cluster observations by considering the distances between objects directly or by considering distances between objects and cluster centroids (or some other cluster representative points). Most distance metrics, and hence the distance-based clustering methods, work either with continuous-only or categorical-only data. In applications, however, observations are often described by a combination of both continuous and categorical variables. Such data sets can be referred to as mixed or mixed-type data. In this review, we consider different methods for distance-based cluster analysis of mixed data. In particular, we distinguish three different streams that range from basic data pre-processing (where all variables are converted to the same scale), to the use of specifc distance measures for mixed data, and finally to so-called joint data reduction (a combination of dimension reduction and clustering) methods specifically designed for mixed data.

Distance-based clustering of mixed data

Iodice D'Enza, Alfonso;
In corso di stampa

Abstract

Cluster analysis comprises several unsupervised techniques aiming to identify a subgroup (cluster) structure underlying the observations of a data set. The desired cluster allocation is such that it assigns similar observations to the same subgroup. Depending on the field of application and on domain-specific requirements, different approaches exist that tackle the clustering problem. In distance-based clustering, a distance metric is used to determine similarity between data objects. The distance metric can be used to cluster observations by considering the distances between objects directly or by considering distances between objects and cluster centroids (or some other cluster representative points). Most distance metrics, and hence the distance-based clustering methods, work either with continuous-only or categorical-only data. In applications, however, observations are often described by a combination of both continuous and categorical variables. Such data sets can be referred to as mixed or mixed-type data. In this review, we consider different methods for distance-based cluster analysis of mixed data. In particular, we distinguish three different streams that range from basic data pre-processing (where all variables are converted to the same scale), to the use of specifc distance measures for mixed data, and finally to so-called joint data reduction (a combination of dimension reduction and clustering) methods specifically designed for mixed data.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11580/70191
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
social impact