Hierarchical clustering of massive, high dimensional data sets by exploiting ultrametric embedding

Fionn Murtagh, Pedro Contreras Albornoz, Geoff Downs

Research output: Contribution to journalArticlepeer-review

78 Downloads (Pure)

Abstract

Coding of data, usually upstream of data analysis, has crucial impli- cations for the data analysis results. By modifying the data coding – through use of less than full precision in data values – we can aid appre- ciably the effectiveness and efficiency of the hierarchical clustering. In our first application, this is used to lessen the quantity of data to be hierar- chically clustered. The approach is a hybrid one, based on hashing and on the Ward minimum variance agglomerative criterion. In our second appli- cation, we derive a hierarchical clustering from relationships between sets of observations, rather than the traditional use of relationships between the observations themselves. This second application uses embedding in a Baire space, or longest common prefix ultrametric space. We compare this second approach, which is of O(n log n) complexity, to k-means.
Original languageEnglish
Pages (from-to)707-730
Number of pages24
JournalSIAM Journal on Scientific Computing
Volume30
DOIs
Publication statusPublished - 2008

Cite this