Hierarchical clustering of massive, high dimensional data sets by exploiting ultrametric embedding. / Murtagh, Fionn; Contreras Albornoz, Pedro; Downs, Geoff.

In: SIAM Journal on Scientific Computing, Vol. 30, 2008, p. 707-730.

Research output: Contribution to journalArticle





Coding of data, usually upstream of data analysis, has crucial impli- cations for the data analysis results. By modifying the data coding – through use of less than full precision in data values – we can aid appre- ciably the effectiveness and efficiency of the hierarchical clustering. In our first application, this is used to lessen the quantity of data to be hierar- chically clustered. The approach is a hybrid one, based on hashing and on the Ward minimum variance agglomerative criterion. In our second appli- cation, we derive a hierarchical clustering from relationships between sets of observations, rather than the traditional use of relationships between the observations themselves. This second application uses embedding in a Baire space, or longest common prefix ultrametric space. We compare this second approach, which is of O(n log n) complexity, to k-means.
Original languageEnglish
Pages (from-to)707-730
Number of pages24
JournalSIAM Journal on Scientific Computing
Publication statusPublished - 2008
This open access research output is licenced under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.

ID: 4568489