Hierarchical clustering of massive, high dimensional data sets by exploiting ultrametric embedding

Fionn Murtagh; Pedro Contreras Albornoz; Geoff Downs

doi:10.1137/060676532

Hierarchical clustering of massive, high dimensional data sets by exploiting ultrametric embedding

Fionn Murtagh, Pedro Contreras Albornoz, Geoff Downs

Research output: Contribution to journal › Article › peer-review

80 Downloads (Pure)

Abstract

Coding of data, usually upstream of data analysis, has crucial impli- cations for the data analysis results. By modifying the data coding – through use of less than full precision in data values – we can aid appre- ciably the effectiveness and efficiency of the hierarchical clustering. In our first application, this is used to lessen the quantity of data to be hierar- chically clustered. The approach is a hybrid one, based on hashing and on the Ward minimum variance agglomerative criterion. In our second appli- cation, we derive a hierarchical clustering from relationships between sets of observations, rather than the traditional use of relationships between the observations themselves. This second application uses embedding in a Baire space, or longest common prefix ultrametric space. We compare this second approach, which is of O(n log n) complexity, to k-means.

Original language	English
Pages (from-to)	707-730
Number of pages	24
Journal	SIAM Journal on Scientific Computing
Volume	30
DOIs	https://doi.org/10.1137/060676532
Publication status	Published - 2008

Access to Document

10.1137/060676532

Ultrametrization oct06dSubmitted manuscript, 579 KB

http://dx.doi.org/10.1137/060676532

New Mathematical approaches for structuring and searching through, very large compressed encrypted textual data stores
Murtagh, F. & Contreras Albornoz, P.
Eng & Phys Sci Res Council EPSRC
1/11/06 → 31/03/10
Project: Research

Cite this

@article{519afaba5f21455e99f3b8cf40e96170,

title = "Hierarchical clustering of massive, high dimensional data sets by exploiting ultrametric embedding",

abstract = "Coding of data, usually upstream of data analysis, has crucial impli- cations for the data analysis results. By modifying the data coding – through use of less than full precision in data values – we can aid appre- ciably the effectiveness and efficiency of the hierarchical clustering. In our first application, this is used to lessen the quantity of data to be hierar- chically clustered. The approach is a hybrid one, based on hashing and on the Ward minimum variance agglomerative criterion. In our second appli- cation, we derive a hierarchical clustering from relationships between sets of observations, rather than the traditional use of relationships between the observations themselves. This second application uses embedding in a Baire space, or longest common prefix ultrametric space. We compare this second approach, which is of O(n log n) complexity, to k-means.",

author = "Fionn Murtagh and {Contreras Albornoz}, Pedro and Geoff Downs",

year = "2008",

doi = "10.1137/060676532",

language = "English",

volume = "30",

pages = "707--730",

journal = "SIAM Journal on Scientific Computing",

issn = "1064-8275",

publisher = "Society for Industrial and Applied Mathematics Publications",

}

TY - JOUR

T1 - Hierarchical clustering of massive, high dimensional data sets by exploiting ultrametric embedding

AU - Murtagh, Fionn

AU - Contreras Albornoz, Pedro

AU - Downs, Geoff

PY - 2008

Y1 - 2008

N2 - Coding of data, usually upstream of data analysis, has crucial impli- cations for the data analysis results. By modifying the data coding – through use of less than full precision in data values – we can aid appre- ciably the effectiveness and efficiency of the hierarchical clustering. In our first application, this is used to lessen the quantity of data to be hierar- chically clustered. The approach is a hybrid one, based on hashing and on the Ward minimum variance agglomerative criterion. In our second appli- cation, we derive a hierarchical clustering from relationships between sets of observations, rather than the traditional use of relationships between the observations themselves. This second application uses embedding in a Baire space, or longest common prefix ultrametric space. We compare this second approach, which is of O(n log n) complexity, to k-means.

AB - Coding of data, usually upstream of data analysis, has crucial impli- cations for the data analysis results. By modifying the data coding – through use of less than full precision in data values – we can aid appre- ciably the effectiveness and efficiency of the hierarchical clustering. In our first application, this is used to lessen the quantity of data to be hierar- chically clustered. The approach is a hybrid one, based on hashing and on the Ward minimum variance agglomerative criterion. In our second appli- cation, we derive a hierarchical clustering from relationships between sets of observations, rather than the traditional use of relationships between the observations themselves. This second application uses embedding in a Baire space, or longest common prefix ultrametric space. We compare this second approach, which is of O(n log n) complexity, to k-means.

U2 - 10.1137/060676532

DO - 10.1137/060676532

M3 - Article

SN - 1064-8275

VL - 30

SP - 707

EP - 730

JO - SIAM Journal on Scientific Computing

JF - SIAM Journal on Scientific Computing

ER -

Hierarchical clustering of massive, high dimensional data sets by exploiting ultrametric embedding

Abstract

Access to Document

Projects

New Mathematical approaches for structuring and searching through, very large compressed encrypted textual data stores

Cite this