Abstract
Coding of data, usually upstream of data analysis, has crucial impli- cations for the data analysis results. By modifying the data coding – through use of less than full precision in data values – we can aid appre- ciably the effectiveness and efficiency of the hierarchical clustering. In our first application, this is used to lessen the quantity of data to be hierar- chically clustered. The approach is a hybrid one, based on hashing and on the Ward minimum variance agglomerative criterion. In our second appli- cation, we derive a hierarchical clustering from relationships between sets of observations, rather than the traditional use of relationships between the observations themselves. This second application uses embedding in a Baire space, or longest common prefix ultrametric space. We compare this second approach, which is of O(n log n) complexity, to k-means.
| Original language | English |
|---|---|
| Pages (from-to) | 707-730 |
| Number of pages | 24 |
| Journal | SIAM Journal on Scientific Computing |
| Volume | 30 |
| DOIs | |
| Publication status | Published - 2008 |
Projects
- 1 Finished
-
New Mathematical approaches for structuring and searching through, very large compressed encrypted textual data stores
Murtagh, F. (PI) & Contreras Albornoz, P. (CoI)
Eng & Phys Sci Res Council EPSRC
1/11/06 → 31/03/10
Project: Research
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver