Coreset-based Protocols for Machine Learning Classification

Nery Riquelme Granada

Research output: ThesisDoctoral Thesis

21 Downloads (Pure)


This thesis addresses the question whether algorithms from Reliable Machine
Learning (RML), a family of machine learning frameworks that allow for theoretically
bounded prediction errors, can be accelerated in order to allow their use over large
data-sets. In particular, this investigation focuses on two fundamental members of
the RML family: Conformal Prediction (CP) and Venn-Abers Predictors (VAP).
The former consists in a machine learning framework that complements the output
of traditional supervised-learning algorithms with reliability measures to model the
uncertainty in the predictions. The latter allows for non-probabilistic classifiers
to produce well-calibrated probabilities over all possible classes. The additional
information that these methods generate, however, come at the price of nonnegligible computational overhead. Current state-of-the-art approaches propose the
use of these methods in an inductive setting, which indeed reduces their computing
cost meaningfully. Still, the use of these methods remains restricted to relatively
small data-sets. This thesis proposes the acceleration of CP and VAP by using the
idea of data summarisation. Specifically, we design methods that rely on reducing
the available input data into a coreset: a representation of the input data which
retains its main properties while being orders of magnitude smaller than the original
data-set. This idea is formalised with two contributions: Coreset-based Inductive
Conformal Prediction (C-ICP) and Inductive Venn-Abers Predictors with Enclosing
Balls (IVAP-WEB). The obtained results show that both methods indeed give
a substantial speed-up to CP and VAP, while largely preserving their predictive
performance. This work’s third and fourth original contributions are protocols that
accelerate the computation of certain types of coresets: Accelerated Clustering
via Sampling (ACvS) and Regressed Data Summarisation Framework (RDSF).
The numerical results obtained indicate that both ACvS and RDSF efficiently
produce high-quality summaries of data, and hence these methods can constitute
important tools alongside coresets.
This investigation studies the above frameworks and protocols under the lenses
of two fundamental machine learning methods: Logistic Regression and Support
Vector Machines, both well-studied at the RML and coreset communities. The
proposed methodologies constitute, to the best of our knowledge, the first incursion
in using coresets alongside CP and VAP; and, we believe, they open the doorfor the study of further interactions between data summarisation techniques and
other members of the RML family, such as Venn Machines, Mondrian Prediction
and Conformal Predictive Distributions.
Original languageEnglish
Awarding Institution
  • Royal Holloway, University of London
  • Luo, Zhiyuan, Supervisor
  • Nguyen, Dr. Khuong An, Supervisor
Thesis sponsors
Award date1 Jul 2022
Publication statusUnpublished - 2022


  • Coresets
  • Data Compression
  • Conformal Prediction
  • Machine Learning

Cite this