Coreset-based Protocols for Machine Learning Classification

Nery Riquelme Granada

Coreset-based Protocols for Machine Learning Classification

Nery Riquelme Granada

Department of Computer Science

Research output: Thesis › Doctoral Thesis

160 Downloads (Pure)

Abstract

This thesis addresses the question whether algorithms from Reliable Machine
Learning (RML), a family of machine learning frameworks that allow for theoretically
bounded prediction errors, can be accelerated in order to allow their use over large
data-sets. In particular, this investigation focuses on two fundamental members of
the RML family: Conformal Prediction (CP) and Venn-Abers Predictors (VAP).
The former consists in a machine learning framework that complements the output
of traditional supervised-learning algorithms with reliability measures to model the
uncertainty in the predictions. The latter allows for non-probabilistic classifiers
to produce well-calibrated probabilities over all possible classes. The additional
information that these methods generate, however, come at the price of nonnegligible computational overhead. Current state-of-the-art approaches propose the
use of these methods in an inductive setting, which indeed reduces their computing
cost meaningfully. Still, the use of these methods remains restricted to relatively
small data-sets. This thesis proposes the acceleration of CP and VAP by using the
idea of data summarisation. Specifically, we design methods that rely on reducing
the available input data into a coreset: a representation of the input data which
retains its main properties while being orders of magnitude smaller than the original
data-set. This idea is formalised with two contributions: Coreset-based Inductive
Conformal Prediction (C-ICP) and Inductive Venn-Abers Predictors with Enclosing
Balls (IVAP-WEB). The obtained results show that both methods indeed give
a substantial speed-up to CP and VAP, while largely preserving their predictive
performance. This work’s third and fourth original contributions are protocols that
accelerate the computation of certain types of coresets: Accelerated Clustering
via Sampling (ACvS) and Regressed Data Summarisation Framework (RDSF).
The numerical results obtained indicate that both ACvS and RDSF efficiently
produce high-quality summaries of data, and hence these methods can constitute
important tools alongside coresets.
This investigation studies the above frameworks and protocols under the lenses
of two fundamental machine learning methods: Logistic Regression and Support
Vector Machines, both well-studied at the RML and coreset communities. The
proposed methodologies constitute, to the best of our knowledge, the first incursion
in using coresets alongside CP and VAP; and, we believe, they open the doorfor the study of further interactions between data summarisation techniques and
other members of the RML family, such as Venn Machines, Mondrian Prediction
and Conformal Predictive Distributions.

Original language	English
Qualification	Ph.D.
Awarding Institution	Royal Holloway, University of London
Supervisors/Advisors	Luo, Zhiyuan, Supervisor Nguyen, Dr. Khuong An, Supervisor
Thesis sponsors	Paraguayan National Scholarship Program for Graduate Studies Abroad “Don Carlos Antonio López”
Award date	1 Jul 2022
Publication status	Unpublished - 2022

Keywords

Coresets
Data Compression
Conformal Prediction
Machine Learning

Access to Document

Nery Riquelme-Granada's PhD ThesisOther version, 55.1 MB

Cite this

@phdthesis{b5ed109e87b6489e97800c85e684c788,

title = "Coreset-based Protocols for Machine Learning Classification",

abstract = "This thesis addresses the question whether algorithms from Reliable MachineLearning (RML), a family of machine learning frameworks that allow for theoreticallybounded prediction errors, can be accelerated in order to allow their use over largedata-sets. In particular, this investigation focuses on two fundamental members ofthe RML family: Conformal Prediction (CP) and Venn-Abers Predictors (VAP).The former consists in a machine learning framework that complements the outputof traditional supervised-learning algorithms with reliability measures to model theuncertainty in the predictions. The latter allows for non-probabilistic classifiersto produce well-calibrated probabilities over all possible classes. The additionalinformation that these methods generate, however, come at the price of nonnegligible computational overhead. Current state-of-the-art approaches propose theuse of these methods in an inductive setting, which indeed reduces their computingcost meaningfully. Still, the use of these methods remains restricted to relativelysmall data-sets. This thesis proposes the acceleration of CP and VAP by using theidea of data summarisation. Specifically, we design methods that rely on reducingthe available input data into a coreset: a representation of the input data whichretains its main properties while being orders of magnitude smaller than the originaldata-set. This idea is formalised with two contributions: Coreset-based InductiveConformal Prediction (C-ICP) and Inductive Venn-Abers Predictors with EnclosingBalls (IVAP-WEB). The obtained results show that both methods indeed givea substantial speed-up to CP and VAP, while largely preserving their predictiveperformance. This work{\textquoteright}s third and fourth original contributions are protocols thataccelerate the computation of certain types of coresets: Accelerated Clusteringvia Sampling (ACvS) and Regressed Data Summarisation Framework (RDSF).The numerical results obtained indicate that both ACvS and RDSF efficientlyproduce high-quality summaries of data, and hence these methods can constituteimportant tools alongside coresets.This investigation studies the above frameworks and protocols under the lensesof two fundamental machine learning methods: Logistic Regression and SupportVector Machines, both well-studied at the RML and coreset communities. Theproposed methodologies constitute, to the best of our knowledge, the first incursionin using coresets alongside CP and VAP; and, we believe, they open the doorfor the study of further interactions between data summarisation techniques andother members of the RML family, such as Venn Machines, Mondrian Predictionand Conformal Predictive Distributions.",

keywords = "Coresets, Data Compression, Conformal Prediction, Machine Learning",

author = "{Riquelme Granada}, Nery",

year = "2022",

language = "English",

school = "Royal Holloway, University of London",

}

TY - BOOK

T1 - Coreset-based Protocols for Machine Learning Classification

AU - Riquelme Granada, Nery

PY - 2022

Y1 - 2022

N2 - This thesis addresses the question whether algorithms from Reliable MachineLearning (RML), a family of machine learning frameworks that allow for theoreticallybounded prediction errors, can be accelerated in order to allow their use over largedata-sets. In particular, this investigation focuses on two fundamental members ofthe RML family: Conformal Prediction (CP) and Venn-Abers Predictors (VAP).The former consists in a machine learning framework that complements the outputof traditional supervised-learning algorithms with reliability measures to model theuncertainty in the predictions. The latter allows for non-probabilistic classifiersto produce well-calibrated probabilities over all possible classes. The additionalinformation that these methods generate, however, come at the price of nonnegligible computational overhead. Current state-of-the-art approaches propose theuse of these methods in an inductive setting, which indeed reduces their computingcost meaningfully. Still, the use of these methods remains restricted to relativelysmall data-sets. This thesis proposes the acceleration of CP and VAP by using theidea of data summarisation. Specifically, we design methods that rely on reducingthe available input data into a coreset: a representation of the input data whichretains its main properties while being orders of magnitude smaller than the originaldata-set. This idea is formalised with two contributions: Coreset-based InductiveConformal Prediction (C-ICP) and Inductive Venn-Abers Predictors with EnclosingBalls (IVAP-WEB). The obtained results show that both methods indeed givea substantial speed-up to CP and VAP, while largely preserving their predictiveperformance. This work’s third and fourth original contributions are protocols thataccelerate the computation of certain types of coresets: Accelerated Clusteringvia Sampling (ACvS) and Regressed Data Summarisation Framework (RDSF).The numerical results obtained indicate that both ACvS and RDSF efficientlyproduce high-quality summaries of data, and hence these methods can constituteimportant tools alongside coresets.This investigation studies the above frameworks and protocols under the lensesof two fundamental machine learning methods: Logistic Regression and SupportVector Machines, both well-studied at the RML and coreset communities. Theproposed methodologies constitute, to the best of our knowledge, the first incursionin using coresets alongside CP and VAP; and, we believe, they open the doorfor the study of further interactions between data summarisation techniques andother members of the RML family, such as Venn Machines, Mondrian Predictionand Conformal Predictive Distributions.

AB - This thesis addresses the question whether algorithms from Reliable MachineLearning (RML), a family of machine learning frameworks that allow for theoreticallybounded prediction errors, can be accelerated in order to allow their use over largedata-sets. In particular, this investigation focuses on two fundamental members ofthe RML family: Conformal Prediction (CP) and Venn-Abers Predictors (VAP).The former consists in a machine learning framework that complements the outputof traditional supervised-learning algorithms with reliability measures to model theuncertainty in the predictions. The latter allows for non-probabilistic classifiersto produce well-calibrated probabilities over all possible classes. The additionalinformation that these methods generate, however, come at the price of nonnegligible computational overhead. Current state-of-the-art approaches propose theuse of these methods in an inductive setting, which indeed reduces their computingcost meaningfully. Still, the use of these methods remains restricted to relativelysmall data-sets. This thesis proposes the acceleration of CP and VAP by using theidea of data summarisation. Specifically, we design methods that rely on reducingthe available input data into a coreset: a representation of the input data whichretains its main properties while being orders of magnitude smaller than the originaldata-set. This idea is formalised with two contributions: Coreset-based InductiveConformal Prediction (C-ICP) and Inductive Venn-Abers Predictors with EnclosingBalls (IVAP-WEB). The obtained results show that both methods indeed givea substantial speed-up to CP and VAP, while largely preserving their predictiveperformance. This work’s third and fourth original contributions are protocols thataccelerate the computation of certain types of coresets: Accelerated Clusteringvia Sampling (ACvS) and Regressed Data Summarisation Framework (RDSF).The numerical results obtained indicate that both ACvS and RDSF efficientlyproduce high-quality summaries of data, and hence these methods can constituteimportant tools alongside coresets.This investigation studies the above frameworks and protocols under the lensesof two fundamental machine learning methods: Logistic Regression and SupportVector Machines, both well-studied at the RML and coreset communities. Theproposed methodologies constitute, to the best of our knowledge, the first incursionin using coresets alongside CP and VAP; and, we believe, they open the doorfor the study of further interactions between data summarisation techniques andother members of the RML family, such as Venn Machines, Mondrian Predictionand Conformal Predictive Distributions.

KW - Coresets

KW - Data Compression

KW - Conformal Prediction

KW - Machine Learning

M3 - Doctoral Thesis

ER -