Abstract
This thesis addresses the question whether algorithms from Reliable Machine
Learning (RML), a family of machine learning frameworks that allow for theoretically
bounded prediction errors, can be accelerated in order to allow their use over large
datasets. In particular, this investigation focuses on two fundamental members of
the RML family: Conformal Prediction (CP) and VennAbers Predictors (VAP).
The former consists in a machine learning framework that complements the output
of traditional supervisedlearning algorithms with reliability measures to model the
uncertainty in the predictions. The latter allows for nonprobabilistic classifiers
to produce wellcalibrated probabilities over all possible classes. The additional
information that these methods generate, however, come at the price of nonnegligible computational overhead. Current stateoftheart approaches propose the
use of these methods in an inductive setting, which indeed reduces their computing
cost meaningfully. Still, the use of these methods remains restricted to relatively
small datasets. This thesis proposes the acceleration of CP and VAP by using the
idea of data summarisation. Specifically, we design methods that rely on reducing
the available input data into a coreset: a representation of the input data which
retains its main properties while being orders of magnitude smaller than the original
dataset. This idea is formalised with two contributions: Coresetbased Inductive
Conformal Prediction (CICP) and Inductive VennAbers Predictors with Enclosing
Balls (IVAPWEB). The obtained results show that both methods indeed give
a substantial speedup to CP and VAP, while largely preserving their predictive
performance. This work’s third and fourth original contributions are protocols that
accelerate the computation of certain types of coresets: Accelerated Clustering
via Sampling (ACvS) and Regressed Data Summarisation Framework (RDSF).
The numerical results obtained indicate that both ACvS and RDSF efficiently
produce highquality summaries of data, and hence these methods can constitute
important tools alongside coresets.
This investigation studies the above frameworks and protocols under the lenses
of two fundamental machine learning methods: Logistic Regression and Support
Vector Machines, both wellstudied at the RML and coreset communities. The
proposed methodologies constitute, to the best of our knowledge, the first incursion
in using coresets alongside CP and VAP; and, we believe, they open the doorfor the study of further interactions between data summarisation techniques and
other members of the RML family, such as Venn Machines, Mondrian Prediction
and Conformal Predictive Distributions.
Learning (RML), a family of machine learning frameworks that allow for theoretically
bounded prediction errors, can be accelerated in order to allow their use over large
datasets. In particular, this investigation focuses on two fundamental members of
the RML family: Conformal Prediction (CP) and VennAbers Predictors (VAP).
The former consists in a machine learning framework that complements the output
of traditional supervisedlearning algorithms with reliability measures to model the
uncertainty in the predictions. The latter allows for nonprobabilistic classifiers
to produce wellcalibrated probabilities over all possible classes. The additional
information that these methods generate, however, come at the price of nonnegligible computational overhead. Current stateoftheart approaches propose the
use of these methods in an inductive setting, which indeed reduces their computing
cost meaningfully. Still, the use of these methods remains restricted to relatively
small datasets. This thesis proposes the acceleration of CP and VAP by using the
idea of data summarisation. Specifically, we design methods that rely on reducing
the available input data into a coreset: a representation of the input data which
retains its main properties while being orders of magnitude smaller than the original
dataset. This idea is formalised with two contributions: Coresetbased Inductive
Conformal Prediction (CICP) and Inductive VennAbers Predictors with Enclosing
Balls (IVAPWEB). The obtained results show that both methods indeed give
a substantial speedup to CP and VAP, while largely preserving their predictive
performance. This work’s third and fourth original contributions are protocols that
accelerate the computation of certain types of coresets: Accelerated Clustering
via Sampling (ACvS) and Regressed Data Summarisation Framework (RDSF).
The numerical results obtained indicate that both ACvS and RDSF efficiently
produce highquality summaries of data, and hence these methods can constitute
important tools alongside coresets.
This investigation studies the above frameworks and protocols under the lenses
of two fundamental machine learning methods: Logistic Regression and Support
Vector Machines, both wellstudied at the RML and coreset communities. The
proposed methodologies constitute, to the best of our knowledge, the first incursion
in using coresets alongside CP and VAP; and, we believe, they open the doorfor the study of further interactions between data summarisation techniques and
other members of the RML family, such as Venn Machines, Mondrian Prediction
and Conformal Predictive Distributions.
Original language  English 

Qualification  Ph.D. 
Awarding Institution 

Supervisors/Advisors 

Thesis sponsors  
Award date  1 Jul 2022 
Publication status  Unpublished  2022 
Keywords
 Coresets
 Data Compression
 Conformal Prediction
 Machine Learning