Abstract
With the proliferation of increasingly complicated Deep Learning architectures, data synthesis is a highly promising technique to address the demand of data-hungry models. However, reliably assessing the quality of a ‘synthesiser’ model’s output is an open research question with significant associated risks for high-stake domains. To address this challenge, we propose a unique synthesis algorithm that generates data from high-confidence feature space regions based on the Conformal Prediction framework. We support our proposed algorithm with a comprehensive exploration of the core parameter's influence, an indepth discussion of practical advice, and an extensive empirical evaluation of five benchmark datasets. To show our approach’s versatility on ubiquitous real-world challenges, the datasets were carefully selected for their variety of difficult characteristics: low sample count, class imbalance, and non-separability. In all trials, training sets extended with our confident synthesised data performed at least as well as the original set and frequently significantly improved Deep Learning performance by up to 61% points F1-score.
Original language | English |
---|---|
Article number | 57 |
Number of pages | 37 |
Journal | Machine Learning |
Volume | 114 |
DOIs | |
Publication status | Published - 6 Feb 2025 |