Abstract
With the proliferation of increasingly complicated Deep Learning architectures, data synthesis is a highly promising technique to address the demand of data-hungry models. However, reliably assessing the quality of a ‘synthesiser’ model’s output is an open research question with significant associated risks for high-stake domains. To address this challenge, we propose a unique synthesis algorithm that generates data from high-confidence feature space regions based on the Conformal Prediction framework. We support our proposed algorithm with a comprehensive exploration of the core parameter's influence, an indepth discussion of practical advice, and an extensive empirical evaluation of five benchmark datasets. To show our approach’s versatility on ubiquitous real-world challenges, the datasets were carefully selected for their variety of difficult characteristics: low sample count, class imbalance, and non-separability. In all trials, training sets extended with our confident synthesised data performed at least as well as the original set and frequently significantly improved Deep Learning performance by up to 61% points F1-score.
| Original language | English |
|---|---|
| Article number | 57 |
| Number of pages | 37 |
| Journal | Machine Learning |
| Volume | 114 |
| DOIs | |
| Publication status | Published - 6 Feb 2025 |
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver