Coreset-Based data compression for Logistic Regression

Nery Riquelme-Granada, Khuong An Nguyen, Zhiyuan Luo

Research output: Chapter in Book/Report/Conference proceedingConference contribution


The coreset paradigm is a fundamental tool for analysing complex and large datasets. Although coresets are used as an acceleration technique for many learning problems, the algorithms used for constructing them may become computationally exhaustive in some settings. We show that this can easily happen when computing coresets for learning a logistic regression classifier. We overcome this issue with two methods: Accelerating Clustering via Sampling (ACvS) and Regressed Data Summarisation Framework (RDSF); the former is an acceleration procedure based on a simple theoretical observation on using Uniform Random Sampling for clustering problems, the latter is a coreset-based data-summarising framework that builds on ACvS and extends it by using a regression algorithm as part of the construction. We tested both procedures on five public datasets, and observed that computing the coreset and learning from it, is 11 times faster than learning directly from the full input data in the worst case, and 34 times faster in the best case. We further observed that the best regression algorithm for creating summaries of data using the RDSF framework is the Ordinary Least Squares (OLS).
Original languageEnglish
Title of host publicationData Management Technologies and Applications - 9th International Conference, DATA 2020, Revised Selected Papers
EditorsSlimane Hammoudi, Christoph Quix, Jorge Bernardino
Number of pages28
ISBN (Print)9783030830137
Publication statusPublished - 23 Jul 2021

Publication series

NameCommunications in Computer and Information Science


  • Coresets
  • Logistic Regression
  • Data compression
  • Logistic regression

Cite this