Investigating Labelless Drift Adaptation for Malware Detection

Zeliang Kan; Feargus Pendlebury; Fabio Pierazzi; Lorenzo Cavallaro

doi:10.1145/3474369.3486873

Investigating Labelless Drift Adaptation for Malware Detection

Zeliang Kan, Feargus Pendlebury, Fabio Pierazzi, Lorenzo Cavallaro

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

81 Downloads (Pure)

Abstract

The evolution of malware has long plagued machine learning-based detection systems, as malware authors develop innovative strategies to evade detection and chase profits. This induces concept drift as the test distribution diverges from the training, causing performance decay that requires constant monitoring and adaptation.

In this work, we analyze the adaptation strategy used by DroidEvolver, a state-of-the-art learning system that self-updates using pseudo-labels to avoid the high overhead associated with obtaining a new ground truth. After removing sources of experimental bias present in the original evaluation, we identify a number of flaws in the generation and integration of these pseudo-labels, leading to a rapid onset of performance degradation as the model poisons itself. We propose DroidEvolver++, a more robust variant of DroidEvolver, to address these issues and highlight the role of pseudo-labels in addressing concept drift. We test the tolerance of the adaptation strategy versus different degrees of pseudo-label noise and propose the adoption of methods to ensure only high-quality pseudo-labels are used for updates.

Ultimately, we conclude that the use of pseudo-labeling remains a promising solution to limitations on labeling capacity, but great care must be taken when designing update mechanisms to avoid negative feedback loops and self-poisoning which have catastrophic effects on performance.

Original language	English
Title of host publication	AISec '21
Subtitle of host publication	Proceedings of the 14th ACM Workshop on Artificial Intelligence and Security
Publisher	ACM
Pages	123-134
Number of pages	12
DOIs	https://doi.org/10.1145/3474369.3486873
Publication status	Published - 15 Nov 2021

Access to Document

10.1145/3474369.3486873

Accepted ManuscriptAccepted author manuscript, 2.54 MB

Cite this

@inproceedings{e77a5d41f70f440abffee7a318943f86,

title = "Investigating Labelless Drift Adaptation for Malware Detection",

abstract = "The evolution of malware has long plagued machine learning-based detection systems, as malware authors develop innovative strategies to evade detection and chase profits. This induces concept drift as the test distribution diverges from the training, causing performance decay that requires constant monitoring and adaptation.In this work, we analyze the adaptation strategy used by DroidEvolver, a state-of-the-art learning system that self-updates using pseudo-labels to avoid the high overhead associated with obtaining a new ground truth. After removing sources of experimental bias present in the original evaluation, we identify a number of flaws in the generation and integration of these pseudo-labels, leading to a rapid onset of performance degradation as the model poisons itself. We propose DroidEvolver++, a more robust variant of DroidEvolver, to address these issues and highlight the role of pseudo-labels in addressing concept drift. We test the tolerance of the adaptation strategy versus different degrees of pseudo-label noise and propose the adoption of methods to ensure only high-quality pseudo-labels are used for updates.Ultimately, we conclude that the use of pseudo-labeling remains a promising solution to limitations on labeling capacity, but great care must be taken when designing update mechanisms to avoid negative feedback loops and self-poisoning which have catastrophic effects on performance.",

author = "Zeliang Kan and Feargus Pendlebury and Fabio Pierazzi and Lorenzo Cavallaro",

year = "2021",

month = nov,

day = "15",

doi = "10.1145/3474369.3486873",

language = "English",

pages = "123--134",

booktitle = "AISec '21",

publisher = "ACM",

}

TY - GEN

T1 - Investigating Labelless Drift Adaptation for Malware Detection

AU - Kan, Zeliang

AU - Pendlebury, Feargus

AU - Pierazzi, Fabio

AU - Cavallaro, Lorenzo

PY - 2021/11/15

Y1 - 2021/11/15

N2 - The evolution of malware has long plagued machine learning-based detection systems, as malware authors develop innovative strategies to evade detection and chase profits. This induces concept drift as the test distribution diverges from the training, causing performance decay that requires constant monitoring and adaptation.In this work, we analyze the adaptation strategy used by DroidEvolver, a state-of-the-art learning system that self-updates using pseudo-labels to avoid the high overhead associated with obtaining a new ground truth. After removing sources of experimental bias present in the original evaluation, we identify a number of flaws in the generation and integration of these pseudo-labels, leading to a rapid onset of performance degradation as the model poisons itself. We propose DroidEvolver++, a more robust variant of DroidEvolver, to address these issues and highlight the role of pseudo-labels in addressing concept drift. We test the tolerance of the adaptation strategy versus different degrees of pseudo-label noise and propose the adoption of methods to ensure only high-quality pseudo-labels are used for updates.Ultimately, we conclude that the use of pseudo-labeling remains a promising solution to limitations on labeling capacity, but great care must be taken when designing update mechanisms to avoid negative feedback loops and self-poisoning which have catastrophic effects on performance.

AB - The evolution of malware has long plagued machine learning-based detection systems, as malware authors develop innovative strategies to evade detection and chase profits. This induces concept drift as the test distribution diverges from the training, causing performance decay that requires constant monitoring and adaptation.In this work, we analyze the adaptation strategy used by DroidEvolver, a state-of-the-art learning system that self-updates using pseudo-labels to avoid the high overhead associated with obtaining a new ground truth. After removing sources of experimental bias present in the original evaluation, we identify a number of flaws in the generation and integration of these pseudo-labels, leading to a rapid onset of performance degradation as the model poisons itself. We propose DroidEvolver++, a more robust variant of DroidEvolver, to address these issues and highlight the role of pseudo-labels in addressing concept drift. We test the tolerance of the adaptation strategy versus different degrees of pseudo-label noise and propose the adoption of methods to ensure only high-quality pseudo-labels are used for updates.Ultimately, we conclude that the use of pseudo-labeling remains a promising solution to limitations on labeling capacity, but great care must be taken when designing update mechanisms to avoid negative feedback loops and self-poisoning which have catastrophic effects on performance.

U2 - 10.1145/3474369.3486873

DO - 10.1145/3474369.3486873

M3 - Conference contribution

SP - 123

EP - 134

BT - AISec '21

PB - ACM

ER -