How Data Augmentation Affects Evolutionary Algorithms in Feature Selection: An Experimental Study

Nardone, Emanuele; D'Alessandro, Tiziana; De Stefano, Claudio; Fontanella, Francesco

doi:10.1007/s42979-025-04049-3

The rapid growth of machine learning has led to increased features used to represent data, often resulting in superfluous or irrelevant features that negatively impact model performance. Feature selection techniques address this issue by identifying the smallest subset of relevant features. This paper examines the integration of a novel data augmentation algorithm with evolutionary algorithms for feature selection across ten datasets from widely varying domains. We investigate the effectiveness of Genetic Algorithms, Particle Swarm Optimization, and Differential Evolution on augmented datasets by 10–50% and compare their performance with standard filter-based and wrapper methods. Our experiments demonstrate that data augmentation significantly boosts the efficacy of evolutionary algorithms, improving accuracy by up to 5% and reducing feature sets by an average of 40%. While Differential Evolution generally outperforms other algorithms, our findings reveal that the efficacy of combining data augmentation and feature selection varies across datasets. Optimal performance is typically observed from 30 to 50% augmentation, though excessive augmentation can occasionally lead to slight performance degradation, emphasizing the need for careful calibration. This research paves the way for future studies on the interplay between data augmentation and feature selection, including investigations into explainability and generalizability across different machine learning paradigms. By providing insight into this complex interplay, our study contributes to developing more robust and efficient algorithms across various domains.