Malware detection using random forest method trained on a balanced synthetic dataset

Matsobane, Neo Onica

ULSpace Home
→
Faculty of Science and Agriculture
→
School of Mathematical & Computational Sciences
→
Theses and Dissertations (Computer Science)
→
View Item

dc.contributor.advisor	Mokwena, S. N.
dc.contributor.author	Matsobane, Neo Onica
dc.date.accessioned	2025-01-30T11:05:22Z
dc.date.available	2025-01-30T11:05:22Z
dc.date.issued	2024
dc.identifier.uri	http://hdl.handle.net/10386/4846
dc.description	Thesis (M.Sc. (eScience Data Science)) -- University of Limpopo, 2024	en_US
dc.description.abstract	Malicious software (malware) poses a significant threat to the security and integrity of computer systems. Traditional malware detection approaches often encounter challenges due to small-scale and imbalanced datasets, resulting in reduced detection accuracy and reliability. In this research, we proposed a novel approach to address these issues by utilising a Random Forest method trained on a balanced synthetic dataset. The primary objective of this study was to investigate the impact of employing a Random Forest technique on the detection of malware. To achieve this, we first created a balanced synthetic dataset based on the latest (CICMalDroid2020) dataset using Generative Adversarial Networks (GANs). This synthetic dataset aimed to address the limitations associated with small-scale and imbalanced datasets commonly encountered in malware detection. We then trained the Random Forest model using this balanced synthetic dataset. The evaluation of the model's performance was conducted using various metrics, including detection accuracy, precision, recall, balanced accuracy, geometric metrics, and F1-score. Intensive analyses were performed to assess the effectiveness of the proposed approach in detecting malware samples accurately and robustly, as compared to traditional detection methods. The results of our research provided insights into the potential benefits of utilising a Random Forest method trained on a balanced synthetic dataset for malware detection. The results shed light on the performance improvements achieved by the random forest method when trained on a balanced synthetic dataset, thus contributing to the advancement of malware detection techniques. The test results showed that random forest can detect malware attacks with an accuracy of 91%, recall of 100%, precision of 85%, Fl score of 92%, balanced accuracy of 95% and geometric metrics of 84%. From the results, we inferred that random forest has the capacity to detect malware attacks.	en_US
dc.format.extent	vi, 67 leaves	en_US
dc.language.iso	en	en_US
dc.relation.requires	PDF	en_US
dc.subject	Random Forest	en_US
dc.subject	Malware detection	en_US
dc.subject	Synthetic dataset	en_US
dc.subject	Balanced dataset	en_US
dc.subject	Generative Adversarial Networks (GANs)	en_US
dc.subject.lcsh	Malware (Computer software)	en_US
dc.subject.lcsh	Computer viruses	en_US
dc.subject.lcsh	Data sets	en_US
dc.title	Malware detection using random forest method trained on a balanced synthetic dataset	en_US
dc.type	Thesis	en_US