Search | VHL Regional Portal

Evaluating Word Embedding Feature Extraction Techniques for Host-Based Intrusion Detection Systems.

Mvula, Paul K; Branco, Paula; Jourdan, Guy-Vincent; Viktor, Herna L.

Discov Data ; 1(1): 2, 2023.

Article in English | MEDLINE | ID: mdl-37035459

ABSTRACT

Research into Intrusion and Anomaly Detectors at the Host level typically pays much attention to extracting attributes from system call traces. These include window-based, Hidden Markov Models, and sequence-model-based attributes. Recently, several works have been focusing on sequence-model-based feature extractors, specifically Word2Vec and GloVe, to extract embeddings from the system call traces due to their ability to capture semantic relationships among system calls. However, due to the nature of the data, these extractors introduce inconsistencies in the extracted features, causing the Machine Learning models built on them to yield inaccurate and potentially misleading results. In this paper, we first highlight the research challenges posed by these extractors. Then, we conduct experiments with new feature sets assessing their suitability to address the detected issues. Our experiments show that Word2Vec is prone to introducing more duplicated samples than GloVe. Regarding the solutions proposed, we found that concatenating the embedding vectors generated by Word2Vec and GloVe yields the overall best balanced accuracy. In addition to resolving the challenge of data leakage, this approach enables an improvement in performance relative to other alternatives.

A systematic literature review of cyber-security data repositories and performance assessment metrics for semi-supervised learning.

Mvula, Paul K; Branco, Paula; Jourdan, Guy-Vincent; Viktor, Herna L.

Discov Data ; 1(1): 4, 2023.

Article in English | MEDLINE | ID: mdl-37038388

ABSTRACT

In Machine Learning, the datasets used to build models are one of the main factors limiting what these models can achieve and how good their predictive performance is. Machine Learning applications for cyber-security or computer security are numerous including cyber threat mitigation and security infrastructure enhancement through pattern recognition, real-time attack detection, and in-depth penetration testing. Therefore, for these applications in particular, the datasets used to build the models must be carefully thought to be representative of real-world data. However, because of the scarcity of labelled data and the cost of manually labelling positive examples, there is a growing corpus of literature utilizing Semi-Supervised Learning with cyber-security data repositories. In this work, we provide a comprehensive overview of publicly available data repositories and datasets used for building computer security or cyber-security systems based on Semi-Supervised Learning, where only a few labels are necessary or available for building strong models. We highlight the strengths and limitations of the data repositories and sets and provide an analysis of the performance assessment metrics used to evaluate the built models. Finally, we discuss open challenges and provide future research directions for using cyber-security datasets and evaluating models built upon them.

COVID-19 malicious domain names classification.

Mvula, Paul K; Branco, Paula; Jourdan, Guy-Vincent; Viktor, Herna L.

Expert Syst Appl ; 204: 117553, 2022 Oct 15.

Article in English | MEDLINE | ID: mdl-35611122

ABSTRACT

Due to the rapid technological advances that have been made over the years, more people are changing their way of living from traditional ways of doing business to those featuring greater use of electronic resources. This transition has attracted (and continues to attract) the attention of cybercriminals, referred to in this article as "attackers", who make use of the structure of the Internet to commit cybercrimes, such as phishing, in order to trick users into revealing sensitive data, including personal information, banking and credit card details, IDs, passwords, and more important information via replicas of legitimate websites of trusted organizations. In our digital society, the COVID-19 pandemic represents an unprecedented situation. As a result, many individuals were left vulnerable to cyberattacks while attempting to gather credible information about this alarming situation. Unfortunately, by taking advantage of this situation, specific attacks associated with the pandemic dramatically increased. Regrettably, cyberattacks do not appear to be abating. For this reason, cyber-security corporations and researchers must constantly develop effective and innovative solutions to tackle this growing issue. Although several anti-phishing approaches are already in use, such as the use of blacklists, visuals, heuristics, and other protective solutions, they cannot efficiently prevent imminent phishing attacks. In this paper, we propose machine learning models that use a limited number of features to classify COVID-19-related domain names as either malicious or legitimate. Our primary results show that a small set of carefully extracted lexical features, from domain names, can allow models to yield high scores; additionally, the number of subdomain levels as a feature can have a large influence on the predictions.

ABSTRACT

ABSTRACT

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL