Search | VHL Regional Portal

SELID: Selective Event Labeling for Intrusion Detection Datasets.

Jang, Woohyuk; Kim, Hyunmin; Seo, Hyungbin; Kim, Minsong; Yoon, Myungkeun.

Sensors (Basel) ; 23(13)2023 Jul 02.

Article in English | MEDLINE | ID: mdl-37447954

ABSTRACT

A large volume of security events, generally collected by distributed monitoring sensors, overwhelms human analysts at security operations centers and raises an alert fatigue problem. Machine learning is expected to mitigate this problem by automatically distinguishing between true alerts, or attacks, and falsely reported ones. Machine learning models should first be trained on datasets having correct labels, but the labeling process itself requires considerable human resources. In this paper, we present a new selective sampling scheme for efficient data labeling via unsupervised clustering. The new scheme transforms the byte sequence of an event into a fixed-size vector through content-defined chunking and feature hashing. Then, a clustering algorithm is applied to the vectors, and only a few samples from each cluster are selected for manual labeling. The experimental results demonstrate that the new scheme can select only 2% of the data for labeling without degrading the F1-score of the machine learning model. Two datasets, a private dataset from a real security operations center and a public dataset from the Internet for experimental reproducibility, are used.

Subject(s)

Algorithms , Internet , Humans , Reproducibility of Results , Cluster Analysis , Machine Learning

FILM: Filtering and Machine Learning for Malware Detection in Edge Computing.

Kim, Young Jae; Park, Chan-Hyeok; Yoon, MyungKeun.

Sensors (Basel) ; 22(6)2022 Mar 10.

Article in English | MEDLINE | ID: mdl-35336322

ABSTRACT

Machine learning with static-analysis features extracted from malware files has been adopted to detect malware variants, which is desirable for resource-constrained edge computing and Internet-of-Things devices with sensors; however, this learned model suffers from a misclassification problem because some malicious files have almost the same static-analysis features as benign ones. In this paper, we present a new detection method for edge computing that can utilize existing machine learning models to classify a suspicious file into either benign, malicious, or unpredictable categories while existing models make only a binary decision of either benign or malicious. The new method can utilize any existing deep learning models developed for malware detection after appending a simple sigmoid function to the models. When interpreting the sigmoid value during the testing phase, the new method determines if the model is confident about its prediction; therefore, the new method can take only the prediction of high accuracy, which reduces incorrect predictions on ambiguous static-analysis features. Through experiments on real malware datasets, we confirm that the new scheme significantly enhances the accuracy, precision, and recall of existing deep learning models. For example, the accuracy is enhanced from 0.96 to 0.99, while some files are classified as unpredictable that can be entrusted to the cloud for further dynamic or human analysis.

Subject(s)

Machine Learning , Humans

Advanced Network Sampling with Heterogeneous Multiple Chains.

Lee, Jaekoo; Yoon, MyungKeun; Noh, Song.

Sensors (Basel) ; 21(5)2021 Mar 09.

Article in English | MEDLINE | ID: mdl-33803175

ABSTRACT

Recently, researchers have paid attention to many types of huge networks such as the Internet of Things, sensor networks, social networks, and traffic networks because of their untapped potential for theoretical and practical outcomes. A major obstacle in studying large-scale networks is that their size tends to increase exponentially. In addition, access to large network databases is limited for security or physical connection reasons. In this paper, we propose a novel sampling method that works effectively for large-scale networks. The proposed approach makes multiple heterogeneous Markov chains by adjusting random-walk traits on the given network to explore the target space efficiently. This approach provides better unbiased sampling results with reduced asymptotic variance within reasonable execution time than previous random-walk-based sampling approaches. We perform various experiments on large networks databases obtained from synthesis to real-world applications. The results demonstrate that the proposed method outperforms existing network sampling methods.

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL