Open Access
Open access
Applied Sciences (Switzerland), volume 15, issue 6, pages 2978

Deep Defense Against Mal-Doc: Utilizing Transformer and SeqGAN for Detecting and Classifying Document Type Malware

Gati Lother Martin 1
Sang-Min Lee 2
Jonghyun Kim 3
Young‐Seob Jeong 4
Ah Reum Kang 5
Jiyoung Woo 1
Publication typeJournal Article
Publication date2025-03-10
scimago Q2
SJR0.508
CiteScore5.3
Impact factor2.5
ISSN20763417
Abstract

The prevalence of non-executable malware is on the rise, presenting a major threat to users, including major public institutions and corporations. While extensive research has been conducted on detecting malware threats, there is a noticeable gap in studying document-type malware compared with executable files. The proposed model will solve this gap by detecting and classifying document-type malware families using script codes, including tags, to write documents and script languages to execute malicious functions. These script codes offer insights into how the malware was constructed and operates on the victim’s system. Additionally, we leverage language models in our approach. Initially, we develop MalCode2Vec to learn associations between source codes and represent them as numeric vectors. Subsequently, we design a Transformer-based model for document malware detection and family classification. Detection is conducted at both the stream and file levels. To solve the class imbalance issue in the malware family, we utilize a generative adversarial network to generate malware samples. Our experimental domain focuses on the Hangul (Korean) word processor, a tool notably used by North Korea in targeting the South Korean government.

Jeong Y., Mswahili M.E., Kang A.R.
Scientific Reports scimago Q1 wos Q1 Open Access
2023-06-01 citations by CoLab: 2 PDF Abstract  
AbstractAs more documents appear on the Internet, it becomes important to detect malware within the documents. Malware of non-executables might be more dangerous because people usually open them without worrying about inherent danger. Recently, deep learning models are used to analyze byte streams of the non-executables for malware detection. Although they have shown successful results, they are commonly designed for stream-level detection, but not for file-level detection. In this paper, we propose a new method that aggregates the stream-level results to get file-level results for malware detection. We demonstrate its effectiveness by experimental results with our annotated dataset, and show that it gives performance gain of 3.37–5.89% of F1 scores.
Rahali A., Akhloufi M.A.
2023-03-24 citations by CoLab: 22 PDF Abstract  
To proactively mitigate malware threats, cybersecurity tools, such as anti-virus and anti-malware software, as well as firewalls, require frequent updates and proactive implementation. However, processing the vast amounts of dataset examples can be overwhelming when relying solely on traditional methods. In cybersecurity workflows, recent advances in natural language processing (NLP) models can aid in proactively detecting various threats. In this paper, we present a novel approach for representing the relevance and significance of the Malware/Goodware (MG) datasets, through the use of a pre-trained language model called MalBERTv2. Our model is trained on publicly available datasets, with a focus on the source code of the apps by extracting the top-ranked files that present the most relevant information. These files are then passed through a pre-tokenization feature generator, and the resulting keywords are used to train the tokenizer from scratch. Finally, we apply a classifier using bidirectional encoder representations from transformers (BERT) as a layer within the model pipeline. The performance of our model is evaluated on different datasets, achieving a weighted f1 score ranging from 82% to 99%. Our results demonstrate the effectiveness of our approach for proactively detecting malware threats using NLP techniques.
Liu Y., Li J., Liu B., Gao X., Liu X.
2022-07-08 citations by CoLab: 5
Demirci D., Sahin N., Sirlancis M., Acarturk C.
IEEE Access scimago Q1 wos Q2 Open Access
2022-05-30 citations by CoLab: 36 Abstract  
In recent years, cyber threats and malicious software attacks have been escalated on various platforms. Therefore, it has become essential to develop automated machine learning methods for defending against malware. In the present study, we propose stacked bidirectional long short-term memory (Stacked BiLSTM) and generative pre-trained transformer based (GPT-2) deep learning language models for detecting malicious code. We developed language models using assembly instructions extracted from .text sections of malicious and benign Portable Executable (PE) files. We treated each instruction as a sentence and each .text section as a document. We also labeled each sentence and document as benign or malicious, according to the file source. We created three datasets from those sentences and documents. The first dataset, composed of documents, was fed into a Document Level Analysis Model (DLAM) based on Stacked BiLSTM. The second dataset, composed of sentences, was used in Sentence Level Analysis Models (SLAMs) based on Stacked BiLSTM and DistilBERT, Domain Specific Language Model GPT-2 (DSLM-GPT2), and General Language Model GPT-2 (GLM-GPT2). Lastly, we merged all assembly instructions without labels for creating the third dataset; then we fed a custom pre-trained model with it. We then compared malware detection performances. The results showed that the pre-trained model improved the DSLM-GPT2 and GLM-GPT2 detection performance. The experiments showed that the DLAM, the SLAM based on DistilBERT, the DSLM-GPT2, and the GLM-GPT2 achieved 98.3%, 70.4%, 86.0%, and 76.2% F1 scores, respectively.
Kim M.
Sustainability scimago Q1 wos Q2 Open Access
2022-02-02 citations by CoLab: 9 PDF Abstract  
North Korea’s economic and technological backwardness does not seem to allow Pyongyang to possess proficient cyberwarfare capabilities. Yet, North Korea’s cyber offensive capabilities are a major security threat in a new convergence space called the cyber–physical space (CPS) that connects the real world and the virtual world. How has North Korea become a formidable actor in the CPS, despite economic and technological disadvantages? Put differently, what makes North Korea a global cyber power despite its disconnect from international society? What are North Korea’s motivations behind strengthening its cyber capabilities in recent decades and what implications do these hold for international security? The primary objective of this article is to examine North Korea’s motivations for strengthening its cyber capabilities and analyze their implications for the sustainability of stability and peace on the Korean peninsula and beyond. By investigating the exemplary cases of North Korea’s recent cyberattacks, it seeks to explore the effective ways to manage the risks that North Korea’s enhanced cyber proficiencies pose in the current and future CPS.
Rahali A., Akhloufi M.A.
2021-10-17 citations by CoLab: 29 Abstract  
In recent years we have witnessed an increase in cyber threats and malicious software attacks on different platforms with important consequences to persons and businesses. It has become critical to find automated machine learning techniques to proactively defend against malware. Transformers, a category of attention-based deep learning techniques, have recently shown impressive results in solving different tasks mainly related to the field of Natural Language Processing (NLP). In this paper, we propose the use of a Transformers architecture to automatically detect malicious software. We propose MalBERT, a model based on BERT (Bidirectional Encoder Representations from Transformers) which performs a static analysis on the source code of Android applications using preprocessed features to characterize existing malware and classify it into different representative malware categories. The obtained results are promising and show the high performance obtained by Transformer-based models for malicious software detection.
Li Y., Wang X., Shi Z., Zhang R., Xue J., Wang Z.
2021-05-16 citations by CoLab: 22 PDF
Phung N.M., Mimura M.
Internet of Things scimago Q1 wos Q1
2021-03-01 citations by CoLab: 22 Abstract  
In order to be able to detect new malicious JavaScript with low cost, methods with machine learning techniques have been proposed and gave positive results. These methods focus on achieving a light-weight filtering model that can quickly and precisely filter out malicious data for dynamic analysis. A method constructs a language model using Natural Language Processing techniques to represent the data in vector form from the source code for machine learning. This method has high score with the balanced dataset, however the experiment with an imbalanced dataset has not been done. Previous studies mainly focus on a balanced dataset, however the dataset is not representative of real-world data, and it rises questions in practical uses of the model. A good model that can have a high recall score with imbalanced dataset is needed for a good filter. To construct an efficient language model, and to deal with the data imbalance problem, we focus on oversampling techniques. In our research, our method is the first to use oversampling and machine learning to detect malicious JavaScript. The experimental result shows that our method can detect new malicious JavaScript more accurately and efficiently. Our model can quickly filter out malicious data for dynamic analysis. The best recall score achieves 0.72 with the Doc2Vec model. Our proposed method is shown to outperform the baseline method by 210% in terms of recal score with the same training time and test time per sample.
Lee G., Shim S., Cho B., Kim T., Kim K.
ETRI Journal scimago Q2 wos Q4 Open Access
2020-12-17 citations by CoLab: 17 PDF
Pranggono B., Arabo A.
Internet Technology Letters scimago Q3 wos Q4
2020-10-14 citations by CoLab: 132 Abstract  
This paper studies the cybersecurity issues that have occurred during the coronavirus (COVID-19) pandemic. During the pandemic, cyber criminals and Advanced Persistent Threat (APT) groups have taken advantage of targeting vulnerable people and systems. This paper emphasizes that there is a correlation between the pandemic and the increase in cyber-attacks targeting sectors that are vulnerable. In addition, the growth in anxiety and fear due to the pandemic is increasing the success rate of cyber-attacks. We also highlight that healthcare organizations are one of the main victims of cyber-attacks during the pandemic. The pandemic has also raised the issue of cybersecurity in relation to the new normal of expecting staff to work from home (WFH), the possibility of state-sponsored attacks, and increases in phishing and ransomware. We have also provided various practical approaches to reduce the risks of cyber-attacks while WFH including mitigation of security risks related to healthcare. It is crucial that healthcare organizations improve protecting their important data and assets by implementing a comprehensive approach to cybersecurity.
Jeong Y., Woo J., Lee S., Kang A.R.
Sensors scimago Q1 wos Q2 Open Access
2020-09-15 citations by CoLab: 7 PDF Abstract  
Malware detection of non-executables has recently been drawing much attention because ordinary users are vulnerable to such malware. Hangul Word Processor (HWP) is software for editing non-executable text files and is widely used in South Korea. New malware for HWP files continues to appear because of the circumstances between South Korea and North Korea. There have been various studies to solve this problem, but most of them are limited because they require a large amount of effort to define features based on expert knowledge. In this study, we designed a convolutional neural network to detect malware within HWP files. Our proposed model takes a raw byte stream as input and predicts whether it contains malicious actions or not. To incorporate highly variable lengths of HWP byte streams, we propose a new padding method and a spatial pyramid average pooling layer. We experimentally demonstrate that our model is not only effective, but also efficient.
Choi S., Bae J., Lee C., Kim Y., Kim J.
Sensors scimago Q1 wos Q2 Open Access
2020-05-20 citations by CoLab: 25 PDF Abstract  
Every day, hundreds of thousands of malicious files are created to exploit zero-day vulnerabilities. Existing pattern-based antivirus solutions face difficulties in coping with such a large number of new malicious files. To solve this problem, artificial intelligence (AI)-based malicious file detection methods have been proposed. However, even if we can detect malicious files with high accuracy using deep learning, it is difficult to identify why files are malicious. In this study, we propose a malicious file feature extraction method based on attention mechanism. First, by adapting the attention mechanism, we can identify application program interface (API) system calls that are more important than others for determining whether a file is malicious. Second, we confirm that this approach yields an accuracy that is approximately 12% and 5% higher than a conventional AI-based detection model using convolutional neural networks and skip-connected long short-term memory-based detection model, respectively.
Mohammed R., Rawashdeh J., Abdullah M.
2020-04-28 citations by CoLab: 395 Abstract  
Data imbalance in Machine Learning refers to an unequal distribution of classes within a dataset. This issue is encountered mostly in classification tasks in which the distribution of classes or labels in a given dataset is not uniform. The straightforward method to solve this problem is the resampling method by adding records to the minority class or deleting ones from the majority class. In this paper, we have experimented with the two resampling widely adopted techniques: oversampling and undersampling. In order to explore both techniques, we have chosen a public imbalanced dataset from kaggle website Santander Customer Transaction Prediction and have applied a group of well-known machine learning algorithms with different hyperparamters that give best results for both resampling techniques. One of the key findings of this paper is noticing that oversampling performs better than undersampling for different classifiers and obtains higher scores in different evaluation metrics.
Singh P., Tapaswi S., Gupta S.
Information Security Journal scimago Q2 wos Q3
2020-02-13 citations by CoLab: 28 Abstract  
In 2018, with the internet being treated as a utility on equal grounds as clean water or air, the underground malicious software economy is flourishing with an influx of growth and sophistication i...
Stokes J.W., Agrawal R., McDonald G., Hausknecht M.
2019-11-01 citations by CoLab: 8 Abstract  
Malicious scripts are an important computer infection threat vector for computer users. For internet-scale processing, static analysis offers substantial computing efficiencies. We propose the ScriptNet system for neural malicious JavaScript detection which is based on static analysis. We also propose a novel deep learning model, Pre-Informant Learning (PIL), which processes Javascript files as byte sequences. Lower layers capture the sequential nature of these byte sequences while higher layers classify the resulting embedding as malicious or benign. Unlike previously proposed solutions, our model variants are trained in an end-to-end fashion allowing discriminative training even for the sequential processing layers. Evaluating this model on a large corpus of 212,408 JavaScript files indicates that the best performing PIL model offers a 98.10% true positive rate (TPR) for the first 60K byte subsequences and 81.66% for the full-length files, at a false positive rate (FPR) of 0.50%. Both models significantly outperform several baseline models. The best performing PIL model can successfully detect 92.02% of unknown malware samples in a hindsight experiment where the true labels of the malicious JavaScript files were not known when the model was trained.

Are you a researcher?

Create a profile to get free access to personal recommendations for colleagues and new articles.
Share
Cite this
GOST | RIS | BibTex | MLA
Found error?