Luận án tiến sĩ Hệ thống thông tin: Enhancing webshell detection with deep learning-powered methods = Nghiên cứu một số phương pháp học sâu trong phát hiện đoạn mã độc

The increasing prevalence of webshell attacks poses a significant threat to web application security, necessitating the development of robust detection mechanisms.The dissertation clearl

Webshell Dataset Collecliion

Webshell datasets are essential for detecting webshells, as they provide the foundational resources needed for training and evaluating detection models These datasets contain a variety of samples that help researchers understand the characteristics and behaviors of web shell malware By analyzing these datasets, security professionals can gain insights into the tactics, techniques, and procedures (TTPs) used by threat actors to compromise web servers and evade detection This analysis enables the identification of common patterns and trends, allowing for the extraction of meaningful features that enhance the effectiveness of detection algorithms and techniques.

Webshell datasets are essential for developing and evaluating signature-based detection mechanisms that identify known web shell artifacts through predefined patterns and heuristics By compiling samples of web shell scripts, encoded payloads, and command-and-control (C2) communications, researchers can identify unique features that differentiate web shell activities from legitimate server operations These features form the basis for creating detection rules using signature-based languages like Yara, allowing for the automated detection of web shell instances based on their distinct behavioral signatures and code structures Additionally, web shell datasets enable the validation and refinement of detection rules through real-world testing in live web server environments, helping researchers evaluate the effectiveness of signature-based detection in combating web shell threats.

Other primary functions of webshell datasets are to facilitate the training and

Webshell detection approaches leverage machine learning (ML) and statistical models, utilizing labeled datasets of benign and malicious webshell instances These datasets are crucial for training algorithms to differentiate between normal web server activities and malicious behaviors indicative of webshell presence They also facilitate the evaluation of detection models against a variety of webshell variants, enhancing their generalizability and effectiveness across diverse environments Furthermore, webshell datasets allow researchers to measure detection algorithm performance using key metrics such as accuracy, precision, recall, and false positive rates, offering valuable insights into the strengths and limitations of different detection strategies and aiding in their refinement.

Table 1.2: Some widely used Webshell datasets

PHP, JSP, ASP, Tenne 2022 About 2.000 Numerous

ASPX, PY, etc John Troony 2020 119 PHP Few

PHP, ASP, WebSHArk 1.0 2015 809 ASP.NET, JSP, Few

Some previous datasets are available for malicious webshell research, as shown in

Table 1.2 Developed in 2013, Tennc 4 is a pioneering dataset of malicious webshells.

The dataset's continuous updates and growth are driven by an open-source webshell project, making it a significant resource However, the challenge of maintaining quality control over samples contributed globally has resulted in a considerable number of noisy entries, including mixed benign and duplicate webshell samples.

4Tennc, A webshell open source project, https: //github.com/tennc/webshell

Troony 7, first produced in 2014, originally featured 132 malicious webshells The dataset has since been updated to remove duplicates and refine sample names, resulting in a current total of 119 high-quality malicious webshell examples Additionally, another early dataset, WebSHArk 1.0, was introduced in 2015.

The collection includes rare and noisy samples, but WebSHArk 1.0 has become outdated Current research on harmful webshells emphasizes the analysis of newly discovered samples A significant addition is the Cyclel83 PHP-Webshell-Dataset, which is a large and recently released dataset This dataset includes malicious samples that were meticulously selected from twelve earlier datasets, including the Tennc and JohnTroony PHP webshell datasets Additionally, a new dataset featuring a malicious webshell family, known as MWF, has been introduced by the authors in recent studies.

The article presents a comprehensive collection of 1,359 PHP malicious webshell samples, categorized into 78 distinct families and 22 outlier groups Each sample is accompanied by metadata detailing the dynamic function calls observed during execution in a sandbox environment Additionally, explicit family labels are provided to facilitate multi-classification of the samples.

The primary limitation of existing webshell datasets is their focus on popular variants, neglecting newer obfuscated webshells commonly utilized in APT attacks To address this gap, we developed a comprehensive webshell dataset sourced from reputable platforms, including a diverse array of webshells from highly-rated repositories on GitHub However, initial testing revealed the presence of benign files, creating a noisy dataset that could compromise model training accuracy To enhance dataset quality, a meticulous cleaning process, supported by cybersecurity experts, was implemented Additionally, we incorporated samples from real-world APT attacks, which, although limited in number, are critical for understanding advanced evasion techniques and tactics, ultimately refining detection models.

5JohnTroony PHP Webshell, https: //github.com/JohnTroony/php-webshells

5Cycle183 PHP Webshell Dataset, https: //github.com/Cyc1e183/PHP-Webshell-Dataset

1.3 RELATED WORKS 44 to detect new types of webshells more effectively and accurately.

Research statistics on Webshell Detection reveal that out of 41 studies, 42% (17 studies) employed machine learning techniques, while 29% (12 studies) utilized deep learning technology, and another 29% (12 studies) introduced alternative solutions.

Besides the research applying AI algorithms to improve the ability to detect new types of webshells, there are still some studies using other approaches [104, 50, 85, 19], which also have notable points.

Cubismo enhances malware detection tools by utilizing counterfactual execution to explore all potential execution paths and reveal hidden code within PHP scripts The process starts with normalizing the original scripts by eliminating unnecessary lines, comments, and whitespace before subjecting them to counterfactual execution During this analysis, exceptions, runtime errors, and nested predicates are ignored to identify new paths, while recursive deobfuscation addresses multi-layer encryption Each newly explored path and dynamic construct generates new program files, which are executed in a sandbox environment for potential detection These program files are then used as inputs for existing malware detection tools, with the identification of a webshell dependent on any flagged input file.

PHPMalScan, a malware detection tool derived from Cubismo, specializes in identifying webshells by utilizing counterfactual execution within a sandbox and virtual environment to analyze all potential code execution paths The tool enables collaboration between sandboxes by sharing essential artifacts for their respective analyses PHP functions are classified as either safe or potentially harmful, and two key metrics—maliciousness score (MS) and potentially malicious functions ratio (PMFR)—are introduced to evaluate the presence and severity of malicious functions The determination of script maliciousness relies on established thresholds for both MS and PMFR.

“IEEE Xplore, *ACM Digital Library, *SpringerLink, *Wiley Online Library, *ScienceDirect

The authors propose a webshell classification tool that employs similarity analysis to identify derivatives of known webshells The method involves decoding PHP scripts to uncover any obfuscation layers, extracting user-defined function names and bodies through PHP script tokenization, and fuzzy hashing the scripts for storage within source files Subsequently, similarity matrices of function names, bodies, and file hashes are generated Visualization tools, including heatmaps and dendrograms, are utilized to illustrate the similarities among the analyzed samples.

The authors of [56] propose a search software designed for detecting ASP webshells, which identifies key features such as calls to specific ASP components, suspicious statements, and custom encryption functions This tool alerts administrators by reporting potentially harmful files for further investigation Utilizing a semi-automatic detection approach, it is tailored to the nuances of the ASP programming language.

The authors in [89] describe a sandbox-based environment designed for the static and dynamic analysis of PHP scripts to semi-automatically detect webshells This environment first deobfuscates and normalizes PHP shells, followed by a statistical analysis for specificity Malicious scripts are then indexed, stored in a database, and executed safely within the sandbox for behavioral analysis This execution allows for the reporting of calls to exploitable functions, such as command execution, information disclosure, and filesystem operations, along with their origins However, the proposed environment struggles with executing PHP files that incorporate other scripts.

JavaScript and CSS Moreover, the deobfuscation process is restricted to eval() and preg_ replace() functions with explicit string arguments, which restricts its detection ability to specific kinds of webshells.

GuruWS is a hybrid platform developed to detect malicious webshells and vulnerabilities in web applications It features two main modules: grMalwrScanner, which focuses on webshell detection, and grVulnScanner, which identifies vulnerabilities in web applications The grMalwrScanner utilizes taint analysis for simple PHP scripts to pinpoint risky function calls and their arguments, while for more complex scripts, it employs Yara rules to identify malicious code Furthermore, it incorporates a statistical analysis that ranks files based on five key statistical features, enhancing its detection capabilities.

1.3 RELATED WORKS 46 tures is optionally provided for users.

1.3.2 AI-Powered Source Code Analysis Approaches

Source code analysis methods provide a holistic view of webshell code [39, 109,

cross-validation with DS2

Tiêu đề	Enhancing webshell detection with deep learning-powered methods
Tác giả	Le Viet Ha
Người hướng dẫn	Associate Professor Nguyen Ngoc Hoa, Doctor Phung Van On
Trường học	Vietnam National University Hanoi University of Engineering and Technology
Chuyên ngành	Information Systems
Thể loại	Luận án tiến sĩ
Năm xuất bản	2024
Thành phố	Ha Noi

Định dạng
Số trang	139
Dung lượng	59,24 MB