In this paper, we proposed a model using a deep learning approach to detect and identify the malicious codes inside PHP source files.. Our method relies on i pattern matching techniques
Trang 1Ngoc-Hoa NGUYEN VNU University of Engineering and Technology
Hanoi, Vietnam hoa.nguyen@vnu.edu.vn
Viet-Ha LE Office of the Government Hanoi, Vietnam levietha@chinhphu.vn Van-On PHUNG
Office of the Government Hanoi, Vietnam phungvanon@gmail.com
Phuong-Hanh DU VNU University of Engineering and Technology
Hanoi, Vietnam hanhdp@vnu.edu.vn
ABSTRACT
The most efficient way of securing Web applications is searching
and eliminating threats therein (from both malwares and
vulnerabil-ities) In case of having Web application source codes, Web security
can be improved by performing the task to detecting malicious
codes, such as Web shells In this paper, we proposed a model using
a deep learning approach to detect and identify the malicious codes
inside PHP source files Our method relies on (i) pattern matching
techniques by applying Yara rules to build a malicious and benign
datasets, (ii) converting the PHP source codes to a numerical
se-quence of PHP opcodes and (iii) applying the Convolutional Neural
Network model to predict a PHP file whether embedding a
ma-licious code such as a webshell Thus, we validate our approach
with different webshell collections from reliable source published
in Github The experiment results show that the proposed method
achieved the accuracy of 99.02% with 0.85% false positive rate
CCS CONCEPTS
• Security and privacy → Malware and its mitigation; Web
application security;
KEYWORDS
pattern matching, yara rules, deep learning, CNN, opcode sequence,
webshell detection
ACM Reference Format:
Ngoc-Hoa NGUYEN, Viet-Ha LE, Van-On PHUNG, and Phuong-Hanh DU.
2019 Toward a Deep Learning Approach for Detecting PHP Webshell In
The Tenth International Symposium on Information and Communication
Technology (SoICT 2019), December 4–6, 2019, Hanoi - Ha Long Bay, Viet Nam.
ACM, New York, NY, USA, 8 pages https://doi.org/10.1145/3368926.3369733
1 INTRODUCTION
Nowadays, web applications are everywhere and Web security has
also received a lot of attention from both researchers and managers
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page Copyrights for components of this work owned by others than ACM
must be honored Abstracting with credit is permitted To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee Request permissions from permissions@acm.org.
SoICT 2019, December 4–6, 2019, Hanoi - Ha Long Bay, Viet Nam
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-7245-9/19/12 $15.00
https://doi.org/10.1145/3368926.3369733
According to Internet Live Stats up to 2019 September[13], there is
an enormous amount of websites being attacked everyday (from 25.000 hacked websites per day on April 2015 to 61.750 hacked websites per day on September 2019), causing direct significant impact on nearly 4.43 billion Internet users In case of having Web application source codes, Web security can be improved by per-forming the task to detecting malicious codes, such as a Webshell which is defined as a script that is installed on source code of web application to enable remote administration on the infected server Webshell could be injected into the system directly by attackers
or through malicious plugin installed by the webmaster [7] An essential feature of a Webshell is command execution With this unsophisticated weapon, an attacker can do many stuff such as communicating with files/folders, listing active processes or let
it act as a backdoor These webshells seem to be extremely tiny, but their capabilities are so diversity and high-plasticity Besides that, they sometimes use encoding method like base64 or gzinflate
to encode themselves for self-defense All of them are wrapped in only one file, so this type of WebsShell can be injected quickly Webshell can be installed as other kinds of backdoor For exam-ple, CryptoPHP is a hidden backdoor found by FoxIT1 CryptoPHP
is a threat that compromises Web servers on a large scale through installing unoriginal WordPress, Joomla, and Drupal themes and plug-ins CryptoPHP has some activities and properties, included (i) integrates with popular content management systems like Dru-pal, WordPress and Joomla: injecting hyperlink into post content (for Black Hat SEO’s purpose2), and so on; (ii) Uses asymmetric cryptography3(RSA public-key)4for communication between the victim’s server and the C&C server; (iii) in case C&C server or do-main takedowns in multiple times, CryptoPHP can encrypt its data and send via email to some specific mail addresses; (iv) supports manually control via HTTP requests; (v) updates automatically the list of C&C servers; and (vi) haves ability to receive new version from C&C server and update itself
Several popular approaches for securing web applications [3] have been investigated, for example safe web development [11], implementing intrusion detection and protection systems, code reviewing, and web application firewalls Masood et al [10] pre-sented an efficient way for securing web applications by searching and eliminating vulnerabilities therein In fact, an attack campaign
1 https://fox-it.com/
2 https://en.wikipedia.org/wiki/Search_engine_optimization
3 https://en.wikipedia.org/wiki/Public-key_cryptography
4 https://en.wikipedia.org/wiki/RSA_(cryptosystem)
Trang 2is temporary However, attackers might upload their backdoors
to that system for persistence, as they can come back to interact
and steal information anytime without exploiting any
vulnerabil-ity This situation leads to serious consequences [12] since these
backdoors are Web shells, and they allow to remotely control files,
databases and execute commands They are not only flexible but
also countless
Figure 1: Command execution in webshell b374k
Indeed, lacking of secure programming awareness and of ability
to discover both malicious web shells and web vulnerabilities from
web developers are main root causes These current issues in web
application security raise a demand for one solution which allows
web developers and security penetration testers to detect
security-related problems in the easiest way
In this research, we proposed a model using a deep learning
approach to detect and identify the malicious codes inside PHP
source files The reason why we focus on web applications written
in the PHP language is because the popular usage of PHP in
server-side programming languages - about 79.0% of all the websites (up
to September 2019) [2] Our method relies on 3 techniques First
of all, we use pattern matching techniques by applying Yara rules
to build a malicious and benign datasets Secondly, we convert the
PHP source codes to a numerical sequence of PHP opcodes Finally,
we apply the Convolutional Neural Network model to predict a
PHP file whether embedding a malicious code such as a webshell
The organization of this paper composes 4 sections: in Section 2,
we revise some basic principles, literature research and related work
in malware detection and deep learning techniques In Section 3, we
describe our proposed solution that is a combination of 3 different
techniques as mentioned above to solve the problem of detecting
malicious code in the web application source code In Section 4,
we present our experiment results, evaluate our work and provide
benchmarks The last section is dedicated to some conclusions and
future work
2 PRELIMINARIES AND RELATED WORK
2.1 Yara and Pattern Matching
Yara ruleset is a list of rules that define the strings that is called
pat-terns and the logical condition between matches and non-matches
of those pattern to determine the final result
yara rule example
{
meta:
description = An example of YARA rule
strings:
$a = 'hello'
$b = {01 23 45 67 89 ab cd ef}
$c = /md5: [0-9a-zA-Z]32/
condition:
$a or $b and $c }
Each Yara rule consists of 3 components:
• Meta: store the metadata information such as description, created date, references, etc
• Strings: define the patterns to be matched of the rule There are 3 type of strings that can be defined: text , hexadecimal and regular expression
• Condition: define as a Boolean expression, that is determines the logic to combine the results of pattern matching of each strings
Based on algorithm design, there are 3 types of pattern matching technique: prefix-based matching, suffix-based matching and factor matching [15]
• Prefix-based matching: the matching process start searching from the top of the sliding window, all characters in the text are read and checked if it doesn’t match then move to the next character This is the simplest strategy but the number
of comparisons is large so the execution speed is slow
• Suffix-based matching: the matching process start searching from the bottom of the sliding window It does not read all the consecutive characters in the text, ignoring the charac-ters base on the comparison result of the characcharac-ters at the bottom of the sliding window.This is the basis for reducing the number of comparisons and reducing the complexity of the algorithm
• Factor-based matching: the matching process start searching from the bottom of the sliding window, It does not read all the consecutive characters in the text, but compare each special character to predict the set of factors (subsamples) of the original sample
All algorithms have 2 stages: pre-processing and searching The pre-processing stage has to build the Yara ruleset, meanwhile the second stage will use the pattern matching techniques (using the regular expression) based on the Yara ruleset Figure 2is illustrated the flowchart of PHP webshell detection using Yara and pattern matching approach
Trang 3Figure 2: Webshell detection process using Yara.
It can be said that in the problem of detecting malicious code
by pattern matching method using Yara rule set, pattern
match-ing technique only determines the resource usage and calculation
speed As for accuracy, it will be determined completely by the
Yara ruleset In this study, we use the latest Yara ruleset for
detect-ing PHP webshell from GitHub5in conjunction with the one we
collected during our research
Opcodes stand for Operation Codes, is the portion of a machine
language instruction that specifies the operation to be performed6
In programming in PHP or any other language, we can extract
the list of opcode used[4] When making statistics of lists of opcodes
created from benign files and malicious files, we can easily see the
huge difference between them This can be explained by the fact
that the opcodes used by malicious files will tend to perform data
theft, impact on the system to gain control or perform check the
system environment to hide it behavior, etc, while the benign files
rarely do these things Taking an example of the use of functions
related to virtualized operation functions, malicious files often use
these functions to check if they are being executed in a virtualized
environment, if it is true, they will not execute malicious behavious
to avoid detection Because of this, machine learning approaches
often use this sequence of opcode to predict whether a file is
mali-cious or benign According to the statistical results of Bragen and
Simen Rune [5], it has been shown that the list of the 15 most used
opcode by malicious files is shown in Table 1
2.2 Webshell Detection by Deep Learning
Approaches
Deep learning is the application of deep neural networks to
ma-chine learning Deep learning is capable of simulating complex
functions by learning deep nonlinear network structures to solve
complex problems A neural networks contain of an input layer,
followed by a list of hidden layers, then ending with an output layer
Value of output of a layer turn to input of the next layer Unlike
machine learning techniques, deep learning is trained by learning
features rather than task-specific algorithms Different layers of
neural networks automatically learn features at different levels
Therefore it can work on raw data without any need of manual
5 https://github.com/Yara-Rules/rules
6 https://en.wikipedia.org/wiki/Opcode
Table 1: Top 15 opcodes used exclusively used by malware
Opcode Description stosq Store String syscall Fast System Call setno Set Byte on Condition - not overflow (OF=0) cvtsd2si Convert Scalar Double-FP Value to DW Integer movmskpd Extract Packed Double-FP Sign Mask prefetcht1 Prefetch Data Into Caches fprem Partial Remainder (for,compatibility with i8087 and i287) cmpsq Compare String Operands
lodsq Load String scasq Scan String cvtss2si Convert Scalar Single-FP Value to DW Integer fnsave Store x87 FPU State
orpd Bitwise Logical OR of Double-FP Values fxsave Save x87 FPU, MMX, XMM, and,MXCSR State movmskps Extract Packed Single-FP Sign Mask
feature engineering One preeminent advantage of deep learning is that a bigger training data make it learn more robust feature One
of the most famous example of deep learning technique is Con-volution Neural Network (CNN), in which the local receive field from the previous layer is handled in a sliding window Because of these advantages, more and more research is being applied to deep learning technique in the field of malware detection [8]
2.3 Related Work
In this section, we briefly introduce some related research and solutions regarding malware, including some popular Web Shell detector, malware detection based on deep learning:
Web Shell Detector7is a python tool that helps on detecting Web Shells This product is a quite good solution as it is easy in using, developing and customizing However, the Web Shell pattern set in Web Shell Detector database is not up-to-date and also very limited
PHP Malware Finder8is also an effective tool to scan Web Shells with its YARA-based rules Because the detecting mechanism
of this product is quite simple, the False/Positive rate in final results
is somewhat high Also, PHP Malware Finder can depict suspicious files, not show whether a file is precisely a Web Shell or a dangerous file
VirusTotal9 is an online service that supports analyze sus-picious files, included viruses, worms and Web application ones through the detection of tens of other anti-virus products However,
it is limited to at most one file of any nature in any given in at once This restriction may lead to the time-consuming problem It
is almost not proper to validate whole of a Web project
In a research of Yingying and Wang [9], they proposed a malware detection system using deep learning on API calls Based on the feature of an solution to automated analyze malicious code Cuckoo Sandbox10, they extracted the API calls sequence of malicious programs, then using some Deep Learning technique such as: GRU, BGRU, LSTM, SimpleRNN, and BLSTM to train and test on an
7 http://www.shelldetector.com/
8 https://github.com/nbs-system/php-malware-finder
9 https://virustotal.com
10 https://cuckoosandbox.org/
Trang 4dataset including 21,378 samples The result show that BLSTM has
the best performance for malware detection, reaching the accuracy
of 97.85%
Kemal Ozkan [18] wants to use image processing techniques to
detect malicious code Realized that some image based techniques
have been developed together with feature extraction and
classi-fiers in order to discover the relation between malware binaries
in grayscale color representation, they applied the CNN features
to overcome the malware detection problem With the datasets
consisting of 12,279 malware samples, the classifier has an 85%
accuracy rate, increased to 99% with a dataset containing 9, 339
samples
Another research using CNN to detect Webshell by YifanTian
[14], focus on the HTTP request of web service, they use ’word2vec’
technique to segmented the HTTP requests to the form of HTTP
symbol words, then HTTP request can be represented as a matrix
Once having the matrix representation, they applied CNN to extract
feature and train the model for detecting malicious webshell
Using 35 different features extracted from packet flow, M Yeo
[17] proposed an automated malware detection method based on
convolutional neural network (CNN), multi-layer perceptron (MLP),
support vector machine (SVM), and random forest (RF) With a
netflow capture from Stratosphere IPS which has nine different
public malware packets and normal state packets were converted
to flow data, they can show >85% accuracy, precision and recall for
all classes using CNN and RF
3 PHP WEBSHELL DETECTION BY DEEP
LEARNING METHOD
In this section, we will propose a solution that combines pattern
matching for malicious code detection technique using Yara rule
set and CNN based approach
3.1 Approach
Each technique has its own advantages and disadvantages, for the
pattern matching method, the rate of True Positive detecting the
type of known malicious codes is extremely high, but this method
will have difficulty in predicting the types of unknown malicious
code As for the CNN deep learning method, the prediction model
only approach high accurate if we build the correct training data
set In the process of researching and developing the training data
set, we had difficulty finding malicious code samples For a dataset
of benign PHP files, it is not difficult to search within the source
code of popular content management systems (CMS) using PHP
languages such as Wordpress, Joomla or Drupal, etc As for the
malicious code dataset, although we have tried to use the most
reliable data sources, however, most of the datasets we found both
contained clean files, which led to inaccurate training results With
the number of thousands of files in each dataset, it is difficult to
manually remove clean files Therefore, our idea is to use a malware
detection method using the Yara rule set to standardize the dataset
of malicious code files, as the training input data for the CNN
learning model
From that, our method to detect PHP webshells is based on three
stages:
• Convert the PHP source files to a numerical sequence of PHP opcodes These opcodes are used to remove the duplicate PHP files for both benign and webshell datasets
• Build the clean datasets of both benign and webshell samples for both training and testing sets For that, pattern matching techniques by applying Yara rules is chosen to generate the clean datasets
• Build the Convolutional Neural Network model by the deep learning approach with the clean datasets This model will be used to to predict a PHP file whether embedding a malicious code such as a webshell
We will detail the two last stages in the next subsections 3.2 Building Clean Datasets
Our idea to build clean datasets is shown in Figure 3:
Figure 3: Building Clean Datasets using Yara rulesets
As we can see in Figure 3, at the beginning, to eliminate the fake malicious files in the webshell datasets, we use the Yara-based webshell detection by applying Yara rulesets for the raw datasets After that, a training data set consisting of benign PHP files and ma-licious PHP files was translated to opcode sequences via an Opcode Converter This converter also has the function of eliminating du-plicated opcodes during conversion to avoid affecting the accuracy when training the model The duplication of opcodes of completely different PHP files can be explained because opcode is a sequence
of numbers representing a list of called Operation Codes functions,
Trang 5if the files are accidentally the same in the list of called opcode
functions, their opcode sequence will be the same At the end of
the process, we have the clean benign/webshell datasets for both
the training and testing phases
3.3 Detecting Webshell by CNN Model
We will use the CNN model to implement our deep learning
ap-proach for detecing the webshells in PHP source files The following
figure illustrates our training and testing model:
Figure 4: Webshell Detection Using CNN Model
The training input data consists of benign and webshell dataset
As mentioned in the previous section, because the webshell dataset
is collected from many different sources, we will get benign files,
so we use the pattern matching method with the Yara rule set to
ensure webshell data is most accurate The standardized dataset
consists of PHP files will continue to be converted into opcode
sequences, then these opcode sequences became training data for
the CNN The trained model will be used to predict test data sets,
resulting in the data set classified as benigns and webshell
In our research, Convolution Neural Network applied for
mal-ware detection using opcodes as its input raw data as show in Figure
5 The opcodes goes through a sequence of convolution layers at
different levels In the end, we have output layer which outputs
probabilities of the files being malware or benign By providing a
huge amount of training data, we can expect the neural network to
learn specific patterns of the malware family as well as powerful
invariant features over time to distinguish the malware with benign
files
Figure 5: CNN Architecture for Detecting Webshell
4 EXPERIMENT AND EVALUATION Based on the proposed method, we built and implemented our so-lution, namely WSDetector, in python language The experiments were performed in a computer having 2 x Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz (45MB Cache, 18-cores per CPU), 128GB for the main memory, CentOS Linux release 7.4.1708, python re-lease 2.7 For the deep learning platform, we use tensorflow v.1.14.0, scikit-learn v.0.20.4, scipy v.1.2.2, numpy v.1.16.5 and yara-python v.3.10.0
4.1 Evaluation Metrics
To evaluate the ability of PHP webshell detection tools, we will use two different test sets: one contains malicious PHP web shells and one is a collection of clean, benign PHP codes We will observe the true positive (TP), False Positive (FP), False Negative (FN) and True Negative (TN) samples, then compute the Accurary, Precision, Recall (sensitivity, or true positive rate -TPR), F1-score and Fall Positive Rate (FPR) with the following formulas [15]:
Accuracy = T P + T N
T P + FP + FN + T N Precision =
T P
T P + FP Recall = T P
T P + FN FPR =
FP
FP + T N
F 1 − score = 2T P + FP + FN2T P 4.2 Datasets
To build the webshell dataset, we collected a wide range of web-shells from reliable and most stars sources on Github11 There are totally 4,171 PHP webshell files For the benign dataset, different PHP frameworks, forums and content management systems were collected from their official sites They includes Laravel, Wordpress, Joomla, phpMyAdmin, phpPgAdmin, phpbb12 After removing non-PHP files, the benign set contains totally 7,400 files In order to
11 /tennc/webshell, /bartblaze/PHP-backdoors, /b374k/b374k, /JohnTroony/ php-webshells, /xl7dev/WebShell, /BlackArch/webshells, /fuzzdb-project/fuzzdb, /LuciferoO/webshell-collector, /ysrc/webshell-sample, /webshellpub/ awsome-webshell, /PHP-WebShell-Bypass-WAF, /linuxsec/indoxploit-shell
12 Github: https://github.com/laravel/laravel; https://github.com/WordPress/ WordPress; https://github.com/joomla/joomla-cms; https://github.com/ phpmyadmin/phpmyadmin; https://github.com/phppgadmin/phppgadmin; https://github.com/phpbb/
Trang 6train and validate our proposed method of detecting PHP webshells,
we divided the benign and webshells datasets in two parts with the
ratio of 7:3 as the rule of thumb [16] Based on the distribution of
files in the dataset sources, the split of training/testing sets is chosen
by whole sources Thus, the following table shows our final datasets
for training and testing To convert the PHP files into opcodes, we
Table 2: Raw Benign and Webshell Datasets
Training Set Testing Set Benign Dataset 5,802 1,598
Webshell Dataset 3,684 487
use the vld extension of PHP engine13to implement the opcode
converter Based on this tool, the raw datasets are firstly cleaned by
removing duplicate opcodes Therefore, the non-duplicate datasets
are shown in the table 3:
Table 3: Non-duplicate Benign and Webshell Datasets
Training Set Testing Set Benign Dataset 4,875 1,182
Webshell Dataset 1,049 275
4.3 Experiment Results
a Pattern Matching based Detection
From the non-duplicate webshell training dataset, we generated
3,242 Yara rules based on our previous research [15] We used
these rules to detect the PHP webshell in the non-duplicate testing
datasets (both benigns and webshells) Table 4 shows the results
we got in the matrix confusion From that, the performance of our
Table 4: Confusion matrix of PHP webshell detection by
us-ing Yara rules
Real Benign Real Webshell Predicted Benign 1,180 2
Predicted Webshell 25 250
Yara-based PHP webshell detector is illustrated by the following
table: This experiment results are clearly better than the results
Table 5: Accuracy, Precision, F1-score and FPR of Yara based
testing (%)
Accuracy Precision Recall F1-Score FPR
Benign 98.15 97.93 99.83 98.87 9.09
Webshell 98.15 99.21 90.91 94.88 0.17
published in [15] having the detecting F1-Score of 92%
b CNN based Detection
Same as the previous experiment, we used also the non-duplicate
13 See more about VLD at: https://github.com/derickr/vld
training datasets to train the CNN model by using the tensorflow engine The maximum sequence length of opcodes in our datasets
is 44,335 Therefore, we should pad all training opcodes by value 0 (mean no-operation) to have the same maximum length
Therefore, the configuration of CNN network is based on maxi-mum of 100,000 inputs, 128 outputs, 03 1D-convolution layers By our different training, we chosen finally the filter sizes for 3 layers are 3, 4 and 5 respectively; dropout is 0.5; activation function is softmax; optimizer is adam; learning_rate is 0.08; loss function is categorical_crossentropy, validation set is 10%; batch_size is 96; and epochs are 32
By using this CNN model, we performed the test datasets and obtained the results illustrated by the matrix confusion in table 6 and the scores in the table 7
Table 6: Confusion matrix of PHP webshell detection by us-ing CNN model
Real Benign Real Webshell Predicted Benign 1,157 10 Predicted Webshell 25 265
Table 7: Accuracy, Precision, F1-score and FPR of CNN based testing (%)
Accuracy Precision Recall F1-Score FPR Benign 97.60 99.14 97.88 98.51 3.64 Webshell 97.60 91.38 96.36 93.81 2.12
c Yara and CNN based Detection
In the above experiment, it is clear that the CNN-based detecting model have lower F1-Score, accuracy and FFR in comparison with the Yara-based detecting model However, after reviewing the mis-detected samples, we found that these samples merely contain very common functions, such as the fread or file_put_contents functions that manipulate the contents to a file We also lookup in detail in the raw datasets and found that the webshell datasets contain some wrong samples: file in webshell datasets but is benign and similarly for benign datasets
From that, we decided to combine the Yara-based detector with the CNN based model Fistly, we clean all non-duplicate datasets by using the Yara-based detector in order to remove the fake webshells After that, we got the cleaned datasets and then these datasets are used to train and test the CNN-based model of webshell detection The cleaned datasets we obtained is summary in table 8: Table 8: Cleaned Benign and Webshell Datasets
Training Set Testing Set Benign Dataset 4,871 1,180 Webshell Dataset 618 250
By using this datasets, we performed to train the CNN model
by using the same the settings as previous works After that, the
Trang 7cleaned test datasets were used to evaluate this model Results we
obtained are shown in the matrix confusion in table 9 and the scores
in the table 10
Table 9: Confusion matrix of PHP webshell detection by
us-ing Yara+CNN model
Real Benign Real Webshell Predicted Benign 1,170 4
Predicted Webshell 10 246
Table 10: Accuracy, Precision, F1-score and FPR of
Yara+CNN based Detection (%)
Accuracy Precision Recall F1-Score FPR Benign 99.02 99.66 99.15 99.41 1.60
Webshell 99.02 96.09 98.40 97.23 0.85
Micro Avg 99.02 99.66 99.15 99.41 0.85
Macro Avg 99.02 97.88 98.78 98.32 1.22
Weighted Avg 99.02 99.04 99.02 99.03 1.47
We also perform the k-fold cross validation for this model The
following table shows the results we obtained with k=5 folds
Table 11: 5-fold Cross Validation Results (%)
Accuracy F1-Score FPR Fold 1 99.40 98.20 0.00
Fold 2 99.23 97.67 0.92
Fold 3 98.63 95.98 0.93
Fold 4 98.97 96.88 0.00
Fold 5 98.63 96.02 1.14
Average 98.97 96.95 0.60
These results allow to confirm that the CNN model built from
the cleaned datasets by Yara detector is overall better than only
Yara and CNN based approach
4.4 Evaluation
To justify the performance of our PHP webshel detection method
based on Yara and CNN, we compare our results with other
ap-proaches By the time, we do not perform the evaluating test on the
same machine, same datasets (moreover, the source codes and clean
datasets of other approaches are not published) Thus, we show
only the results of each approach published by their authors Note
that we use only the accuracy, F1-score, FPR metrics to compare
them in this evaluation The following table shows the comparison
of our Yara+CNN model with other approaches:
5 CONCLUSION
Facing the fact that more and more unknown malicious code is now
being developed to install into the source code of web applications
that are dominating the cyberspace have been a huge challenge
today for cybersecurity researchers We proposed in this paper
Table 12: Comparison of different webshell detection ap-proaches (%)
Accuracy F1-Score FPR php-malware-finder[1] 94.23 96.46 4.49 Word2Vec+CNN[14] 98.6 98.6 -RF-GBDT[6] 99.16 99.09 0.68 GuruWS[15] 85.56 92.00 0.00
Our Yara+CNN 99.02 99.41 0.85
an efficient model using a deep learning approach combine with pattern matching applying Yara rules technique and converting the PHP source codes to a numerical sequence of opcodes to predict a PHP file whether embedded a malicious code or not Our experiment results show that the proposed method (Yara+CNN) achieved the accuracy of 99.02% with 0.85% false positive rate
For future works, we aim to extend our method for others pro-gramming languages such as ASP, ASP.NET, Java, Python, etc Be-sides that, we will study and test other deep learning methods such as LSTM to compare with current methods then select a most accurate predictive model
ACKNOWLEDGMENTS This work is partially supported by the national research project No KC.01.19/16-20, granted by the Ministry of Science and Technology
of Vietnam (MOST)
REFERENCES [1] 2019 PHP Malware Finder https://github.com/nbs-system/php-malware-finder [2] 2019 Web Technology Surveys http://w3techs.com/technologies/overview/ programming_language/all/.
[3] G P Bherde and M A Pund 2016 Recent attack prevention techniques in web service applications In 2016 International Conference on Automatic Control and Dynamic Optimization Techniques (ICACDOT) 1174–1180 https://doi.org/10 1109/ICACDOT.2016.7877771
[4] Daniel Bilar 2007 Malware detection through opcode sequence analysis using machine learning Int J Electronic Security and Digital Forensics 1 (2007) https: //doi.org/10.1504/IJESDF.2007.016865
[5] Simen Rune Bragen 2015 Opcodes as predictor for malware VDP::Mathematics and natural science: 400::Information and communication science: 420::Security and vulnerability: 424 1 (01 2015).
[6] H Cui, D Huang, Y Fang, L Liu, and C Huang 2018 Webshell Detection Based
on Random ForestâĂŞGradient Boosting Decision Tree Algorithm In 2018 IEEE Third International Conference on Data Science in Cyberspace (DSC) 153–160 https://doi.org/10.1109/DSC.2018.00030
[7] Z Cui, F Xue, X Cai, Y Cao, G Wang, and J Chen 2018 Detection of Malicious Code Variants Based on Deep Learning IEEE Transactions on Industrial Informatics
14, 7 (July 2018), 3187–3196 https://doi.org/10.1109/TII.2018.2822680 [8] Z Cui, F Xue, X Cai, Y Cao, G Wang, and J Chen 2018 Detection of Malicious Code Variants Based on Deep Learning IEEE Transactions on Industrial Informatics
14, 7 (July 2018), 3187–3196 https://doi.org/10.1109/TII.2018.2822680 [9] Y Liu and Y Wang 2019 A Robust Malware Detection System Using Deep Learning on API Calls In 2019 IEEE 3rd Information Technology, Networking, Electronic and Automation Control Conference (ITNEC) 1456–1460 https://doi org/10.1109/ITNEC.2019.8728992
[10] A Masood and J Java 2015 Static analysis for web service security - Tools amp; techniques for a secure development life cycle In 2015 IEEE International Symposium on Technologies for Homeland Security (HST) 1–6 https://doi.org/10 1109/THS.2015.7225337
[11] M Mazumder and T Braje 2016 Safe Client/Server Web Development with Haskell In 2016 IEEE Cybersecurity Development (SecDev) 150–150 https://doi org/10.1109/SecDev.2016.040
Trang 8[12] M A E Mohd Efendi, Z Ibrahim, M N Ahmad Zawawi, F Abdul Rahim, N A.
Muhamad Pahri, and A Ismail 2019 A Survey on Deception Techniques for
Securing Web Application In 2019 IEEE 5th Intl Conference on Big Data Security
on Cloud (BigDataSecurity), IEEE Intl Conference on High Performance and Smart
Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS).
328–331 https://doi.org/10.1109/BigDataSecurity-HPSC-IDS.2019.00066
[13] Internet Live Stats 2019 Internet Usage and Social Media Statistics https:
//www.internetlivestats.com/.
[14] Yifan Tian, Jiabao Wang, Zhenji Zhou, and Shengli Zhou 2017 CNN-Webshell:
Malicious Web Shell Detection with Convolutional Neural Network In
Proceed-ings of the 2017 VI International Conference on Network, Communication and
Computing (ICNCC 2017) ACM, New York, NY, USA, 75–79 https://doi.org/10.
1145/3171592.3171593
[15] Le V-G, Nguyen H-T, Pham D-P, Phung V-O, and N-H Nguyen 2019 GuruWS:
A Hybrid Platform for Detecting Malicious Web Shells and Web Application
Vulnerabilities Transactions on Computational Collective Intelligence, Springer,
Berlin, Heidelberg 11370, XXXII (01 2019), 184–208.
[16] Le V-G, Nguyen H-T, Lu L-D, and N-H Nguyen 2016 A solution for automatically
malicious Web shell and Web application vulnerability detection In
Computa-tional Collective Intelligence, Volume 9875 of the series Lecture Notes in Computer
Science Springer-Verlag, Berlin, Heidelberg, 367–378.
[17] M Yeo, Y Koo, Y Yoon, T Hwang, J Ryu, J Song, and C Park 2018 Flow-based
malware detection using convolutional neural network In 2018 International
Conference on Information Networking (ICOIN) 910–913 https://doi.org/10.1109/
ICOIN.2018.8343255
[18] K ÃŰzkan, Åđ IÅ§Äśk, and Y Kartal 2018 Evaluation of convolutional neural
network features for malware detection In 2018 6th International Symposium on
Digital Forensic and Security (ISDFS) 1–5 https://doi.org/10.1109/ISDFS.2018.
8355390