Toward a Deep Learning Approach for Detecting PHP Webshell44877

In this paper, we proposed a model using a deep learning approach to detect and identify the malicious codes inside PHP source files.. Our method relies on i pattern matching techniques

Trang 1

Ngoc-Hoa NGUYEN VNU University of Engineering and Technology

Hanoi, Vietnam hoa.nguyen@vnu.edu.vn

Viet-Ha LE Office of the Government Hanoi, Vietnam levietha@chinhphu.vn Van-On PHUNG

Office of the Government Hanoi, Vietnam phungvanon@gmail.com

Phuong-Hanh DU VNU University of Engineering and Technology

Hanoi, Vietnam hanhdp@vnu.edu.vn

ABSTRACT

The most efficient way of securing Web applications is searching

and eliminating threats therein (from both malwares and

vulnerabil-ities) In case of having Web application source codes, Web security

can be improved by performing the task to detecting malicious

codes, such as Web shells In this paper, we proposed a model using

a deep learning approach to detect and identify the malicious codes

inside PHP source files Our method relies on (i) pattern matching

techniques by applying Yara rules to build a malicious and benign

datasets, (ii) converting the PHP source codes to a numerical

se-quence of PHP opcodes and (iii) applying the Convolutional Neural

Network model to predict a PHP file whether embedding a

ma-licious code such as a webshell Thus, we validate our approach

with different webshell collections from reliable source published

in Github The experiment results show that the proposed method

achieved the accuracy of 99.02% with 0.85% false positive rate

CCS CONCEPTS

• Security and privacy → Malware and its mitigation; Web

application security;

KEYWORDS

pattern matching, yara rules, deep learning, CNN, opcode sequence,

webshell detection

ACM Reference Format:

Ngoc-Hoa NGUYEN, Viet-Ha LE, Van-On PHUNG, and Phuong-Hanh DU.

2019 Toward a Deep Learning Approach for Detecting PHP Webshell In

The Tenth International Symposium on Information and Communication

Technology (SoICT 2019), December 4–6, 2019, Hanoi - Ha Long Bay, Viet Nam.

ACM, New York, NY, USA, 8 pages https://doi.org/10.1145/3368926.3369733

1 INTRODUCTION

Nowadays, web applications are everywhere and Web security has

also received a lot of attention from both researchers and managers

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for profit or commercial advantage and that copies bear this notice and the full citation

on the first page Copyrights for components of this work owned by others than ACM

must be honored Abstracting with credit is permitted To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior specific permission and/or a

fee Request permissions from permissions@acm.org.

SoICT 2019, December 4–6, 2019, Hanoi - Ha Long Bay, Viet Nam

ACM ISBN 978-1-4503-7245-9/19/12 $15.00

https://doi.org/10.1145/3368926.3369733

According to Internet Live Stats up to 2019 September[13], there is

an enormous amount of websites being attacked everyday (from 25.000 hacked websites per day on April 2015 to 61.750 hacked websites per day on September 2019), causing direct significant impact on nearly 4.43 billion Internet users In case of having Web application source codes, Web security can be improved by per-forming the task to detecting malicious codes, such as a Webshell which is defined as a script that is installed on source code of web application to enable remote administration on the infected server Webshell could be injected into the system directly by attackers

or through malicious plugin installed by the webmaster [7] An essential feature of a Webshell is command execution With this unsophisticated weapon, an attacker can do many stuff such as communicating with files/folders, listing active processes or let

it act as a backdoor These webshells seem to be extremely tiny, but their capabilities are so diversity and high-plasticity Besides that, they sometimes use encoding method like base64 or gzinflate

to encode themselves for self-defense All of them are wrapped in only one file, so this type of WebsShell can be injected quickly Webshell can be installed as other kinds of backdoor For exam-ple, CryptoPHP is a hidden backdoor found by FoxIT1 CryptoPHP

is a threat that compromises Web servers on a large scale through installing unoriginal WordPress, Joomla, and Drupal themes and plug-ins CryptoPHP has some activities and properties, included (i) integrates with popular content management systems like Dru-pal, WordPress and Joomla: injecting hyperlink into post content (for Black Hat SEO’s purpose2), and so on; (ii) Uses asymmetric cryptography3(RSA public-key)4for communication between the victim’s server and the C&C server; (iii) in case C&C server or do-main takedowns in multiple times, CryptoPHP can encrypt its data and send via email to some specific mail addresses; (iv) supports manually control via HTTP requests; (v) updates automatically the list of C&C servers; and (vi) haves ability to receive new version from C&C server and update itself

Several popular approaches for securing web applications [3] have been investigated, for example safe web development [11], implementing intrusion detection and protection systems, code reviewing, and web application firewalls Masood et al [10] pre-sented an efficient way for securing web applications by searching and eliminating vulnerabilities therein In fact, an attack campaign

1 https://fox-it.com/

2 https://en.wikipedia.org/wiki/Search_engine_optimization

3 https://en.wikipedia.org/wiki/Public-key_cryptography

4 https://en.wikipedia.org/wiki/RSA_(cryptosystem)

Trang 2

is temporary However, attackers might upload their backdoors

to that system for persistence, as they can come back to interact

and steal information anytime without exploiting any

vulnerabil-ity This situation leads to serious consequences [12] since these

backdoors are Web shells, and they allow to remotely control files,

databases and execute commands They are not only flexible but

also countless

Figure 1: Command execution in webshell b374k

Indeed, lacking of secure programming awareness and of ability

to discover both malicious web shells and web vulnerabilities from

web developers are main root causes These current issues in web

application security raise a demand for one solution which allows

web developers and security penetration testers to detect

security-related problems in the easiest way

In this research, we proposed a model using a deep learning

approach to detect and identify the malicious codes inside PHP

source files The reason why we focus on web applications written

in the PHP language is because the popular usage of PHP in

server-side programming languages - about 79.0% of all the websites (up

to September 2019) [2] Our method relies on 3 techniques First

of all, we use pattern matching techniques by applying Yara rules

to build a malicious and benign datasets Secondly, we convert the

PHP source codes to a numerical sequence of PHP opcodes Finally,

we apply the Convolutional Neural Network model to predict a

PHP file whether embedding a malicious code such as a webshell

The organization of this paper composes 4 sections: in Section 2,

we revise some basic principles, literature research and related work

in malware detection and deep learning techniques In Section 3, we

describe our proposed solution that is a combination of 3 different

techniques as mentioned above to solve the problem of detecting

malicious code in the web application source code In Section 4,

we present our experiment results, evaluate our work and provide

benchmarks The last section is dedicated to some conclusions and

future work

2 PRELIMINARIES AND RELATED WORK

2.1 Yara and Pattern Matching

Yara ruleset is a list of rules that define the strings that is called

pat-terns and the logical condition between matches and non-matches

of those pattern to determine the final result

yara rule example

{

meta:

description = An example of YARA rule

strings:

$a = 'hello'

$b = {01 23 45 67 89 ab cd ef}

$c = /md5: [0-9a-zA-Z]32/

condition:

$a or $b and $c }

Each Yara rule consists of 3 components:

• Meta: store the metadata information such as description, created date, references, etc

• Strings: define the patterns to be matched of the rule There are 3 type of strings that can be defined: text , hexadecimal and regular expression

• Condition: define as a Boolean expression, that is determines the logic to combine the results of pattern matching of each strings

Based on algorithm design, there are 3 types of pattern matching technique: prefix-based matching, suffix-based matching and factor matching [15]

• Prefix-based matching: the matching process start searching from the top of the sliding window, all characters in the text are read and checked if it doesn’t match then move to the next character This is the simplest strategy but the number

of comparisons is large so the execution speed is slow

• Suffix-based matching: the matching process start searching from the bottom of the sliding window It does not read all the consecutive characters in the text, ignoring the charac-ters base on the comparison result of the characcharac-ters at the bottom of the sliding window.This is the basis for reducing the number of comparisons and reducing the complexity of the algorithm

• Factor-based matching: the matching process start searching from the bottom of the sliding window, It does not read all the consecutive characters in the text, but compare each special character to predict the set of factors (subsamples) of the original sample

All algorithms have 2 stages: pre-processing and searching The pre-processing stage has to build the Yara ruleset, meanwhile the second stage will use the pattern matching techniques (using the regular expression) based on the Yara ruleset Figure 2is illustrated the flowchart of PHP webshell detection using Yara and pattern matching approach

Trang 3

Figure 2: Webshell detection process using Yara.

It can be said that in the problem of detecting malicious code

by pattern matching method using Yara rule set, pattern

match-ing technique only determines the resource usage and calculation

speed As for accuracy, it will be determined completely by the

Yara ruleset In this study, we use the latest Yara ruleset for

detect-ing PHP webshell from GitHub5in conjunction with the one we

collected during our research

Opcodes stand for Operation Codes, is the portion of a machine

language instruction that specifies the operation to be performed6

In programming in PHP or any other language, we can extract

the list of opcode used[4] When making statistics of lists of opcodes

created from benign files and malicious files, we can easily see the

huge difference between them This can be explained by the fact

that the opcodes used by malicious files will tend to perform data

theft, impact on the system to gain control or perform check the

system environment to hide it behavior, etc, while the benign files

rarely do these things Taking an example of the use of functions

related to virtualized operation functions, malicious files often use

these functions to check if they are being executed in a virtualized

environment, if it is true, they will not execute malicious behavious

to avoid detection Because of this, machine learning approaches

often use this sequence of opcode to predict whether a file is

mali-cious or benign According to the statistical results of Bragen and

Simen Rune [5], it has been shown that the list of the 15 most used

opcode by malicious files is shown in Table 1

2.2 Webshell Detection by Deep Learning

Approaches

Deep learning is the application of deep neural networks to

ma-chine learning Deep learning is capable of simulating complex

functions by learning deep nonlinear network structures to solve

complex problems A neural networks contain of an input layer,

followed by a list of hidden layers, then ending with an output layer

Value of output of a layer turn to input of the next layer Unlike

machine learning techniques, deep learning is trained by learning

features rather than task-specific algorithms Different layers of

neural networks automatically learn features at different levels

Therefore it can work on raw data without any need of manual

5 https://github.com/Yara-Rules/rules

6 https://en.wikipedia.org/wiki/Opcode

Table 1: Top 15 opcodes used exclusively used by malware

Opcode Description stosq Store String syscall Fast System Call setno Set Byte on Condition - not overflow (OF=0) cvtsd2si Convert Scalar Double-FP Value to DW Integer movmskpd Extract Packed Double-FP Sign Mask prefetcht1 Prefetch Data Into Caches fprem Partial Remainder (for,compatibility with i8087 and i287) cmpsq Compare String Operands

lodsq Load String scasq Scan String cvtss2si Convert Scalar Single-FP Value to DW Integer fnsave Store x87 FPU State

orpd Bitwise Logical OR of Double-FP Values fxsave Save x87 FPU, MMX, XMM, and,MXCSR State movmskps Extract Packed Single-FP Sign Mask

feature engineering One preeminent advantage of deep learning is that a bigger training data make it learn more robust feature One

of the most famous example of deep learning technique is Con-volution Neural Network (CNN), in which the local receive field from the previous layer is handled in a sliding window Because of these advantages, more and more research is being applied to deep learning technique in the field of malware detection [8]

2.3 Related Work

In this section, we briefly introduce some related research and solutions regarding malware, including some popular Web Shell detector, malware detection based on deep learning:

Web Shell Detector7is a python tool that helps on detecting Web Shells This product is a quite good solution as it is easy in using, developing and customizing However, the Web Shell pattern set in Web Shell Detector database is not up-to-date and also very limited

PHP Malware Finder8is also an effective tool to scan Web Shells with its YARA-based rules Because the detecting mechanism

of this product is quite simple, the False/Positive rate in final results

is somewhat high Also, PHP Malware Finder can depict suspicious files, not show whether a file is precisely a Web Shell or a dangerous file

VirusTotal9 is an online service that supports analyze sus-picious files, included viruses, worms and Web application ones through the detection of tens of other anti-virus products However,

it is limited to at most one file of any nature in any given in at once This restriction may lead to the time-consuming problem It

is almost not proper to validate whole of a Web project

In a research of Yingying and Wang [9], they proposed a malware detection system using deep learning on API calls Based on the feature of an solution to automated analyze malicious code Cuckoo Sandbox10, they extracted the API calls sequence of malicious programs, then using some Deep Learning technique such as: GRU, BGRU, LSTM, SimpleRNN, and BLSTM to train and test on an

7 http://www.shelldetector.com/

8 https://github.com/nbs-system/php-malware-finder

9 https://virustotal.com

10 https://cuckoosandbox.org/

Trang 4

dataset including 21,378 samples The result show that BLSTM has

the best performance for malware detection, reaching the accuracy

of 97.85%

Kemal Ozkan [18] wants to use image processing techniques to

detect malicious code Realized that some image based techniques

have been developed together with feature extraction and

classi-fiers in order to discover the relation between malware binaries

in grayscale color representation, they applied the CNN features

to overcome the malware detection problem With the datasets

consisting of 12,279 malware samples, the classifier has an 85%

accuracy rate, increased to 99% with a dataset containing 9, 339

samples

Another research using CNN to detect Webshell by YifanTian

[14], focus on the HTTP request of web service, they use ’word2vec’

technique to segmented the HTTP requests to the form of HTTP

symbol words, then HTTP request can be represented as a matrix

Once having the matrix representation, they applied CNN to extract

feature and train the model for detecting malicious webshell

Using 35 different features extracted from packet flow, M Yeo

[17] proposed an automated malware detection method based on

convolutional neural network (CNN), multi-layer perceptron (MLP),

support vector machine (SVM), and random forest (RF) With a

netflow capture from Stratosphere IPS which has nine different

public malware packets and normal state packets were converted

to flow data, they can show >85% accuracy, precision and recall for

all classes using CNN and RF

3 PHP WEBSHELL DETECTION BY DEEP

LEARNING METHOD

In this section, we will propose a solution that combines pattern

matching for malicious code detection technique using Yara rule

set and CNN based approach

3.1 Approach

Each technique has its own advantages and disadvantages, for the

pattern matching method, the rate of True Positive detecting the

type of known malicious codes is extremely high, but this method

will have difficulty in predicting the types of unknown malicious

code As for the CNN deep learning method, the prediction model

only approach high accurate if we build the correct training data

set In the process of researching and developing the training data

set, we had difficulty finding malicious code samples For a dataset

of benign PHP files, it is not difficult to search within the source

code of popular content management systems (CMS) using PHP

languages such as Wordpress, Joomla or Drupal, etc As for the

malicious code dataset, although we have tried to use the most

reliable data sources, however, most of the datasets we found both

contained clean files, which led to inaccurate training results With

the number of thousands of files in each dataset, it is difficult to

manually remove clean files Therefore, our idea is to use a malware

detection method using the Yara rule set to standardize the dataset

of malicious code files, as the training input data for the CNN

learning model

From that, our method to detect PHP webshells is based on three

stages:

• Convert the PHP source files to a numerical sequence of PHP opcodes These opcodes are used to remove the duplicate PHP files for both benign and webshell datasets

• Build the clean datasets of both benign and webshell samples for both training and testing sets For that, pattern matching techniques by applying Yara rules is chosen to generate the clean datasets

• Build the Convolutional Neural Network model by the deep learning approach with the clean datasets This model will be used to to predict a PHP file whether embedding a malicious code such as a webshell

We will detail the two last stages in the next subsections 3.2 Building Clean Datasets

Our idea to build clean datasets is shown in Figure 3:

Figure 3: Building Clean Datasets using Yara rulesets

As we can see in Figure 3, at the beginning, to eliminate the fake malicious files in the webshell datasets, we use the Yara-based webshell detection by applying Yara rulesets for the raw datasets After that, a training data set consisting of benign PHP files and ma-licious PHP files was translated to opcode sequences via an Opcode Converter This converter also has the function of eliminating du-plicated opcodes during conversion to avoid affecting the accuracy when training the model The duplication of opcodes of completely different PHP files can be explained because opcode is a sequence

of numbers representing a list of called Operation Codes functions,

Trang 5

if the files are accidentally the same in the list of called opcode

functions, their opcode sequence will be the same At the end of

the process, we have the clean benign/webshell datasets for both

the training and testing phases

3.3 Detecting Webshell by CNN Model

We will use the CNN model to implement our deep learning

ap-proach for detecing the webshells in PHP source files The following

figure illustrates our training and testing model:

Figure 4: Webshell Detection Using CNN Model

The training input data consists of benign and webshell dataset

As mentioned in the previous section, because the webshell dataset

is collected from many different sources, we will get benign files,

so we use the pattern matching method with the Yara rule set to

ensure webshell data is most accurate The standardized dataset

consists of PHP files will continue to be converted into opcode

sequences, then these opcode sequences became training data for

the CNN The trained model will be used to predict test data sets,

resulting in the data set classified as benigns and webshell

In our research, Convolution Neural Network applied for

mal-ware detection using opcodes as its input raw data as show in Figure

5 The opcodes goes through a sequence of convolution layers at

different levels In the end, we have output layer which outputs

probabilities of the files being malware or benign By providing a

huge amount of training data, we can expect the neural network to

learn specific patterns of the malware family as well as powerful

invariant features over time to distinguish the malware with benign

files

Figure 5: CNN Architecture for Detecting Webshell

4 EXPERIMENT AND EVALUATION Based on the proposed method, we built and implemented our so-lution, namely WSDetector, in python language The experiments were performed in a computer having 2 x Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz (45MB Cache, 18-cores per CPU), 128GB for the main memory, CentOS Linux release 7.4.1708, python re-lease 2.7 For the deep learning platform, we use tensorflow v.1.14.0, scikit-learn v.0.20.4, scipy v.1.2.2, numpy v.1.16.5 and yara-python v.3.10.0

4.1 Evaluation Metrics

To evaluate the ability of PHP webshell detection tools, we will use two different test sets: one contains malicious PHP web shells and one is a collection of clean, benign PHP codes We will observe the true positive (TP), False Positive (FP), False Negative (FN) and True Negative (TN) samples, then compute the Accurary, Precision, Recall (sensitivity, or true positive rate -TPR), F1-score and Fall Positive Rate (FPR) with the following formulas [15]:

Accuracy = T P + T N

T P + FP + FN + T N Precision =

T P

T P + FP Recall = T P

T P + FN FPR =

FP

FP + T N

F 1 − score = 2T P + FP + FN2T P 4.2 Datasets

To build the webshell dataset, we collected a wide range of web-shells from reliable and most stars sources on Github11 There are totally 4,171 PHP webshell files For the benign dataset, different PHP frameworks, forums and content management systems were collected from their official sites They includes Laravel, Wordpress, Joomla, phpMyAdmin, phpPgAdmin, phpbb12 After removing non-PHP files, the benign set contains totally 7,400 files In order to

11 /tennc/webshell, /bartblaze/PHP-backdoors, /b374k/b374k, /JohnTroony/ php-webshells, /xl7dev/WebShell, /BlackArch/webshells, /fuzzdb-project/fuzzdb, /LuciferoO/webshell-collector, /ysrc/webshell-sample, /webshellpub/ awsome-webshell, /PHP-WebShell-Bypass-WAF, /linuxsec/indoxploit-shell

12 Github: https://github.com/laravel/laravel; https://github.com/WordPress/ WordPress; https://github.com/joomla/joomla-cms; https://github.com/ phpmyadmin/phpmyadmin; https://github.com/phppgadmin/phppgadmin; https://github.com/phpbb/

Trang 6

train and validate our proposed method of detecting PHP webshells,

we divided the benign and webshells datasets in two parts with the

ratio of 7:3 as the rule of thumb [16] Based on the distribution of

files in the dataset sources, the split of training/testing sets is chosen

by whole sources Thus, the following table shows our final datasets

for training and testing To convert the PHP files into opcodes, we

Table 2: Raw Benign and Webshell Datasets

Training Set Testing Set Benign Dataset 5,802 1,598

Webshell Dataset 3,684 487

use the vld extension of PHP engine13to implement the opcode

converter Based on this tool, the raw datasets are firstly cleaned by

removing duplicate opcodes Therefore, the non-duplicate datasets

are shown in the table 3:

Table 3: Non-duplicate Benign and Webshell Datasets

Training Set Testing Set Benign Dataset 4,875 1,182

Webshell Dataset 1,049 275

4.3 Experiment Results

a Pattern Matching based Detection

From the non-duplicate webshell training dataset, we generated

3,242 Yara rules based on our previous research [15] We used

these rules to detect the PHP webshell in the non-duplicate testing

datasets (both benigns and webshells) Table 4 shows the results

we got in the matrix confusion From that, the performance of our

Table 4: Confusion matrix of PHP webshell detection by

us-ing Yara rules

Real Benign Real Webshell Predicted Benign 1,180 2

Predicted Webshell 25 250

Yara-based PHP webshell detector is illustrated by the following

table: This experiment results are clearly better than the results

Table 5: Accuracy, Precision, F1-score and FPR of Yara based

testing (%)

Accuracy Precision Recall F1-Score FPR

Benign 98.15 97.93 99.83 98.87 9.09

Webshell 98.15 99.21 90.91 94.88 0.17

published in [15] having the detecting F1-Score of 92%

b CNN based Detection

Same as the previous experiment, we used also the non-duplicate

13 See more about VLD at: https://github.com/derickr/vld

training datasets to train the CNN model by using the tensorflow engine The maximum sequence length of opcodes in our datasets

is 44,335 Therefore, we should pad all training opcodes by value 0 (mean no-operation) to have the same maximum length

Therefore, the configuration of CNN network is based on maxi-mum of 100,000 inputs, 128 outputs, 03 1D-convolution layers By our different training, we chosen finally the filter sizes for 3 layers are 3, 4 and 5 respectively; dropout is 0.5; activation function is softmax; optimizer is adam; learning_rate is 0.08; loss function is categorical_crossentropy, validation set is 10%; batch_size is 96; and epochs are 32

By using this CNN model, we performed the test datasets and obtained the results illustrated by the matrix confusion in table 6 and the scores in the table 7

Table 6: Confusion matrix of PHP webshell detection by us-ing CNN model

Real Benign Real Webshell Predicted Benign 1,157 10 Predicted Webshell 25 265

Table 7: Accuracy, Precision, F1-score and FPR of CNN based testing (%)

Accuracy Precision Recall F1-Score FPR Benign 97.60 99.14 97.88 98.51 3.64 Webshell 97.60 91.38 96.36 93.81 2.12

c Yara and CNN based Detection

In the above experiment, it is clear that the CNN-based detecting model have lower F1-Score, accuracy and FFR in comparison with the Yara-based detecting model However, after reviewing the mis-detected samples, we found that these samples merely contain very common functions, such as the fread or file_put_contents functions that manipulate the contents to a file We also lookup in detail in the raw datasets and found that the webshell datasets contain some wrong samples: file in webshell datasets but is benign and similarly for benign datasets

From that, we decided to combine the Yara-based detector with the CNN based model Fistly, we clean all non-duplicate datasets by using the Yara-based detector in order to remove the fake webshells After that, we got the cleaned datasets and then these datasets are used to train and test the CNN-based model of webshell detection The cleaned datasets we obtained is summary in table 8: Table 8: Cleaned Benign and Webshell Datasets

Training Set Testing Set Benign Dataset 4,871 1,180 Webshell Dataset 618 250

By using this datasets, we performed to train the CNN model

by using the same the settings as previous works After that, the

Trang 7

cleaned test datasets were used to evaluate this model Results we

obtained are shown in the matrix confusion in table 9 and the scores

in the table 10

Table 9: Confusion matrix of PHP webshell detection by

us-ing Yara+CNN model

Real Benign Real Webshell Predicted Benign 1,170 4

Predicted Webshell 10 246

Table 10: Accuracy, Precision, F1-score and FPR of

Yara+CNN based Detection (%)

Accuracy Precision Recall F1-Score FPR Benign 99.02 99.66 99.15 99.41 1.60

Webshell 99.02 96.09 98.40 97.23 0.85

Micro Avg 99.02 99.66 99.15 99.41 0.85

Macro Avg 99.02 97.88 98.78 98.32 1.22

Weighted Avg 99.02 99.04 99.02 99.03 1.47

We also perform the k-fold cross validation for this model The

following table shows the results we obtained with k=5 folds

Table 11: 5-fold Cross Validation Results (%)

Accuracy F1-Score FPR Fold 1 99.40 98.20 0.00

Fold 2 99.23 97.67 0.92

Fold 3 98.63 95.98 0.93

Fold 4 98.97 96.88 0.00

Fold 5 98.63 96.02 1.14

Average 98.97 96.95 0.60

These results allow to confirm that the CNN model built from

the cleaned datasets by Yara detector is overall better than only

Yara and CNN based approach

4.4 Evaluation

To justify the performance of our PHP webshel detection method

based on Yara and CNN, we compare our results with other

ap-proaches By the time, we do not perform the evaluating test on the

same machine, same datasets (moreover, the source codes and clean

datasets of other approaches are not published) Thus, we show

only the results of each approach published by their authors Note

that we use only the accuracy, F1-score, FPR metrics to compare

them in this evaluation The following table shows the comparison

of our Yara+CNN model with other approaches:

5 CONCLUSION

Facing the fact that more and more unknown malicious code is now

being developed to install into the source code of web applications

that are dominating the cyberspace have been a huge challenge

today for cybersecurity researchers We proposed in this paper

Table 12: Comparison of different webshell detection ap-proaches (%)

Accuracy F1-Score FPR php-malware-finder[1] 94.23 96.46 4.49 Word2Vec+CNN[14] 98.6 98.6 -RF-GBDT[6] 99.16 99.09 0.68 GuruWS[15] 85.56 92.00 0.00

Our Yara+CNN 99.02 99.41 0.85

an efficient model using a deep learning approach combine with pattern matching applying Yara rules technique and converting the PHP source codes to a numerical sequence of opcodes to predict a PHP file whether embedded a malicious code or not Our experiment results show that the proposed method (Yara+CNN) achieved the accuracy of 99.02% with 0.85% false positive rate

For future works, we aim to extend our method for others pro-gramming languages such as ASP, ASP.NET, Java, Python, etc Be-sides that, we will study and test other deep learning methods such as LSTM to compare with current methods then select a most accurate predictive model

ACKNOWLEDGMENTS This work is partially supported by the national research project No KC.01.19/16-20, granted by the Ministry of Science and Technology

of Vietnam (MOST)

REFERENCES [1] 2019 PHP Malware Finder https://github.com/nbs-system/php-malware-finder [2] 2019 Web Technology Surveys http://w3techs.com/technologies/overview/ programming_language/all/.

[3] G P Bherde and M A Pund 2016 Recent attack prevention techniques in web service applications In 2016 International Conference on Automatic Control and Dynamic Optimization Techniques (ICACDOT) 1174–1180 https://doi.org/10 1109/ICACDOT.2016.7877771

[4] Daniel Bilar 2007 Malware detection through opcode sequence analysis using machine learning Int J Electronic Security and Digital Forensics 1 (2007) https: //doi.org/10.1504/IJESDF.2007.016865

[5] Simen Rune Bragen 2015 Opcodes as predictor for malware VDP::Mathematics and natural science: 400::Information and communication science: 420::Security and vulnerability: 424 1 (01 2015).

[6] H Cui, D Huang, Y Fang, L Liu, and C Huang 2018 Webshell Detection Based

on Random ForestâĂŞGradient Boosting Decision Tree Algorithm In 2018 IEEE Third International Conference on Data Science in Cyberspace (DSC) 153–160 https://doi.org/10.1109/DSC.2018.00030

[7] Z Cui, F Xue, X Cai, Y Cao, G Wang, and J Chen 2018 Detection of Malicious Code Variants Based on Deep Learning IEEE Transactions on Industrial Informatics

14, 7 (July 2018), 3187–3196 https://doi.org/10.1109/TII.2018.2822680 [8] Z Cui, F Xue, X Cai, Y Cao, G Wang, and J Chen 2018 Detection of Malicious Code Variants Based on Deep Learning IEEE Transactions on Industrial Informatics

14, 7 (July 2018), 3187–3196 https://doi.org/10.1109/TII.2018.2822680 [9] Y Liu and Y Wang 2019 A Robust Malware Detection System Using Deep Learning on API Calls In 2019 IEEE 3rd Information Technology, Networking, Electronic and Automation Control Conference (ITNEC) 1456–1460 https://doi org/10.1109/ITNEC.2019.8728992

[10] A Masood and J Java 2015 Static analysis for web service security - Tools amp; techniques for a secure development life cycle In 2015 IEEE International Symposium on Technologies for Homeland Security (HST) 1–6 https://doi.org/10 1109/THS.2015.7225337

[11] M Mazumder and T Braje 2016 Safe Client/Server Web Development with Haskell In 2016 IEEE Cybersecurity Development (SecDev) 150–150 https://doi org/10.1109/SecDev.2016.040

Trang 8

[12] M A E Mohd Efendi, Z Ibrahim, M N Ahmad Zawawi, F Abdul Rahim, N A.

Muhamad Pahri, and A Ismail 2019 A Survey on Deception Techniques for

Securing Web Application In 2019 IEEE 5th Intl Conference on Big Data Security

on Cloud (BigDataSecurity), IEEE Intl Conference on High Performance and Smart

Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS).

328–331 https://doi.org/10.1109/BigDataSecurity-HPSC-IDS.2019.00066

[13] Internet Live Stats 2019 Internet Usage and Social Media Statistics https:

//www.internetlivestats.com/.

[14] Yifan Tian, Jiabao Wang, Zhenji Zhou, and Shengli Zhou 2017 CNN-Webshell:

Malicious Web Shell Detection with Convolutional Neural Network In

Proceed-ings of the 2017 VI International Conference on Network, Communication and

Computing (ICNCC 2017) ACM, New York, NY, USA, 75–79 https://doi.org/10.

1145/3171592.3171593

[15] Le V-G, Nguyen H-T, Pham D-P, Phung V-O, and N-H Nguyen 2019 GuruWS:

A Hybrid Platform for Detecting Malicious Web Shells and Web Application

Vulnerabilities Transactions on Computational Collective Intelligence, Springer,

Berlin, Heidelberg 11370, XXXII (01 2019), 184–208.

[16] Le V-G, Nguyen H-T, Lu L-D, and N-H Nguyen 2016 A solution for automatically

malicious Web shell and Web application vulnerability detection In

Computa-tional Collective Intelligence, Volume 9875 of the series Lecture Notes in Computer

Science Springer-Verlag, Berlin, Heidelberg, 367–378.

[17] M Yeo, Y Koo, Y Yoon, T Hwang, J Ryu, J Song, and C Park 2018 Flow-based

malware detection using convolutional neural network In 2018 International

Conference on Information Networking (ICOIN) 910–913 https://doi.org/10.1109/

ICOIN.2018.8343255

[18] K ÃŰzkan, Åđ IÅ§Äśk, and Y Kartal 2018 Evaluation of convolutional neural

network features for malware detection In 2018 6th International Symposium on

Digital Forensic and Security (ISDFS) 1–5 https://doi.org/10.1109/ISDFS.2018.

8355390

Định dạng
Số trang	8
Dung lượng	654,14 KB