Application of machine learning on automatic program repair of security vulnerabilities

Inspired by the work of Zimi Chen [14], whose research uses transfer learning toleverage knowledge deep learning learned from generic code repairing tasks toimprove the performance of vu

Motivation

In software testing, security vulnerabilities are challenging to identify and patch because they do not directly impact software functionality Instead, these vulnerabilities remain hidden until they are intentionally exploited, potentially causing significant damage.

Code patching methods can be categorized into template-based and generative-based approaches Template-based patching utilizes predefined templates to guide the creation of code modifications, allowing developers to systematically address specific security issues This method streamlines the patching process by offering a consistent format, making it particularly effective for common vulnerabilities with established solutions In contrast, generative-based patching employs automated techniques, including machine learning and code analysis, to create patches without relying on templates This approach analyzes the codebase to identify vulnerabilities and generates tailored code changes, providing greater flexibility and adaptability across various programming languages and structures.

Automated program repair is an emerging field of Software Engineering (SE) research that allows for automated rectification of software errors and vulnerabilities

Program repair, also known as automated program repair or software patching, involves the automatic identification and resolution of software bugs without human intervention This process typically includes analyzing the source code to determine the root cause of defects and generating patches or modifications to fix the issues Recently, there has been a growing interest in leveraging Machine Learning techniques to enhance the automation of code repair tasks.

Building on Zimi Chen's research, which employs transfer learning to enhance vulnerability repair tasks through insights gained from generic code repairing, our primary objective is to explore the automatic generation of vulnerability patches using a generative model We aim to utilize the knowledge acquired from large programming language models like CodeBERT to boost the generative model's performance by incorporating extracted embeddings from CodeBERT.

Problem Statement

Generating patches for vulnerabilities using data-driven knowledge is a relatively new field facing significant challenges A major issue is the insufficient volume and diversity of proficient data, which results in the unreliability of current patch generation systems This unreliability persists even in template-based methods, which require less data compared to generative approaches dominated by deep learning architectures.

The abundance of data related to buggy code, particularly from GitHub repositories, provides a valuable resource for code repair By analyzing commit histories, we can obtain pairs of faulty code and their corresponding patches This raises the question of how to effectively leverage this wealth of information to enhance code repair solutions.

In deep learning, limited data availability is a common challenge Transfer learning is a method that leverages knowledge gained from one task to enhance performance on another Its application in vulnerability repair tasks has been explored in previous studies, yielding promising results even with restricted datasets.

In our work, we will go even further by combining source code modeling techniques with transfer learning with the hope that this could further improve previously reported results.

Research Questions

Our research will focus on answering the three Research Questions:

Research Question 1: What do we know about Deep Learning in Vulnerable program Repair?

Deep Learning has demonstrated significant promise in addressing various software development challenges Investigating its role in vulnerable program repair could enhance automated bug fixing methods Gaining insights into the current understanding of Deep Learning in this area will reveal its effectiveness, limitations, and potential to boost the efficiency and accuracy of program repair techniques.

Research Question 2: How effective existing generative-based methods to the problem of code repairing and vulnerability repairing?

• Reason: Investigating the effectiveness of these methods in code and vulnerability repairing can provide valuable insights into their strengths, weaknesses, and limitations.

Research Question 3: Can code embedding extend the capabilities of these methods?

Code embedding transforms source code into vectors within a continuous space, facilitating the use of machine learning algorithms for code analysis and comprehension Investigating code embedding in generative methods for code and vulnerability repair can provide fresh insights into enhancing the effectiveness and generalizability of these approaches.

Thesis Outline

The thesis is organized as follows: Section 2 will cover essential background knowledge of deep learning and its applications in code repair We will also review previous methods addressing vulnerabilities Sections 3 and 4 will provide an in-depth discussion of prominent methods, leading to the presentation of our proposed solutions.

We will first introduce the fundamentals of Deep Learning, secondly LearningParadigms, and thirdly Deep Learning in Code Repair

Background on Neural Network and Deep Learning

Recurrent Neural Network (RNN)

In the realm of data, certain properties govern the information that models aim to learn, with one key property being the order of values in sequence data Formally, sequence data can be defined as a type of data in which the values at later positions are influenced by those at earlier positions.

• T denotes the length of the sequence

• tdenotes the position of xin the sequence

• x t is the value at positiont

• X T represent sequence data of lengthT

Recurrent neural networks (RNNs) can be categorized into two main variants: those with gated units and those without This article will explore both types of networks, focusing on the vanilla recurrent neural network as a representative of the gated unit category and the long short-term memory network as an example of the non-gated unit category.

Vanilla recurrent neural network

In a sequence data sample $ X_T = \{x_1, x_2, \ldots, x_t, \ldots, x_T\} $, each value $ x_t $ at time step $ t $ serves as input parameters for functions within a Recurrent Neural Network (RNN) These functions extract information from the current timestep while utilizing knowledge from previous calculations, allowing for the propagation of information throughout the network This process ultimately leads to predictions based on the specific task at hand The mathematical representation of this operation is given by the equation $ h_t = \tanh(x_t W_{xh} + h_{t-1} W_{hh} + b) $.

First we need to walk through the notations used in the figure 2.1 before going into the details of the operations of a typical RNN.

• tanhfunction is an activation that return values in range of[−1, 1].

• b term is the bias added to allow better generalization of models.

• W hh is the weights matrix representing the connection between hidden values of positiont 1 and position t, which allows information learned in previous position x t−1 to be forwarded to the current positionx t

Figure 2.1: The basic architecture of recurrent neural network

• W xh is the weights matrix, which extracts information from the currentx t position.

• h t represents the latent information extracted from the current layer.

Figure 2.1 illustrates a three-layer RNN that extracts latent values $ h_t $ from input $ x_t $ The network processes input $ x_t $ alongside previously extracted information $ h_{t-1} $, as defined in equation 2.1 Each layer computes hidden values using the same weight matrices, with only $ h_t $ being passed to the next layer This information propagation is crucial for RNNs and their variants, such as LSTM, enabling them to effectively manage sequence data by capturing dependencies between values in a sequence.

The design of recurrent network architectures can be categorized based on the specific task, leading to patterns such as one-to-one, many-to-one, many-to-many, and one-to-many The key distinction among these patterns lies in the number of inputs, \$x_t\$, required by the network to produce either a single prediction, \$y_t\$, or a sequence of predictions, \$y_t, \ldots, y'_T\$ Additionally, the sequence-to-sequence design, discussed earlier, is illustrated in the accompanying images.

Figure 2.2: Recurrent Neural Network design patterns

Long short-term memory network(LSTM)

Recurrent Neural Networks (RNNs) are designed to efficiently process sequence data by allowing information to flow between hidden layers across timesteps However, they struggle with long sequences where later information relies on distant context Long Short-Term Memory (LSTM) networks address this limitation by utilizing gated units that control the flow of information LSTMs incorporate three types of gates: the forget gate, the output gate, and the input gate, which work together to enhance the network's ability to manage long-range dependencies.

The gated units in an LSTM are accompanied by a cell state, which is crucial for retaining information from previous timesteps At each timestep, the cell state is managed by the units, allowing the model to forget outdated information while incorporating new data from recent timesteps.

Gated units in LSTM networks manage the flow of information stored in cell states, enabling the retention of data from the distant past To understand LSTM's functionality, we must analyze the mathematical representation of these gates and their operations Each gate's formal representation is a function that incorporates parameters such as the input $x_t$, the hidden state $h_t$, and the cell state $C_t$ The introduction of new units in LSTM increases the number of learnable weight matrices, including $W_C$ for computing candidate cell state values $\hat{C}$, and $W_i$, $W_f$, and $W_o$ for determining the information to retain, discard, and pass to the next time step as hidden state values The mathematical representation of these computations is illustrated in the following equation: \[f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)\]

In the discussed operations, a bias is incorporated into each linear transformation to enhance the model's generalization capability The functions σ and tanh serve as non-linear transformations, producing output values within the ranges of [0, 1] and [-1, 1], respectively.

Figure 2.3: LSTM network with three repeating layers

Figure 2.3 illustrates a network featuring three repeating LSTM layers, with each layer representing a timestep At each timestep, only the cell state $C_t$ and hidden state $h_t$ are utilized and passed to the subsequent layer Despite this, the LSTM network maintains shared weight matrices across all operations in each layer.

Transformer Neural Network

Transformer-based neural networks are at the forefront of artificial intelligence research, significantly advancing fields such as natural language processing, computer vision, and speech processing These networks have consistently outperformed their predecessors and continue to push the boundaries of technology This section will dissect the components of the transformer, focusing on the fundamental concept of attention, which serves as the core building block, and will provide a step-by-step overview of the operations within a transformer module.

The attention mechanism is crucial for understanding the transformer module, as it enables the retrieval of information from previous time steps Unlike recurrent neural networks and their variants, which struggle with retaining and referencing information, the attention mechanism allows for more effective information processing Even with LSTM networks, which enhance the window size of reference through cell states, the challenge of retaining older information persists, especially as the number of LSTM layers increases.

The attention mechanism allows a network to reference previously extracted information while processing the current timestep, determining the significance of different parts of the sequence for current calculations This mimics human information processing, where predictions are made by focusing on specific past data However, storing results at each timestep for future reference incurs significant computational costs, as matrix multiplication operations grow exponentially with longer sequences To clarify the attention mechanism, we will explore its integration with a recurrent network using the encoder-decoder architecture The encoder block captures and retains information from the input sequence, unlike traditional recurrent networks that discard it The decoder generates predictions using this extracted knowledge, enhanced by the attention mechanism, which combines the encoder's output with a context vector derived from direct references to each encoder layer, forming the input for the decoder layers.

To determine the attention allocated to each encoder layer, the dot product of the hidden states in the decoder is calculated with each layer in the encoder This result is then passed through the softmax function, producing a numeric value between 0 and 1 This value represents the "amount of attention" or the "contribution to the context" for the prediction.

The output from the softmax function at each layer is multiplied by the corresponding layer's hidden states These vectors from all encoder layers are then combined using either summation or averaging to create the context used for prediction.

Figure 2.4: Attention-integrated recurrent network

Since its introduction in 2017, the transformer model has significantly advanced artificial intelligence, surpassing previous performance metrics and shifting the research focus from recurrent networks At its core, the transformer relies on the attention mechanism, which, combined with parallel training modules, enhances its capabilities However, these improvements also lead to increased complexity and computational costs during both training and inference This section will explore the transformer architecture, highlighting its advantages and disadvantages, to provide a foundation for discussing its application in vulnerability repair.

The transformer architecture, as introduced in the original paper, adheres to the traditional encoder-decoder framework In this structure, the encoder generates a dense vector representation of the input, which serves as a reference for the decoder to produce predictions related to the learning task The architecture consists of N identical modules stacked vertically, forming the transformer network Key components of these modules, illustrated in figure 2.5, include attention mechanisms and fully connected networks, which will be explored in detail in this section.

Figure 2.5: The encoder-decoder architecture of transformer

The attention mechanism, when applied in recurrent networks, faces limitations due to the sequential nature of information processing, which hinders the utilization of modern processors' parallel computing capabilities This sequential processing leads to longer training times and slower inference speeds In contrast, fully connected neural networks, an extension of multi-layer perceptrons with additional hidden layers, significantly improve training efficiency and inference speed by enabling independent calculations of hidden states in each layer The transformer network further enhances this by replacing recurrent architectures with fully connected networks and employing innovative positional encoding techniques, allowing for effective parallel computation.

The author intuitively defines cosine and sine functions to map the position of each element in a sequence to a vector with the same dimension as the model's embedding This mapping function simplifies the learning of linear relationships, facilitating the training process without compromising performance, as noted by the authors.

After the positional encoding input sequence, the output is processed by a multi-head attention module, which includes an attention-integrated fully connected layer Each attention head in this module utilizes three vectors—query, key, and value—generated by passing the positional encoding through three linear layers concurrently, as illustrated in Figure 2.6.

In the context of transformers, the concepts of queries and keys originate from information retrieval, where queries and keys represent items in a search repository By computing scores that indicate the relevancy between each query and item, these scores are subsequently normalized to a range of [0, 1] The normalized values reflect the attention each element should allocate to the rest of the input sequence, guiding the extraction of information from the value vectors.

In the original paper, the author notes that larger outputs from the dot-product can lead to vanishing gradients during backpropagation To address this issue, attention scores are scaled by $ \sqrt{\frac{1}{d_k}} $ before being normalized using softmax, as shown in the expression below This scaling helps stabilize gradients during the optimization process with gradient descent.

In the transformer architecture, multiple attention head modules are stacked horizontally, allowing their results to be concatenated and fed into a linear layer, which serves as the final layer in both the encoder and decoder stacks This linear layer, distinct from the attention heads, aggregates the outputs before making predictions Additionally, each multi-head attention layer and linear layer is followed by a residual connection and a normalization layer, which help stabilize gradients and facilitate the learning process of the transformer.

Transfer Learning

Deep learning networks have demonstrated significant performance improvements across various tasks and data types, largely due to their reliance on extensive datasets However, many tasks lack sufficient data to meet the needs of these models, which is where transfer learning becomes essential This technique involves utilizing a pre-trained model on similar tasks as the starting point for a new model, allowing it to be fine-tuned for the target task Transfer learning is particularly beneficial in scenarios where data availability is limited.

• Feature extraction: the output of the source models as input for a model of the target task

• Fine-tuning and pre-trained: the few last layers of the source model are removed and replaced with layers of the new model, during the fine-tuning process weights of the new architecture can be updated altogether, or only the newly added layers are updated depending on specific tasks we are working on.

However, the choice of transfer learning approaches to use is largely dependent on the similarities between tasks and the datasets in both tasks as stated in the guidelines in [17]

• New dataset is small and similar to original dataset

Due to the limited size of the dataset, fine-tuning the source model may lead to overfitting Given the similarity of the data to the original dataset, we anticipate that the higher-level features in the source model will remain relevant Therefore, a more effective approach would be to train a linear classifier using the CNN codes.

• New dataset is large and similar to the original dataset

Since we have more data, we can have more confidence that we won’t overfit if we were to try to fine-tune through the full network.

• New dataset is small but very different from the original dataset

Since the data is small, it is likely best to only train a linear classifier.

Given the significant differences in the dataset, it may be more effective to train a new model at an intermediate layer of the source model's architecture rather than starting from the top, which is heavily influenced by dataset-specific features.

• New dataset is large and very different from the original dataset

Although the dataset is large enough to train source models from scratch, it is often advantageous to start with weights from a pre-trained model This approach allows for sufficient data and confidence to fine-tune the entire network effectively.

Learning Paradigm

Sequence to Sequence Learning

Sequence-to-sequence (seq2seq) models, also known as many-to-many or encoder-decoder architectures, are a type of neural network designed to process input sequences and generate corresponding output sequences based on specific training objectives These models consist of two main components: the encoder and the decoder, which can have identical or different architectures The encoder generates a context vector that serves as the input for the decoder, facilitating the sequence generation process In recent years, seq2seq models have predominantly been developed using Recurrent Neural Networks (RNNs) or their combinations.

Networks (RNN), Long Short-term Memory Networks (LSTM), and Transformers.The details of all these networks would be our subjects of discussion in the following section 2.1.

Sequence-to-sequence architecture is widely utilized in Natural Language Processing (NLP) for tasks such as sentence translation and summarization However, its applications extend beyond NLP, encompassing various deep learning fields like time series forecasting, image captioning, text-to-speech, and speech-to-text These methods share a common pattern of processing input sequences to generate output sequences, which can include visual data, language, or numeric information.

The popularity of certain architectural designs has led researchers to apply deep learning methods to source code tasks, where models predict sequences that summarize input code blocks or generate patches for error code In natural language processing (NLP), these sequences consist of tokens representing words, while in source code, tokens include variables, operators, and parameters A key method for generating these tokens is Byte Pair Encoding, which helps manage the extensive vocabulary inherent in programming, as variable, function, and class names can be virtually limitless Additionally, previous studies have introduced tokenization methods aimed at reducing vocabulary size, alongside sequence-to-sequence networks for patch generation, which will be explored in detail in subsequent sections.

Graphs-based Learning

Using graph representations for code allows for the capture of both syntactic and semantic structures, although this approach is computationally intensive By leveraging data flow and type hierarchies, graph representations can reduce the model's capacity requirements, training regime, and data needs while effectively capturing the semantic context alongside the syntactic context of programs Research has explored machine learning applications on graph-represented programs, particularly through Gated Graph Neural Networks Source code can be represented as a graph using methods such as abstract syntax trees (AST), control flow graphs (CFG), or program dependence graphs (PDG) While ASTs draw inspiration from natural language processing, CFGs and PDGs focus on different aspects of the source code, making them suitable for various optimization purposes.

Tree-to-tree Learning

This learning paradigm is a subcategory of graph-based learning, distinguished by its similarity to sequence-to-sequence learning in mapping input to output using the same representation It utilizes a tree representation of code, typically an abstract syntax tree, to capture the intricate syntactic structure that token sequences in sequence-to-sequence learning may overlook Drawing inspiration from natural language processing, tree-based learning employs a neural machine translation model to convert an input tree of buggy code into an output tree of corrected code.

In studies utilizing this source code representation, it is essential to employ a code differencing tool like GumTree to detect the discrepancies between the abstract syntax tree (AST) of the problematic code and that of the patch.

Bug Repairing and Vulnerabilities Repairing

Intuitively, bug repairing is a broader domain compared to vulnerability repairing

Security-related bugs, such as those classified as [26], pose significant risks to both software users and providers While efforts are made to automatically detect and repair these vulnerabilities, the process of fixing them is often time-consuming and labor-intensive, as software can continue to operate normally despite the presence of these flaws.

Source code Representation

GumTree

Gumtree is an algorithm that extracts edit scripts from an abstract syntax tree using a two-phase approach This method first identifies matching nodes in the abstract trees of both the original and fixed code These mappings serve as input for another algorithm, RTED, to generate the edit scripts The code repair process consists of a bottom-up phase and a top-down phase In the top-down phase, the two trees are compared to identify isomorphic sub-trees, with the roots of these sub-trees referred to as anchor mappings for the subsequent bottom-up phase The identification of anchor mappings utilizes an auxiliary data structure known as a height list, which the algorithm traverses from the root to nodes with heights exceeding a specified minimum.

To determine if two trees, T1 and T2, are isomorphic, begin by comparing their highest nodes If these nodes are not isomorphic, proceed to evaluate their respective children for isomorphism.

• Given a node, there can be multiple matches, so all these mappings are first put into a list called candidate list and later processed after all unique mappings have been found

• For each node with multiple matches in the candidate mappings list we only choose the mappings to give the highest score in the belowdiceF unction diceF unction(t 1 , t 2 , M ) = 2 × |{t 1 ∈ s(t 1 )|(t 1 , t 2 ) ∈ M }|

In the bottom-up phase, the process involves traversing the two sub-trees from their leaves to their roots to identify the highest matching nodes, known as container mappings A match between the two parent nodes is established during this phase if certain criteria are met.

1 The two nodes do not appear inM generated from the top-down phase

2 The two nodes’ dice score has a value larger thanminDiceas the expression below diceF unction(t 1 , t 2 , M ) ≤ minDice (2.13)

3 Only the mapping with the highest dice score is chosen, for nodes with multiple matches given the above condition and added toM

The container mappings undergo additional processing to identify matching descendants by first eliminating all existing matches An edit script, excluding the move action, is then generated for sub-trees with a height less than maxSize, along with the corresponding node mappings New mappings are added to M only when the nodes in these mappings share identical labels.

Edit scripts will be generated from the source tree to the destination tree using mappings created in the previous phases, employing an edit script generation algorithm like RTED, which is also utilized in the original GumTree paper This edit script serves as a representation of our source code and acts as input for downstream patch generation modules.

Byte Pair Encoding

In natural language processing (NLP) tasks, input text is often represented as vectors of tokens derived from words in the dataset This approach can also be applied to source code when treated as text However, a significant issue arises when the dataset's vocabulary fails to encompass all potential words encountered during model deployment While character-level tokens can be utilized as an alternative, this method risks losing the semantic properties inherent in words Therefore, the objective of Byte Pair Encoding (BPE) is to generate tokens that effectively balance these concerns.

• Retaining the semantic features of the token, that is information per token.

• Tokenizing without demanding a very large vocabulary with a finite set of words.

To illustrate Byte Pair Encoding (BPE), we will use an example from Wikipedia The original data is "aaabdaaabac," and the algorithm identifies the most frequently occurring byte pair Below is the data and the corresponding replacement table.

Then iterate the above steps and place the most frequently occurring byte pair in the table:

The algorithm halts when no byte pairs appear more than once To decompress the data, we will reverse the replacements made during compression.

Source code embeddings

CodeBERT

CodeBERT is a bimodal model designed to generate general-purpose vector representations by training on both code segments and their corresponding documentation A significant aspect of CodeBERT's setup is its training on a diverse dataset, which includes a mixture of programming languages without any specific indicators to differentiate between them.

CodeBERT processes two types of data through its input model, utilizing a standard tokenization pipeline for both segments The resulting token sequences are enhanced with special tokens to create a comprehensive input sequence The model outputs dense vector representations for both code and word tokens, as well as the [CLS] representation To achieve this, CodeBERT is trained on two key learning objectives: masked language modeling and replaced token detection.

2.6.1.1 Masked Language Modeling This objective uses the bimodal data of natural language and programming language in order to learn to predict the tokens that are masked out in the input sequence denoted by[M ASK]token in the sequence The loss function for this objective is written as follows

The model parameters, denoted as θ, are optimized to maximize the loss function, which represents the joint probability $ p(D_i) $ of the predictions for the masked tokens $ m_w $ and $ m_c $, where $ w $ is masked and $ c $ is also masked.

2.6.1.2 Replaced Token Detection In this objective, the model would learn to detect replaced tokens as illustrated in figure 2.8 with loss function described as in equation 2.16 δ(i) =

In this context, let \$p_D\$ represent the probability that the discriminator correctly identifies the token at position \$i\$ as original The function \$\delta\$ serves as an indicator, returning 1 for corrupt tokens and 0 for non-corrupt ones This approach contrasts with Generative Adversarial Networks (GANs), as we assign a "real" value to the token label when the generator successfully produces the correct tokens.

Figure 2.8: CodeBERT architecture for replaced tokens detection task

UnixCoder

2.6.2.1 Abstraction of UnixCoder UnixCoder [4] is another large programming languages model that leverages the AST representation of code segments, which is described in 2.9, along with the comments to learn the embeddings in two pre-training tasks.

Figure 2.9: A Python code with its comment and AST

The initial task involves contrastive learning, where the model aims to minimize a cosine loss that quantifies the similarity among all vectorized inputs within a training batch This process utilizes a model that integrates the flattened Abstract Syntax Tree (AST) representation of a code segment along with its corresponding descriptive comment.

Figure 2.10: Input for contrastive learning task of UnixCoder

• The second pre-training task is the conditional text generation that makes the model learns to generate the respective comment with the flattened AST of the code segment.

3 The state of the art program repair approraches

The automatic generation of patches for vulnerable code has been a long-standing area of research Traditionally, the dominant method was template-based, utilizing templates extracted from datasets to create patches for specific vulnerabilities However, with the rise of deep learning models leveraging neural networks, which have shown exceptional performance in fields like computer vision and natural language processing, researchers have increasingly focused on neural network-based methods for automatic bug repair These methods have demonstrated superior performance compared to traditional template-based approaches This article will explore recent and significant methods in both categories, starting with the template-based approach and followed by generative methods using neural networks.

Template-based approach

Methods in this category focus on identifying patterns that can be utilized to generate patches, which can be derived from mining source code datasets and their corresponding patches or from predefined rules established by engineers Each method employs distinct sets of fix patterns for patch generation, complicating the evaluation and comparison of these methods, as the quality of the defined fix patterns directly influences the quality of the generated patches However, the authors of TBAR have categorized these fix patterns into sixteen groups, detailing their properties across four qualitative dimensions.

• Change Action: What high-level operations are applied on a buggy code entity? The mentioned operations are categorized intoU pdate,

Update operations replace faulty code with corrected versions, while delete operations eliminate problematic code from the program Additionally, insert operations add any missing code entities, and move operations reposition faulty code to more appropriate locations within the program.

• Change Granularity: What kinds of code entities are directly impacted by the change actions? This entity can be an entireM ethod, a whole Statement or specifically targeting anExpressionwithin a statement.

• Bug Context: What specific AST nodes of code entities are used to match fix patterns?

• Change Spread: How many statements are impacted by each repairing pattern?

VuRLE is a significant system that employs a template-based approach consisting of two main phases: the Learning Phase and the Repair Phase During the Learning Phase, the system analyzes training data to develop repair templates, which are then utilized to generate patches in the subsequent phase This method represents source code using a tree structure, created with GumTree An overview of the two-phase workflow and the essential steps for data transformation in each phase is illustrated in Figure 3.1.

The VuRLE workflow begins with the Learning Phase, where it mines data to generate patch templates These templates will undergo further refinement in subsequent phases to create the respective patches.

Extracting edit blocksPairs of buggy code and its patch is fed into GumTree to create edit sequences fixing the buggy code.

Constructing edit groups involves using edit sequences to create graphs by forming edges between pairs of sequences that share the longest overlapping sub-sequence These graphs are subsequently divided into connected components.

DBSCAN is used to cluster these components into edit groups.

Template generation involves creating a template for each pair of edit sequences within the edit groups by identifying the longest overlapping edit sub-sequence along with its context The editing context, which indicates the locations of edit operations in the code segments, is also derived from GumTree.

During the repair phase, we utilize the templates created in the learning phase to address unseen bad code segments By assessing the similarity between these segments and the known bad code from our dataset, we identify the most suitable templates Subsequently, we refine the selected template to ensure it accurately aligns with the specific bad code in question.

Selecting templatesTemplates are selected by comparing the input code with edit groups’ templates mined in the learning process

Patches generationThe input code then used the transformative operations specified in the templates’ edit pattern to create code patches and only keep patches that do not contain redundant code

Generative-based approach

SeqTrans

SeqTrans [2] employed transfer learning to adapt a model initially trained for bug repair to the task of vulnerability repair, necessitating distinct datasets for each task In the experiments conducted, the model was first trained on the Tufano [31] dataset, which focuses on bug repair, and subsequently fine-tuned using the Ponto [32] dataset, both of which consist of Java source code.

Before going into the details of this method, let us look at the general design of the architecture in figure 3.2.

Tokenization and normalization are essential processes in code analysis While SeqTrans does not utilize a tree representation for source codes, it employs the GumTree algorithm to align the Abstract Syntax Tree (AST) nodes of both the source code and patches This alignment enables the extraction of diff contexts using the commercial tool Understand Each sample from the bug repair and vulnerability repair datasets is represented as a pair of code segments.

CP = (st src , st dst ) 1 , , (st src , st dst ) n

The code pairs are refined to create def-use chains, which represent the assignment of values to variables, encompassing all variable definitions from the vulnerable statement Figure 3.3 illustrates a sample input of code pairs for the model, where all global variable definitions and statements that have dependencies on the vulnerability statements are retained, while other statements within the same method are excluded.

CP =((def 1 , , def n , st src ), (def 1 , , def n , st dst )) 1 , , ((def 1 , , def n , st src) , (def 1 , , def n , st dst )) n

After creating the code pairs dataset, each code segment is normalized to minimize the vocabulary size of the dictionary, which influences the output vector size and the probability of each token being predicted This normalization simplifies the model's training process by converting literals and strings into numerical representations (num 1, , num n) and variable names into a standardized format (var 1, , var n).

During the normalization process, "placeholders" will be substituted with their actual values using generated mappings At this stage, the input is prepared for tokenization using Byte Pair Encoding, in conjunction with the dataset's dictionary.

SeqTrans utilizes transformer modules as its foundational elements, maintaining a consistent architecture throughout both the pre-training and fine-tuning phases.

Figure 3.4 illustrates the differences in the dataset, batch size, and training steps for the normalized code segment The pre-training model utilizes a batch size of 4096 over 300,000 steps, while the fine-tuning model continues with an additional 30,000 steps at the same batch size The SeqTrans implementation is carried out using OpenNMT, which provides a low-code solution for configuring the model architecture through configuration files, with key configurations detailed by the authors in reference [2].

• Size of hidden transformer feed-forward: 2048

VRepair

The architecture of VRepair closely resembles that of SeqTrans, as it draws inspiration from the latter and applies its methodology to various datasets The primary distinctions between the two methods lie in their preprocessing techniques and source code representation, in addition to the selected datasets For SeqTrans, code pairs are generated using GumTree, which are then refined to extract reference chains before tokenization.

VRepair, source code is handled just like natural language, and the tokenizing

Figure 3.5: The VRepair pipeline process is applied directly to the vulnerable code segments and their respective patches with additional special tokens in both the buggy code segments and patches.

Figure 3.5 illustrates the pre-training and fine-tuning processes of VRepair, highlighting its similarities with SeqTrans [2], excluding the preprocessing step The authors justify their design choice based on this comparison.

Additional tokens help identify problematic areas in the input code, simplifying the generative task by allowing the model to focus solely on generating the modified segments.

• Representing multiple changes to a function, which in turn allows vulnerabilities fixes across multiple lines within a single code block providing robustness to the solution compared to [1] [24] [33].

• Decreasing the length of the output sequence generated also leads to a reduction in both training cost and inference cost.

Using embeddings generated by CodeBERT can significantly enhance vulnerability repair efforts Our hypothesis is supported by the success of large-scale embeddings in various low-resource natural language processing tasks With the advent of extensive code language models trained on vast datasets for tasks like code generation and masked token prediction, we can utilize these embeddings for vulnerability repair models that operate with limited data This approach is grounded in the premise that upstream and downstream tasks share similar objectives and data characteristics While vulnerabilities represent a challenging type of bug that is often exploited for security breaches, the tasks of identifying and generating patches exhibit notable similarities.

In our experiments, we treat source code as plain text, which is processed by preprocessing modules before being input into the model This approach is justified by the similarities in sequential and structural information found in both programming and natural languages Empirical results demonstrate that this representation is effective for various tasks, including bug detection, code summarization, and code prediction, applicable to both discriminative and generative models.

To validate the proposed hypothesis, we will perform experiments in two phases utilizing the dataset from VRepair We will then compare the outcomes of the models from both phases, which are based on OpenNMT The experimental design pipeline for these phases is illustrated in Section 4.1.

We conducted experiments based on the findings of VRepair, utilizing a smaller-scale network due to our limited computational resources The experiments were performed on a transformer-based neural translation network developed with the OpenNMT-py framework.

We propose a novel pipeline that utilizes embeddings generated from programming language models as input for the VRepair architecture This new approach is compared to the previous pipeline, with the embeddings being processed through the same neural translation network employed in the initial phase.

Figure 4.1: Design of our pipeline

This section presents the setting of our experiment, including the dataset, performance metrics, data preparation, and results.

Datasets

The existing dataset provided by VRepair include two existing dataset called

The Big-Vul dataset, created by crawling CVE databases, contains 3,754 vulnerabilities across 348 projects, categorized into 91 different CWE IDs, covering a time frame from 2002 to 2019 In contrast, the CVEfixes dataset includes 5,365 vulnerabilities across 1,754 projects, categorized into 180 CWE IDs, with data spanning from 1999 to 2021 For this research, we focus exclusively on the Big-Vul dataset to narrow the scope of our experiments, with plans to explore a more diverse dataset in future studies.

To effectively train and validate our experiments, we divided the datasets into three distinct parts: 70% for training, 10% for validation, and 20% for testing Specifically, in the Big-vul dataset, this results in 2,228 samples allocated for training, 318 samples for validation, and 636 samples for testing.

Metrics of performance

Measuring our models effectively and selecting appropriate metrics is crucial for understanding our results The OpenNMT-py framework automatically reports perplexity (PPL) and accuracy during both training and validation phases For translation tasks, PPL is particularly significant, as it influences early stopping conditions based on these metrics Perplexity quantifies the network's uncertainty regarding the correctness of its predictions, with lower PPL indicating less uncertainty and higher PPL indicating more This metric is widely used to evaluate language models in natural language processing (NLP) According to Luong et al., there is a direct correlation between translation quality and PPL; a model with low PPL is likely to produce higher-quality translations The PPL is mathematically defined as the joint probability of all words in a generated document, normalized by the total number of words in that document.

Another metric that is reported from the OpenNMT-py automatically is accuracy which measures the number of correctly predicted tokens The accuracy is calculated by:

The accuracy metric, calculated as the number of tokens in the predicted output sequence $ \hat{Y} $ compared to the target sequence $ Y $, often fails to provide meaningful insights into model performance This is because achieving 100% accuracy can occur even when the positions of tokens in the predicted sequence differ from those in the target sequence, highlighting a limitation in using accuracy as a sole measure of effectiveness.

The Bleu-score is a key metric for assessing the quality of text in machine translation, focusing on the precision of predicted tokens within n-length subsequences This precision is determined by counting the number of words in the predicted tokens that match those in the target sequence The calculation of the Bleu-score is represented by a specific equation, where N denotes the length of the subsequence in the predicted sequence.

Preprocessing the code as plain text

In our experiments, we will treat the source and its patch as plain texts and tokenize them using a byte-pair encoding algorithm Prior to tokenization, we will implement an additional preprocessing step that introduces two special tokens to the original dataset, encompassing both the input and target sequences This step is essential before extracting embeddings from the data using programming language models.

Figure 5.1: Sample of buggy code and its patch

In the input sequence presented in section 5.2, the tags < StarLoc > and < EndLoc > will be included to mark the identified vulnerable location, along with an additional indicator "CWE-xxx" that specifies the type of vulnerability.

• For the target sequence shown in 5.3, we use two new unique tokens

To create a patch, the target sequence should only include the necessary modifications, which can be categorized into three types These modifications correspond to three distinct formats of the target sequence, as outlined in section 5.4, each representing a specific type of change made to the input sequence.

Figure 5.4: Syntax of the output sequence

Extracting embeddings from large language models for code

The embeddings from language models are organized in a look-up table corresponding to the vocabulary of the corpus, necessitating the extraction of the vocabulary that includes newly added tokens Each word in this vocabulary is represented by a 768-dimensional vector, reflecting the semantic representations acquired through the pretraining tasks of programming language models.

We utilize two distinct large language models, CodeBERT and Unixcoder, to extract embeddings from our vulnerability dataset CodeBERT is trained on a diverse dataset that includes both natural and programming languages, focusing on tasks like code summarization In contrast, Unixcoder is exclusively trained on programming languages, optimizing its performance for auto-regressive tasks such as code completion This fundamental difference in training data reflects their respective strengths in handling various programming-related tasks.

The extracted embeddings for the entire vocabulary are stored in a look-up table, which will serve as input for training the downstream vulnerability repair translation model It's important to note that the programming language models utilized in our experiments have their own tokenizers and dictionaries Consequently, the input tokenized by OpenMT’s tokenizer may undergo further tokenization in these language models, resulting in output tensors with a shape of $n \times 768$, where $n$ represents the number of tokens generated from the input For instance, the token "word1" could be further tokenized into "subword1" and "subword2," leading to an output tensor size of $2 \times 768$ To create a $1 \times 768$ embedding for a single token in our dictionary, we employ two methods to aggregate the output tensors from the language models.

We calculate the mean of the output tensor along the second dimension using a tokenizer that maps vocabulary from the vulnerability dataset to corresponding indexes in the programming language model's dictionary These indexes are input into the language models to generate the output tensor, from which we derive the mean along the specified dimension.

To extract embeddings, first tokenize the vocabulary using the mean code tokens Convert these tokens into their corresponding IDs and obtain the context embeddings by passing the tensor of token IDs through the model Finally, compute the mean of the context embeddings along the specified dimension and store the results in the vulnerability embeddings array at the given index.

When inputting a token from our dictionary into language models, we concatenate it with a special token called [cls], which intuitively captures the semantic information of all tokens This approach allows us to utilize only the first row of the language model's output as the embedding The code for this method is similar to the initial approach, with the key distinction being the inclusion of the additional [cls] token for embedding purposes.

The embeddings are generated by concatenating the tokenized vocabulary, starting with the classifier token The token IDs are then converted and passed through the model to obtain context embeddings Finally, the resulting embeddings are stored in the variable `vul_embs` at the specified index.

Environment

Our experiments utilize a machine equipped with 32GB of RAM and an NVIDIA Quadro RTX 6000 with 24GB GDDR6 We employ the OpenNMT-py framework for training, prediction, and vocabulary creation for embedding extraction and our translation model This framework is a neural machine translation solution built on Pytorch All programming language models used in our experiments are available through the Hugging Face hub.

Results

Table 5.1 presents the results from experiments replicating the VRepair pipeline with a downscaled transformer architecture A comprehensive set of hyperparameters used in these experiments is detailed in the section below We conducted multiple trials with various configurations of some hyperparameters, as referenced in [15], to gain insights into the architecture's performance when training without embeddings from CodeBERT [3] and UnixCoder [4] The model's performance is evaluated using token-level accuracy, perplexity, and training time, with the best configuration achieving a token-level accuracy of 50.229% However, the high perplexity value suggests that the models exhibit uncertainty in their predictions.

Table 5.1: Experiments replicating the VRepair pipeline

Learning rate Hidden size Sequence Length Validation accuracy (%) Validation perplexity Training time (s)

The pipeline utilizes embeddings from CodeBERT and UnixCoder for the downstream task, as shown in the results In these experiments, we selected a single set of hyperparameters: a learning rate of 0.0005, a hidden size of 768, and a sequence length of 2000 This choice allows us to isolate the impact of embeddings on the downstream model's performance Additionally, we reduced the training iterations from 100,000 in previous experiments to 20,000 for the same reason.

In section 5.4, we utilized a programming language model to obtain word representations, noting that each model's unique input specifications result in varying outputs The experiments labeled with postfix (1) employed embeddings derived from aggregating the language model's output tensor, while those with postfix (2) utilized the [cls] token as the embedding Consistent with the initial phase, we reported results based on token-level accuracy, perplexity, and training time in the second phase Additionally, we assessed the models' ability to generate accurate patches that align with the samples' labels The findings indicate that using embeddings from CodeBERT via the latter method slightly enhances performance.

Table 5.2: Experiments with embeddings as input

Embedding Validation accuracy (%) Validation perplexity Training time (s) EM (out of 316) Bleu-score

The following code snippet illustrates the ideal patches produced by the models, utilizing samples from the validation dataset as outlined in section 5.3 In this instance, the prediction sequence shows that the generated patch will insert "memset" between "stride);" and "(input," at every occurrence in the original code that matches this pattern.

stride ) ; memset ( input ,

The experiments indicate that using pre-trained embeddings does not enhance model performance or reduce training time compared to training from scratch We suggest that the similarity between code repairing and vulnerability repairing tasks is insufficient for embeddings to effectively transfer information and improve the vulnerability repairing model's training process In the second phase, the use of embeddings resulted in only a slight increase in BLEU-score and exact match compared to the vanilla pipeline, with a BLEU-score of 30 in 5.2 being considered acceptable Additionally, the high perplexity observed in both phases suggests that the models exhibit uncertainty in their predictions, indicating that the probability of correctly predicted tokens is not significantly higher than that of others.

Discussions of the results

Code repairing is an emerging research area that employs both template-based and generative methods to automatically generate code patches Most researchers concentrate on generic bugs due to the availability of large datasets that can be utilized by deep learning networks, yielding significant advancements In contrast, vulnerabilities in source code, while also classified as bugs, are more challenging to identify and fix, as exploiting these security flaws requires more time and effort Consequently, the labeled datasets for vulnerable source code are limited compared to those for generic bugs, restricting the application of deep learning in vulnerability repair This thesis aims to address this issue by leveraging insights gained from a larger dataset of programming languages to enhance downstream vulnerability repair models, thereby mitigating the challenges posed by the small dataset.

CodeBERT and UnixCoder as input for a transformer-based network, which serves as the vulnerability repairing model, to generate patches for the vulnerability.

Our experiments suggest that the embeddings utilized do not significantly enhance performance on the task Additionally, we discovered that similar experiments have been conducted by other researchers in the field.

[43] on the tasks of vulnerability detections and archived the same analysis as our experiments.

Main Contribution

This thesis addresses the challenge of handling the lack of labeled datasets in deep learning for automatic vulnerability patch generation To tackle this issue, we established specific objectives that guide our research efforts By achieving these objectives, we aim to provide solutions to the problem statement and contribute valuable insights to the field.

• What do we know about machine learning in vulnerability repair and code repairing?

• How effective existing generative-based methods to the problem of code repairing and vulnerability repairing?

• Can code embedding extend the capabilities of these methods?

Our research on code repairing, focusing on template-based and generative methods, reveals that recent studies primarily address vulnerability as a type of bug by learning patterns from datasets represented as token lists or abstract syntax trees We successfully replicated VRepair, training a transformer-based model to generate patches for a small dataset, achieving an average accuracy of 50%, which is promising To enhance performance, we utilized embeddings from CodeBERT and UnixCoder to transfer knowledge from a larger dataset to the vulnerability-repairing task; however, the results indicated no significant improvements from this approach.

Tiêu đề	Application Of Machine Learning On Automatic Program Repair Of Security Vulnerabilities
Tác giả	Nguyen Ngoc Hai Dang
Người hướng dẫn	Assoc. Prof. Dr. Huynh Tuong Nguyen, Assoc. Prof. Dr. Quan Thanh Tho
Trường học	Ho Chi Minh City University of Technology
Chuyên ngành	Computer Science
Thể loại	master’s thesis
Năm xuất bản	2023
Thành phố	Ho Chi Minh City

Định dạng
Số trang	61
Dung lượng	0,97 MB