Literature review on vulnerability detection using
NLP technology
                                                                                                      1st jiajie wu
                                                                                              School of Computer
                                                                             Hangzhou University of Electronic Science and technology
                                                                                           hangzhou,zhejiang, China
                                                                                              wujaijie@hdu.edu.cn
arXiv:2104.11230v1 [cs.CR] 23 Apr 2021
                                                                                                               have been solved, and certain results have been achieved.
                                              Abstract:Vulnerability detection has always been the most        These results show that the use of NLP technology to study
                                              important task in the field of software security. With the       automatic vulnerability detection (one of the code intelligence
                                              development of technology, in the face of massive source         tasks) technology has a lot of room for development. We will
                                              code, automated analysis and detection of vulnerabilities
                                              has become a current research hotspot. For special text          do a detailed introduction in the chapter III.
                                              files such as source code, using some of the hottest NLP            The chapters are arranged as follows: Section II introduces
                                              technologies to build models and realize the automatic           the development of NLP, and section III introduces the latest
                                              analysis and detection of source code has become one of          development of NLP technology in vulnerability detection.
                                              the most anticipated studies in the field of vulnerability
                                              detection. This article does a brief survey of some recent               II. T HE DEVELOPMENT OF NLP TECHNOLOGY
                                              new documents and technologies, such as CodeBERT, and            A. Natural Language Processing
                                              summarizes the previous technologies.
                                            Index Terms—vulnerability deletion, code intelligence, deep           Natural language processing (NLP) is the use of computers
                                         learning, CodeBERT, NLP                                               to model human natural language in order to solve the appli-
                                                                                                               cation of natural language in some related problems. In NLP,
                                                                I. I NTRODUCTION                               the problems that need to be solved can be divided into two
                                            In recent years, with the continuous maturity of software          categories:
                                         technology, more and more software has been developed by                 • One is the natural language understanding (NLU) prob-
                                         people. While people enjoy the convenience brought by soft-                 lem, including text classification [7], named entity recog-
                                         ware, they are also threatened by software vulnerabilities. It              nition [8], [9], relation extraction [10], reading compre-
                                         can be said that software vulnerabilities are one of the biggest            hension, etc. [11]–[13];
                                         problems that threaten the normal operation of software. For             • The second is natural language generation (NLG) prob-
                                         software users, the direct and indirect economic losses caused              lems, including machine translation [14]–[16], text sum-
                                         by software vulnerabilities worldwide have exceeded tens                    mary generation [17], [18], automatic question and an-
                                         of billions of dollars. It is an indisputable fact that there               swer system [19], [20],Image caption generation [21]–
                                         are various vulnerabilities in most software. There are many                [23] etc.
                                         types of software vulnerabilities, such as CVE-2015-8558 [1],            When NLP researchers studied and solved these two types
                                         which are explained in detail on CVE [2]. The longer the              of problems, they found that the underlying problems that
                                         vulnerability exists, the easier it is to be exploited by hackers,    constitute these problems are basically the same, such as
                                         and the greater the damage to the company or organization             embeding expressions of vocabulary. Now researchers are
                                         [3], so the ability to automatically detect the vulnerability in      more inclined to use a unified model for modeling (pre-
                                         the software within a certain time frame has become one of            training stage), and then adjust the model according to specific
                                         the hottest researches at the moment One.                             problems (fine-tuning stage). Research at this stage has made
                                            How can the automatic detection of vulnerabilities be more         great progress. It is believed that in the near future, machines
                                         accurate? Deep learning technology gives us the possibility.          can truly understand human language and even understand
                                         With the continuous reform and development of deep learning           human thinking.
                                         technology in recent years, great progress has been made in              Since 1980s, traditional NLP has increasingly relied on
                                         the field of natural language processing (NLP). In particular,        statistics, probability and shallow learning (traditional machine
                                         the series of models such as GPT [4] and BERT [5] have                learning) [24], such as naive Bayes, hidden Markov model,
                                         taken NLP technology a big step forward. The source code is           conditional random field, support Vector machines and K-
                                         essentially a text in a special format. It is logically feasible to   proximity algorithms, etc., these algorithms are still widely
                                         use NLP technology for code processing. In fact, in modern            used in NLP today. But with the development of deep learning
                                         code intelligence, models such as CodeBERT [6] have already           (DL), people are paying more and more attention to how to
                                         been proposed by some scholars and some code-level tasks              use DL models to solve the problems in NLP [25].
B. DL in NLP                                                            cannot be processed in parallel. Today, with massive data,
   The main goal of DL is to learn the deep neural net-                 it greatly reduces the development of RNN in engineering
work model [26]. The neural network model is composed                   applications. In NLP, CNN and RNN are used to extract
of neurons and the edges connected to them. Each neuron                 the character-level representation of words, as shown in
can input and output. The data inside the neuron can be                 Figure 2.
nonlinearly transformed. [27]. According to the development
of the timeline, we use the time point at which Transformer
[28] is proposed as the segmentation point. The model method
before its appearance is called the basic model method, and
the later one is called the modern model method (or attention
model method). We will introduce them separately below.
Basic model method introduction:
  1) Convolutional Neural Network (CNN) [29]: Due to the
     excellent abstract feature extraction ability of the convo-
     lution kernel, it has achieved great success in the field
     of computer vision (CV). In the field of NLP, CNN-
                                                                   Fig. 2. CNN & RNN for extracting character-level representation for a word
     based algorithms have also appeared one after another,        [9]
     such as [30]–[34], etc. In the research related to vulner-
                                                                    3) Long Short-Term Memory Networks(LSTM) [41]: In
     ability detection, some scholars have used CNN to mine
                                                                       addition to the structural limitations, RNN cannot capture
     vulnerabilities [35], as shown in figure 1.
                                                                       long sequence text information due to the problem of
                                                                       vanishing gradient [42], so scholars modified RNN The
                                                                       LSTM model is proposed to solve the defect that RNN
                                                                       cannot process data in parallel. LSTM is one of the mod-
                                                                       els with the strongest ”memory” ability in NLP so far, and
                                                                       it is also one of the most widely used models. However,
                                                                       because LSTM has complex gating logic, it consumes a
                                                                       lot of space and time during training. Gated Recurrent
                                                                       Unit (GRU) [43] is a model that is similar in structure
                                                                       to LSTM but more lightweight, and its performance in
                                                                       training is not worse than LSTM. For the comparison
           Fig. 1. Using CNN to classify source code [35]              between the three basic models of CNN, GRU, and LSTM
    Although these models use CNN as a feature extractor               in NLP applications, please refer to [44]. Since LSTM is a
    to extract features from text data, because the feature            one-way model, in order to obtain the context information
    dimensions of text data are not many, in text data, more           of the token, people often superimpose the LSTM/GRU
    attention is paid to the close connection between contexts,        model in two directions to obtain a two-way LSTM
    and the model is required to have a ”memory” function              model (Bi-LSTM) [45]. In practical applications, the Bi-
    , So CNN does not perform very impressively when                   LSTM model is often used to extract the features of the
    processing tasks in NLP. But the latest research shows             sentence, and then the CRF algorithm is used to process
    that with the development of multimodal technology                 the downstream tasks [46].
    [36]–[38], in some code generation tasks, such as image         4) Embedding [47]–[50]: Embedding technology is a tech-
    generation instructions, the use of CNN-based models has           nology that can convert tokens into space vectors. The
    achieved good results [39].                                        earliest embedding technology can be traced back to
 2) Recurrent Neural Network(RNN) [40]: One of the charac-             the distribution of words [51], which can represent a
    teristics of RNN is its ”memory”. RNN can take serialized          sequence of tokens in vector form as the input of the
    data as input or output serialized data. For serialized data       deep learning model. With the continuous development
    such as text, using RNN for processing has a natural               of technology, embedding technology can be divided into
    advantage. In the output of the RNN, the above sequence            two types:
    information of the current token can be included, which            • One is the classic Non-Contextual embedding technol-
    makes the RNN have a ”memory” function. When pro-                     ogy, which is also called contextual-independent em-
    cessing the data in this article, people often use a two-way          bedding in some literatures, which refers to embedding
    RNN, that is, to process the above and below information              independent of the context, such as Word2Vec [52],
    of the current token separately. Let the token contain the            GloVe [53] and other models. When embedding, the
    current context information at the same time, which is                contextual semantic relationship of words in the sen-
    very important for the model to understand the meaning                tence is not considered. To put it simply, these models
    of the sentence. However, the model with RNN structure                only learn the mapping of words in the vector space.
     Each word is a fixed representation and cannot deal
     with the problem of word representation in the context
     of similar polysemous words. It is worth mentioning
     that in embedding technology, oov (out of vocabulary,
     oov) is often encountered. The common solution is to
     use substrings for further segmentation, such as BPE
     [54], [55]. For the semantic analysis performance of
     these classic models such as word2vec and GloVe,
     please refer to [56];
  • The other is Contextual embedding technology, also
                                                                 Fig. 3. A comparison between the traditional encoder-decoder architecture
     called contextual-dependent embedding, such as the          (left) and the attention-based architecture (right) [61]
     famous ELMo model [57], [58], including the models
     such as BERT [5] that appear later, the words embed-
     ding learned are all contextual embedding. Contextual
     embedding technology will comprehensively consider
     the context information in the sentence when learning
     the vectorized representation of words, and integrate
     the context information of a single token into the
     representation of the word. In this way, it is deal-
     ing with issues such as polysemous words, syntactic                              Fig. 4. all attention types [62]
     structure, and semantic roles. At the time, the words
     can be represented differently according to the current
     semantic environment. For a more detailed analysis and          attention mechanism. From the structural point of view,
     comparison of the two technologies, please refer to             Transformer is a typical Encoder-Decode structure, and
     [59].                                                           its general training process is as follows:
                                                                     • On the encoder side, after the serialized token un-
Attention model method introduction:
                                                                        dergoes input embedding and positional embedding,
1) Attention mechanism: The attention mechanism is an                   the QKV matrix is generated using the three weight
   instinctive mechanism that imitates people when observ-              matrices of QKV, and then the attention matrix is
   ing objects. In computers, the attention mechanism is                obtained using the multi-head attention mechanism,
   essentially calculating the weight of a certain item, and            and then passes through the conventional add&norm,
   finally all the items are weighted so that more information          fully connected layer, etc. The whole process can be
   is contributed than the important items. The attention               folded N times in total. In [28], there are 6 folds;
   mechanism was first applied in machine translation [60].          • On the decoder side, the process is roughly the same as
   Due to its excellent performance, it was widely used                 that on the encoder side. The only difference is that a
   in other NLP tasks. Now it has become very popu-                     multi-head attention layer is added, that is, the second
   lar, and most of the NLP models have basically been                  attention is performed. In this attention input, the v
   integrated. The attention mechanism, especially in the               value uses the output of the encoder side.
   encoder-decoder architecture, can be used alone in the            For specific details about Transformer and attention
   encoder or decoder, or mixed, as shown in the figure              mechanism, please see [64]. In [65], the author classified
   3. In summary, the attention mechanism can be divided             Transformers according to technology and main purpose.
   into 6 categories, as shown in the figure 4, of which the         For the visualization research of Transformer, it is ex-
   most common are self-attention and multi-dimensional              plained in this article [66], and we will not go into it
   attention. All models that apply the attention mechanism          here.
   can be collectively referred to as Attention Model (AM).       3) GPT [4]: The Generative Pre-trained Transformer (GPT)
   In addition to the application of AM in NLP, AM has               is a Transformer-based pre-training model developed by
   also received extensive attention in the fields of Computer       OpenAI. The purpose is to learn the dependency between
   Vision (CV), Multi-Modal Tasks, Graph-based Systems               sentences and words in long text. Over time, GPT has
   and Recommender Systems (RS) [61].                                evolved from GPT-1 to GPT-3 [67]. The biggest differ-
2) Transformer [28]: With the birth of the Transformer               ence between GPT-1 and BERT is that the GPT-1 model
   architecture, the architecture with the strongest feature         scans text from left to right, so token embedding can only
   extraction capabilities so far was born. In addition to           consider the information before the current token, without
   the NLP field, Transformer has also made great progress           considering the information below the token, while BERT
   in the CV field [63]. The architecture of Transformer is          uses a two-way model training . Therefore, GPT-1 only
   shown in Figure 5.                                                integrates the information above the token. For GPT-1’s
   As can be seen from the figure, Transformer integrates the        use of the information below the token, it is used as a
                                                                   Fig. 6. Overall pre-training and fine-tuning procedures for BERT [5]
                                                                     using the encoder-decoder architecture , As shown in
                                                                     figure 7. In the pre-training stage of BART, 5 noisy
        Fig. 5. Architecture of the Transformer Model [63]
   new input to the model for training after prediction. GPT
   can realize unsupervised training. In GPT-3, unsupervised
   training of network text data is realized. The parameters
   in the model have reached 175 billion, which is about
   the number of GPT-2 [68] parameters (1.5 billion), GPT-
   3 can be said to be the largest and most advanced pre-
   training model so far.
4) BERT [5]: Bidirectional Encoder Representations from            Fig. 7. A schematic comparison of BART with BERT and GPT. [72]
   Transformers (BERT) is one of the best NLP models so
                                                                     input transformation methods including Token Masking,
   far. BERT uses a two-way Transformer block for training,
                                                                     Token Deletion, Text Infilling, Sentence Permutation, and
   taking into account the context information contained in
                                                                     Document Rotation are used. In the fine-tuning stage,
   the word. After BERT, although many excellent models
                                                                     the author trained four tasks: Sequence Classification,
   (such as XLNet [69]) have been proposed, the huge
                                                                     Token Classification, Sequence Generation, and Machine
   influence and excellent performance of BERT cannot be
                                                                     Translation. The results are shown in the figure 8. As can
   replaced by other models. The training process of BERT
                                                                     be seen from the results in the figure, this extended model
   is shown in figure 6.
                                                                     performs better than the BERT model on the data results.
   In BERT, Masked Language Modeling (MLM) [70] tech-
   nology is used. This technique is a fill-in technique. When
   doing pre-training, it predicts the hidden information
   in the original text, and obtains the context embedding
   of the input token. The general process is that in the
   BERT input, approximately 15% is randomly selected.
   The token of is masked, and then the BERT is pre-trained
   to predict the masked token. One disadvantage of this
   technique is that the masked token information will not be
   encoded into the context embedding. In the downstream
   task, the information deviation problem will occur due
   to the missing information of the previously masked
   word. The solution is to process the tokens selected to                  Fig. 8. Comparison of pre-training objectives [72]
   be masked at a random ratio of 8/1/1, that is, 80% of
   the masked tokens continue to be masked, 10% use the          C. The pre-training model
   original token for training, and 10% tokens are randomly         Currently, the mainstream research direction of NLP pro-
   replaced with other tokens. For a detailed summary of         cessing problems tends to be completed in two stages. The
   the application of the BERT model in NLP, please see          first stage is to build pre-trained models (PTM) based on
5) [71].
   BART [72]: BART is a denoising seq2seq algorithm,             context embedding. The second stage is based on Specific
   which can be said to be an extension of the BERT model.       tasks fine-tuning the PTM. According to the classification in
   From an architectural point of view, it can be regarded       [73], PTM can be divided into three categories: serialization
   as a ”combination” of the BERT and GPT framework,             model, recursive model and self-attention model according
to the model structure. Using the pre-training mechanism               •   Static analysis: refers to the use of additional detec-
can improve the generalization performance of the model,                   tion programs to detect programs that are suspected of
allowing researchers or engineers to have more energy to                   vulnerabilities. During the analysis process, the detected
deal with downstream specific tasks. It is worth mentioning                program does not need to be executed, only the source
that the bias problem in NLP will become prominent as the                  code of the detected program is required;
model becomes larger. For example, in GPT-3, the number                •   Dynamic analysis: refers to the execution environment
of parameters has reached 175 billion. Although GPT-3 is by                that reproduces the software under test. Select the test
far the largest and most advanced NLP pre-training model,                  cases required for the execution of the tested software,
it also exhibits the most prejudiced [74]. In addition, most               and then execute the tested program, monitor the program
of the pre-training models have a very large overhead (time,               execution process and the variable change process, and
memory) during training. In some simple tasks, the effect of               find the loopholes in the execution in time;
the context-independent embedding method is better than that           •   Hybrid dynamic and static analysis: As the name sug-
of the context-dependent embedding citearora2020contextual.                gests, it refers to the use of dynamic analysis and static
This shows that there is no best model, only the most suitable             analysis together, but this does not essentially improve the
model. To use a pre-trained model, there are usually two steps.            accuracy of the analysis, because while focusing on the
The first step is to download the pre-trained model. You can               static and dynamic analysis points, it will also inherit the
use the third-party package transformer [75]. The second step              dynamic analysis and static analysis. The insufficiency.
is based on the specific downstream tasks. The model is fine-        In this article, dynamic analysis techniques such as fuzzing
tuned. Generally, transfer learning [76] is used to adjust the
                                                                     testing or taint analysis [82]–[86] are not within the scope
knowledge in the pre-training model to apply it to downstream        of this article. We only discuss static analysis techniques.
tasks. There are many transfer learning methods in NLP, and
                                                                     According to whether the vulnerability detection technology
the most widely used method is Domain Adaptation [77]. The
                                                                     uses the Transformer architecture, we artificially divide it
article [78] provides a more detailed classification of this.        into two categories, one is the detection model based on
      III.   VULNERABILITY DECETION USING NEURAL                     traditional DL technology, and the other is the NLP pre-
                        N ETWORKS                                    training detection model based on the Transformer architecture
                                                                     (III-B). Detection models based on traditional DL technology,
   Vulnerability detection has always been the top priority in       such as LSTM/GRU/Bi-LSTM models, etc., when this type
the field of software security. With the development of deep         of model performs source code vulnerability detection, it is
learning technology in CV, NLP and other fields, the use of          generally divided into two stages:
deep learning methods to understand and detect vulnerabilities
in the source code, thereby replacing manual detection meth-          1) The first stage is to segment the source code and extract
ods, has become the focus and hotspot of current research                the features in the source code. There are two ways to
[79]. Although more and more detection methods have been                 save the results after segmentation:
proposed, the number of vulnerabilities reported on CVE [2]              • One is based on the storage method of abstract syntax
and NVD [80] is increasing day by day. The reason is that in               trees (AST). Use code attributes and use AST tree
addition to the large-scale increase in the number of software,            analysis tools to decompose the source code into the
another important reason is that root vulnerabilities are not              form of AST, and then perform vulnerability analysis
easy to be detected, that is, if a root vulnerability is not               in the AST tree [87] or do other tasks, such as Alon
detected, it will not help to repair other shallow vulnerabilities         uses path-AST (pAST) to express and complete the
caused by it, and vice versa. , If the fundamental vulnerabilities         code The code completion task [88] has been added.
are detected and fixed, other repetitive vulnerabilities will            • One is the saving method based on the graph. Most
disappear. This requires vulnerability detection or mining tools           of the graph segmentation results are saved as Code
to deeply understand the semantic information related to the               Property Graphs(CPG) [89]. In CPG, AST, Control
vulnerability, so as to fundamentally detect the root vulner-              flow graph (CFG) and Program dependence graph
ability. To do this, deep NLP technology provides unlimited                (PDG) have been integrated together, and the extracted
possibilities.                                                             code feature information will be more, and the final
                                                                           vulnerability detection result will be relatively better,
A. vulnerability introduction                                              because in the CPG The vulnerability code provides
   Software vulnerabilities are defined as follows [81], namely:           more vulnerability information for the model, as shown
A software vulnerability is an instance of a flaw,caused by                in Figure 9. Most of the literature now uses CPG to
a mistake in the design, development, or configuration of                  extract code features. For example, in [90], CPG is
software such that it can be exploited to violate some explicit            called Augmented AST; in Devign [91], the sequence
or implicit security policy.                                               logic relationship between source codes (Natural Code
Vulnerability detection and analysis methods are divided into              Sequence, NSC) is actually another form of AST tree
three types according to whether the detected code is executed             in CPG. It is worth mentioning that the data set used
or not:                                                                    in Devign is widely used by many researchers, and
         Data set open source. There are many tools to generate                          to use GNN for modeling training. In Devign [91],
         CPG, you can directly use Joern or DG [92] and                                  the gated graph neural networks (GGNN) [99], [100]
         other tools. The use of these tools is inextricably                             model is used for modeling training. The advantage
         linked to LLVM. As for the AST tree generation tool,                            is that the information in the entire graph structure
         in the https://github.com/Kolkir/code2seq library, AST                          can be fully considered, and there will be no ad-
         generation tools for programming languages such as                              jacent junction information. Lost, more suitable for
         Java, C++, C, C# and python are provided.                                       semantic graph structure representation in vulnerability
                                                                                         detection tasks. When dealing with real data sets, the
                                                                                         performance of existing detection models based on
                                                                                         traditional DL technology is not very good. This is
                                                                                         due to problems such as data imbalance and data
                                                                                         duplication in real data sets. REVEAL [93] can be
                                                                                         used as a configurable vulnerability prediction tool,
                                                                                         focusing on solving the problem of data imbalance
                                                                                         in real data sets, and using representation learning
                                                                                         to solve problems such as insufficient recognition of
                                                                                         the vulnerability boundary by the model, as shown in
                                                                                         Figure 10 The performance of the boundary between
                                                                                         vulnerabilities and non-vulnerabilities under different
                                                                                         models. In addition, wang [90] uses transfer learning
Fig. 9. By graph splitting,the Red-shaded code elements are most contributing
                                                                                         in the model to deal with the problem of insufficient
for vulnerability decetion [93]                                                          data.
  2) The second stage is modeling training. In this stage, the
     input is the output of the previous stage. According to
     whether the graph neural network (GNN) model is used
     or not, it can be divided into two categories:
     • Use non-GNN model: Generally, the output in the first
       stage is in the form of a graph, so the graph needs
       to be encoded, converted into a vector, and then fed                     Fig. 10. t-SNE plots illustrating the separation between vulnerable (denoted
                                                                                by + ) and non-vulnerable (denoted by ◦ ) example [93]
       to the model. A series of work similar to SySeVR
       [94],Vuldeelocator [95]–[98], the code is segmented
                                                                                B. new era of vulnerability detection
       at the token level, the semantic information inside the
       slice is relatively strong, and then Word2Vec [52] is                       In the NLP field, the best models so far are models such
       used for the slice code mikolov2013efficient Vectorized                  as BERT [5], GPT [4] and their extended models. These
       representation, which converts the slice code into a                     models all use Transformer as the feature extractor. Since
       vector representation. In the modeling phase, these                      the code is also a special kind of text data, it is natural to
       algorithms use LSTM or GRU and their variants Bi-                        think of using these excellent models such as BERT to do
       LSTM, Bi-GRU and other models for modeling train-                        vulnerability detection. Listed below are some of the latest
       ing, and the final results perform well on their respec-                 models that apply NLP technology to code intelligence (CI)
       tive artificialy synthesized data sets. But new research                 tasks. These models have common characteristics, that is, the
       shows that [93], when tested with real data on the                       training process is divided into two stages, the first stage is
       VulDeePecker [95] model, its accuracy is reduced to                      pre-turning, and the second stage is fine-Tuning, and specific
       11.12%. This result is both unexpected and reasonable.                   vulnerability detection tasks are generally completed in the
       Because the LSTM or Bi-LSTM model itself is not                          fine-tuning stage.
       very sufficient in processing vulnerability information,                   1) CodeBERT [6]: CodeBERT is a model developed by
       the ability to extract relevant vulnerability information                      Microsoft for code intelligence tasks. CodeBERT uses
       features is limited, that is, the generalization ability of                    bimodal (bimodal) [101]–[103] to train the model, where
       the model is not strong, and the data in the real data                         bimodal refers to natural language (NL) and program-
       set is unbalanced (not Vulnerability data is much more                         ming language (PL), where NL refers to the program
       than vulnerability data), which causes the accuracy of                         code Natural language annotations. In addition to the
       the VulDeePecker model to be reduced by more than                              pre-training of the model using PL-NL dual-modality,
       50                                                                             CodeBERT also uses the pure code single-modality mode
     • Use GNN model: Since the source code is sliced                                 of 6 programming languages for training. In order to
       in the first stage, it is generally saved as a graph.                          better adapt to this model, standard masked language
       Therefore, continuing this logic, it becomes natural                           modeling (MLM) and replaced token detection (RTD)
   methods are used for training, as shown in Figure 11.             Code Intelligence (CI) tasks refer to a series of tasks related
   It is worth mentioning that no model is a panacea.             to source code operations on the source code that are solved
                                                                  using artificial intelligence methods. Common code intelli-
                                                                  gence tasks are divided into four categories of sub-questions.
                                                                  This classification rule is the same as the classification rule of
                                                                  the problem in NLP, but the problem in NLP is oriented to
                                                                  the macro concept of ”text”, and the code is also A special
                                                                  kind of ”text”, we can regard vulnerability detection as a
                                                                  sub-task of code intelligence. The advantage of doing so is
                                                                  that more training samples and more generation pre-training
                                                                  models can be obtained. Applying some advanced NLP models
                                                                  to CI, using the powerful feature extraction capabilities of
                                                                  deep learning to extract relevant semantic information from
             Fig. 11.   CodeBERT training model [6]               the code, has gradually become a research hotspot. Research
                                                                  at this stage is mainly focused on the representation of vul-
   When using CodeBERT for code generation tasks, the             nerability information. In other words, if deeper vulnerability
   code2seq [104] model does not perform as well. In              information can be excavated, the ability to identify, judge and
   code2seq, Alon uses the concept of path-context to extract     repair vulnerabilities will be greatly improved.
   more relevant semantic information from the code than
   CodeBERT, which uses source code for training. Later,                                       R EFERENCES
   in the extended version of CodeBERT GraphCodeBERT
   [105], the internal structure of the code was considered. In
   the pre-training stage, the semantic-level structure of data
                                                                    [1] Junaid Akram and Ping Luo. Sqvdt: A scalable quantitative vulnerabil-
   flow was used to make the model more effective. In the               ity detection technique for source code security assessment. Software:
   four downstream tasks of code search, clone detection,               Practice and Experience, 51(2):294–318, 2021.
   code translation and code refinement, GraphCodeBERT              [2] Common Vulnerabilities Exposures (CVE).                  Available at
                                                                        http://cve.mitre.org.
   achieved the best performance.                                   [3] Yonghee Shin, Andrew Meneely, Laurie Williams, and Jason A Os-
2) CodeXGLUE [106]: In the code intelligence research, if               borne. Evaluating complexity, code churn, and developer activity
   a benchmark data set is provided, the research results               metrics as indicators of software vulnerabilities. IEEE transactions
                                                                        on software engineering, 37(6):772–787, 2010.
   will be more convincing. CodeXGLUE provides three
                                                                    [4] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever.
   types of model architectures: codeBERT, codeGPT, and                 Improving language understanding by generative pre-training. 2018.
   code-encoder-decoder to help more researchers quickly            [5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
   solve problems in code intelligence. The problems in                 Bert: Pre-training of deep bidirectional transformers for language
                                                                        understanding. arXiv preprint arXiv:1810.04805, 2018.
   code intelligence that CodeXGLUE has implemented                 [6] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng,
   are specifically broken down into the following four                 Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al.
   categories of sub-questions:                                         Codebert: A pre-trained model for programming and natural languages.
                                                                        arXiv preprint arXiv:2002.08155, 2020.
   • code-code:clone detection,         defect detection,cloze      [7] Kamran Kowsari, Kiana Jafari Meimandi, Mojtaba Heidarysafa, San-
      test,code     completion,         coderepair,code-to-code         jana Mendu, Laura Barnes, and Donald Brown. Text classification
                                                                        algorithms: A survey. Information, 10(4):150, 2019.
      translation                                                   [8] Vikas Yadav and Steven Bethard. A survey on recent advances in
   • text-code:natural language code search, text-to-code               named entity recognition from deep learning models. arXiv preprint
      generation                                                        arXiv:1910.11470, 2019.
   • code-text:code summarization
                                                                    [9] Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li. A survey on
                                                                        deep learning for named entity recognition. IEEE Transactions on
   • text-text:documentation translation                                Knowledge and Data Engineering, 2020.
   For detailed descriptions of these sub-problems, see            [10] Shantanu Kumar. A survey of deep learning methods for relation
                                                                        extraction. arXiv preprint arXiv:1705.03645, 2017.
   [106]. In the latest pre-trained CodeBERT model, in the         [11] Daria Dzendzik, Carl Vogel, and Jennifer Foster. English machine
   downstream task Insecure Code Detection, ACC has been                reading comprehension datasets: A survey, 2021.
   increased to 65.3% (previously 62.08%) .                        [12] Razieh Baradaran, Razieh Ghiasi, and Hossein Amirkhani. A sur-
                                                                        vey on machine reading comprehension systems. arXiv preprint
3) PLBART [107]: PLBART applies the BART [72] frame-                    arXiv:2001.01582, 2020.
   work to code intelligence, where PL refers to program           [13] Changchang Zeng, Shaobo Li, Qin Li, Jie Hu, and Jianjun Hu. A
   language (PL). In PLBART, the noise reduction and self-              survey on machine reading comprehension—tasks, evaluation metrics
                                                                        and benchmark datasets. Applied Sciences, 10(21):7640, 2020.
   encoding strategy in BART is continued, using token
                                                                   [14] Chenhui Chu and Rui Wang. A survey of domain adaptation for neural
   masking, token deletion, and token infilling three ways              machine translation. arXiv preprint arXiv:1806.00258, 2018.
   to add noise. In the fine-tuning stage, the author uses the     [15] Shuoheng Yang, Yuxin Wang, and Xiaowen Chu. A survey of deep
   four major tasks of Code Summarization, Code Gener-                  learning techniques for neural machine translation. arXiv preprint
                                                                        arXiv:2002.07526, 2020.
   ation, Code Translation, and Code Classification as the         [16] Raj Dabre, Chenhui Chu, and Anoop Kunchukuttan. A comprehensive
   downstream tasks of PLBARK for fine-tuning.                          survey of multilingual neural machine translation, 2020.
[17] Danqing Wang, Pengfei Liu, Yining Zheng, Xipeng Qiu, and Xuanjing           [40] Mitsuo Kawato, Kazunori Furukawa, and Ryoji Suzuki. A hierarchical
     Huang. Heterogeneous graph neural networks for extractive document               neural-network model for control and learning of voluntary movement.
     summarization. arXiv preprint arXiv:2004.12393, 2020.                            Biological cybernetics, 57(3):169–185, 1987.
[18] Mudasir Mohd, Rafiya Jan, and Muzaffar Shah. Text document sum-             [41] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.
     marization using word embedding. Expert Systems with Applications,               Neural computation, 9(8):1735–1780, 1997.
     143:112958, 2020.                                                           [42] Sepp Hochreiter. The vanishing gradient problem during learning
[19] Tahseen Sultana and Srinivasu Badugu. A review on different question             recurrent neural nets and problem solutions. International Journal of
     answering system approaches. pages 579–586, 2020.                                Uncertainty, Fuzziness and Knowledge-Based Systems, 6(02):107–116,
[20] Zahra Abbasiyantaeb and Saeedeh Momtazi. Text-based question                     1998.
     answering from information retrieval and deep neural network per-           [43] Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and
     spectives: A survey. arXiv preprint arXiv:2002.06612, 2020.                      Yoshua Bengio. On the properties of neural machine translation:
[21] Sulabh Katiyar and Samir Kumar Borgohain. Comparative evaluation                 Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.
     of cnn architectures for image caption generation. arXiv preprint           [44] Wenpeng Yin, Katharina Kann, Mo Yu, and Hinrich Schütze. Com-
     arXiv:2102.11506, 2021.                                                          parative study of cnn and rnn for natural language processing. arXiv
[22] Harshit Parikh, Harsh Sawant, Bhautik Parmar, Rahul Shah, Santosh                preprint arXiv:1702.01923, 2017.
     Chapaneri, and Deepak Jayaswal. Encoder-decoder architecture for            [45] Zhenjin Dai, Xutao Wang, Pin Ni, Yuming Li, Gangmin Li, and
     image caption generation. In 2020 3rd International Conference on                Xuming Bai. Named entity recognition using bert bilstm crf for chinese
     Communication System, Computing and IT Applications (CSCITA),                    electronic health records. In 2019 12th international congress on image
     pages 174–179. IEEE, 2020.                                                       and signal processing, biomedical engineering and informatics (cisp-
[23] Saloni Kalra and Alka Leekha. Survey of convolutional neural net-                bmei), pages 1–5. IEEE, 2019.
     works for image captioning. Journal of Information and Optimization         [46] Rabah Alzaidy, Cornelia Caragea, and C Lee Giles. Bi-lstm-crf
     Sciences, 41(1):239–260, 2020.                                                   sequence labeling for keyphrase extraction from scholarly documents.
[24] Gobinda G Chowdhury. Natural language processing. Annual review                  In The world wide web conference, pages 2551–2557, 2019.
     of information science and technology, 37(1):51–89, 2003.                   [47] Amir Bakarov. A survey of word embeddings evaluation methods.
[25] Daniel W Otter, Julian R Medina, and Jugal K Kalita. A survey of                 arXiv preprint arXiv:1801.09536, 2018.
     the usages of deep learning for natural language processing. IEEE           [48] Soubraylu Sivakumar, Lakshmi Sarvani Videla, T Rajesh Kumar,
     Transactions on Neural Networks and Learning Systems, 2020.                      J Nagaraj, Shilpa Itnal, and D Haritha. Review on word2vec word
[26] Md Zahangir Alom, Tarek M Taha, Chris Yakopcic, Stefan Westberg,                 embedding neural net. In 2020 International Conference on Smart
     Paheding Sidike, Mst Shamima Nasrin, Mahmudul Hasan, Brian C                     Electronics and Communication (ICOSEC), pages 282–290. IEEE,
     Van Essen, Abdul AS Awwal, and Vijayan K Asari. A state-of-the-art               2020.
     survey on deep learning theory and architectures. Electronics, 8(3):292,    [49] Tomasz Limisiewicz and David Mareček. Syntax representation in
     2019.                                                                            word embeddings and neural networks–a survey. arXiv preprint
                                                                                      arXiv:2010.01063, 2020.
[27] Jürgen Schmidhuber. Deep learning in neural networks: An overview.
     Neural networks, 61:85–117, 2015.                                           [50] Sebastian Ruder, Ivan Vulić, and Anders Søgaard. A survey of cross-
                                                                                      lingual word embedding models. Journal of Artificial Intelligence
[28] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
                                                                                      Research, 65:569–631, 2019.
     Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention
                                                                                 [51] Zellig S Harris. Distributional structure. Word, 10(2-3):146–162, 1954.
     is all you need. arXiv preprint arXiv:1706.03762, 2017.
                                                                                 [52] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient
[29] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner.
                                                                                      estimation of word representations in vector space. arXiv preprint
     Gradient-based learning applied to document recognition. Proceedings
                                                                                      arXiv:1301.3781, 2013.
     of the IEEE, 86(11):2278–2324, 1998.
                                                                                 [53] Jeffrey Pennington, Richard Socher, and Christopher D Manning.
[30] Yoon Kim. Convolutional neural networks for sentence classification,             Glove: Global vectors for word representation. In Proceedings of the
     2014.                                                                            2014 conference on empirical methods in natural language processing
[31] Rie Johnson and Tong Zhang. Effective use of word order for text                 (EMNLP), pages 1532–1543, 2014.
     categorization with convolutional neural networks. arXiv preprint           [54] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural ma-
     arXiv:1412.1058, 2014.                                                           chine translation of rare words with subword units. arXiv preprint
[32] Rie Johnson and Tong Zhang. Semi-supervised convolutional neural                 arXiv:1508.07909, 2015.
     networks for text categorization via region embedding. Advances in          [55] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad
     neural information processing systems, 28:919, 2015.                             Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao,
[33] Ye Zhang and Byron Wallace. A sensitivity analysis of (and practition-           Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xi-
     ers’ guide to) convolutional neural networks for sentence classification.        aobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku
     arXiv preprint arXiv:1510.03820, 2015.                                           Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil,
[34] Thien Huu Nguyen and Ralph Grishman. Relation extraction: Per-                   Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol
     spective from convolutional neural networks. In Proceedings of the 1st           Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s
     Workshop on Vector Space Modeling for Natural Language Processing,               neural machine translation system: Bridging the gap between human
     pages 39–48, 2015.                                                               and machine translation, 2016.
[35] Rebecca Russell, Louis Kim, Lei Hamilton, Tomo Lazovich, Jacob              [56] Erion Çano and Maurizio Morisio. Word embeddings for senti-
     Harer, Onur Ozdemir, Paul Ellingwood, and Marc McConley. Auto-                   ment analysis: a comprehensive empirical survey. arXiv preprint
     mated vulnerability detection in source code using deep representation           arXiv:1902.00753, 2019.
     learning. In 2018 17th IEEE international conference on machine             [57] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christo-
     learning and applications (ICMLA), pages 757–762. IEEE, 2018.                    pher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized
[36] Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Mul-           word representations. arXiv preprint arXiv:1802.05365, 2018.
     timodal machine learning: A survey and taxonomy. IEEE transactions          [58] Qi Liu, Matt J Kusner, and Phil Blunsom. A survey on contextual
     on pattern analysis and machine intelligence, 41(2):423–443, 2018.               embeddings. arXiv preprint arXiv:2003.07278, 2020.
[37] Umut Sulubacak, Ozan Caglayan, Stig-Arne Grönroos, Aku Rouhe,              [59] Alessio Miaschi and Felice Dell’Orletta. Contextual and non-contextual
     Desmond Elliott, Lucia Specia, and Jörg Tiedemann. Multimodal                   word embeddings: an in-depth linguistic investigation. In Proceedings
     machine translation through visuals and speech, 2019.                            of the 5th Workshop on Representation Learning for NLP, pages 110–
[38] Chao Zhang, Zichao Yang, Xiaodong He, and Li Deng. Multimodal                    119, 2020.
     intelligence: Representation learning, information fusion, and applica-     [60] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural
     tions. IEEE Journal of Selected Topics in Signal Processing, 14(3):478–          machine translation by jointly learning to align and translate. arXiv
     493, 2020.                                                                       preprint arXiv:1409.0473, 2014.
[39] Sulabh Katiyar and Samir Kumar. Comparative evaluation of cnn               [61] Sneha Chaudhari, Gungor Polatkan, Rohan Ramanath, and Varun
     architectures for image caption generation. International Journal of             Mithal. An attentive survey of attention models. arXiv preprint
     Advanced Computer Science and Applications, 11(12), 2020.                        arXiv:1904.02874, 2019.
[62] Dichao Hu. An introductory survey on attention mechanisms in nlp                   Woo. The art, science, and engineering of fuzzing: A survey. IEEE
     problems. In Proceedings of SAI Intelligent Systems Conference, pages              Transactions on Software Engineering, 2019.
     432–448. Springer, 2019.                                                    [86]   Yan Wang, Peng Jia, Luping Liu, Cheng Huang, and Zhonglin Liu. A
[63] Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir,                     systematic review of fuzzing based on machine learning techniques.
     Fahad Shahbaz Khan, and Mubarak Shah. Transformers in vision: A                    PloS one, 15(8):e0237749, 2020.
     survey. arXiv preprint arXiv:2101.01169, 2021.                              [87]   Fabian Yamaguchi, Markus Lottmann, and Konrad Rieck. Generalized
[64] Benyamin Ghojogh and Ali Ghodsi. Attention mechanism, transform-                   vulnerability extrapolation using abstract syntax trees. pages 359–368,
     ers, bert, and gpt: Tutorial and survey. 2020.                                     2012.
[65] Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient         [88]   Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. code2vec:
     transformers: A survey. arXiv preprint arXiv:2009.06732, 2020.                     Learning distributed representations of code. Proceedings of the ACM
[66] Adrian MP Braşoveanu and Răzvan Andonie. Visualizing transformers                on Programming Languages, 3(POPL):1–29, 2019.
     for nlp: A brief survey. In 2020 24th International Conference              [89]   Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck.
     Information Visualisation (IV), pages 270–279. IEEE, 2020.                         Modeling and discovering vulnerabilities with code property graphs.
[67] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared                     In 2014 IEEE Symposium on Security and Privacy, pages 590–604.
     Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish                IEEE, 2014.
     Sastry, Amanda Askell, et al. Language models are few-shot learners.        [90]   Huanting Wang, Guixin Ye, Zhanyong Tang, Shin Hwei Tan, Songfang
     arXiv preprint arXiv:2005.14165, 2020.                                             Huang, Dingyi Fang, Yansong Feng, Lizhong Bian, and Zheng Wang.
[68] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei,                   Combining graph-based learning with automated data collection for
     and Ilya Sutskever. Language models are unsupervised multitask                     code vulnerability detection. IEEE Transactions on Information Foren-
     learners. OpenAI blog, 1(8):9, 2019.                                               sics and Security, 2020.
[69] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan               [91]   Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu.
     Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pre-               Devign: Effective vulnerability identification by learning comprehen-
     training for language understanding. arXiv preprint arXiv:1906.08237,              sive program semantics via graph neural networks, 2019.
     2019.                                                                       [92]   Marek Chalupa. Dg: Analysis and slicing of llvm bitcode. In
[70] Wilson L Taylor. “cloze procedure”: A new tool for measuring                       International Symposium on Automated Technology for Verification and
     readability. Journalism quarterly, 30(4):415–433, 1953.                            Analysis, pages 557–563. Springer, 2020.
[71] MV Koroteev. Bert: A review of applications in natural language             [93]   Saikat Chakraborty, Rahul Krishna, Yangruibo Ding, and Baishakhi
     processing and understanding. arXiv preprint arXiv:2103.11943, 2021.               Ray. Deep learning based vulnerability detection: Are we there yet?
[72] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Ab-                     arXiv preprint arXiv:2009.07235, 2020.
     delrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettle-                [94]   Yi Hu. A framework for using deep learning to detect software
     moyer. Bart: Denoising sequence-to-sequence pre-training for natural               vulnerabilities, 2019.
     language generation, translation, and comprehension. arXiv preprint
                                                                                 [95]   Deqing Zou, Sujuan Wang, Shouhuai Xu, Zhen Li, and Hai Jin.
     arXiv:1910.13461, 2019.
                                                                                        Vuldeepecker: A deep learning-based system for multiclass vulner-
[73] Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and
                                                                                        ability detection. IEEE Transactions on Dependable and Secure
     Xuanjing Huang. Pre-trained models for natural language processing:
                                                                                        Computing, page 1–1, 2019.
     A survey. Science China Technological Sciences, pages 1–26, 2020.
                                                                                 [96]   Zhen Li, Deqing Zou, Shouhuai Xu, Zhaoxuan Chen, Yawei Zhu, and
[74] Ismael Garrido-Muñoz, Arturo Montejo-Ráez, Fernando Martı́nez-
                                                                                        Hai Jin. Vuldeelocator: a deep learning-based fine-grained vulnerability
     Santiago, and L Alfonso Ureña-López. A survey on bias in deep nlp.
                                                                                        detector. arXiv preprint arXiv:2001.02350, 2020.
     Applied Sciences, 11(7):3184, 2021.
[75] Thomas Wolf, Julien Chaumond, Lysandre Debut, Victor Sanh,                  [97]   Deqing Zou, Yawei Zhu, Shouhuai Xu, Zhen Li, Hai Jin, and Hengkai
     Clement Delangue, Anthony Moi, Pierric Cistac, Morgan Funtowicz,                   Ye. Interpreting deep learning-based vulnerability detector predictions
     Joe Davison, Sam Shleifer, et al. Transformers: State-of-the-art natural           based on heuristic searching. ACM Transactions on Software Engi-
     language processing. In Proceedings of the 2020 Conference on Empir-               neering and Methodology (TOSEM), 30(2):1–31, 2021.
     ical Methods in Natural Language Processing: System Demonstrations,         [98]   Changming Liu, Deqing Zou, Peng Luo, Bin B. Zhu, and Hai Jin. A
     pages 38–45, 2020.                                                                 heuristic framework to detect concurrency vulnerabilities. In Proceed-
[76] Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu,                     ings of the 34th Annual Computer Security Applications Conference,
     Hengshu Zhu, Hui Xiong, and Qing He. A comprehensive survey on                     ACSAC ’18, page 529–541, New York, NY, USA, 2018. Association
     transfer learning. Proceedings of the IEEE, 109(1):43–76, 2020.                    for Computing Machinery.
[77] Alan Ramponi and Barbara Plank. Neural unsupervised domain                  [99]   Daniel Beck, Gholamreza Haffari, and Trevor Cohn. Graph-to-
     adaptation in nlp—a survey. arXiv preprint arXiv:2006.00632, 2020.                 sequence learning using gated graph neural networks. arXiv preprint
[78] Zaid Alyafeai, Maged Saeed AlShaibani, and Irfan Ahmad. A survey                   arXiv:1806.09835, 2018.
     on transfer learning in natural language processing. arXiv preprint        [100]   Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel.
     arXiv:2007.04239, 2020.                                                            Gated graph sequence neural networks, 2017.
[79] Guanjun Lin, Sheng Wen, Qing-Long Han, Jun Zhang, and Yang Xiang.          [101]   Dhanesh Ramachandram and Graham W Taylor. Deep multimodal
     Software vulnerability detection using deep neural networks: a survey.             learning: A survey on recent advances and trends. IEEE Signal
     Proceedings of the IEEE, 108(10):1825–1848, 2020.                                  Processing Magazine, 34(6):96–108, 2017.
[80] National     Vulnerability      Database(NVD).          Available     at   [102]   Wei Chen, Weiping Wang, Li Liu, and Michael S. Lew. New ideas
     https://nvd.nist.gov.                                                              and trends in deep multimodal content understanding: A review, 2020.
[81] Seyed Mohammad Ghaffarian and Hamid Reza Shahriari. Software               [103]   Tariq Habib Afridi, Aftab Alam, Muhammad Numan Khan, Jawad
     vulnerability analysis and discovery using machine-learning and data-              Khan, and Young-Koo Lee. A multimodal memes classification: A
     mining techniques: A survey. ACM Comput. Surv., 50(4), 2017.                       survey and open research issues. arXiv preprint arXiv:2009.08395,
[82] Zaoyu Wei, Jiaqi Wang, Xueqi Shen, and Qun Luo. Smart contract                     2020.
     fuzzing based on taint analysis and genetic algorithms. Journal of         [104]   Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. code2seq:
     Quantum Computing, 2(1):11, 2020.                                                  Generating sequences from structured representations of code. arXiv
[83] Heribertus Yulianton, Agung Trisetyarso, Wayan Suparta, Bahtiar Saleh              preprint arXiv:1808.01400, 2018.
     Abbas, and Chul Ho Kang. Web application vulnerability detection           [105]   Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie
     using taint analysis and black-box testing. In IOP Conference Series:              Liu, Long Zhou, Nan Duan, Jian Yin, Daxin Jiang, et al. Graphcode-
     Materials Science and Engineering, volume 879, page 012031. IOP                    bert: Pre-training code representations with data flow. arXiv preprint
     Publishing, 2020.                                                                  arXiv:2009.08366, 2020.
[84] James Fell. A review of fuzzing tools and methods. Technical               [106]   Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy,
     report,     Technical        Report.   https://dl. packetstormsecurity.            Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu
     net/papers/general/a . . . , 2017.                                                 Tang, et al. Codexglue: A machine learning benchmark dataset for
[85] Valentin Jean Marie Manès, HyungSeok Han, Choongwoo Han,                          code understanding and generation. arXiv preprint arXiv:2102.04664,
     Sang Kil Cha, Manuel Egele, Edward J Schwartz, and Maverick                        2021.
[107] Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei
      Chang. Unified pre-training for program understanding and generation.
      arXiv preprint arXiv:2103.06333, 2021.