Neural Language Models in Natural Language Processing
Neural Language Models in Natural Language Processing
2023 2nd International Conference on Data Analytics, Computing and Artificial Intelligence (ICDACAI) | 979-8-3503-3976-5/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICDACAI59742.2023.00104
Abstract—People cannot live without natural language in such as Chinese and English, and realize the natural language
their daily life, and natural language is also an essential part of communication between humans and computers, in place of
the heritage of human civilization. With the rapid development part of human mental work, including querying data,
of information technology and the explosive growth of various answering questions, excerpting documents, etc. Compilation
data, natural language processing (NLP) technology based on and processing of all-natural language information.
deep learning and other technologies in the field of artificial
intelligence has emerged as the times require. With the rapid NLG is a branch of artificial intelligence and
development of deep learning models in recent years, computational linguistics. Different from natural language
breakthroughs have been made in the field of natural language analysis, its work starts from the relatively abstract conceptual
processing. Based on recent research, this paper briefly level and then generates text by executing and selecting
introduces the development process of deep learning, the specific semantic and grammatical rules. In contrast to
concepts of deep learning and NLP, the methods of deep understanding, NLG defines goals in three phases, plans how
learning used to solve the core problems of NLP, the application to achieve them by assessing the situation and available
of the neural network model in natural language modeling. This communication resources, and formats the plans into text.
paper summarizes the current development achievements and
forecasts its future development. Since the 1980snatural language processing has become
increasingly dependent on data-driven computing, including
Keywords—Deep learning, natural language processing, statistical probability and machine learning, while deep
neural network learning and artificial neural networks allow for billions of
parameters. Besides, the continuous development of data
I. INTRODUCTION collection methods and the use of large datasets make deep
Natural language processing is an important step in AI architecture training possible. This paper investigates the
from perception to cognition, and the key to cognitive application of in-depth learning in computer linguistics, as
intelligence is natural language understanding. Bill Gates once well as basic theories and traditional NLP tasks.
said: "Language understanding is the pearl in the crown of
AI." Once there is a breakthrough, natural language Since the 1940s, the research progress of artificial
processing will greatly promote cognitive intelligence, intelligence has been spiraling upward, which can be divided
improve AI technology, and promote the landing in many into four stages. From the late 1940s to the late 1960s, to
important scenes, such as computer and information science, achieve the first stage of development, the focus of the work
linguistics, mathematics, electrical and electronic engineering, is on machine translation, mainly as dictionary-based, word-
artificial intelligence and robotics, psychology, and other by-word translation. The second stage from the late 1960s to
disciplines. the late 1970s emphasized more on knowledge and special
expression. The third stage from the late 1990s to the late
Natural language processing is a science that integrates 1980s was hampered by the inability to build an appropriate
linguistics, computer science, and mathematics. The research system. The fourth stage is from 1980 to the present, with
in this field will involve natural language, so it is closely technological advances in data collection and coding, people
related to the research of linguistics. Besides, its basis is can use corpora to process languages.
mathematical theory and technology, the use of mathematics
to analyze various data, so it can not be separated from the Recent improvements in computing power and parallelism
support of mathematics. have allowed the graphics processing unit GPU to learn more
deeply. Extending deep learning algorithms can improve the
Natural language processing involves many fields, performance of benchmark tasks and discover more complex
including vocabulary, syntax, semantics and pragmatics, text features. Now, based on off-the-shelf, high-performance
classification, affective analysis, automatic summary, computing technology, it can be extended to deep learning
machine translation, and social computing. Natural language systems with very large models and a large number of training
processing is divided into two main processes: Natural sets, and it can use cloud computing infrastructure and
Language Understanding (NLU) and Natural Language thousands of CPU cores to train very large networks. Its
Generation (NLG). parameters can exceed one billion. This method can utilize
inexpensive computing power in the form of GPU and
NLU mainly understands the meaning of the text. introduce the use of high-speed communications
Specifically, every word and structure needs to be understood. infrastructure to closely coordinate distributed gradient
It studies the simulation of the human language calculations.
communication process by computer so that computer can
understand and use the natural language of human society, The neural network is a basic computational tool for
522
Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on April 26,2024 at 20:48:50 UTC from IEEE Xplore. Restrictions apply.
word error rate. Confusion is a convenient tool in model simpler in a sense, uses a back-propagation algorithm to
development, but it does not take into account confusion in compute gradient-related problems, and solves unusual
speech quality and other search issues. Confusion is a gradient descent patterns in complex dynamic structures.
reasonable measure of a model trained on the same dataset, However, it is ultimately proved that the efficient use of
but it is not suitable for training on different vocabularies. convolution neural networks can achieve segmentation and
recognition, as well as automatic learning.
C. Memory Networks in Language Modeling
Neural language models use a latent representation of the Recently Nal Kalchbrenner describes a dynamic
immediate token history to predict the next token. convolution neural network composed of pool and
convolution operations and uses this structure to model
The main task of a language model is to infer a word in a semantically [6]. Global pools on linear sequences are used to
given context. However, although the classical n-gram handle sentences of different lengths. Through dynamic
language model can give correlative dependencies within a pooling, filters can be used to connect phrases that are far apart
short period of time, the data it produces is sparse. One and generate a feature map that captures long and short
extension of the neural sequence model is the attention distances to display.
mechanism, which is used in neural language models by
encoding a predicted next marker and outputting a vector. If The network has been trained and tested to efficiently
the output representation is overused, the training model handle problems and emotional classification without
becomes difficult, so Daniluk presents a neural language requiring parsers and other resources. Vocabularies and word
model with a key value predictive attention mechanism that types are generated from 1.6 million emoticons on the training
outputs a separate representation of the key value [4]. In their set of tweets. Obviously, it can be found that the performance
experiment, the first network and an attention mechanism of this model is much higher than that of n-gram, which
used a value to predict the next marker. However, the produces a high-precision performance in the training of
information in the attention unit can hinder the network. Then automatic expression extraction.
in the setup of the second network, they set two outputs to In some computer vision processing tasks, convolutional
predict the next token and decode the information in the neural networks often achieve good results. In recent times
attention unit. The output is further separated in the third they have been used for tasks such as relation extraction or
network. sentiment analysis of sentences, but their application to
Although simpler than other works, the work of Daniluk language has received less attention.
does not perform as well as a more complex neural language Since the n-gram model processes each word
model. It is better to separate the output vectors into predictive independently when processing a sentence, it cannot capture
parts and keys than to use a simple attention mechanism in the the semantic relationship between words. In contrast, neural
test. It is also easy to see that the left-right focused neural language models can embed words into a space of given
language model uses recent memory rather than long-range contexts, representing the association structure between words.
dependence. The recurrent neural network goes one memory vector and
D. Convolutional Neural Networks in Language Modeling one token at a time as an input, producing the next prediction
and the new vector. So feedforward and recurrent neural
Convolutional neural networks are also widely used in networks outperform n-gram models in various Settings. Ngoc
speech and NLP. Quan Pham's experiment analyzes the language model of
Convolutional neural network, which gets its name from CNNS and compares it with feedforward neural networks and
the convolution operation in signal processing and recurrent neural networks [7].
mathematics. Yann LeCun has used this method in Ngoc Quan Pham embodied the application of CNN in
applications related to document recognition [5]. The richness language modeling in the study, highlighting language
and variability of natural data have been known since the early processing tasks. For language modeling, it needs to capture
days of pattern recognition. For this reason, it is not possible remote and local information, which is a continuous and
to build a system completely by hand for glyphs, voices, or dynamic task. By adding CNN layers on top of a feedforward
other types of patterns. So it is generally created by a model, the performance of the model has been improved by
combination of manual algorithms and automatic learning modern techniques. The experimental results show that the
technology. model reduces the complexity of the model timely and
Yann LeCun compares the performance of some learning effectively through CNN in several corpora of different sizes.
techniques on handwritten number recognition datasets when Compared with the feedforward neural language model, its
considering handwriting recognition tasks. It is concluded that absolute performance is improved by 11-26%, but its
if there is not a certain amount of prior knowledge for the performance is lower than the LSTM model.
given task, although automatic learning is effective, learning The improvement of Ngoc Quan Pham highlights two key
technology is also difficult to succeed. If the neural network is features of CNNS. Firstly, they were able to obtain
used for processing, the efficiency will be much higher when information about 16 positions of the distance predicted word,
integrating knowledge and formulating a structure based on so they had the ability to integrate larger context information.
the tasks to be accomplished. Secondly, the analysis of the learned information type by
Convolutional neural networks restrict weights and use CNN is improved, and the grammar, semantics, and other
connection modes locally. They are specialized neural information in the input are extracted. Similar to the role of
network structures. Similar to the inverse propagation convolutions in computer vision, in our experiments, deeper
algorithm, the gradient-based graph transformation network convolutional layers affect the performance of language
solves the credit allocation problem where the functional modeling, but are very helpful for computer vision. This may
structure changes with the input. This algorithm may be be because the image generated by convolution can preserve
523
Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on April 26,2024 at 20:48:50 UTC from IEEE Xplore. Restrictions apply.
important attributes of the image but it is not easy to preserve everywhere. These applications include many devices that can
language data, and important features of the text and the process voice commands and operate according to commands,
original content will be lost. as well as Google's machine translator, which can more or less
translate dozens of other languages [9]. The emergence of
E. Character Aware Neural Language Models these complex programs proves the achievements of neural
Although most of the neural networks used in NLP use networks and learning systems in the past 60 years. Without a
words as input, recent networks have developed character set double, in the last successive years, incremental progress has
inputs called character-aware neurolinguistic models. taken place. Ten years ago, these machine-learning structures
The network is trained in English Spanish, German, Czech, were still a relatively old technology, but they have developed
French, Russian, and Arabic. The network's test results on at an unprecedented speed, breaking records in many fields in
large and small datasets are generally better than before, both terms of actual performance. In particular, the model built by
in English and in other languages, although tests in Russian the deep neural architecture in natural language tasks has
are not. The test results on Penn Treebank are not very higher efficiency and introduces the measure of
different from the current state-of-the-art technology. "imperfection".
Dropout is a recently introduced regularization method. At present, most studies of NLP use English as the input
Although this regularization method is very powerful in or output language, and some studies use Mandarin [10]. This
feedforward neural networks, it cannot be well applied to ignores many speakers of other languages. Due to the
RNNs. Because of this, large RNNs tend to overfit in practice, complexity of many languages, some meanings may not be
so smaller models are often used. For example, Wojciech expressed in other languages, which also makes NLP software
Zaremba proposed a simple way to apply Dropout to LSTM unable to capture them. Therefore, a suggestion for future
by simply regularizing a recursive neural network with LSTM work is that it should study the use of a wider range of
units[8]. After improvements, Wojciech Zaremba tested both different languages. According to the current research, it can
models for speech recognition. also be predicted that the deep learning model will continue to
develop and may become the norm of computational
TABLE I. ICELANDIC SPEECH DATASET linguistics in the near future, while the pre-training may play
a more important role.
Result
Model
Training set Validation set REFERENCES
Non-regularized LSTM 71.6 68.9
Regularized LSTM 69.4 70.5 [1] Otter D W, Medina J R, Kalita J K. A survey of the usages of deep
learning for natural language processing[J]. IEEE transactions on
The training set in Table პ has 93K pronunciations and is neural networks and learning systems, 2020, 32(2): 604-624.
relatively small. It reports training performance on the Google [2] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural
Icelandic speech dataset and shows the results of dropout probabilistic language model,” J. of Machine Learning Research, vol.3,
2003.
improving the frame accuracy of LSTM. It is obvious from the
[3] R. Iyer, M. Ostendorf, and M. Meteer, “Analyzing and predicting
experiment that overfitting is a big problem, but it can be well language model improvements,” in IEEE Workshop on Automatic
solved for dropout. Speech Recognition and Understanding, 1997, pp. 254–261.
[4] M. Daniluk, T. Rockt¨aschel, J. Welbl, and S. Riedel, “Frustratingly
TABLE II. RESULTS OF TRANSLATION BETWEEN ENGLISH AND FRENCH short attention spans in neural language modeling,” arXiv preprint
arXiv: 1702.04521, 2017.
Result
Model Test BLEU [5] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
Test perplexity learning applied to document recognition,” Proc of the IEEE, vol. 86,
score
Non-regularized LSTM 5.8 25.9 no. 11, pp. 2278–2324, 1998.
Regularized LSTM 5.0 29.03 [6] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional
neural network for modelling sentences,” arXiv preprint arXiv:
Table ჟ shows the results of the English-French 1404.2188, 2014.
translation task. There is a problem that needs to be explained [7] N.-Q. Pham, G. Kruszewski, and G. Boleda, “Convolutional neural
here. Because of the noise added during the training process, network language models,” in EMNLP, 2016, pp. 1153–1162.
the accuracy of the training frame will decrease, but this [8] W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network
situation is similar to dropout, which can be better extended to regularization,” arXiv preprint arXiv: 1409.2329, 2014.
the model of unknown data. In addition, the test set is more [9] Collobert R, Weston J, Bottou L, et al. Natural language processing
accurate and easier than the training set. (almost) from scratch[J]. Journal of machine learning research, 2011,
$57,&/( í
III. CONCLUSION [10] Liu M. The construction and annotation of a semantically enriched
database: the Mandarin VerbNet and its NLP applications[J]. From
The early application of natural language processing is minimal contrast to meaning construct: Corpus-based, near synonym
relatively simple, but now the advanced application of NLP is driven approaches to Chinese lexical semantics, 2020: 257-272.
524
Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on April 26,2024 at 20:48:50 UTC from IEEE Xplore. Restrictions apply.