2020 International Conference on Computer, Control, Electrical, and Electronics Engineering (ICCCEEE)
SudaBERT: A Pre-trained Encoder Representation
2020 International Conference on Computer, Control, Electrical, and Electronics Engineering (ICCCEEE) | 978-1-7281-9111-9/20/$31.00 ©2021 IEEE | DOI: 10.1109/ICCCEEE49695.2021.9429651
For Sudanese Arabic Dialect
M ukhtar Elgezouli*, K halid N. Elmadani*, M uham m ed Saeed*
University of Khartoum, Faculty of Engineering
Department of Electrical and Electronic Engineering
Algamaa Street, P.O. Box 321, Khartoum, Sudan
{mukhtaralgezoli, khalidnabigh, mohammed.yahia3}@gmail.com
*All authors have contributed equally
Abstract—Bidirectional Encoder Representations from Trans- different word vectors under different contexts ELMO has
formers (BERT) has proven to be very efficient at Natural some drawbacks. First, the complex Bi-LSTM structure makes
Language Understanding (NLU), as it allows to achieve state-of- it very slow to train and generate Embedding, second struggles
the-art results in most NLU tasks. In this work we aim to utilize
the power of BERT in Sudanese Arabic dialect, and produce with long-term context dependencies. One major problem that
a Sudanese word representation. We collected over 7 million all the above models suffer from is the high computational
sentences in Sudanese dialect and used them to resume training of cost and the long time needed for training. Also, LSTM
the pre-trained Arabic-BERT, as it was trained on large Modern layers unroll the entire network, which requires sequential
Standard Arabic (MSA) corpus. Our model -SudaBERT- has calculation, and the model ends up not utilizing the full power
achieved better performance on Sudanese Sentiment Analysis,
this clarifies that SudaBERT works better in understanding of GPU or TPU. Those drawbacks restricted the availability
Sudanese Dialectic which is the domain we are interested in. of language models for non-English language. To fill this
Index Terms—Sudanese Arabic Dialect, BERT, SudaBERT, gap, multilingual models have been trained to learn represen-
Natural Language Understanding tation in +100 different languages. Still, multilingual model
performance falls against single language models far behind
I. I NTRODUCTION due to little data representation and small specific vocabulary,
Early text representation was done using Bag of Words especially for Arabic, which has different morphological and
concepts such as TF-IDF - term frequency-inverse document syntactic structures with other languages in the multilingual
frequency - to get the token importance score. Still, this model. Sudanese data also has its differences from the Arabic
method lakes semantic understanding of the words since it was language.
just a statistical representation for the words in the document. In this paper, we describe the process of collecting Su-
After TF-IDF came to some other statistical representation danese Dialect Data, pre-training BERT transformer model
such as Latent semantic analysis LDA, but also was a complex for Sudanese Arabic Data. We evaluate our model on two
statistical representation and still depends on the count of the Arabic Natural Language Understanding downstream tasks
words. that are different tasks I) Sentiment Analysis II) Named Entity
In 2013 Mikolov came with the idea of Embedding -word Recognition.
vectors- [1]. Embedding contains a single hidden layer, which The paper is structured as follows: Section II describes the
learns the meaning of the words merely by processing a previous work. In Section III we discuss pre-training process to
large corpus of unlabeled text. This unsupervised nature of develop SudaBERT. Section IV describes the datasets we used
word2vec makes it power but has some drawbacks. First, to evaluate our model. Section V presents the experimental
Word2vec consists of only a single hidden layer, which is setup. Section VI presents the results. Finally, the conclusion
not sufficient to capture the language rules. Second word2vec in Section VII.
represents non-contextual word Embedding in which the word
will have the same meaning regardless of the context that came II. R e l a t ed Wo r k
before and after it.
A. non-contextual Embedding
Those issues were addressed by generating contextualized
representation, such as ELMO [2]. Which used deep bi- The first meaningful word representation has appeared with
directional LSTM, ELMO is trained in an unsupervised man- the word2vec model developed by Mikolov [1] then appeared
ner. Interestingly, each layer ends up learning a different char- Glove [3]. Facebook FastText [4] also some Arabic word2vec
acteristic of the sentence. Unlike traditional word Embedding models like AraVec [5] both the previous were non-contextual
such as word2vec and GLoVe [3], the ELMO [2] vector is representation for words - the word has the same meaning re-
assigned to a token or word is a function of the entire sentence gardless of its position on the sentence - a significant advance
containing the word. As a result, the same word can have was achieved with ELMO, which is contextual embedding.
978-1-7281-9111-9/20/$31.00 ©2020 IEEE
Authorized licensed use limited to: Linkoping University Library. Downloaded on June 20,2021 at 13:57:32 UTC from IEEE Xplore. Restrictions apply.
B. Contextual Embedding two most used LM tasks for BERT are Masked Language
ELMO was tested on six benchmark Natural Languages Modeling (MLM) and Next Sentence prediction (NSP).
Processing tasks: named entity extraction, question answering, In Masked Language modeling, some of the tokens (up to
semantic role labeling, sentiment analysis, textual entailment, 15%) in the input sequence are randomly masked (replaced
and coreference resolution. In all cases, the enhanced models with [MASK] token), the model should then predict those
achieved state-of-the-art performance, since ELMO more lan- masked tokens. But this implementation raises a problem; as
guage representation models has been developed like ULMFit the input sequences in the fine-tuning or inference will never
[6], BERT [7], RoBERTa [8], XLNet [9], ALBERT [10], and contain this [MASK] token, to solve this problem, some of
T5 [11], which offered improved performance by exploring the [MASK] tokens are replaced with a random word.
different pre-training methods, modified model architectures In Next Sentence Prediction -which helps the model un-
and larger training corpora. There are some Arabic BERT derstand the relationship between two sentences- the model
models such as AraBERT [12] and Arabic-BERT [13]. is given two sentences as input (separated by [SEP] token).
Half of the time, the second sentence comes after the first one
III. M e t h o d o l o g y in the original text. In the other half, the second sentence is
Bidirectional Encoder Representations from Transformers randomly chosen from the original text; BERT is then required
(BERT) is a model introduced in the paper ”BERT: Pre- to predict if the second sentence comes after the first in the
training o f Deep Bidirectional Transformers fo r Language original text or not.
Understanding” [7], it was a giant leap in NLP as it allowed C. Sub-word Segmentation
for a model to be trained using unsupervised data and then -the
Tokenization is the process of breaking down the raw
same model- can be used with minimum fine-tuning (both on
text into tokens; a Vocabulary is then constructed from the
the size of data used and time of training) in downstream NLP
most common/frequent words, this vocabulary contains all the
tasks, in short, it allowed for transfer learning to be fully used
words that the model can understand.
in NLP tasks. We can say that BERT’s architecture consists
BERT uses a special type of bit pair encoding algorithm
of multiple layers of a model called the transformer [14]. The
(BPE) called Wordpiece tokenizer, where the vocabulary is
most common BERT model (BERT_base) has 12 layers and
initialized with all the individual characters in the language.
768 hidden neurons accumulating to 110 million parameters.
Then the most frequent/likely combinations of the symbols in
Several models utilize a similar transformer architecture, but
the vocabulary are iteratively added to the vocabulary, meaning
BERT distinguishes itself by its bidirectional nature and the
that any new word can just be embedded using subwords or
way it is pre-trained to deliver this bidirectional nature.
even individual characters, excluding the chance of getting
Although the Sudanese dialect corpus that we collected
an unknown token in our text. Workpiece tokenizer works
is not small, it is not merely enough to pre-train a BERT
extremely well with the Arabic language eliminating the need
model from scratch. As a solution to this problem, we used
to use an Arabic specific tokenizer like FARASA [15].
a pre-trained model on MSA (modern standard Arabic) -
Arabic-BERT [13]-, and then continued the pre-training on D. Fine-tuning
our Sudanese dialect data. Finally, we did the fine-tuning; which means benefiting
A. Pre-training Datasets from a pre-trained language representation model by doing
a little bit more training on it with an application specific text.
We used Arabic-BERT model [13], that was already trained This approach has achieved amazing results in many language
on OSCAR Arabic data1, which contains about 8.1 billion understanding tasks in different languages [16], [17].
cleaned -pure- Arabic sentences. Then we pre-trained the 1) Sentence classification: Before feeding the text into
model on Sudanese data we collected on our own. BERT, the [CLS] token is prepended to each sentence to work
The first step in the training process was to collect a large as a sentence representation. In order to fine-tune BERT for
amount of Sudanese dialect. We have collected about 13 sentence classification, we inserted a classifier layer on top of
million Sudanese sentences -each sentence consists of at least the final hidden state corresponding to the [CLS] token. So,
20 characters- from twitter and public Telegram channels. the model should learn to encode all information it needs in
Then we cleaned the data from all non- Sudanese Arabic that hidden state. Figure 1 illustrates the steps we followed to
syntax we removed all symbols - #,?, :) , (:, ! ...etc. - and fine-tune BERT for this task.
emojis. After the cleaning step was completed, we ended 2) Name Entity Recognition: The same previous architec-
up with more than seven million cleaned -pure- Sudanese ture is used for name entity recognition (NER) task, where
sentences. each word is divided into segments using word piece tokenizer,
B. Pre-training tasks prepended with [CLS] token and fed into the model. Finally,
the classifier layer would predict “Person”, “Location”, “Or-
The second step was pre-training; the model was trained
ganization” or “Miscellaneous” for each word based on the
as a language model (LM) using relatively generic tasks; the final hidden representation of the [CLS] token.
1https://oscar-corpus.com
Authorized licensed use limited to: Linkoping University Library. Downloaded on June 20,2021 at 13:57:32 UTC from IEEE Xplore. Restrictions apply.
Arabic Text
Fig. 1. Steps of Fine-tuning BERT for Sentence Classification.
IV. E VALUATION B. Name Entity Recognition
We evaluated our model on two NLU tasks: Sentiment 1) ANERCorp: Arabic Name Entity Recognition corpus
Analysis (SA) and Name Entity Recognition (NER), then we contains 150k tokens, 11% of them are name entities dis-
compared the results to Arabic-BERT model which we started tributed among four entity categories: Person (39%), Lo-
training from. cation (30.4%), Organization (20.6%) and Miscellaneous
(10%) [24].
A. Sentiment Analysis V. E x p e r im e n t s
Due to the lack of the Sudanese dialect content on the A . Pre-training
internet, we evaluated our model on only one Sudanese slang Before beginning to pre-train the model, some preparations
dataset. The rest of the datasets were in MSA or another were done on the data:
Arabic dialect.
• The data was divided into small files, each having 250K
1) AJGT: Arabic Jordanian General Tweets dataset consists sentences.
of 1,800 tweets annotated as positive (900) and negative • Each file was then tokenized, and sentence segmentation
(900) [18]. The tweets were written in MSA and Jordanian was added along with position segmentation.
dialect. • All those files were then saved into TFrecord files to ease
2) ArSenTD-LEV: Arabic Sentiment Twitter Dataset for reading and loading them to the TPU.
Levantine dialect contains 4,000 tweets written in Arabic
Data preparation and initialization of the training was done on
and equally retrieved from Jordan, Lebanon, Palestine and
a Google cloud virtual machine instance; this instance contains
Syria [19]. The tweets were classified into very negative (630),
an N1-standard VCPU with 3.75GB of memory running in a
negative (1,253), neutral (885), positive (835) and very positive
Linux based image. The actual pre-training was done on a
(397).
V3-8 TPU rented from Google cloud platform, and the data
3) ASTD: Arabic Sentiment Tweets Dataset consists over was stored in a GCP (Google Cloud Platform) bucket.
10K Arabic sentiment tweets annotated as subjective negative, The pre-training was carried out with a batch size of 32
subjective positive, subjective mixed and objective [20]. We sentences per input and a learning rate of 1 e - 5 -an order
evaluated our model on the balanced version of the dataset of magnitude less than if the pre-training was from scratch-.
(797 tweets in each class). The model converged after one million steps, equivalent to 14
4) HARD: Hotel Arabic-Reviews Dataset contains 93,700 hours of training time.
hotel reviews in MSA as well as dialectal Arabic [21]. The the pre-training, the model gave a masked language model-
balanced version of the dataset consists 46,850 reviews for ing accuracy of 0.53 and a next sentence prediction accuracy
each positive and negative classes. of 0.638.
5) LABR: Large Scale Arabic Book Reviews Dataset con-
tains over 63K book reviews in Arabic. Each book review B. Fine-tuning
comes with the review text, the rating (1 to 5) and other Unlike the pre-training process, we fine-tuned SudaBERT
metadata [22]. We evaluated our model on the balanced 2class and Arabic-BERT using the GPUs provided by Google Colab.
dataset where the ratings are converted into positive (rating 4 We trained all sentiment analysis datasets for only 3 epochs
& 5) and negative (rating 1 & 2) and rating 3 is ignored. -recommended by [7]- with a learning rate of 2 e - 5, batch
6) Sentiment analysis fo r Sudanese dialect: The dataset size of 16 and max sequence length of 128. We did the same
consists of the opinions of people on Twitter about the during training the models on ANERCorp, but this time with
telecommunication services provided in Sudan [23]. It con- a batch size of 128 and max sequence length of 16.
tains 4,712 tweets written in Sudanese Arabic delicate. The For all datasets, we used the splits provided by the authors
tweets were classified into negative (3,358), positive (716) and when available. Otherwise, we split the data into 80% for
objective (638). training and 20% for testing.
Authorized licensed use limited to: Linkoping University Library. Downloaded on June 20,2021 at 13:57:32 UTC from IEEE Xplore. Restrictions apply.
TABLE I [7] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
Pe r f o r m a n c e o f S u d a BERT o n A r a b ic d o w n s t r e a m t a s k s a n d of deep bidirectional transformers for language understanding,” arXiv
o n e S u d a n e s e d i a l e c t s e n t i m e n t a n a l y s is d a t a s e t , c o m p a r e d t o preprint arXiv:1810.04805, 2018.
Ar a b ic -BERT [8] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert
D a ta se t M etric A rab ic-B E R T S u d aB E R T pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
[9] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and
A J G T (SA ) Accuracy 9 1 .7 89.2 Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language
A rS en T D -L E V (SA ) Accuracy 56.1 53.9 understanding,” in Advances in neural information processing systems,
A S T D (SA ) Accuracy 59 51.6 pp. 5753-5763, 2019.
H A R D (SA ) Accuracy 9 5 .8 95.5 [10] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut,
L A B R (SA ) Accuracy 83.5 80.8 “Albert: A lite bert for self-supervised learning of language representa-
S e n tim e n t an a ly sis fo r Accuracy 75.4 7 6.2 tions,” arXiv preprint arXiv:1909.11942, 2019.
S u d a n e se d ia lect Macro-F1 577 6 0.6 [11] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,
A N E R C o rp (N E R ) Macro-F1 7 6 .9 73.8 Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of trans-
fer learning with a unified text-to-text transformer,” arXiv preprint
arXiv:1910.10683, 2019.
[12] W. Antoun, F. Baly, and H. Hajj, “Arabert: Transformer-based model
VI. R ESULTS for arabic language understanding,” p. 9.
[13] A. Safaya, M. Abdullatif, and D. Yuret, “Kuisail at semeval-2020 task
Table I shows the experimental results of applying Sud- 12: Bert-cnn for offensive speech identification in social media,” 2020.
aBERT on sentiment analysis and name entity recognition [14] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
tasks, compared to Arabic-BERT [13]. L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2017.
[15] A. Abdelali, K. Darwish, N. Durrani, and H. Mubarak, “Farasa: A
The results from table I shows SudaBERT has achieved better fast and furious segmenter for Arabic,” in Proceedings of the 2016
performances in Sudanese dialectic sentiment analysis com- Conference o f the North American Chapter o f the Association for
pared to the state-of-the-art Arabic model Arabic-BERT, this Computational Linguistics: Demonstrations, (San Diego, California),
pp. 11-16, Association for Computational Linguistics, June 2016.
clarify that our model will work better in Sudanese dialectic. [16] Y. Liu and M. Lapata, “Text summarization with pretrained encoders,”
The results also shows Arabic-BERT has achieved better per- 2019.
formance than SudaBERT on Modern Standard Arabic (MSA) [17] K. N. Elmadani, M. Elgezouli, and A. Showk, “Bert fine-tuning for
arabic text summarization,” 2020.
and other Arabic accents. We concluded this performance [18] K. M. Alomari, H. M. ElSherif, and K. Shaalan, “Arabic tweets sen-
degradation on SudaBERT compared to Arabic-BERT, because timental analysis using machine learning,” in International Conference
we pre-trained SudaBERT for more epochs on Sudanese data on Industrial, Engineering and Other Applications o f Applied Intelligent
Systems, pp. 602-610, Springer, 2017.
and this led to change the embedding values of the model. [19] R. Baly, A. Khaddaj, H. Hajj, W. El-Hajj, and K. Bashir Shaban,
“Arsentd-lev: A multi-topic corpus for target-based sentiment analysis
VII. C o n c l u s io n in arabic levantine tweets,” OSACT3, 2018.
[20] M. Nabil, M. Aly, and A. Atiya, “ASTD: Arabic sentiment tweets
In this study, we collected and cleaned Sudanese dialect dataset,” in Proceedings o f the 2015 Conference on Empirical Methods
data from Twitter and public Telegram channels. Then, we in Natural Language Processing, (Lisbon, Portugal), pp. 2515-2519,
used Arabic-BERT model as a checkpoint to start training Association for Computational Linguistics, Sept. 2015.
[21] A. Elnagar, Y. S. Khalifa, and A. Einea, “Hotel arabic-reviews dataset
SudaBERT with the collected data. Finally, we evaluated construction for sentiment analysis applications,” in Intelligent Natural
SudaBERT against Arabic-BERT on two NLU tasks: senti- Language Processing: Trends and Applications, pp. 35-52, Springer,
ment analysis, and name entity recognition. The experimental 2018.
[22] M. Aly and A. Atiya, “LABR: A large scale Arabic book reviews
results show higher performance of SudaBERT as compared dataset,” in Proceedings o f the 51st Annual Meeting of the Association
to Arabic-BERT when dealing with Sudanese dialect, while for Computational Linguistics (Volume 2: Short Papers), (Sofia, Bul-
Arabic-BERT was better in understanding MSA and other garia), pp. 494-498, Association for Computational Linguistics, Aug.
2013.
Arabic dialects. [23] R. Ismail, M. Omer, M. Tabir, N. Mahadi, and I. Amin, “Sentiment
analysis for arabic dialect using supervised learning,” in 2018 Inter-
Re f e r en c es
national Conference on Computer, Control, Electrical, and Electronics
[1] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, Engineering (ICCCEEE), pp. 1-6, 2018.
“Distributed representations of words and phrases and their composition- [24] Y. Benajiba and P. Rosso, “Anersys 2.0: Conquering the ner task for
ality,” in Advances in neural information processing systems, pp. 3111— the arabic language by combining the maximum entropy with pos-tag
3119, 2013. information.,” in IICAI, pp. 1814-1823, 2007.
[2] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee,
and L. Zettlemoyer, “Deep contextualized word representations,” arXiv
preprint arXiv:1802.05365, 2018.
[3] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors
for word representation,” in Proceedings of the 2014 conference on
empirical methods in natural language processing (EMNLP), pp. 1532—
1543, 2014.
[4] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word
vectors with subword information,” Transactions o f the Association for
Computational Linguistics, vol. 5, pp. 135-146, 2017.
[5] A. B. Soliman, K. Eissa, and S. R. El-Beltagy, “Aravec: A set of arabic
word embedding models for use in arabic nlp,” Procedia Computer
Science, vol. 117, pp. 256-265, 2017.
[6] J. Howard and S. Ruder, “Universal language model fine-tuning for text
classification,” arXiv preprint arXiv:1801.06146, 2018.
Authorized licensed use limited to: Linkoping University Library. Downloaded on June 20,2021 at 13:57:32 UTC from IEEE Xplore. Restrictions apply.