Best - Corpus Creation and Language.....
Best - Corpus Creation and Language.....
ABSTRACT
With the massive use of social media today, mixing between languages in social media
text is prevalent. In linguistics, the phenomenon of mixing languages is known as code-
mixing. The prevalence of code-mixing exposes various concerns and challenges in
natural language processing (NLP), including language identification (LID) tasks. This
study presents a word-level language identification model for code-mixed Indonesian,
Javanese, and English tweets. First, we introduce a code-mixed corpus for Indonesian-
Javanese-English language identification (IJELID). To ensure reliable dataset annota-
tion, we provide full details of the data collection and annotation standards construction
procedures. Some challenges encountered during corpus creation are also discussed in
this paper. Then, we investigate several strategies for developing code-mixed language
identification models, such as fine-tuning BERT, BLSTM-based, and CRF. Our results
show that fine-tuned IndoBERTweet models can identify languages better than the
other techniques. This is the result of BERT’s ability to understand each word’s context
from the given text sequence. Finally, we show that sub-word language representation
in BERT models can provide a reliable model for identifying languages in code-mixed
texts.
Submitted 25 May 2022
Accepted 6 March 2023
Published 22 June 2023 Subjects Computational Linguistics, Data Mining and Machine Learning,
Corresponding author
Natural Language and Speech, Network Science and Online Social Networks, Text Mining
Ahmad Fathan Hidayatullah, Keywords Code-mixing, Language identification, Indonesian, Javanese, English, Twitter, BERT
fathan@uii.ac.id,
21h2501@ubd.edu.bn
Academic editor
INTRODUCTION
Lexing Xie Today, mixing languages is prevalent in daily communication, especially in informal
Additional Information and situations, such as texting posts on social media. In linguistics, combining two or more
Declarations can be found on
page 20
languages within an utterance of speech or text is called code-mixing (Hoffmann, 2014;
Ritchie & Bhatia, 2012). Mixing languages is particularly common in regions where people
DOI 10.7717/peerj-cs.1312
are natively multilingual.
Copyright Indonesia is one of the world’s most multilingual countries, with over 700 local spoken
2023 Hidayatullah et al.
languages (Aji et al., 2022). More than 198 million and 84 million people speak Indonesian
Distributed under and Javanese, respectively (Eberhard, Simons & Fennig, 2021). Hence, mixing Indonesian
Creative Commons CC-BY 4.0
and Javanese in an utterance is common in Indonesia, especially among the Javanese people.
OPEN ACCESS Besides, exposure to English from social media and school makes Indonesians mix their
How to cite this article Hidayatullah AF, Apong RA, Lai DTC, Qazi A. 2023. Corpus creation and language identification for code-mixed
Indonesian-Javanese-English Tweets. PeerJ Comput. Sci. 9:e1312 http://doi.org/10.7717/peerj-cs.1312
languages with English (Rizal & Stymne, 2020). As a result, mixing Indonesian, Javanese,
and English in daily conversation becomes the most prevalent language combination in
Indonesian societyt (Yulianti et al., 2021).
The following is an example of a code-mixed sentence containing Indonesian, Javanese,
and English words:
‘‘Aku udah coba ngedownload tapi error, tulung aku diewangi downloadke panduane!’’
(English: I have tried to download but error, please help me download the guideline!).
The sentence contains the following language compositions: Indonesian (aku, udah,
coba, tapi), mix Indonesian-English (ngedownload), English (error), Javanese (tulung, aku,
diewangi), mix Javanese-English (downloadke), and mix Indonesian-Javanese (panduane).
In the above sentence, the mixing of languages occurs not only within the sentence
but also within the word. For example, the word ‘ngedownload’ consists of ‘nge-‘
(informal Indonesian prefix) and ‘download’ (English). The word ‘downloadke’ consists of
‘download’ (English) and ‘-ke’ (Javanese suffix). The word ‘panduane’ consists of ‘panduan’
(Indonesian) and ‘-e’ (Javanese suffix).
To analyze code-mixed text, a language identification (LID) task is often used as part
of the pre-processing step (Hidayatullah et al., 2022). LID is critical for some subsequent
natural language processing tasks in code-mixed documents (Gundapu & Mamidi, 2018).
Applying LID in the code-mixed text has become a foundation work of various NLP systems,
including sentiment analysis (Ansari & Govilkar, 2018; Mahata, Das & Bandyopadhyay,
2021), translation (Barik, Mahendra & Adriani, 2019; Mahata et al., 2019), and emotion
classification (Yulianti et al., 2021). The absence of LID in pre-processing tasks can affect
those NLP systems. For example, if the language is not accurately identified, a code-mixed
sentence will produce an inaccurate translation. In another case, an offensive content
identification system may produce incorrect results if the words in a sentence are not
correctly identified (Singh, Sen & Kumaraguru, 2018).
However, most existing NLP systems are designed to process a single language at once
(Sabty et al., 2021). The number of NLP systems that can process multiple languages per
sentential unit is restricted (Nguyen et al., 2021). The traditional language identification
systems fail to detect languages correctly from mixed language texts (Kalita & Saharia,
2018). Processing multiple languages within a sentence requires additional processing tasks
compared to monolingual texts due to various language combinations such as sentence,
clause, word, and sub-word levels (Mave, Maharjan & Solorio, 2018). Detecting language
from code-mixed text using a traditional approach like dictionary lookup is no longer
applicable. The dictionary approach produces poor results due to spelling inconsistencies
and the loss of word context (Ansari & Govilkar, 2018).
On the other hand, the availability of annotated code-mixed data, including Indonesian
and Javanese data, remains limited. Even though Indonesian and Javanese have many
speakers, only a few studies have addressed the code-mixing phenomenon in the Indonesian
language (Adilazuarda et al., 2022; Winata et al., 2022). In comparison to the languages
spoken in Europe, the existence of Indonesian and Javanese languages in NLP research is
relatively understudied (Aji et al., 2022).
Considering the problems above, this study makes the following contributions:
RELATED WORK
Code-mixed data availability for language identification
In this study, we collect some papers focused on conducting language identification for
code-mixed text. As a result, we found 17 related studies published between 2016 and
2022. During that period year, we identify 14 code-mixed datasets such as Manipuri-
English (Lamabam & Chakma, 2016), Konkani-English (Phadte & Wagh, 2017), Telugu-
English (Gundapu & Mamidi, 2018), Bengali-English (Jamatia, Das & Gambäck, 2018;
Mandal & Singh, 2018), Hindi-English (Ansari et al., 2021; Jamatia, Das & Gambäck, 2018;
Mandal & Singh, 2018; Shekhar, Sharma & Beg, 2020), Bengali-Hindi-English (Jamatia,
Das & Gambäck, 2018), Turkish-English (Yirmibeşoğlu & Eryiğit, 2018), Indonesian-
English (Barik, Mahendra & Adriani, 2019; Yulianti et al., 2021), Sinhala-English (Smith &
Thayasivam, 2019), Arabic-English (Sabty et al., 2021), English-Assamese-Hindi-Bengali
(Sarma, Singh & Goswami, 2022), Telugu-English (Kusampudi, Chaluvadi & Mamidi,
2021), Malayalam-English (Thara & Poornachandran, 2021), and Kannada-English
(Shashirekha et al., 2022; Tonja et al., 2022).
From those 14 datasets, five studies utilized Indonesian-related code-mixed datasets.
Barik, Mahendra & Adriani (2019) introduced code-mixed Indonesian-English data from
Twitter for the text normalization task. Yulianti et al. (2021) used the dataset created by
Barik, Mahendra & Adriani (2019) to see the impact of code-mixed normalization on the
emotion classification task. In Yulianti et al. (2021), they introduced new feature sets to
improve the performance of the language identification task. Suciati & Budi (2019) created
a review dataset containing mixed Indonesian-English. They gathered the review data
from a culinary website for an aspect-based opinion mining task. Arianto & Budi (2020)
also developed a code-mixed Indonesian-English dataset for aspect-based sentiment
analysis. The dataset was collected from Google Maps reviews. Tho et al. (2021) proposed
a code-mixed Indonesian-Javanese corpus using the Twitter dataset for sentiment analysis.
Among the five previous studies, two papers (Barik, Mahendra & Adriani, 2019; Yulianti
et al., 2021) applied language identification in their research. The remaining studies
focused on opinion mining and sentiment analysis. All five studies above provided
bilingual code-mixed languages, namely Indonesian-English and Indonesian-Javanese.
There have been no studies focusing on trilingual code-mixed language data, particularly
code-mixed Indonesian-Javanese-English. The existing studies do not concentrate on the
language identification task. Instead, they aimed to solve problems in sentiment analysis,
emotion classification, and translation. Since no dataset is available for code-mixed
CORPUS CREATION
Data collection and pre-processing
We collect 15K tweets in several batches from December 2021 to July 2022. In this work, we
list some code-mixed Indonesian, Javanese, and English keywords to obtain the tweets to
ensure that the retrieved tweets are code-mixed. In the pre-processing tasks, we first filter
the tweets by removing duplicates. Subsequently, we replace user mentions, URLs, and
hashtags with @user, httpurl, and #hashtag. Finally, we convert all words into lowercase. In
our dataset, we notice some tweets containing languages other than Indonesian, Javanese,
and English, such as Malay, Sundanese, Arabic, and Korean. The occurrence of Malay
and Sundanese in the retrieved tweets is reasonable since both languages have similarities
to Indonesian and Javanese. In addition, Arabic and Korean scripts exist in our dataset
because people sometimes add both scripts to their tweets. In this case, we keep those
languages in our dataset.
No Type Language
Indonesian Javanese English
1 Mixed letters and numbers •‘se7’ stands for ‘setuju’ (to • ‘siji2’ stands for ‘siji-siji’ •‘b4’ (before)
agree) (one by one)
•‘anak2’ stands for ‘anak- •‘ni8t’ (night)
anak’ (kids)
2 Slangs •‘santuy’ stands for ‘santai’ •‘ngunu’ stands for ‘ngono’ •‘epic’ (awesome)
(relaxed) (like that)
•‘nobar’ stands for ‘nonton •‘bonek’ stands for ‘bondo •‘noob’ (newbie)
bareng’ (watch together) nekat’ (reckless)
•‘menfess’ (mention confess)
•‘crashy’ (crazy and trashy)
3 Abbreviated words or •‘dgn’ stands for ‘dengan’ •‘kbh’ stands for ‘kabeh’ (all) •‘idk’ (I don’t know)
acronym (with)
•‘slkh’ stands for ‘sekolah’ •‘mtrnwn’ stands for ‘matur •‘dm’ (direct message)
(school) nuwun’ (thank you)
•‘SD’ stands for ‘sekolah •‘lur’ stands for ‘sedulur’ •‘omg’ (oh my god)
dasar’ (elementary school) (brother or sister)
•‘thx’ (thanks)
•‘ur’ (your)
4 Expressive lengthening • ‘senaaang’ stands for • ‘kaaabeeeh’ stands for • ‘gooooood’ (good)
‘senang’ (happy) ‘kabeh’ (all)
5 English words written in – – •‘n’ (and)
Indonesian spelling style
•‘plis’ (please)
•‘wiken’ (weekend)
•donlod (download)
•‘gud’ (good)
•‘selow’ (slow)
Inter-annotator agreement
This study uses Cohen’s kappa (Cohen, 1960) to quantify the agreement between two
annotators. The following is the equation for calculating Cohen’s kappa (K ):
Pr (a) − Pr(e)
K= . (1)
1 − Pr(e)
Pr(a) is the frequency with which two annotators assigned the same label. Pr(a) is
obtained by calculating all agreed labels divided by the total data. Pr(e) is the probability
of agreement when the annotators see the observed data randomly. The Pr(e) is calculated
by summing the probability when both annotators randomly select the first label and the
probability when both annotators select the second label. Cohen’s kappa value (K ) spans
from 1 to −1, indicating that annotators select different labels for each sample. A value of
0 indicates that the annotators agreed precisely as frequently as they would if they were
both guessing randomly. Therefore, the closer the K value to 1, the better the dataset.
In Eq. (2), wi is the number of words of the language i, max(wi ) represents the number
of words of the prominent language, n is the total number of tokens, and u represents the
number of language independent tokens (such as named entities, abbreviations, mentions,
and hashtags). For monolingual utterances, the CMI score equals to 0 (zero), since the
max(wi ) = n − u. A low CMI score implies monolingualism in the text, whereas a high
CMI score indicates code-mixing.
In Eq. (3), Z (x) is a normalization factor function obtained using the following formula:
T
( K )
XY X
Z (x) = exp λk fk (yt ,yt −1 ,xt ) . (4)
y t =1 k=1
In Eqs. (3) and (4), λ = {λk } be a parameter vector weight estimated from the training
set, and F = fk (yt ,yt −1 ,xt ) k K= 1be a set of feature functions. T is the number of time step
indexed by t .K is the number of features and k indexes the feature function fk and weight
λk .
BLSTM-based architecture
Input representation for BLSTM-based architecture
1. Word-level representation
Word-level representation is a real-valued vector representing a word to capture
the semantic relationship between words. Using word embedding, words with a closer
meaning will have a similar vector representation. The embedding layer carries out the
transformation from words to their corresponding vector representations. The embedding
layer receives sentences in the form of a sequence of tokens. Each token is then transformed
into a word vector with a fixed size by mapping the index of such a token. Figure 1 illustrates
the word-level representation model.
2. Character-level representation
Character-level representation aims to capture morphological features of the words by
processing the character that composes words (Mave, Maharjan & Solorio, 2018). Using
character-level representation can help alleviate the out-of-vocabulary (OOV) problems
in the text data (Joshi & Joshi, 2020). The results of character representation are used to
augment the word vector representation before being processed through the classification
layer.
This study employs CNN and LSTM to train the character-level representation. In
character-based CNN representation, a word is decomposed into a sequence of characters.
The character inputs are passed into the character embedding layer. The embedding
layer outputs are transmitted to the convolutional layer, which produces local features
by applying a convolutional filter across a sliding n-character window. Subsequently, the
max-pooling layer takes the maximum value over each dimension to represent a particular
word. As for the character LSTM representation, we put a single LSTM layer on top of the
embedding layer. The embedding outputs pass the character embedding vector via forward
and backward LSTMs and combine each output to generate the encoding of the associated
word. The character CNN and character LSTM representations are illustrated in Fig. 2.
long-distance relations in the sequence in both directions (Mave, Maharjan & Solorio,
2018). We initialize the embedding layer and feed the output sequence to the spatial
dropout layer with a dropout. Subsequently, we add a single BLSTM layer and a recurrent
sigmoid activation. BLSTM applies two LSTM networks by incorporating forward and
backward hidden layers to capture previous and subsequent contexts (Liu & Guo, 2019).
BLSTM-CRF
BLSTM-CRF is a combination of deep learning and traditional machine learning
approaches. The combination between BLSTM and CRF has been proven to produce
good results in sequence tagging tasks (Poostchi, Borzeshi & Piccardi, 2018). A CRF layer
can be added to the top layer of the BLSTM architecture to predict the label of the entire
phrase at the same time (Lafferty, McCallum & Pereira, 2001). In the sequence labeling
task, it is critical to consider the association between neighboring labels. BLSTM, on the
other hand, does not generalize the connection between output labels (Wintaka, Bijaksana
& Asror, 2019). That is due to the probability distribution of BLSTM being independent.
The combination of BLSTM and CRF may efficiently exploit past and future input features
through an LSTM layer and sentence-level tag information through a CRF layer (Huang,
Xu & Yu, 2015). In this BLSTM-CRF architecture, we stack the following layers: input,
embedding, SpatialDropout1D, BLSTM, dense, and CRF.
BERT-based architecture
BERT input representation
BERT has a particular set of rules for representing the input text, namely sub-word
representation. Sub-word representation is an alternative solution between word and
character-based representations. BERT uses the WordPiece tokenization algorithm to
create the sub-word representation (Wu et al., 2016). WordPiece starts by establishing an
initial vocabulary composed of elementary units and then increases this vocabulary to the
desired size. The vocabulary begins with characters from a single language. Then the most
common character combinations in the vocabulary are added iteratively. WordPiece learns
merged rules for the character pairs and finds the pair that maximizes the likelihood of the
training data. Equation (5) is the formula to calculate the score for each pair:
frequency of pair
Score = . (5)
(frequency of first unit × frequency of second unit)
The score is calculated by dividing the frequency of the pair by the product of the
frequencies of each of its components. The algorithm works by prioritizing the merging of
pairs when each part occurs less frequently in the vocabulary. For example, the pair ‘read’
and ‘##ing’ will not be merged even though the token ‘reading’ frequently appears in the
vocabulary. This is because the pair ‘read’ and ‘##ing’ will probably frequently occur in
many other words. A pair between ‘re’ and ‘##ad’ will likely be merged since ‘re’ and ‘##ad’
appear less frequently individually. Therefore, the token ‘read’ is not split, while the token
‘reading’ is separated into ‘read’ and ‘ing’. This teaches the idea that the token ‘reading’ is
derived from ‘read’ with slightly different meanings but the same origin.
As illustrated in Fig. 3, BERT input representation consists of three embeddings: token
embeddings, segment embeddings, and position embeddings. In the token embeddings,
two special tokens are added to each sentence. At the beginning of each sentence, a
[CLS] token is added. Another special token is a [SEP] token which is located at the
end of each sentence. The [SEP] token is added to separate between sentences. It is used
as a learned segment embedding denoting a token as part of segment A or B. Segment
embeddings are sentence numbers encoded in a vector. The model identifies whether a
specific token belongs to sentence A or B in the segment embeddings. Position embeddings
provide information regarding the word order in the input sequence. Finally, the BERT
representation is obtained by summing those three embeddings.
BERT
Bidirectional encoder representations from transformers (BERT) is a language
representation model built using the transformer-based technique developed by Google
(Devlin et al., 2019). BERT is a transformer encoder stack capable of simultaneously reading
a whole sequence of inputs. The BERT architecture is a deep bidirectional model, meaning
that BERT takes information from both the left and right sides of the token’s context during
Fine-tuning BERT
As illustrated in Fig. 4, fine-tuning is done by leveraging a pre-trained model and then
training it on a particular dataset suited to a specific task. The BERT model is first set with
the pre-trained weight parameters. Next, all parameters are fine-tuned using annotated
data from the downstream tasks. Ultimately, the fine-tuned weights are then used for the
prediction task.
In this study, the fine-tuning tasks are performed by leveraging two existing pre-trained
BERT models, namely multilingual BERT (mBERT) (https://huggingface.co/bert-base-
multilingual-cased) and IndoBERTweet (https://huggingface.co/indolem/indobertweet-
base-uncased). The first pre-trained model is created based on the multilingual BERT.
Multilingual BERT is a masked language modeling (MLM) objective-trained model (Devlin
et al., 2019). It is trained with a large Wikipedia corpus on top of 104 languages, including
English, Indonesian, and Javanese. Another pre-trained model used to build the language
identification model is IndoBERTweet. IndoBERTweet is a pre-trained domain-specific
model using a large set of Indonesian Twitter data (Koto, Lau & Baldwin, 2021).
The illustration of fine-tuning BERT for the language identification task can be seen in
Fig. 5. The [CLS] symbol and [SEP] are added at the beginning and the end of a single text
sequence. Each token of the sequence and the contextual representation of each token are
denoted by E and R, respectively. Following that, the BERT representation of each token
is fed into dense layers. In the dense layers, the dense layer parameters are shared to get the
label of each token.
Experimental setup
We conduct the experiments by splitting our dataset into training (6,170 tweets), validation
(1,543 tweets), and testing (3,306 tweets). The training is performed using an 80 cores
CPU, 250GB of RAM, and 4 GPUs (NVIDIA Tesla V100 SXM2). Before training, we apply
hyperparameter tuning to get the optimal hyperparameters for each technique to maximize
model performance.
For CRF training, the L-BFGS algorithm is utilized for gradient descent optimization and
getting model parameters. We apply randomized search using RandomizedSearchCV (https:
//scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.
html) to find the best parameter values for L1 (Lasso) and L2 (Ridge) regularization
coefficients in the CRF algorithm. The hyperparameter search is done by applying 5-fold
cross-validation. In addition, we set the number of parameter settings to 20.
In the BLSTM-based experiments, we use a grid search algorithm for hyperparameter
tuning. First, we set several values to find the best values for learning rate, batch size,
dropout, and the number of LSTM units. The best learning rate and dropout value for
all models are 0.01 and 0.5, respectively. The batch size and LSTM units for BLSTM and
BLSTM + character CNN architectures are 32 and 64, respectively. As for BLSTM-CRF
and BLSTM + character LSTM, the batch size and LSTM units are 64 and 32, respectively.
Then, the BLSTM-based models are trained using the Adam optimizer over 20 epochs.
Table 2 provides the best hyperparameter values for each BLSTM-based model.
We also employ a grid search algorithm for fine-tuning BERT models. The
hyperparameter search is conducted by setting categorical values for learning rate, per
device training batch size, and per device evaluation batch size. The learning rate is set to
1e−4, 3e−4, 2e−5, 3e−5, and 5e−5. For the per-device train and evaluation batch size,
we select the categorical values of 8, 16, 32, 64, and 128. Subsequently, the hyperparameter
search is run by applying such defined values over four epochs and ten trials. The best
hyperparameter values for each BERT-based model are presented in Table 3. Finally, we
use the best hyperparameters with the Adam optimizer and five epochs for fine-tuning
BERT training.
Table 6 Precision, recall, and F1 score results on the test set for each model.
attains a value of 0.9964. This result represents an almost perfect agreement between two
annotators. As for the CMI, we gain a score of 38.05. It means that 38.05% of the overall
non-neutral language tokens in the dataset are code-mixed.
Our experiments with the fine-tuned BERT models indicate better results than
the BLSTM-based models. The results provided by the fine-tuning BERT models
demonstrate competitive performance compared to the other techniques. The fine-tuning
IndoBERTweet model achieves the highest macro F1 score of 93.53% among the other
models. This score is higher by 0.57% than the fine-tuning using mBERT, which gains a
92.96% F1 score. Further, the fine-tuning of BERT models proves an excellent achievement
in identifying intra-word code-mixing.
Funding
This work is supported by Universiti Brunei Darussalam (Grant no. UBD/RSCH/1.18/FICBF
(a)/2023/007). The funders had no role in study design, data collection and analysis, decision
to publish, or preparation of the manuscript.
Grant Disclosures
The following grant information was disclosed by the authors:
Universiti Brunei Darussalam: UBD/RSCH/1.18/FICBF(a)/2023/007.
Competing Interests
The authors declare there are no competing interests.
Author Contributions
• Ahmad Fathan Hidayatullah conceived and designed the experiments, performed the
experiments, analyzed the data, performed the computation work, prepared figures
and/or tables, authored or reviewed drafts of the article, and approved the final draft.
• Rosyzie Anna Apong conceived and designed the experiments, analyzed the data,
authored or reviewed drafts of the article, and approved the final draft.
• Daphne T.C. Lai conceived and designed the experiments, analyzed the data, authored
or reviewed drafts of the article, and approved the final draft.
• Atika Qazi conceived and designed the experiments, analyzed the data, authored or
reviewed drafts of the article, and approved the final draft.
Data Availability
The following information was supplied regarding data availability:
The data is available at Zenodo: Hidayatullah, Ahmad Fathan. (2022). Code-
mixed Indonesian-Javanese-English Twitter Dataset (Version v1) [Data set]. Zenodo.
https://doi.org/10.5281/zenodo.7567573.
REFERENCES
Adilazuarda MF, Cahyawijaya S, Winata GI, Fung P, Purwarianti A. 2022. IndoRo-
busta: towards robustness against diverse code-mixed indonesian local languages.
In: Proceedings of the First Workshop on Scaling Up Multilingual Evaluation. 25–34.