Best - Corpus Creation and Language.....
Best - Corpus Creation and Language.....
                                      ABSTRACT
                                      With the massive use of social media today, mixing between languages in social media
                                      text is prevalent. In linguistics, the phenomenon of mixing languages is known as code-
                                      mixing. The prevalence of code-mixing exposes various concerns and challenges in
                                      natural language processing (NLP), including language identification (LID) tasks. This
                                      study presents a word-level language identification model for code-mixed Indonesian,
                                      Javanese, and English tweets. First, we introduce a code-mixed corpus for Indonesian-
                                      Javanese-English language identification (IJELID). To ensure reliable dataset annota-
                                      tion, we provide full details of the data collection and annotation standards construction
                                      procedures. Some challenges encountered during corpus creation are also discussed in
                                      this paper. Then, we investigate several strategies for developing code-mixed language
                                      identification models, such as fine-tuning BERT, BLSTM-based, and CRF. Our results
                                      show that fine-tuned IndoBERTweet models can identify languages better than the
                                      other techniques. This is the result of BERT’s ability to understand each word’s context
                                      from the given text sequence. Finally, we show that sub-word language representation
                                      in BERT models can provide a reliable model for identifying languages in code-mixed
                                      texts.
Submitted 25 May 2022
Accepted 6 March 2023
Published 22 June 2023             Subjects Computational Linguistics, Data Mining and Machine Learning,
Corresponding author
                                   Natural Language and Speech, Network Science and Online Social Networks, Text Mining
Ahmad Fathan Hidayatullah,         Keywords Code-mixing, Language identification, Indonesian, Javanese, English, Twitter, BERT
fathan@uii.ac.id,
21h2501@ubd.edu.bn
Academic editor
                                   INTRODUCTION
Lexing Xie                         Today, mixing languages is prevalent in daily communication, especially in informal
Additional Information and         situations, such as texting posts on social media. In linguistics, combining two or more
Declarations can be found on
page 20
                                   languages within an utterance of speech or text is called code-mixing (Hoffmann, 2014;
                                   Ritchie & Bhatia, 2012). Mixing languages is particularly common in regions where people
DOI 10.7717/peerj-cs.1312
                                   are natively multilingual.
    Copyright                         Indonesia is one of the world’s most multilingual countries, with over 700 local spoken
2023 Hidayatullah et al.
                                   languages (Aji et al., 2022). More than 198 million and 84 million people speak Indonesian
Distributed under                  and Javanese, respectively (Eberhard, Simons & Fennig, 2021). Hence, mixing Indonesian
Creative Commons CC-BY 4.0
                                   and Javanese in an utterance is common in Indonesia, especially among the Javanese people.
OPEN ACCESS                        Besides, exposure to English from social media and school makes Indonesians mix their
                                   How to cite this article Hidayatullah AF, Apong RA, Lai DTC, Qazi A. 2023. Corpus creation and language identification for code-mixed
                                   Indonesian-Javanese-English Tweets. PeerJ Comput. Sci. 9:e1312 http://doi.org/10.7717/peerj-cs.1312
                                   languages with English (Rizal & Stymne, 2020). As a result, mixing Indonesian, Javanese,
                                   and English in daily conversation becomes the most prevalent language combination in
                                   Indonesian societyt (Yulianti et al., 2021).
                                      The following is an example of a code-mixed sentence containing Indonesian, Javanese,
                                   and English words:
                                      ‘‘Aku udah coba ngedownload tapi error, tulung aku diewangi downloadke panduane!’’
                                      (English: I have tried to download but error, please help me download the guideline!).
                                      The sentence contains the following language compositions: Indonesian (aku, udah,
                                   coba, tapi), mix Indonesian-English (ngedownload), English (error), Javanese (tulung, aku,
                                   diewangi), mix Javanese-English (downloadke), and mix Indonesian-Javanese (panduane).
                                   In the above sentence, the mixing of languages occurs not only within the sentence
                                   but also within the word. For example, the word ‘ngedownload’ consists of ‘nge-‘
                                   (informal Indonesian prefix) and ‘download’ (English). The word ‘downloadke’ consists of
                                   ‘download’ (English) and ‘-ke’ (Javanese suffix). The word ‘panduane’ consists of ‘panduan’
                                   (Indonesian) and ‘-e’ (Javanese suffix).
                                      To analyze code-mixed text, a language identification (LID) task is often used as part
                                   of the pre-processing step (Hidayatullah et al., 2022). LID is critical for some subsequent
                                   natural language processing tasks in code-mixed documents (Gundapu & Mamidi, 2018).
                                   Applying LID in the code-mixed text has become a foundation work of various NLP systems,
                                   including sentiment analysis (Ansari & Govilkar, 2018; Mahata, Das & Bandyopadhyay,
                                   2021), translation (Barik, Mahendra & Adriani, 2019; Mahata et al., 2019), and emotion
                                   classification (Yulianti et al., 2021). The absence of LID in pre-processing tasks can affect
                                   those NLP systems. For example, if the language is not accurately identified, a code-mixed
                                   sentence will produce an inaccurate translation. In another case, an offensive content
                                   identification system may produce incorrect results if the words in a sentence are not
                                   correctly identified (Singh, Sen & Kumaraguru, 2018).
                                      However, most existing NLP systems are designed to process a single language at once
                                   (Sabty et al., 2021). The number of NLP systems that can process multiple languages per
                                   sentential unit is restricted (Nguyen et al., 2021). The traditional language identification
                                   systems fail to detect languages correctly from mixed language texts (Kalita & Saharia,
                                   2018). Processing multiple languages within a sentence requires additional processing tasks
                                   compared to monolingual texts due to various language combinations such as sentence,
                                   clause, word, and sub-word levels (Mave, Maharjan & Solorio, 2018). Detecting language
                                   from code-mixed text using a traditional approach like dictionary lookup is no longer
                                   applicable. The dictionary approach produces poor results due to spelling inconsistencies
                                   and the loss of word context (Ansari & Govilkar, 2018).
                                      On the other hand, the availability of annotated code-mixed data, including Indonesian
                                   and Javanese data, remains limited. Even though Indonesian and Javanese have many
                                   speakers, only a few studies have addressed the code-mixing phenomenon in the Indonesian
                                   language (Adilazuarda et al., 2022; Winata et al., 2022). In comparison to the languages
                                   spoken in Europe, the existence of Indonesian and Javanese languages in NLP research is
                                   relatively understudied (Aji et al., 2022).
                                      Considering the problems above, this study makes the following contributions:
                                   RELATED WORK
                                   Code-mixed data availability for language identification
                                   In this study, we collect some papers focused on conducting language identification for
                                   code-mixed text. As a result, we found 17 related studies published between 2016 and
                                   2022. During that period year, we identify 14 code-mixed datasets such as Manipuri-
                                   English (Lamabam & Chakma, 2016), Konkani-English (Phadte & Wagh, 2017), Telugu-
                                   English (Gundapu & Mamidi, 2018), Bengali-English (Jamatia, Das & Gambäck, 2018;
                                   Mandal & Singh, 2018), Hindi-English (Ansari et al., 2021; Jamatia, Das & Gambäck, 2018;
                                   Mandal & Singh, 2018; Shekhar, Sharma & Beg, 2020), Bengali-Hindi-English (Jamatia,
                                   Das & Gambäck, 2018), Turkish-English (Yirmibeşoğlu & Eryiğit, 2018), Indonesian-
                                   English (Barik, Mahendra & Adriani, 2019; Yulianti et al., 2021), Sinhala-English (Smith &
                                   Thayasivam, 2019), Arabic-English (Sabty et al., 2021), English-Assamese-Hindi-Bengali
                                   (Sarma, Singh & Goswami, 2022), Telugu-English (Kusampudi, Chaluvadi & Mamidi,
                                   2021), Malayalam-English (Thara & Poornachandran, 2021), and Kannada-English
                                   (Shashirekha et al., 2022; Tonja et al., 2022).
                                      From those 14 datasets, five studies utilized Indonesian-related code-mixed datasets.
                                   Barik, Mahendra & Adriani (2019) introduced code-mixed Indonesian-English data from
                                   Twitter for the text normalization task. Yulianti et al. (2021) used the dataset created by
                                   Barik, Mahendra & Adriani (2019) to see the impact of code-mixed normalization on the
                                   emotion classification task. In Yulianti et al. (2021), they introduced new feature sets to
                                   improve the performance of the language identification task. Suciati & Budi (2019) created
                                   a review dataset containing mixed Indonesian-English. They gathered the review data
                                   from a culinary website for an aspect-based opinion mining task. Arianto & Budi (2020)
                                   also developed a code-mixed Indonesian-English dataset for aspect-based sentiment
                                   analysis. The dataset was collected from Google Maps reviews. Tho et al. (2021) proposed
                                   a code-mixed Indonesian-Javanese corpus using the Twitter dataset for sentiment analysis.
                                      Among the five previous studies, two papers (Barik, Mahendra & Adriani, 2019; Yulianti
                                   et al., 2021) applied language identification in their research. The remaining studies
                                   focused on opinion mining and sentiment analysis. All five studies above provided
                                   bilingual code-mixed languages, namely Indonesian-English and Indonesian-Javanese.
                                   There have been no studies focusing on trilingual code-mixed language data, particularly
                                   code-mixed Indonesian-Javanese-English. The existing studies do not concentrate on the
                                   language identification task. Instead, they aimed to solve problems in sentiment analysis,
                                   emotion classification, and translation. Since no dataset is available for code-mixed
                                   CORPUS CREATION
                                   Data collection and pre-processing
                                   We collect 15K tweets in several batches from December 2021 to July 2022. In this work, we
                                   list some code-mixed Indonesian, Javanese, and English keywords to obtain the tweets to
                                   ensure that the retrieved tweets are code-mixed. In the pre-processing tasks, we first filter
                                   the tweets by removing duplicates. Subsequently, we replace user mentions, URLs, and
                                   hashtags with @user, httpurl, and #hashtag. Finally, we convert all words into lowercase. In
                                   our dataset, we notice some tweets containing languages other than Indonesian, Javanese,
                                   and English, such as Malay, Sundanese, Arabic, and Korean. The occurrence of Malay
                                   and Sundanese in the retrieved tweets is reasonable since both languages have similarities
                                   to Indonesian and Javanese. In addition, Arabic and Korean scripts exist in our dataset
                                   because people sometimes add both scripts to their tweets. In this case, we keep those
                                   languages in our dataset.
 No       Type                                                                         Language
                                          Indonesian                       Javanese                           English
 1        Mixed letters and numbers       •‘se7’ stands for ‘setuju’ (to   • ‘siji2’ stands for ‘siji-siji’   •‘b4’ (before)
                                          agree)                           (one by one)
                                          •‘anak2’ stands for ‘anak-                                          •‘ni8t’ (night)
                                          anak’ (kids)
 2        Slangs                          •‘santuy’ stands for ‘santai’    •‘ngunu’ stands for ‘ngono’        •‘epic’ (awesome)
                                          (relaxed)                        (like that)
                                          •‘nobar’ stands for ‘nonton      •‘bonek’ stands for ‘bondo         •‘noob’ (newbie)
                                          bareng’ (watch together)         nekat’ (reckless)
                                                                                                              •‘menfess’ (mention confess)
                                                                                                              •‘crashy’ (crazy and trashy)
 3        Abbreviated words or            •‘dgn’ stands for ‘dengan’       •‘kbh’ stands for ‘kabeh’ (all)    •‘idk’ (I don’t know)
          acronym                         (with)
                                          •‘slkh’ stands for ‘sekolah’     •‘mtrnwn’ stands for ‘matur        •‘dm’ (direct message)
                                          (school)                         nuwun’ (thank you)
                                          •‘SD’ stands for ‘sekolah        •‘lur’ stands for ‘sedulur’        •‘omg’ (oh my god)
                                          dasar’ (elementary school)       (brother or sister)
                                                                                                              •‘thx’ (thanks)
                                                                                                              •‘ur’ (your)
 4        Expressive lengthening          • ‘senaaang’ stands for          • ‘kaaabeeeh’ stands for           • ‘gooooood’ (good)
                                          ‘senang’ (happy)                 ‘kabeh’ (all)
 5        English words written in        –                                –                                  •‘n’ (and)
          Indonesian spelling style
                                                                                                              •‘plis’ (please)
                                                                                                              •‘wiken’ (weekend)
                                                                                                              •donlod (download)
                                                                                                              •‘gud’ (good)
                                                                                                              •‘selow’ (slow)
                                      Inter-annotator agreement
                                      This study uses Cohen’s kappa (Cohen, 1960) to quantify the agreement between two
                                      annotators. The following is the equation for calculating Cohen’s kappa (K ):
                                              Pr (a) − Pr(e)
                                      K=                     .                                                                               (1)
                                                1 − Pr(e)
                                         Pr(a) is the frequency with which two annotators assigned the same label. Pr(a) is
                                      obtained by calculating all agreed labels divided by the total data. Pr(e) is the probability
                                      of agreement when the annotators see the observed data randomly. The Pr(e) is calculated
                                      by summing the probability when both annotators randomly select the first label and the
                                      probability when both annotators select the second label. Cohen’s kappa value (K ) spans
                                      from 1 to −1, indicating that annotators select different labels for each sample. A value of
                                      0 indicates that the annotators agreed precisely as frequently as they would if they were
                                      both guessing randomly. Therefore, the closer the K value to 1, the better the dataset.
                                      In Eq. (2), wi is the number of words of the language i, max(wi ) represents the number
                                   of words of the prominent language, n is the total number of tokens, and u represents the
                                   number of language independent tokens (such as named entities, abbreviations, mentions,
                                   and hashtags). For monolingual utterances, the CMI score equals to 0 (zero), since the
                                   max(wi ) = n − u. A low CMI score implies monolingualism in the text, whereas a high
                                   CMI score indicates code-mixing.
In Eq. (3), Z (x) is a normalization factor function obtained using the following formula:
                                              T
                                                             ( K                        )
                                             XY               X
                                   Z (x) =            exp           λk fk (yt ,yt −1 ,xt ) .                                     (4)
                                             y t =1           k=1
                                       In Eqs. (3) and (4), λ = {λk } be a parameter vector weight estimated from the training
                                   set, and F = fk (yt ,yt −1 ,xt ) k K= 1be a set of feature functions. T is the number of time step
                                                 
                                   indexed by t .K is the number of features and k indexes the feature function fk and weight
                                   λk .
                                   BLSTM-based architecture
                                   Input representation for BLSTM-based architecture
                                   1. Word-level representation
                                      Word-level representation is a real-valued vector representing a word to capture
                                   the semantic relationship between words. Using word embedding, words with a closer
                                   meaning will have a similar vector representation. The embedding layer carries out the
                                   transformation from words to their corresponding vector representations. The embedding
                                   layer receives sentences in the form of a sequence of tokens. Each token is then transformed
                                   into a word vector with a fixed size by mapping the index of such a token. Figure 1 illustrates
                                   the word-level representation model.
                                      2. Character-level representation
                                      Character-level representation aims to capture morphological features of the words by
                                   processing the character that composes words (Mave, Maharjan & Solorio, 2018). Using
                                   character-level representation can help alleviate the out-of-vocabulary (OOV) problems
                                   in the text data (Joshi & Joshi, 2020). The results of character representation are used to
                                   augment the word vector representation before being processed through the classification
                                   layer.
                                      This study employs CNN and LSTM to train the character-level representation. In
                                   character-based CNN representation, a word is decomposed into a sequence of characters.
                                   The character inputs are passed into the character embedding layer. The embedding
                                   layer outputs are transmitted to the convolutional layer, which produces local features
                                   by applying a convolutional filter across a sliding n-character window. Subsequently, the
                                   max-pooling layer takes the maximum value over each dimension to represent a particular
                                   word. As for the character LSTM representation, we put a single LSTM layer on top of the
                                   embedding layer. The embedding outputs pass the character embedding vector via forward
                                   and backward LSTMs and combine each output to generate the encoding of the associated
                                   word. The character CNN and character LSTM representations are illustrated in Fig. 2.
                                   long-distance relations in the sequence in both directions (Mave, Maharjan & Solorio,
                                   2018). We initialize the embedding layer and feed the output sequence to the spatial
                                   dropout layer with a dropout. Subsequently, we add a single BLSTM layer and a recurrent
                                   sigmoid activation. BLSTM applies two LSTM networks by incorporating forward and
                                   backward hidden layers to capture previous and subsequent contexts (Liu & Guo, 2019).
                                   BLSTM-CRF
                                   BLSTM-CRF is a combination of deep learning and traditional machine learning
                                   approaches. The combination between BLSTM and CRF has been proven to produce
                                   good results in sequence tagging tasks (Poostchi, Borzeshi & Piccardi, 2018). A CRF layer
                                   can be added to the top layer of the BLSTM architecture to predict the label of the entire
                                   phrase at the same time (Lafferty, McCallum & Pereira, 2001). In the sequence labeling
                                   task, it is critical to consider the association between neighboring labels. BLSTM, on the
                                   other hand, does not generalize the connection between output labels (Wintaka, Bijaksana
                                   & Asror, 2019). That is due to the probability distribution of BLSTM being independent.
                                   The combination of BLSTM and CRF may efficiently exploit past and future input features
                                   through an LSTM layer and sentence-level tag information through a CRF layer (Huang,
                                   Xu & Yu, 2015). In this BLSTM-CRF architecture, we stack the following layers: input,
                                   embedding, SpatialDropout1D, BLSTM, dense, and CRF.
                                   BERT-based architecture
                                   BERT input representation
                                   BERT has a particular set of rules for representing the input text, namely sub-word
                                   representation. Sub-word representation is an alternative solution between word and
                                   character-based representations. BERT uses the WordPiece tokenization algorithm to
                                   create the sub-word representation (Wu et al., 2016). WordPiece starts by establishing an
                                   initial vocabulary composed of elementary units and then increases this vocabulary to the
                                   desired size. The vocabulary begins with characters from a single language. Then the most
                                   common character combinations in the vocabulary are added iteratively. WordPiece learns
                                   merged rules for the character pairs and finds the pair that maximizes the likelihood of the
                                   training data. Equation (5) is the formula to calculate the score for each pair:
                                                               frequency of pair
                                   Score =                                                        .                                (5)
                                             (frequency of first unit × frequency of second unit)
                                      The score is calculated by dividing the frequency of the pair by the product of the
                                   frequencies of each of its components. The algorithm works by prioritizing the merging of
                                   pairs when each part occurs less frequently in the vocabulary. For example, the pair ‘read’
                                   and ‘##ing’ will not be merged even though the token ‘reading’ frequently appears in the
                                   vocabulary. This is because the pair ‘read’ and ‘##ing’ will probably frequently occur in
                                   many other words. A pair between ‘re’ and ‘##ad’ will likely be merged since ‘re’ and ‘##ad’
                                   appear less frequently individually. Therefore, the token ‘read’ is not split, while the token
                                   ‘reading’ is separated into ‘read’ and ‘ing’. This teaches the idea that the token ‘reading’ is
                                   derived from ‘read’ with slightly different meanings but the same origin.
                                      As illustrated in Fig. 3, BERT input representation consists of three embeddings: token
                                   embeddings, segment embeddings, and position embeddings. In the token embeddings,
                                   two special tokens are added to each sentence. At the beginning of each sentence, a
                                   [CLS] token is added. Another special token is a [SEP] token which is located at the
                                   end of each sentence. The [SEP] token is added to separate between sentences. It is used
                                   as a learned segment embedding denoting a token as part of segment A or B. Segment
                                   embeddings are sentence numbers encoded in a vector. The model identifies whether a
                                   specific token belongs to sentence A or B in the segment embeddings. Position embeddings
                                   provide information regarding the word order in the input sequence. Finally, the BERT
                                   representation is obtained by summing those three embeddings.
                                   BERT
                                   Bidirectional encoder representations from transformers (BERT) is a language
                                   representation model built using the transformer-based technique developed by Google
                                   (Devlin et al., 2019). BERT is a transformer encoder stack capable of simultaneously reading
                                   a whole sequence of inputs. The BERT architecture is a deep bidirectional model, meaning
                                   that BERT takes information from both the left and right sides of the token’s context during
                                   Fine-tuning BERT
                                   As illustrated in Fig. 4, fine-tuning is done by leveraging a pre-trained model and then
                                   training it on a particular dataset suited to a specific task. The BERT model is first set with
                                   the pre-trained weight parameters. Next, all parameters are fine-tuned using annotated
                                   data from the downstream tasks. Ultimately, the fine-tuned weights are then used for the
                                   prediction task.
                                       In this study, the fine-tuning tasks are performed by leveraging two existing pre-trained
                                   BERT models, namely multilingual BERT (mBERT) (https://huggingface.co/bert-base-
                                   multilingual-cased) and IndoBERTweet (https://huggingface.co/indolem/indobertweet-
                                   base-uncased). The first pre-trained model is created based on the multilingual BERT.
                                   Multilingual BERT is a masked language modeling (MLM) objective-trained model (Devlin
                                   et al., 2019). It is trained with a large Wikipedia corpus on top of 104 languages, including
                                   English, Indonesian, and Javanese. Another pre-trained model used to build the language
                                   identification model is IndoBERTweet. IndoBERTweet is a pre-trained domain-specific
                                   model using a large set of Indonesian Twitter data (Koto, Lau & Baldwin, 2021).
                                       The illustration of fine-tuning BERT for the language identification task can be seen in
                                   Fig. 5. The [CLS] symbol and [SEP] are added at the beginning and the end of a single text
                                   sequence. Each token of the sequence and the contextual representation of each token are
                                   denoted by E and R, respectively. Following that, the BERT representation of each token
                                   is fed into dense layers. In the dense layers, the dense layer parameters are shared to get the
                                   label of each token.
                                   Experimental setup
                                   We conduct the experiments by splitting our dataset into training (6,170 tweets), validation
                                   (1,543 tweets), and testing (3,306 tweets). The training is performed using an 80 cores
                                   CPU, 250GB of RAM, and 4 GPUs (NVIDIA Tesla V100 SXM2). Before training, we apply
                                   hyperparameter tuning to get the optimal hyperparameters for each technique to maximize
                                   model performance.
                                       For CRF training, the L-BFGS algorithm is utilized for gradient descent optimization and
                                   getting model parameters. We apply randomized search using RandomizedSearchCV (https:
                                   //scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.
                                   html) to find the best parameter values for L1 (Lasso) and L2 (Ridge) regularization
                                   coefficients in the CRF algorithm. The hyperparameter search is done by applying 5-fold
                                   cross-validation. In addition, we set the number of parameter settings to 20.
                                       In the BLSTM-based experiments, we use a grid search algorithm for hyperparameter
                                   tuning. First, we set several values to find the best values for learning rate, batch size,
                                   dropout, and the number of LSTM units. The best learning rate and dropout value for
                                   all models are 0.01 and 0.5, respectively. The batch size and LSTM units for BLSTM and
                                   BLSTM + character CNN architectures are 32 and 64, respectively. As for BLSTM-CRF
                                   and BLSTM + character LSTM, the batch size and LSTM units are 64 and 32, respectively.
                                   Then, the BLSTM-based models are trained using the Adam optimizer over 20 epochs.
                                   Table 2 provides the best hyperparameter values for each BLSTM-based model.
                                       We also employ a grid search algorithm for fine-tuning BERT models. The
                                   hyperparameter search is conducted by setting categorical values for learning rate, per
                                   device training batch size, and per device evaluation batch size. The learning rate is set to
                                   1e−4, 3e−4, 2e−5, 3e−5, and 5e−5. For the per-device train and evaluation batch size,
                                   we select the categorical values of 8, 16, 32, 64, and 128. Subsequently, the hyperparameter
                                   search is run by applying such defined values over four epochs and ten trials. The best
                                   hyperparameter values for each BERT-based model are presented in Table 3. Finally, we
                                   use the best hyperparameters with the Adam optimizer and five epochs for fine-tuning
                                   BERT training.
Table 6 Precision, recall, and F1 score results on the test set for each model.
                                   attains a value of 0.9964. This result represents an almost perfect agreement between two
                                   annotators. As for the CMI, we gain a score of 38.05. It means that 38.05% of the overall
                                   non-neutral language tokens in the dataset are code-mixed.
                                           Our experiments with the fine-tuned BERT models indicate better results than
                                        the BLSTM-based models. The results provided by the fine-tuning BERT models
                                        demonstrate competitive performance compared to the other techniques. The fine-tuning
                                        IndoBERTweet model achieves the highest macro F1 score of 93.53% among the other
                                        models. This score is higher by 0.57% than the fine-tuning using mBERT, which gains a
                                        92.96% F1 score. Further, the fine-tuning of BERT models proves an excellent achievement
                                        in identifying intra-word code-mixing.
                                   Funding
                                   This work is supported by Universiti Brunei Darussalam (Grant no. UBD/RSCH/1.18/FICBF
                                   (a)/2023/007). The funders had no role in study design, data collection and analysis, decision
                                   to publish, or preparation of the manuscript.
                                   Grant Disclosures
                                   The following grant information was disclosed by the authors:
                                   Universiti Brunei Darussalam: UBD/RSCH/1.18/FICBF(a)/2023/007.
                                   Competing Interests
                                   The authors declare there are no competing interests.
                                   Author Contributions
                                   • Ahmad Fathan Hidayatullah conceived and designed the experiments, performed the
                                     experiments, analyzed the data, performed the computation work, prepared figures
                                     and/or tables, authored or reviewed drafts of the article, and approved the final draft.
                                   • Rosyzie Anna Apong conceived and designed the experiments, analyzed the data,
                                     authored or reviewed drafts of the article, and approved the final draft.
                                   • Daphne T.C. Lai conceived and designed the experiments, analyzed the data, authored
                                     or reviewed drafts of the article, and approved the final draft.
                                   • Atika Qazi conceived and designed the experiments, analyzed the data, authored or
                                     reviewed drafts of the article, and approved the final draft.
                                   Data Availability
                                   The following information was supplied regarding data availability:
                                      The data is available at Zenodo: Hidayatullah, Ahmad Fathan. (2022). Code-
                                   mixed Indonesian-Javanese-English Twitter Dataset (Version v1) [Data set]. Zenodo.
                                   https://doi.org/10.5281/zenodo.7567573.
                                   REFERENCES
                                    Adilazuarda MF, Cahyawijaya S, Winata GI, Fung P, Purwarianti A. 2022. IndoRo-
                                        busta: towards robustness against diverse code-mixed indonesian local languages.
                                        In: Proceedings of the First Workshop on Scaling Up Multilingual Evaluation. 25–34.