Liar Detection 3
Liar Detection 3
com/scientificreports
Human accuracy in detecting deception with intuitive judgments has been proven to not go above
the chance level. Therefore, several automatized verbal lie detection techniques employing Machine
Learning and Transformer models have been developed to reach higher levels of accuracy. This study
is the first to explore the performance of a Large Language Model, FLAN-T5 (small and base sizes),
in a lie-detection classification task in three English-language datasets encompassing personal
opinions, autobiographical memories, and future intentions. After performing stylometric analysis
to describe linguistic differences in the three datasets, we tested the small- and base-sized FLAN-T5
in three Scenarios using 10-fold cross-validation: one with train and test set coming from the same
single dataset, one with train set coming from two datasets and the test set coming from the third
remaining dataset, one with train and test set coming from all the three datasets. We reached state-
of-the-art results in Scenarios 1 and 3, outperforming previous benchmarks. The results revealed also
that model performance depended on model size, with larger models exhibiting higher performance.
Furthermore, stylometric analysis was performed to carry out explainability analysis, finding
that linguistic features associated with the Cognitive Load framework may influence the model’s
predictions.
Lie detection involves the process of determining the veracity of a given communication. When producing
deceptive narratives, liars employ verbal strategies to create false beliefs in the interacting partners and are thus
involved in a specific and temporary psychological and emotional s tate1. For this reason, the Undeutsch hypoth-
esis suggests that deceptive narratives differ in form and content from truthful narratives2. This topic has always
been under constant investigation and development in the field of cognitive psychology, given its significant and
promising applications in the forensic and legal setting3. Its potential pivotal role is in determining the honesty
of witnesses and potential suspects during investigations and legal proceedings, impacting both the investigative
information-gathering process and the final decision-making l evel4.
Decades of research have focused on identifying verbal cues for deception and developing effective methods
to differentiate between truthful and deceptive narratives, with such verbal cues being, at best, subtle and typically
resulting in both naive and expert individuals performing just above chance levels5,6. A potential explanation
coming from social psychology for this unsatisfactory human performance is the intrinsic human inclination
to the truth bias7, i.e., the cognitive heuristic of presumption of honesty, which makes people assume that an
interaction partner is truthful unless they have reasons to believe otherwise8,9. However, it is worth mentioning
that a more recent study challenged this solid result, finding that instructing participants to rely only on the
best available cue, such as the detailedness of the story, enabled them to consistently discriminate lies from the
truth with accuracy ranging from 59 to 79%10. This finding moves the debate on (1) the proper number of cues
that judges should combine before providing their veracity judgment -with the suggestion that the use-the-best
heuristic approach is the most straightforward and accurate- and thus on (2) the diagnosticity level of this cue.
More recently, the issue of verbal lie detection has also been tackled by employing computational techniques,
such as stylometry. Stylometry refers to a set of methodologies and tools from computational linguistic and
artificial intelligence that allow to conduct quantitative analysis of linguistic features within written texts to
uncover distinctive patterns that can infer and characterize authorship or other stylistic a ttributes11–13. Albeit
with some limitations, stylometry has been proven to be effective in the context of lie d etection14,15. The main
advantage is the possibility of coding and extracting verbal cues independently from human judgment, hence
reducing the problem of inter-coder agreement, as researchers using the same technique for the same data will
extract the same i ndices15.
1
Molecular Mind Lab, IMT School for Advanced Studies Lucca, Piazza San Francesco 19, 55100 Lucca, LU,
Italy. 2Department of Mathematics “Tullio Levi‑Civita”, University of Padova, Padova, Italy. 3Department of
General Psychology, University of Padova, Padova, Italy. *email: riccardo.loconte@imtlucca.it
Vol.:(0123456789)
www.nature.com/scientificreports/
Alongside this trend, several recent studies have explored computational analysis of language in different
domains, such as fake n ews16,17, transcriptions of court c ases18–20, evaluations of deceptive product r eviews21–23,
investigations into cyber-crimes24, analysis of autobiographical information25, and assessments of deceptive
intentions regarding future e vents26. Taken together, most of those studies focused on the usage of Machine
Learning and Deep Learning algorithms combined with Natural Language Processing (NLP) techniques to detect
deception from verbal cues automatically (see Constâncio et al.27 for a systematic review of the computerized
techniques employed in lie-detection studies).
More recently, a great step in advance has been made in the field of AI and NLP with the advent of Large
Language Models (LLMs). LLMs are Transformer-based language models with hundreds of millions of param-
eters trained on a large collection of corpora (i.e., pre-training phase)28. Thanks to this pre-training phase, LLMs
have proven to capture the intricate patterns and structures of language and develop a robust understanding of
syntax, semantics, and pragmatics, being able to generate coherent text resembling human natural language. In
addition, once pre-trained, these models can be fine-tuned on specific tasks using smaller task-specific datasets.
Fine-tuning refers to the process of continuing the training of a pre-trained model on a new dataset, allowing it to
adapt its previously learned knowledge to the nuances and specificities of the new data, thereby achieving state-
of-the-art results28. Common tasks for LLMs fine-tuning include NLP tasks, such as language translation, text
classification (e.g., sentiment analysis), question-answering, text summarization, and code generation. Therefore,
LLMs excel at a wide range of NLP tasks, as opposed to models uniquely trained for one specific task28. However,
to the best of our knowledge, despite the extreme flexibility of LLMs, the procedure of fine-tuning an LLM on
small corpora for a lie-detection task has remained unexplored.
Vol:.(1234567890)
www.nature.com/scientificreports/
The promising results in applying NLP techniques for psychological research suggest the possibility of com-
bining metrics from different psychological frameworks in a new theory-based stylometric analysis, offering the
possibility to investigate verbal lie detection from multiple perspectives in one shot.
• Hypothesis 1a): Fine-tuning an LLM can effectively classify the veracity of short narratives from raw texts,
1b) outperforming classical machine learning and deep learning approaches in verbal lie detection.
• Hypothesis 2): Fine-tuning an LLM can effectively classify the veracity of short narratives from raw texts,
1b) outperforming classical machine learning and deep learning approaches in verbal lie detection.
• Hypothesis 3): Fine-tuning an LLM can effectively classify the veracity of short narratives from raw texts,
1b) outperforming classical machine learning and deep learning approaches in verbal lie detection.
• Hypothesis 4): Fine-tuning an LLM can effectively classify the veracity of short narratives from raw texts,
1b) outperforming classical machine learning and deep learning approaches in verbal lie detection.
• Hypothesis 5a): The linguistic style distinguishing truthful from deceptive statements varies across different
contexts, 5b) and can be a significant feature for model prediction.
To test Hypothesis 1a, we fine-tuned an open-source LLM, FLAN-T5, using three datasets: personal opinions
(the Deceptive Opinions dataset51), autobiographical experiences (the Hippocorpus dataset52) and future inten-
tions (the Intention d ataset49). Given the extreme flexibility of LLMs, this approach is hypothesized to detect
deception from raw texts above the chance level. To test the advantage of our approach compared to classical
machine and deep learning models (Hypothesis 1b), we decided to compare the results with two benchmarks,
further described in the Methods and Materials section.
With regards to Hypotheses 2 and 3, according to empirical evidence, classical machine learning models tend
to experience a decline in performance when trained and tested on the aforementioned scenarios53–55. In contrast,
LLMs have acquired a comprehensive understanding of language patterns during the pre-training phase. We
posit that a fine-tuned LLM is capable of generalizing its learning across various contexts. Related to Hypothesis
4, we believe this generalization ability is further enhanced in larger models, as their size is associated with a
more sophisticated representation of language.
Finally, to test Hypothesis 5, we introduced a new theory-based stylometric approach, named DeCLaRatiVE
stylometry, to extract linguistic features related to the psychological frameworks of D istancing29, Cognitive
Load31, Reality M onitoring32, and Verifiability A pproach40,41, providing a pragmatic set of criteria to extract
features from utterances. We will apply DeCLaRatiVE stylometry to compare truthful and deceptive statements
in the three aforementioned datasets in order to explore potential differences in terms of linguistic style. Our
hypothesis suggests that the linguistic style distinguishing truthful from deceptive statements may vary across the
three datasets, as these types of statements originate from distinct contexts. We also applied the DeCLaRatiVE
stylometry technique to provide explainability analysis of the top-performing model.
Vol.:(0123456789)
www.nature.com/scientificreports/
participants were required to provide genuine or fabricated statements in three different domains: personal
opinions on five different topics (Opinion dataset), autobiographical experiences (Memory dataset), and future
intentions (Intention Dataset). Notably, the specific topic within each domain was counterbalanced among liars
and truth-tellers. A more detailed description of each dataset is available in Supplementary Information as well
as in the method section of each original article.
Table 1 displays an example of truthful and deceptive statements about opinions, memories, and intentions.
Table 2 reports descriptive statistics for each dataset, both overall and when grouped by truthful and deceptive
sets of statements. These statistics include the minimum, maximum, average, and standard deviation of word
counts. Word counts were computed after text tokenization using spaCy, a Python library for text processing.
Additionally, Table 2 provides Jaccard similarity index values between truthful and deceptive vocabulary sets.
Jaccard’s index was derived by calculating the intersection (common words) and union (total words) of these
Truthful Deceptive
While I am morally torn on the issue, I believe that
Abortion is the termination of a life and should not be
ultimately it is a woman’s body and she should be able to
al- lowed. If a fetus has made it to the point of being able to
do with it as she pleases. I belive people should not dehu-
Opinion survive “on its own” outside its mother’s body, what right
manize the fetus tough, to make themselves feel better. The
(Abortion) do we have to cut its life short. If the mother’s life is in
decision about laws regarding this issue should be left up to
danger, she already chose that she was willing to sacrifice
the states to decide. To combat this problem, birth control
her life to have a child when she consented to procreating
should be easily accessible
Concerts are my most favorite thing, and my boyfriend
knew it. That’s why, for our anniversary, he got me tickets
The day started perfectly, with a great drive up to Denver
to see my favorite artist. Not only that, but the tickets were
for the show. Me and my boyfriend didn’t hit any traffic
for an outdoor show, which I love much more than being
on the way to Red Rocks, and the weather was beautiful.
in a crowded stadium. Since he knew I was such a big fan
We met up with my friends at the show, near the top of the
of music, he got tickets for himself, and even a couple of
theater, and laid down a blanket. The opener came on, and
my friends. He is so incredibly nice and considerate to me
we danced our butts off to the banjoes and mandolins that
and what I like to do. I will always remember this event
Memory were playing on-stage. We were so happy to be there. That’s
and I will always cherish him. On the day of the concert, I
(My boyfriend and I went to a concert together and had when the sunset started. It was so beautiful. The sky was
got ready, and he picked me up and we went out to a res-
a great time. We met some of my friends there and really a pastel pink and was beautiful to watch. That’s when Phil
taurant beforehand. He is so incredibly romantic. He knew
enjoyed ourselves watching the sunset.) Lesh came on, and I just about died. It was the happiest
exactly where to take me without asking. We ate, laughed,
moment of my life, seeing him after almost a decade of not
and had a wonderful dinner date before the big event. We
seeing him. I was so happy to be there, with my friends and
arrived at the concert and the music was so incredibly
my love. There was nothing that could top that night. We
beautiful. I loved every minute of it. My friends, boyfriend,
drove home to a sky full of stars and stopped at an overlook
and I all sat down next to each other. As the music was
to look up at them. I love this place I live. And I love live
slowly dying down, I found us all getting lost just staring
music. I was so happy
at the stars. It was such an incredibly unforgettable and
beautiful night
We go to a Waterbabies class every week, where my I will be taking my 8-year-old daughter swimming this
16-month-old is learning to swim. We do lots of activities Saturday. We’ll be going early in the morning, as it’s gener-
in the water, such as learning to blow bubbles, using floats ally a lot quieter at that time, and my daughter is always
Intention
to aid swimming, splashing and learning how to save them- up early watching cartoons anyway (5 am!). I’m trying to
(Going swimming with my daughter)
selves should they ever fall in. I find this activity important teach her how to swim in the deep end before she starts
as I enjoy spending time with my daughter and swimming her new school in September as they have swimming les-
is an important life skill sons there twice a week
Table 1. Truthful and deceptive example statements about opinions, memories, and intentions. In brackets,
the topic assigned to the participant in the deceptive condition to fabricate the narrative.
Table 2. Summary statistics of the number of words for each dataset and truthful and deceptive set of
statements. Jaccard Similarity Index and its qualitative interpretation in brackets refers to the similarity
between truthful and deceptive vocabulary sets for each dataset.
Vol:.(1234567890)
www.nature.com/scientificreports/
two sets50,56. The resulting index ranges from 0 to 1, with 0 indicating a completely different vocabulary between
the two sets, and 1 indicating a completely identical vocabulary between the two sets. We reported the Jaccard
similarity index to provide a measure of similarity or overlap between the word choices of truthful and decep-
tive statements within the respective datasets. Supplementary Information offers a detailed methodology for
calculating the Jaccard similarity index.
FLAN‑T5
We adopted FLAN-T5, an LLM developed by Google researchers and freely available through HuggingFace
Python’s library Transformers (https://huggingface.co/docs/transformers/model_doc/flan-t5). HugginFace is
a company that provides free access to state-of-the-art LLMs through Python API. Among the available LLMs,
we chose FLAN-T5 because of its valuable trade-off between computational load and goodness of the learned
representation. FLAN-T5 is the improved version of MT-5, a text-to-text general model capable of solving many
NLP tasks (e.g., sentiment analysis, question answering, and machine translation), which has been improved by
pre-training57. The peculiarity of this model is that every task they were trained on is transformed into a text-
to-text task. For example, while performing sentiment analysis, the output prediction is the string used in the
training set to label the positive or negative sentiment of each phrase rather than a binary integer output (e.g.,
0 = positive; 1 = negative). Hence, their power stands in both the generalized representation of natural language
learned during the pre-training phase and the possibility of easily adapting the model to a downstream task with
little fine-tuning without adjusting its architecture.
Label Description
num_sentences Total number of sentences
num_words Total number of words
num_syllables Total number of syllables
avg_syllabes_per_word Average number of syllables per word
fk_grade Index of the grade level required to understand the text
fk_read Index of the readability of the text
Analytic LIWC summary statistic analyzing the style of the text in term of analytical thinking (0–100)
Authentic LIWC summary statistic analyzing the style of the text in term of authenticity (0–100)
Tone Standardized difference (0–100) of ‘tone_pos’—‘tone_neg’
tone_pos Percentage of words related to a positive sentiment (LIWC dictionary)
tone_neg Percentage of words related to a negative sentiment (LIWC dictionary)
Cognition Percentage of words related to semantic domains of cognitive processes (LIWC dictionary)
memory Percentage of words related to semantic domains of memory/forgetting (LIWC dictionary)
focuspast Percentage of verbs and adverbs related to the past (LIWC dictionary)
focuspresent Percentage of verbs and adverbs related to the present (LIWC dictionary)
focusfuture Percentage of verbs and adverbs related to the future (LIWC dictionary)
Self-reference Sum of LIWC categories ‘i’ + ‘we’
Other-reference Sum of LIWC categories ‘shehe’ + ‘they’ + ‘you’
Perceptual details Sum of LIWC categories ‘attention’ + ‘visual’ + ‘auditory’ + ‘feeling’
Contextual Embedding Sum of LIWC categories ‘space’ + ‘motion’ + ‘time’
Reality Monitoring Sum of Perceptual details + Contextual Embedding + Affect—Cognition
Concreteness score Mean of concreteness score of words
People Unique named-entities related to people: e.g., ‘Mary’, ‘Paul’, ‘Adam’
Temporal details Unique named-entities related to time: e.g., ‘Monday’, ‘2:30 PM’, ‘Christmas’
Spatial details Unique named-entities related to space: e.g., ‘airport’, ‘Tokyo’, ‘Central park’
Quantity details Unique named-entities related to quantities: e.g., ‘20%’, ‘5 $’, ‘first’, ‘ten’, ‘100 m’
Table 3. List and short description of the 26 linguistic features pertaining to the DeCLaRatiVE Stylometry
technique.
Vol.:(0123456789)
www.nature.com/scientificreports/
and RM framework were computed using L IWC42,43, the gold standard software for analyzing word usage. Using
the English dictionary, we scored each text along with all the categories present in LIWC-22. LIWC scoring was
computed on tokenized text using the English dictionary. The selection of the LIWC categories related to the
Distancing and RM framework was guided by previous research on computerized verbal lie-detection29,49,50,52,56
and a recent metanalysis14. RM was also investigated through linguistic concreteness of w ords39. To determine
the average level of concreteness for each statement, we utilized the concreteness annotation dataset developed
by Brysbaert et al.61. For the calculation of concreteness scores, a preprocessing pipeline was applied to textual
data using the Python library SpaCy: text was converted to lowercase and tokenized; then stop words were
removed, and the remaining content words were lemmatized. These content words were then cross-referenced
with the annotated concreteness dataset to assign the respective concreteness value when a match was found.
The concreteness score for each statement was then computed as the average of the concreteness scores for all
the content words in that statement. For what concerns verifiable details, they were estimated by the frequency
of unique named entities. Named entities were extracted with the NER technique using Python’s library SpaCy
through the Transformer algorithm for English language (en_core_web_trf, https://spacy.io/models/en#en_
core_web_trf).
Further details on how the 26 linguistic features were computed are provided in the Supplementary
Information.
Experimental set‑up
In this section, we describe the methodology that we applied in this work. As a first step, we wanted to perform
a descriptive linguistic analysis of our datasets, trying to provide a response to Hypothesis 5a), i.e., whether the
linguistic style distinguishing truthful from deceptive statements varies across different contexts. To achieve this
result, we employed the DeCLaRatiVE stylometric analysis. As a second step, we proceeded to test the capacity
of the FLAN-T5 model to be fine-tuned on a Lie Detection task. To do so, we provided three scenarios to verify
the following hypothesis:
• Hypothesis 1a): Fine-tuning an LLM can effectively classify the veracity of short narratives from raw texts,
1b) outperforming classical machine learning and deep learning approaches in verbal lie detection.
• Hypothesis 2): Fine-tuning an LLM on deceptive narratives enables the model to also detect new types of
deception;
• Hypothesis 3): Fine-tuning an LLM on deceptive narratives enables the model to also detect new types of
deception;
• Hypothesis 4): Model performance depends on model size, with larger models showing higher accuracy;
We expected hypotheses 1a, 1b, 3, and 4 to be verified, while we did not have any a priori expectation for the
second hypothesis. The scenarios are described below:
1. Scenario 1: The model was fine-tuned and tested on a single dataset. This procedure was repeated for each
dataset with a different copy of the same model each time (i.e., the same parameters before the fine-tuning
process) (Fig. 1). This Scenario assesses the model’s capacity to learn how to detect lies related to the same
context and responds to Hypothesis 1a;
2. Scenario 2: The model was fine-tuned on two out of the three datasets and tested on the remaining unseen
dataset. As for the previous Scenario, this procedure was iterated three times, employing separate instances
of the same model, each time with a distinct combination of dataset pairings (Fig. 2). This Scenario assesses
how the model performs on samples from a new context to which it has never been exposed during the
training phase and provides a response for Hypothesis 2;
3. Scenario 3: We first aggregated the three train and test sets from Scenario 1. Then, we fine-tuned the model
on the aggregated datasets and tested the model on the aggregated test sets (Fig. 1). This Scenario assesses
the capacity of the model to learn and generalize from samples of truthful and deceptive narratives from
multiple contexts and provides a response for Hypothesis 3.
Vol:.(1234567890)
www.nature.com/scientificreports/
training and test sets, such that any individual who had their opinions assigned to the training set did not have
their opinions assigned to the test set, and vice versa.
Together, Scenarios 2 and 3 provide evidence about the generalized capabilities of the fine-tuned FLAN-T5
model in a lie-detection task when tested on unseen data and on a multi-domain dataset. Furthermore, we tested
whether model performance may depend on model sizes. Therefore, we first fine-tuned the small-sized version
of FLAN-T5 in every scenario, and then we repeated the same experiments in every scenario with the base-sized
version, providing a response for Hypothesis 4.
To test Hypothesis 1b, i.e., to test the advantage of our approach when compared to classical machine learning
models, we decided to compare the results with two benchmarks:
1. A basic approach consisting of a bag-of-words (BoW) encoder plus a logistic regression c lassifier62 (following
the experimental procedure of Scenario 1);
Vol.:(0123456789)
www.nature.com/scientificreports/
2. A literature baseline based on previous studies providing accuracy metrics on the same datasets using a
machine learning or a deep learning a pproach49–51. For the Opinion dataset -characterized by opinions on
five different topics per subject- we compared our results to the performance obtained i n51 with respect to
their “within-topic” experiments because our approach is equivalent to theirs, with the only difference that
we addressed all the topics in one model.
As a final step, we conducted an explainability analysis to investigate the differences in linguistic style between
the truthful and deceptive statements that were correctly classified and misclassified by the model. This procedure
aimed to provide a response to Hypothesis 5b, i.e., whether the model takes into account the linguistic style of
statements for its final predictions. To achieve this result, we employed the DeCLaRatiVE stylometric analysis.
In Fig. 3, we provided a flow chart of the whole experimental set-up.
Fine‑tuning strategy
Fine-tuning of LLMs consists of adapting a pre-trained language model to a specific task by further training the
model on task-specific data, thereby enhancing its ability to generate contextually relevant and coherent text in
line with the desired task objectives57. We fine-tuned FLAN-T5 in its small and base size using the three datasets
and following the experimental set-up described above. We approached the lie-detection task as a binary clas-
sification problem, given that the three datasets comprised raw texts associated with a binary label, specifically
instances classified as truthful or deceptive.
To the best of our knowledge, no fine-tuning strategy is available in the literature for this novel downstream
NLP task. Therefore, our strategy followed an adaptation of the Hugginface’s guidelines on fine-tuning an LLM
for translation. Specifically, we chose the same optimization strategy used to pre-train the original model and
the same loss function.
Notably, the classification task between deceptive and truthful statements has never been performed during
the FLAN-T5 pre-training phase, nor is it included in any of the tasks the model has been pre-trained on. There-
fore, we performed the same experiments, described in the Experimental set-up section, multiple times with
different learning rate values (i.e., 1e−3, 1e−4, 1e−5), and we finally chose the configuration shown in Table 4,
which yielded the best performance in terms of accuracy. All experiments and runs of the three scenarios were
conducted on Google Colaboratory Pro + using their NVIDIA A100 Tensor Core GPU.
Figure 3. Visual illustration of the whole experimental set-up. The Opinion, Memory, and Intention dataset
underwent Descriptive Linguistic Analysis using DeCLaRatiVE stylometry. A baseline model consisting of
Bag of Words (BoW) and Logistic Regression (Scenario 1) was also established for the three datasets. Then, the
FLAN-T5 model in small and base versions was fine-tuned across Scenarios 1, 2, and 3. Finally, an Explainability
Analysis was conducted on the top-performing model using DeCLaRatiVE stylometry to interpret the results.
Vol:.(1234567890)
www.nature.com/scientificreports/
Table 4. FLAN-T5 hyperparameters configuration for the small- and base-sized version. The initial learning
rate for every scenario was 5e−4 for the small model and 5e−5 for the base model. This choice was motivated
by preliminary experiment results, with the smaller model, but not the base model, generally performing better
with higher learning rates. The weight decay coefficient was set to 0.01 in all models and Scenarios. The batch
size was set to 2 for computational reasons, specifically to avoid running out of available memory, even though
it is known that a larger batch size usually leads to better performance. Finally, the number of epochs was set to
3 after preliminary experiments showing the maximum test accuracy after the third epoch without overfitting.
For the Memory and Intention dataset, we computed a permutation t-test (n = 10,000) for independent
samples for the 26 linguistic features to outline significant differences among the truthful and deceptive texts.
For the Opinion dataset, our analysis proceeded as follows. Firstly, we computed the DeCLaRatiVE stylom-
etry technique for all the subjects’ opinions. This resulted in a 2500 (opinions) × 26 (linguistic features) matrix.
Then, since each subject provided five opinions (half truthful and half deceptive), we averaged the stylistic vec-
tor separately for the truthful and deceptive sets of opinions. This procedure allowed us to obtain two different
averaged stylistic vectors for the same subject, one for the truthful opinions and one for the deceptive opinions.
Importantly, this averaging process enabled us to obtain results that are independent of the topic (e.g., abortion or
cannabis legalization) and the stance taken by the subject (e.g., in favor or against that particular topic). Finally,
we validated the statistical significance of these differences by conducting a paired sample permutation test
(n = 10,000). Results for each dataset were corrected for multiple comparisons with Holm-Bonferroni correction.
The effect size was expressed by Common Language Effect Size (CLES) with a confidence interval of 95%
(95% CI), which is a measure of effect size that is meant to be more intuitive in its understanding by providing
the probability that a specific linguistic feature, in a picked-at-random truthful statement, will have a higher
score than in a picked-at-random deceptive o ne64. The null value for the CLES is the chance level at 0.5 (in a
probability range from 0 to 1) and indicates that, when sampled, one group will be greater than the other, with
equal chance. Cohen’s d effect size with 95% CI was also computed to add interpretation.
a. Truthful statements misclassified as deceptive (False Negatives), with deceptive statements misclassified as
truthful (False Positives);
b. Statements correctly classified as deceptive (True Negatives) vs. truthful statements misclassified as deceptive
(False Negatives);
c. Statements correctly classified as truthful (True Positives) vs. deceptive statements misclassified as truthful
(False Positives).
d. Truthful versus deceptive statements correctly classified by the model (True Positives vs. True Negatives).
To compute the effect size, we computed the average of the CLES and Cohen’s d effect size scores with their
respective 95% CI obtained from each fold.
Results
Descriptive linguistic analysis
This section outlines the results of the descriptive linguistic analysis in terms of DeCLaRatiVE stylometric
analysis to compare the three datasets on linguistic features.
Vol.:(0123456789)
www.nature.com/scientificreports/
Figure 4. Horizontal stacked bar chart presenting the Common Language Effect Size (CLES) estimates for the
significant linguistic features that survived post-hoc corrections in the Opinion dataset. The CLES estimates
represent the probability (ranging from 0 to 1) of finding a specific linguistic feature in truthful opinions (sky
blue) than in deceptive ones (salmon). The CLES for truthful opinions are sorted in descending order, while the
CLES for deceptive opinions are sorted in ascending order.
Figure 5. Horizontal stacked bar chart presenting the Common Language Effect Size (CLES) estimates for the
significant linguistic features that survived post-hoc corrections in the Memory dataset. The CLES estimates
represent the probability (ranging from 0 to 1) of finding a specific linguistic feature in truthful memories (sky
blue) than in deceptive ones (salmon). The CLES for truthful memories are sorted in descending order, while
the CLES for deceptive memories are sorted in ascending order.
Vol:.(1234567890)
www.nature.com/scientificreports/
For the three datasets, Figs. 4, 5, and 6 show the differences in the number, the type, the magnitude of the
CLES effect size, and the direction of the effect for the linguistic features that survived post-hoc corrections.
To make an example of these differences, the concreteness score of words (‘concr_score’) presented the larg-
est CLES within the Intention dataset towards the truthful statements (Fig. 6), while in the Opinion dataset,
it showed the largest CLES towards the deceptive statements (Fig. 4). Overall, the Intentions dataset displayed
fewer significant differences in linguistic features among truthful and deceptive statements than the Opinion
and Memory datasets. In Table S5 (Supplementary Information), we reported, for all the linguistic features and
the three datasets, all the statistics, the corrected p-values, the effect-size scores expressed by CLES and Cohen’s
D with 95% CI, and the direction of the effect.
Scenario 1
In Table 6 are depicted the test accuracies for the FLAN-T5 model, categorized by dataset and model size in
Scenario 1. In each case, the base model, on average, outperformed the small model, with the Memory dataset
showing the largest improvement of 4% and the Intention dataset showing just a 0.06% increase in average
accuracy. These results indicate that the larger model size generally leads to improved performance across the
three datasets, with higher accuracy observed in the base version.
Scenario 2
This scenario aimed to investigate our fine-tuned LLM’s generalization capability across different deception
domains. As presented in Table 5, the test accuracy for the three experiments in this scenario significantly
dropped to the chance level, showing that the model, in any case, was able to learn a general rule to detect lies
coming from different contexts.
Scenario 3
In Scenario 3, we tested the accuracy of the FLAN-T5 small and base version on the aggregated Opinion,
Memory, and Intention datasets. The small-sized FLAN-T5 achieved an average test accuracy of 75.45% (st.
dev. ± 1.6), while the base-sized FLAN-T5 exhibited a higher average test accuracy of 79.31% (st. dev. ± 1.3).
In other words, the base-sized model outperformed the small model by approximately four percentage points.
Results in Table 6 show the disaggregated performance on individual datasets between the small and base
FLAN-T5 models in Scenario 3, with a comparison to their counterparts in Scenario 1. These comparisons show
that FLAN-T5-small in Scenario 3 exhibited worse performance than in Scenario 1. Instead, in Scenario 3, the
Figure 6. Horizontal stacked bar chart presenting the Common Language Effect Size (CLES) estimates for the
significant linguistic features that survived post-hoc corrections in the Intention dataset. The CLES estimates
represent the probability (ranging from 0 to 1) of finding a specific linguistic feature in truthful intentions (sky
blue) than in deceptive ones (salmon). The CLES for truthful intentions are sorted in descending order, while
the CLES for deceptive intentions are sorted in ascending order.
Vol.:(0123456789)
www.nature.com/scientificreports/
Table 5. Test accuracy of FLAN-5 models in scenario 2 (three combination of train sets). The performance
comparison is among the small and base version of the FLAN-T5 model in the three combination of train set:
opinion + memory, opinion + intention, memory + intention.
Table 6. Test acccuracy of the FLAN-T5 models in Scenarios 1 and 3 for the three datasets. Reported values
are means ± standard deviation of the 10 folds. Best results per evaluation metric are in bold. The literature
baseline for the Opinion dataset refers to the average accuracy and standard deviation from all within-topic
accuracies from FastText Embedding + Transformer51. The literature baseline for the Intention dataset refers
to the accuracy from Vanilla Random Forest using LIWC features (confidence interval in square brackets)49,
the averaged accuracy and standard deviation from RoBERTa + Transformers + Co-Attention model and
BERT + co-attention model50 respectively.
base model barely outperformed its counterparts of Scenario 1 on the Opinion and Intention datasets by less
than 1% and slightly underperformed its counterpart of Scenario 1 on the Memory dataset.
We identified the top-performing model as the FLAN-T5 base in Scenario 3 because of its higher accuracy
in the overall performance. The averaged confusion matrix of the 10 folds for this model is depicted in Fig. 7.
Notably, in any case, we were able to outperform both the bag of word + logistic regression classifier baseline
and the performance achieved on the same datasets in previous studies49–51.
Explainability analysis
This section aims to gain a deeper understanding of the top-performing model identified in Scenario 3 (FLAN-
T5 base) through a DeCLaRatiVE stylometric analysis of statements correctly classified and misclassified by the
model. The purpose of this analysis was to examine whether the linguistic style of the input statements exerted an
influence on the resulting output of the model and to provide explanations for the wrong classification outputs.
For this analysis, we compared:
a. Truthful statements misclassified as deceptive (False Negatives), with deceptive statements misclassified as
truthful (False Positives);
b. Statements correctly classified as deceptive (True Negatives) vs. truthful statements misclassified as deceptive
(False Negatives);
c. Statements correctly classified as truthful (True Positives) vs. deceptive statements misclassified as truthful
(False Positives).
d. Truthful vs. deceptive statements correctly classified by the model (True Positives vs. True Negatives).
The statistically significant features reported survived post-hoc correction for multiple comparisons in each
fold. Overall, for comparison a), b), and c), we observed no statistically significant differences (p < 0.05) in any
linguistic features for most of the splits with the only exception of:
1) ‘fk_read’ in fold 1 (t = 5.30; p = 0.04, CLES = 0.63 [0.55, 0.71], d = 0.46 [0.18, 0.75]) and ‘Reality Monitoring’in
fold 6 (t = 4.74; p = 0.047, CLES = 0.62 [0.54, 0.70], d = 0.46 [0.17, 0.75]) for the a) comparison;
2) ‘Reality Monitoring’in fold 6 (t = −3.39, p = 0.04, CLES = 0.40 [0.34, 0.46], d = −0.34 [−0.55, −0.13]) and
‘Reality Monitoring’ (t = −3.16 p = 0.04, CLES = 0.41 [0.34, 0.47], d = −0.34 [−0.56, −0.12]) and ‘Contextual
Embedding’ (t = −2.11; p = 0.01, CLES = 0.39 [0.33, 0.45], d = −0.42 [−0.63, −0.2]) in fold 7 the b) comparison;
Vol:.(1234567890)
www.nature.com/scientificreports/
Figure 7. Averaged confusion matrix of the top-performing model identified as FLAN-T5 base in Scenario
3. In each square, the results obtained represent the average (and standard deviation) from the test set of each
iteration of the 10-fold cross-validation.
3) ‘num_syllables‘(t = 76.87, p = 0.01, CLES = 0.64 [0.57, 0.7], d = 0.46 [0.27, 0.7]) and ‘word_counts’(t = 59.63,
p = 0.01, CLES = 0.64 [0.57, 0.71], d = 0.46 [0.21, 0.7]) in fold 9 for the c) comparison.
Conversely, for the d) comparison, several significant features emerged in all the folds and survived correc-
tions for multiple comparisons. Figure 8 depicts the CLES effect size scores of linguistic features, sorted accord-
ing to the number of times they were found to be significant among the ten folds. The top six features in Fig. 8
represented a cluster of linguistic features related to the Cognitive Load framework.
Discussion
In the present research, we investigated the efficacy of a Large Language Model, specifically FLAN-T5 in its small
and base version, in learning and generalizing the intrinsic linguistic representation of deception across differ-
ent contexts. To accomplish this, we employed three datasets encompassing genuine or fabricated statements
regarding personal opinions, autobiographical experiences, and future intentions.
Opinions
After analyzing truthful and deceptive opinions using the DeCLaRatiVE stylometry, different linguistic features—
related to the theoretical frameworks of CL, RM, and Distancing—were found to be significant.
Vol.:(0123456789)
www.nature.com/scientificreports/
Figure 8. Linguistic features in Truthful and Deceptive statements that were accurately classified by FLAN-T5
base in Scenario 3. The bar plot shows the averaged Common Language Effect Size among the ten folds of
linguistic features that survived post-hoc corrections. Linguistic features are sorted in descending order
according to the number of times they were found to be significant among the 10 folds (displayed at the side of
each bar). Linguistic features higher on average in truthful texts are shown in sky blue, while those higher on
average in deceptive texts are shown in salmon.
In line with the CL framework, we observed that truthful opinions were characterized by greater complexity,
verbosity, and more authenticity in linguistic s tyle14,31.
For features related to the RM framework, truthful opinions were characterized by a lesser number of con-
crete words and a greater number of cognitive words, as also previously s hown55; in contrast, deceptive opinions
showed higher scores in the concreteness of words, contextual details, and reality monitoring. These differ-
ences may reflect on one side the reasoning processes that truth-tellers engage in evaluating the pros and cons
of abstract and controversial concepts (e.g., abortion), while for deceivers, it may be indicative of difficulty in
abstraction, resulting in faked opinions that sound more grounded in reality.
Finally, in line with previous literature on distancing framework29,65 and deceptive opinions20,55, deceivers
utilized more other-related word classes (‘Other-reference’) and fewer self-related words (‘Self-reference’), con-
firming that individuals may tend to avoid personal involvement when expressing deceptive statements.
Memories
Following the analysis of truthful and deceptive narratives of autobiographical memories through DeCLaRatiVE
stylometry, various linguistic features associated with the theoretical frameworks of CL, RM, VA, and Distancing
were found to be significant.
As for opinions, according to the CL framework, truthful narratives of autobiographical memories exhibited
higher levels of complexity and verbosity and appeared to be more analytical in style14,31.
In accordance with the RM framework32–37, posing that truthful memory accounts tend to reflect the percep-
tual processes involved while experiencing the event while fabricated accounts are constructed through cognitive
operations, we found genuine memories exhibiting higher scores in memory-related words and the number of
words associated with spatial and temporal information (‘Contextual Embedding’), as well as an overall higher
RM score. Conversely, we found deceptive memories showing higher scores in words related to cognitive pro-
cesses (e.g., reasoning, insight, causation). Furthermore, in line with Kleinberg’s truthful concreteness hypothesis
39
, truthful memories were overall characterized by words with higher scores of concreteness.
Along with the VA, truthful memories contained more verifiable details, as indicated by the greater num-
ber of named entities about times and locations23,48. Notably, we found this effect although participants lied in
a low-stake scenario. However, deceptive memories were unexpectedly characterized by a higher number of
self-references and named entities of ‘People’. This result is in contrast with previous literature on distancing
framework14,29. One possible explanation of this significant but small effect is that liars may try to increase their
credibility by fostering a sense of social connection.
Intentions
Upon examining truthful and deceptive statements of future intentions through DeCLaRatiVE stylometry, sev-
eral linguistic features were found to be significant. Our findings are consistent with previous research claiming
that genuine intentions contain more ‘how-utterances’, i.e., indicators of careful planning and concrete descrip-
tions of activities. In contrast, false intentions are characterized by ‘why-utterances’, i.e., explanations and reasons
Vol:.(1234567890)
www.nature.com/scientificreports/
for why someone planned an activity or for doing something in a certain way48. Indeed, we found true intentions
were more likely to provide concrete and distinct information about the intended action, grounding their state-
ments in real-world experiences and providing temporal and spatial references. Additionally, true intentions
were characterized by a more analytical style and a greater presence of numerical entities. In contrast, false
intentions exhibited a higher number of cognitive words and expressions and were temporally oriented toward
the present and past.
Furthermore, we found evidence in line with the claim that liars may over-prepare their s tatements48, as
indicated by higher verbosity. Finally, in contrast with the distancing f ramework14,29, we found a significantly
higher proportion of self-references and mentions of people in deceptive statements. However, the effect size for
this finding was small. As for deceptive memories, one possible interpretation is that liars may attempt to appear
more credible by creating a sense of social connection.
Vol.:(0123456789)
www.nature.com/scientificreports/
models can already achieve good performance, leaving a smaller margin for improvement. Nonetheless, the
difference between our approach and the baselines is not negligible. In the Opinion dataset, we outperformed
the literature baseline of a Transformer model trained from scratch by 17% accuracy and surpassed our logistic
regression baseline by six percentage points. For the Intention dataset, our approach showed a 5-percentage
point improvement over the logistic regression baseline and around 1–2% improvement over the best literature
baseline. Notably, the best literature baseline for the Intention dataset (averaged accuracy: 70.61 ± 2.58%) used a
similar approach to ours in terms of the type of model used, involving a Transformer-based model (BERT + Co-
attention), which may explain the narrower performance gap.
Besides the differences in performance, the main advantage of our approach is its simplicity and flexibility
compared to those used in previous studies49–51. Fine-tuning an LLM leverages an existing encoding of lan-
guage that effortlessly handles any type of statement, unlike logistic regression based on BoW or training a new
Transformer-based model from scratch. Taking all these aspects together, fine-tuning LLMs resulted in being
more advantageous in terms of feasibility, flexibility, and performance accuracy.
Explainability analysis
To improve the explainability of the performance collected, we investigated whether the linguistic style that
characterizes truthful and deceptive narratives could have a role in the model’s final predictions (Hypothesis
5b). For this aim, we applied a DeCLaRatiVE stylometric analysis on statements that were correctly classified
and misclassified by the top-performing model identified in Scenario 3 (i.e., FLAN-T5 base).
In the misclassified sample, truthful and deceptive statements did not differ significantly for any linguistic
feature extracted with the DeCLaRatiVE stylometry technique. The only exception was fold 1, which showed
significant differences in the text’s readability score, and fold 6, which showed significant differences in ’Real-
ity Monitoring’ scores. No significant differences were detected in each fold in linguistic features between decep-
tive statements that were correctly classified as deceptive (True Negatives) and truthful statements that were
misclassified as deceptive (False Negatives), with the exception of ‘Reality Monitoring’ in folds 6 and 7 and
‘Contextual Embedding’ score in fold 7. Finally, truthful statements that were correctly classified as truthful (True
Positives) and deceptive statements that were misclassified as truthful (False Positives) exhibited no significant
differences, except for the number of syllables and number of words in the fold 9. We argue that the observation
of significant differences in selected linguistic features across specific folds is more indicative that these findings
may not be generalizable and are likely influenced by the particular fold under analysis. When taken together,
most of the analyzed folds showed a substantial overlap in linguistic style. Consequently, the model might have
exhibited poor classification performance for those statements because, while deceptive, they showed a linguistic
style resembling truthful statements and vice-versa.
In contrast, correctly classified statements displayed several significant differences between truthful and
deceptive statements. Notably, the top six linguistic features in Fig. 8 resulted in statistical significance in at least
6 out of 10 folds. The fact that we found a consistent pattern of linguistic features in correctly classified state-
ments but not in misclassified statements provides evidence for our hypothesis, suggesting that the linguistic
style of statements does have a role in the model’s final predictions. More in detail, the top-six linguistic features
depicted in Fig. 8 represent a cluster of linguistic cues associated with the CL f ramework31, specifically low-level
features related to the length, complexity, and analytical style of the texts that may have enabled the distinction
between truthful and deceptive statements. The fact that linguistic cues of CL survived among the several features
available -in a mixed dataset of utterances reflecting opinions, memories, and intentions- raises the question of
whether CL cues may be more generalizable than other cues that are, in contrast, more specific to a particular
type of deception.
Vol:.(1234567890)
www.nature.com/scientificreports/
Finally, with our experiments, we introduced the DeCLaRatiVE stylometry technique, a new theory-based
stylometric approach to investigate deception in texts from four psychological frameworks (Distancing, Cognitive
Load, Reality Monitoring, and Verifiability approach). We employed the DeCLaRatiVE stylometry technique
to compare the three datasets on linguistic features and we found that fabricated statements from different con-
texts exhibit different linguistic cues of deception. We also employed the DeCLaRatiVE stylometry technique
to conduct an explainability analysis and investigate whether the linguistic style by which truthful or deceptive
narratives are delivered is a feature that the model takes into account for its final prediction. At this aim, we com-
pared correctly classified and misclassified statements by the top-performing model (FLAN-T5 base in Scenario
3), finding that correctly classified statements share linguistic features related to the cognitive load theory. In
contrast, truthful and deceptive misclassified statements do not present significant differences in linguistic style.
Given the results achieved, we highlight the importance of a diversified dataset to achieve a generalized good
performance. We also considered crucial the balance between the diversity of the dataset and the size of the LLM,
suggesting that the more diverse the dataset is, the bigger the model required to achieve higher-level accuracy.
The main advantage of our approach consists of its applicability to raw text without the need for extensive train-
ing or handcrafted features.
Despite the demonstrated success of our model, three significant limitations impact the ecological validity
of our findings and their practical application in real-life scenarios.
The first notable limitation pertains to the narrow focus of our study, which concentrated solely on lie detec-
tion within three specific contexts: personal opinions, autobiographical memories, and future intentions. This
restricted scope limits the possibility of accurately classify deceptive texts within different domains. A second
limitation is that we exclusively considered datasets developed in experimental set-ups designed to collect genu-
ine and completely fabricated narratives. However, individuals frequently employ embedded lies in real-life
scenarios, in which substantial portions of their narratives are true, rather than fabricating an entirely fictitious
story. Finally, the datasets employed in this study were collected in experimental low-stake scenarios where
participants had low incentives to lie and appear credible. Because of all the above issues, the application of our
model in real-life contexts may be limited, and caution is advised when interpreting the results in such situations.
The limitations addressed in this study underscore the need for future research to expand the applicability and
generalizability of lie-detection models for real-life settings. Future works may explore the inclusion of new data-
sets, trying different LLMs (e.g., the most recent GPT-4), different sizes (e.g., FLAN-T5 XXL version), and dif-
ferent fine-tuning strategies to investigate the variance in performance within a lie-detection task. Furthermore,
our fine-tuning approach completely erased the previous capabilities possessed by the model; therefore, future
works should also focus on new fine-tuning strategies that do not compromise the model’s original capabilities.
Data availability
For the Opinion dataset, we obtained full access after contacting the corresponding author. The Memory dataset
is downloadable at the link: https://msropendata.com/datasets/0a83fb6f-a759-4a17-aaa2-fbac84577318. The
intention dataset is publicly available at the link: https://osf.io/45z7e/.
Code availabitity
All the Colab Notebooks to perform linguistic analysis on the three datasets, fine-tune the model in the three
Scenarios, and conduct explainability analysis is available at https://g ithub.c om/r obeco
der/V
erbal LieDe tecti onWi
thLLM.git.
References
1. Walczyk, J. J., Harris, L. L., Duck, T. K. & Mulay, D. A social-cognitive framework for understanding serious lies: Activation-
decision-construction-action theory. New Ideas Psychol. 34, 22–36. https://doi.org/10.1016/j.newideapsych.2014.03.001 (2014).
2. Amado, B. G., Arce, R. & Fariña, F. Undeutsch hypothesis and criteria based content analysis: A meta-analytic review. Eur J Psychol
Appl Legal Context 7, 3–12. https://doi.org/10.1016/j.ejpal.2014.11.002 (2015).
3. Vrij, A. et al. Verbal lie detection: Its past, present and future. Brain Sciences 12, 1644. https://doi.org/10.3390/brainsci12121644
(2022).
4. Vrij, A. & Fisher, R. P. Which lie detection tools are ready for use in the criminal justice system?. J. Appl. Res. Mem. Cognit. 5,
302–307. https://doi.org/10.1016/j.jarmac.2016.06.014 (2016).
5. DePaulo, B. M. et al. Cues to deception. Psychol. Bull. 129, 74–118. https://doi.org/10.1037/0033-2909.129.1.74 (2003).
6. Bond, C. F. Jr. & DePaulo, B. M. Accuracy of deception judgments. Personal. Soc. Psychol. Rev. 10, 214–234. https://doi.org/10.
1207/s15327957pspr1003_2 (2006).
7. Levine, T. R., Park, H. S. & McCornack, S. A. Accuracy in detecting truths and lies: Documenting the “veracity effect”. Commun.
Monogr. 66, 125–144. https://doi.org/10.1080/03637759909376468 (1999).
8. Levine, T. R. Truth-default theory (TDT). J. Lang. Soc. Psychol. 33, 378–392. https://doi.org/10.1177/0261927x14535916 (2014).
9. Street, C. N. H. & Masip, J. The source of the truth bias: Heuristic processing?. Scand. J. Psychol. 56, 254–263. https://doi.org/10.
1111/sjop.12204 (2015).
10. Verschuere, B., et al. The use-the-best heuristic facilitates deception detection. Nat. Hum. Behav. 7, 718–728. https://doi.org/10.
1038/s41562-023-01556-2 (2023)
11. Chen, X., Hao, P., Chandramouli, R., and Subbalakshmi, K. P. Authorship similarity detection from email messages. In International
Workshop On Machine Learning and Data Mining In Pattern Recognition. Editor P. Perner (New York, NY: Springer), 375–386.
https://doi.org/10.1007/978-3-642-23199-5_28 (2011).
12. Chen, H. Dark web: Exploring and mining the dark side of the web. In 2011 European Intelligence and Security Informatics Confer-
ence, 1–2. IEEE (2011).
13. Daelemans, W. Explanation in computational stylometry. In Computational Linguistics and Intelligent Text Processing, 451–462.
Springer, Berlin. https://doi.org/10.1007/978-3-642-37256-8_37 (2013).
Vol.:(0123456789)
www.nature.com/scientificreports/
14. Hauch, V., Blandón-Gitlin, I., Masip, J. & Sporer, S. L. Are computers effective lie detectors? A meta-analysis of linguistic cues to
deception. Personal. Soc. Psychol. Rev. 19, 307–342. https://doi.org/10.1177/1088868314556539 (2015).
15. Tomas, F., Dodier, O., & Demarchi, S. Computational measures of deceptive language: Prospects and issues. Front. Commun. 7
https://doi.org/10.3389/fcomm.2022.792378 (2022).
16. Conroy, N. K., Rubin, V. L. & Chen, Y. Automatic deception detection: Methods for finding fake news. Proc. Assoc. Inf. Sci. Technol.
52, 1–4. https://doi.org/10.1002/pra2.2015.145052010082 (2015).
17. Pérez-Rosas, V., Kleinberg, B., Lefevre, A., & Mihalcea, R. Automatic detection of fake news. arXiv preprint arXiv:1708.07104
(2017).
18. Fornaciari, T. & Poesio, M. Automatic deception detection in Italian court cases. Artif. Intell. Law 21, 303–340. https://doi.org/10.
1007/s10506-013-9140-4 (2013).
19. Yancheva, M., & Rudzicz, F. Automatic detection of deception in child-produced speech using syntactic complexity features. In
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics 1, 944–953, (2013).
20. Pérez-Rosas, V., & Mihalcea, R. Experiments in open domain deception detection. In Proceedings of the 2015 Conference on Empiri-
cal Methods in Natural Language Processing. https://doi.org/10.18653/v1/d15-1133 (2015).
21. Ott, M., Choi, Y., Cardie, C., & Hancock, J. T. Finding deceptive opinion spam by any stretch of the imagination. arXiv preprint
arXiv:1107.4557 (2011).
22. Fornaciari, T., & Poesio, M. Identifying fake Amazon reviews as learning from crowds. In Proceedings of the 14th Conference of the
European Chapter of the Association for Computational Linguistics. https://doi.org/10.3115/v1/e14-1030n (2014).
23. Kleinberg, B., Mozes, M., Arntz, A. & Verschuere, B. Using named entities for computer-automated verbal deception detection.
Journal of forensic sciences 63, 714–723. https://doi.org/10.1111/1556-4029.13645 (2017).
24. Mbaziira, A. V., & Jones, J. H. Hybrid text-based deception models for native and Non-Native English cybercriminal networks. In
Proceedings of the International Conference on Compute and Data Analysis. https://doi.org/10.1145/3093241.3093280 (2017).
25. Levitan, S. I., Maredia, A., & Hirschberg, J. Linguistic cues to deception and perceived deception in interview dialogues. In Pro-
ceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, 1. https://doi.org/10.18653/v1/n18-1176 (2018).
26. Kleinberg, B., Nahari, G., Arntz, A., & Verschuere, B. An investigation on the detectability of deceptive intent about flying through
verbal deception detection. Collabra: Psychol. 3. https://doi.org/10.1525/collabra.80 (2017).
27. Constâncio, A. S., Tsunoda, D. F., Silva, H. de F. N., Silveira, J. M. da, & Carvalho, D. R. Deception detection with machine learning:
A systematic review and statistical analysis. PLOS ONE, 18, e0281323. https://doi.org/10.1371/journal.pone.0281323 (2023).
28. Zhao, W. X., et al. A survey of large language models. arXiv preprint arXiv:2303.18223. (2023).
29. Newman, M. L., Pennebaker, J. W., Berry, D. S. & Richards, J. M. Lying words: Predicting deception from linguistic styles. Personal.
Soc. Psychol. Bull. 29, 665–675. https://doi.org/10.1177/0146167203029005010 (2003).
30. Monaro, M. et al. Covert lie detection using keyboard dynamics. Sci Rep 8, 1976. https://doi.org/10.1038/s41598-018-20462-6
(2018).
31. Vrij, A., Fisher, R. P. & Blank, H. A cognitive approach to lie detection: A meta-analysis. Legal Criminol. Psychol. 22(1), 1–21.
https://doi.org/10.1111/lcrp.12088 (2015).
32. Johnson, M. K. & Raye, C. L. Reality monitoring. Psychol. Rev. 88, 67–85. https://doi.org/10.1037/0033-295x.88.1.67 (1981).
33. Sporer, S. L. The less travelled road to truth: Verbal cues in deception detection in accounts of fabricated and self-experienced
events. Appl. Cognit. Psychol. 11(5), 373–397. https://d oi.o
rg/1 0.1 002/( SICI)1 099-0 720(199710)1 1:5%3 c373::A
ID-A
CP461%3 e3.0.
CO;2-0 (1997).
34. Sporer, S. L. Reality monitoring and detection of deception in The Detection of Deception in Forensic Contexts (Cambridge University
Press.), 64–102. https://doi.org/10.1017/cbo9780511490071.004 (2004).
35. Masip, J., Sporer, S. L., Garrido, E. & Herrero, C. The detection of deception with the reality monitoring approach: A review of the
empirical evidence. Psychol. Crime Law 11(1), 99–122. https://doi.org/10.1080/10683160410001726356 (2005).
36. Amado, B. G., Arce, R., Fariña, F. & Vilariño, M. Criteria-Based Content Analysis (CBCA) reality criteria in adults: A meta-analytic
review. Int. J. Clin. Health Psychol. 16(2), 201–210. https://doi.org/10.1016/j.ijchp.2016.01.002 (2016).
37. Gancedo, Y., Fariña, F., Seijo, D., Vilariño, M. & Arce, R. Reality monitoring: A meta-analytical review for forensic practice. Eur.
J. Psychol. Appl. Legal Context 13(2), 99–110. https://doi.org/10.5093/ejpalc2021a10 (2021).
38. Vrij, A. et al. Verbal lie detection: its past, present and future. Brain Sci. 12(12), 1644. https://doi.org/10.3390/brainsci12121644
(2022).
39. Kleinberg, B., van der Vegt, I., & Arntz, A. Detecting deceptive communication through linguistic concreteness. Center for Open
Science. https://doi.org/10.31234/osf.io/p3qjh (2019).
40. Nahari, G., Vrij, A. & Fisher, R. P. Exploiting liars’ verbal strategies by examining the verifiability of details. Legal Criminol. Psychol.
19, 227–239. https://doi.org/10.1111/j.2044-8333.2012.02069.x (2012).
41. Vrij, A., & Nahari, G. The verifiability approach. In Evidence-Based Investigative Interviewing (pp. 116–133). Routledge. https://
doi.org/10.4324/9781315160276-7 (2019).
42. Pennebaker, J. W., Francis, M. E., & Booth, R. J. Linguistic inquiry and word count: LIWC 2001. Mahway: Lawrence Erlbaum
Associates, 71, 2001 (2001).
43. Boyd, R. L., Ashokkumar, A., Seraj, S., & Pennebaker, J. W. The development and psychometric properties of LIWC-22. Austin,
TX: University of Texas at Austin, 1–47. (2022).
44. Bond, G. D. & Lee, A. Y. Language of lies in prison: Linguistic classification of prisoners’ truthful and deceptive natural language.
Appl. Cognit. Psychol. 19(3), 313–329. https://doi.org/10.1002/acp.1087 (2005).
45. Bond, G. D. et al. ‘Lyin’ Ted’, ‘crooked hillary’, and ‘Deceptive Donald’: Language of lies in the 2016 US presidential debates. Appl.
Cognit. Psychol. 31(6), 668–677. https://doi.org/10.1002/acp.3376 (2017).
46. Bond, G. D., Speller, L. F., Cockrell, L. L., Webb, K. G., & Sievers, J. L. ‘Sleepy Joe’ and ‘Donald, king of whoppers’: Reality monitor-
ing and verbal deception in the 2020 U.S. presidential election debates. Psychol. Rep. 003329412211052. https://doi.org/10.1177/
00332941221105212 (2022).
47. Schutte, M., Bogaard, G., Mac Giolla, E., Warmelink, L., Kleinberg, B., & Verschuere, B. Man versus Machine: Comparing manual
with LIWC coding of perceptual and contextual details for verbal lie detection. Center for Open Science. https://doi.org/10.31234/
osf.io/cth58 (2021).
48. Kleinberg, B., van der Toolen, Y., Vrij, A., Arntz, A. & Verschuere, B. Automated verbal credibility assessment of intentions: The
model statement technique and predictive modeling. Appl. Cognit. Psychol. 32, 354–366. https://doi.org/10.1002/acp.3407 (2018).
49. Kleinberg, B., & Verschuere, B. How humans impair automated deception detection performance. Acta Psychol., 213, https://doi.
org/10.1016/j.actpsy.2020.103250 (2021).
50. Ilias, L., Soldner, F., & Kleinberg, B. Explainable verbal deception detection using transformers. arXiv preprint arXiv:2210.03080
(2022).
51. Capuozzo, P., Lauriola, I., Strapparava, C., Aiolli, F., & Sartori, G. DecOp: A multilingual and multi-domain corpus for detecting
deception in typed text. In Proceedings of the 12th Language Resources and Evaluation Conference, 1423–1430 (2020).
52. Sap, M. et al. Quantifying the narrative flow of imagined versus autobiographical stories. Proc. Natl. Acad. Sci. 119(45),
e2211715119. https://doi.org/10.1073/pnas.2211715119 (2022).
Vol:.(1234567890)
www.nature.com/scientificreports/
53. Hernández-Castañeda, Á., Calvo, H., Gelbukh, A. & Flores, J. J. G. Cross-domain deception detection using support vector net-
works. Soft Comput. 21, 585–595. https://doi.org/10.1007/s00500-016-2409-2 (2016).
54. Pérez-Rosas, V., & Mihalcea, R. Cross-cultural deception detection. In Proceedings of the 52nd Annual Meeting of the Association
for Computational Linguistics 2. https://doi.org/10.3115/v1/p14-2072 (2014).
55. Mihalcea, R., & Strapparava, C. The lie detector: Explorations in the automatic recognition of deceptive language. In Proceedings
of the ACL-IJCNLP 2009 conference short papers 309–312. https://doi.org/10.3115/1667583.1667679 (2009).
56. Ríssola, E. A., Aliannejadi, M., & Crestani, F. Beyond modelling: Understanding mental disorders in online social media. In
Advances in Information Retrieval: 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14–17, 2020,
Proceedings, Part I 42 (pp. 296–310). Springer (2020).
57. Chung, H. W., et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416. (2022).
58. Zhou, L., Burgoon, J. K., Nunamaker, J. F. & Twitchell, D. Automating linguistics-based cues for detecting deception in text-based
asynchronous computer-mediated communications. Group Decis. Negot. 13, 81–106. https://doi.org/10.1023/b:grup.0000011944.
62889.6f (2004).
59. Solà-Sales, S., Alzetta, C., Moret-Tatay, C. & Dell’Orletta, F. Analysing deception in witness memory through linguistic styles in
spontaneous language. Brain Sci. 13, 317. https://doi.org/10.3390/brainsci13020317 (2023).
60. Sarzynska-Wawer, J., Pawlak, A., Szymanowska, J., Hanusz, K. & Wawer, A. Truth or lie: Exploring the language of deception. PLOS
ONE 18, e0281179. https://doi.org/10.1371/journal.pone.0281179 (2023).
61. Brysbaert, M., Warriner, A. B. & Kuperman, V. Concreteness ratings for 40 thousand generally known English word lemmas. Behav
Res 46, 904–911. https://doi.org/10.3758/s13428-013-0403-5 (2014).
62. Lin, Y. C., Chen, S. A., Liu, J. J., & Lin, C. J. Linear Classifier: An Often-Forgotten Baseline for Text Classification. arXiv preprint
arXiv:2306.07111 (2023).
63. Moore, J. H. Bootstrapping, permutation testing and the method of surrogate data. Phys. Med. Biol. 44(6), L11 (1999).
64. McGraw, K. O. & Wong, S. P. A common language effect size statistic. Psychol. Bull. 111, 361. https://doi.org/10.1037/0033-2909.
111.2.361 (1992).
65. Hancock, J. T., Curry, L. E., Goorha, S. & Woodworth, M. On lying and being lied to: A linguistic analysis of deception in computer-
mediated communication. Discourse Process. 45, 1–23. https://doi.org/10.1080/01638530701739181 (2007).
Acknowledgements
We would like to thank Bruno Verschuere and Bennett Kleinberg for sharing the full version of their Intention
dataset with us.
Author contributions
G.S. conceptualized the research. R.L., R.R., P.C., and G.S. designed the research. P.C. shared the updated version
of the Deceptive Opinion Dataset. R.L. performed the descriptive linguistic analysis and explainability analysis.
R.R. developed and implemented the fine-tuning strategy. R.L. and R.R. wrote the paper. P.P. and G.S. supervised
all aspects of whole the research and provided critical revisions.
Competing interests
The authors declare no competing interests.
Additional information
Supplementary Information The online version contains supplementary material available at https://doi.org/
10.1038/s41598-023-50214-0.
Correspondence and requests for materials should be addressed to R.L.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International
License, which permits use, sharing, adaptation, distribution and reproduction in any medium or
format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons licence, and indicate if changes were made. The images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons licence and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Vol.:(0123456789)