Intent Detection Report
Intent Detection Report
Keywords: User Intent Detection, Chatbots, Virtual Assistants, Natural Language Processing
(NLP),Machine Learning, Tokenization, Bi-directional LSTM, .
Chapter 1
Introduction
It is impossible to overestimate the crucial role query intent detection plays in improving user
interactions with conversational agents, chatbots, and virtual assistants in the constantly changing
field of natural language processing (NLP). In order to uncover new dimensions and improve upon
preexisting paradigms, this thesis sets out on an ambitious and inventive journey into the depths of
query intent detection. The pursuit for accuracy and efficacy in interpreting user intent becomes the
lodestar directing our intellectual journey as we face the complexities of human communication.
The foundation of this innovative study is the development of a dataset that not only satisfies but
also surpasses the particular requirements of query intent identification. In the lack of a common
dataset, our method is a tribute to creativity since it makes use of cutting edge prompt engineering
strategies developed in partnership with ChatGPT. The outcome is an extensive and varied set of
more than 500,000 carefully classified inquiries that show a rainbow of user intent across ten
different categories. These sorts include transactional, troubleshooting and support, appointment
and reservation, educational, entertainment, health and wellness, personal, and product or service
inquiries. They also include informational and navigational searches. This dataset's depth and
diversity, carefully selected from 935 industries, provide a comprehensive picture of the many
difficulties that chatbots face encounter in real-world scenarios.
Our approach to creating datasets is motivated by the latest developments in Natural Language
Inference (NLI) problems. Our approach is in line with a paradigm that goes beyond the constraints
of traditional crowdsourced datasets because it places a strong emphasis on Worker-AI
collaboration. By using a collaborative pipeline that combines the generative power of GPT-3 with
the evaluative expertise of human annotators, we actively contribute to the generation of
challenging examples, which promotes model robustness and generalization, while also addressing
the shortcomings of in-domain performance.
As we explore the complex terrain of query intent detection, the research develops into a thorough
examination of several machine learning systems. These algorithms are examined for performance,
accuracy, and overall efficacy. They each have unique capabilities and considerations. The
investigation becomes a symphony of algorithmic complexities, ranging from the dependability of
the Linear Support Vector Classifier (LinearSVC) to the impressive performance of Multinomial
Naive Bayes, the resilience of Random Forest Classifier, the varied advantages of Gradient
Boosting, and the remarkable accuracy of Bi-directional Long Short-Term Memory (Bi-LSTM).
Past the mathematical analysis, our thesis emerges as a story that moves through the phases of
dataset generation, algorithm training, and thorough assessment. The complex nature of query intent
detection is highlighted by the dependable performances of LinearSVC and Multinomial Naive
Bayes, the resilience displayed by Random Forest, the subtle strengths displayed by Gradient
Boosting, and the unmatched accuracy of Bi-LSTM. Our contribution goes beyond algorithmic
investigation; it encompasses a comprehensive strategy that combines advanced methods with a
dataset that has been carefully selected.
The path from data collection—which is distinguished by the innovative WANLI dataset
construction methodology—to outcome attainment bears witness to our steadfast dedication to
expanding the boundaries of NLP research. This thesis offers a scientific investigation of algorithms
as well as a lighthouse pointing the way for further study, creativity, and the ongoing development
of technology involved in user-centered, contextually aware, and emotionally intelligent
conversations. The relevance of our thesis, as we set out on this intellectual journey, is not just in
the remarkable performance of the algorithms, but also in the spirit of innovation that drove us to
completely reimagine the field of query intent detection. This thesis adds significantly to the
continuous story of technology advancement and is far more than just an academic project.
1.3 Objectives
The overall objective of this thesis is to advance the field of query intent detection through a
multidimensional journey. The main goals include a thorough review of previous studies, the
production of creative datasets, and a thorough examination of machine learning methods. The aims
are designed to provide useful insights toward boosting the user experience using chatbots and
virtual assistants, ranging from improving model generalization to evaluating real-world
applicability. The work goes beyond algorithmic investigation and paves the way for further
developments in natural language processing in the future.
review and Synthesize: Using cutting-edge techniques, do a thorough review of the body of work
already done on intent identification and natural language processing.
Dataset Collection: Carefully select a large dataset that is suited to research requirements. The
dataset includes more than 500,000 inquiries of ten different kinds, which represent the variety of
user interactions that occur in real-world situations.
New Dataset generation approach: To overcome the drawbacks of conventional crowdsourced
datasets and guarantee strong model generalization, implement a novel dataset generation approach
that is motivated by current developments in Natural Language Inference (NLI) tasks.
Algorithmic Analysis: Examine various machine learning techniques for query intent
identification, such as Bi-LSTM, Random Forest Classifier, Linear Support Vector Classifier
(LinearSVC), Gradient Boosting, and Multinomial Naive Bayes.
Analyze each algorithm's performance using the following metrics: accuracy, F1-score,
confusion matrices, and multiclass ROC curves. This will give you important information about the
algorithms' advantages and disadvantages as well as how well-suited they are for the job.
Model Robustness and Generalization: In order to create efficient chatbot applications, examine
how well machine learning models can generalize to a variety of query types.
Future Work Exploration: Examine potential paths for further study and development, such as
handling multimodal data, explainability, real-time query intent recognition, domain-specific query
fine-tuning, improved dataset refining, and sophisticated model architectures.
User Feedback Integration: Provide methods for integrating user feedback into the training of the
model to guarantee ongoing enhancement and adjustment to changing language trends and user
preferences.
Literature Review
Thanks to a variety of approaches, the field of user intent categorization has advanced significantly,
with each approach offering a unique perspective and tackling a different set of difficulties. K-
means clustering, in particular, has proven to be an effective method with remarkable accuracy
(94%) on datasets of different sizes. This method finds eight kinds of user intent, outperforming
binary tree classification, with information seeking being the most prevalent intent. Although it has
the potential to be used for real-time search engine applications, its limitations are highlighted by
concerns about user representativeness and dependence on transaction logs as the data source[6].
Initiatives such as "Open Intent Discovery through Unsupervised Semantic Clustering and
Dependency Parsing"[11] have also investigated unsupervised learning techniques for user intent
discovery. Similar unsupervised learning techniques are explored in this work, which clarifies the
complexities of user intent extraction from dialogues. Moreover, an analysis has been conducted
comparing transformer-based models for intent detection with K-means clustering[12], highlighting
the ongoing effectiveness of K-means clustering when it comes to user intent categorization. This
work adds to the current discussion on intent detection techniques and offers insightful information
to professionals working in the subject.
New methods for intent recognition and slot tagging have been made possible by the development
of neural networks, as the "Multi-stage Bi-LSTM for Career Chatbot"[7] demonstrates. This novel
design leveraged a Bi-LSTM model in a multi-stage process where intent and slot information
mutually inform each other to reach state-of-the-art outcomes (F1-score >77%). This research
presents a viable path for increasing intent recognition in particular domains, addressing issues like
noisy user queries and non-native speakers. "Joint Slot Filling and Intent Classification with Deep
Learning"[13] examines joint learning techniques that use deep learning for both intent detection
and slot filling at the same time. Comparably, "A Neural Multi-stage Architecture for Intent
Detection and Slot Filling"[14] examines the effectiveness of such methods by utilizing LSTMs in a
multi-stage neural network design.
In the context of search queries, convolutional neural networks (CNNs) have been used to
determine user intent[2]. This method reduces the requirement for manual feature engineering by
using CNNs to learn semantic representations while treating queries as vectors. The study indicates
that although CNN features are excellent at capturing semantic similarity, more research should
focus on combining them with other techniques and on recurrent neural networks, like Bi-LSTM, to
achieve higher accuracy. In "Self-Attention Networks for Intent Detection"[16], the integration of
self-attention mechanisms with CNNs has been investigated, offering possible advantages over
CNNs operating independently. The present study underscores the dynamic character of intent
detection techniques, which integrate several neural network topologies to achieve maximum
efficacy.
The use of BERT in building a knowledge base chatbot is described in "[9]", which offers a
thorough structure for responding to information requests and determining which ones are outside
of its purview. The work addresses issues in knowledge base chatbot development by successfully
generating IS queries and detecting OOS. Suggestions for future development highlight the
possibility for sophisticated methods of natural language creation and the incorporation of user
feedback. The use of neural networks in open-domain chatbots is investigated in "A Neural
Conversational Model for Open Domain Dialog"[17], which highlights the potential for knowledge-
based strategies. Furthermore exploring the subtleties of building chatbots for task-oriented
domains, "Building Effective Dialog Systems for Task-Oriented Domains with Multi-Domain
Dialog State Trackers"[18] emphasizes the crucial significance of knowledge integration.
In conclusion, a wide range of approaches are covered in the literature on user intent categorization,
all of which contribute to the changing field of natural language processing. These methods, which
range from advanced neural architectures to conventional clustering techniques, together influence
the direction of intent detection, providing insightful information and opening the door for new
developments.
Chapter 3
Methodology
Ten Distinct Datasets: Different datasets were collected, each focusing on a specific category of
user intent. Examples include informational queries, navigational queries, and transactional queries.
This ensures a diverse representation of user interactions.
Comprehensive DataFrame: The collected datasets were combined into a single comprehensive
DataFrame. This consolidation likely involved merging or concatenating the individual datasets,
creating a unified dataset for analysis and model training.
Text Cleaning Techniques: Text cleaning is crucial for improving the quality of textual data and
enhancing the performance of machine learning models. The following techniques were applied:
Removal of Special Characters: Non-alphanumeric characters, such as punctuation or symbols,
were likely removed to focus on the meaningful words in the text.
Stop Words Removal: Common words (stop words) that don't contribute significantly to the
meaning of the text (e.g., "and," "the," "is") were removed to reduce noise.
Tokenization: The process of breaking down a text into individual words or tokens. This step
facilitates further analysis by treating each word as a separate entity.
Lemmatization: It involves reducing words to their base or root form, considering variations like
plurals or different verb tenses. This helps in standardizing the vocabulary.
Quality Improvement: The overall goal of these text cleaning techniques is to enhance the quality
of the textual data. By removing noise and standardizing the representation of words, the
subsequent analysis and modeling stages can be more effective.
Machine learning algorithms typically work with numerical inputs. Label encoding allows the
algorithm to interpret and learn from the categorical labels by representing them as numerical
values. This is crucial for tasks such as classification, where the algorithm needs to predict the
category or intent of a given query.
3.5 Word Frequency
In our thesis report, we conducted a comprehensive analysis of word frequency after thorough text
pre-processing for ten distinct classes representing various query intents. Each class was
meticulously examined to unveil the most frequently occurring words, providing valuable insights
into the underlying themes and user intentions. For the "Appointments and Reservations" class,
prevalent words such as "schedule," "book," and "consultation" underscore the emphasis on
scheduling and booking activities. In contrast, the "Educational" class prominently features words
like "explain," "types," and "concept," emphasizing a focus on educational content, concepts, and
diverse topics. The "Entertainment" class showcases words like "TV," "recommend," and "suggest,"
indicative of user queries related to entertainment preferences, recommendations, and suggestions.
"Health and Wellness" emphasizes words like "help," "therapy," and "cancer," suggesting a focus on
queries related to health assistance, therapies, and concerns. For the "Informational" class, key
words like "information," "conceptual," and "explain" highlight a quest for informative content,
conceptual understanding, and explanatory details. "Navigational" queries, on the other hand,
revolve around words like "center," "directions," and "nearest," indicating a user's need for
navigational assistance, locations, and directions. In the "Personal" class, frequent words like
"check," "time," and "flight" point to queries associated with personal matters, time management,
and travel arrangements. "Product or Services" class predominantly features words such as "best,"
"recommend," and "smart," reflecting user interest in product recommendations and information on
various services. The "Transactional" class is characterized by words like "book," "private," and
"subscription," suggesting queries related to transactions, bookings, and subscription services.
Lastly, the "Troubleshooting and Support" class includes words like "child," "insurance," and
"troubleshoot," indicative of user queries seeking assistance, troubleshooting guidance, and support.
This detailed analysis of word frequency provides a nuanced understanding of user intent within
each query class, offering valuable information for optimizing and enhancing query intent detection
models.
3.6 Train-Test Split
Purpose
Model Training: The training set is used to teach the machine learning model patterns, relationships,
and trends within the data. During this phase, the model adjusts its parameters to minimize the
difference between its predictions and the actual labels in the training set.
Model Evaluation: The test set is reserved for evaluating the model's performance on new, unseen
data. This helps to estimate how well the model will generalize to real-world scenarios and ensures
that it is not merely memorizing the training data (overfitting).
Splitting Process
Randomization: The dataset is typically randomly shuffled before the split to ensure that both the
training and test sets are representative of the overall data distribution.
Split Ratio: The dataset is divided into two portions based on a predefined ratio, such as 80% for
training and 20% for testing. The exact split ratio can vary based on the size of the dataset and the
specific requirements of the task.
Training Set
Teaching the Model: The training set is used to train the model by providing input data along with
corresponding labels. The model learns to identify patterns and relationships, adjusting its internal
parameters through optimization algorithms like gradient descent.
Test Set
Model Evaluation: The test set remains unseen by the model during training and is used to assess
how well the model generalizes to new instances. This set helps to estimate the model's
performance on real-world, unseen data.
Overfitting Prevention
Detecting Overfitting: The use of a separate test set helps identify if the model has overfit the
training data by performing well on it but poorly on new data.
Hyperparameter Tuning: The test set can also be used for hyperparameter tuning, where different
configurations of the model are evaluated to find the optimal set of hyperparameters.
Cross-Validation:
K-Fold Cross-Validation: In addition to a simple train-test split, more advanced techniques like k-
fold cross-validation can be employed to further ensure robust model evaluation. This involves
dividing the dataset into k subsets and performing k iterations, using different subsets as the test set
in each iteration.
Precision:
Definition: Precision is the ratio of true positive predictions to the total positive predictions made by
the model. It assesses the accuracy of positive predictions.
Precision
P = Precision
TP = True Positives
FP = False Positives
High precision means that when the model predicts a positive instance, it is likely to be correct.
Precision is particularly important when the cost of false positives is high.
Recall (Sensitivity or True Positive Rate)
Recall is the ratio of true positive predictions to the total actual positive instances in the dataset. It
assesses the ability of the model to capture all positive instances.
Recall
R = Recall
TP = True Positives
FP = False Negatives
High recall indicates that the model is effectively identifying most of the positive instances. Recall
is crucial when the cost of false negatives is high.
F1-Score
F1-Score is a metric that combines precision and recall into a single value. It is particularly useful in
binary classification tasks and provides a balance between precision and recall.
F1-Score
F1 = F1-Score
p = Precision
R = Recall
F1-Score ranges from 0 to 1, where 1 indicates perfect precision and recall. It is especially valuable
when there is an uneven class distribution.
Support
Support is the count of instances (or samples) for each class in the dataset. It provides context to the
evaluation metrics by showing how many instances belong to each class. Support is not a metric
that is optimized but rather gives a sense of the distribution of classes. It helps to understand the
imbalances in the dataset.
Accuracy
Accuracy is a measure of overall correctness in a classification model. It calculates the ratio of
correctly predicted instances to the total number of instances.
Accuracy
TP = True Positives
TN = True Negatives
FP = False Positives
FN = False Netatives
High accuracy indicates that the model is making a high percentage of correct predictions across all
classes. However, it might not be the best metric for imbalanced datasets.
Macro Average
Macro Average is a method of computing the average performance across multiple classes without
considering class imbalances. It calculates the metric independently for each class and then takes
the average.
Macro Avg
Interpretation: Weighted Average is useful when there is an imbalance in class distribution. It gives
more weight to classes with larger support.
Confusion Matrix
A confusion matrix is a table used in classification to assess the performance of a machine learning
model. It provides a comprehensive breakdown of the model's predictions, comparing them to the
true labels. The matrix consists of four components:
True Positives (TP): Instances where the model correctly predicted the positive class.
True Negatives (TN): Instances where the model correctly predicted the negative class.
False Positives (FP): Instances where the model predicted the positive class, but the true class is
negative.
False Negatives (FN): Instances where the model predicted the negative class, but the true class is
positive.
The confusion matrix is especially useful in understanding the types and frequencies of errors made
by a classifier. It serves as the foundation for deriving various performance metrics such as accuracy,
precision, recall, and the F1-Score.
AUC Curve (Area Under the ROC Curve)
The Receiver Operating Characteristic (ROC) curve is a graphical representation of a classifier's
performance across different discrimination thresholds. The curve plots the True Positive Rate
(Sensitivity) against the False Positive Rate (1 - Specificity) at various threshold values. The Area
Under the Curve (AUC) quantifies the overall performance of the classifier. The proportion of actual
positive instances correctly identified by the classifier.
False Positive Rate (1 - Specificity): The proportion of actual negative instances incorrectly
identified as positive by the classifier.
A model with a higher AUC score generally has better discrimination ability, meaning it can
distinguish between positive and negative instances more effectively. A perfect classifier would have
an AUC score of 1, while a random classifier would have an AUC of 0.5.
The evaluation of the LinearSVC algorithm is presented in two scenarios: "Without Parameter
Tuning" and "With Parameter Tuning." In both cases, the algorithm exhibits high accuracy and
balanced performance metrics across various categories. In the absence of parameter tuning, the
LinearSVC achieves an impressive overall accuracy for f1-score of 0.95, with support for a
substantial number of instances (101,985). The macro and weighted averages for precision, recall,
and f1-score consistently reach 0.95, reflecting the algorithm's robustness and effectiveness across
diverse query types.
Upon incorporating parameter tuning, the best parameters for the LinearSVC are identified as {'C':
1, 'penalty': 'l2.'}. These parameters represent the regularization strength (C) and the penalty term
(penalty) used in the LinearSVC algorithm. In this context, a regularization strength of 1 and an 'l2'
penalty were identified as the optimal choices based on the tuning process. Parameter tuning aims to
optimize the model's performance by adjusting these hyperparameters, and in the case of
LinearSVC, the specified values were found to yield the best results for the given dataset and query
intent classification task. Notably, the performance metrics remain consistent with the untuned
scenario, maintaining an accuracy for f1-score of 0.95 and supporting 101,985 instances. The macro
and weighted averages for precision, recall, and f1-score mirror the untuned results, affirming the
stability of the algorithm's performance even after fine-tuning. Overall, these findings underscore
the resilience and reliability of the LinearSVC algorithm, as it maintains high-quality predictions for
query intent detection both with and without parameter tuning.
Confusion Matrix (Linear SVC)
The confusion matrix for the LinearSVC algorithm provides a comprehensive breakdown of the
model's performance across ten query types. Notably, for Query Type 1, the model demonstrates
high accuracy, correctly predicting 13,970 instances with minimal false positives and negatives.
Similarly, Query Type 2 and Query Type 3 exhibit strong performance, though with slight confusion
among other query types. However, Query Type 6 shows a lower count of true positives, indicating
challenges in prediction. Overall, the LinearSVC algorithm achieves an impressive accuracy of 95%,
showcasing its effectiveness in classifying diverse query types. The detailed analysis of the
confusion matrix allows for a nuanced understanding of the algorithm's strengths and areas for
improvement, providing valuable insights for further refinement and optimization.
The Multiclass ROC Curve for the MultinomialNB algorithm provides a detailed evaluation of its
performance across different query types. The Area Under the Curve (AUC) values associated with
each query type's curve offer insights into the algorithm's ability to discriminate between classes. A
higher AUC indicates better discriminatory power, and the results reveal notable strengths in certain
classes.
For instance, Class 1 exhibits a perfect AUC of 1.00, indicating that the algorithm achieves optimal
true positive rates while minimizing false positive rates for this specific query type. Similarly,
Classes 6 and 7 also demonstrate perfect AUC scores, signifying the algorithm's exceptional ability
to distinguish between instances of these query types.
While most classes exhibit high AUC values, such as Classes 0, 2, 3, 4, 5, 8, and 9 with AUC scores
ranging from 0.97 to 0.99, it's essential to consider the algorithm's performance in the context of
specific classes. AUC scores approaching 1.00 indicate robust performance, but deviations from
perfection may suggest areas for further exploration and optimization.
The Random Forest algorithm, a powerful ensemble method, was employed for query intent
detection with and without parameter tuning, and the results are highly promising. In both
scenarios, the algorithm exhibited exceptional performance, achieving an accuracy and f1-score of
97% across the dataset.
In the absence of parameter tuning, the Random Forest model demonstrated robustness, with macro
and weighted averages for precision, recall, and f1-score consistently reaching or exceeding 95%.
This signifies the algorithm's ability to effectively identify true positives while minimizing both
false positives and false negatives across diverse query types.
Following parameter tuning, the model's hyperparameters were optimized, enhancing its
performance further. The best parameter configuration, {'max_depth': None, 'min_samples_leaf':
1, 'min_samples_split': 2, 'n_estimators': 150}, reflects the choices that yielded the most
favorable outcomes. These parameters play a crucial role in defining the structure and behavior of
the random forest model. The "max_depth" parameter controls the maximum depth of the trees in
the forest, "min_samples_leaf" sets the minimum number of samples required to be at a leaf node,
"min_samples_split" specifies the minimum number of samples required to split an internal node,
and "n_estimators" determines the number of trees in the forest. Despite achieving the same overall
accuracy and f1-score as the untuned model, the tuned model's parameter configuration might
contribute to improved generalization and stability.
The Random Forest algorithm, whether with or without parameter tuning, emerges as a robust
choice for query intent detection. Its consistent high accuracy, precision, recall, and f1-score across
diverse query types underscore its effectiveness in handling the complexities of intent classification
tasks. The detailed evaluation and optimization of the algorithm contribute valuable insights to our
thesis, highlighting its competence in real-world applications.
Confusion Mtrix (Random Forest)
The confusion matrix for the Random Forest Classifier provides a detailed overview of the model's
performance across different query types. Each row corresponds to the true class, while each
column represents the predicted class.
The diagonal elements indicate the true positives for each query type, and off-diagonal elements
represent misclassifications. From the matrix, it is evident that the Random Forest Classifier excels
in correctly identifying query types, as evidenced by the high values along the diagonal.
For instance, in Query Type 1, the model achieved 14,009 true positives, with only a small number
of misclassifications across other categories. Similarly, for Query Type 6, the classifier
demonstrated strong performance, correctly predicting 690 instances while misclassifying only a
minimal number.
However, some challenges are observed in certain query types, such as Query Types 0, 2, and 8,
where misclassifications are slightly more prominent. These discrepancies could be attributed to the
inherent complexity and subtle differences between queries in these categories.
The confusion matrix provides valuable insights into the strengths and areas for improvement of the
Random Forest Classifier in handling diverse query types. Despite some misclassifications, the
model showcases robust performance, reinforcing its effectiveness for query intent detection in real-
world applications.
Area Under the ROC curve ( Random Forest)
The Multiclass ROC Curve for the Random Forest Classifier showcases the Area Under the Curve
(AUC) values for each query type, providing a comprehensive assessment of the model's
discriminatory power across different classes. Each curve represents the classifier's ability to
distinguish between a specific query type and the rest.
Remarkably, for the majority of query types, including Classes 0, 1, 2, 3, 4, 5, 7, 8, and 9, the AUC
values are consistently high, reaching a perfect score of 1.00. This indicates the Random Forest
Classifier's exceptional performance in distinguishing these query types, achieving optimal
sensitivity and specificity.
While the AUC for Class 6 is slightly lower at 0.99, it still reflects a high discriminatory capability,
showcasing the model's effectiveness in identifying this specific query type. The overall pattern of
near-perfect AUC values across query types highlights the robustness of the Random Forest
Classifier in handling diverse intents within the dataset.
This result reinforces the Random Forest Classifier as a powerful algorithm for query intent
detection, emphasizing its ability to provide accurate predictions across a broad range of query
types. The high AUC values affirm the model's reliability in real-world scenarios, supporting its
potential for deployment in applications requiring precise intent classification.
The performance metrics for the Gradient Boosting algorithm reveal insights into its effectiveness
for query intent classification. The achieved accuracy, precision, recall, and F1-score values are
critical indicators of the model's capability to correctly classify various query types within the
dataset.
With an accuracy of 0.75, the Gradient Boosting algorithm demonstrates a satisfactory level of
overall correctness in predicting query intents. The F1-score, a harmonic mean of precision and
recall, stands at 0.78, reflecting a balanced trade-off between these two metrics. The Macro average
precision and recall are reported as 0.81 and 0.76, respectively, emphasizing the model's ability to
generalize well across different query types.
In the context of weighted averages, the Gradient Boosting algorithm achieves precision, recall, and
F1-score values of 0.79, 0.75, and 0.76, respectively. These weighted averages provide a
comprehensive evaluation, considering the varying support for each query type within the dataset.
While the achieved metrics indicate a reasonable level of performance, it's essential to consider
these results in comparison to other algorithms and explore potential avenues for improvement.
Fine-tuning hyperparameters or exploring ensemble methods could be considered to enhance the
Gradient Boosting model's performance for query intent detection. Overall, this analysis contributes
valuable insights into the strengths and areas of improvement for the Gradient Boosting algorithm
in the context of our thesis on query intent classification.
Confusion Matrix (Gradient Boosting)
The confusion matrix for the Gradient Boosting algorithm reveals distinct performance patterns
across different query types. Notably, the algorithm excels in accurately classifying "Educational"
queries, achieving 644 correct predictions with minimal misclassifications. In contrast, challenges
are observed in distinguishing between "Navigational" and "Informational" queries, with significant
off-diagonal values in these respective rows, suggesting potential confusion between the two
categories.
Comparatively, the algorithm demonstrates relatively better performance in handling "Navigational"
queries (11105 correct predictions) compared to "Informational" queries (11492 correct
predictions). However, it is crucial to consider the specific characteristics and importance of each
query type in the context of the application. Additionally, the "Troubleshooting" class exhibits
notable misclassifications across various categories, indicating room for improvement.
In summary, while the Gradient Boosting algorithm excels in certain classes, such as "Educational"
and "Navigational," the detailed analysis suggests that it may face challenges in distinguishing
between closely related query types. Further optimization efforts could enhance its performance,
particularly in classes where misclassifications are more pronounced.
Area Under the ROC curve (Gradient Boosting)
The multiclass ROC Curve for the Gradient Boosting algorithm provides an insightful perspective
on the model's discriminatory ability across different query types. Each line on the curve
corresponds to a specific query type, and the Area Under the Curve (AUC) quantifies the model's
performance in distinguishing between classes.
Analyzing the AUC values for each class reveals the discriminatory power of the Gradient Boosting
algorithm. Notably, Class 1 has the highest AUC of 0.97, indicating strong performance in
accurately identifying instances of this query type. Similarly, Class 5, Class 7, Class 8, and Class 9
also demonstrate high AUC values, emphasizing the model's effectiveness in distinguishing these
query types.
However, it's essential to consider the AUC values in the context of individual query types. For
instance, while Class 0 exhibits a slightly lower AUC of 0.90, it still indicates a good discriminatory
ability for this query type. Understanding the nuances of AUC values across different classes helps
provide a nuanced assessment of the model's overall performance in multiclass classification.
The multiclass ROC Curve for Gradient Boosting showcases the algorithm's ability to discriminate
between various query types, with high AUC values indicating strong performance for specific
classes. These findings contribute valuable insights to our thesis on query intent classification,
demonstrating the Gradient Boosting algorithm's discriminative capabilities and highlighting areas
for potential optimization.
4.6 Bi-LSTM
In our thesis report, the performance evaluation of the Bi-LSTM algorithm for query intent
detection revealed impressive results, affirming its efficacy in accurately classifying user queries
across ten distinct classes. The algorithm achieved a remarkable accuracy and F1-score of 0.98,
showcasing its robustness in discerning nuanced differences in user intent. The macro and weighted
average precision and recall scores of 0.97 further underscore the algorithm's ability to achieve high
precision and recall across all query classes, ensuring a balanced and reliable classification
performance.
The Bi-LSTM algorithm's exceptional performance suggests its suitability for handling complex
and context-dependent queries, a crucial aspect in query intent detection. The high F1-score,
precision, and recall values indicate the algorithm's proficiency in minimizing false positives and
false negatives, vital for delivering accurate and reliable predictions in real-world scenarios. These
results position Bi-LSTM as a promising choice for query intent detection applications,
emphasizing its potential to enhance user experience and information retrieval systems by precisely
categorizing diverse user queries with a high level of accuracy.
4.7 Analysis
Our thorough investigation of several machine learning methods for query intent recognition has
produced insightful information about how well they function in a variety of contexts. Every
algorithm has contributed differently to the diverse array of outcomes, each with its own set of
advantages and disadvantages. Linear Support Vector Classifier (LinearSVC), Multinomial Naive
Bayes, Random Forest Classifier, Gradient Boosting, and Bi-directional Long Short-Term Memory
(Bi-LSTM) are all included in the comprehensive analysis.
Linear Support Vector Classifier (LinearSVC): Performance: LinearSVC exhibited robust
performance with an accuracy and F1-score of 0.95, showcasing its discriminative power. High
accuracy, reliable in diverse query classifications, and excellent discriminative capabilities. While
LinearSVC performs admirably, the choice might hinge on specific application requirements,
computational resources, and the importance of precision and recall.
Multinomial Naive Bayes: Multinomial Naive Bayes demonstrated commendable accuracy and
F1-score of 0.88, with a slight improvement after parameter tuning (0.89). Handles various query
types effectively, contributing to its overall effectiveness. While versatile, the algorithm's suitability
may depend on specific use cases and desired precision.
Random Forest Classifier: Performance: Exceptional performance even without parameter tuning,
maintaining an accuracy and F1-score of 0.97. Outstanding robustness, consistent results, and high
AUC values in the multiclass ROC curve. A reliable choice with consistent performance, especially
in scenarios where robustness and discriminative capabilities are crucial.
Gradient Boosting: Achieved an accuracy and F1-score of 0.75, indicating reasonable but
comparatively lower performance. Showcased diverse strengths and weaknesses in classifying
different query types. While offering reasonable performance, considerations may be given based
on the specific requirements of the application.
Bi-directional Long Short-Term Memory (Bi-LSTM): Emerged as a standout performer with an
impressive accuracy and F1-score of 0.98. Unparalleled accuracy, robustness, and reliability in
accurately classifying queries. Exceptional performance makes Bi-LSTM a compelling choice,
especially when precision and accuracy are paramount.
Which is better?
The choice of the best method depends on the particular requirements of the application, available
computing power, and how much emphasis is placed on accuracy, precision, and recall. Gradient
Boosting exhibits a variety of qualities, LinearSVC is notable for its dependable performance,
Multinomial Naive Bayes is versatile, Random Forest Classifier is durable, and Bi-LSTM is an
exceptional performer. The choice must be made with the particular use case in mind, taking into
account elements like interpretability, computational effectiveness, and the crucial harmony
between recall and precision in practical applications. The selection turns into a skillful calibration
in which the algorithm conforms to the subtleties of the intended user interactions, guaranteeing a
smooth and contextually aware experience.
Chapter 5
5.2 Conclusion
In the journey from meticulously collecting a vast dataset of over 500,000 queries through
innovative prompt engineering from ChatGPT to the culmination of insightful results, our thesis
stands as a testament to the fusion of cutting-edge techniques and rigorous methodology. We
embraced the challenges posed by the absence of a standard dataset for query intent detection and
took a bold step in creating our own, meticulously categorizing queries across ten distinct types and
associating them with 935 diverse sectors.
The utilization of prompt engineering, inspired by a groundbreaking paradigm for dataset creation,
allowed us to overcome limitations inherent in traditional large-scale crowdsourced datasets. By
employing a collaborative pipeline involving the generative capabilities of GPT-3 and the
evaluative strength of human annotators, we transcended the pitfalls of repetitive patterns and
achieved a dataset—Worker-and-AI NLI (WANLI)—that not only surpassed MultiNLI in
performance but also showcased remarkable generalization on out-of-domain test sets.
Moving through the stages of dataset creation, algorithm training, and extensive evaluation, we
navigated the intricacies of machine learning models. LinearSVC, Multinomial Naive Bayes,
Random Forest Classifier, Gradient Boosting, and Bi-LSTM each played a distinctive role,
contributing to a rich tapestry of results. LinearSVC and Multinomial Naive Bayes exhibited
commendable accuracy, Random Forest demonstrated exceptional robustness, Gradient Boosting
showcased diverse strengths, and Bi-LSTM emerged as a standout performer with unparalleled
accuracy.
Our endeavor transcends mere algorithmic exploration; it represents a holistic approach to query
intent detection, blending sophisticated techniques with a richly curated dataset. The journey from
data collection to result gaining underscores our commitment to pushing the boundaries of
knowledge in the realm of natural language processing. As we conclude, the significance of our
thesis lies not only in the exceptional performance of algorithms but in the pioneering spirit that
drove us to redefine the landscape of query intent detection.
References:
[1] A. Kathuria, B. J. Jansen, C. Hafernik, and A. Spink, "Classifying the user intent of web queries
using k-means clustering," in IEEE Transactions on Information Theory, vol. pp. 563-581,
[publication year, 2010].
[2]Query Intent Detection using Convolutional Neural Networks Homa B. Hashemi Intelligent
Systems Program University of Pittsburgh hashemi@cs.pitt.edu Amir Asiaee, Reiner Kraft Yahoo!
inc Sunnyvale, CA
[3] "Understand Random Forest Algorithms," Analytics Vidhya, [Online]. Available:
https://www.analyticsvidhya.com/blog/2021/06/understanding-random-forest/
[4] "Gradient Boosting Algorithm," Analytics Vidhya, [Online]. Available:
https://www.analyticsvidhya.com/blog/2021/09/gradient-boosting-algorithm-a-complete-guide-for-
beginners/
[5]"Bidirectional Long Short-Term Memory Network," ScienceDirect, [Online]. Available:
https://www.sciencedirect.com/topics/computer-science/bidirectional-long-short-term-memory-
network
[6] "Naive Bayes Classifier Explained," Analytics Vidhya, [Online]. Available:
https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/
[7] A. Nigam, P. Sahare, and K. Pandya, "Intent Detection and Slots Prompt in a Closed-Domain
Chatbot," in Proceedings of the IEEE International Conference on Natural Language Processing and
Machine Learning. New Delhi, India: kydots.ai.
[8] "Guide on Support Vector Machine (SVM) Algorithm," Analytics Vidhya, [Online]. Available:
https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/
[9] L. P. Manik, D. S. Rini, Z. Akbar, H. F. Mustika, A. Indrawati, A. D. Fefirenta, T.
Djarwaningsih, "Out-of-Scope Intent Detection on A Knowledge-Based Chatbot," Research Center
for Informatics, Indonesian Institute of Sciences, Indonesia.
[10] A. Liu, S. Swayamdipta, N. A. Smith, Y. Choi, "WANLI: Worker and AI Collaboration for
Natural Language Inference Dataset Creation," Paul G. Allen School of Computer Science &
Engineering, University of Washington; Allen Institute for Artificial Intelligence.
[12] Moura, A., Lima, P., Mendonça, F., Mostafa, S. S., & Morgado-Dias, F. (2021). On the Use of
Transformer-Based Models for Intent Detection Using Clustering Algorithms. Sensors, 21(13),
4428. https://doi.org/10.3390/s21134428
[13] Zhang, C., Li, Y., Du, N., Fan, W., & Yu, P. S. (2019, June). Joint slot filling and intent
detection via capsule neural networks. In Proceedings of the 57th Annual Meeting of the
Association for Computational Linguistics (Vol. 1, pp. 5259-5267). Association for Computational
Linguistics. https://aclanthology.org/P19-1519
[14] Dao, M. H., Truong, T. H., & Nguyen, D. Q. (2021). Intent detection and slot filling for
Vietnamese [Abstract]. In Proceedings of the 2021 Conference on Empirical Methods in Natural
Language Processing (EMNLP) (pp. 1276-1287). Association for Computational Linguistics.
https://arxiv.org/abs/2104.02021
[15] Zhang, H., Song, W., Liu, L., Du, C., & Zhao, X. (2016). Query classification using
convolutional neural networks. 2016 IEEE International Conference on Data Mining (ICDM) (pp.
1041-1046). Institute of Electrical and Electronics Engineers (IEEE).
[16] Yolchuyeva, S., Németh, G., & Gyires-Tóth, B. (2020). Self-attention networks for intent
detection. In 2020 43rd International Conference on Telecommunications and Signal Processing
(TSP) (pp. 470-474). Institute of Electrical and Electronics Engineers (IEEE).
https://ieeexplore.ieee.org/document/10052676
[17] Vinyals, O., & Le, Q. V. (2015, June 23). A neural conversational model [ArXiv preprint
arXiv:1506.05869]. https://arxiv.org/abs/1506.05869
[18] Zhu, Q., Zhang, Z., Zhu, X., & Huang, M. (2023). Building multi-domain dialog state trackers
from single-domain dialogs. In Proceedings of the 2023 Conference on Empirical Methods in
Natural Language Processing (EMNLP) (pp. 946-957). Association for Computational Linguistics.
https://aclanthology.org/2023.emnlp-main.946
[19] Toward Data Science. (2022, January 15). Micro, Macro, Weighted Averages of F1 Score:
Clearly Explained. Towards Data Science. Available: https://towardsdatascience.com/micro-macro-
weighted-averages-of-f1-score-clearly-explained-b603420b292f
[20] Yao, Z., Schloss, B. J., & Selvaraj, S. P. (2023, December). Aligning AI-Generated Text with
Human Preferences. [ArXiv preprint arXiv:2312.15997]. https://arxiv.org/abs/2312.15997
[21] Chen, S., Gao, S., & He, J. (2023, May). Evaluating Factual Consistency of Summaries with
Large Language Models. [ArXiv preprint arXiv:2305.14069]. https://arxiv.org/abs/2305.14069
[22] Yuan, A., Ippolito, D., Nikolaev, V., Callison-Burch, C., Coenen, A., & Gehrmann, S. (2021).
SynthBio: A Case Study in Human-AI Collaborative Curation of Text Datasets. [ArXiv preprint
arXiv:2111.06467]. https://arxiv.org/abs/2111.06467