1
A Machine Learning based Depression Analysis
and Suicidal Ideation Detection System using
Questionnaires and Twitter
Swati Jain, Suraj Prakash Narayan, Rupesh Kumar Dewang, Utkarsh Bhartiya, Nalini Meena and Varun Kumar
Abstract—Depression as a disorder has been a great concern in on asking the patients questions about their situations, that
our society and has been perpetually a hot topic for researchers doesn’t diagnose depression in a very precise way [4]. Ac-
in the world. Despite the massive quantity of analysis on under- cording to the Global Burden of Disease Study, it’s calculable
standing individual moods together with depression, anxiety, and
stress supported activity logs collected by pervasive computing that if the current increasing rate of the amendment within the
devices like smartphones, foretelling depressed moods continues pattern of mortality and disease continue, by 2020 depression
to be an open question. In this paper, we have proposed a can account to 5.7 % of all the diseases and it might be the
depression analysis and suicidal ideation detection system, for second leading explanation for incapacity worldwide, after the
predicting the suicidal acts based on the level of depression. We heart diseases [1]. With an endless increase in the menace of
collected real time data from students and parents by making
them fill questionnaires similar to PHQ-9 (Parent health ques- depression, there is a demand to develop automatic techniques
tionnaire) consisting of questions like What’s your age? or Are for the detection of the presence and extent of depression
you regular in school/college? and processed it into meaningful thereby stopping new events to occur. Therefore, the moti-
data with related features like age, sex, regularity in the school, vation of this paper, is to explore the whole different sources
etc. Then, classification machine algorithms are used to train of information, like social media posts, blogs, language, and
and classify it in five stages of depression depending on severity
- Minimal or none, mild, moderate, moderately severe and severe. action cues, to predict the severity of depression. Data was
Maximum accuracy i.e. 83.87 % was achieved by using XGBoost collected from tweets and questionnaires prepared. While
classifier in this dataset. Also, data was collected in the form of doing so, we also investigated different feature representation
tweets and were classified into whether the person who tweeted including pie charts and bar charts and modeling techniques
is in depression or not using classification algorithms. Logistic such as supervised classifiers including support vector machine
Regression classifier gave the maximum accuracy i.e. 86.45 %
for the same. and Random forest classifiers corresponding to each modality
for improving the performance of automatic prediction thereby
Index Terms—Twitter, Tweets, Reddit, Suicide, Depression, helping the needful in identifying the depression in the early
Social Media, Machine Learning, Classification.
stage which will help in preventing the catastrophic outcomes
of the same.
I. I NTRODUCTION The rest of the paper is structured as follows. Section II
Depression is a disorder of major public health importance, introduces most relevant related works. Section III provides
in terms of its prevalence and therefore the suffering, dys- a detailed description of the proposed methodology. Observa-
function, morbidity, and economic burden [1]. It’s a serious tions and results are described in Section IV. Finally, Section
enfeebling disorder which might have an effect on folks from V concludes the paper.
all ages which might lead to low mood, feelings of guilt,
insomnia, and cause problems like hurting chronic back pain, II. RELATED WORK
and bilateral medicine symptoms and might be fatal typically
if left untreated. According to the World Health Organiza- Depression studies came much earlier and was a major
tion (WHO), roughly 350 million human-being square units focus than that of Internet. Detecting depression from
are suffering from depression nowadays [2]. United Nations documents in particular has become an increasingly important
agency ranks depression mutually of the foremost devastating research area, with interesting methods and results reported
diseases within the world [3]. Additionally, the two-third frac- for Facebook, Twitter and various other forum posts [5].
tion of depressed folks do not look for applicable treatments, Based upon the questionnaire survey throughout the world,
that cause major consequences[8]. The medical science relies many widely-accepted ways and criteria have been developed.
For example, in one of the related work by Lenore Sawyer
Swati Jain, Suraj Prakash Narayan, Rupesh Kumar Dewang, Radloff, CES-D Scale contains 20 questions about the mental
Utkarsh Bhartiya, Nalini Meena and Varun Kumar are with conditions like users bad feelings and sleep conditions
Department of Computer Science and Engineering, Motilal Nehru
National Institute of Technology, Allahabad, Prayagraj-211004, [6]. The questions either have several options aligned with
India (E-mail: swatijain345@gmail.com, sprksh.narayan@gmail.com, different scores or require users to feedback the degree of
rupeshdewang@mnnit.ac.in, bhartiyautkarsh9695@gmail.com, their situations. The depression level is diagnosed according
nalini61097@gmail.com, varunmaurya37@gmail.com)
to the scale of the total score. In Another example, there are
978-1-7281-0577-2/19/$31.00 2019 IEEE 21 categories about users mental and physiological state in
SCES 2019 5th Students Conference on Engineering and Systems
Becks Depression Inventory such as mood ,sense of failure,
lack of satisfaction, irritability, feeling of guilt ,sense of
punishment, self-hate, self accusations and inhibition of work
[7].
Another work by Richardson studied the performance
characteristics and validity of the Patient Health Questionnaire
- 9 item (PHQ-9) as a tool for depression among adolescents
[8]. Our system uses questionnaire similar to PHQ-9 that
is, an enhanced version of it which covers all aspects or
factors and symptoms leading to depression. Tzirakis et al.
[9] inspired by the use of deep learning in detecting emotion,
presented an approach to analyze emotion as well depression
state of the person based on deep learning. They used Deep
residual Network of 50 layers on visual data and Convolution
Neural Network (CNN) on audio .
Recently, Rafiqul Islam performed depression analysis on
data from Facebook collected from an online public source
through machine learning technique as an efficient and
scalable method [10].
III. P ROPOSED M ETHODOLOGY
In this section, we have given the detail of used datasets,
features extraction, proposed algorithms and model.
A. DataSet I from Questionnaire
Real time data is collected from students and parents by
making them fill questionnaires similar to PHQ-9. Question-
naires were prepared keeping in mind the symptoms observed
in a student while he/she is suffering through any level of
depression and to what level, parents are involved in this
scenario. Also, we consulted various counselors to make our
dataset as effective as possible in determining severity of Fig. 1: Proposed Model of Depression Analysis and Suicidal
depression. The missing data is handled by filling out the Ideation Detection System
spaces with the utmost possible answer. Following are the
few features which were focused during the preparation of
questionnaire. − school/college and has 4 choices, then these choices are
1. Age encoded by the label encoder as in the Table I -
2. Sex
3. Regularity in school/college TABLE I: Encoded Values of the options
4. Feeling tired/ having little energy
Options Value
5. Feeling down or hopeless. Not at all 0
6. Degree of insomnia Several Days 1
7. Poor appetite More than half the days 2
Nearly everyday 3
8. Trouble concentrating on things.
9. Thoughts of getting dead
10. Intentionally overdosed on drugs.
11. Suffered from any physical/mental abuse B. DataSet II from Twitter
In total, 18 features were used for this dataset and 5 for the We utilized PRAW for getting dataset(containing user posts
documentation purpose which includes timestamp that is the related to depression and suicidal ideation) from Reddit,
time at which the response has been recorded and comments which stands for Python Reddit API wrapper - an opensource
section in which students can give additional information Python library for accessing the Reddit content. The Twitter
regarding the same. After circulating the form in various posts(containing positive and negative sentiments but not re-
schools and colleges, 619 responses were recorded which were lated to suicide or depression) were collected from available
then preprocessed like handling missing values and converted dataset. As Reddit has dedicated sections called sub-reddits
to dataset. for suicide and depression related posts, we scrapped those
For example, If the question is- Are you regular in posts from the sub-reddits. We also took positive sentiment
2
SCES 2019 5th Students Conference on Engineering and Systems
Algorithm 1 Pre-processing for Dataset I ”principal components” that are chosen to give one axes
1: procedure PREPROCESS(data) lots of variation. Since there are 5 stages of depression,
2: for x in data.columns do classification machine algorithms are applied to train and test
3: if data[x].type == object then the data.
4: data[x] = transform( data[x].values) . Label
Encoder is used for transformation
Algorithm 3 Feature Labeling
5: end if
6: end for 1: procedure PROCESS(data)
7: return data 2: Divide data into training − data and testing − data
3: for x in training − data do
4: sum ← 0
and negative sentiment (doesnt contain suicide related vocab) 5: n ← N umber − of − f eatures(columns)
labeled dataset from Kaggle. After collecting posts from 6: i←0
Reddit and Twitter, we removed excess white space and then 7: while i ≤ n do
changed the text to lower case. Then data cleaning is carried 8: sum ←sum+xi
out, for that we carry out the following procedure. The first 9: end while
step involves removal of all non-alphabetic characters. The we 10: xlabel ← Label − according − to − the − sum
removed the stopwords, for that we utilized NLTK stopwords 11: end for
corpus. Then data set was created by stemming the words 12: return training − data
(carried out primarily for feature reduction), we used porter
stemmer for the task.
TABLE II: Labels on the basis of scores for students dataset
Algorithm 2 Pre-processing for Dataset II Score Depression severity Label/Stage
1: procedure PREPROCESSII(T weets) 0-5 Minimal or none 0
5-14 Mild 1
2: for x in T weets do 14-27 Moderate 2
3: Remove non-alphabetic characters in x 27-39 Moderately severe 3
4: Do word tokeniszation >30 Severe 4
5: Remove stopwords
6: Do stemming
7: end for
8: return T weets
C. Feature Extraction
1) DataSet I: Before training the collected data, data
is preprocessed. Prepared questionnaire refers to PHQ-9
(Patient health questionnaire) and is the extended version of
it. Questions asked for the documentation purpose including
Email-address or school/college name which do not contribute
in predicting depression stages are removed. Finally, scores Fig. 2: PCA plot for Training DataSet of Questionnaire
are allotted to all the options depending upon the level to
which they contribute in depression using LabelEncoder For example, suppose the collected preprocessed data of two
which is used to transform categorical labels to numerical students after encoding the selected options from the question-
labels. Then, on the basis of total scores of each data entry, naire into numbers is shown in Table III- Now, according to
depression stages are labeled to create a data set as shown in responses, each column in the processed data will be labeled
the table II. from 0 to 4 depending upon the severity of depression. Total
We partitioned the data set in a 80-20 split where 80 % of score for each student will be calculated by the formula -
the data is reserved for training and 20 % is marked for Xn
testing. Training set is shown in figure 2 by a PCA 1 plot sj = ai (1)
where dataset characterized by more than 15 dimensions i=1
or features is plotted as points in a plane. PCA discover a where,
new coordinate system with each point having a new (x,y) sj = Score of the jth student
value. The axes in the system don’t actually mean anything n = Total no. of features
physical. In fact, they’re combinations of features called ai = Value of ith feature for jth student in the data
1 Principal Component Analysis (PCA) is used to reduce a large set of
features to a small set that still contains most of the information in the large For labeling , Table II is referred. In the above example,
set that is it is a dimension-reduction tool total score for student 1 is 12 and for student 2 , it is 26.
3
SCES 2019 5th Students Conference on Engineering and Systems
TABLE III: Labeling data
SNo Features Student 1 Student 2
1 What is your age? 2 2
2 Sex? ( Male/Female) 1 0
3 Are you regular in school/college? 0 2
4 Feeling down or hopeless? 1 2
5 Insomnia, or sleeping too much? 1 3
6 Having little energy or feeling tired? 1 2
7 Overeating or poor appetite? 0 1
8 Feeling bad about yourself? 1 3
9 Trouble focusing on things? 1 2
10 Moving or speaking so slowly that other 0 1 (a) Decision boundary for data set I
people could have noticed?
11 Thoughts that you would be better off 0 2
dead?
12 Intentionally overdosed on drugs? 0 0
13 Have suffered any form of physical or 0 1
mental abuse?
14 Little interest or pleasure in doing things? 1 2
15 How much time do you spend on social 2 2
media?
16 Have you seen a student psychologist be- 0 0
fore?
17 Where do you stay? 0 1 (b) Decision boundary for data set
18 Do you share everything with your parents? 1 1 II. The two colors yellow and blue
correspond to the two classes risky
and not-risky respectively.
Therefore, we can label these as 1 i.e Mild depression and 2
i.e Moderate depression respectively. Fig. 3: Logistic Regression Decision Boundary
2) DataSet II: We used Tf-idf that is Term Frequency
Inverse Document Frequency, weighted word count feature
extraction to form our feature vectors. We feed the test data
to the tfidf vectorizer for creating feature vectors for the test
set which will be used by our classifiers to predict. We now
partition the data set II similar to data set I in a 80-20 split
ratio where 80 % of the data is reserved for training and
the rest 20 % is reserved for testing. We have ensured that
there is equal representation of sentiments from both classes in (a) Decision boundary for data set I
training and testing set by performing random shuffling before
partitioning.
D. Machine Learning Supervised Algorithms
In order to determine suicidal ideation with the data set, text
classification and sentiment analysis algorithms are utilized
for both the datasets. For categorization into five levels of de-
pression severity, supervised classification algorithms are used
which includes Logistic Regression, Decision tree classifier (b) Decision boundary for data set II.
and XGBoost algorithm on dataset I and similar algorithms
on dataset II for categorizing them into yes or no are used Fig. 4: Random Forest Decision Boundary
[11]. PCA plots have been plotted for both datasets against all
the applied algorithms. For dataset I, yellow color represent
people having minimal or none depression that is 1st stage, 3.
purple for mild depression, light green spots on dark green
refers to moderate depression, dark green refers to moderately 2) Random Forest Classifier: A random forest classifier is a
severe and light green refers to severe depression. For dataset meta estimator that fits a number of decision tree classifiers on
II, yellow color refers to suicidal posts whereas purple refers various sub- samples of the dataset .It improves the predictive
to normal post. accuracy using averaging and control over- fitting by ignoring
1) Logistic Regression: It is one of the most widely used outliers [13]. The PCA plot obtained for the classifier is shown
classifier in machine learning. Here the variable y, that we in figure 4 in which 5 stages are visible in form of 5 different
want to predict is discrete value. Ex spam or not spam, online colors for dataset I.
transactions fraudulent or not [12]. It can be used for binary
classification problem as well as multiclass classification prob- 3) XGBoost Classifer: XGBoost classifier is used for su-
lem. Decision boundary for both data sets are shown in figure pervised machine learning problems, where the training data
4
SCES 2019 5th Students Conference on Engineering and Systems
(a) Decision boundary for data set I
(a) Decision boundary for data set I
(b) Decision boundary for data set II.
Fig. 6: Support Vector Machine Decision Boundary
(b) Decision boundary for data set II.
Fig. 5: XGBoost Decision Boundary
including multiple features xi is used to predict a target
variable yi [14]. It is designed for speed and performance
and is an implementation of gradient boosted decision trees .
Decision boundary for both datasets are shown in figure 5.
4) Support Vector Machine: Support vector machines are
supervised machine learning algorithms that can perform non-
Fig. 7: Confusion matrix describing the performance of Lo-
linear classifications by mapping data to higher dimensions
gistic Regression on dataset II
through the use of the kernel trick.The PCA plot obtained for
the classifier is shown in figure 6 for both datasets.
TABLE IV: Accuracy against each algorithm applied on the
dataset
IV. R ESULTS Classifier Accuracy for Accuracy for
dataset I ( %) dataset II ( %)
We tested classifiers using our pre-processed Test data sets. Random Forest Tree Classifier 76.34 82.05
To assess the performance of different classifiers, we computed XGBoost Classifier 83.87 84.02
the accuracy of each for both datasets. A confusion matrix is Logistic Regression Classifier 59.22 86.45
Support Vector Machine 76.50 85.44
a table that is used to describe the performance of a machine
learning classifier on a set of test data in terms of accuracy,
precision, Recall and F-measure for which the true values are
For dataset I, XGBoost classifier gave the highest accuracy
known. Hence, we have calculated the accuracy using this
i.e 83.87 % and Logistic Regression gave the lowest unlike
matrix by the formula-
dataset II where Logistic regression gave the highest accuracy
(T N + T P ) of 86.45 % because dataset I consist of 18 features which
Accuracy = (2)
(F N + F P + T N + T P ) were more than the words on which tweets were extracted
where, and model was trained. Logistic regression starts to falter when
TP = True positive; there are large number of features and good chunk of missing
TN = True negative; data. Also, too many categorical variables are also a problem
FP = False positive for it.
FN = False negative Graphs are designed to help one communicate the survey
For example, on applying Logistic regression on data set II, results. A series of graph are shown from figure 8 to 10 -
figure 7 shows the confusion matrix formed.
Maximum people facing depression are of age group 19-21
5
SCES 2019 5th Students Conference on Engineering and Systems
according to bar graph obtained as shown in Fig. 8. have shown that life events preceding the onset of depression
play a significant role in depression. We have analyzed social
media posts (especially twitter),conducted questionnaire and
asked students and parents to give their opinion and also
scrapped blogs on internet .Major factors of depression among
the age group of 15-29 which we found during the course of
the project are parental pressure, love, failures, bullying, body
shaming, inferiority complex, exam pressure, peer pressure,
physical and sexual abuse etc. Depression being a recurrent
type of illness, repeated episode of the same are common.
Finally, little is known about the prevention and identification
of the disorder at an early stage. Among future directions,
we hope to understand how social media behavior analysis
can help in leading to development of methods for analyzing
Fig. 8: Number of students in different age groups depression at scale.
Around 33.7percent people feels bad about themselves- R EFERENCES
that they are a failure or let their family down as shown in [1] S. Grover, A. Dutt, and A. Avasthi, “An overview of indian research in
Fig. 9. depression,” Indian journal of psychiatry, vol. 52, no. Suppl1, p. S178,
2010.
[2] M. Reddy, “Depression: the disorder and the burden,” Indian journal of
psychological medicine, vol. 32, no. 1, p. 1, 2010.
[3] “Depression.” [Online]. Available: https://www.who.int/news-room/
fact-sheets/detail/depression
[4] M. M. Aldarwish and H. F. Ahmad, “Predicting depression levels using
social media posts,” in 2017 IEEE 13th international Symposium on
Autonomous decentralized system (ISADS). IEEE, 2017, pp. 277–280.
[5] P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and
S. Zafeiriou, “End-to-end multimodal emotion recognition using deep
neural networks,” IEEE Journal of Selected Topics in Signal Processing,
vol. 11, no. 8, pp. 1301–1309, 2017.
[6] “Depression and suicide: Overview, etiology of depression and
suicidality, epidemiology of depression and suicide,” Feb 2019. [Online].
Fig. 9: Number of students feeling bad about themselves. Available: https://emedicine.medscape.com/article/805459-overview
[7] L. RADLOFF, “Scale: A self-report depression scale for research in the
general population.” J Clin Exp Neuropsychol, vol. 19, pp. 340–356,
According to the graph formed by dataset , 48.2 % people 1997.
[8] L. P. Richardson, E. McCauley, D. C. Grossman, C. A. McCarty,
used to feel hopeless and depressed on several days as shown J. Richards, J. E. Russo, C. Rockhill, and W. Katon, “Evaluation of
in Fig. 10. the patient health questionnaire-9 item for detecting major depression
among adolescents,” Pediatrics, vol. 126, no. 6, pp. 1117–1123, 2010.
[9] O. Whooley, “Diagnostic ambivalence: psychiatric workarounds and the
diagnostic and statistical manual of mental disorders,” Sociology of
Health & Illness, vol. 32, no. 3, pp. 452–469, 2010.
[10] M. R. Islam, M. A. Kabir, A. Ahmed, A. R. M. Kamal, H. Wang,
and A. Ulhaq, “Depression detection from social network data using
machine learning techniques,” Health information science and systems,
vol. 6, no. 1, p. 8, 2018.
[11] A. Sabeeh and R. K. Dewang, “Comparison, classification and survey
of aspect based sentiment analysis,” in International Conference on
Advanced Informatics for Computing Research. Springer, 2018, pp.
612–629.
[12] R. K. Dewang and A. K. Singh, “State-of-art approaches for review
spammer detection: a survey,” Journal of Intelligent Information Sys-
Fig. 10: Number of students feeling hopeless tems, vol. 50, no. 2, pp. 231–264, 2018.
[13] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al.,
“Scikit-learn: Machine learning in python,” Journal of machine learning
V. C ONCLUSION AND F UTURE W ORK research, vol. 12, no. Oct, pp. 2825–2830, 2011.
[14] “A gentle introduction to xgboost for applied machine learning,”
Now a days Depression is leading to catastrophic outcomes Sep 2016. [Online]. Available: https://machinelearningmastery.com/
such as suicide which could be life taking as well. Studies gentle-introduction-xgboost-applied-machine-learning/