KEMBAR78
Stack Overflow slides Data Analytics | PDF
What StackOverflow Tells Us
About Programming Languages
Rahul Thankachan Nada Aldarrab
Prathmesh Gat
University of Southern California
1
2
Agenda
Agenda
1. Introduction
2. Problem Statement
3. Temporal Based Trend Analysis
4. Topic Analysis
5. Predicting Time To Answer
6. Summary and Q&A
3
4
Introduction
Introduction
• Dataset Used : Stackoverflow - Internet Archive.
• Why? It is one of the largest developer focused
open collaborative platform currently.
• Through our study we intend to answers some
interesting questions
• Study the rise and fall of popular programming
languages
• Can be used to predict future enhancements
• Study effectiveness of Stack Overflow model
5
Problem Definition
• Basic Analysis: What are the most popular
programming languages?
• What are the trends in programming languages?
• What are the most popular topics discussed in a
programming language?
• Can we accurately predict the time it takes until a
questioner gets an answer?
6
Related Work
Miltiadis Allamanis and Charles Sutton. 2013. Why, When and What: Analyzing Stack Overflow Questions by
Topic, Type & Code In 10th Working Conference on Mining Software Repositories. Mining Challenge. IEEE, pages 53-
56.
• Topic modeling analysis
• Used Latent Dirichlet Allocation (LDA)
• Modeled Java Topics of Questions
• Can evaluate the orthogonality of different languages
• Stack Overflow questions are about the code and are not application domain specific
7
Related Work
V. Bhat, A. Gokhale, R. Jadhav, J. Pudipeddi, and L. Akoglu. Min (e) d your tags: Analysis of question response time in
stackoverflow. In Proceedings of ASONAM 2014, pages 328–335. IEEE, 2014.
• Two linear classifiers: logistic regression and SVM with linear kernel
• Two non- linear classifiers: decision tree (DT) and SVM with radial basis function kernel
8
Related Work
Prediction Accuracy:
9
10
Basic Analysis &
Temporal Trends
StackOverFlow Activity
11
Approx. 45% of all programming languages in world are
discussed on StackOverflow!
Post Type
12
Top Ten Languages
13
Question Count
14
Answer Count
15
Answer Fraction
16
Question Fraction
17
Questioners/Answerers Distribution
18
19
Topic Analysis
Topic Analysis
20
21
Predicting time until
Answer
Approach
Following attributes selected for study:
1. Tag (Only top 10 programming languages)
2. Creation Month
3. Body Length
4. Tag Length
5. Introduced new Nominal class - Time_Answer
6. (less6, bet6and20, 20andmore)
22
Approach
Tools
Weka - Weka is a collection of machine learning
algorithms for data mining tasks.
Data Preprocessing:
Challenging!
23
Approach(Data Pre -processing)
1. Parse all the answers and link first answer’s creation
time to creation time of question. We called this
field delta-answer.
2. Remove all the Questions which had delta answer
negative or zero
3. We developed a Python script which develops .arff
file On the fly (Wish to contribute this file)
24
Evaluation
• Subset Size: 4490947 - Subset - 449000
• Classify response time into 3 types: less than 6
minutes, between 6 and 20 minutes, 20 minutes
and more.
• 10-fold cross-validation
• Results are obtained using different feature
combinations and different classifiers
25
Evaluation
Results of classifier J48 (all Attributes)
26
Evaluation
27
Results of classifier (body_length/ tag_length)
Summary
• We were successfully able to find interesting
temporal trends for major programming languages
• Using tag based topic analysis we were able to find
major discussion topics and to some extent the
difficult topics in a programming language
• Using machine learning techniques we were
successfully able to predict - time to answer with
good accuracy
28
Future Scope of Work
• Contribute the .arff on the fly generator script.
• Adding Parts of speech as an attribute
• Showcasing the results on a website
29
Questions?
30

Stack Overflow slides Data Analytics

  • 1.
    What StackOverflow TellsUs About Programming Languages Rahul Thankachan Nada Aldarrab Prathmesh Gat University of Southern California 1
  • 2.
  • 3.
    Agenda 1. Introduction 2. ProblemStatement 3. Temporal Based Trend Analysis 4. Topic Analysis 5. Predicting Time To Answer 6. Summary and Q&A 3
  • 4.
  • 5.
    Introduction • Dataset Used: Stackoverflow - Internet Archive. • Why? It is one of the largest developer focused open collaborative platform currently. • Through our study we intend to answers some interesting questions • Study the rise and fall of popular programming languages • Can be used to predict future enhancements • Study effectiveness of Stack Overflow model 5
  • 6.
    Problem Definition • BasicAnalysis: What are the most popular programming languages? • What are the trends in programming languages? • What are the most popular topics discussed in a programming language? • Can we accurately predict the time it takes until a questioner gets an answer? 6
  • 7.
    Related Work Miltiadis Allamanisand Charles Sutton. 2013. Why, When and What: Analyzing Stack Overflow Questions by Topic, Type & Code In 10th Working Conference on Mining Software Repositories. Mining Challenge. IEEE, pages 53- 56. • Topic modeling analysis • Used Latent Dirichlet Allocation (LDA) • Modeled Java Topics of Questions • Can evaluate the orthogonality of different languages • Stack Overflow questions are about the code and are not application domain specific 7
  • 8.
    Related Work V. Bhat,A. Gokhale, R. Jadhav, J. Pudipeddi, and L. Akoglu. Min (e) d your tags: Analysis of question response time in stackoverflow. In Proceedings of ASONAM 2014, pages 328–335. IEEE, 2014. • Two linear classifiers: logistic regression and SVM with linear kernel • Two non- linear classifiers: decision tree (DT) and SVM with radial basis function kernel 8
  • 9.
  • 10.
  • 11.
    StackOverFlow Activity 11 Approx. 45%of all programming languages in world are discussed on StackOverflow!
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
    Approach Following attributes selectedfor study: 1. Tag (Only top 10 programming languages) 2. Creation Month 3. Body Length 4. Tag Length 5. Introduced new Nominal class - Time_Answer 6. (less6, bet6and20, 20andmore) 22
  • 23.
    Approach Tools Weka - Wekais a collection of machine learning algorithms for data mining tasks. Data Preprocessing: Challenging! 23
  • 24.
    Approach(Data Pre -processing) 1.Parse all the answers and link first answer’s creation time to creation time of question. We called this field delta-answer. 2. Remove all the Questions which had delta answer negative or zero 3. We developed a Python script which develops .arff file On the fly (Wish to contribute this file) 24
  • 25.
    Evaluation • Subset Size:4490947 - Subset - 449000 • Classify response time into 3 types: less than 6 minutes, between 6 and 20 minutes, 20 minutes and more. • 10-fold cross-validation • Results are obtained using different feature combinations and different classifiers 25
  • 26.
    Evaluation Results of classifierJ48 (all Attributes) 26
  • 27.
    Evaluation 27 Results of classifier(body_length/ tag_length)
  • 28.
    Summary • We weresuccessfully able to find interesting temporal trends for major programming languages • Using tag based topic analysis we were able to find major discussion topics and to some extent the difficult topics in a programming language • Using machine learning techniques we were successfully able to predict - time to answer with good accuracy 28
  • 29.
    Future Scope ofWork • Contribute the .arff on the fly generator script. • Adding Parts of speech as an attribute • Showcasing the results on a website 29
  • 30.