KEMBAR78
Popularity Prediction On Twitter EE239AS Project 3 | PDF | P Value | Coefficient Of Determination
0% found this document useful (0 votes)
85 views21 pages

Popularity Prediction On Twitter EE239AS Project 3

The document discusses predicting popularity on Twitter based on hashtag tweet data. It performs the following analyses: 1. Calculates statistics like average tweets per hour, average follower count, and average retweets for several hashtags. 2. Fits linear regression models to predict the number of tweets in the next hour using features from the previous hour, like number of tweets, retweets, total followers, and time of day. It evaluates the models' accuracy and significance of features. 3. Presents the results of the linear regressions for several hashtags, including accuracy scores and significance tests for each feature in the models.

Uploaded by

Aditya Rao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views21 pages

Popularity Prediction On Twitter EE239AS Project 3

The document discusses predicting popularity on Twitter based on hashtag tweet data. It performs the following analyses: 1. Calculates statistics like average tweets per hour, average follower count, and average retweets for several hashtags. 2. Fits linear regression models to predict the number of tweets in the next hour using features from the previous hour, like number of tweets, retweets, total followers, and time of day. It evaluates the models' accuracy and significance of features. 3. Presents the results of the linear regressions for several hashtags, including accuracy scores and significance tests for each feature in the models.

Uploaded by

Aditya Rao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Popularity Prediction on Twitter

EE239AS Project 3
By:
Aditya Rao (404434974)
Vikas Amar Tikoo (204435535)
Saurabh Trikande (604435562)
Behnam Shahbazi(704355606)

Problem 1:
Download the training tweet data and calculate these statistics for each hashtag:
average number of tweets per hour, average number of followers of users posting the
tweets, and average number of retweets. Plot "number of tweets in hour" over time for
#SuperBowl and #NFL (a histogram with 1-hour bins). The tweets are stored in separate
files for different hashtags and files are named as tweet_[#hashtag].txt. The tweet file
contains one tweet in each line and tweets are sorted with respect to their posting time.
Each tweet is a JSON string that you can load in Python as a dictionary.
Starting from the earliest timestamp in the tweet_[#hashtag].txt file to the last one, the
count of: each tweet, the followers of the tweeters and the further retweets was tracked.
This was later used to calculate the average no.of tweets per hour, average no.of
followers of users posting the tweets and the average no.of retweets.
Hashtag

Average
no.of tweets
per hour
#gopatriots
23.0907
#gohawks
114.298
#nfl
167.326
#patriots
297.697
#sb49
733.102
#superbowl
857.992

Average
no.of
followers
1602.07
2393.6
4763.34
3641.7
10230.1
9958.12

Average no.of
retweets
1.40014
2.01463
1.53854
1.78282
2.51115
2.38827

Problem 2:
Fit a Linear Regression model using 5 features to predict number of tweets in the next
hour, with features extracted from tweet data in the previous hour. The features you
should use are: number of tweets, total number of retweets, sum of the number of
followers posting the hashtag, maximum number of followers in users posting the
hashtag, and time of the day (which could take 24 values that represent hours of the day
with respect to a given time reference). Explain your model's training accuracy and the
significance of each feature using the t-test and P-value results of fitting the model.
For this problem, the independent features are: total number of retweets, sum of
followers, maximum of the followers, and hour of the day.
The feature to be predicted, i.e., predicant is the number of tweets.
We used the statsmodel package as suggested by the professor. A Linear regression
model which used Ordinary Least Squares was made to run on this set of features and
made to predict the no.of tweets in the next hour.
R-Square value: The accuracy or the correctness of the predictability of the model is
given by its R-square value. R-square value = predicted value / actual value. It is clear
that the higher the R-square value, the better the regression model.

P- value: This measures the weight of each feature used for a prediction. The claim
(prediction) that's being tested is called the null hypothesis. A small p-value (typically
0.05) indicates strong evidence against the null hypothesis, so the null hypothesis is
rejected. A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so
the null hypothesis stands probably true.
T-value: t-values are similar to the deviations of the predicted data from the actual data.
It can be seen that the t-values are high when p-values are low and vis-a-vis.
OLS Regression Results for #gopatriots
OLS Regression Results
=========================================================================
=====
Dep. Variable:
y R-squared:
0.608
Model:
OLS Adj. R-squared:
0.605
Method:
Least Squares F-statistic:
210.2
Date:
Fri, 20 Mar 2015 Prob (F-statistic):
4.03e-135
Time:
16:25:15 Log-Likelihood:
-4533.8
No. Observations:
683 AIC:
9080.
Df Residuals:
677 BIC:
9107.
Df Model:
5
=========================================================================
=====
coef std err
t P>|t|
[95.0% Conf. Int.]
-----------------------------------------------------------------------------const
11.4988 13.878
0.829
0.408
-15.750 38.748
x1
1.4818
0.124 11.945
0.000
1.238 1.725
x2
-7.101e-06
0.000 -0.067 0.947
-0.000 0.000
x3
-32.4000
1.844 -17.572
0.000
-36.020 -28.780
x4
5.407e-06
0.000
0.041
0.967
-0.000 0.000
x5
0.3197
1.030
0.310
0.756
-1.703 2.342
=========================================================================
=====
Omnibus:
1138.167 Durbin-Watson:
2.391
Prob(Omnibus):
0.000 Jarque-Bera (JB):
824584.907
Skew:
10.007 Prob(JB):
0.00
Kurtosis:
172.040 Cond. No.
9.18e+05

OLS Regression Results for #gohawks

OLS Regression Results


=========================================================================
=====
Dep. Variable:
y R-squared:
0.609
Model:
OLS Adj. R-squared:
0.607
Method:
Least Squares F-statistic:
301.0
Date:
Fri, 20 Mar 2015 Prob (F-statistic):
3.63e-194
Time:
16:25:46 Log-Likelihood:
-7630.2
No. Observations:
972 AIC:
1.527e+04
Df Residuals:
966 BIC:
1.530e+04
Df Model:
5
=========================================================================
=====
coef std err
t P>|t|
[95.0% Conf. Int.]
-----------------------------------------------------------------------------const
66.8464 38.917
1.718
0.086
-9.525 143.217
x1
0.7296
0.090
8.067
0.000
0.552 0.907
x2
5.454e-05 5.01e-05
1.089 0.277 -4.38e-05 0.000
x3
0.0046
0.030
0.151
0.880
-0.055 0.064
x4
-0.0003
0.000 -3.029
0.003
-0.001 -0.000
x5
-0.1487
2.906 -0.051
0.959
-5.851 5.553
=========================================================================
=====
Omnibus:
935.689 Durbin-Watson:
2.235
Prob(Omnibus):
0.000 Jarque-Bera (JB):
2260025.080
Skew:
3.089 Prob(JB):
0.00
Kurtosis:
239.146 Cond. No.
4.31e+06
=========================================================================
=====

OLS Regression Results for #nfl


OLS Regression Results
=========================================================================
=====
Dep. Variable:
y R-squared:
0.765
Model:
OLS Adj. R-squared:
0.763
Method:
Least Squares F-statistic:
598.1
Date:
Fri, 20 Mar 2015 Prob (F-statistic):
4.06e-286
Time:
16:26:35 Log-Likelihood:
-6734.6
No. Observations:
926 AIC:
1.348e+04
Df Residuals:
920 BIC:
1.351e+04
Df Model:
5
=========================================================================
=====
coef std err
t P>|t|
[95.0% Conf. Int.]
------------------------------------------------------------------------------

const
66.0260 22.685
2.911
0.004
21.506 110.546
x1
0.6551
0.062 10.651
0.000
0.534 0.776
x2
8.851e-05 1.4e-05
6.301 0.000
6.09e-05 0.000
x3
-2.1411
0.142 -15.090
0.000
-2.420 -1.863
x4
-8.577e-05 1.98e-05 -4.328
0.000
-0.000 -4.69e-05
x5
-2.0316
1.664 -1.221
0.223
-5.298 1.235
=========================================================================
=====
Omnibus:
1153.074 Durbin-Watson:
2.151
Prob(Omnibus):
0.000 Jarque-Bera (JB):
292772.213
Skew:
6.056 Prob(JB):
0.00
Kurtosis:
89.263 Cond. No.
7.66e+06
=========================================================================
=====

OLS Regression Results for #patriots


OLS Regression Results
=========================================================================
=====
Dep. Variable:
y R-squared:
0.714
Model:
OLS Adj. R-squared:
0.712
Method:
Least Squares F-statistic:
485.6
Date:
Fri, 20 Mar 2015 Prob (F-statistic):
1.42e-261
Time:
16:27:59 Log-Likelihood:
-8754.5
No. Observations:
980 AIC:
1.752e+04
Df Residuals:
974 BIC:
1.755e+04
Df Model:
5
=========================================================================
=====
coef std err
t P>|t|
[95.0% Conf. Int.]
-----------------------------------------------------------------------------const
90.6135 115.171
0.787
0.432
-135.397 316.624
x1
1.0156
0.029 35.254
0.000
0.959 1.072
x2
-0.0001 1.38e-05 -7.940 0.000
-0.000 -8.24e-05
x3
-0.4754
0.196 -2.429
0.015
-0.860 -0.091
x4
0.0004 6.73e-05
6.329
0.000
0.000 0.001
x5
-3.4832
8.492 -0.410
0.682
-20.149 13.182
=========================================================================
=====
Omnibus:
1765.278 Durbin-Watson:
1.904
Prob(Omnibus):
0.000 Jarque-Bera (JB):
1699060.510
Skew:
12.227 Prob(JB):
0.00
Kurtosis:
205.513 Cond. No.
1.76e+07
=========================================================================
=====

OLS Regression Results for #sb49


OLS Regression Results
=========================================================================
=====
Dep. Variable:
y R-squared:
0.841
Model:
OLS Adj. R-squared:
0.840
Method:
Least Squares F-statistic:
610.9
Date:
Fri, 20 Mar 2015 Prob (F-statistic):
1.52e-227
Time:
16:30:23 Log-Likelihood:
-5653.1
No. Observations:
582 AIC:
1.132e+04
Df Residuals:
576 BIC:
1.134e+04
Df Model:
5
=========================================================================
=====
coef std err
t P>|t|
[95.0% Conf. Int.]
-----------------------------------------------------------------------------const
96.3861 327.335
0.294
0.769
-546.530 739.302
x1
0.9662
0.029 32.873
0.000
0.908 1.024
x2
-1.182e-05 3.69e-06 -3.203
0.001 -1.91e-05 -4.57e-06
x3
-0.4478
0.120 -3.739
0.000
-0.683 -0.213
x4
0.0003 4.4e-05
5.805
0.000
0.000 0.000
x5
-24.7930 24.256 -1.022
0.307
-72.434 22.848
=========================================================================
=====
Omnibus:
971.949 Durbin-Watson:
1.416
Prob(Omnibus):
0.000 Jarque-Bera (JB):
756009.544
Skew:
9.785 Prob(JB):
0.00
Kurtosis:
178.478 Cond. No.
1.70e+08
=========================================================================
=====

OLS Regression Results for #superbowl


OLS Regression Results
=========================================================================
=====
Dep. Variable:
y R-squared:
0.835
Model:
OLS Adj. R-squared:
0.834
Method:
Least Squares F-statistic:
965.2
Date:
Fri, 20 Mar 2015 Prob (F-statistic):
0.00
Time:
16:44:26 Log-Likelihood:
-9685.4
No. Observations:
962 AIC:
1.938e+04
Df Residuals:
956 BIC:
1.941e+04
Df Model:
5
=========================================================================
=====
coef std err
t P>|t|
[95.0% Conf. Int.]

-----------------------------------------------------------------------------const
-198.4680 365.687 -0.543
0.587
-916.109 519.173
x1
1.0006
0.148
6.739
0.000
0.709 1.292
x2
4.365e-05 2.12e-05
2.058 0.040
2.03e-06 8.53e-05
x3
-5.3961
0.187 -28.851
0.000
-5.763 -5.029
x4
0.0003 9.05e-05
3.699
0.000
0.000 0.001
x5
10.1494 26.746
0.379
0.704
-42.339 62.638
=========================================================================
=====
Omnibus:
1403.228 Durbin-Watson:
1.684
Prob(Omnibus):
0.000 Jarque-Bera (JB):
657649.721
Skew:
8.025 Prob(JB):
0.00
Kurtosis:
130.081 Cond. No.
nan
=========================================================================
=====
Hashtag

R-square
value

P-values for each feature


X1
5.6156
0345e30

X2
9.465
9481
4e-01

#gopatriots

0.608

#gohawks

0.609

2.1305 2.765
8797e- 1465
15
5e-01

#nfl

0.765

4.6506 4.565
9358e- 4448
25
8e-10

#patriots

0.714

#sb49

0.835

#superbowl

0.841

3.8863 5.532
4292e- 6108
176
3e0151.
5316
8004e
-002
3.0181 1.433
8294e- 5822
134
6e003
2.7580 3.982
2134e- 2939
011
4e002

X3
3.13
7701
11e57
8.80
1161
34e01
3.87
0171
86e46
1.53
1680
04e002

X4
9.6702
5772e01

2.02
8234
57e004
3.61
8041
11e132

t-values for each feature


X5
7.5640
7916e01

X1
11.94
4734
65
8.066
6414
6

X2
0.067
0083
8
1.088
7977
1

X3
17.5
7219
055
0.15
0861
92

2.5156 9.5920
2066e- 2008e03
01
1.6672 2.2253
4655e- 1619e05
01

10.65
0727
15

6.301
4959
9

3.7685 6.8178
8392e- 3371e010
001

35.25
3773
08

7.940
3466
8

15.0
9027
989
2.42
9089
49

1.0612 3.0714
4683e- 6570e008
001

32.87
2714
67

2.2843 7.0442
5997e- 4847e004
001

6.738
7058
3

3.203
3151
8
2.058
3934
6

3.73
9485
86
28.8
5125
897

X4
0.0413
5413

X5
0.3103
2686

3.0293
9581

0.0511
6828

-4.328
30944

1.2206
4922

6.3285
7316

5.8054
4251
3.6993
9775

0.4101
5374

-1.022
13401
0.3794
6772

Problem 3:
Design a regression model using any features from the paper or other new features you
may find useful for this problem. Fit your model on the data and report fitting accuracy
and significance of variables. For the top 3 features in your measurements, draw a
scatter plot of predictant (number of tweets for next hour) versus feature value, using all
the samples you have extracted.
We used a combination of the following features:
1. Cumulative Favourites count
2. Cumulative Friends_count
3. Sum of the no.of followers of the original_author
4. Cumulative Followers_count
5. Cumulative url_count
6. retweet_count
7. Cumulative no.of refrerences (@) in each tweet
8. Tweet user followers count
Results for each hastags:
superbowl:

OLS Regression Results


===============================================================
===============
Dep. Variable:
y R-squared:
0.944
Model:
OLS Adj. R-squared:
0.943
Method:
Least Squares F-statistic:
1991.
Date:
Fri, 20 Mar 2015 Prob (F-statistic):
0.00
Time:
19:01:05 Log-Likelihood:
-9168.4
No. Observations:
962 AIC:
1.835e+04
Df Residuals:
953 BIC:
1.840e+04
Df Model:
8
===============================================================
===============
coef std err
t
P>|t|
[95.0% Conf. Int.]
-----------------------------------------------------------------------------const
-178.9265 120.510 -1.485
0.138
-415.422 57.569
x1
-2.4110
0.393 -6.128
0.000
-3.183 -1.639
x2
-0.0039
0.000 -14.151
0.000
-0.004 -0.003
x3
0.0030
0.000 13.116
0.000
0.003 0.004
x4
-4.269e-05 1.33e-05 -3.216
0.001 -6.87e-05 -1.66e-05
x5
2.1702
0.300
7.237
0.000
1.582 2.759
x6
8.5863
0.240 35.752
0.000
8.115 9.058
x7
-0.0001 5.02e-05 -2.484
0.013
-0.000 -2.62e-05
x8
-2.2761
0.141 -16.198
0.000
-2.552 -2.000
===============================================================
===============
Omnibus:
1173.053 Durbin-Watson:
1.971
Prob(Omnibus):
0.000 Jarque-Bera (JB):
556969.469
Skew:
5.575 Prob(JB):
0.00
Kurtosis:
120.350 Cond. No.
1.08e+08
===============================================================

===============
sb49:
OLS Regression Results
===============================================================
===============
Dep. Variable:
y R-squared:
0.907
Model:
OLS Adj. R-squared:
0.906
Method:
Least Squares F-statistic:
698.2
Date:
Fri, 20 Mar 2015 Prob (F-statistic):
1.00e-289
Time:
18:56:48 Log-Likelihood:
-5497.8
No. Observations:
582 AIC:
1.101e+04
Df Residuals:
573 BIC:
1.105e+04
Df Model:
8
===============================================================
===============
coef std err
t
P>|t|
[95.0% Conf. Int.]
-----------------------------------------------------------------------------const
-437.4056 144.302 -3.031
0.003
-720.830 -153.981
x1
-6.3339
0.399 -15.857
0.000
-7.118 -5.549
x2
0.0042
0.000
8.476
0.000
0.003 0.005
x3
0.0001 8.2e-05
1.776
0.076 -1.54e-05 0.000
x4
0.0001 1.61e-05
6.524
0.000
7.33e-05 0.000
x5
2.5526
0.307
8.320
0.000
1.950 3.155
x6
0.0579
0.615
0.094
0.925
-1.150 1.266
x7
-6.148e-05 4.18e-05 -1.471
0.142
-0.000 2.06e-05
x8
-0.2170
0.095 -2.288
0.023
-0.403 -0.031
===============================================================
===============
Omnibus:
989.474 Durbin-Watson:
1.687
Prob(Omnibus):
0.000 Jarque-Bera (JB):
952094.822
Skew:
10.064 Prob(JB):
0.00
Kurtosis:
200.121 Cond. No.
1.00e+08
===============================================================
===============

patriots:
OLS Regression Results
===============================================================
===============
Dep. Variable:
y R-squared:
0.762
Model:
OLS Adj. R-squared:
0.760
Method:
Least Squares F-statistic:
388.3
Date:
Fri, 20 Mar 2015 Prob (F-statistic):
2.42e-296
Time:
18:31:00 Log-Likelihood:
-8664.3
No. Observations:
980 AIC:
1.735e+04
Df Residuals:
971 BIC:
1.739e+04
Df Model:
8
===============================================================
===============
coef std err
t
P>|t|
[95.0% Conf. Int.]

-----------------------------------------------------------------------------const
-41.0854 61.195 -0.671
0.502
-161.175 79.004
x1
-2.3666
0.319 -7.410
0.000
-2.993 -1.740
x2
-0.0003
0.000 -0.938
0.348
-0.001 0.000
x3
0.0006 8.4e-05
7.565
0.000
0.000 0.001
x4
0.0005 5.29e-05
9.111
0.000
0.000 0.001
x5
-1.3322
0.331 -4.023
0.000
-1.982 -0.682
x6
3.9433
0.544
7.250
0.000
2.876 5.011
x7
-0.0006 9.92e-05 -5.695
0.000
-0.001 -0.000
x8
-0.4433
0.179 -2.471
0.014
-0.795 -0.091
===============================================================
===============
Omnibus:
1529.777 Durbin-Watson:
1.878
Prob(Omnibus):
0.000 Jarque-Bera (JB):
948872.977
Skew:
9.126 Prob(JB):
0.00
Kurtosis:
154.343 Cond. No.
1.24e+07
===============================================================
===============

nfl:

OLS Regression Results


===============================================================
===============
Dep. Variable:
y R-squared:
0.797
Model:
OLS Adj. R-squared:
0.795
Method:
Least Squares F-statistic:
449.5
Date:
Fri, 20 Mar 2015 Prob (F-statistic):
3.79e-311
Time:
18:31:52 Log-Likelihood:
-6666.7
No. Observations:
926 AIC:
1.335e+04
Df Residuals:
917 BIC:
1.339e+04
Df Model:
8
===============================================================
===============
coef std err
t
P>|t|
[95.0% Conf. Int.]
-----------------------------------------------------------------------------const
49.7828 13.374
3.722
0.000
23.536 76.030
x1
-0.6894
0.175 -3.935
0.000
-1.033 -0.346
x2
-9.57e-05
0.000 -0.945
0.345
-0.000 0.000
x3
0.0006
0.000
4.590
0.000
0.000 0.001
x4
3.722e-05 1.16e-05
3.198
0.001
1.44e-05 6.01e-05
x5
3.0882
0.381
8.116
0.000
2.341 3.835
x6
0.5683
0.136
4.183
0.000
0.302 0.835
x7
-4.286e-05 1.71e-05 -2.502
0.013 -7.65e-05 -9.25e-06
x8
-1.5556
0.154 -10.090
0.000
-1.858 -1.253
===============================================================
===============
Omnibus:
722.402 Durbin-Watson:
2.117
Prob(Omnibus):
0.000 Jarque-Bera (JB):
159610.118
Skew:
2.580 Prob(JB):
0.00
Kurtosis:
67.110 Cond. No.
4.83e+06
===============================================================
===============
gohawks:
OLS Regression Results
===============================================================
===============
Dep. Variable:
y R-squared:
0.762

Model:
OLS Adj. R-squared:
0.760
Method:
Least Squares F-statistic:
385.9
Date:
Fri, 20 Mar 2015 Prob (F-statistic):
3.37e-294
Time:
18:32:29 Log-Likelihood:
-7388.6
No. Observations:
972 AIC:
1.480e+04
Df Residuals:
963 BIC:
1.484e+04
Df Model:
8
===============================================================
===============
coef std err
t
P>|t|
[95.0% Conf. Int.]
-----------------------------------------------------------------------------const
22.5795 17.360
1.301
0.194
-11.489 56.648
x1
-1.7203
0.204 -8.447
0.000
-2.120 -1.321
x2
0.0006
0.000
3.148
0.002
0.000 0.001
x3
0.0010 8.81e-05 11.038
0.000
0.001 0.001
x4
-7.143e-05 3.13e-05 -2.279
0.023
-0.000 -9.94e-06
x5
-1.8436
0.237 -7.780
0.000
-2.309 -1.379
x6
2.8752
0.239 12.031
0.000
2.406 3.344
x7
-2.978e-05 6.07e-05 -0.490
0.624
-0.000 8.94e-05
x8
0.0152
0.024
0.631
0.528
-0.032 0.062
===============================================================
===============
Omnibus:
1745.159 Durbin-Watson:
2.092
Prob(Omnibus):
0.000 Jarque-Bera (JB):
4133794.940
Skew:
11.647 Prob(JB):
0.00
Kurtosis:
321.632 Cond. No.
3.69e+06
===============================================================
===============

gopatriots:
OLS Regression Results
===============================================================
===============
Dep. Variable:
y R-squared:
0.745
Model:
OLS Adj. R-squared:
0.742
Method:
Least Squares F-statistic:
246.2
Date:
Fri, 20 Mar 2015 Prob (F-statistic):
2.62e-194
Time:
18:32:34 Log-Likelihood:
-4387.1
No. Observations:
683 AIC:
8792.
Df Residuals:
674 BIC:
8833.
Df Model:
8
===============================================================
===============
coef std err
t
P>|t|
[95.0% Conf. Int.]
-----------------------------------------------------------------------------const
-6.5901
6.058 -1.088
0.277
-18.484 5.304
x1
4.3425
0.315 13.784
0.000
3.724 4.961
x2
0.0010
0.000
2.541
0.011
0.000 0.002
x3
-0.0020
0.000 -11.348
0.000
-0.002 -0.002
x4
-0.0012
0.000 -7.701
0.000
-0.001 -0.001

x5
4.9909
0.598
8.339
0.000
3.816 6.166
x6
-1.1720
0.542 -2.164
0.031
-2.235 -0.109
x7
0.0011
0.000
7.013
0.000
0.001 0.001
x8
-11.7934
2.000 -5.897
0.000
-15.720 -7.867
===============================================================
===============
Omnibus:
910.949 Durbin-Watson:
2.242
Prob(Omnibus):
0.000 Jarque-Bera (JB):
564206.178
Skew:
6.312 Prob(JB):
0.00
Kurtosis:
143.237 Cond. No.
7.50e+05
===============================================================
===============

Problem 4:
Split the feature data (your set of (features,predictant) pairs for windows) into 10 parts
to perform cross-validation. Run 10 tests, each time fitting your model on 9 parts and
predicting the number of tweets for the 1 remaining part. Calculate the average
prediction error |Npredicted Nreal|over samples in the remaining part, and then average
these values over the 10 tests. Since we know the Super Bowl's date and time, we can
create different regression models for different periods of time. First, when the hashtags
haven't become very active, second, their active period, and third, after they pass their
high-activity time. Train 3 regression models for these time periods (The times are all in
PST):
1. Before Feb. 1, 8:00 a.m.
2. Between Feb. 1, 8:00 a.m. and 8:00 p.m.
3. After Feb. 1, 8:00 p.m.
Report cross-validation errors for the 3 different models. Note that you should do the
90-10% splitting for each model within its specific time window. i.e., only use data
within one of the 3 periods for training and testing each time, so for each period you
will run 10 tests.
Feature Set
Entire set
Before Feb 1, 8 am
Between Feb 1, 8 am 8 pm
After Feb 1, 8 pm

|Npredicted Nreal|
2465.29948811
366.07366939
74546.9344882
903.228146426

Problem 5:
Download the test data and run your model to make predictions for the next hour in
each case.Each file in the test data contains a hashtag's tweets for a 6-hour window. The
file name shows sample number followed by the period number the data is from. E.g. a
file named sample5_period2.txt contains tweets for a 6-hour window that lies in the 2nd
time period described in part 4. Report your predicted number of tweets for the next
hour of each sample window.
We ran the model over the given test data and the prediction results are shown in the
table below.

Test File

Hour 1

Hour 2

Hour 3

Hour 4

Hour 5

Hour 6

sample1_period1

164.2380 132.3452 43.565826 110.01934 135.89274 182.13761


6279
4248
38
521
546
449

sample2_period2

61360.95 65750.49 72678.457 93980.162 173371.25 201042.82


911616
88655
60117
0378
716328
972879

sample3_period3

450.0792 381.6879 507.21787 712.08603 581.90945 433.69485


8674
3746
718
085
857
984

sample4_period1

386.4848 292.7374 105.81407 108.47630 139.05950 142.94221


1377
1923
976
511
91
23

sample5_period1

295.9968 187.1569 198.00202 174.99212 188.41977 150.94818


3053
0446
464
932
493
959

sample6_period2

39699.26 71172.18 161267.20 153584.18 132269.81 134482.55


796418
908046 149039
795669
772466
627018

sample7_period3

147.8020 100.5192 804.36872 786.48359 771.11762 796.86308


772
7159
147
354
266
636

sample8_period1

889.9492 890.0882 1102.9430 1076.4052 1010.3683 883.50427


5098
3136
3293
7254
8112
938

sample9_period2

51834.51 57096.86 62192.666 53997.620 72576.155 81059.093


856431
63199
99246
54499
07135
45971

sample10_period3 588.9466 568.4335 553.04414 535.31943 516.56800 499.70149


9065
6111
36
703
024
696

You might also like