Stock Price Trend Forecasting using Supervised Learning Methods.
1 2
Sharvil Katariya Saurabh Jain
Abstract— The aim of the project is to examine a number B. Feature Selection and Feature Generation
of different forecasting techniques to predict future stock
returns based on past returns and numerical news indicators We created new features from the base features
to construct a portfolio of multiple stocks in order to diversify which provided better insights of the data like 50 day
the risk. We do this by applying supervised learning methods moving average, previous day difference, etc.
for stock price forecasting by interpreting the seemingly To prune out less useful features, in Feature Selection, we
chaotic market data. select features according to the k highest scores, with the help of
an linear model for testing the effect of a single regressor,
I. INTRODUCTION sequentially for many regressors. We used the SelectKBest
The fluctuation of stock market is violent and there are Algorithm, with f regression as the scorer for evaluation.
many complicated financial indicators. However, the Furthermore, we added Twitters Daily Sentiment
advancement in technology, provides an opportunity to Score, as an feature for each company based upon the
gain steady fortune from stock market and also can help users tweets about that particular company and also
experts to find out the most informative indicators to make the tweets on that companys page.
better prediction. The prediction of the market value is of
paramount importance to help in maximizing the profit of III. ANALYSIS
stock option purchase while keeping the risk low. For analyzing the efficiency of the system we are used the
ˆ
Root Mean Square Error(RMSE) and r2 score value.
The next section of the paper will be methodology where we
will explain about each process in detail. After that we will have A. Root Mean Squared Error (RMSE)
pictorial representations of the analysis that we have made and
The square root of the mean/average of the square
we will also reason about the results achieved. Finally, we will
of all of the error.
define the scope of the project. We will talk about how to extend The use of RMSE is very common and it makes an excel-
the paper to achieve more better results. lent general purpose error metric for numerical predictions.
Compared to the similar Mean Absolute Error, RMSE
amplifies and severely punishes large errors.
II. METHODOLOGY
This section will give you the detailed analysis of
each process involved in the project. Each sub section
is mapped to one of the stages in the project.
A. Data Pre-Processing
Fig. 1. RMSE Value calculation
The pre-processing stage involves
Data discretization: Part of data reduction but with
particular importance, especially for numerical data
Data transformation: Normalization.
Data Cleaning: Fill in missing values.
Data Integration: Integration of data files.
After the data-set is transformed into clean data-set,
the data-set is divided into training and testing sets so
as to evaluate. Here, the training values are taken as
the more recent values. Testing data is kept as 5-10
percent of the total dataset.
*This work was supported by International Institute of Information
Technology
1 Sharvil Katariya is a student in Computer Science at IIIT Hyderabad,
India.
Fig. 2. RMSE Value calculation
2Nikhil Chavanke is a student in Computer Science at IIIT Hyderabad, India.
ˆ
B. R-Squared Value(r2 value) V. RESULTS
The value of R2 can range between 0 and 1, and the higher Based on the results obtained, it is found that Gradient
its value the more accurate the regression model is as the more Boosting Regressor consistently performs the best. This is
variability is explained by the linear regression model. followed by Bagging Regressor, Random Forest Regressor,
R2 value indicates the proportionate amount of variation in the Adaboost Regressor and by K Neighbour Regressor.
response variable explained by the independent variables. Bagging Regressor is found to perform good as Bagging
R-squared is a statistical measure of how close the (Bootstrap sampling) relies on the fact that combination of
data are to the fitted regression line. It is also known as many independent base learners will significantly decrease
the coefficient of determination, or the coefficient of the error. Therefore we want to produce as many
multiple determination for multiple regression. independent base learners as possible. Each base learner is
generated by sampling the original data set with
TABLE I
replacement. From the results, it is safe to say that additional
CLASSIFIER EVALUATION
hidden layer(s) improve upon the score of the models.
Random Forest is an extension of bagging where the
Algorithm RMSE Value R-squared Value
Random Regressor 1.4325434e-07 0.956669
major difference is the incorporation of randomized
Bagging Regressor 1.329966e-07 0.959771 feature selection.
Adaboost Regressor 2.9882972e-07 0.909611
KNeighbours Regressor 0.00039015 -117.01176 ACKNOWLEDGMENT
Gradient Boosting Regressor 1.274547e-07 0.961448 We would like thank Soham Saha for mentoring our
project and introducing us to the new state-of-art tech-
nologies and helping us at every stage of this project.
IV. GRAPHS
We would also like to thank Dr. Bapi Raju, our course
instructor for Statistical Methods in AI, and clearing
basic concepts required as part of the Project.
REFERENCES
[1] https://en.wikipedia.org/wiki/F-test
[2] http://goo.gl/4OI84b
[3] http://scikit-learn.org/stable/
[4] http://deeplearning.net/software/theano/
[5] http://colah.github.io/posts/2015-08-Understanding-LSTMs/
[6] http://people.duke.edu/ rnau/411arim.htm - []() - []() - []()
Fig. 3. Comparison Graphs RMSE Value - Different Models
Fig. 4. Comparison Graphs R-squared Value - Different Models