IBM Data Science Capstone Report
Business Understanding
The government is going to prevent avoidable car accidents by
employing methods that alert drivers, health system, and police to
remind them to be more careful in critical situations.
In most cases, not paying enough attention during driving,
abusing drugs and alcohol or driving at very high speed are the
main causes of occurring accidents that can be prevented by
enacting harsher regulations. Besides the aforementioned reasons,
weather, visibility, or road conditions are the major uncontrollable
factors that can be prevented by revealing hidden patterns in the
data and announcing warning to the local government, police and
drivers on the targeted roads.
The target audience of the project is local Seattle government,
police, rescue groups, and last but not least, car insurance
institutes. The model and its results are going to provide some
advice for the target audience to make insightful decisions for
reducing the number of accidents and injuries for the city.
Data
The data was collected by the Seattle Police Department and
Accident Traffic Records Department from 2004 to present.
The data consists of 37 independent variables and 194,673 rows.
The dependent variable, “SEVERITYCODE”, contains numbers
that correspond to different levels of severity caused by an
accident from 1 to 2
Severity codes are as follows:
1: Property Damage Only Collision
2: Injury Collision
Furthermore, because of the existence of null values in some
records, the data needs to be preprocessed before any further
processing.
Data Preprocessing
The dataset in the original form is not ready for data analysis. In
order to prepare the data, first, we need to drop the non-relevant
columns. In addition, most of the features are of object data types
that need to be converted into numerical data types.
After analyzing the data set, I have decided to focus on only four
features, severity, weather conditions, road conditions, and light
conditions, among others.
To get a good understanding of the dataset, I have checked
different values in the features. The results show, the target
feature is imbalance, so we use a simple statistical technique to
balance it.
As you can see, the number of rows in class 1 is almost three times
bigger than the number of rows in class 2. It is possible to solve the
issue by downsampling the class 1.
Methodology
For implementing the solution, I have used Github as a repository
and running Jupyter Notebook to preprocess data and build
Machine Learning models. Regarding coding, I have used Python
and its popular packages such as Pandas, NumPy and Sklearn.
Once I have load data into Pandas Dataframe, used
‘dtypes’ attribute to check the feature names and their data types.
Then I have selected the most important features to predict the
severity of accidents in Seattle. Among all the features, the
following features have the most influence in the accuracy of the
predictions:
“WEATHER”,
“ROADCOND”,
“LIGHTCOND”
Also, as I mentioned earlier, “SEVERITYCODE” is the target
variable.
I have run a value count on road (‘ROADCOND’) and weather
condition (‘WEATHER’) to get ideas of the different road and
weather conditions. I also have run a value count on light
condition (’LIGHTCOND’), to see the breakdowns of accidents
occurring during the different light conditions. The results can be
seen below:
After balancing SEVERITYCODE feature, and standardizing the
input feature, the data has been ready for building machine
learning models.
I have employed three machine learning models:
K Nearest Neighbour (KNN)
Decision Tree
Linear Regression
After importing necessary packages and splitting preprocessed
data into test and train sets, for each machine learning model, I
have built and evaluated the model and shown the results as
follow:
KNN
Decision Tree
Linear Regression
Results and Evaluations
The final results of the model evaluations are summarized in the
following table:
Based on the above table, KNN is the best model to predict car
accident severity.
Conclusion
Based on the dataset provided for this capstone from weather,
road, and light conditions pointing to certain classes, we can
conclude that particular conditions have a somewhat impact on
whether or not travel could result in property damage (class 1) or
injury (class 2).