KEMBAR78
Data Analytics Part 3 | PDF | Machine Learning | Data Analysis
0% found this document useful (0 votes)
20 views54 pages

Data Analytics Part 3

The document discusses the complexities of the machine learning application software development lifecycle (MLASDLC) and highlights the differences between traditional software engineering and machine learning methodologies. It emphasizes the importance of data analysis in driving business success and outlines the data analysis process, including identification, collection, cleaning, analysis, and interpretation. Additionally, it covers various machine learning algorithms and models, including supervised, unsupervised, and reinforcement learning, while providing examples of predictive data analysis using linear regression.

Uploaded by

nirthisingh58
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views54 pages

Data Analytics Part 3

The document discusses the complexities of the machine learning application software development lifecycle (MLASDLC) and highlights the differences between traditional software engineering and machine learning methodologies. It emphasizes the importance of data analysis in driving business success and outlines the data analysis process, including identification, collection, cleaning, analysis, and interpretation. Additionally, it covers various machine learning algorithms and models, including supervised, unsupervised, and reinforcement learning, while providing examples of predictive data analysis using linear regression.

Uploaded by

nirthisingh58
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

THE MACHINE

LEARNING
LIFECYCLE
Linear Regression
Decision Tree Analysis
Read Article titled: Agile Software Development Lifecycle Model for Machine
Learning (ASDLMML)
Machine learning application software
development life cycle (MLASDLC).
■ The complexity of building and integrating machine learning
applications is challenging to software engineering teams.
■ The inherent differences between software engineering and machine
learning do not allow software engineering methodologies to be
applied uniformly.
■ Whereas software engineering is dependent on software design,
development and testing, machine learning model development is
based on data and model design, training, evaluation, deployment,
and monitoring. Machine learning systems are non-deterministic and
are therefore difficult to build using sequential development methods
■ Data, hidden technical debt and the need for iterative experimentation
are the main technical challenges of machine learning development.
ML & Data Analytics
■ In this data-rich age, understanding how to analyze and extract
true meaning from the business’s digital insights is one of the
primary drivers of success.
■ Despite the colossal volume of data created every day, a mere
0.5% is actually analyzed and used for data discovery,
improvement, and intelligence.
■ 1st there is data analytics – then there is the challenge of using
the analytics in an automated manner – resulting in ML
■ ML has a pivotal dependence on data analytics – because the
machine learning models are informed by data analytics
processes and models – the ML lifecycle model is different from
a traditional systems lifecycle model
Traditional vs ML
■ The traditional lifecycle model…

■ The ML lifecycle model…


The Integrated Model
■ Read the article by Ranuwana and Karunananda (2021) for a
full explanation of this model
Business meets Science
■ In business, project management takes precedence – activities are
highly manageable
■ In the world of science project management is the least of concerns -
data analysis uses a more complex approach with advanced
techniques to explore and experiment with data with no/little timeline
specifications
■ On the other hand, in a business context, data is used to create
management protocols so that there is an optimal and efficient use of
resources that will enable the company to improve its overall
performance and profit margin – science improves the quality of
knowledge
■ Ideally – examine an analysis of data from a business point of view
while still going through the scientific and statistical foundations that
are fundamental to understanding the basics of data analysis.
Why Is Data Analysis Important?
■ Informed decision-making: From a management perspective, you can benefit from
analyzing your data as it helps you make decisions based on facts and not simple
intuition. For instance, you can understand where to invest your capital, detect growth
opportunities, predict your incomes, or tackle uncommon situations before they become
problems…explicit vs tacit knowledge?
■ Reduce costs: Another great benefit is to reduce costs. With the help of advanced
technologies such as predictive analytics, businesses can spot improvement
opportunities, trends, and patterns in their data and plan their strategies accordingly. In
time, this will help you save money and resources on implementing the wrong strategies.
And not just that, by predicting different scenarios such as sales and demand you can
also anticipate production and supply.
■ Target customers better: Customers are arguably the most crucial element in any
business. By using analytics to get a 360° vision of all aspects related to your customers,
you can understand which channels they use to communicate with you, their
demographics, interests, habits, purchasing behaviors, and more. In the long run
What Is The Data Analysis Process?
The DA Process
■ Identify: Before you get your hands dirty with data, you first need to identify why do you need it in the
first place. The identification is the stage in which you establish the questions you will need to answer.
For example, what is the customer's perception of our brand? Or what type of packaging is more
engaging to our potential customers? Once the questions are outlined you are ready for the next step.
■ Collect: As its name suggests, this is the stage where you start collecting the needed data. Here, you
define which sources of information you will use and how you will use them. The collection of data can
come in different forms such as internal or external sources, surveys, interviews, questionnaires, focus
groups, among others. An important note here is that the way you collect the information will be
different in a quantitative and qualitative scenario.
■ Clean: Once you have the necessary data it is time to clean it and leave it ready for analysis. Not all the
data you collect will be useful, when collecting big amounts of information in different formats it is very
likely that you will find yourself with duplicate or badly formatted data. To avoid this, before you start
working with your data you need to make sure to erase any white spaces, duplicate records, or
formatting errors. This way you avoid hurting your analysis with incorrect data.
■ Analyze: With the help of various techniques such as statistical analysis, regressions, neural networks,
text analysis, and more, you can start analyzing and manipulating your data to extract relevant
conclusions. At this stage, you find trends, correlations, variations, and patterns that can help you
answer the questions you first thought of in the identify stage. Various technologies in the market assists
researchers and average business users with the management of their data. Some of them include
business intelligence and visualization software, predictive analytics, data mining, among others.
■ Interpret: Last but not least you have one of the most important steps: it is time to interpret your results.
This stage is where the researcher comes up with courses of action based on the findings. For example,
here you would understand if your clients prefer packaging that is red or green, plastic or paper, etc.
Additionally, at this stage, you can also find some limitations and work on them.
Quantitative Data Analysis
■ In Research
– 2 main types are descriptive and inferential
– Data analysis entails a population(entire
group of people/subject you’re interested in)
and a sample (subset of the population)
– Descriptive statistics focuses on describing the
sample, while inferential statistics aim to make
predictions about the population,
The Essential Types Of Data Analysis
Methods
■ 1) Descriptive analysis - What happened.
– The descriptive analysis method is the starting point to any analytic
reflection, and it aims to answer the question of what happened?
– It does this by ordering, manipulating, and interpreting raw data
from various sources to turn it into valuable insights for your
organization.
– Performing descriptive analysis is essential, as it allows us to
present our insights in a meaningful way.
– Although it is relevant to mention that this analysis on its own will
not allow you to predict future outcomes or tell you the answer to
questions like why something happened, it will leave your data
organized and ready to conduct further investigations.
Descriptive
■ Mean – this is simply the mathematical average of a range of numbers.
■ Median – this is the midpoint in a range of numbers when the numbers are
arranged in numerical order. If the data set makes up an odd number, then the
median is the number right in the middle of the set. If the data set makes up
an even number, then the median is the midpoint between the two middle
numbers.
■ Mode – this is simply the most commonly occurring number in the data set.
■ Standard deviation – this metric indicates how dispersed a range of numbers
is. In other words, how close all the numbers are to the mean (the average).
■ In cases where most of the numbers are quite close to the average, the
standard deviation will be relatively low.
■ Conversely, in cases where the numbers are scattered all over the place, the
standard deviation will be relatively high.
■ Visualisation – Pie, Bar, Histograms
Popular options for analysis
■ MsExcel
■ SPSS
Number Gender Age Weight

■ Python 1 Male
2 Male
28
27
65
65
3 Female 39 61

■ PowerBi 4 Female
5 Female
34
43
50
65
6 Male 48 72
7 Female 41 55
8 Female 52 55
9 Male 39 68
10 Female 48 68
Descriptive Sample
Machine learning Algorithms
■ Supervised learning: Supervised learning occurs when an algorithm is trained using
“labeled data,” or data that is tagged with a label so that an algorithm can successfully
learn from it. Training labels help the eventual machine learning model know how to
classify data in the manner that the researcher desires.
■ Unsupervised learning: Unsupervised algorithms use unlabeled data to train an algorithm.
In this process, the algorithm finds patterns in the data itself and creates its own data
clusters. Unsupervised learning and pattern recognition are helpful for researchers who
are looking to find patterns in data that are currently unknown to them.
■ Semi-supervised learning: Semi-supervised learning uses a mix of labeled and unlabeled
data to train an algorithm. In this process, the algorithm is first trained with a small
amount of labeled data before being trained with a much larger amount of unlabeled
data.
■ Reinforcement learning: Reinforcement learning is a machine learning technique in which
positive and negative values are assigned to desired and undesired actions. The goal is to
encourage programs to avoid the negative training examples and seek out the positive,
learning how to maximize rewards through trial and error. Reinforcement learning can be
used to direct unsupervised machine learning.
machine learning models

■ There are two types of problems that dominate


machine learning:
– classification and prediction.
■ Occasionally, the same algorithm can be used to
create either classification or regression models,
depending on how it is trained.
Classification & Prediction
■ Classification ■ Regression
models models
■ Logistic ■ Linear
regression regression
■ Naive Bayes ■ Ridge
regression
■ Decision trees ■ Decision trees
■ Random forest ■ Random forest
■ K-nearest ■ K-nearest
neighbor (KNN) neighbor (KNN)
■ Support vector ■ Neural network
machine regression
Pearson Correlation- a simple form of
prediction
■ import pandas as pd
■ df = pd.read_csv('Heights.csv')
■ df.plot(kind='scatter',x='Female_Height', y='Male_Height',figsize=(10,6));
■ print(df.corr())
Getting the significance of the correlation
■ import pandas as pd

■ from scipy.stats import pearsonr

■ df = pd.read_csv('Heights.csv')

■ df.plot(kind='scatter',x='Female_Height', y='Male_Height',figsize=(10,6));

■ print(df.corr())

■ corr,sig=pearsonr(df['Male_Height'],df['Female_Height'])

■ print("Male Heights vs Female Heights is SIGNIFICANT at the 95% confidence lebel", pearsonr(df['Male_Height'],df['Female_Height']))

■ print(round(sig,3))
Linear/multiple regression
■ Simple linear regression is a function that allows an analyst or
statistician to make predictions about one variable based on the
information that is known about another variable.
■ Linear regression can only be used when one has two continuous
variables—an independent variable and a dependent variable.
■ The independent variable is the parameter that is used to calculate
the dependent variable or outcome….
– y=mx+c…OR y=2x+3…y(profit) = 2*x(no of customers) + 3
■ A multiple regression model extends to several explanatory variables.
– Y=2x+3k+4z+c ….the dependent variable is determined by more
than one independent variable
Linear/multiple regression
■ This is an example of predictive data analysis –
also creates an opportunity for machine learning
(ML)
■ Take the example of vehicles and carbon
emission – let us suppose that we need to find a
model that will predict the level of carbon
emission of a vehicle if we know the weight and
engine capacity
The data as a csv
■ Use the pandas to arrange the data into data frames
■ import pandas as pd
■ df=pd.read_csv("cars.csv")
■ df
Get the main data items
■ To build the model – we need to isolate the
independent and the dependent variables
■ The volume and weight are the independent variables
■ The CO2 is the dependent variable…use X to
represent the independent variable and y to
represent the dependent variable
■ X = df[['Weight', 'Volume']]
y = df['CO2']
Using a ML Library
from sklearn import linear_model
regr = linear_model.LinearRegression()
X = df[['Weight','Volume']]
y = df['CO2']
regr.fit(X, y)
weight = input ('Enter the Car weight')
Engine=input('Enter the Engine Capacity in ccm')
predictedCO2 = regr.predict([[int(weight), int(Engine)]])
print(predictedCO2)
ML - The main data analysis route
■ Use pandas to import data file
■ Use seaborn and matplotlib to do the serious
stats
■ E.g. – multiple regression
– Similar to linear regression but with more than
one independent value, meaning that we try to
predict a value based on two or
more variables.
Getting the data
■ Another example involving more variables – The decision is to determine the most cost-effective method for advertising
■ The independent data is the cost of advertising via:
– TV
– Radio
– Newspaper
■ The dependent variable is an amount that represents sales

import pandas as pd
df=pd.read_csv("Advertising.csv", index_col="No")
df
Checking the head & tail
■ df.head(10)….df.tail(8)
Scatter plot TV vs Sales
df.plot(kind='scatter',x='TV', y='sales',figsize=(10,6),color='Red');
Matplot and seaborn
■ Pandas is quite good but there is another library
that is better for plotting – named matplotlib and
also seaborn which is even better
■ import seaborn as sns
■ sns.pairplot(df,x_vars=['TV','radio','newspaper'],y_vars='sales')
The color option
■ import seaborn as sns
■ sns.pairplot(df,x_vars=['TV','radio','newspaper'],y_vars='sales', kind='reg',plot_kws={'line_kws':{'color':'red'}} )
Tutorial Exercise
■ Use the Advertising data (csv download from
Learn) and build the predictive model that will
predict the sale given the cost of TV, Radio and
Newspaper
Another Example of ML
■ The file contains information about passengers who were on board the
Titanic when the collision took place.
■ We will use this data to perform exploratory data analysis
in Python and better understand the factors that contributed to a
passenger’s survival of the incident.
■ The idea here is to use the passengers details (independent variable)
to predict whether the passenger survived (dependent variable)
■ In this scenario – we have 2 datasets
– The 1st is a training dataset with full data including whether the
passenger has survived
– The 2nd is a sample dataset with the dependent variable missing
(i.e. the survival indicator is omitted)
The Titanic
import os
os.chdir("C:\SE_2025\Python\Data")
import pandas as pd
df=pd.read_csv('train.csv')
df.head()

■ There is a problem with missing Data!


df.info()
The Aggregates
■ Count: the number of rows in the dataset that are populated with non-null values. There
are 891 unique passenger IDs in this dataset. All the other variables also have 891 rows
of data populated, with the exception of ‘Age’which only has 714 rows. This means that
there are 177 passengers in the dataset who aren’t tagged with an age value.
■ Mean: the mean value in each column. The mean age of passengers aboard the Titanic,
for example, was 30.
■ Std: how much deviation each column has from the mean.
■ Min: the minimum value of each variable. For example, the minimum value for ‘SibSp’ is 0,
meaning that there were passengers who traveled without their siblings and spouses.
■ 25%, 50%, and 75%: the 1st quartile, 2nd quartile (median), and 3rd quartile.
■ Max: the highest value for each variable in the dataset. From the data frame above, we
can see that the oldest passenger aboard the Titanic was 80 years old.
Cleaning the Data – especially the missing ones!
■ Note: Notice that we are creating a copy of the data
frame before removing missing values. This is done
so that the original frame isn’t tampered with and we
can go back to it anytime without losing valuable
data. It is often a best practice to create a copy
before performing data manipulation.
df2 = df.copy()
df2 = df2.dropna()
df2.info()
Notice that earlier there were 418 rows. By dropping rows with
missing values, we have dramatically reduced the size of this
data frame by more than half. This isn’t a good practice. We
lose a lot of valuable data by simply removing rows that contain
missing values.
The missing values
■ Data preprocessing is one of the most important steps
when conducting any kind of data science activity. Earlier,
we noticed that the ‘Age’ column had some missing
values in it. Let’s dive deeper to see if there are any
further inconsistencies in our dataset.
■ df.isnull().sum()
■ As a result, we see that there are 3 columns with missing
values — Age, Cabin, and Embarked:
■ We can deal with these missing values in a few different
ways. The simplest option is to simply drop all the rows
that contain missing values.
Data Imputation
■ Let’s try a second approach — imputation. In other
words, the process of replacing missing data with
substituted values.
■ First, impute missing values in the ‘Age’ column. We
will use mean imputation in this case — substituting
all the missing age values with the average age in the
dataset.
■ We can do this by running the following line of code:
■ df3 = df.copy()
■ df3['Age'] = df3['Age'].fillna(df3['Age'].mean())
Imputation ….cont’d
■ Now, let’s move on to the ‘Cabin’ column. We will
replace the missing values in this column with
the majority class:
■ We can do the same for ‘Embarked’:
df3['Cabin'] = df3['Cabin'].fillna(df3['Cabin'].value_counts().index[0])
df3['Embarked'] = df3['Embarked'].fillna(df3['Embarked'].value_counts().index[0])
Univariate Analysis
■ Univariate analysis is the process of performing a
statistical review on a single variable.
■ We will start by creating a simple visualization to
understand the distribution of the ‘Survived’ variable
in the Titanic dataset.
■ Our aim is to answer simple questions with the help
of available data, such as:
– How many passengers survived the Titanic
collision?
■ Were there more fatalities than survivors?
Let’s get to know the Data set
df_num=train_data[['Age','SibSp','Parch','Fare']]
df_num
df_cat=train_data[['Survived','Pclass','Sex','Ticket','Cabin','Em
barked']]
df_cat
Histograms and Correlations
■ It is always a good idea to visualise
import matplotlib.pyplot as plt
for i in df_num.columns:
plt.hist(df_num[i])
plt.title(i)
plt.show()

print (df_num.corr())

sns.heatmap(df_num.corr(),annot=True)
Pivot the data
■ The Pivot tables are very good at providing a deeper insight

print(pd.pivot_table(train_data,index='Survived',columns='Pclass',values='Ticket', aggfunc='count'))
print('----------------------------------------------------------')
print(pd.pivot_table(train_data,index='Survived',columns='Sex',values='Ticket', aggfunc='count'))
print('----------------------------------------------------------')
print(pd.pivot_table(train_data,index='Survived',columns='Embarked',values='Ticket', aggfunc='count'))
Histogram of Categorical Variable
■ In the Seaborn library, we can create a count plot
to visualize the distribution of the
‘Survived’ variable.
■ Essentially, a count plot can be thought of as
a histogram across a categorical variable.
■ To do this, run the following code:

import seaborn as sns


sns.countplot(x='Survived',data=df)
Getting the exact values
■ By looking at the results, we can tell that a
majority of the passengers didn’t survive the
Titanic collision.
■ To get the exact breakdown of passengers who
survived and those who didn’t, we can use an in-
built function of the pandas library called
‘value_counts()’: df['Survived'].value_counts()
Analyze the Relationship Between Variables
■ In this case, we will run an analysis to try and answer the
following questions about Titanic survivors:
– Did a passenger’s age have any impact on what class they
traveled in?
– Did the class that these passengers traveled in have any
correlation with their ticket fares?
– Were passengers who paid higher ticket fares located in
different cabins as compared to passengers who paid lower
fares?
– Did ticket fare have any impact on a passenger’s survival?
■ Using the questions above as a rough guideline, let’s begin the
analysis.
Age vs Class
■ First, let’s create a boxplot to visualize the
relationship between a passenger’s age and the
class they were traveling in:
sns.boxplot(data=df,x='Pclass', y='Age')

Taking a look at the boxplot above, notice that passengers


traveling first class were older than passengers in the second
and third classes. The median age of first-class passengers is
around 35, while it is around 30 for second-class passengers,
and 25 for third-class passengers.
This makes sense since older individuals are likely to have
accumulated a larger amount of wealth and can afford to travel
first class. Of course, there are exceptions, which is why you
can observe passengers above 70 in the second and third
classes – our outliers.
Price vs Life!
■ Moving on, let’s look into the relationship between a
passenger’s ticket fare and survival:

sns.barplot(data=df,x='Survived',y='Fare')

As expected, passengers with higher ticket fares had a higher


chance of survival:
This is because they could afford cabins closer to lifeboats,
which meant they could make it out on time.
By extension, this should also mean that the first-class
passengers had a higher likelihood of survival. Let’s confirm
this:

sns.barplot(data=df,x='Pclass',y='Survived')
Answering the questions via plots
■ Did a passenger’s age have any impact on what class they
traveled in? Yes, older passengers were more likely to travel
first class.
■ Were passengers who paid higher ticket fares in different
cabins as opposed to passengers who paid lower fares? Yes,
passengers who paid higher ticket fares seemed to mostly
travel in cabin B. However, the relationship between ticket fare
and cabin isn’t too clear because there were many missing
values in the ‘Cabin’ This might have compromised the quality
of the analysis.
■ Did ticket fare have any impact on a passenger’s survival? Yes,
first-class passengers were more likely to survive the collision.
The challenge

■ Use the Titanic passenger data (name, age, price


of ticket, etc) to try to predict who will survive and
who will die.
■ Remember the goal: we want to find patterns
in train.csv that help us predict whether the
passengers in test.csv survived.
Exploratory Analysis - a possible pattern
(hypothesis)
train_data = pd.read_csv("train.csv")
train_data
women = train_data.loc[train_data.Sex == 'female']["Survived"]
rate_women = sum(women)/len(women)*100
print("% of women who survived:", rate_women)
men = train_data.loc[train_data.Sex == 'male']["Survived"]
rate_men = sum(men)/len(men)*100
print("% of men who survived:", rate_men)
Gender as a possible predictor of Survival
(b.t.w don’t forget to cast your vote!!)
■ From this you can see that almost 75% of the women on board
survived, whereas only 19% of the men lived to tell about it. Since
gender seems to be such a strong indicator of survival, it is a
pretty good guess of a predictor of survival rates
■ Using this kind of logic, ML uses a model – in this case, it is
known as the random forest model.
■ This model is constructed of several "trees" that will individually
consider each passenger's data and vote on whether the
individual survived. Then, the random forest model makes a
democratic decision: the outcome with the most votes wins
A 1st Encounter with a machine learning
(ML) model
■ We'll build what's known as a random forest
model.
Using 4 indicators (for now)
■ The code cell below looks for patterns in four
different columns ("Pclass", "Sex", "SibSp",
and "Parch") of the data.
■ It constructs the trees in the random forest
model based on patterns in the train.csv file,
before generating predictions for the passengers
in test.csv.
■ The code also saves these new predictions in a
CSV file submission.csv.
The ML Prediction Model
from sklearn.ensemble import RandomForestClassifier
y = train_data["Survived"]
features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)
output = pd.DataFrame({'Passenger Name': test_data.Name, 'PassengerId':
test_data.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")
df=pd.read_csv('submission.csv')
df

You might also like