KEMBAR78
Introduction To Data Science UNIT - II | PDF | Cross Validation (Statistics) | Predictive Analytics
0% found this document useful (0 votes)
38 views33 pages

Introduction To Data Science UNIT - II

This document discusses the applications of machine learning in data science, highlighting its role in various industries such as healthcare, finance, and marketing. It emphasizes how machine learning enhances data analysis, predictive modeling, and anomaly detection, leading to improved decision-making and operational efficiency. Additionally, it introduces Scikit-learn as a popular library for implementing machine learning algorithms in Python.

Uploaded by

smce.ramu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views33 pages

Introduction To Data Science UNIT - II

This document discusses the applications of machine learning in data science, highlighting its role in various industries such as healthcare, finance, and marketing. It emphasizes how machine learning enhances data analysis, predictive modeling, and anomaly detection, leading to improved decision-making and operational efficiency. Additionally, it introduces Scikit-learn as a popular library for implementing machine learning algorithms in Python.

Uploaded by

smce.ramu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Introduction to data science

Unit -II
Applications of Machine Learning in Data science
Machine learning is one of the most exciting technologies that one would have ever come
across. As is evident from the name, it gives the computer that which makes it more similar
to humans: The ability to learn.
Machine learning is actively being used today, perhaps in many more places than one would
expect.
Today, companies are using Machine Learning to improve business decisions, increase
productivity, detect disease, forecast weather, and do many more things. With the exponen-
tial growth of technology, we not only need better tools to understand the data we currently
have, but we also need to prepare ourselves for the data we will have. To achieve this goal we
need to build intelligent machines. We can write a program to do simple things. But most of
the time, Hardwiring Intelligence in it is difficult. The best way to do it is to have some way for
machines to learn things themselves. A mechanism for learning – if a machine can learn from
input then it does the hard work for us. This is where Machine Learning comes into action.
Some of the most common examples are:
• Image Recognition
• Speech Recognition
• .Recommender Systems
• Fraud Detection
• Self Driving Cars
• Medical Diagnosis
• Stock Market Trading
• Virtual Try On

Image Recognition
Image Recognition is one of the reasons behind the boom one could have experienced in
the field of Deep Learning. The task which started from classification between cats and
dog images has now evolved up to the level of Face Recognition and real-world use
cases based on that like employee attendance tracking.
Also, image recognition has helped revolutionized the healthcare industry by employing
smart systems in disease recognition and diagnosis methodologies.

Speech Recognition
Speech Recognition based smart systems like Alexa and Siri have certainly come across
and used to communicate with them. In the backend, these systems are based basically
on Speech Recognition systems. These systems are designed such that they can convert
voice instructions into text.
One more application of the Speech recognition that we can encounter in our day-to-day
life is that of performing Google searches just by speaking to it.

Recommender Systems
As our world has digitalized more and more approximately every tech giants try to pro-
vide customized services to its users. This application is possible just because of
the recommender systems which can analyze a user’s preferences and search history
and based on that they can recommend content or services to them.
An example of these services is very common for example youtube. It recommends new
videos and content based on the user’s past search patterns. Netflix recommends mov-
ies and series based on the interest provided by users when someone creates an ac-
count for the very first time.

Fraud Detection
In today’s world, most things have been digitalized varying from buying toothbrushes or
making transactions of millions of dollars everything is accessible and easy to use. But
with this process of digitization cases of fraudulent transactions and fraudulent activi-
ties have increased. Identifying them is not that easy but machine learning systems are
very efficient in these tasks.
Due to these applications only whenever the system detects red flags in a user’s activity
than a suitable notification be provided to the administrator so, that these cases can be
monitored properly for any spam or fraud activities.

Self Driving Cars


It would have been assumed that there is certainly some ghost who is driving a car if we
ever saw a car being driven without a driver but all thanks to machine learning and deep
learning that in today’s world, this is possible and not a story from some fictional book.
Even though the algorithms and tech stack behind these technologies are highly ad-
vanced but at the core it is machine learning which has made these applications possi-
ble.
The most common example of this use case is that of the Tesla cars which are well-
tested and proven for autonomous driving.

Medical Diagnosis
If you are a machine learning practitioner or even if you are a student then you must
have heard about projects like breast cancer Classification, Parkinson’s Disease
Classification, Pneumonia detection, and many more health-related tasks which are
performed by machine learning models with more than 90% of accuracy.
Not even in the field of disease diagnosis in human beings but they work perfectly fine
for plant disease-related tasks whether it is to predict the type of disease it is or to de-
tect whether some disease is going to occur in the future.

Stock Market Trading


Stock Market has remained a hot topic among working professionals and even students
because if you have sufficient knowledge of the markets and the forces which drives
them then you can make fortune in this domain. Attempts have been made to create in-
telligent systems which can predict future price trends and market value as well.
This can be considered as one of the applications of time series forecasting because
stock price data is nothing but sequential data in which the time at which data has been
taken is of utmost importance.

Virtual Try On
Have you ever purchased your specs or lenses from Lenskart? If yes then you must have
come across its feature where you can try different frames virtually without actually pur-
chasing them or visiting the outlet. This has become possible just because of the ma-
chine learning systems only which identify certain landmarks on a person’s face and
then place the specs virtually on your face using those landmarks.

Role of Machine Learning in Data Science

What is the Role of Machine Learning in Data Science

in today’s world, the collaboration between machine learning and data science plays an
important role in maximizing the potential of large datasets. Despite the complexity,
these concepts are integral in unraveling insights from vast data pools. Let’s delve into
the role of machine learning in data science, exploring the functionalities and signifi-
cance across diverse domains.

Understanding Machine Learning and Data Science


Machine learning is like a computer learning from data and making independent deci-
sions. It’s similar to how we teach kids patterns by showing them several examples. On
the other hand, data science focuses on pulling out useful information from data using
different methods and tools.
The merger of machine learning with data science is indispensable in today’s data-
driven world. Machine learning assists data scientists in efficiently navigating through
extensive datasets, identifying patterns, predicting outcomes, and detecting anomalies.
This collaboration proves vital across industries like business, medicine, and finance,
where data-driven insights drive progress and informed decision-making. Understand-
ing the interplay between machine learning and data science is crucial for harnessing
the potential of data analysis in various domains. Together, they empower companies to
extract valuable insights from complex datasets, leading to improved work efficiency,
innovative ideas, and a competitive edge in the current landscape.

Data Science Vs Machine Learning

Data Science Machine Learning

Data scientists use powerful tools from Machine learning algorithms act as magic
machine learning algorithms. keys, unlocking large datasets with ease.

Data science is like a twin to machine Machine learning is a handy sidekick for
learning, enhancing each other’s abili- data scientists, helping them navigate
ties. through complex data mazes.

With machine learning, data scientists Machine learning algorithms are wizards at
can dig deeper into information and un- finding patterns, predicting outcomes, and
cover concealed patterns. spotting anomalies.

The collaboration between data science This joint effort leads to smarter decisions,
and machine learning is crucial across better work methods, and success in data-
various fields. driven environments.

Role of Machine Learning in Data Science

Machine learning significantly boosts data science by improving analysis efficiency,


spotting patterns, predicting outcomes, and identifying anomalies in extensive da-
tasets, facilitating informed decision-making.
1. Enabling predictive modeling: Machine learning is like having a superpower. Why? It
can look at old data and find patterns. Those patterns help guess what will happen
next. It’s pretty accurate, too. Businesses love this. They can use it to make plans
and good choices, like in finance. Machine learning looks at old stock market info
and guesses what prices will do. It can help investors know when to buy or sell. Or
in healthcare. It can look at patient info and guess if they might get sick. If they
might, doctors can help sooner. That can make patients healthier. Machine learning
has become increasingly important in data science as it can uncover patterns and
correlations in large datasets that would be impossible to detect otherwise. By
training algorithms on vast amounts of real-world data, machine learning tech-
niques are able to identify useful insights and make predictions that guide critical
decisions in many different fields.

2. Facilitating classification: Machine learning algorithms work like tools. They sort
data into set groups. This makes it easier to handle and understand information. By
grouping items based on their qualities, we can make sense of a lot of data. Just pic-
ture an online shop. Machine learning algorithms can sort products into groups like
electronics, clothes, or home stuff. Thus, customers can smoothly uncover what
they want. Because this sorting is automated, machine learning algorithms save
time and energy. This lets businesses focus on studying data and pulling out useful
details. In short, machine learning makes data management and understanding bet-
ter. This leads to swifter decisions and a clearer grasp of complex data sets.

3. Supporting anomaly detection: Machine learning plays a key role in picking out odd
patterns or weird things in datasets. This could point out possible issues or sneaky
activities. Machine learning algorithms look at the load of data. They find anything
that moves off the beaten path, like odd money transactions or strange user ac-
tions. This skill to spot oddities is key in many areas. This includes finance, cyberse-
curity, and healthcare. Here, spotting anything unusual early on might stop big
losses or risks. For example, in banks, machine learning algorithms can mark trans-
actions that stray from the normal. This can stop fraud.

Industry Applications

E-Commerce Recommendations, Demand Forcasting

Healthcare Disease Diagnosis, Outcome Prediction

Finance Fraud Detection, Risk Assessment


Industry Applications

Marketing Customer Segmentation, Campaign Optimization

Transportation Route Optimization, Autonomous Vehicles

Manufacturing Predictive Maintenance, Quality Control

Education Personalized Learning, Performance Prediction

Real-world Applications

The influence of machine learning in data science spans industries, facilitating efficient
analysis, predictive modeling, anomaly detection, and decision-making processes, en-
hancing overall productivity and effectiveness.
1. Business: Machine learning helps businesses improve service, hone marketing, and
smooth out tasks. It uses client data to tailor suggestions, predict demands, and
automate jobs, which elevates service and ramps up efficiency. More so, it allows
firms to gather valuable knowledge from vast data, aiding strategy choices and
powering innovation. As an example, machine learning-based predictive analytics
can predict demand shifts, helping businesses better manage supplies and re-
sources.

2. Healthcare: Machine learning is changing the game in healthcare! It helps identify


diseases, predicts how patients will do, and matches treatment plans to specific
needs. This makes healthcare better. It reviews medical data and spots trends
linked to different illnesses. This means we can catch and diagnose conditions
early. Machine learning also forecasts how patients will fare based on factors like
past medical issues, genes, and how they respond to treatment. This lets
healthcare providers step in early to boost patient care.

3. Finance: Machine learning is super important in finance. It helps find fraud, check
risks, and manage investments in the best way. It looks at lots of financial data to
find regular patterns that might mean fraud. This way, crime can be stopped earlier.
Machine learning also helps check how risky different financial dealings or invest-
ments are. This helps organizations make the best choices and reduce the chances
of loss.
4. Marketing: Machine Learning enables customer segmentaion, campaign optimiza-
tion, and personalized marketing strategies, improving targeting and conversion
rates.

Education: Machie learning supports personalized learning experience and perfor-


mance prediction, enhancing student engagement and academic outcomes.

5. Manufacturing: Machine Learning supports personalized learning experiences and


performance prediction, enhancing student engagement and academic outcomes.
Future of Machine Learning in Data Science
The hunger for machine learning in today’s data-packed world isn’t fading. It’s set to
rise. Technological upgrades and booming data make teamwork between machine
learning and data science ever more vital. They help pull sweet wisdom from a treasure
trove of data. That means smarter choices and creative leaps forward for many organiza-
tions.

1. Enhancing Efficiency and Insights: Machine learning algorithms help data scien-
tists. They can look at complex data and find hidden things like patterns, trends,
and connections. Data science and machine learning can change many things. It
can help fields like health care, finance, and retail. It can predict the future, recom-
mend things people might like, and make business processes better. Take
healthcare, for example. Here, machine learning can detect disease early, guess
whether a treatment works, and create unique treatment plans for each patient. In
the same way, finance uses machine learning, too. It helps find fraud, assess risks,
and choose the best investment strategies. Adding machine learning to data sci-
ence helps companies. Smart choices are easier for them. Work gets done quicker.
Market changes don’t throw them off. As tech grows, uniting machine learning and
data science is key. It drives new ideas and shapes industries globally.

2. Revolutionizing Industries: Machine learning in healthcare. It’s reshaping patient


care. It catches diseases early and predicts treatment outcomes. Plus, it custom-
izes care plans using unique patient data. Just like that, it fights fraud in finance. It
handles risks linked to money exchanges. It even fine-tunes investment plans using
algorithmic trading. These uses make money swaps safer. They better manage port-
folios. This boosts businesses and helps shoppers. As machine learning gets
sharper, it’ll play a bigger part in healthcare and finance. It’ll be a big help in improv-
ing results and lowering risks in key areas.

3. Driving Innovation and Competitiveness: The spark that machine learning and data
science create together isn’t going dim any soon. It’s a must-have for successful
businesses and efficient operations. Plus, it keeps you ahead in the competitive
jungle out there. Companies that use machine learning to bolster their data science
game get a huge leg up. Better choices, a smooth process, and new paths to innova-
tion are their prizes. Technology is improving. More data is accumulated. The link
between machine learning and data science fuels growth in numerous areas.

By using machine learning, companies decode crucial knowledge from their data. They
can then adjust to market trends, predict customer wants, and sharpen their products.
Simply, mixing machine learning with data science boosts businesses. It allows them to
thrive in a changing world full of opportunities and obstacles.

Conclusion
Think of machine learning as the spine of data science. It’s super important because it
can dig deep into big, complicated data collections and pull out useful info. Beyond pre-
dicting what might happen in the future, machine learning can spot tricky patterns and
help businesses work smoother and smarter, sparking new ideas in all kinds of fields.

Scikit-learn

Scikit-learn is one of the most popular ML libraries for classical ML algorithms. It is built
on top of two basic Python libraries, viz., NumPy and SciPy. Scikit-learn supports most
of the supervised and unsupervised learning algorithms. Scikit-learn can also be used
for data-mining and data-analysis, which makes it a great tool who is starting out with
ML.

Python
# Python script using Scikit-learn
# for Decision Tree Classifier

# Sample Decision Tree Classifier


from sklearn import datasets
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier

# load the iris datasets


dataset = datasets.load_iris()

# fit a CART model to the data


model = DecisionTreeClassifier()
model.fit(dataset.data, dataset.target)
print(model)

# make predictions
expected = dataset.target
predicted = model.predict(dataset.data)

# summarize the fit of the model


print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

Output:

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,


max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
precision recall f1-score support

0 1.00 1.00 1.00 50


1 1.00 1.00 1.00 50
2 1.00 1.00 1.00 50

micro avg 1.00 1.00 1.00 150


macro avg 1.00 1.00 1.00 150
weighted avg 1.00 1.00 1.00 150

[[50 0 0]
[ 0 50 0]
[ 0 0 50]]

Best Python libraries for Machine Learning

• Numpy
• Scipy
• Scikit-learn
• Theano
• TensorFlow
• Keras
• PyTorch
• Pandas
• Matplotlib

The feature engineering process in Python involves several steps, including:


• Data preparation
Cleaning data by handling missing values, outliers, and encoding categorical variables
• Feature scaling
Normalizing or standardizing numerical features so they are on a similar scale
• Model training and evaluation
Building machine learning models using the selected features and evaluating their per-
formance
• Iterative refinement
Refining the feature engineering and selection process based on the model's perfor-
mance metrics
Some feature engineering techniques include:
• One-hot encoding
• Imputation
• Polynomial features
• Principal component analysis (PCA)
• Statistical tests
• Feature importance
There are two main approaches to feature engineering:

1.Checklist approach: Uses tried and tested methods to construct features


2.Domain-based approach: Incorporates domain knowledge of the dataset's sub-
ject matter into constructing new features

What is Feature Engineering?


Feature Engineering is the process of creating new features or transforming existing
features to improve the performance of a machine-learning model. It involves se-
lecting relevant information from raw data and transforming it into a format that can
be easily understood by a model. The goal is to improve model accuracy by provid-
ing more meaningful and relevant information.

The success of machine learning models heavily depends on the quality of the fea-
tures used to train them. Feature engineering involves a set of techniques that ena-
ble us to create new features by combining or transforming the existing ones. These
techniques help to highlight the most important patterns and relationships in the
data, which in turn helps the machine learning model to learn from the data more
effectively.

What is a Feature?
In the context of machine learning, a feature (also known as a variable or attribute)
is an individual measurable property or characteristic of a data point that is used as
input for a machine learning algorithm. Features can be numerical, categorical, or
text-based, and they represent different aspects of the data that are relevant to the
problem at hand.
• For example, in a dataset of housing prices, features could include the number of
bedrooms, the square footage, the location, and the age of the property. In a da-
taset of customer demographics, features could include age, gender, income level,
and occupation.
• The choice and quality of features are critical in machine learning, as they can
greatly impact the accuracy and performance of the model.
Need for Feature Engineering in Machine Learning?
We engineer features for various reasons, and some of the main reasons include:
• Improve User Experience: The primary reason we engineer features is to enhance
the user experience of a product or service. By adding new features, we can make
the product more intuitive, efficient, and user-friendly, which can increase user sat-
isfaction and engagement.
• Competitive Advantage: Another reason we engineer features is to gain a competi-
tive advantage in the marketplace. By offering unique and innovative features, we
can differentiate our product from competitors and attract more customers.
• Meet Customer Needs: We engineer features to meet the evolving needs of custom-
ers. By analyzing user feedback, market trends, and customer behavior, we can
identify areas where new features could enhance the product’s value and meet cus-
tomer needs.
• Increase Revenue: Features can also be engineered to generate more revenue. For
example, a new feature that streamlines the checkout process can increase sales,
or a feature that provides additional functionality could lead to more upsells or
cross-sells.
• Future-Proofing: Engineering features can also be done to future-proof a product or
service. By anticipating future trends and potential customer needs, we can de-
velop features that ensure the product remains relevant and useful in the long term.
Processes Involved in Feature Engineering
Feature engineering in Machine learning consists of mainly 5 processes: Feature
Creation, Feature Transformation, Feature Extraction, Feature Selection, and Fea-
ture Scaling. It is an iterative process that requires experimentation and testing to
find the best combination of features for a given problem. The success of a machine
learning model largely depends on the quality of the features used in the model.
1. Feature Creation
Feature Creation is the process of generating new features based on domain
knowledge or by observing patterns in the data. It is a form of feature engineering
that can significantly improve the performance of a machine-learning model.
Types of Feature Creation:
1. Domain-Specific: Creating new features based on domain knowledge, such as cre-
ating features based on business rules or industry standards.
2. Data-Driven: Creating new features by observing patterns in the data, such as cal-
culating aggregations or creating interaction features.
3. Synthetic: Generating new features by combining existing features or synthesizing
new data points.
Why Feature Creation?
1. Improves Model Performance: By providing additional and more relevant infor-
mation to the model, feature creation can increase the accuracy and precision of
the model.
2. Increases Model Robustness: By adding additional features, the model can become
more robust to outliers and other anomalies.
3. Improves Model Interpretability: By creating new features, it can be easier to under-
stand the model’s predictions.
4. Increases Model Flexibility: By adding new features, the model can be made more
flexible to handle different types of data.
2. Feature Transformation
Feature Transformation is the process of transforming the features into a more suit-
able representation for the machine learning model. This is done to ensure that the
model can effectively learn from the data.
Types of Feature Transformation:
1. Normalization: Rescaling the features to have a similar range, such as between 0
and 1, to prevent some features from dominating others.
2. Scaling: Scaling is a technique used to transform numerical variables to have a sim-
ilar scale, so that they can be compared more easily. Rescaling the features to have
a similar scale, such as having a standard deviation of 1, to make sure the model
considers all features equally.
3. Encoding: Transforming categorical features into a numerical representation. Ex-
amples are one-hot encoding and label encoding.
4. Transformation: Transforming the features using mathematical operations to
change the distribution or scale of the features. Examples are logarithmic, square
root, and reciprocal transformations.
Why Feature Transformation?
1. Improves Model Performance: By transforming the features into a more suitable
representation, the model can learn more meaningful patterns in the data.
2. Increases Model Robustness: Transforming the features can make the model more
robust to outliers and other anomalies.
3. Improves Computational Efficiency: The transformed features often require fewer
computational resources.
4. Improves Model Interpretability: By transforming the features, it can be easier to un-
derstand the model’s predictions.
3. Feature Extraction
Feature Extraction is the process of creating new features from existing ones to pro-
vide more relevant information to the machine learning model. This is done by
transforming, combining, or aggregating existing features.
Types of Feature Extraction:
1. Dimensionality Reduction: Reducing the number of features by transforming the
data into a lower-dimensional space while retaining important information. Exam-
ples are PCA and t-SNE.
2. Feature Combination: Combining two or more existing features to create a new one.
For example, the interaction between two features.
3. Feature Aggregation: Aggregating features to create a new one. For example, calcu-
lating the mean, sum, or count of a set of features.
4. Feature Transformation: Transforming existing features into a new representation.
For example, log transformation of a feature with a skewed distribution.
Why Feature Extraction?
1. Improves Model Performance: By creating new and more relevant features, the
model can learn more meaningful patterns in the data.
2. Reduces Overfitting: By reducing the dimensionality of the data, the model is less
likely to overfit the training data.
3. Improves Computational Efficiency: The transformed features often require fewer
computational resources.
4. Improves Model Interpretability: By creating new features, it can be easier to under-
stand the model’s predictions.
4. Feature Selection
Feature Selection is the process of selecting a subset of relevant features from the
dataset to be used in a machine-learning model. It is an important step in the fea-
ture engineering process as it can have a significant impact on the model’s perfor-
mance.
Types of Feature Selection:
1. Filter Method: Based on the statistical measure of the relationship between the fea-
ture and the target variable. Features with a high correlation are selected.
2. Wrapper Method: Based on the evaluation of the feature subset using a specific ma-
chine learning algorithm. The feature subset that results in the best performance is
selected.
3. Embedded Method: Based on the feature selection as part of the training process of
the machine learning algorithm.
Why Feature Selection?
1. Reduces Overfitting: By using only the most relevant features, the model can gener-
alize better to new data.
2. Improves Model Performance: Selecting the right features can improve the accu-
racy, precision, and recall of the model.
3. Decreases Computational Costs: A smaller number of features requires less com-
putation and storage resources.
4. Improves Interpretability: By reducing the number of features, it is easier to under-
stand and interpret the results of the model.
5. Feature Scaling
Feature Scaling is the process of transforming the features so that they have a simi-
lar scale. This is important in machine learning because the scale of the features
can affect the performance of the model.
Types of Feature Scaling:
1. Min-Max Scaling: Rescaling the features to a specific range, such as between 0 and
1, by subtracting the minimum value and dividing by the range.
2. Standard Scaling: Rescaling the features to have a mean of 0 and a standard devia-
tion of 1 by subtracting the mean and dividing by the standard deviation.
3. Robust Scaling: Rescaling the features to be robust to outliers by dividing them by
the interquartile range.
Why Feature Scaling?
1. Improves Model Performance: By transforming the features to have a similar scale,
the model can learn from all features equally and avoid being dominated by a few
large features.
2. Increases Model Robustness: By transforming the features to be robust to outliers,
the model can become more robust to anomalies.
3. Improves Computational Efficiency: Many machine learning algorithms, such as k-
nearest neighbors, are sensitive to the scale of the features and perform better with
scaled features.
4. Improves Model Interpretability: By transforming the features to have a similar
scale, it can be easier to understand the model’s predictions.
What are the Steps in Feature Engineering?
The steps for feature engineering vary per different Ml engineers and data scien-
tists. Some of the common steps that are involved in most machine-learning algo-
rithms are:
1. Data Cleansing
• Data cleansing (also known as data cleaning or data scrubbing) involves identi-
fying and removing or correcting any errors or inconsistencies in the dataset.
This step is important to ensure that the data is accurate and reliable.
2. Data Transformation
3. Feature Extraction
4. Feature Selection
• Feature selection involves selecting the most relevant features from the da-
taset for use in machine learning. This can include techniques like correlation
analysis, mutual information, and stepwise regression.
5. Feature Iteration
• Feature iteration involves refining and improving the features based on the per-
formance of the machine learning model. This can include techniques like add-
ing new features, removing redundant features and transforming features in
different ways.

Sklearn Model Selection


Sklearn's model selection module provides various functions to cross-validate our
model, tune the estimator's hyperparameters, or produce validation and learning
curves.
Here is a list of the functions provided in this module. Later we will understand the
theory and use of these functions with code examples.
Splitter Classes

This function is a variant of


model_selection.GroupK- KFold cross-validation,
Fold([ n_splits ]) which forms non-overlap-
ping groups.

model_selec- This function performs


tion.GroupShuffleSplit([ ... Shuffle-Group(s)-Out cross-
]) validation test.

This function is used to per-


model_selection.KFold([
form the KFold cross-vali-
n_splits, shuffle, ... ])
dation test.

This function performs the


model_selection.LeaveOne-
Leave One Group Out
GroupOut( )
cross-validation test.

model_selec- This function performs the


tion.LeavePGroupsOut( test by using Leave P Group
n_groups ) Out.

This test performs the


model_selection.LeaveOne-
Leave One Out cross-vali-
Out( )
dation.

model_selec- Cross-validator with Leave-


tion.LeavePOut( p ) One-Out
This is a general version of
model_selection.Prede- the Leave One Out test, i.e.
finedSplit( test_fold ) Leave P Out cross-valida-
tion test.

This performs a cross-vali-


model_selection.Repeat-
dation test on a predefined
edKFold( *[, n_splits, ... ] )
split.

This test performs the re-


model_selection.Repeat-
peated stratified K-Fold
edStratifiedKFold( *[, ... ] )
cross-validation.

This function performs a


model_selection.Shuf- cross-validation test on a
fleSplit([n_splits, ... ]) shuffled slit dataset using
random permutations.

This function performs a


model_selection.Strati-
stratified KFold cross-vali-
fiedKFold([ n_splits, ... ])
dation test.

This function performs a


model_selection.Strati-
stratified shuffle split
fiedShuffleSplit([ ... ])
cross-validation test.

This function performs a


model_selection.Stratified- stratified K-Folds cross-val-
GroupKFold([ ... ]) idation test on non-overlap-
ping groups.

model_selec-
This cross-validation test is
tion.TimeSeriesSplit([
for time series.
n_splits, ... ])

Splitter Functions

model_selec- This function checks the


tion.check_cv([ cv, y, clas- utility to perform a cross-
sifier ]) validation test.

model_selec- This function performs


tion.train_test_split( *ar- cross-validation by separat-
rays[, ...] ) ing matrices or arrays into
training and testing da-
tasets at random.

Hyper-parameter optimizers

This function executes an


model_selec-
exhaustive search for an
tion.GridSearchCV( estima-
estimator over the defined
tor, ... )
parameters.

This function executes a


model_selection.Halving- search over given parame-
GridSearchCV( ...[, ...] ) ters by using successive
halving.

This function performs a


model_selection.Parameter- grid of parameters, where
Grid( param_grid ) each parameter has a dis-
crete range of values.

This function acts as a gen-


model_selection.Parameter- erator on parameter sam-
Sampler( ...[, ...] ) ples taken from the given
distribution.

This function performs a


model_selection.Random-
randomized search on the
izedSearchCV( ...[, ...] )
hyper-parameters.

model_selection.Halv- This function performs a


ingRandomSearchCV( ...[, randomized search on the
...] ) hyper-parameters.

Cross-validation: assessing the performance of the estimator


A fundamental error data scientists make while creating a model is learning the pa-
rameters of a forecasting function and evaluating the model on the same dataset. A
model that simply repeats the labels of the data points it has just been trained on
would score well but be unable to make any predictions about data that the model
has not yet seen. Overfitting is the term for this circumstance.
It is common to reserve a portion of the given data as validation or the test set (X
test, y test) when conducting a machine learning experiment to avoid this problem.
We should note that the term "experiment" does not just refer to academic pur-
poses since machine learning experiments sometimes begin in today's business
world.
Advertisement
..

Validation and prediction

Cross-Validation in Sklearn
Data scientists can benefit from cross-validation while working with machine learn-
ing models in two key aspects: it can assist in minimizing the amount of data
needed and ensuring the machine learning model is reliable. Cross-validation ac-
complishes that at the expense of resource use; thus, it's critical to comprehend
how it operates before deciding to use it.
In this article, we'll quickly go over the advantages of cross-validation, and then we
will go through its application using a wide range of techniques from the well-known
Python Scikit-learn package.
What is Cross-validation?
A fundamental error is training the model to make a prediction function and then us-
ing the same data to test the model and get a validation score. A model that simply
repeats the labels of the samples it has just examined would receive a perfect score
but be unable to make predictions about data that has not yet been seen. Overfit-
ting is the term used to describe this circumstance. To avoid this, it is customary to
reserve a portion of the given data as a test set (X test, y test) when conducting a
(supervised) machine learning study. Because machine learning sometimes begins
as an experiment in business contexts, we should note that the word "experiment"
does not just refer to academic application.
How does Cross Validation Solve the Problem of Overfitting?
We create numerous micro train-test splits during cross-validation using our initial
training dataset. To fine-tune our model, use these splits. For instance, we divide
the dataset into k subgroups for the usual k-fold cross-validation. The remaining da-
taset is then used as the test dataset after the model has been successively trained
on the k-1 dataset. We may test the model on a new dataset in this manner. We will
learn about the seven most popular cross-validation approaches in this tutorial. The
code samples for each method are also included.

There is still a chance of overfitting the test dataset when comparing various set-
tings for estimators. This is why We can adjust the parameters until the estimator
works at its best. The model may "leak" information about the test dataset in this
method, and evaluation measures may no longer reflect generalization perfor-
mance. We can resolve this issue by holding out a different portion of the dataset as
a "validation set": training is conducted on the training dataset, followed by
evaluation on the validation dataset, and when it appears that the experiment has
succeeded, we can perform a final assessment on the test set.
Data Size Reduction
Usually, the data is divided into three sets.
Training: used to hone the hyperparameters of the machine learning model and
train the model.

Testing: used to ensure that the improved model performs well when applied to new
data and that the model generalizes correctly.
Validation: We execute the last check on utterly unreliable data because, when op-
timizing, some knowledge about the test dataset seeps into the model due to the
choice of parameters.
Because we can train and test using the same dataset, adding cross-validation to
the workflow helps you eliminate the requirement for the validation dataset.
Robust Process
Even though sklearn's train test split method uses a stratified split, which ensures
that the target variable's distribution is the same in both the train and test sets, it's
still possible to unintentionally train on a subset that doesn't accurately represent
the real world.
Methods of Cross-Validation with Sklearn
HoldOut Cross Validation or Train-Test Split
This cross-validation procedure randomly divides the entire dataset into a training
dataset and a validation dataset. Generally, approximately 70% of the whole da-
taset is utilized as a training set, and the leftover 30% is taken as a validation da-
taset.
The advantage of this method is that we only need to divide the dataset into the
training and validation sets once. The machine learning model will only need to be
trained once based on the training dataset, allowing for quick execution.
This method is not appropriate for an unbalanced dataset. Consider an unbalanced
dataset with classes "0" and "1". Let's assume that 80% of the data falls under class
'0' and the rest 20% falls under class '1' upon performing a train-test splitting, with
the training dataset making up to 80% of the dataset and the test data making up
20%. The training dataset may contain 100% of the class "0" data, and the test da-
taset has 100% of the class "1" data. Since our model has never previously encoun-
tered class "1" data, it will not generalize well to our testing data.

K-Fold Cross Validation


The entire dataset is divided into K equal-sized pieces using the K-Fold cross-valida-
tion procedure. Each division is referred to as a "Fold." We refer to it as K-Folds
because there are K pieces. The other K-1 folds are utilized as the training dataset,
while One of the Fold is employed as a validation dataset.

Until each fold is employed as a validation dataset and the leftover folds are the
training datasets, the procedure is repeated K times.
The average accuracy of the k number of models of the validation dataset is used to
calculate the model's final accuracy.

Stratified K-Fold Cross Validation


The improved K-Fold cross-validation method known as stratified K-Fold is typically
applied to unbalanced datasets. The entire dataset is split into K-folds of the same
size, just like K-fold.
However, in this method, each fold will contain the same proportion of target varia-
ble occurrences as the entire dataset.

What is Predictive Analytics and How does it Work?


Predictive analytics is the practice of using statistical algorithms and machine
learning techniques to analyze historical data, identify patterns, and predict future
outcomes. This powerful tool has become necessary in today’s world, enabling or-
ganizations to predict trends, reduce risks, and make informed decisions. In this ar-
ticle, we’ll be exploring the importance, working, and applications of predictive an-
alytics.

Why Predictive Analytics is important?


Predictive analytics is important for several reasons:
• Informed Decision-Making: By anticipating future trends and outcomes, businesses
and organizations can make more strategic decisions. Imagine being able to predict
customer churn (when a customer stops using your service) or equipment failure
before it happens. This allows for proactive measures to retain customers or pre-
vent costly downtime.
• Risk Management: Predictive analytics helps identify and mitigate potential risks.
For example, financial institutions can use it to detect fraudulent transactions,
while healthcare providers can predict the spread of diseases.
• Optimization and Efficiency: Predictive models can optimize processes and re-
source allocation. Businesses can forecast demand and optimize inventory levels,
or predict equipment maintenance needs to avoid disruptions.
• Personalized Experiences: Predictive analytics allows for personalization and cus-
tomization. Retailers can use it to recommend products to customers based on
their past purchases and browsing behavior.
• Innovation and Competitive Advantage: Predictive analytics empowers organiza-
tions to identify new opportunities and develop innovative products and services.
By understanding customer needs and market trends, businesses can stay ahead of
the competition.
How Predictive Analytics Modeling works?

Types of Machine Learning


Machine learning is the branch of Artificial Intelligence that focuses on developing
models and algorithms that let computers learn from data and improve from previ-
ous experience without being explicitly programmed for every task. In simple
words, ML teaches the systems to think and understand like humans by learning
from the data.
In this article, we will explore the various types of machine learning algorithms that
are important for future requirements. Machine learning is generally a training sys-
tem to learn from past experiences and improve performance over time. Machine
learning helps to predict massive amounts of data. It helps to deliver fast and accu-
rate results to get profitable opportunities.
Types of Machine Learning
There are several types of machine learning, each with special characteristics and
applications. Some of the main types of machine learning algorithms are as follows:
1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning
4. Reinforcement Learning
Types of Machine Learning

1. Supervised Machine Learning


Supervised learning is defined as when a model gets trained on a “Labelled Da-
taset”. Labelled datasets have both input and output parameters. In Supervised
Learning algorithms learn to map points between inputs and correct outputs. It has
both training and validation datasets labelled.

Supervised Learning
Let’s understand it with the help of an example.
Example: Consider a scenario where you have to build an image classifier to differ-
entiate between cats and dogs. If you feed the datasets of dogs and cats labelled
images to the algorithm, the machine will learn to classify between a dog or a cat
from these labeled images. When we input new dog or cat images that it has never
seen before, it will use the learned algorithms and predict whether it is a dog or a
cat. This is how supervised learning works, and this is particularly an image classifi-
cation.
There are two main categories of supervised learning that are mentioned below:
• Classification
• Regression
Classification
Classification deals with predicting categorical target variables, which represent
discrete classes or labels. For instance, classifying emails as spam or not spam, or
predicting whether a patient has a high risk of heart disease. Classification algo-
rithms learn to map the input features to one of the predefined classes.
Here are some classification algorithms:
• Logistic Regression
• Support Vector Machine
• Random Forest
• Decision Tree
• K-Nearest Neighbors (KNN)
• Naive Bayes
Regression
Regression, on the other hand, deals with predicting continuous target variables,
which represent numerical values. For example, predicting the price of a house
based on its size, location, and amenities, or forecasting the sales of a product. Re-
gression algorithms learn to map the input features to a continuous numerical
value.
Here are some regression algorithms:
• Linear Regression
• Polynomial Regression
• Ridge Regression
• Lasso Regression
• Decision tree
• Random Forest
Advantages of Supervised Machine Learning
• Supervised Learning models can have high accuracy as they are trained on labelled
data.
• The process of decision-making in supervised learning models is often interpreta-
ble.
• It can often be used in pre-trained models which saves time and resources when
developing new models from scratch.
Disadvantages of Supervised Machine Learning
• It has limitations in knowing patterns and may struggle with unseen or unexpected
patterns that are not present in the training data.
• It can be time-consuming and costly as it relies on labeled data only.
• It may lead to poor generalizations based on new data.
Applications of Supervised Learning
Supervised learning is used in a wide variety of applications, including:
• Image classification: Identify objects, faces, and other features in images.
• Natural language processing: Extract information from text, such as sentiment, en-
tities, and relationships.
• Speech recognition: Convert spoken language into text.
• Recommendation systems: Make personalized recommendations to users.
• Predictive analytics: Predict outcomes, such as sales, customer churn, and stock
prices.
• Medical diagnosis: Detect diseases and other medical conditions.
• Fraud detection: Identify fraudulent transactions.
• Autonomous vehicles: Recognize and respond to objects in the environment.
• Email spam detection: Classify emails as spam or not spam.
• Quality control in manufacturing: Inspect products for defects.
• Credit scoring: Assess the risk of a borrower defaulting on a loan.
• Gaming: Recognize characters, analyze player behavior, and create NPCs.
• Customer support: Automate customer support tasks.
• Weather forecasting: Make predictions for temperature, precipitation, and other
meteorological parameters.
• Sports analytics: Analyze player performance, make game predictions, and opti-
mize strategies.
2. Unsupervised Machine Learning
Unsupervised Learning Unsupervised learning is a type of machine learning tech-
nique in which an algorithm discovers patterns and relationships using unlabeled
data. Unlike supervised learning, unsupervised learning doesn’t involve providing
the algorithm with labeled target outputs. The primary goal of Unsupervised learn-
ing is often to discover hidden patterns, similarities, or clusters within the data,
which can then be used for various purposes, such as data exploration, visualiza-
tion, dimensionality reduction, and more.

Unsupervised Learning
Let’s understand it with the help of an example.
Example: Consider that you have a dataset that contains information about the pur-
chases you made from the shop. Through clustering, the algorithm can group the
same purchasing behavior among you and other customers, which reveals potential
customers without predefined labels. This type of information can help businesses
get target customers as well as identify outliers.
There are two main categories of unsupervised learning that are mentioned below:
• Clustering
• Association
Clustering
Clustering is the process of grouping data points into clusters based on their simi-
larity. This technique is useful for identifying patterns and relationships in data
without the need for labeled examples.
Here are some clustering algorithms:
• K-Means Clustering algorithm
• Mean-shift algorithm
• DBSCAN Algorithm
• Principal Component Analysis
• Independent Component Analysis
Association
Association rule learning is a technique for discovering relationships between items
in a dataset. It identifies rules that indicate the presence of one item implies the
presence of another item with a specific probability.
Here are some association rule learning algorithms:
• Apriori Algorithm
• Eclat
• FP-growth Algorithm
Advantages of Unsupervised Machine Learning
• It helps to discover hidden patterns and various relationships between the data.
• Used for tasks such as customer segmentation, anomaly detection, and data explo-
ration.
• It does not require labeled data and reduces the effort of data labeling.
Disadvantages of Unsupervised Machine Learning
• Without using labels, it may be difficult to predict the quality of the model’s output.
• Cluster Interpretability may not be clear and may not have meaningful interpreta-
tions.
• It has techniques such as autoencoders and dimensionality reduction that can be
used to extract meaningful features from raw data.
Applications of Unsupervised Learning
Here are some common applications of unsupervised learning:
• Clustering: Group similar data points into clusters.
• Anomaly detection: Identify outliers or anomalies in data.
• Dimensionality reduction: Reduce the dimensionality of data while preserving its
essential information.
• Recommendation systems: Suggest products, movies, or content to users based on
their historical behavior or preferences.
• Topic modeling: Discover latent topics within a collection of documents.
• Density estimation: Estimate the probability density function of data.
• Image and video compression: Reduce the amount of storage required for multime-
dia content.
• Data preprocessing: Help with data preprocessing tasks such as data cleaning, im-
putation of missing values, and data scaling.
• Market basket analysis: Discover associations between products.
• Genomic data analysis: Identify patterns or group genes with similar expression
profiles.
• Image segmentation: Segment images into meaningful regions.
• Community detection in social networks: Identify communities or groups of individ-
uals with similar interests or connections.
• Customer behavior analysis: Uncover patterns and insights for better marketing
and product recommendations.
• Content recommendation: Classify and tag content to make it easier to recommend
similar items to users.
• Exploratory data analysis (EDA): Explore data and gain insights before defining spe-
cific tasks.
3. Semi-Supervised Learning
Semi-Supervised learning is a machine learning algorithm that works between
the supervised and unsupervised learning so it uses both labelled and unla-
belled data. It’s particularly useful when obtaining labeled data is costly, time-con-
suming, or resource-intensive. This approach is useful when the dataset is expen-
sive and time-consuming. Semi-supervised learning is chosen when labeled data
requires skills and relevant resources in order to train or learn from it.
We use these techniques when we are dealing with data that is a little bit labeled
and the rest large portion of it is unlabeled. We can use the unsupervised tech-
niques to predict labels and then feed these labels to supervised techniques. This
technique is mostly applicable in the case of image data sets where usually all im-
ages are not labeled.

Semi-Supervised Learning
Let’s understand it with the help of an example.
Example: Consider that we are building a language translation model, having la-
beled translations for every sentence pair can be resources intensive. It allows the
models to learn from labeled and unlabeled sentence pairs, making them more ac-
curate. This technique has led to significant improvements in the quality of machine
translation services.
Types of Semi-Supervised Learning Methods
There are a number of different semi-supervised learning methods each with its
own characteristics. Some of the most common ones include:
• Graph-based semi-supervised learning: This approach uses a graph to represent the
relationships between the data points. The graph is then used to propagate labels
from the labeled data points to the unlabeled data points.
• Label propagation: This approach iteratively propagates labels from the labeled
data points to the unlabeled data points, based on the similarities between the data
points.
• Co-training: This approach trains two different machine learning models on differ-
ent subsets of the unlabeled data. The two models are then used to label each
other’s predictions.
• Self-training: This approach trains a machine learning model on the labeled data
and then uses the model to predict labels for the unlabeled data. The model is then
retrained on the labeled data and the predicted labels for the unlabeled data.
• Generative adversarial networks (GANs): GANs are a type of deep learning algorithm
that can be used to generate synthetic data. GANs can be used to generate unla-
beled data for semi-supervised learning by training two neural networks, a genera-
tor and a discriminator.
Advantages of Semi- Supervised Machine Learning
• It leads to better generalization as compared to supervised learning, as it takes both
labeled and unlabeled data.
• Can be applied to a wide range of data.
Disadvantages of Semi- Supervised Machine Learning
• Semi-supervised methods can be more complex to implement compared to other
approaches.
• It still requires some labeled data that might not always be available or easy to ob-
tain.
• The unlabeled data can impact the model performance accordingly.
Applications of Semi-Supervised Learning
Here are some common applications of semi-supervised learning:
• Image Classification and Object Recognition: Improve the accuracy of models by
combining a small set of labeled images with a larger set of unlabeled images.
• Natural Language Processing (NLP): Enhance the performance of language models
and classifiers by combining a small set of labeled text data with a vast amount of
unlabeled text.
• Speech Recognition: Improve the accuracy of speech recognition by leveraging a
limited amount of transcribed speech data and a more extensive set of unlabeled
audio.
• Recommendation Systems: Improve the accuracy of personalized recommenda-
tions by supplementing a sparse set of user-item interactions (labeled data) with a
wealth of unlabeled user behavior data.
• Healthcare and Medical Imaging: Enhance medical image analysis by utilizing a
small set of labeled medical images alongside a larger set of unlabeled images.
4. Reinforcement Machine Learning
Reinforcement machine learning algorithm is a learning method that interacts with
the environment by producing actions and discovering errors. Trial, error, and de-
lay are the most relevant characteristics of reinforcement learning. In this tech-
nique, the model keeps on increasing its performance using Reward Feedback to
learn the behavior or pattern. These algorithms are specific to a particular problem
e.g. Google Self Driving car, AlphaGo where a bot competes with humans and even
itself to get better and better performers in Go Game. Each time we feed in data,
they learn and add the data to their knowledge which is training data. So, the more it
learns the better it gets trained and hence experienced.
Here are some of most common reinforcement learning algorithms:
• Q-learning: Q-learning is a model-free RL algorithm that learns a Q-function, which
maps states to actions. The Q-function estimates the expected reward of taking a
particular action in a given state.
• SARSA (State-Action-Reward-State-Action): SARSA is another model-free RL algo-
rithm that learns a Q-function. However, unlike Q-learning, SARSA updates the Q-
function for the action that was actually taken, rather than the optimal action.
• Deep Q-learning: Deep Q-learning is a combination of Q-learning and deep learning.
Deep Q-learning uses a neural network to represent the Q-function, which allows it
to learn complex relationships between states and actions.

Reinforcement Machine Learning


Let’s understand it with the help of examples.
Example: Consider that you are training an AI agent to play a game like chess. The
agent explores different moves and receives positive or negative feedback based on
the outcome. Reinforcement Learning also finds applications in which they learn to
perform tasks by interacting with their surroundings.
Types of Reinforcement Machine Learning
There are two main types of reinforcement learning:
Positive reinforcement
• Rewards the agent for taking a desired action.
• Encourages the agent to repeat the behavior.
• Examples: Giving a treat to a dog for sitting, providing a point in a game for a correct
answer.
Negative reinforcement
• Removes an undesirable stimulus to encourage a desired behavior.
• Discourages the agent from repeating the behavior.
• Examples: Turning off a loud buzzer when a lever is pressed, avoiding a penalty by
completing a task.
Advantages of Reinforcement Machine Learning
• It has autonomous decision-making that is well-suited for tasks and that can learn
to make a sequence of decisions, like robotics and game-playing.
• This technique is preferred to achieve long-term results that are very difficult to
achieve.
• It is used to solve a complex problems that cannot be solved by conventional tech-
niques.
Disadvantages of Reinforcement Machine Learning
• Training Reinforcement Learning agents can be computationally expensive and
time-consuming.
• Reinforcement learning is not preferable to solving simple problems.
• It needs a lot of data and a lot of computation, which makes it impractical and
costly.
Applications of Reinforcement Machine Learning
Here are some applications of reinforcement learning:
• Game Playing: RL can teach agents to play games, even complex ones.
• Robotics: RL can teach robots to perform tasks autonomously.
• Autonomous Vehicles: RL can help self-driving cars navigate and make decisions.
• Recommendation Systems: RL can enhance recommendation algorithms by learn-
ing user preferences.
• Healthcare: RL can be used to optimize treatment plans and drug discovery.
• Natural Language Processing (NLP): RL can be used in dialogue systems and chat-
bots.
• Finance and Trading: RL can be used for algorithmic trading.
• Supply Chain and Inventory Management: RL can be used to optimize supply chain
operations.
• Energy Management: RL can be used to optimize energy consumption.
• Game AI: RL can be used to create more intelligent and adaptive NPCs in video
games.
• Adaptive Personal Assistants: RL can be used to improve personal assistants.
• Virtual Reality (VR) and Augmented Reality (AR): RL can be used to create immersive
and interactive experiences.
• Industrial Control: RL can be used to optimize industrial processes.
• Education: RL can be used to create adaptive learning systems.
• Agriculture: RL can be used to optimize agricultural operations.
Must check, our detailed article on: Machine Learning Algorithms
Conclusion
In conclusion, each type of machine learning serves its own purpose and contrib-
utes to the overall role in development of enhanced data prediction capabilities,
and it has the potential to change various industries like Data Science. It helps deal
with massive data production and management of the datasets.

supervised learning

Some of the challenges faced in supervised learning mainly include addressing


class imbalances, high-quality labeled data, and avoiding overfitting where models
perform badly on real-time data.

Where can we apply supervised learning?


Supervised learning is commonly used for tasks like analysing spam emails, image
recognition, and sentiment analysis.

You might also like