Unit 3 and Unit 4 Notes - Data Science - III BCA 2
Unit 3 and Unit 4 Notes - Data Science - III BCA 2
Machine Learning:
• Machine learning is a growing technology which enables computers to learn automatically from
past data.
• Machine learning uses various algorithms for building mathematical models and making predictions
using historical data or information.
• Currently, it is being used for various tasks such as image recognition, speech recognition, email
filtering, Facebook auto-tagging, recommender system, and many more.
Arthur Samuel
• The term machine learning was first introduced by Arthur Samuel in 1959. We can define it in a
summarized way as:
• Machine learning enables a machine to automatically learn from data, improve performance
from experiences, and predict things without being explicitly programmed.
2
There are so many different types of Machine Learning systems that it is useful to classify them in broad
categories, based on the following criteria:
1. Whether or not they are trained with human supervision (supervised, unsupervised, semi supervised, and
Reinforcement Learning)
2. Whether or not they can learn incrementally on the fly (online versus batch learning)
3. Whether they work by simply comparing new data points to known data points, or instead by detecting
patterns in the training data and building a predictive model, much like scientists do (instance-based versus
model-based learning).
1. Supervised Machine Learning: As its name suggests, supervised machine learning is based on
supervision.
• It means in the supervised learning technique, we train the machines using the "labelled" dataset,
and based on the training, the machine predicts the output.
• The main goal of the supervised learning technique is to map the input variable(x) with the output
variable(y). Some real-world applications of supervised learning are Risk Assessment, Fraud
Detection, Spam filtering, etc.
Supervised learning is a type of machine learning where a computer learns from labeled data. Think of it as
teaching a child to identify animals using a picture book.
Labeled Data: The book contains pictures of animals with their names written below (e.g., "Dog," "Cat").
Learning Process: The child looks at the pictures and memorizes what each animal looks like.
Making Predictions: When shown a new picture of a dog, the child can identify it because of what they learned.
4
In machine learning, the "child" is the model, and the labeled data (input and output) trains it to make
predictions on new, unseen data.
1. Training Phase:
2. Testing/Prediction Phase:
Inputs (e.g., features like age, height, pixel data for images).
2. Model: The machine learning algorithm that learns the relationship (e.g., Linear Regression, Decision Trees,
Neural Networks).
3. Loss Function: Measures the error between the model’s predictions and the actual outputs. The goal is to
minimize this error.
4. Optimization Algorithm: Adjusts the model to improve predictions, e.g., Gradient Descent.
2. Classification:
5
Predicts discrete categories.
2. Spam Detection:
3. Weather Prediction:
4. Fraud Detection:
Unsupervised Learning can be further classified into two types, which are given below:
• Clustering
• Association
1) Clustering:
• The clustering technique is used when we want to find the inherent groups from the data.
• It is a way to group the objects into a cluster such that the objects with the most similarities remain
in one group and have fewer or no similarities with the objects of other groups.
• An example of the clustering algorithm is grouping the customers by their purchasing behavior.
2) Association:
• Association rule learning is an unsupervised learning technique, which finds interesting relations
among variables within a large dataset.
• The main aim of this learning algorithm is to find the dependency of one data item on another data
item and map those variables accordingly so that it can generate maximum profit.
• Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-growth
algorithm.
Disadvantages:
• The output of an unsupervised algorithm can be less accurate as the dataset is not labelled, and
algorithms are not trained with the exact output in prior.
• Working with Unsupervised learning is more difficult as it works with the unlabeled dataset that
does not map with the output.
8
3. Semi-Supervised Learning:
• Semi-Supervised learning is a type of Machine Learning algorithm that lies between
Supervised and Unsupervised machine learning.
• It represents the intermediate ground between Supervised (With Labelled training data) and
Unsupervised learning (with no labelled training data) algorithms and uses the combination of
labelled and unlabeled datasets during the training period.
To overcome the drawbacks of supervised learning and unsupervised learning algorithms, the
concept of Semi-supervised learning is introduced.
• We can imagine these algorithms with an example. Supervised learning is where a student is under
the supervision of an instructor at home and college.
• Further, if that student is self- analyzing the same concept without any help from the instructor, it
comes under unsupervised learning.
• Under semi-supervised learning, the student has to revise himself after analyzing the same concept
under the guidance of an instructor at college.
Advantages:
• It is simple and easy to understand the algorithm.
• It is highly efficient.
• It is used to solve drawbacks of Supervised and Unsupervised Learning algorithms.
Disadvantages:
• Iterations results may not be stable.
• We cannot apply these algorithms to network-level data.
• Accuracy is low.
1. Supervised Learning
🔹 Definition: In supervised learning, the algorithm is trained on labeled data, meaning the input data has
corresponding output labels.
🔹 Goal: Learn a mapping from inputs to outputs and make predictions on new data.
🔹 Types:
🔹 Examples:
9
🔹 Algorithms:
Linear Regression
Logistic Regression
Decision Trees
Support Vector Machines (SVM)
Neural Networks
2. Unsupervised Learning
🔹 Definition: The algorithm is trained on unlabeled data, meaning it tries to find patterns, structures, or
relationships without predefined labels.
🔹 Goal: Identify hidden patterns, group similar data points, or reduce data complexity.
🔹 Types:
🔹 Examples:
Customer Segmentation 🛍️
Anomaly Detection 🚨 (Fraud Detection)
Market Basket Analysis 🛒 (Amazon, Netflix Recommendations)
🔹 Algorithms:
K-Means Clustering
Hierarchical Clustering
Principal Component Analysis (PCA)
Autoencoders
Comparison Table
Feature Supervised Learning 🏷️ Unsupervised Learning ❓
Data Type Labeled Data ✅ Unlabeled Data ❌
Goal Predict outcomes Find hidden patterns
Types Classification, Regression Clustering, Dimensionality Reduction
Common Algorithms Decision Trees, SVM, Neural Networks K-Means, PCA, DBSCAN
Examples Spam Detection, Stock Prediction Customer Segmentation, Anomaly Detection
Human Intervention High (requires labeled data) Low (no need for labeled data)
Use Supervised Learning when you have labeled data and need to make predictions.
Use Unsupervised Learning when you have unlabeled data and want to explore hidden structures.
10
4. Reinforcement Learning:
• Reinforcement learning works on a feedback-based process, in which an AI agent (A software
component) automatically explore its surrounding by hitting & trail, taking action, learning
from experiences, and improving its performance.
• Agent gets rewarded for each good action and get punished for each bad action; hence the goal of
reinforcement learning agent is to maximize the rewards.
• In reinforcement learning, there is no labelled data like supervised learning, and agents learn from
their experiences only.
MODELLING PROCESS
1. Feature Engineering
Feature engineering is the process of transforming raw data into meaningful features that improve the predictive
power of a machine learning model.
2. Model Selection
Choosing the right model is crucial for achieving good performance.
Problem Type:
o Regression (e.g., Linear Regression, Random Forest Regression)
o Classification (e.g., Logistic Regression, SVM, Decision Trees, Neural Networks)
o Clustering (e.g., K-Means, DBSCAN)
o Time Series Forecasting (e.g., ARIMA, LSTMs)
Dataset Size and Complexity:
o For small datasets: Logistic Regression, Decision Trees, KNN
o For large datasets: Neural Networks, Gradient Boosting, Random Forest
Computational Efficiency: Consider the model's speed and memory usage.
Overfitting Risk: Simpler models (e.g., Logistic Regression) generalize better, while complex models
(e.g., Deep Learning) may need regularization.
Validation Techniques:
Cross-Validation:
o k-Fold Cross-Validation: Splits the data into k subsets, trains on k-1 subsets, and tests on the remaining
one.
o Leave-One-Out Cross-Validation (LOOCV): Uses a single observation as the test set and the rest for
training.
Evaluation Metrics:
o Regression: RMSE (Root Mean Square Error), MAE (Mean Absolute Error), R² score.
o Classification: Accuracy, Precision, Recall, F1-score, ROC-AUC.
Hyperparameter Tuning:
o Use Grid Search or Random Search to optimize hyperparameters.
o Implement Bayesian Optimization or AutoML for more advanced tuning.
Deployment Process:
1. Save the Model: Use formats like .pkl (Pickle), .h5 (for deep learning), or ONNX for interoperability.
2. Integrate with an Application:
o Deploy via an API (e.g., Flask, FastAPI, TensorFlow Serving).
o Embed it in a mobile or web app.
3. Monitor Performance in Production:
o Track real-world accuracy, latency, and data drift.
o Retrain the model periodically if performance degrades.
12
UNIT -4
INTRODUCTION TO HADOOP
Hadoop is an open-source framework designed for processing and storing vast amounts of data in a distributed manner. It
is particularly useful for handling big data, which involves datasets that are too large or complex for traditional data-
processing software. Developed by the Apache Software Foundation, Hadoop is widely used in industries that require
large-scale data processing, such as technology, finance, healthcare, and retail.
COMPONENTS OF HADOOP
Hadoop's architecture is built on two primary modules:
1. Hadoop Distributed File System (HDFS):
o HDFS is a distributed file system that allows data to be stored across multiple machines. o It divides large
datasets into smaller blocks, which are replicated across the cluster to ensure fault tolerance and high
availability.
HDFS in Data Science
Hadoop Distributed File System (HDFS) plays a crucial role in data science by providing scalable, fault-
tolerant, and distributed storage for big data analytics. Since data science involves processing massive
datasets, HDFS enables efficient storage and retrieval, making it an essential component in big data-driven
machine learning and AI projects.
4. Healthcare Analytics
Stores and processes medical records, genomic data, and images.
Uses deep learning models for disease prediction and diagnosis.
2. MapReduce:
o A programming model and processing engine for distributed computation. o It breaks down tasks into
smaller chunks (Map phase) and processes them in parallel, combining the results into a final output (Reduce
phase).
1. Map Step: Breaks the big task into smaller pieces and processes them in parallel.
2. Reduce Step: Collects the processed results and combines them to get the final output.
14
Imagine you are organizing a national election. Millions of people across different cities are voting, and you need to count
the total votes for each candidate.
If one person had to count all the votes manually, it would take weeks or months. Instead, we can use MapReduce to
count votes efficiently.
Every center counts votes individually and records how many votes each candidate received.
Example:
Now, we have the final vote count for the entire country.
Imagine you have 1 billion sales records from an e-commerce website and you want to find total sales per product.
Step 1: Map
The sales data is split across multiple computers (like vote counting centers).
Each computer processes its own portion of the data and records sales per product.
Example:
Step 2: Reduce
A central computer combines all results to get the final sales numbers:
Amazon & Flipkart: To process large-scale customer purchases and recommend products.
Bank Fraud Detection: To scan millions of transactions and find suspicious patterns.
BENEFITS OF HADOOP
1. Scalability: Can scale horizontally by adding more nodes to the cluster.
2. Cost-effectiveness: Works on commodity hardware, reducing overall costs.
3. Fault tolerance: Automatically replicates data, ensuring redundancy and reliability.
4. Flexibility: Handles structured, semi-structured, and unstructured data.
5. Speed: Processes large datasets in parallel, significantly reducing processing time.
USE CASES
• Data Warehousing and Analytics: For large-scale business intelligence.
• Search Engines: For indexing and querying web pages.
• Social Media Analytics: For analyzing user behavior and trends.
• Fraud Detection: In finance and insurance sectors.
Hadoop has become a foundational technology in the big data domain, empowering businesses to extract insights and
value from their data efficiently.
FRAMEWORK
A framework is a pre-built structure or platform that provides tools, libraries, and guidelines for developing applications
efficiently. In the context of big data, frameworks like Hadoop and Spark help process and analyze large datasets.
2. Ease of Use:
o Supports APIs in languages like Python, Java, Scala, and R, simplifying data processing compared to
MapReduce's Java-based paradigm.
3. Rich Ecosystem:
o Spark includes modules like Spark SQL (structured data processing), MLlib (machine learning), GraphX
(graph processing), and Spark Streaming (real-time data).
What is Apache Spark?
Apache Spark is a fast and powerful data processing engine used to analyze huge amounts of data quickly. It
helps companies handle big data in real-time or batch mode and is widely used for applications like
recommendation systems, fraud detection, and large-scale data analysis.
Think of Spark as a super-efficient team of workers who can process massive amounts of information at lightning
speed without slowing down.
Imagine you are shopping on Amazon or Flipkart during a sale, and thousands of people are placing orders at the
same time. The system needs to:
1. Process Orders Quickly – Amazon must ensure that your order is placed and confirmed instantly.
2. Recommend Products – It suggests "People who bought this also bought…" based on past customer behavior.
3. Detect Fraud – If a hacker tries to make a suspicious purchase, the system must stop it immediately.
4. Track Deliveries – It manages delivery schedules and optimizes shipping routes for faster delivery.
Apache Spark is used by companies like Amazon, Alibaba, and Netflix to handle such tasks in real-time by
processing massive amounts of data faster than traditional systems like Hadoop.
How Does Apache Spark Help in This Example?
As soon as you click "Buy Now," Spark processes your order instantly, without delay.
Finds the fastest way to deliver your order by analyzing multiple routes.
Apache Spark is like an intelligent, high-speed brain that helps businesses process and analyze vast amounts of
data quickly and efficiently. Whether it's online shopping, fraud detection, or real-time analytics, Spark plays a
crucial role in improving performance and customer experience.
MapReduce is still used in legacy systems but is less common for new projects due to Spark’s superior performance and
flexibility.
NOSQL DATABASES
NoSQL databases are designed to handle unstructured or semi-structured data and offer high scalability. Unlike
traditional relational databases (SQL-based), NoSQL databases do not rely on a fixed schema.
NoSQL (Not Only SQL) databases are non-relational databases designed to store, retrieve, and manage large volumes of
unstructured, semi-structured, or structured data. They are widely used in data science for their scalability, flexibility, and
ability to handle big data efficiently. Here’s a detailed yet simple explanation with real-life examples.
Why NoSQL in Data Science?
1. Scalability: NoSQL databases can easily scale horizontally, making them suitable for big data applications.
2. Flexibility: They can store unstructured data like JSON, XML, or key-value pairs.
3. High Performance: Fast read and write operations due to simplified data models.
4. Variety of Data Types: Handles diverse data types, including text, images, and videos.
Example: MongoDB
Use Case: E-commerce product catalog, where each product has different attributes.
Real-Life Example: Amazon's product listings, where products have varied specifications like size, color, and
features.
19
Real-Life Example: Monitoring stock prices or IoT sensor data over time.
4. Graph Database: Stores data as nodes and relationships.
Example: Neo4j
Real-Life Example: Facebook's social graph to find mutual friends or suggest connections.
Scenario: An online learning platform wants to analyze student engagement based on their interactions, including
video views, quiz attempts, and forum posts.
Why MongoDB?
Each student has different activity patterns. Document-oriented storage allows flexibility.
Fast querying of nested JSON data, like retrieving a student's quiz attempts and forum posts.
Use this data for predictive analytics (e.g., identifying at-risk students).
20
Handling Unstructured Data: Easily stores JSON, XML, images, and videos.
Complex Data Relationships: Not ideal for complex joins (relational operations).
Consistency Trade-offs: Some NoSQL databases prioritize availability over consistency (as per the CAP theorem).
---
Atomicity:
"All or nothing" principle. If an operation fails, the entire transaction is rolled back.
Example: If you are transferring money between bank accounts and the power fails, the transaction either fully completes
or doesn’t happen at all, ensuring no partial transfer.
Consistency:
Ensures that the database moves from one valid state to another.
Example: If a student’s grade is updated, it reflects consistently across all systems without any data corruption.
Isolation:
Example: Two people editing a Google Doc simultaneously see each other’s changes instantly (low isolation). But, in a
Word document, only one person can edit at a time (high isolation).
Durability:
Once a transaction is committed, it remains in the database even after a system crash.
Example: After booking a flight ticket, the confirmation remains intact even if the server goes down.
CAP theorem states that in a distributed system, you can only achieve two of the following three:
Partition Tolerance (P): The system continues to operate even if network communication is lost between nodes.
It’s impossible to achieve all three (Consistency, Availability, and Partition Tolerance) simultaneously.
Situation: A German customer and an American customer both want to buy the last piece of an item at the same time.
1. Prioritize Availability:
Risk: Both customers might be sold the same item, leading to an inventory mismatch.
Example: Amazon sometimes oversells an item and later informs one buyer about a delayed shipment.
2. Prioritize Consistency:
Example: A booking website may show “unavailable” until all servers sync data.
23
CA (Consistency + Availability):
Used when Partition Tolerance is not required, e.g., within a single data center.
Conclusion
ACID ensures data integrity and reliability, making it ideal for traditional relational databases and transactions.
CAP Theorem helps in designing distributed systems, particularly in NoSQL databases, where trade-offs between
Consistency, Availability, and Partition Tolerance are necessary.
• The reinforcement learning process is similar to a human being; for example, a child learns various
things by experiences in his day-to-day life.
• An example of reinforcement learning is to play a game, where the Game is the environment, moves
of an agent at each step define states, and the goal of the agent is to get a high score.
• Agent receives feedback in terms of punishment and rewards.
• Due to its way of working, reinforcement learning is employed in different fields such as Game
theory, Operation Research, Information theory, multi-agent systems.
Categories of Reinforcement Learning:
• Reinforcement learning is categorized mainly into two types of methods/algorithms:
Although this AI-driven software helps to successfully detect credit card fraud, there are issues in Machine
Learning that make the process redundant.
4) Talent Deficit
Albeit numerous individuals are pulled into the ML business, however, there are still not many experts who
can take complete control of this innovation.
5) Implementation
Organizations regularly have examination engines working with them when they decide to move up to ML.
The usage of fresher ML strategies with existing procedures is a complicated errand.
ML models can’t manage datasets containing missing data points. Thus, highlights that contain a huge part
of missing data should be erased.
7) Deficient Infrastructure
ML requires a tremendous amount of data stirring abilities. Inheritance frameworks can’t deal with the
responsibility and clasp under tension.
The other issues in Machine Learning are that deep analytics and ML in their present structures are still
new technologies.
Neural Networks
Naive Bayesian Model
Classification
Support Vector Machines
Regression
Random Forest Model
11) Complexity
Although Machine Learning and Artificial Intelligence are booming, a majority of these sectors are still in
their experimental phases, actively undergoing a trial and error method.
Another one of the most common issues in Machine Learning is the slow-moving program. The Machine
27
Learning Models are highly efficient bearing accurate results but the said results take time to be produced.
13) Maintenance
Requisite results for different actions are bound to change and hence the data needed for the same is
different.
This occurs when the target variable changes, resulting in the delivered results being inaccurate. This forces
the decay of the models as changes cannot be easily accustomed to or upgraded.
This occurs when certain aspects of a data set need more importance than others.
Many algorithms will contain biased programming which will lead to biased datasets. It will not deliver the
right output and produces irrelevant information.
Machine Learning is often termed a “Black box” as deciphering the outcomes from an algorithm is often
complex and sometimes useless.