KEMBAR78
Unit 3 and Unit 4 Notes - Data Science - III BCA 2 | PDF | No Sql | Apache Spark
0% found this document useful (0 votes)
35 views27 pages

Unit 3 and Unit 4 Notes - Data Science - III BCA 2

The document provides an overview of machine learning, detailing its definition, types, and applications. It categorizes machine learning into supervised, unsupervised, semi-supervised, and reinforcement learning, explaining their methodologies, advantages, and disadvantages. Additionally, it covers key processes in model development, including feature engineering, model selection, training, validation, and deployment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views27 pages

Unit 3 and Unit 4 Notes - Data Science - III BCA 2

The document provides an overview of machine learning, detailing its definition, types, and applications. It categorizes machine learning into supervised, unsupervised, semi-supervised, and reinforcement learning, explaining their methodologies, advantages, and disadvantages. Additionally, it covers key processes in model development, including feature engineering, model selection, training, validation, and deployment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

1

Machine Learning:

• Machine learning is a growing technology which enables computers to learn automatically from
past data.
• Machine learning uses various algorithms for building mathematical models and making predictions
using historical data or information.
• Currently, it is being used for various tasks such as image recognition, speech recognition, email
filtering, Facebook auto-tagging, recommender system, and many more.
Arthur Samuel
• The term machine learning was first introduced by Arthur Samuel in 1959. We can define it in a
summarized way as:
• Machine learning enables a machine to automatically learn from data, improve performance
from experiences, and predict things without being explicitly programmed.
2

TOPIC-2: Types of Machine Learning Systems


3

There are so many different types of Machine Learning systems that it is useful to classify them in broad
categories, based on the following criteria:
1. Whether or not they are trained with human supervision (supervised, unsupervised, semi supervised, and
Reinforcement Learning)
2. Whether or not they can learn incrementally on the fly (online versus batch learning)
3. Whether they work by simply comparing new data points to known data points, or instead by detecting
patterns in the training data and building a predictive model, much like scientists do (instance-based versus
model-based learning).

1. Supervised Machine Learning: As its name suggests, supervised machine learning is based on
supervision.
• It means in the supervised learning technique, we train the machines using the "labelled" dataset,
and based on the training, the machine predicts the output.
• The main goal of the supervised learning technique is to map the input variable(x) with the output
variable(y). Some real-world applications of supervised learning are Risk Assessment, Fraud
Detection, Spam filtering, etc.

Supervised learning is a type of machine learning where a computer learns from labeled data. Think of it as
teaching a child to identify animals using a picture book.

Labeled Data: The book contains pictures of animals with their names written below (e.g., "Dog," "Cat").

Learning Process: The child looks at the pictures and memorizes what each animal looks like.

Making Predictions: When shown a new picture of a dog, the child can identify it because of what they learned.
4
In machine learning, the "child" is the model, and the labeled data (input and output) trains it to make
predictions on new, unseen data.

HOW DOES SUPERVISED LEARNING WORKS?

1. Training Phase:

The algorithm is provided a dataset with inputs and corresponding outputs.

It identifies patterns or relationships between them.

2. Testing/Prediction Phase:

The trained model is tested on new data (inputs without outputs).

It predicts the corresponding output based on the learned patterns.

Key Components of Supervised learning

1. Data: The dataset must contain:

Inputs (e.g., features like age, height, pixel data for images).

Outputs (e.g., labels like "Dog," "Cat," or a numeric value).

2. Model: The machine learning algorithm that learns the relationship (e.g., Linear Regression, Decision Trees,
Neural Networks).

3. Loss Function: Measures the error between the model’s predictions and the actual outputs. The goal is to
minimize this error.

4. Optimization Algorithm: Adjusts the model to improve predictions, e.g., Gradient Descent.

Types of Supervised Learning


1. Regression:

Predicts continuous values.

Example: Predicting house prices based on size and location.

2. Classification:
5
Predicts discrete categories.

Example: Classifying emails as "spam" or "not spam."

Examples of Supervised Learning

1. Handwritten Digit Recognition:

Input: Pixel values from an image.

Output: The digit (0–9).

2. Spam Detection:

Input: Email text (words or phrases).

Output: Spam or not spam.

3. Weather Prediction:

Input: Temperature, humidity, wind speed.

Output: Will it rain or not?

4. Fraud Detection:

Input: Transaction details.

Output: Fraudulent or legitimate.

Categories of Supervised Machine Learning:


• Supervised machine learning can be classified into two types of problems, which are given below:
• Classification
• Regression
Classification: Classification algorithms are used to solve the classification problems in which the output
variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc.

• The classification algorithms predict the categories present in the dataset.


• Some real-world examples of classification algorithms are Spam Detection, Email filtering, etc.
Some popular classification algorithms are given below:
• Random Forest Algorithm

• Decision Tree Algorithm


6
• Logistic Regression Algorithm
• Support Vector Machine Algorithm
Regression:
• Regression algorithms are used to solve regression problems in which there is a linear relationship
between input and output variables.
• These are used to predict continuous output variables, such as market trends, weather prediction,
etc.
Some popular Regression algorithms are given below:
• Simple Linear Regression Algorithm
• Multivariate Regression Algorithm
• Decision Tree Algorithm
• Lasso Regression
Advantages and Disadvantages of Supervised Learning:
Advantages:
• Since supervised learning work with the labelled dataset so we can have an exact idea about the
classes of objects.
• These algorithms are helpful in predicting the output on the basis of prior experience.
Disadvantages:
• These algorithms are not able to solve complex tasks.
• It may predict the wrong output if the test data is different from the training data.
• It requires lots of computational time to train the algorithm.

2. Unsupervised Machine Learning:


• Unsupervised learning is different from the supervised learning technique; as its name suggests,
there is no need for supervision.
• It means, in unsupervised machine learning, the machine is trained using the unlabeled dataset, and
the machine predicts the output w
• The main aim of the unsupervised learning algorithm is to group or categories the unsorted
dataset according to the similarities, patterns, and differences.
• Machines are instructed to find the hidden patterns from the input dataset.
7

Categories of Unsupervised Machine Learning:

Unsupervised Learning can be further classified into two types, which are given below:
• Clustering
• Association
1) Clustering:
• The clustering technique is used when we want to find the inherent groups from the data.
• It is a way to group the objects into a cluster such that the objects with the most similarities remain
in one group and have fewer or no similarities with the objects of other groups.
• An example of the clustering algorithm is grouping the customers by their purchasing behavior.

Some of the popular clustering algorithms are given below:


• K-Means Clustering algorithm
• Mean-shift algorithm
• DBSCAN Algorithm
• Principal Component Analysis
• Independent Component Analysis

2) Association:
• Association rule learning is an unsupervised learning technique, which finds interesting relations
among variables within a large dataset.
• The main aim of this learning algorithm is to find the dependency of one data item on another data
item and map those variables accordingly so that it can generate maximum profit.
• Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-growth
algorithm.

Advantages and Disadvantages of Unsupervised Learning Algorithm:


Advantages:
• These algorithms can be used for complicated tasks compared to the supervised ones because these
algorithms work on the unlabeled dataset.
• Unsupervised algorithms are preferable for various tasks as getting the unlabeled dataset is easier
as compared to the labelled dataset.

Disadvantages:
• The output of an unsupervised algorithm can be less accurate as the dataset is not labelled, and
algorithms are not trained with the exact output in prior.
• Working with Unsupervised learning is more difficult as it works with the unlabeled dataset that
does not map with the output.
8

3. Semi-Supervised Learning:
• Semi-Supervised learning is a type of Machine Learning algorithm that lies between
Supervised and Unsupervised machine learning.
• It represents the intermediate ground between Supervised (With Labelled training data) and
Unsupervised learning (with no labelled training data) algorithms and uses the combination of
labelled and unlabeled datasets during the training period.
To overcome the drawbacks of supervised learning and unsupervised learning algorithms, the
concept of Semi-supervised learning is introduced.
• We can imagine these algorithms with an example. Supervised learning is where a student is under
the supervision of an instructor at home and college.
• Further, if that student is self- analyzing the same concept without any help from the instructor, it
comes under unsupervised learning.
• Under semi-supervised learning, the student has to revise himself after analyzing the same concept
under the guidance of an instructor at college.
Advantages:
• It is simple and easy to understand the algorithm.
• It is highly efficient.
• It is used to solve drawbacks of Supervised and Unsupervised Learning algorithms.
Disadvantages:
• Iterations results may not be stable.
• We cannot apply these algorithms to network-level data.
• Accuracy is low.

Difference Between Supervised and Unsupervised Learning


Supervised and Unsupervised Learning are two main types of Machine Learning (ML) techniques. They
differ in how they learn from data and the type of problems they solve.

1. Supervised Learning

🔹 Definition: In supervised learning, the algorithm is trained on labeled data, meaning the input data has
corresponding output labels.

🔹 Goal: Learn a mapping from inputs to outputs and make predictions on new data.

🔹 Types:

 Classification: Predicts discrete labels (e.g., Spam vs. Not Spam).


 Regression: Predicts continuous values (e.g., House Price Prediction).

🔹 Examples:
9

 Email Spam Detection 📧 (Spam or Not Spam)


 Predicting Stock Prices 📈
 Sentiment Analysis 😊😡

🔹 Algorithms:

 Linear Regression
 Logistic Regression
 Decision Trees
 Support Vector Machines (SVM)
 Neural Networks

2. Unsupervised Learning

🔹 Definition: The algorithm is trained on unlabeled data, meaning it tries to find patterns, structures, or
relationships without predefined labels.

🔹 Goal: Identify hidden patterns, group similar data points, or reduce data complexity.

🔹 Types:

 Clustering: Groups similar data points together (e.g., Customer Segmentation).


 Dimensionality Reduction: Reduces the number of input variables (e.g., PCA for feature selection).

🔹 Examples:

 Customer Segmentation 🛍️
 Anomaly Detection 🚨 (Fraud Detection)
 Market Basket Analysis 🛒 (Amazon, Netflix Recommendations)

🔹 Algorithms:

 K-Means Clustering
 Hierarchical Clustering
 Principal Component Analysis (PCA)
 Autoencoders

Comparison Table
Feature Supervised Learning 🏷️ Unsupervised Learning ❓
Data Type Labeled Data ✅ Unlabeled Data ❌
Goal Predict outcomes Find hidden patterns
Types Classification, Regression Clustering, Dimensionality Reduction
Common Algorithms Decision Trees, SVM, Neural Networks K-Means, PCA, DBSCAN
Examples Spam Detection, Stock Prediction Customer Segmentation, Anomaly Detection
Human Intervention High (requires labeled data) Low (no need for labeled data)

 Use Supervised Learning when you have labeled data and need to make predictions.
 Use Unsupervised Learning when you have unlabeled data and want to explore hidden structures.
10

4. Reinforcement Learning:
• Reinforcement learning works on a feedback-based process, in which an AI agent (A software
component) automatically explore its surrounding by hitting & trail, taking action, learning
from experiences, and improving its performance.
• Agent gets rewarded for each good action and get punished for each bad action; hence the goal of
reinforcement learning agent is to maximize the rewards.
• In reinforcement learning, there is no labelled data like supervised learning, and agents learn from
their experiences only.
MODELLING PROCESS
1. Feature Engineering
Feature engineering is the process of transforming raw data into meaningful features that improve the predictive
power of a machine learning model.

Key Steps in Feature Engineering:


 Handling Missing Data: Fill missing values using methods like mean, median, mode, or advanced techniques
like KNN imputation.
 Feature Scaling: Normalize or standardize numerical features (e.g., Min-Max Scaling, Z-score normalization).
 Encoding Categorical Variables: Convert categorical data into numerical form (e.g., One-Hot Encoding, Label
Encoding).
 Feature Creation: Derive new features from existing ones (e.g., creating an "age group" from "age").
 Feature Selection: Identify and retain only the most relevant features to improve model performance (e.g., using
correlation analysis, mutual information, or PCA).
 Handling Outliers: Remove or transform outliers using statistical methods like the Z-score or IQR method.

2. Model Selection
Choosing the right model is crucial for achieving good performance.

Considerations for Model Selection:

 Problem Type:
o Regression (e.g., Linear Regression, Random Forest Regression)
o Classification (e.g., Logistic Regression, SVM, Decision Trees, Neural Networks)
o Clustering (e.g., K-Means, DBSCAN)
o Time Series Forecasting (e.g., ARIMA, LSTMs)
 Dataset Size and Complexity:
o For small datasets: Logistic Regression, Decision Trees, KNN
o For large datasets: Neural Networks, Gradient Boosting, Random Forest
 Computational Efficiency: Consider the model's speed and memory usage.
 Overfitting Risk: Simpler models (e.g., Logistic Regression) generalize better, while complex models
(e.g., Deep Learning) may need regularization.

3. Training the Model


11
Training a model involves feeding it data and adjusting its parameters to minimize the error.

Key Steps in Model Training:

1. Splitting the Data: Divide the dataset into:


o Training Set (typically 70-80% of data) for model learning.
o Validation Set (10-15%) for tuning hyperparameters.
o Test Set (10-15%) for final evaluation.
2. Choosing an Optimization Algorithm:
o Gradient Descent for deep learning models.
o SGD, Adam, RMSprop for neural networks.
o Grid Search/Random Search for hyperparameter tuning.
3. Adjusting Hyperparameters: Optimize parameters like learning rate, tree depth (for decision trees),
and number of layers (for deep learning).
4. Regularization Techniques:
o L1/L2 Regularization to prevent overfitting.
o Dropout (for neural networks) to reduce reliance on specific neurons.

4. Model Validation and Selection


After training, validate the model to ensure it generalizes well to unseen data.

Validation Techniques:

 Cross-Validation:
o k-Fold Cross-Validation: Splits the data into k subsets, trains on k-1 subsets, and tests on the remaining
one.
o Leave-One-Out Cross-Validation (LOOCV): Uses a single observation as the test set and the rest for
training.
 Evaluation Metrics:
o Regression: RMSE (Root Mean Square Error), MAE (Mean Absolute Error), R² score.
o Classification: Accuracy, Precision, Recall, F1-score, ROC-AUC.
 Hyperparameter Tuning:
o Use Grid Search or Random Search to optimize hyperparameters.
o Implement Bayesian Optimization or AutoML for more advanced tuning.

5. Applying the Trained Model to Unseen Data (Deployment)


Once the model is validated, it's ready for deployment.

Deployment Process:

1. Save the Model: Use formats like .pkl (Pickle), .h5 (for deep learning), or ONNX for interoperability.
2. Integrate with an Application:
o Deploy via an API (e.g., Flask, FastAPI, TensorFlow Serving).
o Embed it in a mobile or web app.
3. Monitor Performance in Production:
o Track real-world accuracy, latency, and data drift.
o Retrain the model periodically if performance degrades.
12

UNIT -4
INTRODUCTION TO HADOOP
Hadoop is an open-source framework designed for processing and storing vast amounts of data in a distributed manner. It
is particularly useful for handling big data, which involves datasets that are too large or complex for traditional data-
processing software. Developed by the Apache Software Foundation, Hadoop is widely used in industries that require
large-scale data processing, such as technology, finance, healthcare, and retail.
COMPONENTS OF HADOOP
Hadoop's architecture is built on two primary modules:
1. Hadoop Distributed File System (HDFS):
o HDFS is a distributed file system that allows data to be stored across multiple machines. o It divides large
datasets into smaller blocks, which are replicated across the cluster to ensure fault tolerance and high
availability.
HDFS in Data Science

Hadoop Distributed File System (HDFS) plays a crucial role in data science by providing scalable, fault-
tolerant, and distributed storage for big data analytics. Since data science involves processing massive
datasets, HDFS enables efficient storage and retrieval, making it an essential component in big data-driven
machine learning and AI projects.

Why Use HDFS in Data Science?


1. Scalability – Handles petabytes of structured and unstructured data.
2. Fault Tolerance – Replicates data across multiple nodes, ensuring reliability.
3. Cost-Effective – Uses commodity hardware to store large datasets.
4. Distributed Processing – Works seamlessly with Apache Spark, Hive, and MapReduce for parallel data
processing.
5. Supports Various Data Formats – Works with CSV, JSON, Parquet, Avro, and binary files.

How HDFS Fits into the Data Science Workflow


1. Data Ingestion
 HDFS can store raw data from various sources like IoT devices, web logs, sensors, social media, and
databases.
 Tools like Apache Flume, Apache Sqoop, or Kafka can be used to load data into HDFS.

2. Data Processing & Transformation


 Apache Spark: Uses PySpark (Python API for Spark) to process large datasets in-memory.
 Hive: Queries structured data using SQL-like commands.
 MapReduce: Batch processing of large-scale datasets.

3. Feature Engineering & Model Training


13
 Data stored in HDFS can be processed using Spark MLlib (machine learning library) to extract features for
ML models.
 Data scientists can use Jupyter Notebooks with PySpark for interactive data exploration.

4. Model Deployment & Prediction


 Once trained, models can be deployed using frameworks like TensorFlow, Scikit-learn, or MLflow, with
predictions running on large-scale data stored in HDFS.
 HDFS ensures low-latency access to massive datasets for real-time inference.

Example Use Cases


1. Fraud Detection in Banking
 Stores large volumes of transaction logs in HDFS.
 Uses PySpark to process transactions and detect anomalies.

2. Customer Segmentation in E-commerce


 Analyzes customer behavior from HDFS-stored clickstream data.
 Uses Spark MLlib for clustering customers based on their purchasing patterns.

3. Predictive Maintenance in Manufacturing


 Stores IoT sensor data from machines in HDFS.
 Applies time-series forecasting models to predict failures.

4. Healthcare Analytics
 Stores and processes medical records, genomic data, and images.
 Uses deep learning models for disease prediction and diagnosis.

2. MapReduce:
o A programming model and processing engine for distributed computation. o It breaks down tasks into
smaller chunks (Map phase) and processes them in parallel, combining the results into a final output (Reduce
phase).

It works in two main steps:

1. Map Step: Breaks the big task into smaller pieces and processes them in parallel.

2. Reduce Step: Collects the processed results and combines them to get the final output.
14

Real-Life Example: Counting Votes in an Election

Imagine you are organizing a national election. Millions of people across different cities are voting, and you need to count
the total votes for each candidate.

If one person had to count all the votes manually, it would take weeks or months. Instead, we can use MapReduce to
count votes efficiently.

Step 1: Map (Divide the Task)

Each city has its own vote counting center.

Every center counts votes individually and records how many votes each candidate received.

Example:

City 1 → Alice: 500, Bob: 300

City 2 → Alice: 600, Bob: 400


City 3 → Alice: 550, Bob: 450

Step 2: Reduce (Combine Results)

A central system collects the results from all cities.

It adds up the votes from each city to get the total:

Alice: 500 + 600 + 550 = 1650 votes

Bob: 300 + 400 + 450 = 1150 votes

Now, we have the final vote count for the entire country.

How MapReduce Works in Big Data


15
Now, let’s see how this applies to computers and big data:

Imagine you have 1 billion sales records from an e-commerce website and you want to find total sales per product.

Step 1: Map

The sales data is split across multiple computers (like vote counting centers).

Each computer processes its own portion of the data and records sales per product.

Example:

Computer 1: Apple: ₹10,000, Banana: ₹5,000

Computer 2: Apple: ₹15,000, Banana: ₹8,000

Computer 3: Apple: ₹12,000, Banana: ₹7,000

Step 2: Reduce

A central computer combines all results to get the final sales numbers:

Apple: ₹10,000 + ₹15,000 + ₹12,000 = ₹37,000

Banana: ₹5,000 + ₹8,000 + ₹7,000 = ₹20,000


This way, big data can be processed much faster using multiple computers in parallel.

Where is MapReduce Used?

Google Search: To index and rank billions of web pages.

Amazon & Flipkart: To process large-scale customer purchases and recommend products.

Bank Fraud Detection: To scan millions of transactions and find suspicious patterns.

Weather Forecasting: To analyze vast amounts of climate data.


16
ADDITIONAL COMPONENTS
Modern Hadoop ecosystems often include additional tools that expand its functionality:
• YARN (Yet Another Resource Negotiator):
o Manages and schedules resources in the cluster.
o Enables multiple data processing engines to work simultaneously.
• Hadoop Common: o A set of shared utilities and libraries that support other Hadoop modules.
ECOSYSTEM TOOLS
Hadoop integrates with a rich ecosystem of tools, such as:
• Apache Hive: For SQL-like querying.
• Apache Pig: For scripting-based data analysis.
• Apache HBase: A NoSQL database for real-time data access.
• Apache Spark: For fast, in-memory processing.
• Apache Kafka: For data streaming.

BENEFITS OF HADOOP
1. Scalability: Can scale horizontally by adding more nodes to the cluster.
2. Cost-effectiveness: Works on commodity hardware, reducing overall costs.
3. Fault tolerance: Automatically replicates data, ensuring redundancy and reliability.
4. Flexibility: Handles structured, semi-structured, and unstructured data.
5. Speed: Processes large datasets in parallel, significantly reducing processing time.
USE CASES
• Data Warehousing and Analytics: For large-scale business intelligence.
• Search Engines: For indexing and querying web pages.
• Social Media Analytics: For analyzing user behavior and trends.
• Fraud Detection: In finance and insurance sectors.
Hadoop has become a foundational technology in the big data domain, empowering businesses to extract insights and
value from their data efficiently.

FRAMEWORK
A framework is a pre-built structure or platform that provides tools, libraries, and guidelines for developing applications
efficiently. In the context of big data, frameworks like Hadoop and Spark help process and analyze large datasets.

APACHE SPARK: REPLACING MAPREDUCE


Apache Spark is a powerful, open-source big data framework that has largely replaced MapReduce in many
applications. Here's why:
1. In-Memory Processing:
o Spark stores data in memory rather than disk, making it significantly faster than MapReduce, which relies
heavily on disk I/O.
17

2. Ease of Use:
o Supports APIs in languages like Python, Java, Scala, and R, simplifying data processing compared to
MapReduce's Java-based paradigm.

3. Rich Ecosystem:
o Spark includes modules like Spark SQL (structured data processing), MLlib (machine learning), GraphX
(graph processing), and Spark Streaming (real-time data).
What is Apache Spark?

Apache Spark is a fast and powerful data processing engine used to analyze huge amounts of data quickly. It
helps companies handle big data in real-time or batch mode and is widely used for applications like
recommendation systems, fraud detection, and large-scale data analysis.

Think of Spark as a super-efficient team of workers who can process massive amounts of information at lightning
speed without slowing down.

Simple Real-Life Example: Online Shopping (Amazon/Flipkart)

Imagine you are shopping on Amazon or Flipkart during a sale, and thousands of people are placing orders at the
same time. The system needs to:

1. Process Orders Quickly – Amazon must ensure that your order is placed and confirmed instantly.
2. Recommend Products – It suggests "People who bought this also bought…" based on past customer behavior.
3. Detect Fraud – If a hacker tries to make a suspicious purchase, the system must stop it immediately.
4. Track Deliveries – It manages delivery schedules and optimizes shipping routes for faster delivery.
Apache Spark is used by companies like Amazon, Alibaba, and Netflix to handle such tasks in real-time by
processing massive amounts of data faster than traditional systems like Hadoop.
How Does Apache Spark Help in This Example?

1. Spark Streaming → Tracks Orders in Real-Time

As soon as you click "Buy Now," Spark processes your order instantly, without delay.

2. Spark SQL → Manages Customer and Product Data


18
Retrieves customer purchase history and suggests relevant products.

3. MLlib (Machine Learning) → Makes Smart Recommendations


Analyzes your past purchases and suggests "Best Offers for You."
4. GraphX → Optimizes Delivery Routes

Finds the fastest way to deliver your order by analyzing multiple routes.

Apache Spark is like an intelligent, high-speed brain that helps businesses process and analyze vast amounts of
data quickly and efficiently. Whether it's online shopping, fraud detection, or real-time analytics, Spark plays a
crucial role in improving performance and customer experience.
MapReduce is still used in legacy systems but is less common for new projects due to Spark’s superior performance and
flexibility.

NOSQL DATABASES
NoSQL databases are designed to handle unstructured or semi-structured data and offer high scalability. Unlike
traditional relational databases (SQL-based), NoSQL databases do not rely on a fixed schema.
NoSQL (Not Only SQL) databases are non-relational databases designed to store, retrieve, and manage large volumes of
unstructured, semi-structured, or structured data. They are widely used in data science for their scalability, flexibility, and
ability to handle big data efficiently. Here’s a detailed yet simple explanation with real-life examples.
Why NoSQL in Data Science?

1. Scalability: NoSQL databases can easily scale horizontally, making them suitable for big data applications.
2. Flexibility: They can store unstructured data like JSON, XML, or key-value pairs.
3. High Performance: Fast read and write operations due to simplified data models.
4. Variety of Data Types: Handles diverse data types, including text, images, and videos.

Types of NoSQL Databases

1. Document Store: Stores data in documents (like JSON or BSON).

Example: MongoDB

Use Case: E-commerce product catalog, where each product has different attributes.

Real-Life Example: Amazon's product listings, where products have varied specifications like size, color, and
features.
19

2. Key-Value Store: Stores data as key-value pairs.

Example: Redis, DynamoDB


Use Case: Caching user session data.
Real-Life Example: Storing user session information for quick retrieval in online shopping carts.
3. Column Store: Stores data in columns instead of rows.

Example: Apache Cassandra, HBase

Use Case: Time-series data analysis.

Real-Life Example: Monitoring stock prices or IoT sensor data over time.
4. Graph Database: Stores data as nodes and relationships.

Example: Neo4j

Use Case: Social network analysis.

Real-Life Example: Facebook's social graph to find mutual friends or suggest connections.

Real-Life Example: MongoDB in Data Science

Scenario: An online learning platform wants to analyze student engagement based on their interactions, including
video views, quiz attempts, and forum posts.

Why MongoDB?

Each student has different activity patterns. Document-oriented storage allows flexibility.

Fast querying of nested JSON data, like retrieving a student's quiz attempts and forum posts.

How it's Used:

Store student interactions as JSON documents.

Perform aggregation to find most-watched videos or frequently asked questions.

Use this data for predictive analytics (e.g., identifying at-risk students).
20

Advantages of Using NoSQL in Data Science

Handling Unstructured Data: Easily stores JSON, XML, images, and videos.

High Scalability: Ideal for big data applications.

Fast Query Performance: Due to simplified data models and indexing.

Disadvantages of NOSQL in Data Science

Lack of Standardization: No standard query language like SQL.

Complex Data Relationships: Not ideal for complex joins (relational operations).

Consistency Trade-offs: Some NoSQL databases prioritize availability over consistency (as per the CAP theorem).

FEATURES OF NOSQL DATABASES:


• Horizontal Scalability: Add more servers easily to handle increased data.
• Flexible Data Models: Suitable for document, key-value, column-family, or graph storage.
• High Availability: Focuses on distributing data for better availability.

ACID, CAP, and BASE


1. ACID (for traditional relational databases):
• Atomicity: All operations in a transaction succeed or none do.
• Consistency: Database remains in a valid state after a transaction.
• Isolation: Transactions do not interfere with each other.
• Durability: Data persists even after a crash.
• Ensures data integrity but can affect performance in distributed systems.
The text covers two important concepts related to NoSQL databases:

1. ACID (for Relational Databases)

2. CAP Theorem (for Distributed Databases, including NoSQL)


21

---

1. ACID: Core Principle of Relational Databases

ACID stands for:

Atomicity:

"All or nothing" principle. If an operation fails, the entire transaction is rolled back.

Example: If you are transferring money between bank accounts and the power fails, the transaction either fully completes
or doesn’t happen at all, ensuring no partial transfer.

Consistency:

Ensures that the database moves from one valid state to another.

Example: If a student’s grade is updated, it reflects consistently across all systems without any data corruption.

Isolation:

Transactions occur independently without interference.

Example: Two people editing a Google Doc simultaneously see each other’s changes instantly (low isolation). But, in a
Word document, only one person can edit at a time (high isolation).

Durability:

Once a transaction is committed, it remains in the database even after a system crash.

Example: After booking a flight ticket, the confirmation remains intact even if the server goes down.

Where is ACID Used?

Mainly in traditional relational databases (e.g., MySQL, PostgreSQL).


22
Some NoSQL databases like Neo4j (a graph database) also support ACID properties.

2. CAP Theorem: The Challenge of Distributed Databases

CAP theorem states that in a distributed system, you can only achieve two of the following three:

Consistency (C): Every read receives the most recent write.

Availability (A): Every request gets a response (success or failure).

Partition Tolerance (P): The system continues to operate even if network communication is lost between nodes.

According to the CAP theorem:

It’s impossible to achieve all three (Consistency, Availability, and Partition Tolerance) simultaneously.

Example Scenario: Online Shop

Setup: An online shop has servers in Europe and the USA.

Situation: A German customer and an American customer both want to buy the last piece of an item at the same time.

Problem: Network communication between the two servers is temporarily lost.

Two Options to Handle the Situation:

1. Prioritize Availability:

Allow both servers to continue processing requests.

Risk: Both customers might be sold the same item, leading to an inventory mismatch.

Example: Amazon sometimes oversells an item and later informs one buyer about a delayed shipment.

2. Prioritize Consistency:

Put all sales on hold until communication is restored.

Risk: Some customers may face delays or experience temporary unavailability.

Example: A booking website may show “unavailable” until all servers sync data.
23

Practical Examples of CAP Choices in NoSQL Databases

CA (Consistency + Availability):

Used when Partition Tolerance is not required, e.g., within a single data center.

Example: Relational databases like MySQL (when used in a single node).

CP (Consistency + Partition Tolerance):

Prioritizes accuracy over availability.

Example: HBase and MongoDB in some configurations.

AP (Availability + Partition Tolerance):

Prioritizes system uptime over strict consistency.

Example: Cassandra and DynamoDB (ideal for social media feeds).

Conclusion

ACID ensures data integrity and reliability, making it ideal for traditional relational databases and transactions.

CAP Theorem helps in designing distributed systems, particularly in NoSQL databases, where trade-offs between
Consistency, Availability, and Partition Tolerance are necessary.

2. CAP Theorem (for distributed databases):


Proposed by Eric Brewer, it states that a distributed database can only provide two of the following three guarantees:
• Consistency: All nodes see the same data at the same time.
• Availability: Every request gets a response (though the response might not reflect the most recent changes).
• Partition Tolerance: The system continues to operate even if some nodes lose communication.
Distributed systems must trade off one of these aspects, often prioritizing Availability and Partition Tolerance over
Consistency.

3. BASE (NoSQL databases):


BASE is an alternative to ACID, commonly used in NoSQL databases to optimize scalability and performance:
• Basically Available: System guarantees availability even during failures.
24
• Soft State: System state might change over time (eventual consistency).
• Eventual Consistency: Data across nodes eventually becomes consistent after updates. BASE sacrifices strict
consistency for higher availability and better partition tolerance, aligning with CAP theorem's trade-offs.

TYPES OF NOSQL DATABASES


1. Key-Value Stores: o Simplest type, storing data as key-value pairs (e.g., Redis, DynamoDB).
2. Document Stores: o Store semi-structured data like JSON documents (e.g., MongoDB, CouchDB).
3. Column-Family Stores: o Organize data in columns instead of rows (e.g., Cassandra, HBase).
4. Graph Databases:
o Focus on relationships between entities (e.g., Neo4j, Amazon Neptune).
These databases are chosen based on specific use cases, such as low-latency reads (Key-Value), flexible schemas
(Document), or complex relationships (Graph).

• The reinforcement learning process is similar to a human being; for example, a child learns various
things by experiences in his day-to-day life.
• An example of reinforcement learning is to play a game, where the Game is the environment, moves
of an agent at each step define states, and the goal of the agent is to get a high score.
• Agent receives feedback in terms of punishment and rewards.
• Due to its way of working, reinforcement learning is employed in different fields such as Game
theory, Operation Research, Information theory, multi-agent systems.
Categories of Reinforcement Learning:
• Reinforcement learning is categorized mainly into two types of methods/algorithms:

• Positive Reinforcement Learning: Positive reinforcement learning specifies increasing the


25
tendency that the required behavior would occur again by adding something. It enhances the
strength of the behavior of the agent and positively impacts it.
• Negative Reinforcement Learning: Negative reinforcement learning works exactly opposite to the
positive RL. It increases the tendency that the specific behavior would occur again by avoiding the
negative condition.
Real-world Use cases of Reinforcement Learning
• Video Games
• Robotics
• Text Mining

TOPIC-3: Main Challenges of Machine Learning:


1) Lack Of Quality Data
One of the main issues in Machine Learning is the absence of good data. While upgrading, algorithms tend
to make developers exhaust most of their time on artificial intelligence.

 Data can be noisy which will result in inaccurate predictions.


 Incorrect or incomplete information can also lead to faulty programming through Machine
Learning.
2) Fault In Credit Card Fraud Detection

Although this AI-driven software helps to successfully detect credit card fraud, there are issues in Machine
Learning that make the process redundant.

3) Getting Bad Recommendations


Proposal engines are quite regular today. While some might be dependable, others may not appear to provide the
necessary results. Machine Learning algorithms tend to only impose what these proposal engines have suggested.

4) Talent Deficit

Albeit numerous individuals are pulled into the ML business, however, there are still not many experts who
can take complete control of this innovation.

5) Implementation

Organizations regularly have examination engines working with them when they decide to move up to ML.
The usage of fresher ML strategies with existing procedures is a complicated errand.

6) Making The Wrong Assumptions


26

ML models can’t manage datasets containing missing data points. Thus, highlights that contain a huge part
of missing data should be erased.

7) Deficient Infrastructure

ML requires a tremendous amount of data stirring abilities. Inheritance frameworks can’t deal with the
responsibility and clasp under tension.

8) Having Algorithms Become Obsolete When Data Grows


ML algorithms will consistently require a lot of data when being trained. Frequently, these ML algorithms
will be trained over a specific data index and afterwards used to foresee future data, a cycle which you can
only expect with a significant amount of effort.

9) Absence Of Skilled Resources

The other issues in Machine Learning are that deep analytics and ML in their present structures are still
new technologies.

10) Customer Segmentation


Let us consider the data of human behaviour by a user during a time for testing and the relevant previous
practices. All things considered, an algorithm is necessary to recognize those customers that will change over to
the paid form of a product and those that won’t.

The lists of supervised learning algorithms in ML are:

 Neural Networks
 Naive Bayesian Model
 Classification
 Support Vector Machines
 Regression
 Random Forest Model
11) Complexity

Although Machine Learning and Artificial Intelligence are booming, a majority of these sectors are still in
their experimental phases, actively undergoing a trial and error method.

12) Slow Results

Another one of the most common issues in Machine Learning is the slow-moving program. The Machine
27
Learning Models are highly efficient bearing accurate results but the said results take time to be produced.

13) Maintenance

Requisite results for different actions are bound to change and hence the data needed for the same is
different.

14) Concept Drift

This occurs when the target variable changes, resulting in the delivered results being inaccurate. This forces
the decay of the models as changes cannot be easily accustomed to or upgraded.

15) Data Bias

This occurs when certain aspects of a data set need more importance than others.

16) High Chances Of Error

Many algorithms will contain biased programming which will lead to biased datasets. It will not deliver the
right output and produces irrelevant information.

17) Lack Of Explainability

Machine Learning is often termed a “Black box” as deciphering the outcomes from an algorithm is often
complex and sometimes useless.

You might also like