Notes Unit1 Unit2
Notes Unit1 Unit2
Data science is the study of data to extract meaningful insights for business. It is a multidisciplinary approach that
combines principles and practices from the fields of mathematics, statistics, artificial intelligence, and computer
engineering to analyze large amounts of data. This analysis helps data scientists to ask and answer questions like
what happened, why it happened, what will happen, and what can be done with the results.
OR
Data Science is an interdisciplinary field that combines techniques from statistics, computer science, and domain
expertise to extract meaningful insights from data. It involves processes like data collection, cleaning, analysis, and
visualization, aiming to make data-driven decisions and predictions.
Data science is important because it combines tools, methods, and technology to generate meaning from data.
Modern organizations are inundated with data; there is a proliferation of devices that can automatically collect and
store information. Online systems and payment portals capture more data in the fields of e-commerce, medicine,
finance, and every other aspect of human life. We have text, audio, video, and image data available in vast
quantities.
Key Components of Data Science:
Data Collection & Cleaning: Gathering raw data from various sources (databases, APIs, etc.), then cleaning and
preprocessing it to make it suitable for analysis.
Exploratory Data Analysis (EDA): This is the process of using statistics and visualization tools to understand the
structure and patterns in the data before building models.
Modeling: Applying algorithms (like machine learning or statistical models) to make predictions or classify data
based on the available information.
Visualization: Creating visual representations of the data or results from analysis (e.g., graphs, charts) to help
communicate findings.
Communication: Presenting results in a way that stakeholders or others can understand and use to inform decisions.
R Tool
R is an open-source programming language used statistical software and data analysis tools. It is an important tool
for Data Science. It is highly popular and is the first choice of many statisticians and data scientists.
R includes powerful tools for creating aesthetic and insightful visualizations.
Facilitates data extraction, transformation, and loading, with interfaces for SQL, spreadsheets, and
more.
Provides essential packages for cleaning and transforming data.
Enables the application of ML algorithms to predict future events.
Supports analysis of unstructured data through NoSQL database interfaces.
2. Visualization: With packages like ggplot2, R allows for easy and flexible data visualization,
enabling you to create everything from basic bar charts to sophisticated heat maps and interactive visualizations.
3. Extensive Libraries: R has a rich ecosystem of packages like dplyr, tidyr, caret, and shiny that
simplify tasks like data cleaning, modeling, and building interactive web apps.
4. Community Support: R has a large, active community, making it easier to find resources, tutorials,
and answers to questions.
5. Integration: R can be integrated with other tools, such as Python, databases, and cloud computing
platforms, making it versatile for various data science applications.
2. ggplot2: For data visualization (creating plots like histograms, scatter plots, etc.).
install.packages("ggplot2")
install.packages("dplyr")
library(ggplot2)
library(dplyr)
2. Data Import and Exploration: You can load data from various formats like CSV, Excel, or databases.
summary(data)
head(data)
3. Data Cleaning: Clean your data by handling missing values, filtering irrelevant rows, or
transforming variables.
summary(model)
5. Visualization: Use
Overview of Data Science
Data science is the study of data. Like a biological science is a study of biology, physical sciences, it’s the study
of physical reactions. Data is real, data has real properties, and we need to study them if we’re going to work on
them. Data Science involves data and some signs.
It is a process, not an event. It is the process of using data to understand too many different things, to understand
the world. Let suppose when you have a model or proposed explanation of a problem, and you try to validate that
proposed explanation or model with your data.
USES OF DATA SCIENCE:
Data science is a field that involves using scientific methods, processes, algorithms, and systems to extract
knowledge and insights from structured and unstructured data. It can be used in a variety of industries and
applications such as:
Business: Data science can be used to analyze customer data, predict market trends, and optimize business
operations.
Healthcare: Data science can be used to analyze medical data and identify patterns that can aid in diagnosis,
treatment, and drug discovery.
Finance: Data science can be used to identify fraud, analyze financial markets, and make investment decisions.
Social Media: Data science can be used to understand user behavior, recommend content, and identify
influencers.
Internet of things: Data science can be used to analyze sensor data from IoT devices and make predictions about
equipment failures, traffic patterns, and more.
Natural Language Processing: Data science can be used to make computers understand human language,
process large amounts of text or speech data and make predictions.
2. Data Cleaning and Preprocessing: Raw data often contains errors, inconsistencies, and missing
values. Cleaning and preprocessing data is one of the most time-consuming tasks in data science.
This step includes:
Without this step, any analysis or model built on the data may be inaccurate.
3.Exploratory Data Analysis (EDA): This step involves visually and statistically exploring the data to identify
patterns, trends, and potential anomalies. Techniques used include:
EDA helps the data scientist understand the data’s structure and relationships between variables, guiding further
analysis.
2. Feature Engineering: Feature engineering is the process of selecting, modifying, or creating new
variables (features) from raw data to improve model performance. For example, you might transform a date column
into features like day of the week or month, or combine multiple variables into a single one.
3. Model Building: After cleaning and preparing the data, the next step is to apply machine learning
algorithms or statistical models to make predictions or understand relationships. These models can be:
o Supervised learning: When the target variable is known (e.g., predicting house prices based
on features like size, location, etc.).
o Unsupervised learning: When the target variable is unknown and you want to find hidden
patterns (e.g., clustering customers based on purchasing behavior).
o Reinforcement learning: Algorithms that learn by interacting with an environment to
maximize rewards (e.g., self-driving cars, game AI).
4. Model Evaluation and Validation: After building models, it's crucial to evaluate their performance
using various metrics like accuracy, precision, recall, F1 score, and cross-validation techniques to ensure they
generalize well to new, unseen data.
6. Communication and Visualization: Communicating the results of data analysis is essential for
decision-making. Data scientists create dashboards, visualizations, and reports to make complex insights
understandable for stakeholders or non-technical audiences.
o Python: Popular for its simplicity and powerful libraries like Pandas, NumPy, Scikit-learn,
and TensorFlow.
o R: Widely used for statistical analysis, data visualization, and reporting.
o SQL: Essential for querying and manipulating data stored in relational databases.
2. Data Visualization Tools:
o AWS, Google Cloud, Microsoft Azure (for scalable data storage, processing, and
deployment)
2. Data Analyst: Focuses on interpreting data, creating reports, and providing actionable insights
through data visualizations and descriptive statistics.
3. Data Engineer: Focuses on building infrastructure to collect, store, and preprocess large amounts of
data. They ensure the data is clean, available, and accessible for analysis.
4. Machine Learning Engineer: Specializes in designing and deploying machine learning models at
scale. They work closely with data scientists and engineers to implement production-ready models.
5. Business Intelligence (BI) Analyst: Focuses on interpreting data through dashboards and reports to
inform business decisions. They often use tools like Tableau or Power BI.
Healthcare: Predictive modeling for patient outcomes, drug discovery, and personalized treatment
plans.
Finance: Fraud detection, risk analysis, and algorithmic trading.
Retail: Customer segmentation, recommendation engines, and demand forecasting.
Social Media: Sentiment analysis, trend detection, and personalized content delivery.
Autonomous Vehicles: Self-driving cars use machine learning models to process sensor data and
make real-time decisions.
Sports Analytics: Player performance analysis, injury prediction, and game strategy optimization.
Manufacturing: Data Science helps in predictive maintenance of machines, reducing downtime and
extending equipment life.
Supply Chain Engineering: Machine learning models can forecast demand, optimize inventory
levels, and streamline logistics operations.
2. Predictive Maintenance
In industries like aerospace, manufacturing, and automotive engineering, Data Science is used for predictive
maintenance. By monitoring the health of machinery and infrastructure through sensors and analyzing historical
performance data, engineers can predict potential failures before they occur. This reduces unexpected downtime and
maintenance costs.
Example: In the case of turbines, sensors can collect data on temperature, vibrations, and pressure,
and machine learning algorithms can predict when parts will likely fail, allowing maintenance to be scheduled
proactively.
Finite Element Analysis (FEA) and Computational Fluid Dynamics (CFD) tools benefit from Data
Science by running simulations on large datasets to predict how materials and structures will perform under various
conditions.
Generative Design: Data science techniques, such as optimization algorithms, can automatically
generate the best design solutions based on a set of predefined parameters.
Smart Cities: Using data from traffic sensors, weather stations, and energy grids, engineers can
optimize traffic flow, energy consumption, and waste management in urban settings.
Smart Grids: Data science can predict energy demands, balance supply and consumption, and
improve grid reliability.
Robotics: Machine learning allows robots to improve their performance over time by learning from
past tasks, making them more efficient in performing complex engineering tasks.
Automation of Design and Manufacturing: In industries like automotive manufacturing, data science
can be used to automate assembly lines, quality control, and testing procedures, improving efficiency and precision.
Structural Engineering: Engineers use data from sensors embedded in buildings, bridges, and dams
to monitor the health of these structures. By analyzing real-time data, engineers can make informed decisions about
when maintenance or repairs are needed.
Energy Engineering: Data Science is essential in optimizing energy production and consumption.
Engineers can analyze data from energy grids, weather patterns, and consumption trends to improve the efficiency of
renewable energy sources like wind and solar.
7. Enhancing Safety and Risk Management
Engineering involves assessing and mitigating risks. Data Science provides the tools to identify and quantify
potential risks in systems and structures.
Risk Assessment: Engineers use data to predict potential failures, accidents, or disasters. By
analyzing historical failure data, engineers can model and predict potential risks in infrastructure projects, helping
prevent costly or dangerous incidents.
Safety Monitoring: In areas like civil engineering, oil and gas, and mining, continuous monitoring of
worker safety and environmental conditions is enabled by real-time data analytics, providing early warning systems
for hazardous situations.
Example: In aerospace engineering, Data Science helps in the development of lightweight and
durable materials for aircraft, optimizing both safety and fuel efficiency.
Example: In semiconductor manufacturing, Data Science is used to detect minute defects in the
production process that could lead to a failure, reducing scrap and improving yield.
Smart Buildings and Infrastructure: IoT sensors in buildings can monitor energy usage, air quality,
and structural integrity, while Data Science can help engineers make real-time decisions to improve performance and
sustainability.
The data science process is a systematic sequence of steps to turn raw data into valuable insights, predictions, or
decisions. Below is a breakdown of the typical stages of the data science process along with a diagram that illustrates
these stages:
o Example: Are you trying to predict customer churn, classify emails as spam, or optimize
manufacturing processes?
o
2. Data Collection: In this step, data is gathered from various sources, which could include databases,
APIs, sensors, or public data sources.
o Example: Removing NaN values, converting categorical variables into numerical ones, and
scaling data for machine learning models.
o
4. Exploratory Data Analysis (EDA): In this phase, data scientists perform initial analysis to understand
the structure, patterns, and relationships within the data. Data visualization tools and summary statistics are often
used.
o Example: Creating histograms, scatter plots, and correlation matrices to visualize data
distributions and relationships.
o
5. Feature Engineering: Data scientists create new features or transform existing ones to improve
model performance. This might involve creating variables that capture important information or aggregating
features.
o Example: Extracting "day of the week" from a timestamp or combining multiple variables
into one.
o
6. Model Building: After data preprocessing and feature engineering, the next step is to apply machine
learning models or statistical techniques to solve the problem at hand.
o Example: Checking the model's accuracy on a test dataset or using confusion matrix for
classification tasks.
o
8. Model Deployment: After evaluating the model and confirming its performance, it’s deployed into a
production environment where it can make real-time predictions or decisions.
o Example: Deploying the model in a web app or integrating it into an existing system.
o
9. Communication and Reporting: In this final step, the results and findings are communicated to
stakeholders in a clear and understandable way, often through dashboards, reports, or visualizations.
Use case: Numeric data types are used for quantitative analysis, calculations, and mathematical modeling.
Nominal: Data that represents categories without any order (e.g., color, gender, type of animal).
Ordinal: Data that represents categories with a specific order or rank, but the differences between
categories are not defined (e.g., education level: High School < Bachelor's < Master's).
Use case: Categorical data is often used for classification tasks, grouping, and labeling.
c) Boolean:
Use case: Boolean values are used for binary classification and decision-making logic.
d) Text (String):
String: A sequence of characters used to represent text or alphanumeric data (e.g., "John", "Data
Science", "abc123").
Use case: Strings are used for text analysis, natural language processing (NLP), and categorical data encoding.
Use case: Date and time data are used for time-series analysis, scheduling, and temporal analysis.
a) Lists/Arrays:
List (Python): A collection of items that can be of any type (e.g., [1, 2, 3, 4, 5] or ["apple", "banana",
"cherry"]).
Array (NumPy): An array is similar to a list, but it is optimized for numerical operations and is more
efficient when dealing with large datasets.
Use case: Lists and arrays are used for storing ordered collections of data, such as time-series data or experiment
results.
b) Tuples:
Tuple (Python): An ordered, immutable collection of elements (e.g., (1, 2, 3) or ("apple", 2, 5.5)).
Use case: Tuples are used when the order of elements matters but the data should not be changed, like coordinates or
fixed configurations.
c) Dictionaries:
Dictionary (Python): A collection of key-value pairs, where each key maps to a value (e.g., {"name":
"John", "age": 25}).
Use case: Dictionaries are used when there is a need to store data in a non-sequential, fast-access manner, like user
information or attributes of objects.
d) DataFrames:
Pandas DataFrame: A two-dimensional, tabular data structure with labeled axes (rows and columns).
It is the most commonly used structure in Data Science, especially when dealing with structured datasets.
o Example: pd.DataFrame({'Name': ['John', 'Jane'], 'Age': [28, 22]}).
Use case: DataFrames are used to store and manipulate tabular data (e.g., from CSV files, databases, or Excel). They
provide functions for data cleaning, transformation, and analysis.
e) Matrices:
Matrix: A 2D array or collection of numbers arranged in rows and columns. Matrices are widely
used in linear algebra operations and machine learning models.
o Example: A 3x3 matrix:
lua
Copy
[[1, 2, 3],
[4, 5, 6],
[7, 8, 9]]
Use case: Matrices are commonly used in machine learning, deep learning, and data transformation tasks such as
principal component analysis (PCA) and matrix factorization.
f) Series:
Pandas Series: A one-dimensional array-like object that can hold data of any type (e.g., integers,
strings, floats). It is similar to a column in a DataFrame.
o Example: pd.Series([1, 2, 3, 4]).
Use case: A Series is often used when handling a single column or feature in a dataset, like a list of ages or names.
Graphs: Data structures used to represent networks of objects. They consist of nodes (vertices) and
edges (connections between nodes).
o Example: A social network where people are nodes and their relationships are edges.
Use case: Graphs are used in problems like social network analysis, recommendation systems, and route
optimization.
Introduction to R Programming
R is a powerful and widely used programming language and environment for statistical computing and data analysis.
Developed by statisticians Ross Ihaka and Robert Gentleman in 1993, R has since evolved into one of the most
popular languages for data science, statistical analysis, and machine learning. It is open-source, meaning anyone can
use and modify it, which has led to a large and active community contributing to its development.
R is particularly known for its rich ecosystem of packages, libraries, and tools for data manipulation, statistical
modeling, and graphical visualization. It is widely used in academia, research, and industry, particularly in fields
such as bioinformatics, social sciences, economics, and marketing.
2. Data Visualization: One of R's standout features is its ability to create high-quality visualizations.
Packages like ggplot2 allow users to create a variety of graphs, from simple bar charts and histograms to complex
multi-dimensional visualizations.
3. Extensive Package Ecosystem: R has a huge collection of packages (over 15,000 available on
CRAN), which extend its functionality. These packages allow users to perform advanced machine learning, data
mining, and visualization tasks with minimal effort.
5. Reproducibility: R supports a reproducible research workflow, where you can create reports that
include code, results, and visualizations, and share them with others. Tools like R Markdown and knitr allow you to
embed R code directly into reports and generate dynamic documents.
6. Integration with Other Languages: R can interact with other programming languages such as Python,
C/C++, and Java, which is useful for projects that require integration with other tools or legacy systems.
7. Community Support: R has a large and active community that regularly contributes to tutorials,
forums, and documentation. Whether you're a beginner or an expert, you can find resources to help with any project.
R Programming Structure
R uses a command-line interface (CLI) and a script-based programming environment. You can use R in various
interfaces:
Basic R Syntax
R uses a syntax that is relatively easy to learn, especially for those with a background in statistics or mathematics.
Below are some common elements of R programming.
Copy
x <- 5 # Numeric
Copy
Copy
3. Data Frames:
A data frame is the most commonly used structure in R for handling tabular data (like spreadsheets). It consists of
columns and rows:
Copy
4. Functions:
R allows you to define functions to perform tasks. Functions are defined using the function() keyword:
Copy
return(a + b)
5. Control Structures:
R uses standard control structures like if, for, while, and repeat for conditional logic and loops:
Copy
# If-Else
if (x > 10) {
} else {
# For loop
for (i in 1:5) {
print(i)
Copy
# Simple plot
x <- 1:10
y <- x^2
# ggplot2 example
library(ggplot2)
o RStudio is an IDE that makes working with R more efficient. Download it from RStudio's
website.
3. Install Packages: You can install packages in R using the install.packages() function:
r
Copy
4. Load Libraries: After installation, you need to load the package into your R environment using
library():
Copy
library(ggplot2)
1. Data Manipulation and Cleaning: Packages like dplyr and tidyr are used for data wrangling tasks,
such as cleaning, reshaping, and transforming data.
2. Statistical Analysis: R provides a wide array of statistical functions for hypothesis testing,
regression, ANOVA, and more.
3. Data Visualization: With ggplot2, you can create beautiful, customizable visualizations like scatter
plots, line charts, and heatmaps.
4. Machine Learning: R supports machine learning algorithms via packages like caret, randomForest,
and xgboost.
5. Text Mining and NLP: R has powerful tools for text mining and natural language processing through
packages like tm and text.
1. Creating Data
Before we start manipulating data, let's create some sample data to work with:
Copy
print(data)
This will create a data frame with columns Name, Age, Height, and Gender.
2. Selecting Columns
Using base R:
You can select columns by name or index.
Copy
data$Age
data[c("Name", "Age")]
Using dplyr:
You can use the select() function from dplyr to select columns.
Copy
library(dplyr)
3. Filtering Rows
Using base R:
You can filter rows based on conditions.
Copy
Copy
Using base R:
You can add new columns by directly assigning values.
Copy
Using dplyr:
The mutate() function is used to create or modify columns.
Copy
5. Renaming Columns
Using base R:
You can rename columns by modifying the colnames() attribute.
Copy
Using dplyr:
The rename() function in dplyr allows renaming columns in a simpler way.
Copy
Using base R:
You can sort data using the order() function.
Copy
data[order(data$Age), ]
Using dplyr:
The arrange() function from dplyr allows sorting in both ascending and descending order.
Copy
7. Summarizing Data
Using base R:
You can summarize data using functions like mean(), sum(), sd(), etc.
Copy
mean(data$Age)
summary(data$Age)
Using dplyr:
The summarize() function (or summarise()) can be used to calculate summary statistics.
Copy
8. Grouping Data
Using base R:
You can group data using the aggregate() function.
Copy
Using dplyr:
The group_by() function groups the data by one or more columns, and then you can apply a summary function like
summarize().
Copy
Using base R:
r
Copy
print(merged_data)
Using dplyr:
r
Copy
# Use left_join to join two data frames
Using tidyr:
r
Copy
library(tidyr)
filter()
Produces a subset of a Data Frame.
Function
Name Description
distinct()
Removes duplicate rows in a Data Frame
arrange()
Reorder the rows of a Data Frame
select()
Produces data in required columns of a Data Frame
rename()
Renames the variable names
mutate()
Creates new variables without dropping old ones.
transmute()
Creates new variables by dropping the old.
summarize()
Gives summarized data like Average, Sum, etc.
filter() method
The filter() function is used to produce the subset of the data that satisfies the condition specified in the filter()
method. In the condition, we can use conditional operators, logical operators, NA values, range operators etc. to
filter out data. Syntax of filter() function is given below-
filter(dataframeName, condition)
Example:
In the below code we used filter() function to fetch the data of players who scored more than 100 runs from the
“stats” data frame.
R
library(dplyr)
filter(stats, runs>100)
Output
player runs wickets
1 B 200 20
2 C 408 NA
distinct() method
The distinct() method removes duplicate rows from data frame or based on the specified columns. The syntax of
distinct() method is given below-
distinct(dataframeName, col1, col2,.., .keep_all=TRUE)
Example:
Here in this example, we used distinct() method to remove the duplicate rows from the data frame and also
remove duplicates based on a specified column.
R
library(dplyr)
distinct(stats)
#remove duplicates based on a column
Output
player runs wickets
1 A 100 17
2 B 200 20
3 C 408 NA
4 D 19 5
5 A 56 2
player runs wickets
1 A 100 17
2 B 200 20
3 C 408 NA
4 D 19 5
arrange() method
In R, the arrange() method is used to order the rows based on a specified column. The syntax of arrange() method
is specified below-
arrange(dataframeName, columnName)
Example:
In the below code we ordered the data based on the runs from low to high using arrange() function.
R
library(dplyr)
arrange(stats, runs)
Output
player runs wickets
1 D 19 5
2 A 100 17
3 B 200 20
4 C 408 NA
select() method
The select() method is used to extract the required columns as a table by specifying the required column names in
select() method. The syntax of select() method is mentioned below-
select(dataframeName, col1,col2,…)
Example:
Here in the below code we fetched the player, wickets column data only using select() method.
R
library(dplyr)
Output
player wickets
1 A 17
2 B 20
3 C NA
4 D 5
rename() method
The rename() function is used to change the column names. This can be done by the below syntax-
rename(dataframeName, newName=oldName)
Example:
In this example, we change the column name “runs” to “runs_scored” in stats data frame.
R
library(dplyr)
rename(stats, runs_scored=runs)
Output
player runs_scored wickets
1 A 100 17
2 B 200 20
3 C 408 NA
4 D 19 5
mutate() & transmute() methods
These methods are used to create new variables. The mutate() function creates new variables without dropping the
old ones but transmute() function drops the old variables and creates new variables. The syntax of both methods is
mentioned below-
mutate(dataframeName, newVariable=formula)
transmute(dataframeName, newVariable=formula)
Example:
In this example, we created a new column avg using mutate() and transmute() methods.
R
library(dplyr)
mutate(stats, avg=runs/4)
transmute(stats, avg=runs/4)
Output
player runs wickets avg
1 A 100 17 25.00
2 B 200 20 50.00
3 C 408 7 102.00
4 D 19 5 4.75
avg
1 25.00
2 50.00
3 102.00
4 4.75
Here mutate() functions adds a new column for the existing data frame without dropping the old ones where as
transmute() function created a new variable but dropped all the old columns.
summarize() method
Using the summarize method we can summarize the data in the data frame by using aggregate functions like
sum(), mean(), etc. The syntax of summarize() method is specified below-
summarize(dataframeName, aggregate_function(columnName))
Example:
In the below code we presented the summarized data present in the runs column using summarize() method.
R
library(dplyr)
# summarize method
Output
sum(runs) mean(runs)
1 727 181.75
SIMPLE PROGRAMS USING R
Here are some simple programs written in R that demonstrate basic programming concepts such as variable
assignment, control structures, functions, loops, and basic data manipulation.
print("Hello, World!")
num1 <- 5
num2 <- 3
num <- 7
if (num %% 2 == 0) {
} else {
4. Factorial of a Number
This program calculates the factorial of a number using a for loop.
result <- 1
for (i in 1:n) {
return(result)
}
num <- 5
5. Fibonacci Sequence
This program prints the first n Fibonacci numbers.
# Fibonacci Sequence
fib_seq[1] <- 0
fib_seq[2] <- 1
for (i in 3:n) {
return(fib_seq)
n <- 10
print(fibonacci(n))
if (num <= 1) {
return(FALSE)
for (i in 2:sqrt(num)) {
if (num %% i == 0) {
return(FALSE)
}
}
return(TRUE)
n <- 20
for (i in 1:n) {
if (is_prime(i)) {
print(prime_numbers)
7. Simple Calculator
This program performs basic arithmetic operations (addition, subtraction, multiplication, division) on two numbers.
# Simple Calculator
if (operation == "add") {
return(a + b)
return(a - b)
return(a * b)
return(a / b)
} else {
return("Invalid operation")
}
}
num1 <- 10
num2 <- 5
print(numbers_plus_five)
print("Sorted numbers:")
print(sorted_numbers)
print(filtered_numbers)
x <- c(1, 2, 3, 4, 5)
y <- c(2, 4, 6, 8, 10)
plot(x, y, main = "Scatter Plot", xlab = "X-Axis", ylab = "Y-Axis", col = "blue", pch = 16)
print("Selected Columns:")
print(selected_data)
print(filtered_data)
a <- c("geek","for","geek")
b <- c("hello","coder")
c <- c(a,b)
c
The primary feature of an RDBMS is the use of relations, which are represented as tables. Each table consists of
rows (also known as records or tuples) and columns (also known as attributes or fields). These tables are connected
to each other through relationships.
o Data is stored in tables, which are similar to spreadsheets where each row represents a
record, and each column represents an attribute of that record.
o Tables are defined by the columns, each of which has a defined data type (such as integers,
strings, etc.).
o Example:
sql
Copy
CustomerName VARCHAR(100),
Age INT,
Country VARCHAR(50)
);
o A primary key is a unique identifier for each row in a table. It ensures that no two rows in the
table have the same value in the primary key column(s).
o Example:
sql
Copy
4. Foreign Key:
o A foreign key is a column or set of columns in one table that links to the primary key of
another table. It establishes a relationship between two tables.
o Example:
sql
Copy
5. Relationships:
o One-to-One: A single row in one table is related to a single row in another table.
o One-to-Many: A single row in one table is related to multiple rows in another table. This is
the most common type of relationship in RDBMS.
o Many-to-Many: Multiple rows in one table are related to multiple rows in another table. This
is typically represented by creating a junction table.
6. SQL (Structured Query Language):
o SQL is used to interact with an RDBMS. It allows you to perform a variety of operations,
such as inserting, updating, deleting, and querying data from tables.
o Common SQL operations:
SELECT: Retrieve data from one or more tables.
INSERT: Add new data into a table.
UPDATE: Modify existing data in a table.
DELETE: Remove data from a table.
CREATE: Create new tables, views, or indexes.
ALTER: Modify the structure of an existing table.
7. Normalization:
o Normalization is the process of organizing data in such a way that redundancy is minimized
and dependencies are logical.
o The goal is to divide a large table into smaller, manageable tables and eliminate undesirable
characteristics like redundancy and inconsistency.
o Common normal forms include 1NF (First Normal Form), 2NF (Second Normal Form), and
3NF (Third Normal Form).
8. ACID Properties:
o ACID stands for Atomicity, Consistency, Isolation, and Durability. These properties ensure
that database transactions are processed reliably.
Atomicity: A transaction is an atomic unit of work, meaning it is either fully
completed or fully rolled back.
Consistency: A transaction takes the database from one valid state to another.
Isolation: Transactions do not interfere with each other; changes made by one
transaction are not visible to others until committed.
Durability: Once a transaction is committed, its effects are permanent, even in the
case of a system failure.
RDBMS Architecture
RDBMS typically follows client-server architecture, where:
The client sends requests to the server (e.g., querying, updating data).
The server processes these requests and manages the database files.
USE Shop;
ProductName VARCHAR(100),
Price DECIMAL(10, 2)
);
3. Querying Data:
4. Joining Tables (Example of a One-to-Many relationship): Assume there's another table, Orders, that
links customers to products.
FROM Orders
5. Updating Data:
UPDATE Products
WHERE ProductID = 1;
6. Deleting Data:
Advantages of RDBMS
ChatGPT said:
In RDBMS (Relational Database Management System), SQL (Structured Query Language) is used to interact with
the database. Four essential SQL operations that are used to manipulate data within tables are:
Syntax:
sql
Copy
FROM table_name
WHERE condition;
sql
Copy
This will return all rows and all columns from the Customers table.
sql
Copy
sql
Copy
sql
Copy
2. INSERT
The INSERT statement is used to add new rows of data into a table.
Syntax:
sql
Copy
sql
Copy
This adds a new customer with CustomerID 1, CustomerName "John Doe", Age 28, and Country "USA".
sql
Copy
VALUES
3. UPDATE
The UPDATE statement is used to modify existing records in a table.
Syntax:
sql
Copy
UPDATE table_name
SET column1 = value1, column2 = value2, ...
WHERE condition;
sql
Copy
UPDATE Customers
SET Age = 29
WHERE CustomerID = 1;
This will change the Age of the customer with CustomerID 1 to 29.
sql
Copy
UPDATE Customers
WHERE CustomerID = 2;
This will change the Age to 25 and the Country to 'Australia' for the customer with CustomerID 2.
sql
Copy
UPDATE Customers
4. DELETE
The DELETE statement is used to remove one or more rows from a table.
Syntax:
sql
Copy
sql
Copy
WHERE CustomerID = 1;
This will remove the row with CustomerID 1 from the Customers table.
sql
Copy
This will delete all customers whose age is greater than 50.
sql
Copy
This will remove all rows from the Customers table, but the table structure will remain intact.
Create Table:
sql
Copy
CustomerName VARCHAR(100),
Age INT,
Country VARCHAR(50)
);
INSERT Data:
sql
Copy
VALUES
SELECT Data:
sql
Copy
UPDATE Data:
sql
Copy
UPDATE Customers
SET Age = 29
WHERE CustomerID = 1;
DELETE Data:
sql
Copy
WHERE CustomerID = 2;
After running these commands, the Customers table will look like this:
CustomerID
CustomerName
Country
Age
John Doe
Bob Johnson
In data science:
Data scientists can rely on the database to hold consistent and accurate data, without worrying about
corruption or duplication.
Data normalization reduces redundancy and ensures data integrity, leading to more efficient and
accurate data analysis.
Large datasets are often required for building and training machine learning models.
SQL queries allow users to filter, aggregate, and manipulate large datasets without the need to load
them entirely into memory.
RDBMSs can scale to handle enterprise-level data, making them useful in big data scenarios.
Structured data is easier to clean, transform, and preprocess for machine learning algorithms.
Set user permissions to control who can access, modify, or delete data.
Use encryption to protect data at rest and in transit.
Ensure auditability by tracking who accessed or modified the data and when.
It ensures that data privacy and compliance (e.g., GDPR, HIPAA) are maintained while working
with sensitive data.
Multiple users (data scientists, analysts, engineers) can work collaboratively on a shared dataset
without compromising data integrity.
Extracting data from different sources (e.g., external APIs, flat files, logs).
Transforming data by cleaning, filtering, and normalizing it within the RDBMS.
Loading data into a format that is ready for analysis or feeding into machine learning models.
RDBMS supports these steps by providing the infrastructure to efficiently manage and manipulate data before it is
used in data science applications.
This allows data scientists to easily perform advanced analyses on data, such as:
Python: Libraries like SQLAlchemy or pandas allow easy interaction with RDBMS, including data
retrieval, manipulation, and analysis.
R: R provides functions like dbConnect() and dplyr to query and manipulate data directly from an
RDBMS.
Business Intelligence (BI) tools like Tableau, Power BI, and Looker can directly connect to RDBMS
to visualize and report data in real-time.
These integrations make RDBMS a convenient source for structured data in data science workflows.
UNIT 2
1. Vectors
A vector is an ordered list of numbers, which can be thought of as a point or a direction in space. Vectors
are central to data science, as data is often represented as vectors (e.g., in machine learning, features are
often represented as vectors).
o Matrices are used to represent datasets where rows are data points and columns are
features.
o Matrix multiplication is fundamental in linear regression and neural networks for
calculating model weights and transformations.
o In machine learning, algorithms like principal component analysis (PCA), linear
regression, and support vector machines (SVMs) rely heavily on matrix operations.
3. Linear Transformation
A linear transformation is a function that maps vectors to other vectors in a linear fashion, which means it
preserves vector addition and scalar multiplication. In the context of data science, linear transformations
are useful for:
Eigen value Problem: For a square matrix A, an eigenvector v satisfies the equation:
o Principal Component Analysis (PCA): PCA uses eigenvalues and eigenvectors to identify
the principal components in data, which are the directions of maximum variance in the
data. These components help in reducing the dimensionality of the dataset while retaining
most of the information.
o Singular Value Decomposition (SVD): SVD is used in matrix factorization methods, like
in recommender systems (e.g., Netflix, Amazon), to decompose a matrix into eigenvectors
and eigenvalues.
x + y =10
2x+3y=20
Linear Regression: The algorithm finds the best-fitting line through data points using matrix
operations to minimize error.
Support Vector Machines (SVM): SVMs use linear algebra to compute the optimal hyperplane
separating data points.
Neural Networks: Neural networks perform matrix multiplications and linear transformations to
compute the output layer from the input layer.
Dimensionality Reduction: Algorithms like PCA, t-SNE, and LDA use linear algebra to reduce
the dimensionality of data while preserving its structure.
The algebraic view of linear algebra for data science focuses on the mathematical tools and techniques
needed to manipulate and analyze large datasets using vectors, matrices, and linear transformations.
Linear algebra is fundamental to many data science techniques, especially in machine learning, data
manipulation, and optimization. Let's explore the key algebraic concepts that are most relevant in this
context.
Vectors:
A vector is an ordered list of numbers that represents a point or data element in a multi-
point.
Vector Space:
A vector space is a set of vectors that are closed under addition and scalar multiplication. The
ability to perform operations on vectors is a fundamental algebraic concept in linear algebra and
data science.
In data science, vectors can represent data points in high-dimensional spaces, where each component
corresponds to a different feature or characteristic of the data (e.g., pixel values in an image, or
measurements in a dataset).
2. Matrix Operations
In data science, matrices are used to represent datasets where each row is a vector (data point) and each
column represents a feature.
Matrix Representation:
A matrix can represent a collection of data points. For example, if we have a dataset with 3
samples and 2 features, it might look like this
: A=[ x 1 x 2 x 3 x 4 x 5 x 6] A={bmatrix } x 1∧x(2 ¿x ) ∧x(4 ¿ x )∧x 6 {bmatrix } A=x 1 x 3 x 5 x 2 x 4 x 6 where each
3 5
data point.
{− 1} XT ¿ { y }
¿ { β }=( X T X )
Operations on Matrices:
Example:
In data projection, you might use a transformation matrix to project data into a lower-dimensional space,
as is done in Principal Component Analysis (PCA) for dimensionality reduction.
In PCA, the goal is to find the principal components of the data, which are the eigenvectors of the
covariance matrix of the data. These eigenvectors represent the directions of maximum variance
in the data.
where AAA is a matrix of coefficients, x\mathbf{x}x is a vector of unknowns, and b\mathbf{b}b is the
result vector.
Example:
In machine learning, linear regression can be framed as solving a system of linear equations. The goal is
to find the best-fitting line (or hyperplane in higher dimensions) that minimizes the error (difference
between predicted and actual values). This is typically solved using matrix operations.
TA
SVD decomposes a ¿ A into three matrices : A=UΣVTA=U Σ V =UΣVT where :
o UU U is an orthogonal ¿ eigenvectors of AATA AT AAT ,
o Σ Σ is a diagonal ¿ singular values ,
o VT V T VT is an orthogonal ¿ eigenvectors of ATA A T A ATA .
SVD allows for dimensionality reduction and noise reduction by keeping only the largest singular values,
which correspond to the most important features in the data.
7. Determinants and Inverses
In data science, the determinant and inverse of a matrix have important properties. The determinant
provides information about the matrix, such as whether it is invertible (non-zero determinant means the
matrix is invertible). The inverse of a matrix is used to solve linear systems and is central to methods like
least squares regression.
Example:
Least Squares Regression: In linear regression, the optimal solution is found using the formula:
β=(XTX)−1XTy\math{\beta} = (X^T X)^{-1} X^T \mathbf{y}β=(XTX)−1XTy where XXX is
the matrix of feature vectors, y\mathbf{y}y is the output vector, and β\mathbf{\beta}β is the
vector of model parameters. This involves calculating the inverse of XTXX^T XXTX, which is a
matrix operation.
Data Preprocessing: Scaling, normalization, and transforming data using matrix operations.
Optimization: Minimizing error functions in machine learning algorithms using matrix-based
methods like gradient descent.
Dimensionality Reduction: Techniques like PCA use linear algebra to reduce the number of
features while preserving the variance in the data.
Clustering: Methods like K-means clustering rely on matrix operations to calculate distances
between points and centroids.
The algebraic view of linear algebra for data science involves understanding the mathematical tools and
concepts that are fundamental to many algorithms and models used in data science. These concepts are
vital in areas such as machine learning, data analysis, dimensionality reduction, and optimization. Here’s
a breakdown of key algebraic concepts used in data science:
Vectors:
In data science, vectors represent individual data points in a high-dimensional space, where each
dimension corresponds to a feature or attribute of the data. A vector is an ordered list of numbers, often
representing a data point in feature space. For example, a dataset where each data point is described by
multiple features (e.g., age, income, and education level) can be represented as a vector.
Matrix multiplication is used for transforming data, such as applying a linear model in linear
regression.
Matrix inversion is used in solving systems of linear equations. For example, in linear regression,
matrix inversion is used to find optimal coefficients through the normal equation:
and scalar multiplication. Linear transformations are represented as matrix-vector products.
In data science, linear transformations are used to project or transform data, making it easier to model or
analyze. For example, in Principal Component Analysis (PCA), data is projected onto a new subspace
that captures the maximum variance.
If AAA is a matrix representing a linear transformation and x\mathbf{x}x is a vector, then the
transformation of x\mathbf{x}x is:
PCA uses these eigenvectors to reduce the dimensionality of the data, retaining the most significant
features.
Recommender systems: Used to factorize matrices representing user-item interactions and predict
missing values.
7. Solving Systems of Linear Equations
Many data science problems, especially in regression analysis, involve solving systems of linear
equations. For example, linear regression involves solving the equation:
Distance metrics based on norms (such as Euclidean distance or Cosine similarity) are used in clustering,
classification, and similarity-based algorithms.
9. Rank of a Matrix
The rank of a matrix indicates the number of linearly independent rows or columns. Understanding the
rank is useful in data science when dealing with issues like multicollinearity, where features in a dataset
are highly correlated.
A matrix with full rank has no redundant features, which is desirable for machine learning models like
linear regression and support vector machines (SVMs).
Vector are built from components, which are ordinary numbers. We can think of a vector as a list of
numbers, and vector algebra as operations performed on the numbers in the list. In other words vector
is the numpy 1-D array.
In order to create a vector, we use np.array method.
Syntax : np.array(list)
Argument : It take 1-D list it can be 1 row and n columns or n rows and 1 column
Return : It returns vector which is numpy.ndarray
Note: We can create vector with other method as well which return 1-D numpy array for example
np.arange(10), np.zeros((4, 1)) gives 1-D array, but most appropriate way is using np.array with the 1-
D list.
Creating a Vector
In this example we will create a horizontal vector and a vertical vector
Python3
# importing numpy
import numpy as np
list1 = [1, 2, 3]
list2 = [[10],
[20],
[30]]
# creating a vector1
# vector as row
vector1 = np.array(list1)
# creating a vector 2
# vector as column
vector2 = np.array(list2)
print("Horizontal Vector")
print(vector1)
print("----------------")
print("Vertical Vector")
print(vector2)
Output :
Horizontal Vector
[1 2 3]
----------------
Vertical Vector
[[10]
[20]
[30]]
Basic Arithmetic operation:
In this example we will see do arithmetic operations which are element-wise between two vectors of
equal length to result in a new vector with the same length
Python3
# importing numpy
import numpy as np
list1 = [5, 6, 9]
# creating a 1-D list (Horizontal)
list2 = [1, 2, 3]
vector1 = np.array(list1)
# printing vector1
vector2 = np.array(list2)
# printing vector2
Output :
First Vector: [5 6 9]
Second Vector: [1 2 3]
Vector Addition: [ 6 8 12]
Vector Subtraction: [4 4 6]
Vector Multiplication: [ 5 12 27]
Vector Division: [5 3 3]
Vector Dot Product
In mathematics, the dot product or scalar product is an algebraic operation that takes two equal-length
sequences of numbers and returns a single number.
For this we will use dot method.
Python3
# importing numpy
import numpy as np
list1 = [5, 6, 9]
list2 = [1, 2, 3]
vector1 = np.array(list1)
# printing vector1
vector2 = np.array(list2)
# printing vector2
# a . b = (a1 * b1 + a2 * b2 + a3 * b3)
Output:
First Vector : [5 6 9]
Second Vector : [1 2 3]
Dot Product : 44
Vector Operations
Vectors are fundamental quantities in physics and mathematics, that have both magnitude and direction.
So performing mathematical operations on them directly is not possible. So we have special operations
that work only with vector quantities and hence the name, vector operations.
Thus, It is essential to know what kind of operations can be performed on the vector quantities and vector
operations tell us about the same.
Key Vector Operations:
Addition of Two Vectors
Subtraction of Two Vectors
Multiplication of Vector with Scalar
Product of Two Vectors
o Dot Product
o Cross-Product
Addition of Vectors
Vectors cannot be added by usual algebraic rules. While adding two vectors, the magnitude and the
direction of the vectors must be taken into account. Triangle law is used to add two vectors, the diagram
below shows two vectors “a” and “b” and the resultant is calculated after their addition. Vector addition
follows commutative property, this means that the resultant
Rank and Null Space in Matrices for Data Science
In data science, understanding the rank and null space of a matrix is crucial for dimensionality
reduction, feature selection, and solving linear systems. Let's break down their significance and
applications.
An overdetermined system is a system of linear equations where the number of equations exceeds the
number of unknowns. In general, such systems have no exact solution unless the equations are consistent
and dependent. However, solutions can still be approximated using various methods.
Solving for xxx gives the best approximate solution.
Solving Least Squares Problems: If Ax=bAx = bAx=b has no exact solution, the least-squares solution is
given by: x=A+bx = A^+ bx=A+b
Machine Learning & Data Science: Used in regression models and dimensionality reduction.
Control Systems: Used in state-space models and signal processing.
he geometric interpretation of the pseudo-inverse (A+) provides intuition about how it generalizes the
standard inverse for non-square and singular matrices. Here's how to understand it visually:
Projections
In data science, projections refer to transforming high-dimensional data into a lower-dimensional space
while preserving key information. This technique is widely used in dimensionality reduction,
visualization, and machine learning.
Types of Projections
1. Linear Projections
Linear projections use matrix operations to transform data while maintaining a linear relationship.
Examples include:
Principal Component Analysis (PCA) – Projects data onto the directions of maximum variance.
Linear Discriminant Analysis (LDA) – Projects data to maximize class separability.
Singular Value Decomposition (SVD) – Decomposes a matrix into principal components for
dimensionality reduction.
2. Non-Linear Projections
Applications of Projections
Data Visualization – Reducing dimensions (e.g., PCA, t-SNE) helps visualize high-dimensional datasets.
Feature Selection & Engineering – Projection methods help identify the most important features.
Compression – Dimensionality reduction via projections helps reduce storage and computation costs.
Noise Reduction – By projecting data onto a subspace, noise can be minimized (e.g., PCA for denoising
images).
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
Next, load your dataset and preprocess it as needed. For this example, let's assume we
have a dataset called data with n samples and m features. Make sure to scale the data
appropriately for PCA.
Now, we can apply PCA to reduce the dimensionality of our dataset. Specify the desired
number of components (k) to retain after dimensionality reduction.
# Apply PCA
k = 2 # Number of components to retain
pca = PCA(n_components=k)
transformed_data = pca.fit_transform(data)
We can examine the explained variance ratio to understand how much information is
retained by each principal component. This ratio indicates the proportion of the
dataset's variance explained by each principal component.
Finally, interpret the results obtained from PCA. Analyze the explained variance ratio to
determine the importance of each principal component. Additionally, you can perform
downstream analysis such as clustering or classification on the transformed data.
SVD (Singular Value Decomposition) is a method used in linear algebra to decompose a matrix into
three simpler matrices, making it easier to analyze and manipulate.
Amit 5 3
Sanket 4 2
Harsh 2 5
SVD breaks this table into three smaller parts: one that shows people’s preferences, one that shows
the importance of each movie, and one that shows how similar the movies are to each other
Mathematically, the SVD of a matrix AA (of size m×nm×n) is represented as:
Fundamental Theory of Eigen Decomposition
Eigen decomposition separates a matrix into its eigenvalues and eigenvectors.
Mathematically, for a square matrix A, if there exists a scalar λ (eigenvalue) and a
non-zero vector v (eigenvector) such that:
Av = λv
Where:
A is the matrix.
λ is the eigenvalue.
v is the eigenvector.
Then, the matrix A can then be represented as:
A=VΛV-1
Where:
V is the matrix of eigenvectors.
Λ is the diagonal matrix of eigenvalues.
V-1 is the inverse of the matrix.
This decomposition is significant because it transforms matrix operations into
simpler, scalar operations involving eigenvalues, making computations easier.