Module-1 Notes Basics 09.07.25
Module-1 Notes Basics 09.07.25
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and
systems to extract knowledge and insights from structured and unstructured data. In simpler terms,
data science is about obtaining, processing, and analyzing data to gain insights for many purposes.
The data science lifecycle refers to the various stages a data science project generally undergoes,
from initial conception and data collection to communicating results and insights.
Despite every data science project being unique—depending on the problem, the industry it's applied
in, and the data involved—most projects follow a similar lifecycle.
This lifecycle provides a structured approach for handling complex data, drawing accurate
conclusions, and making data-driven decisions.
Here are the five main phases that structure the data science lifecycle:
Data collection and storage
This initial phase involves collecting data from various sources, such as databases, Excel files, text
files, APIs, web scraping, or even real-time data streams. The type and volume of data collected
largely depend on the problem you’re addressing.
Once collected, this data is stored in an appropriate format ready for further processing. Storing the
data securely and e iciently is important to allow quick retrieval and processing.
Data preparation
Often considered the most time-consuming phase, data preparation involves cleaning and
transforming raw data into a suitable format for analysis. This phase includes handling missing or
inconsistent data, removing duplicates, normalization, and data type conversions. The objective is to
create a clean, high-quality dataset that can yield accurate and reliable analytical results.
During this phase, data scientists explore the prepared data to understand its patterns,
characteristics, and potential anomalies. Techniques like statistical analysis and data visualization
summarize the data's main characteristics, often with visual methods.
Visualization tools, such as charts and graphs, make the data more understandable, enabling
stakeholders to comprehend the data trends and patterns better.
Data scientists use machine learning algorithms and statistical models to identify patterns, make
predictions, or discover insights in this phase. The goal here is to derive something significant from the
data that aligns with the project's objectives, whether predicting future outcomes, classifying data, or
uncovering hidden patterns.
The final phase involves interpreting and communicating the results derived from the data analysis.
It's not enough to have insights; you must communicate them e ectively, using clear, concise
language and compelling visuals. The goal is to convey these findings to non-technical stakeholders in
a way that influences decision-making or drives strategic initiatives.
Understanding and implementing this lifecycle allows for a more systematic and successful approach
to data science projects. Let's now delve into why data science is so important.
Data science has emerged as a revolutionary field that is crucial in generating insights from data and
transforming businesses. It's not an overstatement to say that data science is the backbone of
modern industries. But why has it gained so much significance?
Data volume. Firstly, the rise of digital technologies has led to an explosion of data. Every
online transaction, social media interaction, and digital process generates data. However, this
data is valuable only if we can extract meaningful insights from it. And that's precisely where
data science comes in.
Value-creation. Secondly, data science is not just about analyzing data; it's about interpreting
and using this data to make informed business decisions, predict future trends, understand
customer behavior, and drive operational e iciency. This ability to drive decision-making based
on data is what makes data science so essential to organizations.
Career options. Lastly, the field of data science o ers lucrative career opportunities. With the
increasing demand for professionals who can work with data, jobs in data science are among
the highest paying in the industry. As per Glassdoor, the average salary for a data scientist in
the United States is $116,000 base pay, making it a rewarding career choice.
Data science is used for an array of applications, from predicting customer behavior to optimizing
business processes. The scope of data science is vast and encompasses various types of analytics.
Descriptive analytics. Analyzes past data to understand current state and trend identification.
For instance, a retail store might use it to analyze last quarter's sales or identify best-selling
products.
Diagnostic analytics. Explores data to understand why certain events occurred, identifying
patterns and anomalies. If a company's sales fall, it would identify whether poor product
quality, increased competition, or other factors caused it.
Predictive analytics. Uses statistical models to forecast future outcomes based on past data,
used widely in finance, healthcare, and marketing. A credit card company may employ it to
predict customer default risks.
Prescriptive analytics. Suggests actions based on results from other types of analytics to
mitigate future problems or leverage promising trends. For example, a navigation app advising
the fastest route based on current tra ic conditions.
The increasing sophistication from descriptive to diagnostic to predictive to prescriptive analytics can
provide companies with valuable insights to guide decision-making and strategic planning.
Data science can add value to any business that uses its data. From statistics to predictions, e ective
data-driven practices can put a company on the fast track to success. Here are some ways in which
data science is used:
Data Science can uncover hidden patterns and insights that might not be evident at first glance. These
insights can provide companies with a competitive edge and help them understand their business
better. For instance, a company can use customer data to identify trends and preferences, enabling
them to tailor their products or services accordingly.
Companies can use data science to innovate and create new products or services based on customer
needs and preferences. It also allows businesses to predict market trends and stay ahead of the
competition. For example, streaming services like Netflix use data science to understand viewer
preferences and create personalized recommendations, enhancing user experience.
The implications of data science span across all industries, fundamentally changing how
organizations operate and make decisions. While every industry stands to gain from implementing
data science, it's especially influential in data-rich sectors.
Let's delve deeper into how data science is revolutionizing these key industries:
The finance sector has been quick to harness the power of data science. From fraud detection and
algorithmic trading to portfolio management and risk assessment, data science has made complex
financial operations more e icient and precise. For instance, credit card companies utilize data
science techniques to detect and prevent fraudulent transactions, saving billions of dollars annually.
Healthcare is another industry where data science has a profound impact. Applications range from
predicting disease outbreaks and improving patient care quality to enhancing hospital management
and drug discovery. Predictive models help doctors diagnose diseases early, and treatment plans can
be customized according to the patient's specific needs, leading to improved patient outcomes.
Marketing is a field that has been significantly transformed by the advent of data science. The
applications in this industry are diverse, ranging from customer segmentation and targeted advertising
to sales forecasting and sentiment analysis. Data science allows marketers to understand consumer
behavior in unprecedented detail, enabling them to create more e ective campaigns. Predictive
analytics can also help businesses identify potential market trends, giving them a competitive edge.
Personalization algorithms can tailor product recommendations to individual customers, thereby
increasing sales and customer satisfaction.
While data science overlaps with many fields that also work with data, it carries a unique blend of
principles, tools, and techniques designed to extract insightful patterns from data.
Distinguishing between data science and these related fields can give a better understanding of the
landscape and help in setting the right career path. Let's demystify these di erences.
Data science and data analytics both serve crucial roles in extracting value from data, but their
focuses di er. Data science is an overarching field that uses methods including machine learning and
predictive analytics, to draw insights from data. In contrast, data analytics concentrates on
processing and performing statistical analysis on existing datasets to answer specific questions.
While business analytics also deals with data analysis, it is more centered on leveraging data for
strategic business decisions. It is generally less technical and more business-focused than data
science. Data science, though it can inform business strategies, often dives deeper into the technical
aspects, like programming and machine learning.
Data engineering focuses on building and maintaining the infrastructure for data collection, storage,
and processing, ensuring data is clean and accessible. Data science, on the other hand, analyzes this
data, using statistical and machine learning models to extract valuable insights that influence
business decisions. In essence, data engineers create the data 'roads', while data scientists 'drive' on
them to derive meaningful insights. Both roles are vital in a data-driven organization.
Machine learning is a subset of data science, concentrating on creating and implementing algorithms
that let machines learn from and make decisions based on data. Data science, however, is broader
and incorporates many techniques, including machine learning, to extract meaningful information
from data.
Statistics, a mathematical discipline dealing with data collection, analysis, interpretation, and
organization, is a key component of data science. However, data science integrates statistics with
other methods to extract insights from data, making it a more multidisciplinary field.
Industry Focus Technical Emphasis
Having understood these distinctions, we can now delve into the key concepts every data scientist
needs to master.
A successful data scientist doesn't just need technical skills but also an understanding of core
concepts that form the foundation of the field. Here are some key concepts to grasp:
These are the bedrock of data science. Statistics is used to derive meaningful insights from data,
while probability allows us to make predictions about future events based on available data.
Understanding distributions, statistical tests, and probability theories is essential for any data
scientist.
Programming
Programming is the tool that allows data scientists to work with data. Languages like Python and R are
particularly popular due to their ease of use and powerful data handling libraries. Familiarity with
these languages allows a data scientist to clean, process, and analyze data e ectively.
Data visualization
Data visualization is the art of representing complex data in a visual and easily comprehensible
format. It helps to communicate findings and makes it easier to understand complex data sets. Tools
like Tableau, Matplotlib, and Seaborn are commonly used in this field.
Machine learning
Machine Learning, a subset of artificial intelligence, involves training a model on data to make
predictions or decisions without being explicitly programmed. It is at the heart of many modern data
science applications, from recommendation systems to predictive analytics.
Data engineering
Data engineering is concerned with the design and construction of systems for collecting, storing, and
processing data. It forms the basis on which data analysis and machine learning models are built.
Data scientists need a set of tools to carry out their tasks e ectively. These tools can range from
programming languages to software for data visualization. Here are some essential data science
tools.
Programming languages
In the realm of data science, programming languages are the tools of the trade. They provide a
framework for instructing a computer to perform specific tasks, such as data manipulation, statistical
analysis, and machine learning. Here are some key languages that every data scientist should
consider mastering:
Python. Known for its simplicity and powerful libraries like pandas and NumPy.
Julia. Recognized for its high performance and speed, ideal for numerical and scientific
computing.
Business Intelligence (BI) tools are software applications used to analyze an organization's raw data.
They aid in the visualization, reporting, and sharing of data insights, allowing companies to make data-
driven decisions. Here are some essential BI tools for data science:
Machine learning libraries are a collection of pre-written code that data scientists can use to save
time. They provide pre-packaged algorithms and learning routines that can be integrated into
programs. Here are some key libraries that streamline machine learning tasks:
Database Management Systems (DBMS) are software applications that interact with the user, other
applications, and the database to capture and analyze data. A DBMS allows for a systematic way to
create, retrieve, update, and manage data. Here are some popular DBMS used in data science:
Data Science is a vast field with many specialized roles, each carrying its unique responsibilities, skill
requirements, and salary expectations. Here are some of the most sought-after job titles in the realm
of data science:
Data analyst
Data analysts play a crucial role in interpreting an organization's data. They possess expertise in
mathematical and statistical analysis, enabling them to transform complex datasets into actionable
insights that drive business decisions. Employing data visualization tools, they e ectively
communicate their findings to both technical and non-technical stakeholders.
Data analysts dive into data, providing reports and visualizations to reveal hidden insights. While not
necessarily involved in developing advanced algorithms, they utilize a range of tools to make sense of
data. Their responsibilities may also encompass SQL queries, data cleaning, and data management.
Key skills:
Data scientist
Data scientists delve into an organization's data to extract and communicate meaningful insights.
They possess a deep understanding of machine learning workflows and how to apply them to real-
world business applications. Data Scientists predominantly work with coding tools, conducting
thorough analysis and frequently engaging with big data tools.
Data scientists are akin to detectives within the data realm. They are responsible for unearthing and
interpreting rich data sources, managing large datasets, and identifying trends by merging data points.
Leveraging analytical, statistical, and programming skills, they collect, analyze, and interpret
extensive datasets. These insights drive the development of data-driven solutions to complex
business problems, often involving the creation of machine learning algorithms to generate new
insights, automate processes, or deliver enhanced value to customers.
Key skills:
Essential tools:
Data engineers are the architects of the data science realm. They design, construct, and manage data
infrastructure, enabling Data Scientists to analyze data e iciently. Data Engineers focus on data
collection, storage, and processing, establishing data pipelines that streamline the analytical
process.
Data engineers often tackle algorithm design for information extraction and create database systems.
They ensure optimal performance by managing data architecture, databases, and processing
systems. This role requires a comprehensive understanding of programming languages and
experience with relational and non-relational databases.
Key Skills:
Tools:
Machine Learning Engineers are the architects of the AI world. They design and implement machine
learning systems that leverage organizational data to make predictions. Their responsibilities also
include addressing challenges like customer churn prediction and lifetime value estimation, and
deploying models for organizational use. Machine Learning Engineers primarily work with coding-
based tools.
Key Skills:
AI research scientist
AI Research Scientists focus on advancing artificial intelligence by creating and improving algorithms
and models. They tackle complex challenges like teaching computers to understand language,
recognize images, or learn from experience. Their work often bridges academic research and practical
applications, driving innovations in AI.
For example, they might develop smarter chatbots, improve self-driving car systems, or design AI
models for healthcare. This role requires both technical expertise and creative problem-solving.
Key Skills:
Essential Tools:
AI Research Scientists play a key role in shaping the future of AI, creating technologies that power
smarter applications, enhance automation, and solve real-world problems.
Scikit-learn,
Python, Java, Scala,
TensorFlow, Pandas,
Design and deploy Machine learning
NumPy, Cloud
Machine machine learning frameworks, Data
platforms (e.g., AWS,
Learning systems, solve complex structures, Software
Google Cloud
Engineer problems using ML, architecture,
Platform), Version
collaborate with teams Mathematics, Teamwork,
control systems (e.g.,
Problem-solving skills
Git)
TensorFlow, PyTorch,
Machine learning, neural
Develop advanced AI Jupyter Notebooks,
networks, natural
AI models, improve cloud platforms (e.g.,
language processing,
Research algorithms, and solve AWS, Google Cloud),
computer vision, Python,
Scientist complex problems like GPUs for high-
TensorFlow, PyTorch, math
NLP or computer vision performance
skills
computing
Big Data and Data Science
What is Big Data?
Big Data refers to datasets that are too large, complex, or rapidly changing for traditional data
processing methods.
Challenges
Scalability requirements
Datafication
Definition
Datafication is the process of converting various aspects of life, business, and society into digital data
that can be tracked, analyzed, and utilized.
Examples of Datafication
Impact on Society
Emerging Trends
Analytical Skills
Soft Skills
Linear algebra
Linear algebra is essential for many machine learning algorithms and techniques. It helps in
manipulating and processing data, which is often represented as vectors and matrices. These
mathematical tools make computations faster and reveal patterns within the data.
It simplifies complex tasks like data transformation, dimensionality reduction (e.g., PCA), and
optimization. Key concepts like matrix multiplication, eigenvalues, and linear transformations help in
training models and improving predictions e iciently.
In machine learning, vectors, matrices, and scalars play key roles in handling and processing data.
Vectors are used to represent individual data points, where each number in the vector
corresponds to a specific features of the dataset (like age, income, or hours ).
Matrices are considered as data storage units used to store large datasets, with rows
representing di erent data points and columns representing features.
Scalars are single numbers that scale vectors or matrices, often used in algorithms
like gradient descent to adjust the weights or learning rate, helping the model improve over
time.
Together, these mathematical tools enable e icient computation, pattern recognition, and model
training in machine learning.
Linear Transformations
Linear transformations are basic operations in linear algebra that change vectors and matrices while
keeping important properties like straight lines and proportionality. In machine learning, they are key
for tasks like preparing data, creating features, and training models. This section covers the definition,
types, and uses of linear transformations.
Linear transformations are functions that map vectors from one vector space to another in a linear
manner. Formally, a transformation T is considered linear if it satisfies two properties:
Common linear transformations in machine learning are operations that help manipulate data in
useful ways, making it easier for models to learn patterns and make predictions. Some common
linear transformations are:
1. Translation: Translation means moving data points around without changing their shape or
size. In machine learning, this is often used to center data by subtracting the average value
from each data point.
2. Scaling: Scaling involves stretching or compressing vectors along each dimension. It is used
in feature scaling to make sure all features are on a similar scale, so one feature doesn’t
dominate the model.
3. Rotation: Rotation involves turning data around a point or axis. It’s not used much in basic
machine learning but can be helpful in fields like computer vision and robotics.
Matrix Operations
Matrix operations are a key part of linear algebra and are vital for handling and analyzing data in
machine learning. This section covers important operations like multiplication, transpose, inverse,
and determinant, and explains their importance and how they are used.
1. Transpose:
The transpose of a matrix involves flipping its rows and columns, resulting in a new
matrix where the rows become columns and vice versa.
It is denoted by AT, and its dimensions are the reverse of the original matrix.
2. Inverse:
Not all matrices have inverses, and square matrices with a determinant not equal to
zero are invertible.
Inverse matrices are used in solving systems of linear equations, computing solutions
to optimization problems, and performing transformations.
3. Identity matrix
C. Determinants
A determinant is a number that comes from a square matrix. It helps tell us if the matrix can
be flipped or not. If the determinant is zero, the matrix can't be flipped. If it's not zero, it means
the matrix can be inverted or reversed.
Significance: The determinant of a matrix tells us if it can be inverted (flipped) and how it
transforms space. If the determinant is zero, the matrix can't be inverted.
Properties: The determinant satisfies several properties, including linearity, multiplicativity,
and the property that a matrix is invertible if and only if its determinant is non-zero.
Eigenvalues and Eigenvectors
Eigenvalues and eigenvectors are fundamental concepts in linear algebra that play a significant role in
machine learning algorithms and applications. In this section, we explore the definition, significance,
and applications of eigenvalues and eigenvectors.
Eigenvalues of a square matrix A are scalar values that represent how a transformation
represented by A stretches or compresses vectors in certain directions.
Eigenvalues quantify the scale of transformation along the corresponding eigenvectors and are crucial
for understanding the behavior of linear transformations.
Eigenvectors are non-zero vectors that are transformed by a matrix only by a scalar factor,
known as the eigenvalue. They represent the directions in which a linear transformation
represented by a matrix stretches or compresses space.
Eigenvectors corresponding to distinct eigenvalues are linearly independent and form a basis for the
vector space.
Eigen Decomposition
Eigen decomposition is the process of decomposing a square matrix into its eigenvalues and
eigenvectors.
It is expressed as A=QΛQ−1, where Q is a matrix whose columns are the eigenvectors of A, and λ is a
diagonal matrix containing the corresponding eigenvalues.
Eigen decomposition provides insights into the structure and behavior of linear transformations,
facilitating various matrix operations and applications in machine learning.
2. Matrix Factorization: Methods like Singular Value Decomposition (SVD) and Non-negative
Matrix Factorization (NMF) use eigenvalue decomposition to break down large matrices into
smaller, more manageable parts. This helps us extract important features from complex data,
making analysis more e icient
Applications of Linear Algebra in Machine Learning
Linear algebra serves as the backbone of many machine learning algorithms, providing powerful tools
for data manipulation, model representation, and optimization. In this section, we explore some of
the key applications of linear algebra in machine learning, including principal component analysis
(PCA), singular value decomposition (SVD), linear regression, support vector machines (SVM), and
neural networks.
Principal Component Analysis (PCA) is a dimensionality reduction technique that utilizes linear
algebra to identify the principal components in high-dimensional data. The main steps of PCA involve:
1. Covariance Matrix Calculation: Compute the covariance matrix of the data to understand the
relationships between di erent features.
2. Eigenvalue Decomposition: Decompose the covariance matrix into its eigenvalues and
eigenvectors to identify the principal components.
3. Projection onto Principal Components: Project the original data onto the principal
components to reduce the dimensionality while preserving the maximum variance.
Singular Value Decomposition (SVD) is a matrix factorization technique widely used in machine
learning for dimensionality reduction, data compression, and noise reduction. The key steps of SVD
include:
1. Decomposition: Decompose the original matrix into the product of three matrices: A=UΣV T
where U and V are orthogonal matrices, and σ is a diagonal matrix of singular values.
2. Dimensionality Reduction: Retain only the most significant singular values and their
corresponding columns of U and V to reduce the dimensionality of the data.
C. Linear Regression
Linear regression is a supervised learning algorithm used for modeling the relationship between a
dependent variable and one or more independent variables. Linear algebra plays a crucial role in
solving the linear regression problem e iciently through techniques such as:
1. Matrix Formulation: Representing the linear regression problem in matrix formY=Xβ+ϵ where Y
is the dependent variable, X is the matrix of independent variables, β is the vector of
coe icients, and ϵ\epsilonϵ is the error term.
2. Normal Equation: Solving the normal equation XTXβ=XTY using linear algebra to obtain the
optimal coe icients β.
D. Support Vector Machines (SVM)
Support Vector Machines (SVM) are powerful supervised learning models used for classification and
regression tasks. Linear algebra plays a crucial role in SVMs through:
1. Kernel Trick: The kernel trick uses linear algebra to map data into higher dimensions, allowing
SVMs to handle complex, non-linear problems like classification. Optimization: In SVM,
optimization involves finding the best decision boundary. This is done by turning the problem
into a math problem and solving it using linear algebra methods, making the process faster and
more e icient.
E. Neural Networks
Neural networks, especially deep learning models, heavily rely on linear algebra for model
representation, parameter optimization, and forward/backward propagation. Key linear algebraic
operations in neural networks include:
1. Matrix Multiplication: Performing matrix multiplication operations between input features and
weight matrices in di erent layers of the neural network during the forward pass.
3. Weight Initialization: Initializing network weights using techniques such as Xavier initialization
and He initialization, which rely on linear algebraic properties for proper scaling of weight
matrices.
Key Terms of Hypothesis Testing
To understand the Hypothesis testing firstly we need to understand the key terms which are given
below:
Significance Level (α): How sure we want to be before saying the claim is false. Usually, we
choose 0.05 (5%).
p-value: The chance of seeing the data if the null hypothesis is true. If this is less than α, we
say the claim is probably false.
Test Statistic: A number that helps us decide if the data supports or rejects the claim.
Critical Value: The cuto point to compare with the test statistic.
Degrees of freedom: A number that depends on the data size and helps find the critical value.
Parametric Test: Non-Parametric Test:
Methods
Model Evaluation
Binomial: Models successes in a fixed number of independent trials with constant probability.
Poisson: Counts events occurring independently in a fixed interval at a constant rate.
Geometric: Measures trials until the first success in independent Bernoulli trials.
Normal (Gaussian): Describes data with a symmetric, bell-shaped distribution around a
mean.
Exponential: Models time between events in a Poisson process with a constant rate.
Uniform: Assigns equal probability to all outcomes within a specified interval.
Introduction R
Essential R Libraries
Basic Operations
Data Structures
Data Import/Export
Basic Analysis