KEMBAR78
Foundations of Data Science | PDF | Data Science | Machine Learning
0% found this document useful (0 votes)
12 views148 pages

Foundations of Data Science

Uploaded by

fayaz mohammed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views148 pages

Foundations of Data Science

Uploaded by

fayaz mohammed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 148

1 www.anuupdates.

org

B.Sc Computer
Science(Major)V
Semester

FOUNDATIONS OF DATA SCIENCE

P.Y.KUMAR
ABHYUDAYA MAHILA DEGREE COLLEGE
B.Sc Computer Science(Major)V Semester
2 www.anuupdates.org

B.Sc Computer Science(Major)

V Semester

FOUNDATIONS OF DATA SCIENCE

UNIT-1

Introduction to Data Science: Need for Data Science What is Data Science Evolution of Data

Science, Data Science Process Business Intelligence and Data Science Prerequisites for a Data
Scientist Tools and Skills required. Applications of Data Science in various fields Data Security Issues.

Data Collection Strategies, Data Pre-Processing Overview, Data Cleaning. Data Integration and

g
Transformation, Data Reduction, Data Discretization, Data Munging, Filtering

or
UNIT-II

s.
Descriptive Statistics Mean, Standard Deviation, Skewness and Kurtosis; Box Plots - Pivot Table -Heat
te
Map-Correlation Statistics-ANOVA.
da
No-SQL: Document Databases, Wide-column Databases and Graphical Databases.
p

UNIT-III
uu

Python for Data Science Python Libraries, Python integrated Development Environments (IDE)for
Data Science,
n

NumPy Basics: Arrays and Vectorized Computation- The NumPy ndarray -Creating ndarrays- Data
.a

Types for ndarrays- Arithmetic with NumPy Arrays- Basic Indexing and Slicing - Boolean Indexing-
w

Transposing Arrays and Swapping Axes.


w

Universal Functions: Fast Element-Wise Array Functions- Mathematical and Statistical


w

MethodsSorting- Unique and Other Set Logic.

UNIT-IV

Introduction to pandas Data Structures: Series, Data Frame and Essential Functionality: Dropping
Entries- Indexing, Selection, and Filtering- Function Application and Mapping- Sortingand Ranking.
Summarizing and Computing Descriptive Statistics- Unique Values, Value Counts, and Membership.
Reading and Writing Data in Text Format.

UNIT-V

Data Cleaning and Preparation: Handling Missing Data Data Transformation: Removing Duplicates,
Transforming Data Using a Function or Mapping. Replacing Values, Detecting and Filtering

OutliersPlotting with pandas: Line Plots, Bar Plots, Histograms and Density Plots, Scatter or Point
Plots.
3 www.anuupdates.org

UNIT- 1

INTRODUCTION TO DATA SCIENCE

Q. What is Data Science

Data Science is a field that gives insights from structured and unstructured data, using different
scientific methods and algorithms, and consequently helps in generating insights, making predictions
and devising data driver solutions. It uses a large amount of data to get meaningful insights using
statistics and computation for decision making

g
or
s.
The data used in Data Science is usually collected from different sources, such as e-commerce sites,
surveys, social media, and internet searches. All this access to data has become possible due to the
te
advanced technologies for data collection. This data helps in making predictions and providing profits
da
to the businesses accordingly. Data Science is the most discussed topic in today's time and is a hot
career option due to the great opportunities it has to offer.
p
uu

It is widely used for a variety of purposes across different industries and sectors. Some of the key
applications include:
n
.a
w

Data Science Skills


w
w

Statistical Analysis: Understanding statistical methods and techniques is crucial for interpreting data
and making inferences.

Programming: Proficiency in programming languages such as Python or R is vital for data


manipulation, analysis, and machine learning.

Data Manipulation and Analysis: Skills in handling, cleaning, and analysing large datasets are
fundamental. This often involves using Python libraries like Pandas.

Machine Learning: Understanding various machine learning algorithms and their applications is a key
part of data science.
4 www.anuupdates.org

Data Visualization: The ability to present data visually using tools like Matplotlib, Seaborn, or
Tableau helps in making data understandable to non-technical stakeholders.

Big Data Technologies: Knowledge of big data platforms like Hadoop or Spark can be important,
especially when dealing with very large datasets.

Data Science Tools:

Common data science programming languages include:

g
Python: Python is an object-oriented, general-purpose programming language known for having

or
simple syntax and being easy to use. It's often used for executing data analysis, building websites and
software and automating various tasks.

s.
te
R: R is a programming language that caters to statistical computing and graphics. It's ideal for
da
creating data visualizations and building statistical software
p
uu

SQL: Structured Query Language (SQL) is essential for data manipulation and retrieval from relational
databases. It's widely used for data extraction, transformation, and loading (ETL) processes.
n
.a

Ex: Agricultural Optimization: In Kenya, a small-scale farming community faced challenges in crop
w

yields due to unpredictable weather patterns and soil quality. By implementing data science, they
w

were able to analyse satellite imagery and soil data to make informed decisions about crop rotation,
irrigation schedules, and fertiliser application. This led to a significant increase in crop yield and
w

sustainability. showcasing how data science can revolutionise traditional farming methods.

Data Science Needed

Data Science is used in many industries in the world today, e.g. banking, consultancy, healthcare, and
manufacturing.

Examples of where Data Science is needed:

● For route planning: To discover the best routes to ship


5 www.anuupdates.org

● To foresee delays for flight/ship/train etc. (through predictive analysis)

● To create promotional offers

● To find the best suited time to deliver goods

● To forecast the next years revenue for a company

● To analyze health benefit of training

● To predict who will win elections

g
or
Q. Evolution of Data Science

s.
te
1) 1962: American mathematician John W. Tukey first articulated the data science dream. In his now-
famous article "The Future of Data Analysis," he foresaw the inevitable emergence of a new field
da

nearly two decades before the first personal computers. While Tukey was ahead of his time, he was
not alone in his early appreciation of what would come to be known as "data science."
p
uu

2) 1977: The theories and predictions of "pre" data scientists like Tukey and Naur became more
n

concrete with the establishment of The International Association for Statistical Computing (IASC),
.a

whose mission was "to link traditional statistical methodology, modern computer technology, and
the knowledge of domain experts in order to convert data into information and knowledge."
w
w
w

3) 1980s and 1990s: Data science began taking more significant strides with the emergence of the
first Knowledge Discovery in Databases (KDD) workshop and the founding of the International
Federation of Classification Societies (IFCS).

4) 1994: Business Week published a story on the new phenomenon of "Database Marketing." It
described the process by which businesses were collecting and leveraging enormous amounts of
data to learn more about their customers, competition, or advertising techniques.

5) 1990s and early 2000s: We can clearly see that data science has emerged as a recognized and
specialized field. Several data science academic journals began to circulate, and data science
proponents like Jeff Wu and William S. Cleveland continued to help develop and expound upon the
necessity and potential of data science.
6 www.anuupdates.org

6) 2000s: Technology made enormous leaps by providing nearly universal access to internet
connectivity, communication, and (of course) data collection.

7) 2005: Big data enters the scene. With tech giants such as Google and Facebook uncovering large
amounts of data, new technologies capable of processing them became necessary. Hadoop rose to
the challenge, and later on Spark and Cassandra made their debuts.

8) 2014: Due to the increasing importance of data, and organizations interest in finding patterns and
making better business decisions, demand for data scientists began to see dramatic growth in
different parts of the world.

g
9) 2015: Machine learning, deep learning, and Artificial Intelligence (Al) officially enter the realm of

or
data science.

s.
te
10) 2018: New regulations in the field are perhaps one of the biggest aspects in the evolution in data
science.
p da

11) 2020s: We are seeing additional breakthroughs in Al, mach


n uu

Q. Data Science Process


.a
w

The data science process gives a clear step-by-step framework to solve problems using data. It maps
w

out how to go from a business issue to answers and insights using data. Key steps include defining
the problem, collecting data, cleaning data, exploring, building models, testing, and putting solutions
w

to work.

Process of Data Science


7 www.anuupdates.org

g
or
s.
te
da
Steps for Data Science Processes
p
uu

1. Problem Definition
n

The first step in any data science project involves clearly defining the problem at hand. This include
.a

understanding the business objectives and identifying how data can contribute to solving the
w

problenThe definition should be specific, measurable, and aligned with organizational goals. It sets
the foundation for the entire data science process and ensures that efforts are focused on addressing
w

the most relevant challenges.


w

Understanding the Problem

● Defining the objectives


● Identifying constraints
● Gathering domain knowledge

2. Data Collection

Once you have a clear understanding of the problem, the next step in the data science process is to
collect the relevant data. This can be done through various means such as:
8 www.anuupdates.org

Web Scraping: This method is useful for gathering data from public websites. It requires knowledg
web technologies and legal considerations to avoid violating terms of service.

APIs: Many online platforms, such as social media sites and financial data providers, offer APIs to ac
their data. Using APIs ensures that you get structured data, often in real-time, which is crucial for t
sensitive analyses.

Databases: Internal databases are gold mines of historical data. SQL is the go-to language for quern
relational databases, while NoSQL databases like MongoDB are used for unstructured data

Surveys and Sensors: Surveys are effective for collecting user opinions and feedback, while sensors

g
invaluable in lot applications for gathering real-time data from physical devices.

or
s.
te
p da
n uu
.a
w
w
w

3. Data Exploration

EDA is a crucial step in the data science process where you explore the data to uncover patterns,
anomalies, and relationships. This involves:

● Descriptive statistics: Calculating mean, median, standard deviation, etc.


9 www.anuupdates.org

● Visualization: Creating plots such as histograms, scatter plots, and box plots to visualize data
distributions and relationships.
● Correlation analysis: Identifying relationships between different variables.

4. Data Modeling

Model Planning:

● In this stage, you need to determine the method and technique to draw the relation
between input variables.
● Planning for a model is performed by using different statistical formulas and visualization
tools. SQL analysis services, R. and SAS/access are some of the tools used for this purpose.

g
or
Model Building:

s.
te
In this step, the actual model building process starts. Here, Data scientist distributes datasets for
training and testing.
da

Techniques like association, classification, and clustering are applied to the training data set. The
model, once prepared, is tested against the "testing" dataset.
p
uu

5. Evaluation:
n
.a
w

● After creating models, it's essential to evaluate their performance. This involves assessing
how well the models generalize to new, unseen data.
w

● Evaluation metrics, such as accuracy, precision, recall, or others depending on the problem,
w

are used to quantify the model's effectiveness. Rigorous evaluation ensures that the chosen
models provide meaningful and reliable results.

6. Deployment:

● Once a model has proven effective, it is integrated into operational systems for practical use.
Deployment involves making the model available to end-users or incorporating it into
business processes.
● This step requires collaboration between data scientists and IT professionals to ensure a
seamless transition from development to implementation.

7. Monitoring and Maintenance:


10 www.anuupdates.org

● Data science projects don't end with deployment, they require ongoing monitoring. Models
can drift in performance over time due to changes in the underlying data distribution or
other factors.
● Regular monitoring helps detect and address these issues, ensuring that models continue to
provide accurate and relevant insights.
● Maintenance activities may include updating models, retraining them with new data, or
adapting to changes in the business environment. This iterative process ensures the
sustainability and longevity of the data science solution.

Q. Business Intelligence and Data Science

g
or
Business intelligence (BI):

s.
te
● Business intelligence (BI) is a process that involves gathering, storing, analysing, and
da
presenting data to help organisations make informed decisions. BI tools are designed to
collect and analyze data from various sources and give it in a way that is easy to understand
p

and use.
uu

● Business intelligence is focused on using historical data to identify trends and patterns and
n

providing actionable insights to business users. Bl tools typically use a combination of charts,
.a

graphs, and reports to visualize data and make it easy to interpret.


w

● Bl tools can be used for a variety of purposes, such as financial reporting, customer analytics,
w

and supply chain management. They can also be used to monitor key performance indicators
(KPIs) and track progress towards organisational goals.
w
11 www.anuupdates.org

g
or
Business intelligence process:
s.
te
da

1. Data gathering involves collecting information from a variety of sources, either external (e.g. ma
data providers, industry analytics, etc.) or internal (Google Analytics, CRM, ERP, etc.).
p
uu

2. Data cleaning/standardization means preparing collected data for analysis by validating data qua
ensuring its consistency, and so on (please check the linked articles for more details.)
n

3. Data storage refers to loading data in the data warehouse and storing it for further usage.
.a

4. Data analysis is actually the automated process of turning raw data into valuable, actional
w

information by applying various quantitative and qualitative analytical techniques.


w

5.Reporting involves generating dashboards, graphical imagery, or other forms of readable vinu
representation of analytics results that users can interact with or extract actionable insights from
w

Examples of BI in Action

1. Finance: Financial institutions use Bl for risk management, fraud detection, and

performance analysis.

2. Manufacturing: Bl helps manufacturers optimize production processes, manage supply chains, and
ensure product quality.
12 www.anuupdates.org

3. Education: Educational institutions use BI to track student performance, manage resource and
improve educational outcomes.

Data Science:

Data science involves using advanced analytical techniques to extract insights from data. Data
scientin use statistical models, machine learning algorithms, and other data mining techniques to
identi patterns and relationships in large datasets.

Data science is focused on understanding the underlying data and using it to make predictions abou
future outcomes. Data scientists use a variety of tools, such as programming languages like Python a
R, to manipulate and analyse this data.

g
or
Data science can be used for a variety of purposes, such as predictive modelling, fraud detection, and

s.
natural language processing. Data scientists are responsible for designing and implementing data-
drive solutions that can help organisations improve their business processes and make informed
te
decisions.
da

Differences between Business Intelligence and Data Science


p
n uu
.a

Aspect Business Intelligence (BI) Data science


w

Focus and purpose Analyzing historical data to Using historical and current data to
w

provide descriptive insights make predictions, identify


w

and support immediate patterns, and drive strategic


decision making. decisions.

Methodologies Relies on structured data, Employs advanced algorithms, data


reporting tools and dashboards mining, statistical modeling and
to summarize past machine learning for deeper
performance. analysis.

Tools and Technologies Tools like power BI, Technologies like python, R ,
Tableau,Qlik view and SQL TensorFlow, and Jupyter
based reporting platform for Notebooks for predictive modeling
creating visual insights. and analysis.

End-User Primarily business teams, Data scientists, analysts, and


managers,and decision-makers technical trems exploring complex
13 www.anuupdates.org

needing quick insights. problems and trends.

Outcomes Provides actionable Delivers forecasts, prescriptive


dashboard, KPIs,and visual analytics and innovative insights
reports to enhance operational for long-term strategy.
efficiency.

Data Types Handles structured data stored Work with Structured, semi-
in databases, spreadsheets,or structured, and unstructured data
data warehouses. from diverse sources.

Complexity Easier to implement and Requires technical expertise in


interpret with minimal programming, statistic and
technical expertise. machine learning.

g
Speed Faster deployment for Slower due to iterative model

or
generating insights due to training, testing, and optimization.
straightforward workflows and

s.
tools. te
Decision-making Supports operational and Drives strategic decision by
da
tactical decision by presenting predicting trends and solving
clear, real-time metrics. complex problems.
p
uu

Scalability Designed for small to medium- Scales to handle large, complex,


scale datasets with limited and high-dimensional datasets
variability. efficiently.
n
.a
w
w
w

Q. Prerequisites for a Data Scientist

Data scientist:

Role: A Data Scientist is a professional who manages enormous amounts of data to come up with
compelling business visions by using various tools, techniques, methodologies, algorithms, etc.

Languages: R, SAS, Python, SQL, Hive, Matlab, Pig, Spark

Prerequisites for becoming a Data Scientist


14 www.anuupdates.org

To become a data scientist, you should:

g
or
1. Build a strong educational foundation A bachelor's degree in data science, compute science,

s.
statistics, or applied math is a solid starting point. Alternatively, online courses and bootcamps can
help introduce you to the field.
te
da

2. Master core data science skills Including, but not limited to, programming, statistic machine
learning, skills in data wrangling and visualization, as well as soft skills lä communication,
p

collaboration, and critical thinking.


n uu

3. Gain practical experience Use internships, personal projects, or competitions to gathe examples
.a

that showcase your ability to apply your skills effectively.


w
w

4. Keep learning Stay up-to-date with new advancements, engage with data scienc communities, and
w

keep pushing yourself to learn and grow.

5. communication skills: Good communication skills for effectively communicating with th


stakeholders and other team members.

Some skills you will require to become a data scientist.

Technical skills:

1) Python: Python is central to the growth of Al and data science and arguably the most important
skil to learn for data science exploration. The programming language is a favorite for data scientists,
15 www.anuupdates.org

wer developers, and Al experts thanks to the simplicity of its syntax, the wealth of open-source
librarie created for the language that drive efficiencies in building algorithms and apps, and its diver
applications across data analysis and Al

2) R: The R programming language is most often used for the statistical analysis of large datasets was
the preferred tool in the data science community for many years as an ideal resource for plotting and
building data visualizations. R sees continued strong use in academic circles. However, the fas
performance speeds of Python have made it the optimal choice for the convergence of data science
and Al applications.

3) Machine Learning

Machine learning is a process used to develop algorithms that perform tasks without explicitly being
programmed by the user. Various big companies like Netflix and Instagram use it. They embel

g
algorithms using machine learning and generate excellent features for their customers. Machine

or
learning is a skill set that allows you to build a predictive model and algorithm framework thut
uncovers patterns and predicts outcomes, improving data-driven business strategies.

s.
te
4) Statistical Analysis:
da

Statistical analysis uses a fundamental concept for interpreting data and validating its findings. This
includes various tests, distributions, and regression models to improve your experience as a data
p

scientist. Proficiency in such analysis allows you to make informed, data-driven decisions. You can
uu

assess the reliability of your models and derive meaningful conclusions from the data
n
.a

5) SQL Skills:
w

SQL. is a structured query language that is one of the critical data scientist skills. It is a standard tool
w

in the industry that allows you to manage and communicate with relational databases. Relationa
databases allow you to store structured data in tables using columns and rows. A significant amoun
w

of data is stored in such databases.

Non-Technical Skills

1. Critical thinking: The ability to objectively analyze questions, hypotheses, and results, understand
which resources ar necessary to solve a problem, and consider different perspectives on a problem.

2. Effective communication: Effective communication is how a data scientist finds collaboration


across teams, advocacy around data-driven decisions, and makes the work of others value to the
business.
16 www.anuupdates.org

3. Proactive problem solving: The ability to identify opportunities, approach problems by identifying
existing avsumptions and resources, and use the most effective methods to find solutions.

4. Intellectual curiosity: The drive to find answers, dive deeper than surface results and initial
assumptions, think creatively, and constantly ask "why" to gain a deeper understanding of the data

5. Teamwork: The ability to work effectively with others, including cross-functional teams, to achieve
common goals. This includes strong collaboration, communication, and negotiation skills

Q. Tools and Skills required

Data science tools are application software or frameworks that help data science professionals

g
perform various data science tasks like analysis, cleansing, visualization, mining, reporting, and

or
filtering of data. A Data Scientist needs to master the fields f Machine Learning, statistics and

s.
probability, analytics, and programming. He needs to perform many tasks such as prepare the data,
analyze the data, built models, draw meaningful conclusions, predict the results and optimize the
te
results.
da
For performing these various tasks efficiently and quickly, the Data Scientists use different Data
Science tools
p
uu

Top Data Science Tools


n

a) Tableau:
.a

Tableau is an interactive visualization software packaged with strong graphics. The company focuses
w

on the business intelligence sectors. Tableau's most significant element is its capacity to interface
w

with databases, tablets, OLAP cubes, etc. Tableau can also visualize geographic data and draw the
lengths and latitudes of maps together with these characteristics. You can also use its analytics tool
w

to evaluate the information together with visualizations. You can share your results on the internet
platform with Tableau with an active community. While Tableau is company software, Tableau Public
comes with a free version.

b) MS Excel:

It is the most fundamental & essential tool that everyone should know. For freshers, this tool helps in
easy analysis and understanding of data. MS. Excel comes as a part of the MS Office suite. Freshers
and even seasoned professionals can get a basic idea of what the data wants to say before getting
into high-end analytics. It can help in quickly understanding the data, comes with built-in formulae,
and provides various types of data visualization elements like charts and graphs. Through MS Excel,
data science professionals can represent the data simply through rows and columns. Even a non-
technical user can understand this representation.

c) Python & R Programming


17 www.anuupdates.org

Proficiency in programming languages is crucial for data manipulation, analysis, and machine
learning. Important languages include:

● Python: The most popular language for data science, Python's libraries like NumPy, Pandas,
and Scikit-learn make it perfect for data manipulation, analysis, and machine learning.
● R: Specializes in statistical analysis and has extensive libraries for data science, suchb as
ggplot2 for visualization and caret for machine learning.

Data Scientist:

A Data Scientist is an expert who examines data to identify patterns, trends, and insights that aid in
problem-solving and decision-making. They analyze and forecast data using tools like machine
learning, statistics, and programming. Data scientists transform unstructured data into
understandable, useful information that companies can utilize to enhance operations and make
future plans.

g
or
Skills Required for Data Scientists Technical Skills Required for a Data Scientist

s.
1. Programming: One of the most important skills needed for data scientists is proficiency in
programming l like Python and R. It is the very foundation of data science and offers versatile tools
te
like Pan TensorFlow for data analysis, manipulation and interpretation. Machine Learning Algorithms
da
are huge part of the programming process.
p
uu

2. Mathematics and Statistics: Data scientist skills involve a deep understanding of statistics and
mathematical formulation includes knowledge of probability, hypothesis testing theories and linear
algebraic equatio scientists use a lot of calculations to interpret data accurately. That's the best way
n

to make predictions.
.a

For example, in a hospital, the skills needed for a data scientist like statistics, can be ap
w

understanding patient information, trends in illnesses and the effectiveness of each treatment meth
w

make diagnosis and treatment more targeted and successful. 4


w

3. Machine Learning Algorithms: This is a kev data scientist skill Machine Learning is an integral part
of teaching Al and com applications how to read data and skilled professionals play a huge part in its
accuracy, Algorithms data scientists for creating analytics models. Moreover, data scientist skills go
further from ML with deep learning frame like PyTorch. This adds significant value to your resume
because you can deal with more con data

4. Data Visualization and Handling: One of the top data scientist skills involves the process of data
wrangling This is the process wher understand raw data, clean it and structure it, ready for analysis.
Data visualization comes after use these data analyzes to communicate meaning, trends and patterns
accurately.
18 www.anuupdates.org

5. Database Management Systems: Companies use large datasets for their organizational decisions.
SQL, or Structured Query Langu essential for managing these datasets. Data scientist skills such as
expertise in relational and relational databases can open doors for you to handle huge amounts of
data

Soft Skills Required for Data Science

6.. Problem-Solving: One must ensure to have the capability to identify and develop both creative
and effective sol as and when required. Problem-solving is a critical skill in data science. Data
scientists must:

● Break Down Complex Problems: Define the problem, find patterns in the data, and devise
● driven solutions. Creativity: Think outside the box to address unique challenges, often
involving new appro to modeling or data wrangling

g
or
s.
7. Communication Skills: Strong communication skills are necessary for conveying findings and
insights effectively. Key areas include:
te
da

● Translate Technical Insights. Communicate complex analyses in a way that non-tech


p

stakeholders understand.
uu

● Storytelling with Data: Convincing stakeholders to make data-driven decisions by creating


compelling narratives.
n
.a

Analytical Skills Required for Data Science


w
w

1. Exploratory Data Analysis (EDA)


w

EDA is an integral part of the data analysis process that focuses on summarizing the characteristics of
a dataset. Key areas include:

● Outlier Detection: Finding unusual data points that can impact model accuracy.
● Statistical Analysis: Includes measures such as mean, median, mode, standard deviation, and
correlation.
2. Data Visualization:

Data visualization helps communicate insights clearty. Tools Tableau are important for:

like Matplotlib,Seaborn and tableau

● Creating Visual Representations: Bar charts, histograms, line plots, and heatmaps are
examples that help to interpret data.
19 www.anuupdates.org

● Storytelling with Data: Visuals allow non-technical stakeholders to understand the insights
generated by data scientists.
3. Data Wrangling and Preprocessing

This refers to the transformation and mapping of raw data into a more usable format. Raw data
needs to be cleaned and preprocessed before analysis. Key techniques include:

● Handling Missing Data: Removing, filling, or imputing missing values.


● Feature Engineering: Creating new variables that make the data more informative for model
building.
● Normalization & Scaling: Ensuring that data variables are in the same range for machine
learning models to process effectively.

Q. Applications of Data Science in various fields

Applications of Data Science in:

g
or
s.
te
p da
n uu
.a
w
w

Finance:
w

● Fraud Detection and Prevention: Machine learning models can detect unusual
patterns and anomalies in financial transactions, helping prevent fraud.
● Credit Risk Assessment: By analyzing historical data, machine learning models can
assess credit risk more accurately than traditional methods.
● Investment Recommendations: Data science can recommend investment portfolios
based on customer risk profiles and predicted market trends.

Healthcare:

● Medical Diagnosis: Machine learning can assist in diagnosing medical conditions


based on X-rays, CT scans and historical patient data.
● Personalized Medicine: By analyzing historical data and genomics, machine learning
can recommend personalized treatment plans.
20 www.anuupdates.org

Retail:

● Inventory Management: Data science can forecast demand for various products,
optimizing inventory levels.
● Product Recommendations: Recommending similar products based on customer
search queries and interests can increase sales.
● Customer Reviews Analysis: Analyzing customer reviews helps identify common
issues w products, informing improvements.
● Chatbots: Al-powered chatbots can handle customer queries efficiently, improving
custom service

Automotive:

● Predictive Maintenance: Data science models can predict part failures, reducing
downtime a maintenance costs.
● Autonomous Vehicles: Machine learning enables the development of self-driving

g
cars that make real-time decisions like human drivers.

or
Manufacturing:
s.
te
● Predictive Maintenance: Similar to automotive, predictive maintenance models
da
manufacturing can reduce production line downtime.
● Supply Chain Optimization: Data science optimizes supply chain processes and
p

inventor management, enhancing efficiency.


uu

Education:
n

● Personalized Learning: Data science can create personalized learning experiences


.a

analyzing students' progress and adapting content accordingly.


w

● Performance Prediction: Analyzing student data helps predict interest and


w

performance act different subjects, enabling targeted interventions.


w

Social Media:

● Content Recommendation: Al algorithms show relevant content to users, increasing


engagement.
● Fraud Detection: Data science helps identify and remove fraudulent accounts.
● Targeted Marketing: Machine learning enables targeted marketing campaigns on
social med platforms.

E-commerce:

● Recommendation Systems: Suggesting products or services to customers based on


their pa purchases, browsing history, and preferences.
● Price Optimization: Dynamically adjusting product prices based on demand,
competition, and othe factors to maximize revenue.
21 www.anuupdates.org

● Supply Chain Management: Optimizing inventory levels, transportation routes, and


logistics improve efficiency and reduce costs.
● Customer Churn Prediction: Identifying customers at risk of leaving and taking
proactive steps retain them.

Government:

● Public Safety: Using data analytics to predict crime hotspots, optimize emergency
response, an improve public safety.
● Urban Planning: Analyzing population data, traffic patterns, and resource usage to
inform urba development and infrastructure planning.
● Policy Analysis: Evaluating the effectiveness of public policies using data-driven
insights.
● Resource Allocation: Optimizing the allocation of public resources based on need
and demand.

g
or
Education:


s.
Personalized Learning: Tailoring educational content and approaches to individual
te
students nee and learning styles based on data analysis.
● Student Success Prediction. Identifying students at risk of academic difficulties or
da

dropout usin predictive models.


p
uu

Agriculture:

● Precision Agriculture: Optimizing crop yields and resource usage through data-driven
n

technique like remote sensing, soil analysis, and weather forecasting.


.a

● Yield Prediction: Predicting crop yields based on various factors such as weather
w

conditions, soil quality, and pest infestations.


● Disease Detection: Early detection of plant diseases and pests using image analysis
w

and sensor data.


w

● Supply Chain Optimization: Improving the efficiency of agricultural supply chains


through demand forecasting and inventory management

Q. Data Security Issues

Data security is the process of protecting digital information from unauthorized access, corruption,
and theft throughout its lifecycle. It encompasses the entire spectrum of information security,
including physical protection for hardware and secure storage devices, administrative controls to
manage access, and logical security measures for software applications
22 www.anuupdates.org

✓ Policies and procedures strengthen these defenses, ensuring data stays secure
across all environments.

✓ By implementing encryption, immutable hackups, antivirus, or real-time monitoring.


organizations can prevent different types of breaches and stay compliant with
regulations.

✓It measures include encryption, firewalls, access controls, intrusion detection systems,
and data backup.

✓The main goal of Digital Security is to prevent data breaches and unauthorized access
and protect data from external threats.

g
Types of Data Security

or
s.
1. Encryption: Encryption is a process that converts data into an unreadable format using algorithms.
te
This protects data both
da
● When it is stored (data at rest)
● When it is being transmitted (data in transit)
p

It prevents unauthorized access even if the data is intercepted


uu

2. Data Erasure
n

✓ Data erasure is a method of permanently deleting data from storage devices. It


.a

ensures that it cannot be recovered or reconstructed.


w

✓ This is particularly important when organizations dispose of old hardware or when


w

sensitive data is no longer needed. Unlike basic file deletion, data erasure uses software
w

to overwrite data, making it unrecoverable

3. Data Masking:

✓ Data masking involves altering data in a way that is no longer sensitive to make it
accessible to employees or systems without exposing confidential information.

✓ For example, in a customer database, personal information like names or credit card
numbers may be masked with random characters. It allows access for testing or
analytics purposes without revealing the actual data.

4. Data Resiliency

Data resiliency refers to the ability of an organization's data systems to recover quickly from
disruptions, such as:
23 www.anuupdates.org

● Cyberattacks
● Hardware failures
● Natural disasters

Data Security Risks:

a) Accidental data exposure:

✓ Many data breaches are not a result of hacking but through employees accidentally
or negligently exposing sensitive information.

✓ Employees can easily lose, share, or grant access to data with the wrong person, or
mishandle or lose information because they are not aware of their company's security
policies.

g
or
b) Phishing attacks:

s.
✓ In a phishing attack, a cyber criminal sends messages, typically via email, short
te
message service (SMS), or instant messaging services, that appear to be from a trusted
da
sender.
p

✓ Messages include malicious links or attachments that lead recipients to either


uu

download
n

malware or visit a spoofed website that enables the attacker to steal their login credentials or
financial information.
.a

✓ These attacks can also help an attacker compromise user devices or gain access to
w

corporate networks. Phishing attacks are often paired with social engineering, which
w

hackers use to manipulate victims into giving up sensitive information or login


w

credentials to privileged accounts.

c) Insider threats

One of the biggest data security threats to any organization is its own employees. Insider threat are
individuals who intentionally or inadvertently put their own organization's data at risk. They come in
three types:

1. Compromised insider: The employee does not realize their account or credentials have been
compromised. An attacker can perform malicious activity posing as the user.
2. Malicious insider: The employee actively attempts to steal data from their organization or
cause harm for their own personal gain.
24 www.anuupdates.org

3. Nonmalicious insider. The employee causes harm accidentally, through negligent behavior,
by not following security policies or procedures, or being unaware of them.

d) Malware:

Malicious software is typically spread through email- and web-based attacks. Attackers use malware
to infect computers and corporate networks by exploiting vulnerabilities in thes software, such as
web browsers or web applications. Malware can lead to serious data securi events like data theft,
extortion, and network damage.

Q. Data Collection Strategies

A data collection strategy is the collection of methods that will be utilized to get accurate and reliable

g
from different data sources.

or
s.
te
p da
n uu
.a
w

Data Collection Techniques and Methods


w

1. Feedback Forms and Surveys


w

✓ Feedback forma and survey are systematic tools for collecting participants'
responses in quantifiable format.

✓ You can distribute forms via email or host them online using Google Forms or
SurveyMonke and gather data in real-time. This technique is beneficial for gauging
customer satisfaction and assessing employee engagement.

2. Interviews and Focus Groups

✓ Interviews and focus groups involve in-depth discussions, and data is collected by
observing and understanding people's thoughts and behaviors.
25 www.anuupdates.org

✓ While this method provides the flexibility to modify questions based on the
participant's answers, it can be time-consuming

✓ Additionally, the interviewer's questioning style, body language, and perception can
influence the responses.

3. Web Scrapping

✓ Web scraping is a data collection technique where data is automatically extracted


from dynamic web pages

✓ You can use tools like Scrapy or BeautifulSoup to collect real-time information on
competitors and market trends. However, you should be mindful of the terms of service

g
violations and data privacy laws.

or
4. Log Files
s.
te
✓ Log files are detailed records generated by servers, applications, or devices. They
da

capture a timeline of events, transactions, and interactions within a system.


p

✓ You can use Splunk or ELK Stack to analyze and visualize log files and gain insights
uu

into web traffic, system performance, or security incidents.


n
.a

5. API Integration
w

✓ APIs (application programming interfaces) automate data collection by connecting


w

two systems or software and allowing them to exchange information.


w

✓ It is a scalable method and provides a consolidated data view. Extracting datasets


from social media platforms or cloud services is an example of this technique.

6. Transactional Tracking

✓ Transactional tracking involves collecting your customers' purchase data. You can
capture information on purchased product combinations, delivery locations, and more by
monitoring transactions made through websites, third-party services, or in-store point-
of-sale systems.
26 www.anuupdates.org

✓ Analyzing this data lets you optimize your marketing strategies and target ideal
customer

segments

7. Document Review

✓ Document reviewing is a data collection technique in which relevant data is extracted


by examining existing records, reports, and contracts.

✓ It is often used in legal, academic, or historical research to analyze trends over time.
You can use advanced document scanning and optical character recognition (OCR) tools
to digitize and store required information.

g
or
8. Mobile Data Collection

s.
✓In this technique, mobile devices are used to collect real-time data directly from the
te
user through apps, surveys, and GPS tracking.
da

✓The widespread use of mobiles and tablets makes it ideal for on-the-go data
p

gathering
n uu
.a
w

9. Social Media Monitoring


w

✓ Many social media platforms have data analytics features that help you track your
w

target audience's demographic information, engagement metrics, and more.

✓ Tools like Hootsuite or Brandwatch can provide information about customer


sentiment and emerging trends.

10. Data Warehousing

✓ Data warehousing allows you to collect large volumes of data and store them in a
centralized repository.
27 www.anuupdates.org

✓ You can use cloud-based solutions like Snowflake or Amazon Redshift for this. The
data collection method is highly scalable and allows you to consolidate data for better
insights.

Q. Data Pre-Processing Overview

Real-world datasets are generally messy, raw, incomplete, inconsistent, and unusable. It can contain
manual entry errors, missing values, inconsistent schema, etc. Data Preprocessing is the process of
converting raw data into a format that is understandable and usable. It is a crucial step in any Data
Science project to carry out an efficient and accurate analysis. It ensures that data quality is consisten
before applying any Machine Learning or Data Mining techniques.

g
or
s.
te
p da
uu

Steps in Data preprocessing:


n
.a
w
w
w

a) Data Collection and Import:

The first step in any data preprocessing pipeline is gathering the necessary data. This may involve
querying databases, accessing APIs, scraping websites, or importing data from various files formats
like CSV, JSON, or Excel It's important to ensure that you have the right permissions and comply with
relevant data protection regulations

b) Data Exploration and Profilling:

Before getting into cleaning and transformation, it's essential to understand your data. This involves
examining the structure of your dataset, checking data types, looking for patterns, and identifying
potential issues.
28 www.anuupdates.org

c) Data Cleaning

This step involves handling missing data, removing duplicates, correcting errors, and dealing with
outliers.

● Handling missing data: You might choose to drop rows with missing values, fill them with a
specific value (like the mean or median), or use more advanced imputation techniques.
● Removing duplicates: Duplicate records can skew your analysis and should be removed.
● Correcting errors: This might involve fixing typos, standardizing formats (e.g., date formats)or
correcting impossible values.
● Dealing with outliers: Outliers can be legitimate extreme values or errors. You need to
investigate them and decide whether to keep, modify, or remove them.

d) Data Transformation

g
This step involves modifying the data to make it more suitable for analysis or modeling. Common

or
transformations include:

s.
te
● Normalization or standardization: Scaling numerical features to a common range.
● Encoding categorical variables. Converting categorical data into numerical format.
da

● Feature engineering: Creating new features from existing ones.


● Handling skewed data: Applying transformations like log or square root to make the
p

distribution more normal.


n uu

e) Data Reduction
.a

For large datasets, it might be necessary to reduce the volume of data while preserving as much
w

information as possible. This can involve


w

● Feature selection. Choosing the most relevant features for your analysis.
● Dimensionality reduction: Using techniques like Principal Component Analysis (PCA) to
w

reduce the number of features.


● Sampling: Working with a representative subset of your data.

f) Data Validation

The final step is to validate your preprocessed data to ensure it meets the requirements for your
analysis or modeling task. This might involve

● Checking data types


● Verifying value ranges
● Ensuring all necessary features are present
● Checking for any remaining missing values or inconsistencies
29 www.anuupdates.org

These steps form the core of the data preprocessing pipeline. However, the specific techniques and
their order may vary depending on the nature of your data and the requirements of your data
science project.

g)Applications of Data Preprocessing

Data Preprocessing is important in the early stages of a Machine Learning and Al application
development lifecycle. A few of the most common usage or application include -

● Improved Accuracy of ML Models - Various techniques used to preprocess data, such as Data
Cleaning, Transformation ensure that data is complete, accurate, and understandable,
resulting in efficient and accurate ML models.
● Reduced Costs - Data Reduction techniques can help companies save storage and compute
costs by reducing the volume of the data

g
● Visualization - Preprocessed data is easily consumable and understandable that can be

or
further used to build dashboards to gain valuable insights.

s.
Q. Data Cleaning
te
Data cleaning is the process of detecting and correcting errors or inconsistencies in your data to
da

improve its quality and reliability. Raw data, which is data in its unprocessed form, is often riddled
with issues that can negatively impact the results of analysis. These issues can include:
p
uu

● Missing values: When data points are absent from a dataset.


● Inconsistent formatting: Inconsistency in how data is presented, like dates written in
n

different formats (e.g., MM/DD/YYYY, YYYY-MM-DD).


.a

● Duplicates: When the same data point appears multiple times in a dataset
● Errors: This can include typos, spelling mistakes, or even data entry errors.
w

Data cleaning helps ensure that the data you're analyzing is accurate and reliable, which is crucial for
w

getting meaningful insights from your data


w

The Involved in Data Cleaning Process


30 www.anuupdates.org

The Steps Invoked in the the data cleaning process, each of which addresses a different kind
discrepancy in the dataset. To achieve high-quality data, you can perform the following data-cleaning
steps:

1. Define Data Cleaning Objectives

✓ Before beginning the data cleaning process, it is crucial to assess the raw data and
identify you requirements or desired output from the dataset.

✓ This helps you focus on the specific parts of the data, thus saving your time and
resources.

2. Eliminate Duplicate or Irrelevant Values

g
✓ You can generally observe repetitive data while extracting it from multiple sources

or
into

centralized repository
s.
te
✓ Such values take up unnecessary space in your dataset and often result in flawed
da

analysis. Using data cleaning tools or techniques, you can easily locate duplicate or
p

irrelevant values and remove them to achieve a more optimized dataset.


n uu

3. Correct Structural Flaws


.a

✓ Structural errors include misspellings, incorrect word usage, improper naming


w

convention capitalization mistakes, and many others.


w

✓ They mainly occur while migrating or transferring data from one place to another. So,
w

applying a quick data check in such a scenario ensures the credibility of your dataset

4. Remove Data Outliers

✓ Outliers are unusual values in a dataset that differ greatly from the existing values in
the dataset Although the presence of such values can be fruitful for research purposes,
they can also impact your data analysis process.

✓ Therefore, it is always advisable to employ data-cleaning methods to remove any


inconsisten values, thus maintaining data accuracy
31 www.anuupdates.org

5. Restore Any Missing Data

✓ Data values can be lost or removed during the extraction process, leading to
inefficient data analytics.

✓ Therefore, before using your data for business operations, you must scan the dataset
thoroughly and look for any missing values, blank spaces, or empty cells in it.

6. Ensure Data Validity

✓ Once the above steps are completed, you must perform a final quality check on your
dataset to ensure its authenticity, consistency, and structure.

✓ To facilitate this process quickly, you can also leverage Al or machine learning

g
capabilities and verify data. This helps your organization work with reliable data and use

or
it for seamless analysis and visualization.

s.
te
Some Important data-cleaning techniques:
da

✓ Remove duplicates
p

✓ Detect and remove Outliers


uu

✓ Remove irrelevant data


n
.a

✓ Standardize capitalization
w

✓ Convert data type


w

✓Clear formatting
w

✓ Fix errors

✓ Language translation

✓ Handle missing values

The importance of data cleaning

1. Enables reliable analytics: Dodgy data can distort analytics. Cleaning ensures quality data that
leads to accurate insights.
32 www.anuupdates.org

2. Improves decision making: You can make better-informed strategic and operational decisions with
clean, trusty data

3. Increases efficiency: You waste less time gathering data that could be faulty and useless. Cleaning
weeds these issues out.

4. Saves money: Bad data can lead to costly errors, while good data cleaning practices can help you
save money in the long run.

5. Builds trust: Dependable data that tells the truth about your business performance helps build
stakeholder confidence.

6. Supports automation: Artificial intelligence (AI) and machine learning (ML) driven automation
need clean data. Otherwise, they may amplify existing data problems.

7. Ensures compliance: In regulated industries, meticulous data quality controls help you support
compliance.

g
or
Q. Data Integration and Transformation

s.
Data integration:
te
da
Data integration is the process of combining data from different sources into a single, unified view.
This process includes activities such as data cleaning, transformation, and loading (ETL), as well as
real-time streaming and batch processing.
p
uu

✓ By aligning data across systems, organisations can eliminate data silos and uncover
hidden relationships, leading to enhanced analytics and more accurate predictive
n

models.
.a
w

✓ Without data integration, data would remain fragmented and inconsistent, making it
w

difficult to achieve a comprehensive, reliable view needed for discovering insights and
w

informed decision-making

Data integration is particularly important in the healthcare industry. Integrated data from various
patient records and clinics assist clinicians in identifying medical disorders and diseases by
integrating data from many systems into a single perspective of beneficial information from which
useful insights can be derived. Effective data collection and integration also improve medical
insurance claims processing accuracy and ensure that patient names and contact information are
recorded consistently and accurately. Interoperability refers to the sharing of information across
different systems.
33 www.anuupdates.org

g
or
Data Integration Approaches

s.
te
There are mainly two kinds of approaches to data integration in data mining, as mentioned below -
da

Tight Coupling
p

✓ This approach involves the creation of a centralized database that integrates data
uu

from diffe sources. The data is loaded into the centralized database using extract,
n

transform, and load t processes.


.a

✓ In this approach, the integration is tightly coupled, meaning that the data is physically
w

stored in central database, and any updates or changes made to the data sources are
w

immediately reflected the central database.


w

✓ Tight coupling is suitable for situations where real-time access to the data is required,
and consistency is critical. However, this approach can be costly and complex, especially
when deale with large volumes of data.

Loose Coupling

✓ This approach involves the integration of data from different sources without
physically storing a centralized database.

✓ In this approach, data is accessed from the source systems as needed and combined
in real-time provide a unified view. This approach uses middleware, such as application
34 www.anuupdates.org

programming interfa (APIs) and web services, to connect the source systems and access
the data

✓ Loose coupling is suitable for situations where real-time access to the data is not
critical, and th data sources are highly distributed. This approach is more cost-effective
and flexible than tin coupling but can be more complex to set up and maintain.

Integration tools

There are various integration tools in data mining. Some of them are as follows:

● On-promise data integration tool


An on-premise data integration tool integrates data from local sources and connects legacy
databases using middleware software.

g
● Open-source data integration tool

or
If you want to avoid pricey enterprise solutions, an open-source data integration tool is the ideal

s.
alternative. Although, you will be responsible for the security and privacy of the data if you're using
the tool.
te
● Cloud-based data integration tool
da
A cloud-based data integration tool may provide an 'integration platform as a service'.
p
uu

Data Transformation:
n
.a

a) Data transformation refers to the process of converting, cleaning, and manipulating raw d into a
structured format that is suitable for analysis or other data processing tasks.
w
w

b) Raw data can be challenging to work with and difficult to filter. Often, the problem isn't hov collect
w

more data, but which data to store and analyze.

c) To curate appropriate, meaningful data and make it usable across multiple systems, busine must
leverage data transformation.
35 www.anuupdates.org

Data Transformation Works:

The transformation process generally follows 6 stages:

1. Data Discovery: During the first stage, data teams work to understand and identify applicable raw
data. By profiling data, analysts/engineers can better understand the transformations that need to
occur.

2. Data Mapping: During this phase, analysts determine how individual fields are modified, matched,
filtered, joined, and aggregated

3. Data Extraction: During this phase, data is moved from a source system to a target system
Extraction may include structured data (databases) or unstructured data (event streaming, log files)
sources.

g
or
4. Code Generation and Execution: Once extracted and loaded, transformation needs to occur on

s.
the raw data to store it in a format appropriate for Bl and analytic use. This is frequently
accomplished by analytics engineers, who write SQL/Python to programmatically transform data.
te
This code is executed daily/hourly to provide timely and appropriate analytic data.
da

5. Review: Once implemented, code needs to be reviewed and checked to ensure a correct and
p

appropriate implementation.
n uu

6. Sending: The final step involves sending data to its target destination. The target might be a data
.a

warehouse or other database in a structured format


w
w

Data Transformation Techniques


w

● Smoothing: This is the data transformation process of removing distorted or meaningless


data from the dataset. It also detects minor modifications to the data to identify specific
patterns or trends.
● Aggregation: Data aggregation collects raw data from multiple sources and stores & in a
single format for accurate analysis and reports. This technique is necessary when your
business collects high volumes of data.
36 www.anuupdates.org

● Discretization: This data transformation technique creates interval labels in continuous data
to improve efficiency and easier analysis. The process utilizes decision tree algorithms to
transform a large dataset into compact categorical data.
● Generalization: Utilizing concept hierarchies, generalization converts low-level attributes to

g
high-level, creating a clear data snapshot.

or
● Attribute Construction: This technique allows a dataset to be organized by creating new

s.
attributes from an existing set.
● Normalization: Normalization transforms the data so that the attributes stay within a
te
specified range for more efficient extraction and data mining applications.
da
● Manipulation: Manipulation is the process of changing or altering data to make it more
readable and organized. Data manipulation tools help identify patterns in the data and
p

transform it into a usable form to generate insight.


uu

Q. Data Reduction
n
.a

When you collect data from different data warehouses for analysis, it results in a huge amount of
w

data. It is difficult for a data analyst to deal with this large volume of data. It is even difficult to run
w

the complex queries on the huge amount of data as it takes a long time. This is why reducing data
becomes important
w

Data reduction techniques ensure the integrity of data while reducing the data. Data reduction is a
process that reduces the volume of original data and represents it in a much smaller volume.

● Data reduction techniques are used to obtain a reduced representation of the dataset that is
smaller in volume by maintaining the integrity of the original data.
● By reducing the data the efficiency of the data mining process is improved, which produces
th same analytical results.
● Data resduction does not affect the result obtained from data mining.that means the result
obtained from data mining before and after data reduction is the same or almost the same.
● Data roduction aims to define it more compactiv. When the data sire is smaller, it is simpler
apply sophisticated and computationally high-priced algorithms
37 www.anuupdates.org

● The reduction of the data may be in terms of the mimber of rows (records) or terms of the.
number of columns (dimensions).
● The only difference occurs in the efficiency of data mining. Data reduction increases the
efficiency of data mining.

Types of data reduction are:

● Deduplication: Removing duplicate data. This can include simply removing duplicated
records to deleting records that, while not strictly identical, represent the same information
or event.
● Compression: Compression processes apply algorithms to transform information to take up
le storage space Compression algorithms can (and often are) applied to data as it is moved in
storage, but some can be applied to data-at-rest to improve space gains even more.
● Thin Provisioning: Thin provisioning is an approach to storage where space is partitioned

g
and user as needed rather than pre-allocate storage to users or processes. While more

or
computationall intensive, this approach can significantly reduce inefficiencies like disk

s.
fragmentation. te
These technologies include:
p da

● Dimensionatity Reduction: This approach attempts to reduce the number of "dimensions,"


uu

aspectu'variables, from a data set. For example, a spreadsheet with 10,000 rows but only
one colume is much simpler to process than one with an additional 500 columns of
n

attributes included. The approach can include compression transformations or even the
.a

removal of irrelevant attributes for specific data mining application.


w

● Data Cube Aggregation:


This technique is used to aggregate data in a simpler form. Data Cube Aggregation is a
w

multidimensional aggregation that uses aggregation at various levels of a data cube to


w

represent the original data set, thus achieving data reduction.

✓ For example, suppose you have the data of All Electronics sales per quarter for the
year 2018 to the year 2022. If you want to get the annual sale per year, you just have to
aggregate the sales per quarter for each year. In this way, aggregation provides you with
the required data, which i much smaller in size, and thereby we achieve data reduction
even without losing any data.
38 www.anuupdates.org

g
Data Compression:

or
Data compression employs modification, encoding, or converting the structure of data in a way that

s.
sumes less space. Data compression involves building a compact representation of information by
removing redundancy and representing data in binary form. Data that can be restored successfully
te
from is compressed form is called Lossless compression. In contrast, the opposite where it is not
da
possible to restore the original form from the compressed form is Lossy compression. Dimensionality
and numerosity reduction method are also used for data compression.
p
n uu
.a
w
w
w

This technique reduces the size of the files using different encoding mechanisms, such as Huffman
Encoding and run-length Encoding. We can divide it into two types based on their compression
techniques.

i. Lossless Compression: Encoding techniques (Run Length Encoding) allow a simple and minimal
data size reduction. Lossless data compression uses algorithms to restore the precise original data
from the compressed data.
39 www.anuupdates.org

ii. Lossy Compression: In lossy-data compression, the decompressed data may differ from the
original data but are useful enough to retrieve information from them.

Benefits of Data Reduction

✓ Data reduction can save energy.

✓ Data reduction can reduce your physical storage costs.

✓ And data reduction can decrease your data center track.

Q. Data Discretization

g
Discretization, also known as binning, is the process of transforming continuous numerical variables

or
into discrete categorical features. Data discretization refers to a method of converting a huge number
of data values into smaller ones so that the evaluation and management of data become easy

s.
✓ In other words, data discretization is a method of converting attributes values of
te
continuous data into a finite set of intervals with minimum data loss.
da

✓ There are two forms of data discretization first is supervised discretization, and the
p

second is unsupervised discretization. Supervised discretization refers to a method in


uu

which the class data is used.


n
.a

✓ Unsupervised discretization refers to a method depending upon the way which


w

operation proceeds. It means it works on the top-down splitting strategy and bottom-up
w

merging strategy.
w

Steps of Discretization

1. Understand the Data: Identify continuous variables and analyze their distribution, range, and role
in the problem.
40 www.anuupdates.org

2. Choose a Discretization Technique:

● Equal-width binning: Divide the range into intervals of equal size


● Equal-frequency binning: Divide data into bins with an equal number of observations.
● Clustering-based discretization: Define bins based on similarity (eg, age, spend).
3. Set the Number of Bins: Decide the number of intervals or categories based on the data and the
problem's requirements.

4. Apply Discretization: Map continuous values to the chosen bins, replacing them with their
respective bin identifiers.

5. Evaluate the Transformation: Assess the impact of discretization on data distribution and model
performance. Ensure that patterns or important relationships are not lost.

6. Validate the Results: Cross-check to ensure discretization aligns with the problem goals Ex:
Uppose we have an attribute of Age with the given values

g
or
s.
Age 1.5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77
te
da
Table before Discretization
p
n uu

Attribute Age Age Age Qge


.a

1,4,5,,7,9 11,13,14,17,18,,1 31,33,36,42,44,4 70,74,77,78


w

9 6
w
w

After Child Young Mature Old


Discretization

Some techniques of data discretization

● Histogram analysis: Histogram refers to a plot used to represent the underlying frequen
distribution of a continuous data set. Hestogram assists the data inspection for data
distribution. example, Outliers, skewness represemation, normal distribution representation,
etc
● Binning: Binning is a data imoothing technique and its helps to group a huge number
continuous values into a smaller number of bins. For example, if we have data about a group
studcrits, and we want to arrange their marks into a smaller number of marks intervals by
41 www.anuupdates.org

maki the bins of grades. One bin for grade A one for grade B, one for C, one for D, and one
for Grade.
● Correlation analysis: Cluster analysis is commonly known as clustering. Clustering is the task
grouping similar objects in one group, commonly called clusters. All different objects are
place in different clusters.
● Equal-Width Intervals: This technique divides the range of attribute values into intervals
equal size. The simplicity of this method makes it popular, especially for initial data analysis.
For example, if you're dealing with an attribute like height, you might divide it into intervals
of 10 cn each
● Equal-Frequency Intervals: Unlike equal-width intervals, this method divides the data so
thacach interval contains approximately the same number of data points. It's particularly
useful whee the data is unevenly distributed, as it ensures that cach category has a
representative sample.

Applications of Discretization

g
or
s.
1. Improved Model Performance: Decision trees, Naive Bayes, and rule-based algorithms often
perform better with discrete data because they naturally handle categorical features more
te
effectively
da
2. Handling Non-linear Relationships: Data scientists can discover non-linear patterns between
features and the target variable by discretising continuous variables into bins.
p

3. Outlier Management: Discretization, which groups data into bins, can help reduce the
uu

influence of extreme values, helping models focus on trends rather than outliers.
4. Feature Reduction: Discretization can group values into intervals, reducing the
n

dimensionality of continuous features while retaining their core information.


.a

5. Visualization and interpretability: Discretized data makes it easier to create visualizations for
exploratory data analysis and to interpret the data, which helps in the decision-making
w

process
w
w

Q. Data Munging

Data munging, also known as data wrangling, is a data transformation process that converts raw data
into a more usable format. In other words, it involves cleaning, normalizing, and enriching raw data
so that it can be used to produce meaningful insights needed to make strategic decisions.

Some examples of data munging include

● Handling missing or inconsistent data


● Eliminating irrelevant or unnecessary data
● Formatting data types
● Merging multiple data sources into one comprehensive dataset for analysis
42 www.anuupdates.org

Data munging can be done automatically or manually Large-scale organizations with massive
datasets generally have a dedicated data team responsible for data munging. Their task is to
transform raw data and pass it to business leaders for informed decision making

Data Munging Works

g
Data munging is a fairly simple process. It involves a series of steps taken to ensure your data

or
is clean, enriched, and reliable for various uses. Let's look at these steps in detail here:

s.
1. Collection: The first step in data wrangling is collecting raw data from various sources. These
sources can include databases, files, external APIs, web scraping, and many other data streams. The
te
data collected can be structured (e.g., SQL databases), semi-structured (e.g., JSON, XML files), or
da
unstructured (e.g.. text documents, images).

2. Cleaning: Once data is collected, the cleaning process begins. This step removes errors,
p

inconsistencies, and duplicates that can skew analysis results. Cleaning might involve
uu

● Removing irrelevant data that doesn't contribute to the analysis.


● Correcting errors in data, such as misspellings or incorrect values.
n

● Dealing with missing values by removing them, attributing them to other data points, or
.a

estimating them through statistical methods.


w

● Identifying and resolving inconsistencies, such as different formats for dates or currency.
3. Structuring: After cleaning, data needs to be structured or restructured into a more analysis-
w

friendly format. This often means converting unstructured or semi-structured data into a structured
w

form, like a table in a database or a CSV file. This step may involve:

● Parsing data into structured fields.


● Normalizing data to ensure consistent formats and units.
● Transforming data, such as converting text to lowercase, to prepare for analysis.
4. Enriching:

Data enrichment involves adding context or new information to the dataset to make it more valuable
for analysis. This can include:

● Merging data from multiple sources to develop a more comprehensive dataset


● Creating new variables or features that can provide additional insights when analyzed.
5. Validating: Validation ensures the data's accuracy and quality after it has been cleaned,
structured, and enriched.

This step may involve:


43 www.anuupdates.org

● Data integrity checks, such as ensuring foreign keys in a database match.


● Quality assurance testing to ensure the data meets predefined standards and rules.

6. Storing: The final wrangled data is then stored in a data repository, such as a database or
warehouse, making it accessible for analysis and reporting. This storage not only secures the data
also organizes it in a way that is efficient for querying and analysis

7. Documentation: Documentation is critical throughout the data wrangling process. It records s was
done to the data, including the transformations and decisions. This documentation is invaluuht
reproducibility, auditing, and understanding the data analysis process.

g
or
s.
te
p da
n uu

Benefits of Data Munging


.a

Automated data solutionsfre used by enterprises to seamlessly perform data munging activities, te
w

cleanse and translutionsce data into standardized information for cross-data set analytics. There are
mumerous benefits of data munging. It helps businesses
w
w

● Eliminate data silves and integrate various sources (like relational databases, web servers,
etc)
● improve data usability by transforming raw data into compatible, machine-readable
informata for business systems.
● process large volumes of data to get valuable insights for business analytics
● Ensure high data quality to make strategic decisions with greater confidence.

Q. Filtering

Filtering is a fundamental technique used in data processing and analysis to refine and extract usef
information from the raw data. It aids in removing unnecessary or irrelevant data, thereby improving
the efficiency and accuracy of data analysis. Filtering is widely used across multiple fields includeg
business intelligence, data science, and machine learning
44 www.anuupdates.org

✓ From a business perspective, data filtering can help organizations make informed
decision by identifying trends, patterns, and outliers in their data.

✓ For example, a retail company might use data filtering to identify its top-selling
product and adjust its inventory accordingly.

✓ From a scientific perspective, data filtering can help researchers identify patterns
experimental data that support or refute hypotheses.

✓ For instance, a biologist might use data filtering to identify genes that are differentiall
expressed between healthy and diseased cells.

There are several ways to filter data, depending on the format of the data and the desired outcome

g
or
Some common methods include:

s.
te
1. Selection: This method involves selecting a subset of data based on specific criteria, such a date
da
range, geographic location, or demographic characteristics.

For example, a marketing team might select data on customers who have purchased a particular
p

product within the last quarter to target them with a new promotion.
n uu

2. Sorting: This method involves organizing data in ascending or descending order based on one ot
.a

more variables
w

For instance, a financial analyst might sort stock prices by date to identify trends over time.
w
w

3. Aggregation: This method involves summarizing data by grouping it into categories aggregating it
to a higher level.

For example, a sales manager might aggregate sales data by region to identify which regions art
performing best.

4. Filtering by conditions: This method involves applying specific conditions to the data to exclude
certain values or rows.

For example, a quality control engineer might filter out data points that exceed a certain threshold
for a machine's temperature reading to detect potential issues.
45 www.anuupdates.org

5. Statistical methods: This method uses statistical techniques such as regression, correlation, and
clustering to identify patterns and relationships in the data

For instance, a data scientist might use clustering algorithms to group customers based on their
purchasing behavior to identify customer segments.

6. machine learning methods: This method uses machine learning algorithms such as decision trees,
random forest, and neural networks to identify patterns and relationships in the data. For instance, a
credit risk assessment model might use decision trees to predict which borrowers are likely to default
on their loans based on their credit history and demographic factors

7. Text filtering: This method is used to extract specific information from test data For example, a
sentiment analysis tool might filter out words or phrases that indicate positive or

negative sentiment from social media posts to analyze public opinion on a brand

g
or
s.
8. Image filtering: This method is used to manipulate image data, such as removing noise or
enhancing features.
te
For example, a medical maging algorithm might filter out background noise from an MRI scan to
da
better visualize tumors.
p
uu

9. Audio filtering: This method is used to manipulate audio data, such as removing noise or isolating
specific sounds.
n
.a

For example, a speech recognition system might filter out background noise from a voice recording
w

to improve accuracy.
w
w

10. Real-time filtering: This method involves filtering data in real-time, as it is generated.

For example, a monitoring system might filter out anomalous sensor readings in real-time to quickly
detect equipment failures.

Uses of Data Filtering

● Excel and Spreadsheet Operations


It is commonly employed in spreadsheet software like Microsoft Excel. Users can filter data rows
hased on specific conditions, allowing them to view and manipulate only the data that meets certain
criteria This is particularly useful when dealing with large datasets, streamlining the analysis process

● E-commerce and Marketing


46 www.anuupdates.org

For businesses engaged in e-commerce, data filtering aids in targeting specific customer segments.
Marketers can leverage this process to tailor campaigns, promotions, and product recommendations
hased on customer preferences and behaviors.

● Network Security
Filtering is a crucial component of network security and data security, where it is employed to
identify and block potentially harmful data or traffic. This helps prevent cyber threats and ensures
the integrity of a network

Benefits:

✓ Efficiency: Reduces processing time and resources by focusing on relevant data.

✓ Focus: Enables analysts to pinpoint specific data subsets for in-depth analysis.

g
or
✓ Accuracy: Improves data quality by removing irrelevant or erroneous information

s.
te
Techniques:
da

✓ Rule-based filtering: Applying predefined rules or conditions to select data


p
uu

✓ Range filtering: Selecting data within specific numerical or date ranges


n

✓ Conditional filtering: Applying conditions based on data values or characteristics


.a

✓Tools and Libraries: Many tools and libraries, such as Pandas im Python, facilitate
w

data filtering.
w
w

UNIT-II
Descriptive Statistics:
Descriptive
statistics:

Descriptive statistics, such as mean, median, and range, help characterize a particular
datasummarizingit. They alsoorganize and present that data in a way that allows
it to be interpreted. Descriptive statistics techniques can help describe a data set to an individual or
organizationinclude measures related to the data's frequency, positioning, variation, and central
47 www.anuupdates.org

tendency, other things.

● Descriptive statistics can help businesses decide where to focus further research. The
main purposedescriptive statistics is to provide information about a data set.
● Descriptive statistics summarizes amounts of data into useful bits of information.
Exl: For example, suppose a brand ran descriptive statistics on the customers buying a
specific proposes and saw that 90 percent were female. In that case, it may focus its
marketing efforts on better read female demographics.
Ex2: In recapping a Major League Baseball season, for example, descriptive statistics
might include team batting averages,the number of runs allowed per team, and the average wins
per division.

Q. Mean

g
1. Mean :

or
s.
In descriptive statistics, the mean, also known as the average, is a measure of central tendency that
represents the typical value of a dataset. It's calculated by summing all the values in the dataset
te
and then dividing by the total number of values.
da
Calculation:
Sum the values: Add up all the numbers in your dataset.
p


Count the values: Determine how many numbers are in
uu


the dataset.
● Divide the sum by the count: The result is the mean.
n

Formula to calculate
.a

mean:
w
w
w

𝒙𝟏 + 𝒙𝟐 +.....+ 𝒙𝒏
𝒙𝒙̄ =
𝒏

Ex: dataset: 2, 4, 6, 8, 10.


● Sum: 2+4+6+8+10=30.
● Count: There are 5 values in
the data sets.
● Mean: 30/5=6.
Importance of Mean:
48 www.anuupdates.org

● Central Tendency: The mean provides a single value that represents the "average" or
"typical" value of a dataset.
● Data Summary: It's a valuable way to summarize a large dataset into a single,
representative value.
● Comparison: The mean can be used to compare different datasets or groups.

Q. Standard Deviation
1) The standard deviation is a measure of spread or variability in descriptive statistics. It is us for
calculating the variance or spread by which individual data points different from the
mean.

2) A low deviance implies that the data points are extremely close to the mean,whereas a hig
deviance suggests that the data is spread out over a wider range of

g
values.

or
3) In marketing, variance can assist in accounting for big variations in expenses or

s.
revenues also helps identify the dispersion of asset prices in relation to their average price
te
and mark volatility.
da

4) In the image, the curve on top is more spread out and therefore has a higher
standard deviation while the curve right side image is more clustered around the mean
p

and therefore has a lowstandard deviation.


n uu
.a
w
w
w

Standard Deviation Formula:

( 𝑋−𝑥̄ )²
Standard Deviation = √∑
𝑛−1

where:

Σ=Sum of
X= Each value.
49 www.anuupdates.org

x= Sample mean.
n = number of values in the
sample.
Step-by-step to Calculate Standard Deviation :
Follow these steps to calculate a sample's standard deviation:
Step 01: Collect your data Collect the dataset for whichthe standard deviation is to
be calculated. Assume you have a data set (45, 67, 30, 58, 50) and a sample of size
n = 5.
Step 02: Find the mean
Calculate the sample mean (average) by adding all of the data points and dividing by the sample
size n. * Sample Mean x=(45+67+30+58+50)/n=250/5 = 50
Step 03: Calculate the differences from the mean

g
or
Subtract the sample mean (𝑥̄ ) from each data point (X).

s.
Difference = X-𝑥̄ te
● 45, difference = X-𝑥̄ = 45-50 =-
da
4
● 67, difference = X-𝑥̄ = 67-50 =
p

17
uu

● 30, difference = X-𝑥̄ = 30-50 = -20


● 58, difference = X-𝑥̄ = 58-50 = 8
n

● 50, difference = X-𝑥̄ = 50-50 = 0


.a

Step 04: Square the differences


w

Square each difference acquired in the previous step.


w

Squared difference = (X-𝑥̄ )²


w

● 45, Squared difference = (X-𝑥̄ )² = (45-50)² = (-4)²=


16
● 67, Squared difference = (X-𝑥̄ )² = (67-50)² = (17)²= 289
● 30, Squared difference = (X-𝑥̄ )² = (30-50)² = (-20)²= 400
● 58, Squared difference = (X-𝑥̄ )² = (58-50)²= (8)²= 64
● 50, Squared difference = (X-𝑥̄ )² = (50-50)²= (0)²= 0
Step 05: Sum the squared differences
Add all the squared differences together.
Σ(squared difference) = ∑[(X-𝑥̄ )²] = 16+289+400+64 +0=769
Step 06: Calculate the variance
50 www.anuupdates.org

To get the variance, divide the sum of squared differences by (n-1).


* Variance (S²)=∑ (squared difference)/(n-1)=769/4=192.25
Step 07: Calculate the standard deviation
Finally, calculate the standard deviation by taking the square root of the variance.

Standard Deviation (S) = √variance = √192.25 =13.875

Q. Skewness and Kurtosis


A probability distribution with the mean O(zero) and standard deviation of 1(one) is known as
normal distribution. A normal distribution is symmetric about the mean andfollows a bell curve.Skewness
and Kurtosis are two important distribution factors studied by descriptive statistics.

* Skewness and kurtosis are important statistical measures that help inunderstanding the shape and

g
distribution of a dataset.

or
* They are widely used in various fields, including finance, economics, quality control, and

s.
management.
te
Skewness:
da

Skewness of a distribution is defined as the lack of symmetry. In asymmetrical


p

distribution, the Median and Mode are equal. The normal distribution has a skewness of
uu

0. Skewness tell us ✓distribution of our data. In Skewness, we have zero which


is nothing skewness distribution, Negativewhich falls left side on the graph, Positive
n

which falls on the right side of distribution.


.a

Measure of
w

Skewness:
w

Skewness(p)= (Mean-Mode) / Standard Deviation


w

Types of
Skewness:

Symmetric Skewness: A perfect symmetric distribution is one in which frequency


distribution is same on the sides of the center point of the frequency curve. In this, Mean
= Median = Mode. This is no skewness in a perfectly symmetrical distribution.
51 www.anuupdates.org

● Asymmetric Skewness: Aasymmetrical or skewed distribution is one in which the


spread of ✓the frequencies is different on both the sides of the center point or the
frequency curve is more stretch towards one side or value of Mean. Median and
Mode falls at different points.
● Positive skewness: When the tail on the right side of the distribution is longer or fatter,
we say data is positively skewed. For a positive skewness mean> median > mode.

g
In a distribution with positive skewness (right-

or
skewed):

s.
✓The right tail of the distribution is longer or fatter than the
left.
te
da
✓The mean is greater than the median, and the mode is less than both mean and
median.
p

✓Lower values are clustered in the "hill" of the distribution, while extreme values are in
uu

the long right.


n

✓ It is also known as right-skewed distribution.


.a
w
w
w

● Negative skewness::When the tail on the left side of the distribution is


longer or fatter, we say that the distribution is negatively skewed. For a negative
52 www.anuupdates.org

skewness
mean <median <mode.

In a distribution with negative skewness (left-skewed):

✓The left tail of the distribution is longer or fatter than the


right.

✓ The mean is less than the median, and the mode is greater than both mean
and median.

✓ Higher values are clustered in the "hill" of the distribution, while extreme values
are in the long left tail.
✓ It is also known as left-skewed distribution.

g
or
s.
te
da

Kurtosis:
p

Kurtosis is a statistical measure that defines how heavily the tails of a distribution differ
uu

from the tails of a normal distribution. In other words, kurtosis identifies whether the tails of a
givendistributioncontain extreme values.
n

Types of Kurtosis
.a

1. Leptokurtic: Leptokurtic is a curve having a high peak than the normal distribution. In this curve,
w

there is too much concentration of items near the central value.


w

2. Mesokurtic: Mesokurtic is a curve having a normal peak than the normal curve. In this curve,
w

there is equal distribution of items around the central


value.

3. Platykurtic: Platykurtic is a curve having a low peak than the normal curve is called
platykurtic.
In this curve, there is less concentration of items around the central value
53 www.anuupdates.org

Application of Skewness
(1) Finance & Investment: Helps in analyzingthe return
distributions of stocks, bonds, and other assets.
(2) Quality Control & Manufacturing: In process control, skewness helps in
detecting systematic bias in production quality.
(3) Medical &Biological Research: Used in analyzingpatient recovery times, response
times to treatment, or the spread of diseases.
Application of
Kurtosis
(1) Finance & Risk Management: High kurtosis (leptokurtic) suggests extreme
values (outliers) are more common, which is crucial in risk assessment for stock market investments.
Low kurtosis(platykurtic)suggests fewer outliers,indicating stable returns.

g
or
(2) Econometrics &Business Forecasting: Helps in identifying anomalies in
economic trends,crashes, or extreme demand-supply shocks.

s.
te
(3) Industrial Quality Control: Identifies production defects by highlighting
rare but severedeviations in product quality.
p da

Q. Box Plots
uu

The box and whisker plot, sometimes simply called the box plot. Box plot
n

is a graphical represent of the distribution of a dataset. It displays key summary


.a

statistics such as the median. quartiles potential outliers in a concise and visual manner. By
w

using Box plot you can provide a summarydistribution, identify potential and compare
w

different datasets in a compact and visual manner.


w

✓A box plot-displays the five-number summary of a set of data. The five-


number summarythe minimum, first quartile, median, third quartile, and
maximum.

✓In a box plot, we draw a box from the first quartile to the third quartile. A
vertical line through the box at the median. The whiskers go from each quartile to the
minimum and maximum.

✓ A box plot displays a ton of information in a simplified format.

Elements of Box Plot:

A box plot gives a five-number summary of a set of data which is-


54 www.anuupdates.org

● Minimum- It is the minimum value in the dataset excludingthe outliers.


● FirstQuartile (Q1)-25% of the data lies below the First (lower)
Quartile.
● Median (Q2)- It is the mid-point of the dataset. Half of the values lie below it and half
above.
● Third Quartile (Q3)-75% of the data lies below the Third (Upper) Quartile.
● Maximum - It is the maximum value in the dataset excluding the outliers.

g
or
To construct a box plot :

s.
te
To construct a box plot, use a horizontal or vertical number line and a rectangular box. The smallest and
da
largest data values label the endpoints of the axis. The first quartile marks one end of the box the third
quartile marks the other end of the box. Approximately the middle 5050 percent of the data fall inside the
p

box. The "whiskers" extend from the ends of the box to the smallest and largest data values. The
uu

median or second quartile can be between the first and third quartiles, or it can be one the other,
or both. The box plot gives a good, quick picture of the data.
n

Example: Finding the five-number summary


.a

A sample of 10 boxes of raisins has these weights (in


w

grams):
w

25,28,29,29,30,34,35,35,37,38
w

Make a box plot of the data.


Step 1: Order the data from smallest to
largest.
Our data is already in
order.
25,28,29,29,30,34,35,37,38
Step 2: Find the median.
The median is the mean of the middle two numbers:
25,28,29,29,30,34,35,35,37,38
30 + 34 /2 = 32
55 www.anuupdates.org

The median is 32.

Step 3: Find the


quartiles.
The first quartile is the median of the data points to the left of the median.
25,28,29,30
Q1=29

The third quartile is the median of the data points to the right of the median.
34,35,35,37,38
Q3-35

Step 4: Complete the five-number summary by finding the min and the
max.
The min is the smallest data point, which is

g
25.

or
The max is the largest data point, which is 38.
The five-number summary is 25,29,32,35,38
s.
te
da

Features of Box
p

Plot:
uu

✓ It exhibits data from a five-number summary, which is also inclusive of


n

one of the measures of central tendency. This implies that it has five pieces of
.a

information.
w

✓ Particularly used to reflect if the dataset given is a skewed distribution or


w

not.
w

✓ It also provides an insight into the data set, that whether there is potential
unusual observation.These are called outliers.

✓ It reflects information about how the data is spread


out.

✓ Herein, the arrangements can be matched with each other. This is because, the
center, spread, and overall range are instantly apparent in the case of a box plot.

✓ It is particularly useful for descriptive data interpretation.

✓ It is also used where huge numbers of data collections are involved or compared.

Q. Pivot Table
56 www.anuupdates.org

A pivot table is a statistics tool that summarizes and reorganizes selected columns and rows of data
in a spreadsheet or database table to obtain a desired report. The tool does not actually change the
spreadsheet or database itself, it simply "pivots" or turns the data to view it from different
perspectives.

✓ Pivot tables are especially useful with large amounts of data that would be time-consuming
to
calculate by hand.

✓ A few data processing functions a pivot table can perform include identifying
sums, averages, ranges or outliers. The table then arranges this information in a
simple, meaningful layout that draws attention to key values.

✓ Pivot table is a generic term, but is sometimes confused with the Microsoft trademarked term,

PivotTable. This refers to a tool specific to Excel for creating pivot tables.

g
or
The Pivot Table helps us to view our data effectively and saves crucial time by just
summarizing out the data into essential categories. It is basically a kind of reporting tool and
contains mainly the following four fields which are as follows:
s.
te
● Rows: This refers to data which are taken as a specifier.
da
● Values: This represents the count of the data in given Excel sheet.
● Filters: This would help us to hide or highlight out the specific amount of
p

the data.
uu

● Columns: This refers to values under the various conditions


respectively.
n
.a

Pivot tables work

When users create a pivot table, there are four main


w

components:
w
w

1. Columns-When a field is chosen for the column area, only the unique values of the field are listed
across the top.

2. Rows- When a field is chosen for the row area, it populates as the first column.
Similar to the columns, all row labels are the unique values and duplicates are removed.
3. Values- Each value is kept in a pivot table cell and display the
summarized information. The most common values are sum, average, minimum
and maximum.
4. Filters- Filters apply a calculation or restriction to the entire table.

Uses of a pivot
table
57 www.anuupdates.org

A pivot table helps users answer business questions with minimal effort.
Common pivot table uses include:
• To calculate sums or averages in business situations. For example, counting
sales by department or region.

* Toshow totals asa percentageof a whole. Forexample, comparingsales for a


specific product to tosales.
• To generate a list of unique values. For example, showing which states or countries
have ordered a product.
• To create a 2x2 table summary of a complex report.

*To identify the maximum and minimum values of a


dataset.

g
*To query information directly from an online analytical processing (OLAP)

or
server.

Q. Heat Map
s.
te
Heatmap data visualization is a powerful tool used to represent numerical data
da
graphically, when values are depicted using colors. This method is particularly effective for
identifying patterns, trend and anomalies within large datasets.
p

✓ Heat maps are graphic representations of information that use color-coding and relative
uu

size
n

display data values.


.a

✓ They are commonly used to visualize and analyzelarge data sets but can be an
w

indispensable part of the change manager's arsenal too.


w
w

✓ Heatmaps are a useful tool for analyzingand communicating complex information


clearly intuitively.

✓ Heatmap analytics are represented graphically, making the data easy to digest and
understand

✓ Heatmaps help businesses understand user behavior and how people interact with content.
58 www.anuupdates.org

Types of Heatmaps
1. Interaction Heatmaps
Interaction heatmaps measure active engagement on a webpage, allowing you to see the
type interaction users have with your website. They can measure mouse movements, clicks, and
scrolgiving you an in-depth understanding of how consumers use your website.
● Click Maps
Click maps provide a graphical representation of where users click. This includes mouse clicks
desktop devices and finger taps on mobile devices. Click maps allow you to see which elements of you
webpageare being clicked on and which are being ignored.
● Mouse Move Maps
Research shows a strong correlation between where a user moves their mouse and

g
where their attention lies on a webpage. Mouse move maps track where users move

or
their mouse as they navigate a webpage.This gives you a clear indication of where users
are looking as they interact with your webpage.

s.
● Scroll Maps
te
Scroll maps help you visualize how visitors to your website scroll through
da

your web pages. They do this by visually representing how many visitors scroll
down to any point on the page.
p
uu

2. Attention
Heatmaps
n

Attention heatmaps can be used on websites .They allow you to visualize a


.a

consumer's attention as they look at your content, discovering where their eyes
w

move and which aspects of your content grasp the consumer's attention. There are two types
of attention heatmaps:
w
w

a) Eye Tracking Heatmaps


Eye-tracking heatmaps collect primary data to visualize how a sample audience
views your content. Eye movements and fixation durations are measured to
understand how consumers see your content accurately. Here's how they work:
● Special devices track eye movements as people look at content.
● The system records where the eyes focus and for how long.
● Thisdata is turned into a colorfulmap.
● Red areas show where people looked the most.
● Blue or green areas show less viewed spots.
For example, Netflix used eye-tracking heatmaps to improve its user interface. This
led to a significant increase in viewer engagement with its content recommendations.
b) Predictive Attention Heatmaps
59 www.anuupdates.org

Predictive attention heatmaps use artificial intelligence to predict where a typical


audience would likely look when viewing your content. The data is displayed as a heatmap,
providing a graphical representation of consumer attention.
● AI and Data: The system uses AI and past data to predict eye movements. It learns
from how people have looked at similar images or designs before.
● Simulation: The AI creates a heatmap without needing real users. It shows which areas
will likely grab attention, using warm colors for high focus and cool colors for low
focus.
● Testing Designs: Designers use these heatmaps to test layouts, ads, or products
before they go live. This helps them make changes early to improve engagement.
3. Clustered
Heatmap :
A clustered heatmap groups similar data sets together to show patterns and relationships. A
clustered heatmap is useful when working with large amounts of raw data, as it helps group

g
related data points for better insights. Here are a few common uses of clustered heatmaps:

or
● Business Intelligence: Helps companies analyze sales, revenue, and

s.
customer trends. te
● Healthcare: Used for studying gene expression and patient data.
● Marketing: Identifies customer segments with similar behaviors.
da

4. Correlogram:
p

A correlogram is a heatmap showing the relationship between different variables. It uses


uu

a colorscheme to represent positive or negative correlations, helping users understand


how strongly two factors are related. Common uses of correlogram include:
n
.a

● Finance and Stock Markets: Helps traders analyzerelationships between stock


w

prices.
w

● Website Heatmaps: Measures the correlation between site traffic and user actions.
Scientific Research: Used by data scientists to study connections between
w


variables in complex studies.
5. Grayscale
Heatmap

indicate higher data points, while lighter shades represent lower values. Since it
avoids multiple cokA grayscale heatmap represents data using black, white,
andgrayshades. Darker shades type provides a clear, distraction-free view of the
information. Greyscale Heatmaps can be used in
● Medical imaging: Doctors use grayscale heatmaps to analyze X-rays and
MRI scans.
● Data visualization: Researchers use them for reports where coloris not needed.
6. Rainbow
60 www.anuupdates.org

Heatmap
variations in heat values. Cooler colors like blue represent lower values, while
warmer colors like Arainbow heatmap uses multiple colors, such as blue,
green, yellow, orange, and red, to indicate higher intensity.
This type of color map is visually striking and can make it easier to see differences in
large data They are used for:
● Weather maps: Meteorologists use rainbow heatmaps to show temperature or
rainfall.
● Website heatmaps: Businesses track user behavior using rainbow-colored
click maps.
● Sports analytics: Football analysts use spatial heatmaps to study player movement.
Q Correlation Statistics

g
Correlation analysis refers to the statistical technique that quantifies the relationship

or
between two or more variables. It reveals whether an increase or decrease in one variable leads to

s.
an increase or decrease in another variable. This is useful in identifying trends, making
te
predictions, and testing hypotheses.
da
The most commonly used correlation coefficient is Pearson's correlation coefficient,
which measure the strength and direction of a linear relationship between variables.
p

Types of Correlation
uu

Correlation can be classified into several types based on the relationship between the variables.
n

The most common types include:


.a

1. Positive Correlation: In a positive correlation, as one variable increases, the other variable
w

also increases. For example, as the temperature increases, ice cream sales tend to rise. The correlation
coefficient (r) for a positive correlation lies between 0 and +1.
w

Example: As the hours spent studying increase, exam scores tend to increase as
w

well.
2. Negative Correlation: In a negative correlation, as one variable increases, the other
variable decreases. A perfect negative correlation would have a correlation coefficient of -1.
Ex: As the amount of time spent watching television increases, the time spent on
physical activity decreases.
3. NoCorrelation:In this case, there is no discernible relationship between the two variables.
Change in one variable do not affect the other.
Ex: The correlation between a person's shoe size and their salary would likely have no
correlation.
4. ZeroCorrelation: Zero correlation refers to the absence of a relationship between two variables.
The correlation coefficient in this case is 0.
61 www.anuupdates.org

Ex: The correlation between the day of the week and a person's height would be zero, as there is no
relationship.

To Perform Correlation Analysis:

Performing correlation analysis


involves the following steps:
1: Collect the Data: Gather the data for the variables you wish to
analyze. The data should be numeric and can either be from

g
observational studies, experiments, or surveys.

or
2:Visualizethe Data: Before performing correlation analysis, it's helpful to

s.
create scatter plots or other visual representations to identify any apparent
te
relationships between the variables.
da
3: Calculate the Correlation Coefficient: Use a statistical tool or software
like Excel, Python (with libraries like NumPy or Pandas), or R to calculate the
p

correlation coefficient (Pearson's, Spearman's, or Kendall's) between the variables.


uu

4: Interpret the Results: The calculated correlation coefficient will give you
an idea of the strength and direction of the relationship. Values closer to +1 or -1
n
.a

indicate stronger relationships, while values near 0 suggest weak or no correlation.


w

5: Check forSignificance: It is essential to check if the correlation result is statistically


significant. This can be done through hypothesis testing to determine whether the
w

observed correlation is due to chance or if it reflects a true relationship.


w

Measuring the
Correlation

For n pairs of sample observations (x1,y1), (x2,y2),...,( Xn. Yn), the


correlation coefficient r can be defined as,
62 www.anuupdates.org

Correlation coefficient r is a statistical measure that quantifies the linear relationship


between a pair of variables.
The value of correlation coefficient (r) lies between -1 to +1. When the value
of-
● r=0; there is no relation between the
variable.
● r=+1; perfectly positively correlated.
● r=-1; perfectly negatively correlated.
● r= 0 to 0.30; negligible correlation.
● r= 0.30 to 0.50; moderate correlation.
● r= 0.50 to 1 highly correlated

Properties of Correlation Coefficient

g
● The correlation coefficient is a symmetric

or
measure.
● The value of correlation coefficient lies between -1 to

s.
+1.
●It is dimensionless quantity.
te
●It is independent of origin and scale of measurement.
da
●The correlation coefficient will be positive or negative depending on
whether the sign of numerator of the formula is negative or positive.
Q. ANOVA
p
uu

ANOVA stands for Analysis of Variance, a statistical test used to compare the means of three or
more groups. It analyzes the variance within the group and between groups. The primary
n

objective is to assess whether the observed variance between group means is more
.a

significant than within the groups. If the observed variance between group means is significant,
it suggests that the differences are meaningful.
w
w

Mathematically, ANOVA breaks down the total variability in the data into two
components:
w

● Within-Group Variability: Variability caused by differences within


individual groups, reflecting random fluctuations.
● Between-Group Variability: Variability caused by differences between
the means of the different groups.
The test produces an F-statistic, which shows the ratio between between-
group and within-groupvariability. If the F-statistic is sufficiently
large, it indicates that at least one of the group mean
significantly different from the others.
To understand this better, consider a scenario where you are asked to assess a
63 www.anuupdates.org

student's perform (exam scores) based on three teaching methods: lecture,


interactive workshop, and online learnANOVA can help us assess whether the
teaching method statistically impacts the student's performance.
Types of
ANOVA

There are two types of ANOVA: one-way and two-way. Depending on the
number of independent variables and how they interact with each other, both are used
in different scenarios.

1. One-way ANOVA
A one-way ANOVA test is used when there is one independent variable with two
or more groups. The objective is to determine whether a significant difference exists between

g
the means of different In our example, we can use one-way ANOVA to compare the

or
effectiveness of the three different teaching methods (lecture, workshop, and online
learning) on student exam scores. The teaching method is the independent

s.
variable with three groups, and the exam score is the dependent variable.
te
● Null Hypothesis (Ho): The mean exam scores of students across the three teaching
da

methods are equal (no difference in means).


● Alternative Hypothesis (H): At least one group's mean significantly differs.
p
n uu
.a
w
w
w

The one-way ANOVA test will tell us if the variation in student exam scores can be attributed to the
differences in teaching methods or if it's likely due to random chance.

One-way ANOVA is effective when analyzingthe impact of a single factor across multiple
groups, making it simpler to interpret. However, it does not account for the possibility of
interaction between multiple independent variables, where two-way ANOVA becomes
necessary.
64 www.anuupdates.org

2. Two-way
ANOVA
Two-way ANOVA is used when there are two independent variables, each with two
or more groups. The objective is to analyze how both independent variables influence
the dependent variable.
Let's assume you are interested in the relationship between teaching
methods and study techniques and how they jointly affect student
performance. The two-way ANOVA is suitable for this scenario. Herewe test
three hypotheses:

● The main effect of factor 1 (teaching method): Does the


teaching method influence student exam scores?
● The main effect of factor 2 (study technique): Does the study

g
technique affect exam scores?

or
● Interaction effect: Does the effectiveness of the teaching method
depend on the study technique used?
s.
te
For example, two-way ANOVA could reveal that students
using the lecture method perform better in group study, and
da

those using online learning might perform better in


p

individual study. Understanding these interactions gives a


uu

deeper insight into how different factors together impact outcomes.


n
.a

Q.No-SQL
w

SQL and NoSQL databases differ in how they store and query data. SQL
w

databases rely on tables with columns and rows to retrieve and write structured
w

data, while NoSQL, or "Not Only SQL," is a database management system


(DBMS) designed to handle large volumes of unstructured data. NoSQL
databases use flexible data models better suited for unstructured and semi-structured data.
There are four popular NoSQL database types:
1. Key-Value Databases
2.Document Databases
3. Wide-Column Databases
4.Graph Databases
65 www.anuupdates.org

Q. Document Database
A document database is a type of NoSQL database which stores data as JSON documents instead
of columns and rows. JSON is a native language used to both store and query data. These documents
can be grouped together into collections to form database systems. Developers can use JSON
documents in their code and save them directly into the document database.

✓ The flexible, semi-structured, and hierarchical nature of documents and document databases

g
or
allows them to evolve with applications' needs.

s.
✓ Document databases enable flexible indexing, powerful ad hoc queries, and analytics over
te
collections of documents.
p da
nuu
.a
w

Being a NoSQL database, you can easily store data without implementing a schema. You can
w

transfer the object model directly into a document using several different formats. The
most commonly used are JSON, BSON, and XML.
w

Here is an example of a simple document in JSON format that consists of


three key-value pairs:
{
"ID": "001",
"Name": "John",
"Grade": "Senior",
}
66 www.anuupdates.org

What's more, you can also use nested queries in such formats, providing
easier data distribution
For instance, we can add a nested value string to the document above:
{

"ID": "001",

"Name": "John".
"Grade": "Senior".

"Classes": {
"Class1": "English"
"Class2": "Geometry"

"Class3":"History"

g
}

or
}

s.
*Due to their structure, document databases are optimal for use cases that
te
require flexibility and continual development. For example, you can use them
da
for managing user profiles, which die according to the information provided.
It's Schema-lessstructure allows you to have different attributesand values.
p

*Examples of NoSQL document databases include MongoDB, CouchDB,


uu

Elasticsearch, and others Working of Document Data Model:


n

This is a data model which works as a semi-structured data model in which the records and data
.a

associated with them are stored in a single document which means this data model is
not completelyunstructured. The main thing is that data here is stored in a document.
w
w

Document database
operations
w

You can create, read, update, and delete entire documents stored in the database.
Document databasprovide a query language or API that allows developers to run the
following operations:
● Create: You can create documents in the
database.
● Read: You can use the API or query language to read document data.
● Update: You can update existing documents flexibly.

The use cases for document databases


● Single view or data hub
● Customer data management and personalization
● Internet of Things (IoT) and time-series data
● Payment processing
67 www.anuupdates.org

● Mobile apps
● Real-time analytics
Relational Vs Document Database :

Difference between Relational and Document databases:

RDBMS Document Database System

Structured around the concept of Focused on data rather than


relationships. relationships.

Organizes data into tuples (or rows). Documents have properties without
theoretical definitions, instead of rows.

g
or
Defines data (forms relationships) via No DDL language for defining schemas.
constraints and foreign keys (e.g., a child

s.
table references to the master table via its
ID).
te
da
Uses DDL (Data Definition Language) Relationships represented via nested
data, not
to create relationships.
p

foreign keys (any document may contain


uu

others
nested inside of it, leading to an N:1 or
n

1:N relationship between the two


document entities).
.a
w
w

Features of Document Databases


w

Document Type Model: Documents in a document database provide a natural


representation of data ie. data is in the document and not in tables or
graphs, allowing for easy mapping in different programming languages.

FlexibleSchema: Document databases, in comparison to relational


databases, have a flexible schema. This makes data evolution and schema
changes easier by removing the need for all documents in a collection to have the
same fields.

Distributed and Resilient: Document databases are designed for horizontal


scaling and distribution of data. This makes them highly scalable and able to handle
large amounts of data and high?trafficloads.
68 www.anuupdates.org

Manageable Query Language: Document databases come with a query language that lets
users work with the data model to conduct CRUD (Create, Read, Update, and
Destroy) activities. As a result, accessing the database and getting the needed data is made
simpler.

Q. Wide-column Databases and Graphical Databases


Wide-column Databases:
Wide-column stores are another type of NoSQL database. In them, data is stored and
grouped into separately stored columns instead of rows. Such databases organize
information into columns that function similarly to tables in relational databases.

g
or
s.
te
p da
uu

✓ However, unlike traditional databases, wide-column databases are highly flexible. They
n

have no predefined keys nor column names.


.a

✓ Their schema-free characteristic allows variation of column names even within the same
w

table,
w

as well as adding columns in real-


w

time.

✓ The most significant benefit of having column-oriented databases is that you


can store large amounts of data within a single column.
✓This feature allows you to reduce disk resources and the time it takes to retrieve
information from it. They are also excellent in situations when you have to spread
data across multi servers
✓ Examples of popular wide-column databases includ

e Apache Cassandra, HBase , CosmoDB.


Wide Column Database Use Cases
69 www.anuupdates.org

1)Data Warehousing: Wide-column databases are optimized for data warehousing and bus intelligence
applications, where large amounts of data need to be analyzed and aggregated. are often
used for analytical queries, such as aggregation and data mining.

2) Big data: Wide-column databases can handle large datasets and provide efficient
storage retrieval of data, therefore, can be used for big data applications.
3) Cloud-based analytics: Wide-column databases can be easily scaled to handle large amouof
data and are designed for high availability, this makes them suitable for cloud-based
analyapplications.

4) IoT : Wide-column databases can handle a high number of writesand reads and can
be used storing and processing IoT data.

Advantages of wide-column databases include:

g
or
1. High performance: Wide-column databases are optimized for analytical queries and are
designfor fast query performance, which can be especially importantin data
warehousing and businessintelligence applications.
s.
te
2. Flexible and efficient data model: Wide-column databases store data in a
da

column-family form which allows for a more flexible and efficient data model, as
each column family can have its o set of columns and can be optimized for different types
p

of queries.
uu

3. Scalability: Wide-column databases are often horizontally scalable, which


n

means that they handle large amounts of data and a high number of concurrent users.
.a

4. Distributed Systems: Wide-column databases can be distributed across multiple


w

machines, which allows for high availability and scalability.


w

Graph database:
w

A graph database is a type of database that uses a graph model to represent and store data.
The data represented as a collection of nodes and edges. Nodes represent entities or objects,
and edges represent connections or relationships between them. Nodes and edges each
have attributes or properties that give additional details about the data. Representing complex
relationships between data in tradition databases can be challenging because they are built to
manage and store data in tables and column Contrarily, graph databases represent data as a
network of nodes and edges, making it simple to mode intricate relationships between data.
70 www.anuupdates.org

● Graph databases are also NoSQL systems designed to investigate correlations


amoncomplicated interconnected entities.
● Graph databases store, manage, and query complex data networks known as
graphs.
● The structure of this database addresses the limitations of relational databases
by emphasizing the relationshipof data.
● This includes businesses in various industries such as social media, e-
commerce, finance, anhealthcare.

Types of graph databases:

g
or
Types Description

Property grap database


s.
Used to store data as nodes and edges, with metadata
te
attached to each node and edge. Hence, property graph
databases are well-suited for applications like fraud
da

detection, recommendation engines, and social


network analysis.
p
uu

Hypergrap database A subset of graph databases with edges connecting


more than two nodes. Consequently, hypergraph
n

databases are best for simulating complex data


relationships, such as those present in chemical
.a

compounds.
w
w

Used to store and manage relationships between


w

Object-oriented databases
objects. Therefore, object- oriented databases are good
for use cases like managing intricate data relationships
in applications and modeling complex business logic.

Resource Description Framework (RDF) databases Made to manage and store metadata and their
connections to one another about resources, including
web pages and scholarly articles. As a result, RDF
databases are suitable for applications utilizing
knowledge graphs and the semantic web frequently.

Mixed-model databases Combine various data models, including document and


graph models. Thus, mixed-model databases are well-
71 www.anuupdates.org

suited for content management systems or e-commerce


platforms that need the flexibility to handle various
data types.

Graph databases use cases

● Social media networks: Social media networks are one of the most popular and
natural use cases of graph databases because they involve complex relationships
between people and their activities. For example, graph databases can store and
retrieve information about friends, followers, likes, and shares, which can help social
media companies like Facebook and Instagramtailor their content and recommendations for
each user.

g
● Recommendation systems: Recommendation systems can provide users with

or
tailored recommendations like relationships between goods, clients, and purchases. A
movie streaming service, like Netflix, might use a graph database to suggest movies and TV

s.
shows according to a user's viewing habits and preferences. Fraud detection: Graph
te
databases allow for modeling relationships between different entities, including
da
customers, transactions, and devices, which can be used in fraud detection and
prevention. For example, a bank could use a graph database to detect fraudulent
p

transactions by analyzing activity patterns across multiple accounts.


uu

Graph database advantages


n

1. Flexibility: Graph databases can easily adapt to new data models and schemas due
.a

to their high
w

level of flexibility.
w

2. Data integration: Graph databases can be used to combine structured and


w

unstructured data from various sources. This can make drawing conclusions from various data
sources simpler.

UNIT-III

Python for Data Science


72 www.anuupdates.org

Data Science has become one of the fastest-growing fields in recent years, helping organizatsi make
informed decisions solve problems and understand human behavior. As the volume of grows so does
the demand for skilled data scientists The most common languages used fot science are Python and
R

Q. Python Libraries

A Python library is a collection of modules and packages that offer a wide range of functionalit
contain pre-wraten code, classes, functions, and routines that can be used to develop
applicatioautomate tasks, manipulate data, perform mathematical computations, and more. Some of
the pop libraries offered by Python for supporting different Data science activities are:

g
or
s.
te
p da
n uu

1. NumPy
.a

NumPy is a scientific computing package for producing and computing multidimensional array
w

matrices, Fourier transformations, statistics, linear algebra, and more. NumPy's tools allow you
manipulate and compute large data sets efficiently and at a high level.
w
w

Key Features:

● N-dimensional array objects


● Broadcasting functions
● Linear algebra, Fourier transforms, and random number capabilities

2. Pandas

Pandas is one of the best libraries for Python, which is a free software library for data analysis an
data handling. In short, Pandas is perfect for quick and easy data manipulation, data aggregatio
reading, and writing the data and data visualization.

Key Features:
73 www.anuupdates.org

● DataFrame manipulation
● Grouping, joining, and merging datasets
● Time series data handling
● Data cleaning and wrangling

3. Dask

Dask is an open-source Python library designed to scale up computations for handling large datase It
provides dynamic parallelism, enabling computations to be distributed across multiple cores
machines. This is where Dask, a parallel computing library in Python, shines by providing scalable
solutions for big data processing.

Key Features:

● Scalable parallel collections (DataFrame, Array)

g
● Works with Pandas and NumPy for distributed processing

or
● Built for multi-core machines and cloud computing

s.
4. Vaex:
te
Vaex is a Python library designed for fast and efficient data manipulation, especially when dealing
da

with massive datasets. Unlike traditional libraries like pandas, Vacx focuses on out-of-core data
processing, allowing users to handle billions of rows of data with minimal memory consumption.
p
uu

Key Features:
n
.a

● Handles billions of rows with minimal memory


● Lazy loading for fast computations
w

● Built-in visualization tools


w
w

5. Scrapy:

Scrapy is a web scraping and extraction tool for data mining. Its use extends beyond just scraping
websites, you can also use it as a web crawler and to extract data from APIs, HTML, and XMI.
sources. Scraped data turns into JSON, CSV, or XML files to store on a local disk or through file
transfer protocol (FTP)

6. Seaborn:

Seaborn is a data visualization library built on top of Matplotlib. It simplifies the creation of
informative and aesthetically pleasing statistical graphics. Seaborn is particularly helpful for exploring
relationships in data and presenting complex data in a visually appealing manner.

7. TensorFlow:
74 www.anuupdates.org

TensorFlow is another popular library for deep learning and machine learning. Developed by Google,
a provides a comprehensive ecosystem for building and deploying machine learning models.
TensorFlow's computational graph allows for distributed training and deployment on various
platforms, making it suitable for production-level applications

8. Statsmodels:

Statsmodels is a library focused on statistical modeling and hypothesis testing It provides tools for
estimating and interpreting models for various statistical methods, including linear regression, time
series analysis, and more

9. SciPy:

SciPy is a scientific computing package with high-level algorithms for optimisation, integration,
differential equations, eigenvectors, algebra, and statistics. It enhances the usage of NumPy-like

g
arrays by using other matrix data structures as its main objects for data. This gives you an even wider

or
range of ways to analyse and compute data

s.
te
Features:
da
● Collection of algorithms and functions built on the NumPy extension of Python
● High-level commands for data manipulation and visualization
p

● Multidimensional image processing with the SciPy ndimage submodule


uu

● Includes built-in functions for solving differential equations


n

Applications:
.a

● Multidimensional image operations


w

● Solving differential equations and the Fourier transform


w
w

Q. Python integrated Development Environments (IDE)for Data Science

An Integrated Development Environment (IDE) is a software application that provides various tools
and features for writing, editing, and debugging code in a programming language. IDEs are designed
to be a one-stop shop solution for software development and generally consist of a code editor,
compiler or interpreter, and a debugger.

An IDE is a multifaceted software suite that combines a wide range of tools within a singular
interface It caters to the diverse needs of developers and data scientists and the programming
languages they support. Some of the main features that usually come with an IDE include
75 www.anuupdates.org

● Code Editor: A code editor is a text editor with features that help in writing and editing
source code. These features include syntax highlighting, code completion, and error
detection

● Compiler/Interpreter: IDEs often include tools that convert source code written in a
programming language into a form that can be executed by a computer.

● Debugger: A debugger is a tool for identifying and fixing errors or bugs in source code allows
the programmer to run the code step-by-step, inspect variables, and control the execution
flow

● Build Automation Tools: Build automation tools are utilities that automate the process of
compiling code, running tests, and packaging the software into a distributable format

g
or
● Version Control Integration: IDEs offer support for version control systems like Git, enabling
programmers to manage changes to their source code over time and maintain version
histories.
s.
te
da
Importance of IDEs for Python Development
p
uu

1. Syntax highlighting - Different parts of the code are highlighted with different colors to make it
easier to read and understand.
n
.a

2 Auto-completion - The IDEs can automatically suggest code snippets and complete statement
w

based on what you have typed so far.


w
w

3. Debugging - The IDEs include tools for setting breakpoints, reviewing through code, ant inspecting
variables, which can help you detect and fix bugs in your code.

4. Collaboration - IDEs can be integrated with version control systems like Git, allowing you to track
code changes and collaborate with other developers.

5. Project management - IDEs can help you manage your projects by allowing you to organize your
code into different files and directories.
76 www.anuupdates.org

6. Language support - IDEs are typically designed to support a specific programming language or
group of languages. This means they can provide language-specific features and integrations that can
make developing software in that language casier.

7. Community - Many IDEs have a large and active community of users, which can be a valuable
resource for getting help and learning new techniques

Top IDEs for Python

Jupyter Notebook

a) Jupyter notebook is the most commonly used and popular python IDE used by data scientists It is

g
a web-based computation environment to create Jupyter notebooks, which are documents that

or
contain code, equations, visualizations, and narrative text.

s.
te
b) Jupyter notebooks are a useful tool for data scientists because they allow them to combine code
da
visualization, and narrative in a single document, which makes it easy to share your work and
reproduce your results. It also provides support for markdown language and equations. c) The
Jupyter notebook can support almost all popular programming languages used in data science, such
p

as Python, R, Scala, Julia, etc.


n uu

Spyder
.a
w

1) Spyder is an open-source Python IDE created and developed by Pierre Raybaut in 2009. The name
w

Spyder stands for Scientific Python Development Environment.


w

2) It is designed specifically for scientific and data analysis applications and includes various toob and
features that make it well-suited for these tasks. Some of the features of Spyder include-code editor,
interactive console, variable explorer, debugging, visualization, etc.

Sublime Text

a) Sublime Text is a proprietary Python IDE known for its speed, flexibility, and powerful features,
making it a popular choice for a wide range of programming tasks.

b) Sublime text features include a customizable interface, syntax highlighting, auto-suggest, multiple
selections, various plugins, etc.
77 www.anuupdates.org

Atom

a) Atom is a popular and powerful text and source code editor among developers It is a free, open-
source editor that is available for Windows, macOS, and Linux.

b) While Atom is not specifically designed as an Integrated Development Environment (IDE) for
Python, it can be used as one with the help of plugins and packages.

Geany

a) Geany is a free text editor that supports Python and contains some IDE features. It was developed
by Enrico Tröger in C and C++.

b) A few of the features of Geany include - Symbol lists, Auto-completion, Syntax highlighting. Code
navigation, Multiple document support, etc.

g
Q. Arrays and Vectorized Computation

or
s.
NumPy stands for Numerical Python. It is a Python library used for working with an array. In Python,
te
we use the list for the array but a's slow to process. NumPy array is a powerful N-dimensional array
da
object and is used in linear algebra, Fourier transform, and random number capabilities. It provides
an array object much faster than traditional Python lists
p
uu

Types of Array:
n

1. One Dimensional Array


.a

2. Two-dimensional arrays
w

3. Multi-Dimensional Array
w
w

One Dimensional Array:

A one-dimensional array is a type of linear array.

Ex: #importing numpy module import numpy as np

#creating list list=[1, 2, 3, 4]

#creating numpy array


78 www.anuupdates.org

sample_arraynp.array(list)

print("List in python", list)

print("Numpy Array in python", sample_array)

Output:

List in python [1, 2, 3, 4] Numpy Array in python: [1234]

Two-Dimensional Array:

In this type of array, elements are stored in rows and columns which represent a matrix.

g
or
s.
te
Three-dimensional array: This type of array comprises 2-D matrices as its elements.
p da
n uu
.a
w
w

Vectorization:
w

Vectorization is a technique used to improve the performance of Python code by eliminating the use
of loops. This feature can significantly reduce the execution time of code

There are various operations that can be performed on vectors, such as the dot product of vectors
(also known as scalar product), which results in a single output, outer products that produce a
square matrix of dimension equal to the length of the vectors, and element-wise multiplication,
which multiplies the elements of the same index and preserves the dimension of the matrix.

Vectorization is the use of array operations from (Numl'y) to perform computations on a dataset.

● Vectorization in NumPy is a method of performing operations on entire arrays without


explicit loops
79 www.anuupdates.org

● This approach leverages NumPy's underlying C implementation for faster and more efficient
computations
● Replacing iterative processes with vectorized functions, you can significantly optimize
performance in data analysis, machine learning, and scientific computing tasks.

For example:

import numpy as np

al np.array([2,4,6,8,10 1)

number 2

resultal number

print(result)

g
or
Output: [4 6 8 10 12]

s.
Vectorization is significant because it:
te
da

Improves Performance: Operations are faster due to pre-compiled C-based implementations.


p

Simplifies Code: Eliminates explicit loops, making code cleaner and easier to read.
uu

Supports Scalability: Efficiently handles large datasets.


n
.a

Different types of operations that can be vectorized in NumPy:


w
w

Ex1. Adding two arrays together with vectorization


w

import numpy as np

a1 = np.array([1, 2, 3])

a2 = np.array([4, 5, 6])

result = a1+a2

print(result)

Output: [5 7 9]

Ex2. Element-Wise Multiplication with array


80 www.anuupdates.org

import numpy as np

a1 = np.array([1,2,3,4])

result = a1 * 2

print(result)

Output: [2 4 6 8]

Advantages:

● Vectorization can drastically increase the speed of execution versus looping over arrays
● Vectorization keeps code simpler and more readable so it's easier to understand and build on
later
● Much of the math of data science is similar to vectorized implementations, making it easier
to translate into vectorized code

g
or
Q. The NumPy ndarray

s.
te
NumPy ndarray:
da

NumPy is used to work with arrays. The array object in NumPy is called ndarray. Anndarray is a multi-
dimensional array of items of the same type and size. The number of dimensions and items
p

contained in the array is defined with a tuple of N non-negative integers that specify cach
uu

dimension's size
n
.a

● The most important object defined in NumPy is an N-dimensional array type called ndarray.
w

It describes a collection of items of the same type, which can be accessed using a zero-based
index.
w
w

● Each item in anndarray takes the same size of block in the memory and is represented by a
data-type object called dtype.

● Any item extracted from anndarray object (by slicing) is represented by a Python object of
one of the array scalar types
A multidimensional array looks something like this:
81 www.anuupdates.org

In Numpy, the number of dimensions of the array is given by Rank. Thus, in the above example, the
ranks of the array of ID, 2D, and 3D arrays are 1, 2 and 3 respectively.

Ex: import numpy as np

#ID array

arr1 = np.array([1, 2, 3, 4, 51)

g
#2D array

or
arr2 = np.array([[1, 2, 3], [4, 5, 6]])

#3D array
s.
te
arr3 = np.array([[[1, 2], [3, 4]], [[5, 6]. [7, 8]]])
da

print(arr1)
p

print(arr2)
uu

print(arr.3)
n
.a

Output:
w

[1 2 3 4 5]
w
w

[[1 2 3]

[4 5 6]]

[[1 2]

[3 4]]

[[5 6]

[7 8]]]
82 www.anuupdates.org

Attributes of ndarray:

Understanding the attributes of anndarray is essential to working with NumPy effectively. Here are
the

key attributes:

● ndarray shape: Returns a tuple representing the shape (dimensions) of the array.

● ndarray.ndim: Returns the number of dimensions (axes) of the array

● ndarray.size: Returns the total number of elements in the array.

● ndarray.dtype: Provides the data type of the array elements.

g
or
● ndarray.itemsize: Returns the size (in bytes) of each element.

s.
te
Q. Creating ndarrays
da
An instance of ndarray class can be constructed by different array creation routines. The basic ndarre
is created using the array() function in NumPy.
p
uu

The numpy.array() function creates anndarray from any object exposing the array interface or from a
n

method that returns an array.


.a
w

Syntax: numpyarray(object, dtype None, copy True, order None, symbol False, ndmin = 0) The above
w

constructor takes the following parameters -


w

Parameter & Description

1. Object: Any object exposing the array interface method returns an array, or any (nested)
sequence
2. Dtype Desired data type of array, optional
3. Copy Optional. By default (true), the object is copied
4. Order: C (row major) or F (column major) or A (any) (default)
5. Subok: By default, returned array forced to be a base class array. If true, sub-classes passed
through
6. Ndmin: Specifies minimum dimensions of resultant array
Example: Create a One-dimensional Array

import numpy as np
83 www.anuupdates.org

a = np.array([1, 2, 3])

print(a)

Output: [1, 2, 3]

Example: Create a Two-dimensional Array

import numpy as np

a = np.array([[1, 2], [3, 4]])

print(a)

Output:

[[1 2]

[3 4]]

g
Example: Create a Multi-dimensional Array

or
import numpy as np

s.
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]]) te
print(arr)
da
Output:

[[[1 2 3]
p
uu

[4 5 6]]

[[1 2 3]
n
.a

[4 5 6]]]
w
w

Advantages of Ndarrays
w

● One of the main advantages of using Numpy Ndarrays is that they take less memory space
and provide better runtime speed when compared with similar data structures in
python(lists and tuples).

● Numpy Ndarrays support some specific scientific functions such as linear algebra. They help
us in solving linear equations.

● Ndarrays support vectorized operations, like elementwise addition and multiplication,


computing Kronecker product, etc. Python lists fail to support these features.

Q. Data Types for ndarrays


84 www.anuupdates.org

NumPy, the fundamental library for scientific computing in Python, supports a wide range of data
types to accommodate various numerical and non-numerical data. These data types are essential for
efficient storage, manipulation, and analysis of data in scientific and numerical computing
applications. The common data types supported by NumPy can be broadly classified into the
following categories:

● Numeric Data Types:


NumPy provides a variety of numeric data types, which are optimized for different use cases and
memory requirements.

These include:

1. Integer Data Types:

● int8: 8-bit signed integer (-128 to 127)


● int 16: 16-bit signed integer (-32,768 to 32,767)

g
● int32: 32-bit signed integer (-2,147,483,648 to 2,147,483,647)

or
● int64: 64-bit signed integer (-9,223,372,036,854,775,808 to 9,223,372,036,854,775,807)

2. Unsigned Integer Data Types:


s.
te
● uint8: 8-bit unsigned integer (0 to 255)
da

● uint16: 16-bit unsigned integer (0 to 65.535)


● uint32: 32-bit unsigned integer (0 to 4,294,967,295)
p

● uint64: 64-bit unsigned integer (0 to 18,446,744,073,709,551,615)


n uu

3.Floating-Point Data Types:


.a

● float 16: 16-bit floating-point number (half precision)


w

● float32: 32-bit floating-point number (single precision)


● float64: 64-bit floating-point number (double precision)
w
w

4.Complex Data Types:

● complex64: 64-bit complex number (two 32-bit floating-point numbers)


● complex128: 128-bit complex number (two 64-bit floating-point numbers)
● Non-Numberic Data Types:
NumPy also supports non-numeric data types, which are useful for storing and manipulating textual
and categorical data. These include:

1.Boolean Data Type:

● bool: Stores True or False values


2.String Data Types:

● str Variable-length string


● bytes Variable-length bytes
3.Object Data Type: object: Stores Python objects of any type
85 www.anuupdates.org

Ex:

import numpy as np

#Create a NumPy array with different data types

arr= np.array([1, 2.5, 'hello', True), dtype = object)

print(arr)

print(arr.dtype)

Output:

[1, 2.5, 'hello', True]

Object

g
● NumPy Datetime Data Type

or
In NumPy, we have a standard, and fast datatype system called the datetime64. We cannot use
datetime, as it's already taken by the standard Python library.

s.
te
Dates are represented using the current Gregorian Calendar and are infinitely extended in the past
da
future, just like the Python date class.

When we call the datetime64 function, it returns a new date with the specified format in the
p

parameter
n uu

Q. Arithmetic with NumPy Arrays


.a
w

NumPy Arithmetic Operations:


w

NumPy makes performing arithmetic operations on arrays simple and easy, With NumPy, you can a
w

subtract, multiply, and divide entire arrays element-wise, meaning that each element in one array
operated on by the corresponding element in another array

When performing arithmetic operations with arrays of different shapes, NumPy uses a feature called
broadcasting. It automatically adjusts the shapes of the arrays so that the operation can be
performed extending the smaller array across the larger one as needed.

Basic NumPy Arithmetic Operations:

NumPy provides several arithmetic operations that are performed element-wise on arrays. These
include addition, subtraction, multiplication, division, and power
86 www.anuupdates.org

Addition: Addition is the most basic operation in mathematics. We can add NumPy arrays element-
wise using the "+" operator.

Ex: import numpy as np

array1 = np.array([1.2.3.4])

array2 = np.array([5,6,7,8])

result_addition array1+array2

print("Addition Result", result_addition)

Output: Addition Result: [6 8 10 12

Subtraction: Subtraction works similar to addition It subtracts each element in the first array from

g
the corresponding element in the second array.

or
Ex: import numpy as np

array1 = np.array([1,2,3,4])
s.
te
array2 = np.array([5,6,7,8])
da

result_subtraction array1-array2
p

print("Subtraction Result", result subtraction)


uu

Output: Subtraction Result: [-4-4-4-4]


n
.a

Multiplication: Similar to multiplication operation on python integers we can multiply arrays


element-wise using the "*" operator.
w

Ex: import numpy as np


w
w

array1 = np.array([1,2,3,4])

array2 = np.array([5,6,7,8])

result_multiplication= array1 * array2

print("Multiplication Result:", result_multiplication)

Output: Multiplication Result: [5 12 21 32]

Division: To divide one array by another array (element-wise), we can use the "/" operator

Ex: import numpy as np

array1 = np.array([1,2,3,4])

array2 = np.array([5,6,7,8])
87 www.anuupdates.org

result_division array1/array2

print("Division Result:", result_division)

Output: Division Result: [0.2. 0.33333333 0.42857143 0.5 ]

NumPy Modulo Operation: The modulo operation is performed using the % operator in NumPy
When applied to arrays, it operates element-wise, meaning each element of the first array is divided
by the corresponding element in the second array, and the remainder is calculated-

Q. Basic Indexing and Slicing

Indexing and slicing are fundamental concepts in programming and data manipulation that allow
access and manipulation of individual elements or subsets of elements in a sequence, such as strings,

g
lists, or arrays.

or
NumPy Indexing:

s.
NumPy Indexing is used to access or modify elements in an array. Three types of indexing methods
te
are available field access, basic slicing and advanced indexing.
p da

There are three types of Indexing methods that are available in Numpy library and these are given
below:
n uu

● Field access - This is direct field access using the index of the value, for example, [0] index is
.a

for 1st value. [1] index is for the 2nd value, and so on.
w
w

● Basic Slicing - Basic slicing is simply an extension of Python's basic concept of slicing to the a
w

dimensions. In order to construct a Python slice object you just need to pass the start, stop,
and step parameters to the built-in slice function. Further, this slice object is passed to the
array to extract the part of the array.

● Advanced Indexing

NumPy 1-D array indexing: You need to pass the index of that element as shown below, to access
the 1-D array.

import numpy as np

arr1 = np.array([10, 21, 33, 54, 45, 67])

print(arr1[0])
88 www.anuupdates.org

Output:10

NumPy 2-D array indexing: To access the 2-D array, you need to use commas to separate the integers
which represent the dimension and the index of the element. The first integer represents the row
and the other represents column

import numpy as пр

z = np.array([[61,27,13,14,54], [46,37,38,19,10]])

print("2nd element on 1st row: “z[0, 1])

Output: 2nd element on 1st Row: 27

g
Slicing in NumPy:

or
NumPy Slicing is an extension of Python's basic concept of slicing to n dimensions. A Python object is

s.
constructed by giving start, stop, and step parameters to the built-in slice function. This slice object is
passed to the array to extract a part of array.
te
● Slicing in the array is performed in the same way as it is performed in the python list.
da

● If an array has 100 elements and you want to pick only a section of the values, you can perla
slicing and extract the required set of values from the complete ndarray
p

● Learn Python List Slicing and you can apply the same on Numpyndarrays
n uu

Slicing 1-D arrays: When slicing an array, you pass the starting index and the ending index whicha
separated by a full colon. You use the square brackets as shown below.
.a

arr[start.end]
w
w

arr is the variable name of the array.


w

import numpy as np

x = np.array([10, 20, 30, 40, 50, 60, 70])

print(x[1:5])

Output: [20 30 40 50 ]

The starting index is 1 and the ending index is 5. Therefore, you will slice from the second element
since indexing in the array starts from 0 up to the fourth element.

Slicing 2-D arrays: To slice a 2-D array in NumPy, you have to specify row index and column inde
which are separated using a comma as shown below.
89 www.anuupdates.org

arr 1. 1:41

The part before the comma represents the row while the part after the comma represents the
column.

Example: Slice elements from index I to 4 from the second row.

import numpy as np

y = np.array([[10, 5, 3, 12, 5], [6, 17, 18, 29, 10]])

print(y[1, 1:4])

Output: [17 18 19 ]

g
or
Q. Boolean Indexing

s.
In NumPy, boolean indexing allows us to filter elements from an array based on a specific condition,
te
We use boolean masks to specify the condition
p da

Boolean Masks in NumPy: Boolean mask is a numpy array containing truth values (True/False) that
uu

correspond to each element in the array. Suppose we have an array named array1.
n

array1 = np.array([12, 24, 16, 21, 32, 29, 7, 15])


.a

Now let's create a mask that selects all elements of array1 that are greater than 20
w
w
w

boolean_mask = array1 > 20

Here, array1 20 creates a boolean mask that evaluates to True for elements that are greater than 20.
and False for elements that are less than or equal to 20. The resulting mask is an array stored in the
boolean mask variable as:

[False, True, False, True, True, True, False, False]

Boolean Indexing allows us to create a filtered subset of an array by passing a boolean mask as an
index. The boolean mask selects only those elements in the array that have a True value at the
corresponding index position. Let's create a boolean indexing of the boolean mask in the above
example.

array [boolean_mask]
90 www.anuupdates.org

This results in [24, 21, 32, 29]

ID Boolean Indexing in NumPy

Ex: We'll use the boolean indexing to select only the odd numbers from an array.

import numpy as np

#create an array of numbers

array1 = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

#create a boolean mask

boolean_mask= array1%2 != 0

g
#boolean indexing to filter the odd numbers

or
result = array1[boolean_mask]

s.
print(result) te
Output: [1 3 5 7 9]
da

2D Boolean Indexing in NumPy: Boolean indexing can also be applied to multi-dimensional arrays in
p

NumPy.
uu

Ex:
n

import numpy as np
.a

#create a 2D array
w

array1 = np.array([[1, 7, 9],


w

[14, 19, 21],


w

[25, 29, 35]])

#create a boolean mask based on the condition

#that elements are greater than 9

boolean mask array1 > 9

#select only the elements that satisfy the condition

result array1[boolean_mask]

print(result)

Output: [14 19 21 25 29 35]


91 www.anuupdates.org

In this example, we have applied boolean indexing to the 2D array named array1. We then created
boolean mask based on the condition that elements are greater than 9. The resulting mask is,

[[False, False, False],

[True, True, True].

[True, True, True]]

Q. Transposing Arrays and Swapping Axes.

Swapping Axes of Arrays in NumPy: Swapping axes in NumPy allows you to change the order of
dimensions in an array. You can swap axes of an array in NumPy using the swapaxes() function and
the transpose() function.

g
or
In NumPy, an array can have multiple dimensions, and each dimension is referred as an axis For
example, a 2D array (matrix) has two axes, the rows and the columns. In a 3D array (tensor), there ar

s.
three axes: depth, height, and width. te
● Axis 0 refers to the first dimension (often rows).
da
● Axis 1 refers to the second dimension (often columns).
● Axis 2 refers to the third dimension, and so on
p
uu

Using swapaxes() Function: The np.swapaxes() function in NumPy allows you to swap two specified
axes of an array. This function is particularly useful when you need to reorganize the structure of an
n

array, such as switching rows and columns in a 2D array or reordering the dimensions in a multi
.a

dimensional array.
w
w

This function does not create a copy of the data but rather returns a new view of the array with the
specified axes swapped. It does not involve duplicating the array's data in memory.
w

Syntax: numpyswapaxes(arr, axis1, axis2)


92 www.anuupdates.org

Where,

● arr is the input array.


● axis1 is the first axis to be swapped.
● axis2 is the second axis to be swapped.

Ex: In the following example, we are swapping the rows and columns in a 2D array using the

swapaxes() function in NumPy–

import numpy as np

#Creating a 2D array

arr = np.array([[1, 2, 3],

[4, 5, 6]])

g
#Swapping axes 0 and 1 (rows and columns)

or
swapped = np.swapaxes(arr, 0, 1)

s.
print("Original Array:") te
print(arr)
da
print("\nArray After Swapping Axes:")

print(swapped)
p
uu

Output: Original Array:

[[1 2 3]
n
.a

[4 5 6]]
w

Array After Swapping Axes:


w

[[1 4]
w

[2 5]

[3 6]]

Using the transpose() Function: We can also use the transpose() function to swap axes of arrays in
NumPy. Unlike the swapaxes() function, which swaps two specific axes, the transpose() function is
used to reorder all axes of an array according to a specified pattern.

Syntax:numpy.transpose(a, axes=None)

Where,

● a is the input array whose axes you want to reorder.


● axes is a tuple or list specifying the desired order of axes. If axes is None, it reverses the order
of the axes.
Return Value: The transpose() function returns the transposed array with the same data type as the
input array.
93 www.anuupdates.org

Ex: Following is the example of the numpytranspose() function in which transpose operation
switches the rows and columns of the 2D array-

import numpy as np

#Original 2D array

array_2d = np.array([[1, 2, 3], [4, 5, 6]])

#Transposing the 2D array

transposed_2d = np.transpose(array_2d)

print("Original array:\n", array_2d)

print("Transposed array:\n", transposed_2d)

Output:

g
or
Original array:

[[1 2 3]

[4 5 6]]
s.
te
Transposed array:
da

[[1 4]
p

[2 5]
uu

[3 6]]
n
.a

Q. Fast Element-Wise Array Functions


w
w

Universal Functions (referred to as "Ufuncs' hereon) in NumPy are highly efficient functions that
w

perform element-wise operations on arrays. They allow mathematical and logical operations to be
applied seamlessly across large datasets.

1) Arithmetic operations: Ufuncs enable the execution of element-wise arithmetic operations such
as addition, subtraction, multiplication, and division across entire arrays without needing explicit
loops.

Here's how to use Ufuncs for these common operations:


94 www.anuupdates.org

Ex:

import numpy as np

a = np.array ([10, 20, 30])

b = np.array ([1, 2, 3])

print("Addition:", np.add(a, b)) # Element-wise addition

print("Subtraction:", np.subtract(a, b)) # Element-wise subtraction

g
or
print("Multiplication:", np.multiply(a, b)) # Element-wise multiplication

s.
print("Division:", np.divide(a, b)) # Element-wise division
te
Output:
da
Addition: [11 22 33]

Subtraction: [9 1827]
p

Multiplication: [10 40 90]


uu

Division: [10. 10. 10.]


n
.a

2) Trigonometric functions
w

● NumPy's Ufuncs extend their efficiency to trigonometric operations, enabling seamless and
w

fast computation of trigonometric functions on arrays.


w

● These Ufuncs apply trigonometric operations like sine, cosine, and tangent element wise to
arrays, allowing us to perform complex mathematical transformations effortlessly.

● The trigonometric functions available in NumPy include numpy.sin(x),


numpy.cos(x),numpy.tan(x), numpy.arcsin(x), numpy.arccos(x), and numpy arctan(x).

3) Exponential and Logarithmic Functions

NumPy's Ufuncs offer powerful and efficient ways to perform exponential and logarithmic
calculations on arrays. These functions operate element-wise, allowing you to compute exponential
and logarithmic transformations swiftly across large datasets.
95 www.anuupdates.org

Here are some frequently used exponential and logarithmic functions in NumPy:

4) Logical Ufuncs:

1. np.logical_and, np.logical_or, np.logical_xor: Element-wise logical operations.

g
or
2. np.logical_not: Element-wise logical NOT operation.

5) Rounding and Truncation Ufuncs:


s.
te
1. np.round: Round elements to the nearest integer or specified number of decimals.
da

2. np.floor: Round elements down to the nearest integer


p

3. np.ceil: Round elements up to the nearest integer.


uu

4. np.trune: Truncate decimal values, keeping only the integer part.


n
.a

6) Statistical Ufuncs:
w

1. ap.mean, np.median: Compute the mean and median of array elements


w

2. ap.min, np.max: Finds the minimum and maximum values present in an array.
w

3. ap.std, np.var: Calculate the standard deviation and variance of elements.

7) Bitwise Ufunes:

1. np.bitwise_and, np.hitwise_or, np.bitwise_xor: Element-wise bitwise operations.

2. np.invert: Element-wise bitwise NOT operation (bitwise inversion)

8)Comparison Functions:

In NumPy, comparison functions are integral to element-wise operations that allow us to evaluate
relationships between arrays. The Ufuncs perform clement-wise comparisons and return Boolean
arrays, where each element represents the result of the comparison operation.
96 www.anuupdates.org

1. np.greater, np.greater equal: Element-wise comparison for greater than and greater than or equal
to and less than or equal to.

2.np.less, up.less equal: Element-wise comparison for less than or equal to.

3. np.cqual, np.not_equal: Element-wise comparison for equality and inequality.

Ex:

import numpy as np

#Define two arrays

array1 = np.array([10, 20, 30, 40])

array2 = пр.array([15, 20, 25, 35])

g
or
#Perform comparison operations

print("Equal:", np.equal(array1, array2))

print("Less Than:", np.less(array1, array2))


s.
te
print("Greater Than", np. greater(array1, array2))
p da

Output:
n uu

Equal: [False True False False]


.a

Less Than: [True False False False]


w

Greater Than: [False False True True)


w
w

Q. Mathematical and Statistical Methods

Mathematical Functions: NumPy provides a wide range of mathematical functions that are essential
for performing numerical operations on arrays. These functions include basic arithmetic,
trigonometric, exponential, logarithmic, and statistical operations, among others.

we will explore the most commonly used mathematical functions in NumPy, with examples to help
you understand their application.

Arithmetic operations: Ufuncs enable the execution of element-wise arithmetic operations such as
addition, subtraction, multiplication, and division across entire arrays without needing explicit loops.

Here's how to use Ufunes for these common operations:


97 www.anuupdates.org

Ex: import numpy as np

a = np.array([10, 20, 30])

b = np.array([1, 2, 3])

print("Addition:", np.add(a, b)) # Element-wise addition

g
print("Subtraction:", np subtract(a, b)) # Element-wise subtraction

or
print("Multiplication:", np.multiply(a, b)) # Element-wise multiplication

s.
print("Division:", np. divide(a, b)) # Element-wise division
te
Output:
da
Addition: [11 22 33]

Subtraction: [9 18 27]
p
uu

Multiplication: [10 40 90]

Division: [10. 10. 10.]


n
.a

Trigonometric functions
w
w

● NumPy's Ufuncs extend their efficiency to trigonometric operations, enabling seamless and
fau computation of trigonometric functions on arrays.
w

● These Ufuncs apply trigonometric operations like sine, cosine, and tangent element wise
arrays, allowing us to perform complex mathematical transformations effortlessly.
● The trigonometric functions available in NumPy include numpy.sin(x), numpy.cos(x)
numpy.tan(x), numpyarcsin(x), numpyarccos(x), and numpy arctan(x).

Statistical Methods:

Mean: The mean is a measure of central tendency. It is the total of all values divided by how many
values there are. We use the mean() function to calculate the mean.

Syntax: np.mean(data)

Ex:
98 www.anuupdates.org

# Sample data

data = np.array([1, 2, 3, 4, 5])

#Calculate the mean

mean = np.mean(data)

#Print the result

print(f"Mean: {mean}")

Output: Mean: 3.0

Average: The average is often used interchangeably with the mean. It is the total of all values divided
by how many values there are. We use average() function to calculate the average. This function s
useful because it allows for the inclusion of weights to compute a weighted average.

g
Syntax: np.average(data), np.average(data, weights weights)

or
Ex:

#Sample data
s.
te
data = np.array([1, 2, 3, 4, 5])
da

weights = np.array([1, 2, 3, 4, 5])


p

#Calculate the average


uu

average = np.average(data)

#Calculate the weighted average


n
.a

weighted_average = np.average(data, weights=weights)


w

#Print the results


w

print(f"Average: {average}")
w

print(f"Weighted Average: {weighted_average}")

Output:

Average: 3.0

Weighted Average: 3.6666666666666665

Median: The median is the middle value in an ordered dataset. The median is the middle value when
the dataset has an odd number of values. The median is the average of the two middle values when
the dataset has an even number of values. We use the median() function to calculate the median.

Syntax:np.median(data)

Ex: #Sample data


99 www.anuupdates.org

data np.array([1, 2, 3, 4, 5])

#Calculate the median

median np.median(data)

#Print the result

print("Median: (median)")

Output: Median: 3.0

Variance: Variance measures how spread out the numbers are from the mean. It shows how much
the values in a dataset differ from the average. A higher variance means more spread. We use the
var() function to calculate the variance.

Syntax: np.var(data)

g
Ex:

or
# Sample data

data = np.array([1, 2, 3, 4, 5])


s.
te
#Calculate the variance
da

variance = np.var(data)
p

#Print the result


uu

print(f”Variance: {variance}")

Output: Variance: 2.0


n
.a
w

Minimum and Maximum: The minimum and maximum functions help identify the smallest and
largest values in a dataset, respectively. We use the min() and max() functions to calculate these
w

values.
w

Syntax: np.min(data), np.max(data)

Ex:

#Sample data

data = np.array([1, 2, 3, 4, 5])

#Calculate the minimum and maximum

minimum = np.min(data)

maximum = np.max(data)

#Print the results

print(f"Minimum: {minimum}")
100 www.anuupdates.org

print(f"Maximum: {maximum}")

Output:

Minimum: 1

Maximum: 5

Q. Sorting

Sorting is basically a process where elements are arranged in an ordered sequence

● The ordered sequence basically is any sequence that has an order in corresponding to the
elements. It can be numeric or alphabetical, ascending, or descending, anything

● There are many functions for performing sorting, available in the NumPy library. We have

g
various sorting algorithms like quicksort, merge sort and heapsort, and all these are

or
implemented using the numpy.sort() function.

s.
te
p da
uu

Numpysort() function: You can use the numpyndarray function sort() to sort a numpy array. It sorts
n

the array in-place. You can also use the global numpysort() function which returns a copy of the
.a

sorted array.

syntax:numpy.sort(a, axis, kind=None, order None)


w
w

Parameters:
w

Now we will discuss the parameters of this function:

● a: This parameter will indicate the input array to be sorted


● axis: This parameter is used to indicate the axis along which the array needs to be sorted. If
the value of this parameter is None, then the array is flattened before sorting. The default
value of this parameter is -1, which sorts along the last axis.
● Kind: This parameter will specify the sorting algorithm. The kind argument can take several
values, including,
1. quicksort (default): This is a fast algorithm that works well for most cases i.e. small and
medium-sized arrays with random or uniformly distributed elements.
2. mergesort: This is a stable, recursive algorithm that works well for larger arrays with
repeated elements.
3. heapsort: This is a slower, but guaranteed O(n log n) sorting algorithm that works well for
smaller arrays with random or uniformly distributed elements
101 www.anuupdates.org

● Order: This parameter will represent the fields according to which the array is to be sorted in
the case if the array contains the fields.
● Returned Values: This function will return the sorted array of the same type and having the
same shape as the input array.
Ex:

import numpy as np

#sorting along the first axis

a = np.array([[17, 15], [10, 25]])

arr1 = np.sort(a, axis = 0)

print ("Sorting Along first axis: \n")

print(arr1)

#sorting along the last axis

g
or
b = np.array([[1, 15], [20, 18]])

arr2 = np.sort(b, axis-1)

s.
print ("\nSorting along last axis: \n")
te
print(arr2)
da

c = np.array([[12, 15], [10, 1]])


p

arr3 = np.sort(c, axis = None)


uu

print ("\nSorting Along none axis: \n")


n

print(arr3)
.a

Output:
w

Sorting Along first axis :


w
w

[[10 15]

[17 25]]

Sorting along last axis :

[[ 1 15]

[18 20]]

Sorting Along axis :

[ 1 10 12 15]
102 www.anuupdates.org

Ex2: Sort a String Array

import numpy as np

array = np.array(['Apple', 'apple', 'Ball', 'Cat'])

#sort a string array based on their ASCII values.

array2 = np.sort(array)

print(array2)

Output: ['Apple' 'Ball' 'Cat' 'apple']

Q Unique and Other Set Logic

g
This numpy set operation helps us find unique values from the set of array elements in Python. The

or
numpyunique() function skips all the duplicate values and represents only the unique elements from
the Array

Syntax: np.unique(Array)
s.
te
Example: In this example, we have used unique() function to select and display the unique elements
da
from the set of array. Thus, it skips the duplicate value 30 and selects it only once. import numpy as
np
p

arr = np.array([30,60,90,30,100,60,30])
uu

data = np.unique(arr)
n

print(data)
.a

Output: [30 60 90 100]


w
w

Set Operations:
w

A set is a collection of unique data. That is, elements of a set cannot be repeated NumPy set
operations perform mathematical set operations on arrays like union, intersection, difference, and
symmetric difference.

Set Union Operation: The union of two sets A and B include all the elements of set A and B. In
NumPy, we use the np.unionld() function to perform the set union operation in an array.
103 www.anuupdates.org

Ex: import numpy as np

A = np.array([1, 3, 5])

B = np.array([0, 2, 3])

#union of two arrays

result = np.union1d(A, B)

g
print(result)

or
Output: [0 1 2 3 5]

s.
In this example, we have used the np.unionld(A, B) function to compute the union of two arrays:
te
A and B.
p da

Intersection Operation: The intersection of two sets A and B include the common elements between
uu

set A and B. We use the np.intersectId() function to perform the set intersection operation in an
array.
n
.a
w
w
w

Ex: import numpy as np

A = np.array([1, 3, 5])

B = np.array([0, 2, 3))

#intersection of two arrays

result = np. intersect1d(A, B)

print(result)

Output: [3]
104 www.anuupdates.org

Difference Operation: The difference between two sets A and B include elements of set A that are
not present on set B. We use the np.setdiffld() function to perform the difference between two
arrays.

Ex: import numpy as np

A = np.array([1, 3, 5])

B = np.array([0, 2.3])

#difference of two arrays

result = np.setdiff1d(A, B)

g
or
print(result)

s.
Output: [15]
te
Symmetric Difference Operation: The symmetric difference between two sets A and B includes all
elements of A and B without the common elements. In NumPy, we use the np.setxorld() function to
da
perform symmetric differences between two arrays.
p
nuu
.a

Ex: import numpy as np


w

A = np.array([1, 3, 5])
w
w

B = np.array([0, 2, 3])

#symmetric difference of two arrays

result = np.setxor1d(A, B)

print(result)

Output: [0 1 2 5]

UNIT-IV
105 www.anuupdates.org

Introduction to pandas Data Structures

Pandas

Pandas is a Python library used for data manipulation and analysis. It provides data structures like
DataFrames and Series, which are built on top of NumPy, making it efficient for working with
structured data DataFrames are similar to tables in SQL, er spreadsheets, allowing for easy
organization and manipulation of data, while Series are one-dimensional arrays with labeled indices

Pandas is widely used for tasks such as data cleaning, trasformation, and analysis. It offers
functionalities for handling missing data, reshaping data, merging and joining datasets, and
calculating summary statistics. Its integration with other Python Libraries the NumPy and Matplotlib
makes a s powerful tool for data science workflows.

g
or
Q. Series

s.
Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string.
float, etc.). It's similar to a one-dimensional array or a list in Python, but with additional
te
functionalities Each element in a Pandas Series has a label associated with in, called an indes. This
da
index allows for fast and efficient data access and manipulation. Pandas Series can be created from
various data structures like lists, dictionaries, NumPy arrays, etc
p

It consists of two main components


uu

Data: The actual values stored in the Series.


n

Index: The labels or indices that correspond to each data value.


.a
w
w
w

A Series is similar to a one-dimensional ndarray (NumPy array) but with labels, which are also known
as indices. These labels can be used to access the data within the Series. By default, the index values
are integers starting from 0 to the length of the Series minus one, but you can also manually set the
index labels.

Syntax:Pandas.series(data, index, dtype, copy)

Data: The data to be stored in the Series. It can be a list, ndarray, dictionary, scalar value (like an
integer or string), etc.
106 www.anuupdates.org

Index: Optional It allows you to specify the index labels for the Series. If not provided, default
integer index labels (0, 1, 2, ...) will be used.

dtype-Optional. The data type of the Series. If not specified, it will be inferred from the data.

Copy: Optional. If True, it makes a copy of the data. Default is False.

Create pandas Series:

pandas Series can be created in multiple ways, From array, list, dict, and from existing DataFrame.

Create Series using array:

Before creating & Series, first, we have to import the NumPy module and use array() function in the
program. If the data is ndarrays, then the paints should be in the same length, if the index is not
passed the default value is range(n).

Ex:

import pandas as pd

g
or
import numpy as np

s.
data = np.array([‘python’, ‘php’,’ Java’]) te
series = pd.Series(data)
da
print (series)

Output:
p
uu

0 python
1 php
2 java
n
.a

dtype: object
w

Create Series from List:


w

If you have a Python list it can be easily converted into Pandas Scries by passing the list object a an
w

argument to series() function. In case if you wanted to convert the Pandas Series to alist use
Series.tolist().

Ex:

# Create Pandas series from list

data =[‘python’, ‘php’’, java’]

series = pd.Series(data, index=[‘r1’,r2’,’r3’])

print(series)

Output:
107 www.anuupdates.org

r1 python

r2 php

r3 java

dtype: object

Create a Series using a Dictionary:

A Python dictionary can be utilized to create a Pandas Series. In this process, the dictionary keys are
used as the index labels, and the dictionary values are used as the data for the Series. If you want to
convert a Pandas Series to a dictionary, use the Series to_dict() method.

Ex:

#Create a Dict from a input

data = {‘Courses’: "pandas", ‘Fees’: 20000, ‘Duration': "30days"}

g
series=pd .Series(data)

or
print (series)

Output:
s.
te
Courses pandas
da

Fee 20000
p

Duration 30days
uu

dtype: object

Pandas series method:


n
.a

• df.count(): Counts the number of non-null values in each column.


w

• Series() : A pandas series can be created with the Series() constructor method.This
constructor method accepts a variety of inputs.
w


w

size() :Returns the no.of elements in the underlying data.

• name(): Method allows to give a name to a series object, i.e, to the column.

• is_unique(): Method returns the index positions of the highest values in a series.

• idxmax(): Method to extract the index positions of the highest values in a series.

Q. Data Frame and Essential Functionality

A Dataframe in Python's pandas library in a two-dimensional labeled data structure that is used for
data manipulation and analysis. It can handle different data types such as integers, floats, and
strings. Cach column has a unique label, and each row is labeled with a unique index values, which
helps in accessing specific rows .

DataFrame is used in machine learning tasks which allow the users in manipulate and analyze the
data sets in large size. It supports the operations such as filtering, sorting, merging grouping anal
transforming data.
108 www.anuupdates.org

DataFrame Structure:

Creating a pandas DataFrame:

Syntax: pandas. DataFrame(data = None, index= None, columns= None, dtype= None, copy=None)

Creating a DataFrame from Different Inputs:

g
A pandas DataFrame can be created using various inputs like-

or
1) Lists

s.
2) Dictionary te
3) Series
4) Numpyndarrays
da
5) Another Dataframe
6) External input iles like CSV, JSON, HTML, Excel sheet, and more.
p

Advantages of using paradasDataFrames:


uu

1) Can easily load data from different databases and data formats:
n

2) Can be used with lots of different data types


.a

3) Have intuitive merging and joining data sets that une a common key in order to get a
complete view
w

4) Segment records within a DataFrame.


5) Allow smart label-based slicing, creative indexing and subsetting of large data sets
w

6) Aggregate and summarize quickly in order to get eloquent stats from your data by accessing
w

in-buit functions within pandas DataFrames


7) Define your own Python functions featuring certain computational tasks and apply them on
your DataFrame records.

Q. Dropping Entries

The Pandas drop() function in Python drops specified labels from rows and columns. Drop is a major
function used in data science & Machine Learning to clean the dataset. Pandas Drop() function
removes specified labels from rows or columns. When using a multi-index, labels on differen levels
can be removed by specifying the level. The drop() function removes specified rows or column from
a Pandas DataFrame or Series.

syntax:DataFrame.drop(labels =None, axis=0, index =None, columns =None, level=None,

inplace=False, errors=‘raise’)
109 www.anuupdates.org

Options Explanation
labels Single-label or list-like
Index or Column labels to drop.
axis the drop will remove the provided axis; the axis can be 0 or 1.
axis 0 refers to rows or indexes (verticals)
axis I refers to columns (horizontals)
by default, axis = 0
index single label or list-like.
the index is the row (verticals) & is equivalent to axis=0
columns Single label or list-like.
the columns are horizontals in the tabular view & are denoted
with axis = 1.
inplace accepts bool (True or False), default is False.
Inplace makes changes then & there, don't need to assiga a
variable.
level int or level name, optional
For MultiIndex, the level from which the labels will be removed.

g
errors the error can be ‘ignored’ or ‘raised’ default is raised

or
if ignored, suppress the error, and only existing labels art
dropped if raised, it will show the error message & won't allow

s.
dropping the data. te
da

The drop() function in the Python pandas library is useful for removing specified rows or columes
from a DataFrame or Scries. The function takes in several parameters, including the labels to drop,
p

the axis (Le, rows or columns), and whether or not to modify the original Dataframe in place.
uu

With the Pandas dataframedrop() method, we can easily manipulate the structure of our data by
n

removing unnecessary rows or columns. We can also chain multiple drop() functions in Python to
remove multiple rows or columns simultaneously.
.a

It's important to note that the Python drop() function in Pandas with inplace-True modifies the
w

original DataFrame in place and does not return a new DataFrame object. Thin can be useful to save
w

memory or avoid creating unnecessary copies of our data.


w

Ex: Drop rows using the DataFrame.drop() Method

import pandas as pd

df = pd. DataFrame([[0,1,2,3], [4,5,6,7],[8,9,10,11]],columns=(‘a’,’b’,’c’,’d’))

print("-----DataFrame-----)

print(df)

print("---After dropping a specific label from the row of the DataFrame---“)

print(df. drop(1))

Output:

-----Dataframe-----
110 www.anuupdates.org

a b c d

0 0 1 2 3

1 4 5 6 7

2 8 9 10 11

---After dropping a specific label from the row of the DataFrame---

a b c d

0 0 1 2 3

2 8 9 10 11

Q. Indexing

Indexing is the process of accessing an element in a sequence using its position in the sequence (its
index). In Python, indexing starts from 0, which means the first element in a sequence is at position

g
0. the second element is at position 1, and so on. Indexes can be numeric, string, or even datetime

or
They can also be unique or non-unique. By default, Pandas assigns a numeric, auto-incrernenting
index to cachDataFrame you create.

Indexing is important for two main reasons:


s.
te
1. Identification Unique identifiers help in identifying rows with specific characteristics.
da

2. Selection : Indexes make data selection and manipulation faster and easier.
p

To access an element in a sequence, you can use square brackets [] with the index of the element
uu

you want to access.

Ex:
n
.a

my _list = ['apple', banana', 'cherry', 'date']


w

print(my_list[0]) # output: "apple"


w

print(my_ list[1]) # output: “banana”


w

In the above code, we have created a list called my list and then used indexing to access the first and
second elements in the list using their respective indices.

To Set an Index in Pandas DataFrame: Setting an index in a DataFrame is straightforward. You can

use the set_ index() function, which takes a column name (or a list of column names) as an argumert.

Ex:

import pandas as pd

#Create a simple dataframe

df = pd.DataFrame({

'A': ['foo', 'bar', 'baz', 'qux'].

'B': ['one', 'one', 'two', 'three'],


111 www.anuupdates.org

‘C’: [1, 2, 3, 4],

‘D’: [10, 20, 30, 40]

}]

# Set 'A' as the index

df.set_index(A', inplace =True)

Output:

B C D

foo one 1 10

bar one 2 20

g
baz two 3 30

or
qux three 4 40

s.
The inplace=True argument modifies the original DataFrame. If you don't include this argument, the
function will return a new DataFrame.
te
Multi-Indexing in Pandas:
da

Pandas also supports multiple indexes, which can be useful for higher dimensional data. You can
p

crete a multi-index DataFrame by passing a list of column names to the set_index() function.
uu

Ex:

#Set 'A' and 'B' as the index


n
.a

df.set_index(['A', 'B'], inplace= True)


w

Output:
w

A B C D
w

foo one 1 10

bar one 2 20

baz two 3 30

qux three 4 40

Resetting the Index: If you want to revert your DataFrame to the default integer index, you can use
the reset index() function.

Ex: #Reset the index

df.reset_index(inplace=True)

Output:

A B C D
112 www.anuupdates.org

0 foo one 1 10

1 bar one 2 20

2 baz two 3 30

3 qux three 4 40

Indexing for Performance: Indexes are not just for identification and selection. They can also
significantly improve performance. When you perform a task, that uses the index, like a data lookup
or a merge operation, Pandas uses a hash-based algorithm, which is extremely fast.

Q. Selection

Pandas select refers to the process of extracting specific portions of data from a DataFrams Data
selection involves choosing specific rows and columns based on labels, positions, or conditions.
Pandas provides various methods, such as basic indexing, slicing, boolean indexing, and querying, to
efficiently extract, filter, and transform data, enabling users to focus on relevant information for
analysis and decision-making pandas functions-loc and iloc-that allow you to select rows and

g
columns either by their labels (names) or their integer positions (indexes).

or
Ex:

import pandas as pd
s.
te
data = {'Name': ['Alice'. "Bob', 'Charlie'],'Age’ :[25, 30, 35],’ City': ['New York', 'Los Angeles’,’Chicago'])
da

df = pd.DataFrame(data)
p

#Using loc (label-based)


uu

result_loc=df. loc[0, 'Name'] # Select value at row 0 and column "Name


n

Lising iloc (position-based)


.a

result_iloc=df.iloc[1,2] # Select value at row I and column 2


w

print("Using loc:", result_loc)


w

print("Using iloc:", result iloc)


w

Output:

Using loc: Alice

Using iloc: Los Angeles

Selecting Rows and Columns Using loc]] (Label-Based Indexing)

The .loc[]method selects data based on labels (names of rows or columns). It is flexible and supports
various operations like selecting single rows/columns, multiple rows/columns, or specific subsets.

Key Features of loc[]:

1) Label-based indexing.
2) Can select both rows and columns simultaneously.
3) Supports slicing and filtering.
113 www.anuupdates.org

Select a Single Row by Label:

Ex:

row = df.loc[0] #Select the first row

print(row)

Output:

Name Alice

Age 25

City New York

Select Multiple Rows by Labels:

Ex:

g
rows = df.loc[[0, 2]] #Select rows with index labels 0 and 2

or
print(rows)

s.
Output: te
Name Age City
da
0 Alice 25 New York

2 Charlie 35 Chicago
p
uu

Select Specific Rows and Columns:

subset = df.loc[0:1, ['Name', 'City']] # Select first two rows and specific columns
n
.a

print(subset)

Output:
w
w

Name City
w

0 Alice New York

1 Bob Los Angeles

Filter Rows Based on Conditions:

Ex: filtered= df.loc[df['Age']>25] #Select rows where Age > 25

print(filtered)

Output:

Name Age City


114 www.anuupdates.org

1 Bob 30 Los Angeles

2 Charlie 35 Chicago

Selecting Rows and Columns Using iloc|| (Integer-Position Based Indexing):

The.iloc [] method selects data based on integer positions (index numbers). It is particularly
usefulwhen you don't know the labels but know the positions.

Key Features of .iloc[ ]:

1) Uses integer positions (0, 1, 2,….) to index rows and columns.


2) Just like .loc[], you can pass a range or a list of indices.
3) Supports slicing, similar to Python lists.
4) Unlike .loc[ ], it is exclusive when indexing ranges, meaning that the end index is excluded.

Select a Single Row by Position:

Ex:

g
row = df.iloc[1] # Select the second row (index position = 1)

or
print(row)

Output:
s.
te
NameBob
da

Age 30
p

CityLos Angeles
uu

Name :1, dtype: object


n

Select Multiple Rows by Positions:


.a

Ex:
w

Rows = df.iloc[[0, 2]] # Select first and third rows by position Print(rows) Ex:
w

Output:
w

Name AgeCity

0 Alice 25 New York

2 Charlie 35 Chicago

Q. Filtering

One of the most common data manipulation operations is dataframe filtering. A DataFrame is
filtered when its data are analyzed, and only those that satisfy specific requirements are returned.
115 www.anuupdates.org

Pandas, a great data manipulation tool, is the best fit for Dataframe Filtering. The core data structure
of Pandas is DataFrame, which stores data in tabular form with labled rows and columns.

Create a sample DataFrame for our examples:

df= pd.DataFrame({

‘name’: ['Jane’, ‘John’,’ Ashley’, "Mike", "Emily”, ‘Jack’,’ Catlin'],

‘ctg’:[‘A’,’A’,’C’,’B’,’C’,’B’],

‘val’:np.random.random(7).round(2),

‘val2’:np.random.randint(1,10,size=7)

})

Output:

name ctgval val2

g
or
0 Jane A 0.43 1

1 John A 0.67 1

2 Ashley C 0.40 7
s.
te
3 Mike B 0.91 5
da

4 Emily B 0.99 8
p

5 Jack C 0.02 7
uu

6 Catlin B 1.00 3
n

Pandas Filter Methods:


.a

1) Logical Operators: We can use the logical operators on column values to filter rows. We've now
w

selected the rows in which the value in the "val" column is greater than 0.9.
w

Ex: df[df.val> 0.9]


w

Output:

name ctg val val2

6 Catlin B 1.00 3

2) Isin: The isin method is another way of applying multiple conditions for filtering. For instance, we
can filter the names that exist in a given list.

Ex:

names = [‘John’,’Catlin’,’Mike’]

df[df.nameisin(names)]

Output:

name ctgval val2


116 www.anuupdates.org

1 John A 0.67 1

3 Mike B 0.91 5

6 Catlin B 1.00 3

3) Str accessor:

Pandas is a highly efficient library on textual data as well. The functions and methods under the str
accessor provide flexible ways to filter rows based on strings. For instance. we can select the names
that start with the letter "J”.

Ex:

df[df.name.str.startswith(‘J’)]

name ctgval val2

0 Jane A 0.43 1

g
1 John A 0.67 7

or
5 Jack C 0.02 7

Tilde (-):
s.
te
The tilde operator is used for "not" logic in filtering. If we add the tilde operator before the filter
da
expression, the rows that do not fit the condition are returned.

Ex: df[~df.name.str.contains(‘J’)]
p
uu

name ctgval val2

2 Ashley C 0.40 7
n

3 Mike B 0.91 5
.a

4 Emily B 0.99 8
w

6 Catlin B 1.00 3
w
w

7) Query:

The query function offers a little more flexibility at writing the conditions for filtering. we can pass
the conditions as a string .

For instance, the following code returns the rows that belong to the B category and have a value

higher than 0.5 in the val column.

Ex:df.query(‘ctg == “B” and val> 0.5’)

Output:

name ctgval val2


117 www.anuupdates.org

3 Mike B 0.91 5

4 Emily B 0.99 8

5 Catlin B 1.00 3

8) Nlargest or Nsmallest: In some cases, we do not have a specific range for filtering but just need
the largest or smallest values. The nlargest and nsmallest functions allow you to select rows that
have the largest or smallest values in a column, respectively.

Ex: df.nlargest(2, 'val')

Output:

name ctgval val2

6 Catlin B 1.00 3

3 Emily B 0.99 8

g
Q. Function Application and Mapping

or
Pandas provides powerful methods to apply custom or library functions to DataFrame and Series

s.
objects. Depending on whether you want to apply a function to the entire DataFrame, row- or
column-wise, or element-wise, Pandas offers several methods to achieve these tasks. three essential
te
methods for function application in Pandas -
da
a) Table wise Function Application: pipe()
p

b) Row or Column Wise Function Application: apply()


uu

c) Element wise Function Application: map()

Table-wise Function Application:


n
.a

The custom operations performed by passing a function and an appropriate number of parameters.
These are known as pipe arguments. Hence, the operation is performed on the entire DataFrame or
w

Series. When we want to apply one function to a series or DataFrame, then apply another, then
w

another, and so on, the notation can become messy. It can also makes the program more prone to
error. Here, pipe() becomes useful.
w

Example: Applying a Custom Function to the Entire DataFrame

Here is the example that demonstrates how you can add a value to all elements in the DataFrame
using the pipe() function.

Ex:

dataflair_s1= pd.Series([11,21,31,41,51])

dataflair_s1

Output:
118 www.anuupdates.org

21

31

41

51

Row or Column Wise Function Application:

The apply() function is versatile and allows you to apply a function along the axes of a DataFrame. By
default, it applies the function column-wise, but you can specify row-wise application using the axis
parameter.

✓ Applying a Function Column-wise: This example applies a function to the DataFrame


columns.
Here the np.mean() function calculates the mean of each column.
Ex:
df= pd.DataFrame(np.random.randn(5, 3), columns=['coll', 'col2', 'col3'])

g
or
print('Original DataFrame:\n', df)

s.
result = df.apply(np.mean) te
print('Result:\n',result)
da
Output:

Original DataFrame:
p

col1col2col3
uu

0 3 7 6
n

1 4 1 4
.a

2 5 2 8
w

Result:
w
w

col1 4

col2 3

col3 9

Applying a Function Row-wise:

This function applies the np.mean() function to the rows of the pandas DataFrame. is as follows –

Ex:

df = pd.DataFrame(np.random.randn(5, 3), columns=['coll', 'col2', 'col3'])

print('Original DataFrame:\n', df)

result = df.apply(np.mean, axis=1)

print('Result:\n', result)
119 www.anuupdates.org

Output:

Original DataFrame:

col1col2col3

0 3 7 6

1 4 1 4

2 5 2 8

Result:

0 5.3
1 3
2 5

Element Wise Function Application:

g
When you need to apply a function to each element individually, you can use map() function. These

or
methods are particularly useful when the function cannot be vectorized.

s.
Using map() Function: The following example demonstrates how to use the map() function for
applying a custom function to the elements of the DataFrame object.
te
Ex:
da

gfg_string = 'hello'
p

gfg_series = pd.Series(list(gfg_string))
uu

print("Original series\n" +
n

gfg_series.to_string(index = False,
.a

header=False), end='\n\n')
w

new_gfg_series = gfg_series.map(str.upper)
w

print("Transformed series:\n" +
w

new_gfg_series.to_string(index=False,

header=False), end='\n\n')

Output:

Original series Transformed series

h H

e E

l L

l L

o O
120 www.anuupdates.org

Q. Sorting and Ranking

Sorting is a fundamental operation when working with data in Pandas, whether you're organizing
rows, columns, or specific values. Sorting can help you to arrange your data in a meaningful way for
better understanding and easy analysis. Pandas provides powerful tools for sorting your data
efficiently, which can be done by labels or actual values.

Types of Sorting in Pandas:

1) Sorting by Label: This involves sorting the data based on the index labels.
2) Sorting by Value: This involves sorting data based on the actual values in the DataFrame or
Series.

Sorting by Label:

To sort by the index labels, you can use the sort_index() method, by passing the axis arguments and
the order of sorting, data structure object can be sorted. By default, this method sorts the
DataFrame in ascending order based on the row labels.

g
Ex:

or
unsorted_df=pd.DataFrame(np.random.randn(10,2),index=[1,0,2,3],

columns = ['col2', 'col1'])


s.
te
print("Original DataFrame:\n", unsorted_df)
da

# Sort the DataFrame by labels


p

sorted_df=unsorted_df.sort_index()
uu

print("\nOutput Sorted DataFrame:\n", sorted_df)


n

Output:
.a

Original DataFrame:
w

col2col1
w

1 1.116 1.631
w

2 -2.070 0.148

0 0.922 -0.429

Output:

Original DataFrame:

col2col1

0 0.922 -0.429

1 1.116 1.631

2 -2.070 0.148
121 www.anuupdates.org

b) Sorting by Actual Values: Like index sorting, sorting by actual values can be done using the
sort_values() method. This method allows sorting by one or more columns. It accepts a 'by'
argument which will use the column name of the DataFrame with which the values are to be sorted.

Ex: Sorting by Actual Values

import pandas as pd

panda_seriespd.Series([18, 12, 55, 0])

print("Unsorted Pandas Series: \n", panda_series)

panda_series_sorted = panda_series.sort_values(ascending=True)

print("\nSorted Pandas Series: \n", panda_series_sorted)

Output:

Unsorted Pandas Series:

g
0 18

or
1 12
2 55

s.
3 0 te
Sorted Pandas Series:
da
5 0
p

3 12
uu

0 18

4 55
n
.a

Ranking: Ranking in pandas is achieved using the rank() method, which assigns ranks to elements
within a Series or DataFrame. This method provides flexibility in how ranks are computed and
w

displayed, with several parameters to control its behavior.


w

The rank() method is applied to a Series or a column of a DataFrame. By default, it assigns ranks in
w

ascending order, with the smallest value receiving rank 1. In cases of ties, the average rank is
assigned to the tied values.

Syntax: df.rank(axis=0, method='average', numeric_only=None, na_option="keep', ascending=True,


pct=False)

rank() Arguments:

1) axis: specifies whether to rank rows or columns


2) method: specifies how to handle equal values
3) numeric_only: rank only numeric data if True
4) na_option: specifies how to handle NaN
5) ascending: specifies whether to rank in ascending order
6) pct: specifies whether to display the rank as a percentage.

Ex: Basic Ranking


122 www.anuupdates.org

import pandas as pd

data = {'Score': [250, 400, 300, 300, 150]}

df = pd.DataFrame(data)

df['Rank'] = df['Score'].rank()

print(df)

Output:

Score Rank

0 250 2.0

1 400 5.0

2 300 3.5

g
3 300 3.5

or
4 150 1.0

s.
Rank a Column with Different Methods: te
Method Description
average Default:assign the average rank to each entry in
da

the same group


min Uses the minium rank for the whole group
p

max Uses the maxium rank for the whole group


uu

first Assigns the ranks in the order in which group


dense Like method = ‘min’ but the ranks always
n

increase by 1 between groups and not according


to the number of same items in a group
.a
w
w

Ranking with Method:


w

import pandas as pd

data= {'Score': [78, 85, 85, 90]}

df= pd.DataFrame(data)

# rank using the 'max' method for ties

df['Rank'] = df['Score'].rank(method='max')

print(df)

Output:

Score Rank

0 78 1.0

1 85 3.0
123 www.anuupdates.org

2 85 3.0

3 90 4.0

Q. Summarizing and Computing Descriptive Statistics:

Statistics is a branch of mathematics that deals with collecting, interpreting, organization and
interpretation of data. Descriptive statistics involves summarizing and organizing the data so that it
can be easily understood.

✓ Descriptive statistics are a set of tools in data analysis that help us understand and
summarize the main features of a dataset.
✓ They provide simple ways to describe essential aspects like central tendency (mean, median)
variability (range, standard deviation), and distribution shape.
✓ By using descriptive statistics, we can quickly grasp the overall picture of data, spot patterns,
and identify potential outliers.

g
✓ These techniques make complex data more manageable and accessible, aiding in making

or
informed decisions and drawing meaningful insights from the information.

s.
te
p da
n uu
.a
w
w

Measures of Central Tendency:


w

Measure of central tendency is used to describe the middle/center value of the data. Mean, Median,
Mode are measures of central tendency.

1. Mean

✓ Mean is the average value of the dataset.


✓ Mean is calculated by adding all values in the dataset divided by the number of values in the
dataset.
✓ We can calculate the mean for only numerical variables Formula tocalculate mean
124 www.anuupdates.org

2. Median:

g
✓ The Median is the middle number in the dataset.

or
✓ Median is the best measure when we have outliers.

Ex: Find the median of the "Age" column in our dataset.


s.
te
da
Age -> 25,27,30,26,28,29,31,32,27
Sort in ascending order -> 25,26,27,28,29,30,31,32
pick the middle one -> 28
p
n uu

Mathematical Calculation:
.a

If we have an even number of data, find the average of the middle two items.
w

Ex:
w

Age -> 4,12,24,8,16,20


w

Sort -> 4,8,12,16,20,24

Pick the middle once -> 12,16

Find the average -> 28/2 = 14

3)Mode: The mode is used to find the common number in the dataset. mode for a particular variable
("Age") using a mathematical calculation

Age -> 25,27,30,26,28,29,31,32,27


Age repeated more times -> 27
Mode -> 27
125 www.anuupdates.org

Measure of Variability:

1) Range: describes the difference between the largest and smallest data point in our data set
Syntax: Range Largest data value smallest data value
Syntax: Range = Large data value – smallest data value
Ex:
#sample Data
arr = [1,2,3,4,5]
# Finding Max
Maximum max(arr)
# Finding Min
Minimum min(arr)
#Difference Of Max and Min
Range Maximum-Minimum
print("Maximum = {}, Minimum = {} and Range = {}".format(
Maximum, Minimum, Range))

g
Output: Maximum = 5, Minimum = 1 and Range = 4

or
2) Variance: is defined as an average squared deviation from the mean. It is calculated by
finding the difference between every data point and the average which is also known as the

s.
mean, squaring them, adding all of them and then dividing by the number of data points
present in our data set.
te
Ex:
da
# sample data
arr = [1, 2, 3, 4, 5]
#variance
p

print("Var = ", (statistics. variance(arr)))


uu

Output:
Var = 2.5
n

3) Standard deviation: Standard deviation is widely used to measure the extent of variation or
.a

dispersion in data. It's especially important when assessing model performance (e.g.,
w

residuals) or comparing datasets with different means.


Ex:
w

import statistics
w

arr = [1,2,3,4,5]
print(“Std = “,(statistics.stdev(arr)))
output:
std = 1.5811388300841898

Q. Unique Values

Pandas is a powerful tool for manipulating data once you know the core operations and how to use
them. one such operation to get unique values in a column of pandas dataframe.

1) Using unique() method:

pandas.DataFrame().unique() method is used when we deal with a single column of a DataFrame and
returns all unique elements of a column. The method returns a DataFrame containing the unique
elements of a column, along with their corresponding index labels.

Syntax: Series.unique(self)
126 www.anuupdates.org

Ex: import pandas as pd

data = {

"Students": ["Ray", "John", "Mole", "Smith", "Jay", "Milli", "Tom", "Rick"], "Subjects": ["Maths",
"Economics", "Science", "Maths", "Statistics", "Statistics", "Statistics"," Computers"]

#load data into a DataFrame object:

df = pd.DataFrame(data)

print(df["Subjects"].unique())

print(type(df["Subjects"].unique()))

Output: ['Maths' 'Economics' 'Science' 'Statistics' 'Computers']

g
or
2) Using the drop_duplicates method:

drop_duplicates() is an in-built function in the panda's library that helps to remove the duplicates

s.
from the dataframe. It helps to preserve the type of the dataframe object or its subset and removes
te
the rows with duplicate values. When it comes to dealing with the large set of dataframe, using the
drop_duplicate() method is considered to be the faster option to remove the duplicate values.
da

Ex:
p

import pandas as pd
uu

data = {
n

"Students": ["Ray", "John", "Mole", "Smith", "Jay", "Milli", "Tom", "Rick"].


.a

"Subjects": ["Maths", "Economics", "Science", "Maths", "Statistics", "Statistics", "Statistics",


w

"Computers"] }
w

#load data into a DataFrame object:


w

df = pd. DataFrame(data)

print(df.drop_duplicates(subset = "Subjects"))

print(type(df.drop_duplicates(subset = "Subjects")))

Output:

students Subjects

0 Ray Math

1 John Economics

2 Mole Science

4 Jay Statistics

7 Rick Computers
127 www.anuupdates.org

3) Get unique values in multiple columns:

Till now we have understood how you can get the set of unique values from a single dataframe. But
what if you wish to identify unique values from more than one column. In such cases, you can merge
the content of those columns for which the unique values are to be found, and later, use the unique()
method on that series(column) object.

Ex:

import pandas as pd

data = {

"Students": ["Ray", "John", "Mole", "Smith", "Jay", "Smith", "Tom", "John"],

"Subjects": ["Maths", "Economics", "Science", "Maths", "Statistics", "Statistics", "Statistics",


"Computers"]

g
}

or
#load data into a DataFrame object:

df= pd. DataFrame(data)


s.
te
unique Values = (df['Students'].append(df['Subjects'])).unique()
da

print(unique Values)
p

Output: ['Ray' 'John' 'Mole' 'Smith' 'Jay' 'Tom' 'Maths' 'Economics' 'Science' 'Statistics' 'Computers']
uu

4) Count unique values in a single column:


n

Suppose instead of finding the names of unique values in the columns of the dataframe, you wish to
.a

count the total number of unique elements. In such a case, you can make use of the nunique()
method instead of the unique() method as shown in the below example:
w

import pandas as pd
w
w

data {

"Students": ["Ray", "John", "Mole", "Smith", "Jay", "Milli", "Tom", "Rick"], "Subjects": ["Maths",
"Economics", "Science", "Maths", "Statistics", "Statistics", "Statistics", "Computers"]

#load data into a DataFrame object:

df= pd.DataFrame(data)

unique Values = df['Subjects'].nunique()

print(unique Values)

Output: 5

5) Count unique values in each columns:


128 www.anuupdates.org

In the above method, we count the number of unique values for the given column of the dataframe.

However, if you wish to find a total number of the unique elements from each column of the
dataframe, you can pass the dataframe object using the nunique() method.

Ex:

import pandas as pd

data = {

"Students": ["Ray", "John", "Mole", "Smith", "Jay", "Milli", "Tom", "Rick"], "Subjects": ["Maths",
"Economics", "Science", "Maths", "Statistics", "Statistics", "Statistics", "Computers"] }

unique Values = df.nunique()

print(unique Values)

Output:

g
Students 8

or
Subjects 6

Q. Value Counts
s.
te
The value_counts() method in Pandas is used to count the number of occurrences of each unique
da
value in a Series.

✓ value counts is a function in the Pandas library that counts the frequency of unique
p

values in a Series or DataFrame column.


uu

✓ It returns a Series containing counts of unique values. This function is beneficial for
data analysis and preprocessing tasks, providing insights into the distribution of
n

categorical data.
.a

Syntax: Series value_counts(normalize False, sort-True, ascending=False, bins None, dropna=True)


w

value_counts() Arguments:
w

The value counts() method takes following arguments:


w

1) normalize (optional) if set to True, returns the relative frequencies (proportions) of unique values
instead of their counts

2)sort (optional) - determines whether to sort the unique values by their counted frequencies

3) ascending (optional) - determines whether to sort the counts in ascending or descending

4) bins (optional) - groups numeric data into equal-width bins if specified

5) dropna (optional) - exclude null values if True.

Ex:

import pandas as pd

# create a Series

data = pd.Series(['apple', 'banana', 'apple', 'orange', 'banana', 'apple'])


129 www.anuupdates.org

#use value_counts() to count the occurrences of each unique value

counts = data.value_counts()

print(counts)

Output:

apple 3

banana 2

orange 1

Ex 2:Sort Unique Value Counts in Pandas

import pandas as pd

#create a Series

g
data = pd.Series(['apple', 'banana', 'apple', 'orange', 'banana', 'apple', 'banana', 'kiwi, 'kiwi', 'kiwi'])

or
# use value_counts() with sorting

s.
counts_sort=data.value_counts(sort=True) te
print("Counts with sorting:")
da
print(counts_sort)

Output:
p
uu

Counts with sorthing:

apple 3
n
.a

banana 3

kiwi 3
w
w

orange 1
w

Q. Membership

Membership operators in Python are operators used to test whether a value exists in a sequence,
such as list, tuple, or string. The membership operators available in Python are,

• in: The in operator returns True if the value is found in the sequence.
• not in: The not in operator returns True if the value is not found in the sequence.

Membership operators are commonly used with sequences like lists, tuples, and sets. They can also
be used with strings, dictionaries, and other iterable objects.

Membership operators in Python are commonly used with sequences like lists, tuples, and sets. They
can also be used with strings, dictionaries, and other iterable objects.

Types of Membership Operators:

1) The in operator
130 www.anuupdates.org

Th in operator is used to check whether a value is present in a sequence or not. If the value is
present, then true is returned. Otherwise, false is returned.

Ex:

# Define a list fruits = ['apple', 'banana', 'cherry'] #Check if 'apple' is in the list

if 'apple' in fruits:

print('Yes, apple is in the list')

Output:

Yes, apple is in the list

2) The not in operator: The not in operator is opposite to the in operator. If a value is not present in a
sequence, it returns true. Otherwise, it returns false.

Ex:

g
# Define a tuple

or
numbers = (1, 2, 3, 4, 5)

# Check if 6 is not in the tuple if 6 not in numbers:


s.
te
print('Yes, 6 is not in the tuple')
da

Output: Yes, 6 is not in the tuple


p

Application of Membership Operators in Python:


uu

• Programming languages: The membership operator is used in programming languages to


test whether a value is a member of a list or a tuple. For example, in Python, the 'in' keyword
n

is used to test the membership of an element in a list or a tuple.


.a

• Database management: The membership operator is used in database management to


perform operations on sets of data. It is used to filter data and retrieve records that meet a
w

specific condition.
w

• Artificial intelligence: The membership operator is used in fuzzy logic to represent the degree
w

of membership of an element in a set. Fuzzy logic is used to model systems that have
uncertain or imprecise information.
• Natural language processing: The membership operator is used in natural language
processing to determine the similarity between words and phrases. It is used to identify
synonyms and related words in a corpus of text.

Q. Reading and Writing Data in Text Format.

Python can read/write both ordinary text files as well as binary files. Python can read/write both
ordinary text files as well as binary files.

Read Text Files with Pandas

Below are the methods by which we can read text files with Pandas:

• Using read csv()


• Using read_table()
131 www.anuupdates.org

• Using read fwf()

1) Read Text Files with Pandas Using read_csv():


We will read the text file with pandas using the read csv() function. Along with the text file,
we also pass separator as a single space (') for the space character because, for text files, the
space character will separate each field. There are three parameters we can pass to the
read_csv() function.

Syntax: data pandas.read_csv('filename.txt', sep, header=None, names=["Column!",


"Column2"])
2) Read Text Files with Pandas Using read_table()
We can read data from a text file using read table() in pandas. This function reads a general
delimited file to a DataFrame object. This function is essentially the same as the read_csv()
function but with the delimiter '\t', instead of a comma by default. We will read data with the
read_table function making separator equal to a single space(' ').

g
Syntax: data = pandas.read_table('filename.txt', delimiter = " ")

or
3) Read Text Files with Pandas Using read_fwf()
The fwf in the read_fwf() function stands for fixed-width lines. We can use this function to load

s.
DataFrames from files. This function also supports text files. We will read data from the text files
te
using the read_fwf() function with pandas. It also supports optionally iterating or breaking the
file into chunks. Since the columns in the text file were separated with a fixed width, this
da

read_fwf() read the contents effectively into separate columns.


Syntax: data = pandas.read_fwf('filename.txt')
p
uu

Opening and Closing a File: All files must be explicitly opened and closed when working with them in
the Python environment. This can be performed using the open() and close() functions.
n

Ex:
.a

f = open(filepath,'r’)
w

#perform file handling operations


w

f.close()
w

The most common way to do this is to use the with open statement which automatically closes the
file once the file handling is complete.mode is an optional parameter which tells python what
permissions to give to the script opening the file. It defaults to read-only 'r'.

Mode Description
‘r’ Open the file for read only
‘w’ Open the file for writing (will overwrite existing content)

Extracting (Reading) Data from the File:

There are three methods used to read the contents of a text file.

• read() this will return the data in the text file as a single string of characters .
• readline() this will read a line in the file and return it as a string. It will only read a single line.
• readlines() reads all the lines in the text file and will return each line as a string within a list.
132 www.anuupdates.org

This method looks for EOL characters to differentiate items in the list, readlines() is most
often implicitly used in a for loop to extract all the data from the text file.

Example using readline():Readline() will return the first line in the text file. The first line is defined as

the first character in the file up to the first EOL character.

Ex:

with open("sample-textfiletxt",Y) as f:

string = freadline()

"Beautiful is better than ugly\n".

Writing to a Text File:

Writing to a text file follows a very similar methodology to that of reading files. Use a with open
statement but now change the mode to 'w' to write a new file or 'a' to append onto an existing file.

g
There are two write functions that you can use:

or
• write() takes a string as an input and will write out a single line.

s.
• writelines() takes a list of strings as an input will write out each string in the list.
te
Example using write()
da
import os

outfilename = "sample-output.txt"
p
uu

outputfolder = "output"

filepathos.path.join(outputfolder,outfilename)
n

with open(filepath, 'w') as f:


.a

f.write("Hello World\n")
w
w

f.write("How are you doing?\n")


w

Output:

Sample-output.txt – Notepad
File Edit View
Hello Worls
How are you doing?

UNIT-V
133 www.anuupdates.org

DataCleaningandPreparation
Data Cleaning: Data cleaning is the process of fixing or removing
incorrect, corrupted, incorrectly formatted, duplicate, or incomplete
data within a dataset. When combining multiple data sources, there
are many opportunities for datatobeduplicatedormislabeled.Ifdata is
incorrect,
outcomesandalgorithmsareunreliable,eventhoughtheymaylookcorrect.
There is no one absolute way to prescribe the exact steps in the data
cleaning process because the processes will vary from dataset to dataset.
But it is crucial to establish a template for your data cleaning process so
you know you are doing it the right way every time.

Thefollowingareessentialstepstoperformdatacleaning.

g
or
s.
te
p da
uu

Q.HandlingMissingData
n

Missing values in a dataset can occur due to several reasons such as breakdown
.a

of measuring equipment, accidental removal of observations, lack of response


w

by respondents,erroronthepart of the researcher, etc.


w
w

Rowsno State Salary Yrsof Experience

1 NY 57400 MID

2 TX Missingvalue ENTRY

3 NJ 90000 HIGH

✓ Missingdatacancompromisethequalityofyouranalysis.Youcanaddress
thesegapsby either deleting or imputing data.
✔Rows or columns with minimal missing data can be deleted without
heavily impactingtheoverall analysis. However, if the missing data is
significant, it's advisable to impute values using statistical measures such as
mean, median, or mode.
134 www.anuupdates.org

✓ Thisapproachhelpsmaintaintheintegrityofyouranalysiswhileaddressingtheissueof
missingdata.

✔Handlingmissing valueseffectively helpscomplete your dataset,leading


tomore accurate and

g
or
s.
te
p da
n uu
.a
w
w
w
135 www.anuupdates.org

reliableanalysisresults.

LetusreadthedatasetGDP_missing_data.csv,inwhichwehaverandomlyremovedsome values, or
put missing values in some of the columns.
Ex:

importpandasaspd
import numpyas np
gdp_missingvalues_data=pd.read_csv('/Datasets/GDP_missing_data.csv')
gdp_complete_data = pd.read_csv('./Datasets/GDP_complete_data.csv')

country GDP Crops Transport Education


PerCapital

g
or
Afghanistan 2474.0 87.5 NaN NaN

s.
Algeria 11433.0 76.4 76.1 NaN

Argentina NaN 76.2


te 83.8 NaN
da
Armenia 13638.0 65.0 NaN NaN
up

Australia 54891.0 NaN 91.0 89


nu

Observe that the gdpmissing_values_data dataset consists of some missing values shown as NaN
(Not aNumber).
.a
w

Typesofmissingvalues
w

Nowthatweknowhowtoidentifymissingvaluesinthedataset,letuslearnaboutthetypesof missing
w

values that can be there. Rubin (1976) classified missing values in three categories.
● Missing completely at random (MCAR): In this case, there may be nopatternasto
why a column's data is missing. For example, survey data is missing because someone could
not make it to an appointment, or an administrator misplaces the test results he is
supposedtoenterintothecomputer.Thereasonforthemissingvaluesisunrelatedtothedatain the dataset.
● Missing at random (MAR): In this scenario, the reason the data is missing in a column
can be explained by the data in other columns. For example, a school student who scores
above the cutoff is typically given a grade. So, a missing grade for astudent can be
explained by the column that has scores below the cutoff. The reasonforthese missing
values can be described by data in another column.
● Missingnotatrandom(MNAR):Sometimes,themissingvalueisrelatedtothevalueitself.For
example,higherincomepeoplemaynotdisclosetheirincomes.Here,there is a correlation
136 www.anuupdates.org

betweenthemissingvaluesandtheactualincome.Themissingvaluesarenotdependenton
othervariablesin thedataset.
MethodsforIdentifyingMissingData

Functions Description

.isnull() IdentifiesmissingvaluesinaSeriesor DataFrame.

.notnull() check for missing values in a pandas Series or


DataFrame. It returns a boolean Series or

g
DataFrame,whereTrueindicatesnon-missing

or
values and False indicates missing values.

s.
.info()
te
DisplaysinformationabouttheDataFrame,
including data types, memory usage, and
da
presence of missing values.
up

.isna() similartonotnull()butreturnsTrueformissing
values and False for non-missing values.
nu

.dropna() Dropsrowsorcolumnscontainingmissingvalues
based on custom criteria.
.a
w

fillna() Fillsmissingvalueswithspecificvalues,means,
medians, or other calculated values.
w
w

Replaces specific values with other values,


replace() facilitatingdatacorrectionandstandardization

drop_duplicates() Removesduplicaterowsbasedonspecified columns

unique() FindsuniquevaluesinaSeriesorDataFrame
137 www.anuupdates.org

Q.RemovingDuplicates
Duplicatescanarisefromvarioussources,suchashumanerrors,dataentrymistakes,

dataintegrationissues,webscrapingerrors,ordatacollectionmethods.
For example, acustomer may fill out a form twice with slightly different
information, a data entry operator may copy and paste the same record
multiple times, a data integration process may merge datfrom different sources
withoutcheckingforuniqueness,awebscrapermayextractthesamepagemorethan once,
or a survey may collect responses from the same respondent using different

g
identifiers Some of these causes are easy to avoid or fix, while others may require

or
more complex solutions.

s.
Detection of duplicates:

te
✓ When dealing with duplicates, the first step is to detect them in your data
da
set.Dependingonthetypesize,andlevelofsimilarityyouwishtoconsider,there are various
tools and techniques that can be used.
up

✓ Built-in functions or libraries in your programming language or software can be used to


check for exact duplicates. For instance, in Python, you can use the pandas library's
nu

drop_duplicates() method. Hashing or checksum algorithms can generate


.a

uniqueidentifiersforeachrecordbasedontheirvalues,allowingyoutocomparethemandfind exact or
near duplicates.
w

Measurement of Duplicates: To measure duplicates in your data set, you can use different
w

metricsandindicators.Thiscanhelpyouunderstandtheextentandimpactofduplicates on your
w

data quality and analysis results.


Forexample,youcanusethepandaslibraryinPythontocountthenumberofduplicaterowsor columns
using the duplicated() method, and then calculate the percentage of duplicates using the
sum() and len() functions.
Handling ofduplicates:Thefinalstepinhandlingduplicatesistodecidewhattodowiththemin your data
set. Depending on your data and objectives, you can delete or drop duplicate records or
values if they are redundant, irrelevant, or erroneous, and won't affect the
representativeness or completeness of your data. For example, you can use thepandaslibrary in
Python to drop duplicate rows or columns using the drop duplicates() method.
Built-inPandasFunction:Thepromisedbuilt-infunction,dropduplicatesdeletesrows based
on duplicates in a list of column name(s) that you specify in the subset parameter.
Ex:
138 www.anuupdates.org

Code:df=pd.DataFrame((k1':['A]*3+['B']*4,k2:[1,1,2,2,3,3,4]})dfl
Output:

K1 k2

0 A 1

1 A 1

2 A 2

3 B 2

4 B 3

5 B 3

g
or
6 B 4

Code:dfdedup-dfl.drop_duplicates()df_dedup

s.
Output:
K1 k2
te
da
0A 1
2 A 2
up

3 B 2
4 B 3
nu

6B 4
.a

Q.TransformingDataUsingaFunctionorMapping
w

Formanydatasets,youmayneedtoperformsometransformationsbasedonvaluesin
w

Pandas Object. Consider this hypothetical scenario:


w

You have a DataFrame consisting of customer names and addresses. The addresses include code and city, but
country information is missing. You needtoaddthecountryinformation to this dataframe.
Fortunately,youhavecity-to-countrymappingmaintainedinadictionary. You want to create an additional
column in the dataframethat contains the country values.
Let's implement thissolutionusingthemapmethodinthisparticularcase.Thismethodiscalledon a
seriesandyoupassafunctiontoitasanargument.Thefunctionisappliedtoeachvalue in the
series specified in the map method.

Usethesecodesnippetsforademonstration:
Code:
139 www.anuupdates.org

dfperson-pd.DataFrame([
[Person1','Melbourne','3024'],
['Person2','Sydney','3003'],
[Person3','Delhi','100001'].
['Person4','Kolkata','700007],

['Person5','London','QA3023']
]

columns-['Name','City','Pin'])df_person

g
Output:

or
Name City Pin

s.
0Person1 Melbourne 3024

te
1Person2 Melbourne 3003

2 Person3
da
Delhi 100001

3 Person4 Kolkata 700007


up

4 Person5 London QA3023


nu

Next,letuscreateadictionaryforthecityandthecountry.

Code:
.a

dict_mapping("Melbourne":"Australia","Sydney":"Australia","Delhi":
w

"India", "Kolkata" "India", "London":"United Kingdom") dict_mappingOutput:


w
w

("Melbourne":"Australia".
"Sydney":"Australia"
'Delhi": "India".

"Kolkata":"India".

"London":"UnitedKingdom”)

Code:
dfperson[Country]-dfperson[City]map(lambdax
dict_mapping[x])df_person
140 www.anuupdates.org

Name City Pin Country

0 Person1 Melbourne 3024 Australia

1 Person2 Melbourne 3003 Australia

2 Person3 Delh. 100001 India

3 Person4 Kolkata 700007 India

4 Person5 London QA3023 UnitedKingdom

wecanobserveintheresult:
.

● The map method is called on the series dfperson["City"]


● there is an inline function using lambda notation (covered in Course 1)

g
● this inline function takes key(x) as an input, and returns the value corresponding to

or
● this key(x) from the dictionary object dict_mapping

s.
● theresultingvalueisstoredinanewcolumn,"Country",intheoriginal
DataFramedfperson
te
da
Q.ReplacingValues
up

Thereplace()methodreplacesthespecifiedvaluethespecifiedvaluewithanotherspecified value.
Pandas dataframe.replace() function is used to replace a string, regex, list,
nu

dictionary, series, number, etc. from a Pandas Dataframe in Python.


.a

Syntax:DataFrame.replace(to_replace-None,value-None,inplace=False,limit-None,
w

regex-False,method="pad",axis=None)
w

Parameters:
w

● toreplace:[str,regex,list,dict,Series,numeric,orNone]patternthatwearetryingto replace in
dataframe.
● value:Valuetousetofillholes(eg.0),alternatelyadictofvaluesspecifyingwhichvalue to use for
each column (columns not in the dictwill not be filled). Regular expressions, strings and lists or
dicts of such objects are also allowed
● inplace:IfTrue,inplace.Note:thiswillmodifyanyotherviewsonthisobject(e.g.a column from
a DataFrame). Returns the caller if this is True.
● limit:Maximumsizegaptoforwardorbackward fill
● regex:Whethertointerprettoreplaceand/orvalueasregularexpressions.IfthisisTrue then to
replace must be a string. Otherwise, to replace must be None because this parameter
will be interpreted as a regular expression or a list, dict, or array of regular
expressions.
● method: Method to use when for replacement, when to replace is a list.
141 www.anuupdates.org

Ex2:Wearegoingtoreplaceteam"BostonCeltics"with"OmegaWarrior"inthe'dfDatafr
ame.
#thiswillreplace"BostonCeltics"with"OmegaWarrior"
dfreplace(to replace="Boston Celtics",
value="Omega Warrior")
Example: Here, we are replacing 49.50 with 60.
importpandasaspd

Df={
"Array1":149.50, 701

"Array2":[65.1,49.50]

g
}

or
data=pd.DataFrame(df)

s.
Output: print(data.replace(49.50,60))

Array1 Array2 te
da
0 60.0 65.1
up

1 70.0 60.0
Output:
nu

Name Team Number Postion Age Height Weight


.a

1 AveryBradeyOmegamer 11 PG 25 6.2 102.0


w

2 Je Crovider Omegamer 99 SF 25 5.6 135.0


w
w

Q.DetectingandFilteringOutliers
Outliers are data points that are significantly different from other observations in a dataset.
Theylieatanabnormaldistancefromothervaluesinarandomsamplefromapopulation.
Insimplerterms,outliersaretheoddonesout-thedatapointsthatdon'tseemtofitthepattern

established by the majority of the data.


142 www.anuupdates.org

Intheaboveexample,theboxplotillustratesthedistributionofgoalsscoredperplayer.The

g
or
s.
te
da
up
nu
.a
w
w
w
143 www.anuupdates.org

majority of players scored between 2 and 8 goals, as indicated by the interquartile range (the
box). The whiskers extend to the minimum and maximum values within 1.5 times the interquartile
range from the lower and upper quartiles, respectively.

However,thereisanoutlierat20goals,whichissignificantlyhigherthantherestofthedata.This outlier
represents a player who scored an exceptionally high number of goals compared to their peers.
Types of Outliers:

a) Univariateoutliers: These areoutliers that occurin a singlevariable or feature.For example, in a


datasetof human heights, a recorded height of 3 meters would likely be a univariate outlier.
b) Multivariate outliers: These outliers only appear abnormal when considering the
relationship betweentwoormorevariables.Forexample,aperson'sweightmightnotbeanoutlierby itself, but
when considered in relation to their height, it might be identified as an outlier.

g
c) Globaloutliers:Thesearedatapointsthatareexceptionalwithrespecttoallotherpointsin

or
thedataset.

s.
d) Localoutliers:Thesearedatapointsthatareoutlierswithrespecttotheirlocal

te
neighborhoodinthedataset,butmaynotbeoutliersintheglobalcontext.
da
Causes of outliers:
● Humanerrors,e.g.dataentryerrors
up

● Instrumenterrors,eg.measurementerrors
● Dataprocessingerrors,eg,datamanipulation
nu

● Samplingerrors,egextractingdatafromwrongsources
Outlier Detection Methods
.a

Outlierdetectionplaysacrucialroleinensuringthequalityandaccuracyofmachin
w

e
w

learningmodels.Byidentifyingandremovingorhandlingoutlierseffectively,wecan
w

preventthemfrombiasingthemodel,reducingitsperformance,andhinderingitsinterpretability.Here'sanoverview

1. Statistical Methods:

● Z-Score:Thismethodcalculatesthestandarddeviationofthedatapoints
andidentifiesoutliersthosewithZ-scoresexceedingacertainthreshold(typically3or
-3).
DetectingOutlierswithz-Scores

143
144 www.anuupdates.org

..

• InterquartileRange(IQR):IQRidentifiesoutliersasdatapointsfallingoutsidethe
rangedefinedbyQ1-k(03-01)andQ3+k(Q3-Q1),whereQ1andQ3arethefirstandth
quartiles,andkis afactor(typically1.5).

2. Distance-BasedMethods:
● K-NearestNeighbors(KNN):KNNidentifiesoutliersasdatapointswhoseK
nearestneighborarefaraway from them.
● LocalOutlierFactor(LOF):Thismethodcalculatesthelocaldensityofdatapoints and
identifies outliers as those with significantly lower density compared to their neighbors.
3. Clustering-BasedMethods:

g
Density-Based Spatial Clustering of Applications with Noise (DBSCAN): In

or

DBSCANclustersdatapointsbasedontheirdensityandidentifiesoutliersaspoints not

s.
belonging to ancluster.

te
Hierarchical clustering: Hierarchical clustering involves building a hierarchy of
da
clusters b iteratively merging or splitting clusters based on their similarity. Outliers can be
identifiedclusterscontainingonlyasingledatapointorclusterssignificantlysmallerthan others.
up

Detectingandfilteringoutliers:
Filteringortransformingoutliersislargelyamatterofapplyingarrayoperations.Considera DataFrame
nu

with some normally distributed data:


Ex:
.a

import numpy as np
w

import pandas as pd
w

dfpd.DataFrame(np.random.randn(1000, 4))
w

df.describe()

0 1 2 3
Count 1000.000000 1000.000000 1000.000000 1000.000000
mean -0.034508 0.011824 -0.24911 -0.048428
md 1.023096 1.069927 1.03749 -0.37292
Min -2.99198 -2.69065 0.602409 0.048421
25% 0.735124 0.739318 -0.699221 -0.06537
30% 0.620213 0.009185 4.641272 -0.04647
144
145 www.anuupdates.org

75% 0.661472 0.728629 0.675814 0.568814


Max 3.387850 3.691235 3.950033 3.085534

Supposeyouwanttofindvaluesinoneofthecolumnswhoseabsolutevalueisgreaterthan3
Col-df[1]

collcolabs()>31
Output:

435 3.693235
Name: 1, dtype: float64

g
Plottingwithpandas:

or
s.
Q. Line Plots

te
Alineplotissuitableforidentifyingtrendsandpatternsoveracontinuousvariablewhichis usually
time or a similar scale.
da
Whencreatingthedfemployeesdataframe,wehaddefinedalinearrelationship
up

between the number of years an employee has worked with the company and their
salary.Solet'slookatthelineplotshowinghowtheaveragesalariesvarywiththenumberof years.
nu

Wefindtheaveragesalarygroupedbytheyearswithcompany,andthencreatealineplot
withplotline():
.a

Ex:
w

import pandas as pd
w

#Createasampledataset
w

data={

"Month": ["Jan", "Feb", "Mar", "Apr").


"Sales":[2500,3000,4000,3500],

df-pd.DataFrame(data)

#Plotalinegraph

df.plot(x-"Month",y-"Sales",kind-"line",title="MonthlySales",
legend-True,color="red")

145
146 www.anuupdates.org

Q.BarPlots
Abarplot isa plotthatpresentscategoricaldatawithrectangularbars.The lengths of
the bars are proportional to the values that they represent
WecanalsocreatebarplotbyjustpassingthecategoricalvariableintheX-axis andnumericalvalacintheY-
axis.Herewehaveadataframe.Nowwewillcreateavertical bar plot and a horizontal bar plot using kind as bar and also use the
subplots

Ex:
importpandasaspd
#Sample data
Data = {

g
"Category"["A","B","C".“D”] "Values" [4,

or
7, 1.8]
}

s.
df
pd.DataFrame(data) te
da
#Create subplots for vertical and horizontal bar plots
up

df.plot(x="Category","Values",kind-"bar",subplots-True,layout-(2,1), figsize=(12,
61
nu

#Horizontal bar plot using kind-harh


.a

df.plot(x-"Category",y-"Values",kind-"barh",subplots-True)
w
w
w

Q.HistogramsandDensity Plots
Histograms are basically used to visualize one-dimensional data. Using plot method we
cancreatehistogramsaswell.Letusillustratetheuseofhistogramusingpandas.plotmethod.
146
147 www.anuupdates.org

importpandasaspd
import numpyas np
#Sample data
data={

"Scores":[65,70,85,90,95,80]

dfpd.DataFrame(data)
#Plotahistogram

dfplotty-"Scores",kind="hist",bins-5,title="HistogramofScores",

g
legend-False,figsize-(8,6))

or
s.
te
da
Q.Scatteror Point Plots.
Togetthescatterplotofadataframeallwehavetodo istojustcalltheplot()
up

methodbyspecifyingsomeparameters
nu

kind'scatter,""some_column'y'somecolum,color-somecolor
Ex:InthisexamplecodecreatesascatterplotusingaDataFrame'drwithmath marks on
.a

the x- axis and 'physics marks on the y-axis, plotted in red.


w

import pandas as pd
w

importmatplotlib.pyplotasplt
w

d={
'name':['pl','p2,p3,p4','p5','p6] age:
[20, 20, 21, 20, 21, 201

math marks: [100, 90, 91, 98, 92, 95].


physics marks: 190, 100, 91, 92, 98, 95].
'chemmarks:193,89,99,92,94,92]

dfpd.DataFrame(data dict)
#Scatter plot using pandas

147
148 www.anuupdates.org

ax=df.plot(kind-scatter.x-math_marks',y-physics_marks',
color-red', title="Scatter Plot)
#Customizingplot
elements
ax.set_ xlabel("Math Marks")
ax.set_ylabel("PhysicsMarks")
plt.show( )

g
Explanation: This code creates a Pandas DataFramedfwith student data, including theirnames,

or
ages and marks in Math, Physics and Chemistry. It then creates a scatter plot using the

s.
plot() method, where the math_marksareplottedonthex-axisandphysics_marksonthe y-
axis.
te
da
up
nu
.a
w
w
w

148

You might also like