Foundations of Data Science
Foundations of Data Science
org
B.Sc Computer
Science(Major)V
Semester
P.Y.KUMAR
ABHYUDAYA MAHILA DEGREE COLLEGE
B.Sc Computer Science(Major)V Semester
2 www.anuupdates.org
V Semester
UNIT-1
Introduction to Data Science: Need for Data Science What is Data Science Evolution of Data
Science, Data Science Process Business Intelligence and Data Science Prerequisites for a Data
Scientist Tools and Skills required. Applications of Data Science in various fields Data Security Issues.
Data Collection Strategies, Data Pre-Processing Overview, Data Cleaning. Data Integration and
g
Transformation, Data Reduction, Data Discretization, Data Munging, Filtering
or
UNIT-II
s.
Descriptive Statistics Mean, Standard Deviation, Skewness and Kurtosis; Box Plots - Pivot Table -Heat
te
Map-Correlation Statistics-ANOVA.
da
No-SQL: Document Databases, Wide-column Databases and Graphical Databases.
p
UNIT-III
uu
Python for Data Science Python Libraries, Python integrated Development Environments (IDE)for
Data Science,
n
NumPy Basics: Arrays and Vectorized Computation- The NumPy ndarray -Creating ndarrays- Data
.a
Types for ndarrays- Arithmetic with NumPy Arrays- Basic Indexing and Slicing - Boolean Indexing-
w
UNIT-IV
Introduction to pandas Data Structures: Series, Data Frame and Essential Functionality: Dropping
Entries- Indexing, Selection, and Filtering- Function Application and Mapping- Sortingand Ranking.
Summarizing and Computing Descriptive Statistics- Unique Values, Value Counts, and Membership.
Reading and Writing Data in Text Format.
UNIT-V
Data Cleaning and Preparation: Handling Missing Data Data Transformation: Removing Duplicates,
Transforming Data Using a Function or Mapping. Replacing Values, Detecting and Filtering
OutliersPlotting with pandas: Line Plots, Bar Plots, Histograms and Density Plots, Scatter or Point
Plots.
3 www.anuupdates.org
UNIT- 1
Data Science is a field that gives insights from structured and unstructured data, using different
scientific methods and algorithms, and consequently helps in generating insights, making predictions
and devising data driver solutions. It uses a large amount of data to get meaningful insights using
statistics and computation for decision making
g
or
s.
The data used in Data Science is usually collected from different sources, such as e-commerce sites,
surveys, social media, and internet searches. All this access to data has become possible due to the
te
advanced technologies for data collection. This data helps in making predictions and providing profits
da
to the businesses accordingly. Data Science is the most discussed topic in today's time and is a hot
career option due to the great opportunities it has to offer.
p
uu
It is widely used for a variety of purposes across different industries and sectors. Some of the key
applications include:
n
.a
w
Statistical Analysis: Understanding statistical methods and techniques is crucial for interpreting data
and making inferences.
Data Manipulation and Analysis: Skills in handling, cleaning, and analysing large datasets are
fundamental. This often involves using Python libraries like Pandas.
Machine Learning: Understanding various machine learning algorithms and their applications is a key
part of data science.
4 www.anuupdates.org
Data Visualization: The ability to present data visually using tools like Matplotlib, Seaborn, or
Tableau helps in making data understandable to non-technical stakeholders.
Big Data Technologies: Knowledge of big data platforms like Hadoop or Spark can be important,
especially when dealing with very large datasets.
g
Python: Python is an object-oriented, general-purpose programming language known for having
or
simple syntax and being easy to use. It's often used for executing data analysis, building websites and
software and automating various tasks.
s.
te
R: R is a programming language that caters to statistical computing and graphics. It's ideal for
da
creating data visualizations and building statistical software
p
uu
SQL: Structured Query Language (SQL) is essential for data manipulation and retrieval from relational
databases. It's widely used for data extraction, transformation, and loading (ETL) processes.
n
.a
Ex: Agricultural Optimization: In Kenya, a small-scale farming community faced challenges in crop
w
yields due to unpredictable weather patterns and soil quality. By implementing data science, they
w
were able to analyse satellite imagery and soil data to make informed decisions about crop rotation,
irrigation schedules, and fertiliser application. This led to a significant increase in crop yield and
w
sustainability. showcasing how data science can revolutionise traditional farming methods.
Data Science is used in many industries in the world today, e.g. banking, consultancy, healthcare, and
manufacturing.
g
or
Q. Evolution of Data Science
s.
te
1) 1962: American mathematician John W. Tukey first articulated the data science dream. In his now-
famous article "The Future of Data Analysis," he foresaw the inevitable emergence of a new field
da
nearly two decades before the first personal computers. While Tukey was ahead of his time, he was
not alone in his early appreciation of what would come to be known as "data science."
p
uu
2) 1977: The theories and predictions of "pre" data scientists like Tukey and Naur became more
n
concrete with the establishment of The International Association for Statistical Computing (IASC),
.a
whose mission was "to link traditional statistical methodology, modern computer technology, and
the knowledge of domain experts in order to convert data into information and knowledge."
w
w
w
3) 1980s and 1990s: Data science began taking more significant strides with the emergence of the
first Knowledge Discovery in Databases (KDD) workshop and the founding of the International
Federation of Classification Societies (IFCS).
4) 1994: Business Week published a story on the new phenomenon of "Database Marketing." It
described the process by which businesses were collecting and leveraging enormous amounts of
data to learn more about their customers, competition, or advertising techniques.
5) 1990s and early 2000s: We can clearly see that data science has emerged as a recognized and
specialized field. Several data science academic journals began to circulate, and data science
proponents like Jeff Wu and William S. Cleveland continued to help develop and expound upon the
necessity and potential of data science.
6 www.anuupdates.org
6) 2000s: Technology made enormous leaps by providing nearly universal access to internet
connectivity, communication, and (of course) data collection.
7) 2005: Big data enters the scene. With tech giants such as Google and Facebook uncovering large
amounts of data, new technologies capable of processing them became necessary. Hadoop rose to
the challenge, and later on Spark and Cassandra made their debuts.
8) 2014: Due to the increasing importance of data, and organizations interest in finding patterns and
making better business decisions, demand for data scientists began to see dramatic growth in
different parts of the world.
g
9) 2015: Machine learning, deep learning, and Artificial Intelligence (Al) officially enter the realm of
or
data science.
s.
te
10) 2018: New regulations in the field are perhaps one of the biggest aspects in the evolution in data
science.
p da
The data science process gives a clear step-by-step framework to solve problems using data. It maps
w
out how to go from a business issue to answers and insights using data. Key steps include defining
the problem, collecting data, cleaning data, exploring, building models, testing, and putting solutions
w
to work.
g
or
s.
te
da
Steps for Data Science Processes
p
uu
1. Problem Definition
n
The first step in any data science project involves clearly defining the problem at hand. This include
.a
understanding the business objectives and identifying how data can contribute to solving the
w
problenThe definition should be specific, measurable, and aligned with organizational goals. It sets
the foundation for the entire data science process and ensures that efforts are focused on addressing
w
2. Data Collection
Once you have a clear understanding of the problem, the next step in the data science process is to
collect the relevant data. This can be done through various means such as:
8 www.anuupdates.org
Web Scraping: This method is useful for gathering data from public websites. It requires knowledg
web technologies and legal considerations to avoid violating terms of service.
APIs: Many online platforms, such as social media sites and financial data providers, offer APIs to ac
their data. Using APIs ensures that you get structured data, often in real-time, which is crucial for t
sensitive analyses.
Databases: Internal databases are gold mines of historical data. SQL is the go-to language for quern
relational databases, while NoSQL databases like MongoDB are used for unstructured data
Surveys and Sensors: Surveys are effective for collecting user opinions and feedback, while sensors
g
invaluable in lot applications for gathering real-time data from physical devices.
or
s.
te
p da
n uu
.a
w
w
w
3. Data Exploration
EDA is a crucial step in the data science process where you explore the data to uncover patterns,
anomalies, and relationships. This involves:
● Visualization: Creating plots such as histograms, scatter plots, and box plots to visualize data
distributions and relationships.
● Correlation analysis: Identifying relationships between different variables.
4. Data Modeling
Model Planning:
● In this stage, you need to determine the method and technique to draw the relation
between input variables.
● Planning for a model is performed by using different statistical formulas and visualization
tools. SQL analysis services, R. and SAS/access are some of the tools used for this purpose.
g
or
Model Building:
s.
te
In this step, the actual model building process starts. Here, Data scientist distributes datasets for
training and testing.
da
Techniques like association, classification, and clustering are applied to the training data set. The
model, once prepared, is tested against the "testing" dataset.
p
uu
5. Evaluation:
n
.a
w
● After creating models, it's essential to evaluate their performance. This involves assessing
how well the models generalize to new, unseen data.
w
● Evaluation metrics, such as accuracy, precision, recall, or others depending on the problem,
w
are used to quantify the model's effectiveness. Rigorous evaluation ensures that the chosen
models provide meaningful and reliable results.
6. Deployment:
● Once a model has proven effective, it is integrated into operational systems for practical use.
Deployment involves making the model available to end-users or incorporating it into
business processes.
● This step requires collaboration between data scientists and IT professionals to ensure a
seamless transition from development to implementation.
● Data science projects don't end with deployment, they require ongoing monitoring. Models
can drift in performance over time due to changes in the underlying data distribution or
other factors.
● Regular monitoring helps detect and address these issues, ensuring that models continue to
provide accurate and relevant insights.
● Maintenance activities may include updating models, retraining them with new data, or
adapting to changes in the business environment. This iterative process ensures the
sustainability and longevity of the data science solution.
g
or
Business intelligence (BI):
s.
te
● Business intelligence (BI) is a process that involves gathering, storing, analysing, and
da
presenting data to help organisations make informed decisions. BI tools are designed to
collect and analyze data from various sources and give it in a way that is easy to understand
p
and use.
uu
● Business intelligence is focused on using historical data to identify trends and patterns and
n
providing actionable insights to business users. Bl tools typically use a combination of charts,
.a
● Bl tools can be used for a variety of purposes, such as financial reporting, customer analytics,
w
and supply chain management. They can also be used to monitor key performance indicators
(KPIs) and track progress towards organisational goals.
w
11 www.anuupdates.org
g
or
Business intelligence process:
s.
te
da
1. Data gathering involves collecting information from a variety of sources, either external (e.g. ma
data providers, industry analytics, etc.) or internal (Google Analytics, CRM, ERP, etc.).
p
uu
2. Data cleaning/standardization means preparing collected data for analysis by validating data qua
ensuring its consistency, and so on (please check the linked articles for more details.)
n
3. Data storage refers to loading data in the data warehouse and storing it for further usage.
.a
4. Data analysis is actually the automated process of turning raw data into valuable, actional
w
5.Reporting involves generating dashboards, graphical imagery, or other forms of readable vinu
representation of analytics results that users can interact with or extract actionable insights from
w
Examples of BI in Action
1. Finance: Financial institutions use Bl for risk management, fraud detection, and
performance analysis.
2. Manufacturing: Bl helps manufacturers optimize production processes, manage supply chains, and
ensure product quality.
12 www.anuupdates.org
3. Education: Educational institutions use BI to track student performance, manage resource and
improve educational outcomes.
Data Science:
Data science involves using advanced analytical techniques to extract insights from data. Data
scientin use statistical models, machine learning algorithms, and other data mining techniques to
identi patterns and relationships in large datasets.
Data science is focused on understanding the underlying data and using it to make predictions abou
future outcomes. Data scientists use a variety of tools, such as programming languages like Python a
R, to manipulate and analyse this data.
g
or
Data science can be used for a variety of purposes, such as predictive modelling, fraud detection, and
s.
natural language processing. Data scientists are responsible for designing and implementing data-
drive solutions that can help organisations improve their business processes and make informed
te
decisions.
da
Focus and purpose Analyzing historical data to Using historical and current data to
w
Tools and Technologies Tools like power BI, Technologies like python, R ,
Tableau,Qlik view and SQL TensorFlow, and Jupyter
based reporting platform for Notebooks for predictive modeling
creating visual insights. and analysis.
Data Types Handles structured data stored Work with Structured, semi-
in databases, spreadsheets,or structured, and unstructured data
data warehouses. from diverse sources.
g
Speed Faster deployment for Slower due to iterative model
or
generating insights due to training, testing, and optimization.
straightforward workflows and
s.
tools. te
Decision-making Supports operational and Drives strategic decision by
da
tactical decision by presenting predicting trends and solving
clear, real-time metrics. complex problems.
p
uu
Data scientist:
Role: A Data Scientist is a professional who manages enormous amounts of data to come up with
compelling business visions by using various tools, techniques, methodologies, algorithms, etc.
g
or
1. Build a strong educational foundation A bachelor's degree in data science, compute science,
s.
statistics, or applied math is a solid starting point. Alternatively, online courses and bootcamps can
help introduce you to the field.
te
da
2. Master core data science skills Including, but not limited to, programming, statistic machine
learning, skills in data wrangling and visualization, as well as soft skills lä communication,
p
3. Gain practical experience Use internships, personal projects, or competitions to gathe examples
.a
4. Keep learning Stay up-to-date with new advancements, engage with data scienc communities, and
w
Technical skills:
1) Python: Python is central to the growth of Al and data science and arguably the most important
skil to learn for data science exploration. The programming language is a favorite for data scientists,
15 www.anuupdates.org
wer developers, and Al experts thanks to the simplicity of its syntax, the wealth of open-source
librarie created for the language that drive efficiencies in building algorithms and apps, and its diver
applications across data analysis and Al
2) R: The R programming language is most often used for the statistical analysis of large datasets was
the preferred tool in the data science community for many years as an ideal resource for plotting and
building data visualizations. R sees continued strong use in academic circles. However, the fas
performance speeds of Python have made it the optimal choice for the convergence of data science
and Al applications.
3) Machine Learning
Machine learning is a process used to develop algorithms that perform tasks without explicitly being
programmed by the user. Various big companies like Netflix and Instagram use it. They embel
g
algorithms using machine learning and generate excellent features for their customers. Machine
or
learning is a skill set that allows you to build a predictive model and algorithm framework thut
uncovers patterns and predicts outcomes, improving data-driven business strategies.
s.
te
4) Statistical Analysis:
da
Statistical analysis uses a fundamental concept for interpreting data and validating its findings. This
includes various tests, distributions, and regression models to improve your experience as a data
p
scientist. Proficiency in such analysis allows you to make informed, data-driven decisions. You can
uu
assess the reliability of your models and derive meaningful conclusions from the data
n
.a
5) SQL Skills:
w
SQL. is a structured query language that is one of the critical data scientist skills. It is a standard tool
w
in the industry that allows you to manage and communicate with relational databases. Relationa
databases allow you to store structured data in tables using columns and rows. A significant amoun
w
Non-Technical Skills
1. Critical thinking: The ability to objectively analyze questions, hypotheses, and results, understand
which resources ar necessary to solve a problem, and consider different perspectives on a problem.
3. Proactive problem solving: The ability to identify opportunities, approach problems by identifying
existing avsumptions and resources, and use the most effective methods to find solutions.
4. Intellectual curiosity: The drive to find answers, dive deeper than surface results and initial
assumptions, think creatively, and constantly ask "why" to gain a deeper understanding of the data
5. Teamwork: The ability to work effectively with others, including cross-functional teams, to achieve
common goals. This includes strong collaboration, communication, and negotiation skills
Data science tools are application software or frameworks that help data science professionals
g
perform various data science tasks like analysis, cleansing, visualization, mining, reporting, and
or
filtering of data. A Data Scientist needs to master the fields f Machine Learning, statistics and
s.
probability, analytics, and programming. He needs to perform many tasks such as prepare the data,
analyze the data, built models, draw meaningful conclusions, predict the results and optimize the
te
results.
da
For performing these various tasks efficiently and quickly, the Data Scientists use different Data
Science tools
p
uu
a) Tableau:
.a
Tableau is an interactive visualization software packaged with strong graphics. The company focuses
w
on the business intelligence sectors. Tableau's most significant element is its capacity to interface
w
with databases, tablets, OLAP cubes, etc. Tableau can also visualize geographic data and draw the
lengths and latitudes of maps together with these characteristics. You can also use its analytics tool
w
to evaluate the information together with visualizations. You can share your results on the internet
platform with Tableau with an active community. While Tableau is company software, Tableau Public
comes with a free version.
b) MS Excel:
It is the most fundamental & essential tool that everyone should know. For freshers, this tool helps in
easy analysis and understanding of data. MS. Excel comes as a part of the MS Office suite. Freshers
and even seasoned professionals can get a basic idea of what the data wants to say before getting
into high-end analytics. It can help in quickly understanding the data, comes with built-in formulae,
and provides various types of data visualization elements like charts and graphs. Through MS Excel,
data science professionals can represent the data simply through rows and columns. Even a non-
technical user can understand this representation.
Proficiency in programming languages is crucial for data manipulation, analysis, and machine
learning. Important languages include:
● Python: The most popular language for data science, Python's libraries like NumPy, Pandas,
and Scikit-learn make it perfect for data manipulation, analysis, and machine learning.
● R: Specializes in statistical analysis and has extensive libraries for data science, suchb as
ggplot2 for visualization and caret for machine learning.
Data Scientist:
A Data Scientist is an expert who examines data to identify patterns, trends, and insights that aid in
problem-solving and decision-making. They analyze and forecast data using tools like machine
learning, statistics, and programming. Data scientists transform unstructured data into
understandable, useful information that companies can utilize to enhance operations and make
future plans.
g
or
Skills Required for Data Scientists Technical Skills Required for a Data Scientist
s.
1. Programming: One of the most important skills needed for data scientists is proficiency in
programming l like Python and R. It is the very foundation of data science and offers versatile tools
te
like Pan TensorFlow for data analysis, manipulation and interpretation. Machine Learning Algorithms
da
are huge part of the programming process.
p
uu
2. Mathematics and Statistics: Data scientist skills involve a deep understanding of statistics and
mathematical formulation includes knowledge of probability, hypothesis testing theories and linear
algebraic equatio scientists use a lot of calculations to interpret data accurately. That's the best way
n
to make predictions.
.a
For example, in a hospital, the skills needed for a data scientist like statistics, can be ap
w
understanding patient information, trends in illnesses and the effectiveness of each treatment meth
w
3. Machine Learning Algorithms: This is a kev data scientist skill Machine Learning is an integral part
of teaching Al and com applications how to read data and skilled professionals play a huge part in its
accuracy, Algorithms data scientists for creating analytics models. Moreover, data scientist skills go
further from ML with deep learning frame like PyTorch. This adds significant value to your resume
because you can deal with more con data
4. Data Visualization and Handling: One of the top data scientist skills involves the process of data
wrangling This is the process wher understand raw data, clean it and structure it, ready for analysis.
Data visualization comes after use these data analyzes to communicate meaning, trends and patterns
accurately.
18 www.anuupdates.org
5. Database Management Systems: Companies use large datasets for their organizational decisions.
SQL, or Structured Query Langu essential for managing these datasets. Data scientist skills such as
expertise in relational and relational databases can open doors for you to handle huge amounts of
data
6.. Problem-Solving: One must ensure to have the capability to identify and develop both creative
and effective sol as and when required. Problem-solving is a critical skill in data science. Data
scientists must:
● Break Down Complex Problems: Define the problem, find patterns in the data, and devise
● driven solutions. Creativity: Think outside the box to address unique challenges, often
involving new appro to modeling or data wrangling
g
or
s.
7. Communication Skills: Strong communication skills are necessary for conveying findings and
insights effectively. Key areas include:
te
da
stakeholders understand.
uu
EDA is an integral part of the data analysis process that focuses on summarizing the characteristics of
a dataset. Key areas include:
● Outlier Detection: Finding unusual data points that can impact model accuracy.
● Statistical Analysis: Includes measures such as mean, median, mode, standard deviation, and
correlation.
2. Data Visualization:
Data visualization helps communicate insights clearty. Tools Tableau are important for:
● Creating Visual Representations: Bar charts, histograms, line plots, and heatmaps are
examples that help to interpret data.
19 www.anuupdates.org
● Storytelling with Data: Visuals allow non-technical stakeholders to understand the insights
generated by data scientists.
3. Data Wrangling and Preprocessing
This refers to the transformation and mapping of raw data into a more usable format. Raw data
needs to be cleaned and preprocessed before analysis. Key techniques include:
g
or
s.
te
p da
n uu
.a
w
w
Finance:
w
● Fraud Detection and Prevention: Machine learning models can detect unusual
patterns and anomalies in financial transactions, helping prevent fraud.
● Credit Risk Assessment: By analyzing historical data, machine learning models can
assess credit risk more accurately than traditional methods.
● Investment Recommendations: Data science can recommend investment portfolios
based on customer risk profiles and predicted market trends.
Healthcare:
Retail:
● Inventory Management: Data science can forecast demand for various products,
optimizing inventory levels.
● Product Recommendations: Recommending similar products based on customer
search queries and interests can increase sales.
● Customer Reviews Analysis: Analyzing customer reviews helps identify common
issues w products, informing improvements.
● Chatbots: Al-powered chatbots can handle customer queries efficiently, improving
custom service
Automotive:
● Predictive Maintenance: Data science models can predict part failures, reducing
downtime a maintenance costs.
● Autonomous Vehicles: Machine learning enables the development of self-driving
g
cars that make real-time decisions like human drivers.
or
Manufacturing:
s.
te
● Predictive Maintenance: Similar to automotive, predictive maintenance models
da
manufacturing can reduce production line downtime.
● Supply Chain Optimization: Data science optimizes supply chain processes and
p
Education:
n
Social Media:
E-commerce:
Government:
● Public Safety: Using data analytics to predict crime hotspots, optimize emergency
response, an improve public safety.
● Urban Planning: Analyzing population data, traffic patterns, and resource usage to
inform urba development and infrastructure planning.
● Policy Analysis: Evaluating the effectiveness of public policies using data-driven
insights.
● Resource Allocation: Optimizing the allocation of public resources based on need
and demand.
g
or
Education:
●
s.
Personalized Learning: Tailoring educational content and approaches to individual
te
students nee and learning styles based on data analysis.
● Student Success Prediction. Identifying students at risk of academic difficulties or
da
Agriculture:
● Precision Agriculture: Optimizing crop yields and resource usage through data-driven
n
● Yield Prediction: Predicting crop yields based on various factors such as weather
w
Data security is the process of protecting digital information from unauthorized access, corruption,
and theft throughout its lifecycle. It encompasses the entire spectrum of information security,
including physical protection for hardware and secure storage devices, administrative controls to
manage access, and logical security measures for software applications
22 www.anuupdates.org
✓ Policies and procedures strengthen these defenses, ensuring data stays secure
across all environments.
✓It measures include encryption, firewalls, access controls, intrusion detection systems,
and data backup.
✓The main goal of Digital Security is to prevent data breaches and unauthorized access
and protect data from external threats.
g
Types of Data Security
or
s.
1. Encryption: Encryption is a process that converts data into an unreadable format using algorithms.
te
This protects data both
da
● When it is stored (data at rest)
● When it is being transmitted (data in transit)
p
2. Data Erasure
n
sensitive data is no longer needed. Unlike basic file deletion, data erasure uses software
w
3. Data Masking:
✓ Data masking involves altering data in a way that is no longer sensitive to make it
accessible to employees or systems without exposing confidential information.
✓ For example, in a customer database, personal information like names or credit card
numbers may be masked with random characters. It allows access for testing or
analytics purposes without revealing the actual data.
4. Data Resiliency
Data resiliency refers to the ability of an organization's data systems to recover quickly from
disruptions, such as:
23 www.anuupdates.org
● Cyberattacks
● Hardware failures
● Natural disasters
✓ Many data breaches are not a result of hacking but through employees accidentally
or negligently exposing sensitive information.
✓ Employees can easily lose, share, or grant access to data with the wrong person, or
mishandle or lose information because they are not aware of their company's security
policies.
g
or
b) Phishing attacks:
s.
✓ In a phishing attack, a cyber criminal sends messages, typically via email, short
te
message service (SMS), or instant messaging services, that appear to be from a trusted
da
sender.
p
download
n
malware or visit a spoofed website that enables the attacker to steal their login credentials or
financial information.
.a
✓ These attacks can also help an attacker compromise user devices or gain access to
w
corporate networks. Phishing attacks are often paired with social engineering, which
w
c) Insider threats
One of the biggest data security threats to any organization is its own employees. Insider threat are
individuals who intentionally or inadvertently put their own organization's data at risk. They come in
three types:
1. Compromised insider: The employee does not realize their account or credentials have been
compromised. An attacker can perform malicious activity posing as the user.
2. Malicious insider: The employee actively attempts to steal data from their organization or
cause harm for their own personal gain.
24 www.anuupdates.org
3. Nonmalicious insider. The employee causes harm accidentally, through negligent behavior,
by not following security policies or procedures, or being unaware of them.
d) Malware:
Malicious software is typically spread through email- and web-based attacks. Attackers use malware
to infect computers and corporate networks by exploiting vulnerabilities in thes software, such as
web browsers or web applications. Malware can lead to serious data securi events like data theft,
extortion, and network damage.
A data collection strategy is the collection of methods that will be utilized to get accurate and reliable
g
from different data sources.
or
s.
te
p da
n uu
.a
w
✓ Feedback forma and survey are systematic tools for collecting participants'
responses in quantifiable format.
✓ You can distribute forms via email or host them online using Google Forms or
SurveyMonke and gather data in real-time. This technique is beneficial for gauging
customer satisfaction and assessing employee engagement.
✓ Interviews and focus groups involve in-depth discussions, and data is collected by
observing and understanding people's thoughts and behaviors.
25 www.anuupdates.org
✓ While this method provides the flexibility to modify questions based on the
participant's answers, it can be time-consuming
✓ Additionally, the interviewer's questioning style, body language, and perception can
influence the responses.
3. Web Scrapping
✓ You can use tools like Scrapy or BeautifulSoup to collect real-time information on
competitors and market trends. However, you should be mindful of the terms of service
g
violations and data privacy laws.
or
4. Log Files
s.
te
✓ Log files are detailed records generated by servers, applications, or devices. They
da
✓ You can use Splunk or ELK Stack to analyze and visualize log files and gain insights
uu
5. API Integration
w
6. Transactional Tracking
✓ Transactional tracking involves collecting your customers' purchase data. You can
capture information on purchased product combinations, delivery locations, and more by
monitoring transactions made through websites, third-party services, or in-store point-
of-sale systems.
26 www.anuupdates.org
✓ Analyzing this data lets you optimize your marketing strategies and target ideal
customer
segments
7. Document Review
✓ It is often used in legal, academic, or historical research to analyze trends over time.
You can use advanced document scanning and optical character recognition (OCR) tools
to digitize and store required information.
g
or
8. Mobile Data Collection
s.
✓In this technique, mobile devices are used to collect real-time data directly from the
te
user through apps, surveys, and GPS tracking.
da
✓The widespread use of mobiles and tablets makes it ideal for on-the-go data
p
gathering
n uu
.a
w
✓ Many social media platforms have data analytics features that help you track your
w
✓ Data warehousing allows you to collect large volumes of data and store them in a
centralized repository.
27 www.anuupdates.org
✓ You can use cloud-based solutions like Snowflake or Amazon Redshift for this. The
data collection method is highly scalable and allows you to consolidate data for better
insights.
Real-world datasets are generally messy, raw, incomplete, inconsistent, and unusable. It can contain
manual entry errors, missing values, inconsistent schema, etc. Data Preprocessing is the process of
converting raw data into a format that is understandable and usable. It is a crucial step in any Data
Science project to carry out an efficient and accurate analysis. It ensures that data quality is consisten
before applying any Machine Learning or Data Mining techniques.
g
or
s.
te
p da
uu
The first step in any data preprocessing pipeline is gathering the necessary data. This may involve
querying databases, accessing APIs, scraping websites, or importing data from various files formats
like CSV, JSON, or Excel It's important to ensure that you have the right permissions and comply with
relevant data protection regulations
Before getting into cleaning and transformation, it's essential to understand your data. This involves
examining the structure of your dataset, checking data types, looking for patterns, and identifying
potential issues.
28 www.anuupdates.org
c) Data Cleaning
This step involves handling missing data, removing duplicates, correcting errors, and dealing with
outliers.
● Handling missing data: You might choose to drop rows with missing values, fill them with a
specific value (like the mean or median), or use more advanced imputation techniques.
● Removing duplicates: Duplicate records can skew your analysis and should be removed.
● Correcting errors: This might involve fixing typos, standardizing formats (e.g., date formats)or
correcting impossible values.
● Dealing with outliers: Outliers can be legitimate extreme values or errors. You need to
investigate them and decide whether to keep, modify, or remove them.
d) Data Transformation
g
This step involves modifying the data to make it more suitable for analysis or modeling. Common
or
transformations include:
s.
te
● Normalization or standardization: Scaling numerical features to a common range.
● Encoding categorical variables. Converting categorical data into numerical format.
da
e) Data Reduction
.a
For large datasets, it might be necessary to reduce the volume of data while preserving as much
w
● Feature selection. Choosing the most relevant features for your analysis.
● Dimensionality reduction: Using techniques like Principal Component Analysis (PCA) to
w
f) Data Validation
The final step is to validate your preprocessed data to ensure it meets the requirements for your
analysis or modeling task. This might involve
These steps form the core of the data preprocessing pipeline. However, the specific techniques and
their order may vary depending on the nature of your data and the requirements of your data
science project.
Data Preprocessing is important in the early stages of a Machine Learning and Al application
development lifecycle. A few of the most common usage or application include -
● Improved Accuracy of ML Models - Various techniques used to preprocess data, such as Data
Cleaning, Transformation ensure that data is complete, accurate, and understandable,
resulting in efficient and accurate ML models.
● Reduced Costs - Data Reduction techniques can help companies save storage and compute
costs by reducing the volume of the data
g
● Visualization - Preprocessed data is easily consumable and understandable that can be
or
further used to build dashboards to gain valuable insights.
s.
Q. Data Cleaning
te
Data cleaning is the process of detecting and correcting errors or inconsistencies in your data to
da
improve its quality and reliability. Raw data, which is data in its unprocessed form, is often riddled
with issues that can negatively impact the results of analysis. These issues can include:
p
uu
● Duplicates: When the same data point appears multiple times in a dataset
● Errors: This can include typos, spelling mistakes, or even data entry errors.
w
Data cleaning helps ensure that the data you're analyzing is accurate and reliable, which is crucial for
w
The Steps Invoked in the the data cleaning process, each of which addresses a different kind
discrepancy in the dataset. To achieve high-quality data, you can perform the following data-cleaning
steps:
✓ Before beginning the data cleaning process, it is crucial to assess the raw data and
identify you requirements or desired output from the dataset.
✓ This helps you focus on the specific parts of the data, thus saving your time and
resources.
g
✓ You can generally observe repetitive data while extracting it from multiple sources
or
into
centralized repository
s.
te
✓ Such values take up unnecessary space in your dataset and often result in flawed
da
analysis. Using data cleaning tools or techniques, you can easily locate duplicate or
p
✓ They mainly occur while migrating or transferring data from one place to another. So,
w
applying a quick data check in such a scenario ensures the credibility of your dataset
✓ Outliers are unusual values in a dataset that differ greatly from the existing values in
the dataset Although the presence of such values can be fruitful for research purposes,
they can also impact your data analysis process.
✓ Data values can be lost or removed during the extraction process, leading to
inefficient data analytics.
✓ Therefore, before using your data for business operations, you must scan the dataset
thoroughly and look for any missing values, blank spaces, or empty cells in it.
✓ Once the above steps are completed, you must perform a final quality check on your
dataset to ensure its authenticity, consistency, and structure.
✓ To facilitate this process quickly, you can also leverage Al or machine learning
g
capabilities and verify data. This helps your organization work with reliable data and use
or
it for seamless analysis and visualization.
s.
te
Some Important data-cleaning techniques:
da
✓ Remove duplicates
p
✓ Standardize capitalization
w
✓Clear formatting
w
✓ Fix errors
✓ Language translation
1. Enables reliable analytics: Dodgy data can distort analytics. Cleaning ensures quality data that
leads to accurate insights.
32 www.anuupdates.org
2. Improves decision making: You can make better-informed strategic and operational decisions with
clean, trusty data
3. Increases efficiency: You waste less time gathering data that could be faulty and useless. Cleaning
weeds these issues out.
4. Saves money: Bad data can lead to costly errors, while good data cleaning practices can help you
save money in the long run.
5. Builds trust: Dependable data that tells the truth about your business performance helps build
stakeholder confidence.
6. Supports automation: Artificial intelligence (AI) and machine learning (ML) driven automation
need clean data. Otherwise, they may amplify existing data problems.
7. Ensures compliance: In regulated industries, meticulous data quality controls help you support
compliance.
g
or
Q. Data Integration and Transformation
s.
Data integration:
te
da
Data integration is the process of combining data from different sources into a single, unified view.
This process includes activities such as data cleaning, transformation, and loading (ETL), as well as
real-time streaming and batch processing.
p
uu
✓ By aligning data across systems, organisations can eliminate data silos and uncover
hidden relationships, leading to enhanced analytics and more accurate predictive
n
models.
.a
w
✓ Without data integration, data would remain fragmented and inconsistent, making it
w
difficult to achieve a comprehensive, reliable view needed for discovering insights and
w
informed decision-making
Data integration is particularly important in the healthcare industry. Integrated data from various
patient records and clinics assist clinicians in identifying medical disorders and diseases by
integrating data from many systems into a single perspective of beneficial information from which
useful insights can be derived. Effective data collection and integration also improve medical
insurance claims processing accuracy and ensure that patient names and contact information are
recorded consistently and accurately. Interoperability refers to the sharing of information across
different systems.
33 www.anuupdates.org
g
or
Data Integration Approaches
s.
te
There are mainly two kinds of approaches to data integration in data mining, as mentioned below -
da
Tight Coupling
p
✓ This approach involves the creation of a centralized database that integrates data
uu
from diffe sources. The data is loaded into the centralized database using extract,
n
✓ In this approach, the integration is tightly coupled, meaning that the data is physically
w
stored in central database, and any updates or changes made to the data sources are
w
✓ Tight coupling is suitable for situations where real-time access to the data is required,
and consistency is critical. However, this approach can be costly and complex, especially
when deale with large volumes of data.
Loose Coupling
✓ This approach involves the integration of data from different sources without
physically storing a centralized database.
✓ In this approach, data is accessed from the source systems as needed and combined
in real-time provide a unified view. This approach uses middleware, such as application
34 www.anuupdates.org
programming interfa (APIs) and web services, to connect the source systems and access
the data
✓ Loose coupling is suitable for situations where real-time access to the data is not
critical, and th data sources are highly distributed. This approach is more cost-effective
and flexible than tin coupling but can be more complex to set up and maintain.
Integration tools
There are various integration tools in data mining. Some of them are as follows:
g
● Open-source data integration tool
or
If you want to avoid pricey enterprise solutions, an open-source data integration tool is the ideal
s.
alternative. Although, you will be responsible for the security and privacy of the data if you're using
the tool.
te
● Cloud-based data integration tool
da
A cloud-based data integration tool may provide an 'integration platform as a service'.
p
uu
Data Transformation:
n
.a
a) Data transformation refers to the process of converting, cleaning, and manipulating raw d into a
structured format that is suitable for analysis or other data processing tasks.
w
w
b) Raw data can be challenging to work with and difficult to filter. Often, the problem isn't hov collect
w
c) To curate appropriate, meaningful data and make it usable across multiple systems, busine must
leverage data transformation.
35 www.anuupdates.org
1. Data Discovery: During the first stage, data teams work to understand and identify applicable raw
data. By profiling data, analysts/engineers can better understand the transformations that need to
occur.
2. Data Mapping: During this phase, analysts determine how individual fields are modified, matched,
filtered, joined, and aggregated
3. Data Extraction: During this phase, data is moved from a source system to a target system
Extraction may include structured data (databases) or unstructured data (event streaming, log files)
sources.
g
or
4. Code Generation and Execution: Once extracted and loaded, transformation needs to occur on
s.
the raw data to store it in a format appropriate for Bl and analytic use. This is frequently
accomplished by analytics engineers, who write SQL/Python to programmatically transform data.
te
This code is executed daily/hourly to provide timely and appropriate analytic data.
da
5. Review: Once implemented, code needs to be reviewed and checked to ensure a correct and
p
appropriate implementation.
n uu
6. Sending: The final step involves sending data to its target destination. The target might be a data
.a
● Discretization: This data transformation technique creates interval labels in continuous data
to improve efficiency and easier analysis. The process utilizes decision tree algorithms to
transform a large dataset into compact categorical data.
● Generalization: Utilizing concept hierarchies, generalization converts low-level attributes to
g
high-level, creating a clear data snapshot.
or
● Attribute Construction: This technique allows a dataset to be organized by creating new
s.
attributes from an existing set.
● Normalization: Normalization transforms the data so that the attributes stay within a
te
specified range for more efficient extraction and data mining applications.
da
● Manipulation: Manipulation is the process of changing or altering data to make it more
readable and organized. Data manipulation tools help identify patterns in the data and
p
Q. Data Reduction
n
.a
When you collect data from different data warehouses for analysis, it results in a huge amount of
w
data. It is difficult for a data analyst to deal with this large volume of data. It is even difficult to run
w
the complex queries on the huge amount of data as it takes a long time. This is why reducing data
becomes important
w
Data reduction techniques ensure the integrity of data while reducing the data. Data reduction is a
process that reduces the volume of original data and represents it in a much smaller volume.
● Data reduction techniques are used to obtain a reduced representation of the dataset that is
smaller in volume by maintaining the integrity of the original data.
● By reducing the data the efficiency of the data mining process is improved, which produces
th same analytical results.
● Data resduction does not affect the result obtained from data mining.that means the result
obtained from data mining before and after data reduction is the same or almost the same.
● Data roduction aims to define it more compactiv. When the data sire is smaller, it is simpler
apply sophisticated and computationally high-priced algorithms
37 www.anuupdates.org
● The reduction of the data may be in terms of the mimber of rows (records) or terms of the.
number of columns (dimensions).
● The only difference occurs in the efficiency of data mining. Data reduction increases the
efficiency of data mining.
● Deduplication: Removing duplicate data. This can include simply removing duplicated
records to deleting records that, while not strictly identical, represent the same information
or event.
● Compression: Compression processes apply algorithms to transform information to take up
le storage space Compression algorithms can (and often are) applied to data as it is moved in
storage, but some can be applied to data-at-rest to improve space gains even more.
● Thin Provisioning: Thin provisioning is an approach to storage where space is partitioned
g
and user as needed rather than pre-allocate storage to users or processes. While more
or
computationall intensive, this approach can significantly reduce inefficiencies like disk
s.
fragmentation. te
These technologies include:
p da
aspectu'variables, from a data set. For example, a spreadsheet with 10,000 rows but only
one colume is much simpler to process than one with an additional 500 columns of
n
attributes included. The approach can include compression transformations or even the
.a
✓ For example, suppose you have the data of All Electronics sales per quarter for the
year 2018 to the year 2022. If you want to get the annual sale per year, you just have to
aggregate the sales per quarter for each year. In this way, aggregation provides you with
the required data, which i much smaller in size, and thereby we achieve data reduction
even without losing any data.
38 www.anuupdates.org
g
Data Compression:
or
Data compression employs modification, encoding, or converting the structure of data in a way that
s.
sumes less space. Data compression involves building a compact representation of information by
removing redundancy and representing data in binary form. Data that can be restored successfully
te
from is compressed form is called Lossless compression. In contrast, the opposite where it is not
da
possible to restore the original form from the compressed form is Lossy compression. Dimensionality
and numerosity reduction method are also used for data compression.
p
n uu
.a
w
w
w
This technique reduces the size of the files using different encoding mechanisms, such as Huffman
Encoding and run-length Encoding. We can divide it into two types based on their compression
techniques.
i. Lossless Compression: Encoding techniques (Run Length Encoding) allow a simple and minimal
data size reduction. Lossless data compression uses algorithms to restore the precise original data
from the compressed data.
39 www.anuupdates.org
ii. Lossy Compression: In lossy-data compression, the decompressed data may differ from the
original data but are useful enough to retrieve information from them.
Q. Data Discretization
g
Discretization, also known as binning, is the process of transforming continuous numerical variables
or
into discrete categorical features. Data discretization refers to a method of converting a huge number
of data values into smaller ones so that the evaluation and management of data become easy
s.
✓ In other words, data discretization is a method of converting attributes values of
te
continuous data into a finite set of intervals with minimum data loss.
da
✓ There are two forms of data discretization first is supervised discretization, and the
p
operation proceeds. It means it works on the top-down splitting strategy and bottom-up
w
merging strategy.
w
Steps of Discretization
1. Understand the Data: Identify continuous variables and analyze their distribution, range, and role
in the problem.
40 www.anuupdates.org
4. Apply Discretization: Map continuous values to the chosen bins, replacing them with their
respective bin identifiers.
5. Evaluate the Transformation: Assess the impact of discretization on data distribution and model
performance. Ensure that patterns or important relationships are not lost.
6. Validate the Results: Cross-check to ensure discretization aligns with the problem goals Ex:
Uppose we have an attribute of Age with the given values
g
or
s.
Age 1.5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77
te
da
Table before Discretization
p
n uu
9 6
w
w
● Histogram analysis: Histogram refers to a plot used to represent the underlying frequen
distribution of a continuous data set. Hestogram assists the data inspection for data
distribution. example, Outliers, skewness represemation, normal distribution representation,
etc
● Binning: Binning is a data imoothing technique and its helps to group a huge number
continuous values into a smaller number of bins. For example, if we have data about a group
studcrits, and we want to arrange their marks into a smaller number of marks intervals by
41 www.anuupdates.org
maki the bins of grades. One bin for grade A one for grade B, one for C, one for D, and one
for Grade.
● Correlation analysis: Cluster analysis is commonly known as clustering. Clustering is the task
grouping similar objects in one group, commonly called clusters. All different objects are
place in different clusters.
● Equal-Width Intervals: This technique divides the range of attribute values into intervals
equal size. The simplicity of this method makes it popular, especially for initial data analysis.
For example, if you're dealing with an attribute like height, you might divide it into intervals
of 10 cn each
● Equal-Frequency Intervals: Unlike equal-width intervals, this method divides the data so
thacach interval contains approximately the same number of data points. It's particularly
useful whee the data is unevenly distributed, as it ensures that cach category has a
representative sample.
Applications of Discretization
g
or
s.
1. Improved Model Performance: Decision trees, Naive Bayes, and rule-based algorithms often
perform better with discrete data because they naturally handle categorical features more
te
effectively
da
2. Handling Non-linear Relationships: Data scientists can discover non-linear patterns between
features and the target variable by discretising continuous variables into bins.
p
3. Outlier Management: Discretization, which groups data into bins, can help reduce the
uu
influence of extreme values, helping models focus on trends rather than outliers.
4. Feature Reduction: Discretization can group values into intervals, reducing the
n
5. Visualization and interpretability: Discretized data makes it easier to create visualizations for
exploratory data analysis and to interpret the data, which helps in the decision-making
w
process
w
w
Q. Data Munging
Data munging, also known as data wrangling, is a data transformation process that converts raw data
into a more usable format. In other words, it involves cleaning, normalizing, and enriching raw data
so that it can be used to produce meaningful insights needed to make strategic decisions.
Data munging can be done automatically or manually Large-scale organizations with massive
datasets generally have a dedicated data team responsible for data munging. Their task is to
transform raw data and pass it to business leaders for informed decision making
g
Data munging is a fairly simple process. It involves a series of steps taken to ensure your data
or
is clean, enriched, and reliable for various uses. Let's look at these steps in detail here:
s.
1. Collection: The first step in data wrangling is collecting raw data from various sources. These
sources can include databases, files, external APIs, web scraping, and many other data streams. The
te
data collected can be structured (e.g., SQL databases), semi-structured (e.g., JSON, XML files), or
da
unstructured (e.g.. text documents, images).
2. Cleaning: Once data is collected, the cleaning process begins. This step removes errors,
p
inconsistencies, and duplicates that can skew analysis results. Cleaning might involve
uu
● Dealing with missing values by removing them, attributing them to other data points, or
.a
● Identifying and resolving inconsistencies, such as different formats for dates or currency.
3. Structuring: After cleaning, data needs to be structured or restructured into a more analysis-
w
friendly format. This often means converting unstructured or semi-structured data into a structured
w
form, like a table in a database or a CSV file. This step may involve:
Data enrichment involves adding context or new information to the dataset to make it more valuable
for analysis. This can include:
6. Storing: The final wrangled data is then stored in a data repository, such as a database or
warehouse, making it accessible for analysis and reporting. This storage not only secures the data
also organizes it in a way that is efficient for querying and analysis
7. Documentation: Documentation is critical throughout the data wrangling process. It records s was
done to the data, including the transformations and decisions. This documentation is invaluuht
reproducibility, auditing, and understanding the data analysis process.
g
or
s.
te
p da
n uu
Automated data solutionsfre used by enterprises to seamlessly perform data munging activities, te
w
cleanse and translutionsce data into standardized information for cross-data set analytics. There are
mumerous benefits of data munging. It helps businesses
w
w
● Eliminate data silves and integrate various sources (like relational databases, web servers,
etc)
● improve data usability by transforming raw data into compatible, machine-readable
informata for business systems.
● process large volumes of data to get valuable insights for business analytics
● Ensure high data quality to make strategic decisions with greater confidence.
Q. Filtering
Filtering is a fundamental technique used in data processing and analysis to refine and extract usef
information from the raw data. It aids in removing unnecessary or irrelevant data, thereby improving
the efficiency and accuracy of data analysis. Filtering is widely used across multiple fields includeg
business intelligence, data science, and machine learning
44 www.anuupdates.org
✓ From a business perspective, data filtering can help organizations make informed
decision by identifying trends, patterns, and outliers in their data.
✓ For example, a retail company might use data filtering to identify its top-selling
product and adjust its inventory accordingly.
✓ From a scientific perspective, data filtering can help researchers identify patterns
experimental data that support or refute hypotheses.
✓ For instance, a biologist might use data filtering to identify genes that are differentiall
expressed between healthy and diseased cells.
There are several ways to filter data, depending on the format of the data and the desired outcome
g
or
Some common methods include:
s.
te
1. Selection: This method involves selecting a subset of data based on specific criteria, such a date
da
range, geographic location, or demographic characteristics.
For example, a marketing team might select data on customers who have purchased a particular
p
product within the last quarter to target them with a new promotion.
n uu
2. Sorting: This method involves organizing data in ascending or descending order based on one ot
.a
more variables
w
For instance, a financial analyst might sort stock prices by date to identify trends over time.
w
w
3. Aggregation: This method involves summarizing data by grouping it into categories aggregating it
to a higher level.
For example, a sales manager might aggregate sales data by region to identify which regions art
performing best.
4. Filtering by conditions: This method involves applying specific conditions to the data to exclude
certain values or rows.
For example, a quality control engineer might filter out data points that exceed a certain threshold
for a machine's temperature reading to detect potential issues.
45 www.anuupdates.org
5. Statistical methods: This method uses statistical techniques such as regression, correlation, and
clustering to identify patterns and relationships in the data
For instance, a data scientist might use clustering algorithms to group customers based on their
purchasing behavior to identify customer segments.
6. machine learning methods: This method uses machine learning algorithms such as decision trees,
random forest, and neural networks to identify patterns and relationships in the data. For instance, a
credit risk assessment model might use decision trees to predict which borrowers are likely to default
on their loans based on their credit history and demographic factors
7. Text filtering: This method is used to extract specific information from test data For example, a
sentiment analysis tool might filter out words or phrases that indicate positive or
negative sentiment from social media posts to analyze public opinion on a brand
g
or
s.
8. Image filtering: This method is used to manipulate image data, such as removing noise or
enhancing features.
te
For example, a medical maging algorithm might filter out background noise from an MRI scan to
da
better visualize tumors.
p
uu
9. Audio filtering: This method is used to manipulate audio data, such as removing noise or isolating
specific sounds.
n
.a
For example, a speech recognition system might filter out background noise from a voice recording
w
to improve accuracy.
w
w
10. Real-time filtering: This method involves filtering data in real-time, as it is generated.
For example, a monitoring system might filter out anomalous sensor readings in real-time to quickly
detect equipment failures.
For businesses engaged in e-commerce, data filtering aids in targeting specific customer segments.
Marketers can leverage this process to tailor campaigns, promotions, and product recommendations
hased on customer preferences and behaviors.
● Network Security
Filtering is a crucial component of network security and data security, where it is employed to
identify and block potentially harmful data or traffic. This helps prevent cyber threats and ensures
the integrity of a network
Benefits:
✓ Focus: Enables analysts to pinpoint specific data subsets for in-depth analysis.
g
or
✓ Accuracy: Improves data quality by removing irrelevant or erroneous information
s.
te
Techniques:
da
✓Tools and Libraries: Many tools and libraries, such as Pandas im Python, facilitate
w
data filtering.
w
w
UNIT-II
Descriptive Statistics:
Descriptive
statistics:
Descriptive statistics, such as mean, median, and range, help characterize a particular
datasummarizingit. They alsoorganize and present that data in a way that allows
it to be interpreted. Descriptive statistics techniques can help describe a data set to an individual or
organizationinclude measures related to the data's frequency, positioning, variation, and central
47 www.anuupdates.org
● Descriptive statistics can help businesses decide where to focus further research. The
main purposedescriptive statistics is to provide information about a data set.
● Descriptive statistics summarizes amounts of data into useful bits of information.
Exl: For example, suppose a brand ran descriptive statistics on the customers buying a
specific proposes and saw that 90 percent were female. In that case, it may focus its
marketing efforts on better read female demographics.
Ex2: In recapping a Major League Baseball season, for example, descriptive statistics
might include team batting averages,the number of runs allowed per team, and the average wins
per division.
Q. Mean
g
1. Mean :
or
s.
In descriptive statistics, the mean, also known as the average, is a measure of central tendency that
represents the typical value of a dataset. It's calculated by summing all the values in the dataset
te
and then dividing by the total number of values.
da
Calculation:
Sum the values: Add up all the numbers in your dataset.
p
●
Count the values: Determine how many numbers are in
uu
●
the dataset.
● Divide the sum by the count: The result is the mean.
n
Formula to calculate
.a
mean:
w
w
w
𝒙𝟏 + 𝒙𝟐 +.....+ 𝒙𝒏
𝒙𝒙̄ =
𝒏
● Central Tendency: The mean provides a single value that represents the "average" or
"typical" value of a dataset.
● Data Summary: It's a valuable way to summarize a large dataset into a single,
representative value.
● Comparison: The mean can be used to compare different datasets or groups.
Q. Standard Deviation
1) The standard deviation is a measure of spread or variability in descriptive statistics. It is us for
calculating the variance or spread by which individual data points different from the
mean.
2) A low deviance implies that the data points are extremely close to the mean,whereas a hig
deviance suggests that the data is spread out over a wider range of
g
values.
or
3) In marketing, variance can assist in accounting for big variations in expenses or
s.
revenues also helps identify the dispersion of asset prices in relation to their average price
te
and mark volatility.
da
4) In the image, the curve on top is more spread out and therefore has a higher
standard deviation while the curve right side image is more clustered around the mean
p
( 𝑋−𝑥̄ )²
Standard Deviation = √∑
𝑛−1
where:
Σ=Sum of
X= Each value.
49 www.anuupdates.org
x= Sample mean.
n = number of values in the
sample.
Step-by-step to Calculate Standard Deviation :
Follow these steps to calculate a sample's standard deviation:
Step 01: Collect your data Collect the dataset for whichthe standard deviation is to
be calculated. Assume you have a data set (45, 67, 30, 58, 50) and a sample of size
n = 5.
Step 02: Find the mean
Calculate the sample mean (average) by adding all of the data points and dividing by the sample
size n. * Sample Mean x=(45+67+30+58+50)/n=250/5 = 50
Step 03: Calculate the differences from the mean
g
or
Subtract the sample mean (𝑥̄ ) from each data point (X).
s.
Difference = X-𝑥̄ te
● 45, difference = X-𝑥̄ = 45-50 =-
da
4
● 67, difference = X-𝑥̄ = 67-50 =
p
17
uu
* Skewness and kurtosis are important statistical measures that help inunderstanding the shape and
g
distribution of a dataset.
or
* They are widely used in various fields, including finance, economics, quality control, and
s.
management.
te
Skewness:
da
distribution, the Median and Mode are equal. The normal distribution has a skewness of
uu
Measure of
w
Skewness:
w
Types of
Skewness:
g
In a distribution with positive skewness (right-
or
skewed):
s.
✓The right tail of the distribution is longer or fatter than the
left.
te
da
✓The mean is greater than the median, and the mode is less than both mean and
median.
p
✓Lower values are clustered in the "hill" of the distribution, while extreme values are in
uu
skewness
mean <median <mode.
✓ The mean is less than the median, and the mode is greater than both mean
and median.
✓ Higher values are clustered in the "hill" of the distribution, while extreme values
are in the long left tail.
✓ It is also known as left-skewed distribution.
g
or
s.
te
da
Kurtosis:
p
Kurtosis is a statistical measure that defines how heavily the tails of a distribution differ
uu
from the tails of a normal distribution. In other words, kurtosis identifies whether the tails of a
givendistributioncontain extreme values.
n
Types of Kurtosis
.a
1. Leptokurtic: Leptokurtic is a curve having a high peak than the normal distribution. In this curve,
w
2. Mesokurtic: Mesokurtic is a curve having a normal peak than the normal curve. In this curve,
w
3. Platykurtic: Platykurtic is a curve having a low peak than the normal curve is called
platykurtic.
In this curve, there is less concentration of items around the central value
53 www.anuupdates.org
Application of Skewness
(1) Finance & Investment: Helps in analyzingthe return
distributions of stocks, bonds, and other assets.
(2) Quality Control & Manufacturing: In process control, skewness helps in
detecting systematic bias in production quality.
(3) Medical &Biological Research: Used in analyzingpatient recovery times, response
times to treatment, or the spread of diseases.
Application of
Kurtosis
(1) Finance & Risk Management: High kurtosis (leptokurtic) suggests extreme
values (outliers) are more common, which is crucial in risk assessment for stock market investments.
Low kurtosis(platykurtic)suggests fewer outliers,indicating stable returns.
g
or
(2) Econometrics &Business Forecasting: Helps in identifying anomalies in
economic trends,crashes, or extreme demand-supply shocks.
s.
te
(3) Industrial Quality Control: Identifies production defects by highlighting
rare but severedeviations in product quality.
p da
Q. Box Plots
uu
The box and whisker plot, sometimes simply called the box plot. Box plot
n
statistics such as the median. quartiles potential outliers in a concise and visual manner. By
w
using Box plot you can provide a summarydistribution, identify potential and compare
w
✓In a box plot, we draw a box from the first quartile to the third quartile. A
vertical line through the box at the median. The whiskers go from each quartile to the
minimum and maximum.
g
or
To construct a box plot :
s.
te
To construct a box plot, use a horizontal or vertical number line and a rectangular box. The smallest and
da
largest data values label the endpoints of the axis. The first quartile marks one end of the box the third
quartile marks the other end of the box. Approximately the middle 5050 percent of the data fall inside the
p
box. The "whiskers" extend from the ends of the box to the smallest and largest data values. The
uu
median or second quartile can be between the first and third quartiles, or it can be one the other,
or both. The box plot gives a good, quick picture of the data.
n
grams):
w
25,28,29,29,30,34,35,35,37,38
w
The third quartile is the median of the data points to the right of the median.
34,35,35,37,38
Q3-35
Step 4: Complete the five-number summary by finding the min and the
max.
The min is the smallest data point, which is
g
25.
or
The max is the largest data point, which is 38.
The five-number summary is 25,29,32,35,38
s.
te
da
Features of Box
p
Plot:
uu
one of the measures of central tendency. This implies that it has five pieces of
.a
information.
w
not.
w
✓ It also provides an insight into the data set, that whether there is potential
unusual observation.These are called outliers.
✓ Herein, the arrangements can be matched with each other. This is because, the
center, spread, and overall range are instantly apparent in the case of a box plot.
✓ It is also used where huge numbers of data collections are involved or compared.
Q. Pivot Table
56 www.anuupdates.org
A pivot table is a statistics tool that summarizes and reorganizes selected columns and rows of data
in a spreadsheet or database table to obtain a desired report. The tool does not actually change the
spreadsheet or database itself, it simply "pivots" or turns the data to view it from different
perspectives.
✓ Pivot tables are especially useful with large amounts of data that would be time-consuming
to
calculate by hand.
✓ A few data processing functions a pivot table can perform include identifying
sums, averages, ranges or outliers. The table then arranges this information in a
simple, meaningful layout that draws attention to key values.
✓ Pivot table is a generic term, but is sometimes confused with the Microsoft trademarked term,
PivotTable. This refers to a tool specific to Excel for creating pivot tables.
g
or
The Pivot Table helps us to view our data effectively and saves crucial time by just
summarizing out the data into essential categories. It is basically a kind of reporting tool and
contains mainly the following four fields which are as follows:
s.
te
● Rows: This refers to data which are taken as a specifier.
da
● Values: This represents the count of the data in given Excel sheet.
● Filters: This would help us to hide or highlight out the specific amount of
p
the data.
uu
components:
w
w
1. Columns-When a field is chosen for the column area, only the unique values of the field are listed
across the top.
2. Rows- When a field is chosen for the row area, it populates as the first column.
Similar to the columns, all row labels are the unique values and duplicates are removed.
3. Values- Each value is kept in a pivot table cell and display the
summarized information. The most common values are sum, average, minimum
and maximum.
4. Filters- Filters apply a calculation or restriction to the entire table.
Uses of a pivot
table
57 www.anuupdates.org
A pivot table helps users answer business questions with minimal effort.
Common pivot table uses include:
• To calculate sums or averages in business situations. For example, counting
sales by department or region.
g
*To query information directly from an online analytical processing (OLAP)
or
server.
Q. Heat Map
s.
te
Heatmap data visualization is a powerful tool used to represent numerical data
da
graphically, when values are depicted using colors. This method is particularly effective for
identifying patterns, trend and anomalies within large datasets.
p
✓ Heat maps are graphic representations of information that use color-coding and relative
uu
size
n
✓ They are commonly used to visualize and analyzelarge data sets but can be an
w
✓ Heatmap analytics are represented graphically, making the data easy to digest and
understand
✓ Heatmaps help businesses understand user behavior and how people interact with content.
58 www.anuupdates.org
Types of Heatmaps
1. Interaction Heatmaps
Interaction heatmaps measure active engagement on a webpage, allowing you to see the
type interaction users have with your website. They can measure mouse movements, clicks, and
scrolgiving you an in-depth understanding of how consumers use your website.
● Click Maps
Click maps provide a graphical representation of where users click. This includes mouse clicks
desktop devices and finger taps on mobile devices. Click maps allow you to see which elements of you
webpageare being clicked on and which are being ignored.
● Mouse Move Maps
Research shows a strong correlation between where a user moves their mouse and
g
where their attention lies on a webpage. Mouse move maps track where users move
or
their mouse as they navigate a webpage.This gives you a clear indication of where users
are looking as they interact with your webpage.
s.
● Scroll Maps
te
Scroll maps help you visualize how visitors to your website scroll through
da
your web pages. They do this by visually representing how many visitors scroll
down to any point on the page.
p
uu
2. Attention
Heatmaps
n
consumer's attention as they look at your content, discovering where their eyes
w
move and which aspects of your content grasp the consumer's attention. There are two types
of attention heatmaps:
w
w
g
related data points for better insights. Here are a few common uses of clustered heatmaps:
or
● Business Intelligence: Helps companies analyze sales, revenue, and
s.
customer trends. te
● Healthcare: Used for studying gene expression and patient data.
● Marketing: Identifies customer segments with similar behaviors.
da
4. Correlogram:
p
prices.
w
● Website Heatmaps: Measures the correlation between site traffic and user actions.
Scientific Research: Used by data scientists to study connections between
w
●
variables in complex studies.
5. Grayscale
Heatmap
indicate higher data points, while lighter shades represent lower values. Since it
avoids multiple cokA grayscale heatmap represents data using black, white,
andgrayshades. Darker shades type provides a clear, distraction-free view of the
information. Greyscale Heatmaps can be used in
● Medical imaging: Doctors use grayscale heatmaps to analyze X-rays and
MRI scans.
● Data visualization: Researchers use them for reports where coloris not needed.
6. Rainbow
60 www.anuupdates.org
Heatmap
variations in heat values. Cooler colors like blue represent lower values, while
warmer colors like Arainbow heatmap uses multiple colors, such as blue,
green, yellow, orange, and red, to indicate higher intensity.
This type of color map is visually striking and can make it easier to see differences in
large data They are used for:
● Weather maps: Meteorologists use rainbow heatmaps to show temperature or
rainfall.
● Website heatmaps: Businesses track user behavior using rainbow-colored
click maps.
● Sports analytics: Football analysts use spatial heatmaps to study player movement.
Q Correlation Statistics
g
Correlation analysis refers to the statistical technique that quantifies the relationship
or
between two or more variables. It reveals whether an increase or decrease in one variable leads to
s.
an increase or decrease in another variable. This is useful in identifying trends, making
te
predictions, and testing hypotheses.
da
The most commonly used correlation coefficient is Pearson's correlation coefficient,
which measure the strength and direction of a linear relationship between variables.
p
Types of Correlation
uu
Correlation can be classified into several types based on the relationship between the variables.
n
1. Positive Correlation: In a positive correlation, as one variable increases, the other variable
w
also increases. For example, as the temperature increases, ice cream sales tend to rise. The correlation
coefficient (r) for a positive correlation lies between 0 and +1.
w
Example: As the hours spent studying increase, exam scores tend to increase as
w
well.
2. Negative Correlation: In a negative correlation, as one variable increases, the other
variable decreases. A perfect negative correlation would have a correlation coefficient of -1.
Ex: As the amount of time spent watching television increases, the time spent on
physical activity decreases.
3. NoCorrelation:In this case, there is no discernible relationship between the two variables.
Change in one variable do not affect the other.
Ex: The correlation between a person's shoe size and their salary would likely have no
correlation.
4. ZeroCorrelation: Zero correlation refers to the absence of a relationship between two variables.
The correlation coefficient in this case is 0.
61 www.anuupdates.org
Ex: The correlation between the day of the week and a person's height would be zero, as there is no
relationship.
g
observational studies, experiments, or surveys.
or
2:Visualizethe Data: Before performing correlation analysis, it's helpful to
s.
create scatter plots or other visual representations to identify any apparent
te
relationships between the variables.
da
3: Calculate the Correlation Coefficient: Use a statistical tool or software
like Excel, Python (with libraries like NumPy or Pandas), or R to calculate the
p
4: Interpret the Results: The calculated correlation coefficient will give you
an idea of the strength and direction of the relationship. Values closer to +1 or -1
n
.a
Measuring the
Correlation
g
● The correlation coefficient is a symmetric
or
measure.
● The value of correlation coefficient lies between -1 to
s.
+1.
●It is dimensionless quantity.
te
●It is independent of origin and scale of measurement.
da
●The correlation coefficient will be positive or negative depending on
whether the sign of numerator of the formula is negative or positive.
Q. ANOVA
p
uu
ANOVA stands for Analysis of Variance, a statistical test used to compare the means of three or
more groups. It analyzes the variance within the group and between groups. The primary
n
objective is to assess whether the observed variance between group means is more
.a
significant than within the groups. If the observed variance between group means is significant,
it suggests that the differences are meaningful.
w
w
Mathematically, ANOVA breaks down the total variability in the data into two
components:
w
There are two types of ANOVA: one-way and two-way. Depending on the
number of independent variables and how they interact with each other, both are used
in different scenarios.
1. One-way ANOVA
A one-way ANOVA test is used when there is one independent variable with two
or more groups. The objective is to determine whether a significant difference exists between
g
the means of different In our example, we can use one-way ANOVA to compare the
or
effectiveness of the three different teaching methods (lecture, workshop, and online
learning) on student exam scores. The teaching method is the independent
s.
variable with three groups, and the exam score is the dependent variable.
te
● Null Hypothesis (Ho): The mean exam scores of students across the three teaching
da
The one-way ANOVA test will tell us if the variation in student exam scores can be attributed to the
differences in teaching methods or if it's likely due to random chance.
One-way ANOVA is effective when analyzingthe impact of a single factor across multiple
groups, making it simpler to interpret. However, it does not account for the possibility of
interaction between multiple independent variables, where two-way ANOVA becomes
necessary.
64 www.anuupdates.org
2. Two-way
ANOVA
Two-way ANOVA is used when there are two independent variables, each with two
or more groups. The objective is to analyze how both independent variables influence
the dependent variable.
Let's assume you are interested in the relationship between teaching
methods and study techniques and how they jointly affect student
performance. The two-way ANOVA is suitable for this scenario. Herewe test
three hypotheses:
g
technique affect exam scores?
or
● Interaction effect: Does the effectiveness of the teaching method
depend on the study technique used?
s.
te
For example, two-way ANOVA could reveal that students
using the lecture method perform better in group study, and
da
Q.No-SQL
w
SQL and NoSQL databases differ in how they store and query data. SQL
w
databases rely on tables with columns and rows to retrieve and write structured
w
Q. Document Database
A document database is a type of NoSQL database which stores data as JSON documents instead
of columns and rows. JSON is a native language used to both store and query data. These documents
can be grouped together into collections to form database systems. Developers can use JSON
documents in their code and save them directly into the document database.
✓ The flexible, semi-structured, and hierarchical nature of documents and document databases
g
or
allows them to evolve with applications' needs.
s.
✓ Document databases enable flexible indexing, powerful ad hoc queries, and analytics over
te
collections of documents.
p da
nuu
.a
w
Being a NoSQL database, you can easily store data without implementing a schema. You can
w
transfer the object model directly into a document using several different formats. The
most commonly used are JSON, BSON, and XML.
w
What's more, you can also use nested queries in such formats, providing
easier data distribution
For instance, we can add a nested value string to the document above:
{
"ID": "001",
"Name": "John".
"Grade": "Senior".
"Classes": {
"Class1": "English"
"Class2": "Geometry"
"Class3":"History"
g
}
or
}
s.
*Due to their structure, document databases are optimal for use cases that
te
require flexibility and continual development. For example, you can use them
da
for managing user profiles, which die according to the information provided.
It's Schema-lessstructure allows you to have different attributesand values.
p
This is a data model which works as a semi-structured data model in which the records and data
.a
associated with them are stored in a single document which means this data model is
not completelyunstructured. The main thing is that data here is stored in a document.
w
w
Document database
operations
w
You can create, read, update, and delete entire documents stored in the database.
Document databasprovide a query language or API that allows developers to run the
following operations:
● Create: You can create documents in the
database.
● Read: You can use the API or query language to read document data.
● Update: You can update existing documents flexibly.
● Mobile apps
● Real-time analytics
Relational Vs Document Database :
Organizes data into tuples (or rows). Documents have properties without
theoretical definitions, instead of rows.
g
or
Defines data (forms relationships) via No DDL language for defining schemas.
constraints and foreign keys (e.g., a child
s.
table references to the master table via its
ID).
te
da
Uses DDL (Data Definition Language) Relationships represented via nested
data, not
to create relationships.
p
others
nested inside of it, leading to an N:1 or
n
Manageable Query Language: Document databases come with a query language that lets
users work with the data model to conduct CRUD (Create, Read, Update, and
Destroy) activities. As a result, accessing the database and getting the needed data is made
simpler.
g
or
s.
te
p da
uu
✓ However, unlike traditional databases, wide-column databases are highly flexible. They
n
✓ Their schema-free characteristic allows variation of column names even within the same
w
table,
w
time.
1)Data Warehousing: Wide-column databases are optimized for data warehousing and bus intelligence
applications, where large amounts of data need to be analyzed and aggregated. are often
used for analytical queries, such as aggregation and data mining.
2) Big data: Wide-column databases can handle large datasets and provide efficient
storage retrieval of data, therefore, can be used for big data applications.
3) Cloud-based analytics: Wide-column databases can be easily scaled to handle large amouof
data and are designed for high availability, this makes them suitable for cloud-based
analyapplications.
4) IoT : Wide-column databases can handle a high number of writesand reads and can
be used storing and processing IoT data.
g
or
1. High performance: Wide-column databases are optimized for analytical queries and are
designfor fast query performance, which can be especially importantin data
warehousing and businessintelligence applications.
s.
te
2. Flexible and efficient data model: Wide-column databases store data in a
da
column-family form which allows for a more flexible and efficient data model, as
each column family can have its o set of columns and can be optimized for different types
p
of queries.
uu
means that they handle large amounts of data and a high number of concurrent users.
.a
Graph database:
w
A graph database is a type of database that uses a graph model to represent and store data.
The data represented as a collection of nodes and edges. Nodes represent entities or objects,
and edges represent connections or relationships between them. Nodes and edges each
have attributes or properties that give additional details about the data. Representing complex
relationships between data in tradition databases can be challenging because they are built to
manage and store data in tables and column Contrarily, graph databases represent data as a
network of nodes and edges, making it simple to mode intricate relationships between data.
70 www.anuupdates.org
g
or
Types Description
compounds.
w
w
Object-oriented databases
objects. Therefore, object- oriented databases are good
for use cases like managing intricate data relationships
in applications and modeling complex business logic.
Resource Description Framework (RDF) databases Made to manage and store metadata and their
connections to one another about resources, including
web pages and scholarly articles. As a result, RDF
databases are suitable for applications utilizing
knowledge graphs and the semantic web frequently.
● Social media networks: Social media networks are one of the most popular and
natural use cases of graph databases because they involve complex relationships
between people and their activities. For example, graph databases can store and
retrieve information about friends, followers, likes, and shares, which can help social
media companies like Facebook and Instagramtailor their content and recommendations for
each user.
g
● Recommendation systems: Recommendation systems can provide users with
or
tailored recommendations like relationships between goods, clients, and purchases. A
movie streaming service, like Netflix, might use a graph database to suggest movies and TV
s.
shows according to a user's viewing habits and preferences. Fraud detection: Graph
te
databases allow for modeling relationships between different entities, including
da
customers, transactions, and devices, which can be used in fraud detection and
prevention. For example, a bank could use a graph database to detect fraudulent
p
1. Flexibility: Graph databases can easily adapt to new data models and schemas due
.a
to their high
w
level of flexibility.
w
unstructured data from various sources. This can make drawing conclusions from various data
sources simpler.
UNIT-III
Data Science has become one of the fastest-growing fields in recent years, helping organizatsi make
informed decisions solve problems and understand human behavior. As the volume of grows so does
the demand for skilled data scientists The most common languages used fot science are Python and
R
Q. Python Libraries
A Python library is a collection of modules and packages that offer a wide range of functionalit
contain pre-wraten code, classes, functions, and routines that can be used to develop
applicatioautomate tasks, manipulate data, perform mathematical computations, and more. Some of
the pop libraries offered by Python for supporting different Data science activities are:
g
or
s.
te
p da
n uu
1. NumPy
.a
NumPy is a scientific computing package for producing and computing multidimensional array
w
matrices, Fourier transformations, statistics, linear algebra, and more. NumPy's tools allow you
manipulate and compute large data sets efficiently and at a high level.
w
w
Key Features:
2. Pandas
Pandas is one of the best libraries for Python, which is a free software library for data analysis an
data handling. In short, Pandas is perfect for quick and easy data manipulation, data aggregatio
reading, and writing the data and data visualization.
Key Features:
73 www.anuupdates.org
● DataFrame manipulation
● Grouping, joining, and merging datasets
● Time series data handling
● Data cleaning and wrangling
3. Dask
Dask is an open-source Python library designed to scale up computations for handling large datase It
provides dynamic parallelism, enabling computations to be distributed across multiple cores
machines. This is where Dask, a parallel computing library in Python, shines by providing scalable
solutions for big data processing.
Key Features:
g
● Works with Pandas and NumPy for distributed processing
or
● Built for multi-core machines and cloud computing
s.
4. Vaex:
te
Vaex is a Python library designed for fast and efficient data manipulation, especially when dealing
da
with massive datasets. Unlike traditional libraries like pandas, Vacx focuses on out-of-core data
processing, allowing users to handle billions of rows of data with minimal memory consumption.
p
uu
Key Features:
n
.a
5. Scrapy:
Scrapy is a web scraping and extraction tool for data mining. Its use extends beyond just scraping
websites, you can also use it as a web crawler and to extract data from APIs, HTML, and XMI.
sources. Scraped data turns into JSON, CSV, or XML files to store on a local disk or through file
transfer protocol (FTP)
6. Seaborn:
Seaborn is a data visualization library built on top of Matplotlib. It simplifies the creation of
informative and aesthetically pleasing statistical graphics. Seaborn is particularly helpful for exploring
relationships in data and presenting complex data in a visually appealing manner.
7. TensorFlow:
74 www.anuupdates.org
TensorFlow is another popular library for deep learning and machine learning. Developed by Google,
a provides a comprehensive ecosystem for building and deploying machine learning models.
TensorFlow's computational graph allows for distributed training and deployment on various
platforms, making it suitable for production-level applications
8. Statsmodels:
Statsmodels is a library focused on statistical modeling and hypothesis testing It provides tools for
estimating and interpreting models for various statistical methods, including linear regression, time
series analysis, and more
9. SciPy:
SciPy is a scientific computing package with high-level algorithms for optimisation, integration,
differential equations, eigenvectors, algebra, and statistics. It enhances the usage of NumPy-like
g
arrays by using other matrix data structures as its main objects for data. This gives you an even wider
or
range of ways to analyse and compute data
s.
te
Features:
da
● Collection of algorithms and functions built on the NumPy extension of Python
● High-level commands for data manipulation and visualization
p
Applications:
.a
An Integrated Development Environment (IDE) is a software application that provides various tools
and features for writing, editing, and debugging code in a programming language. IDEs are designed
to be a one-stop shop solution for software development and generally consist of a code editor,
compiler or interpreter, and a debugger.
An IDE is a multifaceted software suite that combines a wide range of tools within a singular
interface It caters to the diverse needs of developers and data scientists and the programming
languages they support. Some of the main features that usually come with an IDE include
75 www.anuupdates.org
● Code Editor: A code editor is a text editor with features that help in writing and editing
source code. These features include syntax highlighting, code completion, and error
detection
● Compiler/Interpreter: IDEs often include tools that convert source code written in a
programming language into a form that can be executed by a computer.
● Debugger: A debugger is a tool for identifying and fixing errors or bugs in source code allows
the programmer to run the code step-by-step, inspect variables, and control the execution
flow
● Build Automation Tools: Build automation tools are utilities that automate the process of
compiling code, running tests, and packaging the software into a distributable format
g
or
● Version Control Integration: IDEs offer support for version control systems like Git, enabling
programmers to manage changes to their source code over time and maintain version
histories.
s.
te
da
Importance of IDEs for Python Development
p
uu
1. Syntax highlighting - Different parts of the code are highlighted with different colors to make it
easier to read and understand.
n
.a
2 Auto-completion - The IDEs can automatically suggest code snippets and complete statement
w
3. Debugging - The IDEs include tools for setting breakpoints, reviewing through code, ant inspecting
variables, which can help you detect and fix bugs in your code.
4. Collaboration - IDEs can be integrated with version control systems like Git, allowing you to track
code changes and collaborate with other developers.
5. Project management - IDEs can help you manage your projects by allowing you to organize your
code into different files and directories.
76 www.anuupdates.org
6. Language support - IDEs are typically designed to support a specific programming language or
group of languages. This means they can provide language-specific features and integrations that can
make developing software in that language casier.
7. Community - Many IDEs have a large and active community of users, which can be a valuable
resource for getting help and learning new techniques
Jupyter Notebook
a) Jupyter notebook is the most commonly used and popular python IDE used by data scientists It is
g
a web-based computation environment to create Jupyter notebooks, which are documents that
or
contain code, equations, visualizations, and narrative text.
s.
te
b) Jupyter notebooks are a useful tool for data scientists because they allow them to combine code
da
visualization, and narrative in a single document, which makes it easy to share your work and
reproduce your results. It also provides support for markdown language and equations. c) The
Jupyter notebook can support almost all popular programming languages used in data science, such
p
Spyder
.a
w
1) Spyder is an open-source Python IDE created and developed by Pierre Raybaut in 2009. The name
w
2) It is designed specifically for scientific and data analysis applications and includes various toob and
features that make it well-suited for these tasks. Some of the features of Spyder include-code editor,
interactive console, variable explorer, debugging, visualization, etc.
Sublime Text
a) Sublime Text is a proprietary Python IDE known for its speed, flexibility, and powerful features,
making it a popular choice for a wide range of programming tasks.
b) Sublime text features include a customizable interface, syntax highlighting, auto-suggest, multiple
selections, various plugins, etc.
77 www.anuupdates.org
Atom
a) Atom is a popular and powerful text and source code editor among developers It is a free, open-
source editor that is available for Windows, macOS, and Linux.
b) While Atom is not specifically designed as an Integrated Development Environment (IDE) for
Python, it can be used as one with the help of plugins and packages.
Geany
a) Geany is a free text editor that supports Python and contains some IDE features. It was developed
by Enrico Tröger in C and C++.
b) A few of the features of Geany include - Symbol lists, Auto-completion, Syntax highlighting. Code
navigation, Multiple document support, etc.
g
Q. Arrays and Vectorized Computation
or
s.
NumPy stands for Numerical Python. It is a Python library used for working with an array. In Python,
te
we use the list for the array but a's slow to process. NumPy array is a powerful N-dimensional array
da
object and is used in linear algebra, Fourier transform, and random number capabilities. It provides
an array object much faster than traditional Python lists
p
uu
Types of Array:
n
2. Two-dimensional arrays
w
3. Multi-Dimensional Array
w
w
sample_arraynp.array(list)
Output:
Two-Dimensional Array:
In this type of array, elements are stored in rows and columns which represent a matrix.
g
or
s.
te
Three-dimensional array: This type of array comprises 2-D matrices as its elements.
p da
n uu
.a
w
w
Vectorization:
w
Vectorization is a technique used to improve the performance of Python code by eliminating the use
of loops. This feature can significantly reduce the execution time of code
There are various operations that can be performed on vectors, such as the dot product of vectors
(also known as scalar product), which results in a single output, outer products that produce a
square matrix of dimension equal to the length of the vectors, and element-wise multiplication,
which multiplies the elements of the same index and preserves the dimension of the matrix.
Vectorization is the use of array operations from (Numl'y) to perform computations on a dataset.
● This approach leverages NumPy's underlying C implementation for faster and more efficient
computations
● Replacing iterative processes with vectorized functions, you can significantly optimize
performance in data analysis, machine learning, and scientific computing tasks.
For example:
import numpy as np
al np.array([2,4,6,8,10 1)
number 2
resultal number
print(result)
g
or
Output: [4 6 8 10 12]
s.
Vectorization is significant because it:
te
da
Simplifies Code: Eliminates explicit loops, making code cleaner and easier to read.
uu
import numpy as np
a1 = np.array([1, 2, 3])
a2 = np.array([4, 5, 6])
result = a1+a2
print(result)
Output: [5 7 9]
import numpy as np
a1 = np.array([1,2,3,4])
result = a1 * 2
print(result)
Output: [2 4 6 8]
Advantages:
● Vectorization can drastically increase the speed of execution versus looping over arrays
● Vectorization keeps code simpler and more readable so it's easier to understand and build on
later
● Much of the math of data science is similar to vectorized implementations, making it easier
to translate into vectorized code
g
or
Q. The NumPy ndarray
s.
te
NumPy ndarray:
da
NumPy is used to work with arrays. The array object in NumPy is called ndarray. Anndarray is a multi-
dimensional array of items of the same type and size. The number of dimensions and items
p
contained in the array is defined with a tuple of N non-negative integers that specify cach
uu
dimension's size
n
.a
● The most important object defined in NumPy is an N-dimensional array type called ndarray.
w
It describes a collection of items of the same type, which can be accessed using a zero-based
index.
w
w
● Each item in anndarray takes the same size of block in the memory and is represented by a
data-type object called dtype.
● Any item extracted from anndarray object (by slicing) is represented by a Python object of
one of the array scalar types
A multidimensional array looks something like this:
81 www.anuupdates.org
In Numpy, the number of dimensions of the array is given by Rank. Thus, in the above example, the
ranks of the array of ID, 2D, and 3D arrays are 1, 2 and 3 respectively.
#ID array
g
#2D array
or
arr2 = np.array([[1, 2, 3], [4, 5, 6]])
#3D array
s.
te
arr3 = np.array([[[1, 2], [3, 4]], [[5, 6]. [7, 8]]])
da
print(arr1)
p
print(arr2)
uu
print(arr.3)
n
.a
Output:
w
[1 2 3 4 5]
w
w
[[1 2 3]
[4 5 6]]
[[1 2]
[3 4]]
[[5 6]
[7 8]]]
82 www.anuupdates.org
Attributes of ndarray:
Understanding the attributes of anndarray is essential to working with NumPy effectively. Here are
the
key attributes:
● ndarray shape: Returns a tuple representing the shape (dimensions) of the array.
g
or
● ndarray.itemsize: Returns the size (in bytes) of each element.
s.
te
Q. Creating ndarrays
da
An instance of ndarray class can be constructed by different array creation routines. The basic ndarre
is created using the array() function in NumPy.
p
uu
The numpy.array() function creates anndarray from any object exposing the array interface or from a
n
Syntax: numpyarray(object, dtype None, copy True, order None, symbol False, ndmin = 0) The above
w
1. Object: Any object exposing the array interface method returns an array, or any (nested)
sequence
2. Dtype Desired data type of array, optional
3. Copy Optional. By default (true), the object is copied
4. Order: C (row major) or F (column major) or A (any) (default)
5. Subok: By default, returned array forced to be a base class array. If true, sub-classes passed
through
6. Ndmin: Specifies minimum dimensions of resultant array
Example: Create a One-dimensional Array
import numpy as np
83 www.anuupdates.org
a = np.array([1, 2, 3])
print(a)
Output: [1, 2, 3]
import numpy as np
print(a)
Output:
[[1 2]
[3 4]]
g
Example: Create a Multi-dimensional Array
or
import numpy as np
s.
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]]) te
print(arr)
da
Output:
[[[1 2 3]
p
uu
[4 5 6]]
[[1 2 3]
n
.a
[4 5 6]]]
w
w
Advantages of Ndarrays
w
● One of the main advantages of using Numpy Ndarrays is that they take less memory space
and provide better runtime speed when compared with similar data structures in
python(lists and tuples).
● Numpy Ndarrays support some specific scientific functions such as linear algebra. They help
us in solving linear equations.
NumPy, the fundamental library for scientific computing in Python, supports a wide range of data
types to accommodate various numerical and non-numerical data. These data types are essential for
efficient storage, manipulation, and analysis of data in scientific and numerical computing
applications. The common data types supported by NumPy can be broadly classified into the
following categories:
These include:
g
● int32: 32-bit signed integer (-2,147,483,648 to 2,147,483,647)
or
● int64: 64-bit signed integer (-9,223,372,036,854,775,808 to 9,223,372,036,854,775,807)
Ex:
import numpy as np
print(arr)
print(arr.dtype)
Output:
Object
g
● NumPy Datetime Data Type
or
In NumPy, we have a standard, and fast datatype system called the datetime64. We cannot use
datetime, as it's already taken by the standard Python library.
s.
te
Dates are represented using the current Gregorian Calendar and are infinitely extended in the past
da
future, just like the Python date class.
When we call the datetime64 function, it returns a new date with the specified format in the
p
parameter
n uu
NumPy makes performing arithmetic operations on arrays simple and easy, With NumPy, you can a
w
subtract, multiply, and divide entire arrays element-wise, meaning that each element in one array
operated on by the corresponding element in another array
When performing arithmetic operations with arrays of different shapes, NumPy uses a feature called
broadcasting. It automatically adjusts the shapes of the arrays so that the operation can be
performed extending the smaller array across the larger one as needed.
NumPy provides several arithmetic operations that are performed element-wise on arrays. These
include addition, subtraction, multiplication, division, and power
86 www.anuupdates.org
Addition: Addition is the most basic operation in mathematics. We can add NumPy arrays element-
wise using the "+" operator.
array1 = np.array([1.2.3.4])
array2 = np.array([5,6,7,8])
result_addition array1+array2
Subtraction: Subtraction works similar to addition It subtracts each element in the first array from
g
the corresponding element in the second array.
or
Ex: import numpy as np
array1 = np.array([1,2,3,4])
s.
te
array2 = np.array([5,6,7,8])
da
result_subtraction array1-array2
p
array1 = np.array([1,2,3,4])
array2 = np.array([5,6,7,8])
Division: To divide one array by another array (element-wise), we can use the "/" operator
array1 = np.array([1,2,3,4])
array2 = np.array([5,6,7,8])
87 www.anuupdates.org
result_division array1/array2
NumPy Modulo Operation: The modulo operation is performed using the % operator in NumPy
When applied to arrays, it operates element-wise, meaning each element of the first array is divided
by the corresponding element in the second array, and the remainder is calculated-
Indexing and slicing are fundamental concepts in programming and data manipulation that allow
access and manipulation of individual elements or subsets of elements in a sequence, such as strings,
g
lists, or arrays.
or
NumPy Indexing:
s.
NumPy Indexing is used to access or modify elements in an array. Three types of indexing methods
te
are available field access, basic slicing and advanced indexing.
p da
There are three types of Indexing methods that are available in Numpy library and these are given
below:
n uu
● Field access - This is direct field access using the index of the value, for example, [0] index is
.a
for 1st value. [1] index is for the 2nd value, and so on.
w
w
● Basic Slicing - Basic slicing is simply an extension of Python's basic concept of slicing to the a
w
dimensions. In order to construct a Python slice object you just need to pass the start, stop,
and step parameters to the built-in slice function. Further, this slice object is passed to the
array to extract the part of the array.
● Advanced Indexing
NumPy 1-D array indexing: You need to pass the index of that element as shown below, to access
the 1-D array.
import numpy as np
print(arr1[0])
88 www.anuupdates.org
Output:10
NumPy 2-D array indexing: To access the 2-D array, you need to use commas to separate the integers
which represent the dimension and the index of the element. The first integer represents the row
and the other represents column
import numpy as пр
z = np.array([[61,27,13,14,54], [46,37,38,19,10]])
g
Slicing in NumPy:
or
NumPy Slicing is an extension of Python's basic concept of slicing to n dimensions. A Python object is
s.
constructed by giving start, stop, and step parameters to the built-in slice function. This slice object is
passed to the array to extract a part of array.
te
● Slicing in the array is performed in the same way as it is performed in the python list.
da
● If an array has 100 elements and you want to pick only a section of the values, you can perla
slicing and extract the required set of values from the complete ndarray
p
● Learn Python List Slicing and you can apply the same on Numpyndarrays
n uu
Slicing 1-D arrays: When slicing an array, you pass the starting index and the ending index whicha
separated by a full colon. You use the square brackets as shown below.
.a
arr[start.end]
w
w
import numpy as np
print(x[1:5])
Output: [20 30 40 50 ]
The starting index is 1 and the ending index is 5. Therefore, you will slice from the second element
since indexing in the array starts from 0 up to the fourth element.
Slicing 2-D arrays: To slice a 2-D array in NumPy, you have to specify row index and column inde
which are separated using a comma as shown below.
89 www.anuupdates.org
arr 1. 1:41
The part before the comma represents the row while the part after the comma represents the
column.
import numpy as np
print(y[1, 1:4])
Output: [17 18 19 ]
g
or
Q. Boolean Indexing
s.
In NumPy, boolean indexing allows us to filter elements from an array based on a specific condition,
te
We use boolean masks to specify the condition
p da
Boolean Masks in NumPy: Boolean mask is a numpy array containing truth values (True/False) that
uu
correspond to each element in the array. Suppose we have an array named array1.
n
Now let's create a mask that selects all elements of array1 that are greater than 20
w
w
w
Here, array1 20 creates a boolean mask that evaluates to True for elements that are greater than 20.
and False for elements that are less than or equal to 20. The resulting mask is an array stored in the
boolean mask variable as:
Boolean Indexing allows us to create a filtered subset of an array by passing a boolean mask as an
index. The boolean mask selects only those elements in the array that have a True value at the
corresponding index position. Let's create a boolean indexing of the boolean mask in the above
example.
array [boolean_mask]
90 www.anuupdates.org
Ex: We'll use the boolean indexing to select only the odd numbers from an array.
import numpy as np
boolean_mask= array1%2 != 0
g
#boolean indexing to filter the odd numbers
or
result = array1[boolean_mask]
s.
print(result) te
Output: [1 3 5 7 9]
da
2D Boolean Indexing in NumPy: Boolean indexing can also be applied to multi-dimensional arrays in
p
NumPy.
uu
Ex:
n
import numpy as np
.a
#create a 2D array
w
result array1[boolean_mask]
print(result)
In this example, we have applied boolean indexing to the 2D array named array1. We then created
boolean mask based on the condition that elements are greater than 9. The resulting mask is,
Swapping Axes of Arrays in NumPy: Swapping axes in NumPy allows you to change the order of
dimensions in an array. You can swap axes of an array in NumPy using the swapaxes() function and
the transpose() function.
g
or
In NumPy, an array can have multiple dimensions, and each dimension is referred as an axis For
example, a 2D array (matrix) has two axes, the rows and the columns. In a 3D array (tensor), there ar
s.
three axes: depth, height, and width. te
● Axis 0 refers to the first dimension (often rows).
da
● Axis 1 refers to the second dimension (often columns).
● Axis 2 refers to the third dimension, and so on
p
uu
Using swapaxes() Function: The np.swapaxes() function in NumPy allows you to swap two specified
axes of an array. This function is particularly useful when you need to reorganize the structure of an
n
array, such as switching rows and columns in a 2D array or reordering the dimensions in a multi
.a
dimensional array.
w
w
This function does not create a copy of the data but rather returns a new view of the array with the
specified axes swapped. It does not involve duplicating the array's data in memory.
w
Where,
Ex: In the following example, we are swapping the rows and columns in a 2D array using the
import numpy as np
#Creating a 2D array
[4, 5, 6]])
g
#Swapping axes 0 and 1 (rows and columns)
or
swapped = np.swapaxes(arr, 0, 1)
s.
print("Original Array:") te
print(arr)
da
print("\nArray After Swapping Axes:")
print(swapped)
p
uu
[[1 2 3]
n
.a
[4 5 6]]
w
[[1 4]
w
[2 5]
[3 6]]
Using the transpose() Function: We can also use the transpose() function to swap axes of arrays in
NumPy. Unlike the swapaxes() function, which swaps two specific axes, the transpose() function is
used to reorder all axes of an array according to a specified pattern.
Syntax:numpy.transpose(a, axes=None)
Where,
Ex: Following is the example of the numpytranspose() function in which transpose operation
switches the rows and columns of the 2D array-
import numpy as np
#Original 2D array
transposed_2d = np.transpose(array_2d)
Output:
g
or
Original array:
[[1 2 3]
[4 5 6]]
s.
te
Transposed array:
da
[[1 4]
p
[2 5]
uu
[3 6]]
n
.a
Universal Functions (referred to as "Ufuncs' hereon) in NumPy are highly efficient functions that
w
perform element-wise operations on arrays. They allow mathematical and logical operations to be
applied seamlessly across large datasets.
1) Arithmetic operations: Ufuncs enable the execution of element-wise arithmetic operations such
as addition, subtraction, multiplication, and division across entire arrays without needing explicit
loops.
Ex:
import numpy as np
g
or
print("Multiplication:", np.multiply(a, b)) # Element-wise multiplication
s.
print("Division:", np.divide(a, b)) # Element-wise division
te
Output:
da
Addition: [11 22 33]
Subtraction: [9 1827]
p
2) Trigonometric functions
w
● NumPy's Ufuncs extend their efficiency to trigonometric operations, enabling seamless and
w
● These Ufuncs apply trigonometric operations like sine, cosine, and tangent element wise to
arrays, allowing us to perform complex mathematical transformations effortlessly.
NumPy's Ufuncs offer powerful and efficient ways to perform exponential and logarithmic
calculations on arrays. These functions operate element-wise, allowing you to compute exponential
and logarithmic transformations swiftly across large datasets.
95 www.anuupdates.org
Here are some frequently used exponential and logarithmic functions in NumPy:
4) Logical Ufuncs:
g
or
2. np.logical_not: Element-wise logical NOT operation.
6) Statistical Ufuncs:
w
2. ap.min, np.max: Finds the minimum and maximum values present in an array.
w
7) Bitwise Ufunes:
8)Comparison Functions:
In NumPy, comparison functions are integral to element-wise operations that allow us to evaluate
relationships between arrays. The Ufuncs perform clement-wise comparisons and return Boolean
arrays, where each element represents the result of the comparison operation.
96 www.anuupdates.org
1. np.greater, np.greater equal: Element-wise comparison for greater than and greater than or equal
to and less than or equal to.
2.np.less, up.less equal: Element-wise comparison for less than or equal to.
Ex:
import numpy as np
g
or
#Perform comparison operations
Output:
n uu
Mathematical Functions: NumPy provides a wide range of mathematical functions that are essential
for performing numerical operations on arrays. These functions include basic arithmetic,
trigonometric, exponential, logarithmic, and statistical operations, among others.
we will explore the most commonly used mathematical functions in NumPy, with examples to help
you understand their application.
Arithmetic operations: Ufuncs enable the execution of element-wise arithmetic operations such as
addition, subtraction, multiplication, and division across entire arrays without needing explicit loops.
b = np.array([1, 2, 3])
g
print("Subtraction:", np subtract(a, b)) # Element-wise subtraction
or
print("Multiplication:", np.multiply(a, b)) # Element-wise multiplication
s.
print("Division:", np. divide(a, b)) # Element-wise division
te
Output:
da
Addition: [11 22 33]
Subtraction: [9 18 27]
p
uu
Trigonometric functions
w
w
● NumPy's Ufuncs extend their efficiency to trigonometric operations, enabling seamless and
fau computation of trigonometric functions on arrays.
w
● These Ufuncs apply trigonometric operations like sine, cosine, and tangent element wise
arrays, allowing us to perform complex mathematical transformations effortlessly.
● The trigonometric functions available in NumPy include numpy.sin(x), numpy.cos(x)
numpy.tan(x), numpyarcsin(x), numpyarccos(x), and numpy arctan(x).
Statistical Methods:
Mean: The mean is a measure of central tendency. It is the total of all values divided by how many
values there are. We use the mean() function to calculate the mean.
Syntax: np.mean(data)
Ex:
98 www.anuupdates.org
# Sample data
mean = np.mean(data)
print(f"Mean: {mean}")
Average: The average is often used interchangeably with the mean. It is the total of all values divided
by how many values there are. We use average() function to calculate the average. This function s
useful because it allows for the inclusion of weights to compute a weighted average.
g
Syntax: np.average(data), np.average(data, weights weights)
or
Ex:
#Sample data
s.
te
data = np.array([1, 2, 3, 4, 5])
da
average = np.average(data)
print(f"Average: {average}")
w
Output:
Average: 3.0
Median: The median is the middle value in an ordered dataset. The median is the middle value when
the dataset has an odd number of values. The median is the average of the two middle values when
the dataset has an even number of values. We use the median() function to calculate the median.
Syntax:np.median(data)
median np.median(data)
print("Median: (median)")
Variance: Variance measures how spread out the numbers are from the mean. It shows how much
the values in a dataset differ from the average. A higher variance means more spread. We use the
var() function to calculate the variance.
Syntax: np.var(data)
g
Ex:
or
# Sample data
variance = np.var(data)
p
print(f”Variance: {variance}")
Minimum and Maximum: The minimum and maximum functions help identify the smallest and
largest values in a dataset, respectively. We use the min() and max() functions to calculate these
w
values.
w
Ex:
#Sample data
minimum = np.min(data)
maximum = np.max(data)
print(f"Minimum: {minimum}")
100 www.anuupdates.org
print(f"Maximum: {maximum}")
Output:
Minimum: 1
Maximum: 5
Q. Sorting
● The ordered sequence basically is any sequence that has an order in corresponding to the
elements. It can be numeric or alphabetical, ascending, or descending, anything
● There are many functions for performing sorting, available in the NumPy library. We have
g
various sorting algorithms like quicksort, merge sort and heapsort, and all these are
or
implemented using the numpy.sort() function.
s.
te
p da
uu
Numpysort() function: You can use the numpyndarray function sort() to sort a numpy array. It sorts
n
the array in-place. You can also use the global numpysort() function which returns a copy of the
.a
sorted array.
Parameters:
w
● Order: This parameter will represent the fields according to which the array is to be sorted in
the case if the array contains the fields.
● Returned Values: This function will return the sorted array of the same type and having the
same shape as the input array.
Ex:
import numpy as np
print(arr1)
g
or
b = np.array([[1, 15], [20, 18]])
s.
print ("\nSorting along last axis: \n")
te
print(arr2)
da
print(arr3)
.a
Output:
w
[[10 15]
[17 25]]
[[ 1 15]
[18 20]]
[ 1 10 12 15]
102 www.anuupdates.org
import numpy as np
array2 = np.sort(array)
print(array2)
g
This numpy set operation helps us find unique values from the set of array elements in Python. The
or
numpyunique() function skips all the duplicate values and represents only the unique elements from
the Array
Syntax: np.unique(Array)
s.
te
Example: In this example, we have used unique() function to select and display the unique elements
da
from the set of array. Thus, it skips the duplicate value 30 and selects it only once. import numpy as
np
p
arr = np.array([30,60,90,30,100,60,30])
uu
data = np.unique(arr)
n
print(data)
.a
Set Operations:
w
A set is a collection of unique data. That is, elements of a set cannot be repeated NumPy set
operations perform mathematical set operations on arrays like union, intersection, difference, and
symmetric difference.
Set Union Operation: The union of two sets A and B include all the elements of set A and B. In
NumPy, we use the np.unionld() function to perform the set union operation in an array.
103 www.anuupdates.org
A = np.array([1, 3, 5])
B = np.array([0, 2, 3])
result = np.union1d(A, B)
g
print(result)
or
Output: [0 1 2 3 5]
s.
In this example, we have used the np.unionld(A, B) function to compute the union of two arrays:
te
A and B.
p da
Intersection Operation: The intersection of two sets A and B include the common elements between
uu
set A and B. We use the np.intersectId() function to perform the set intersection operation in an
array.
n
.a
w
w
w
A = np.array([1, 3, 5])
B = np.array([0, 2, 3))
print(result)
Output: [3]
104 www.anuupdates.org
Difference Operation: The difference between two sets A and B include elements of set A that are
not present on set B. We use the np.setdiffld() function to perform the difference between two
arrays.
A = np.array([1, 3, 5])
B = np.array([0, 2.3])
result = np.setdiff1d(A, B)
g
or
print(result)
s.
Output: [15]
te
Symmetric Difference Operation: The symmetric difference between two sets A and B includes all
elements of A and B without the common elements. In NumPy, we use the np.setxorld() function to
da
perform symmetric differences between two arrays.
p
nuu
.a
A = np.array([1, 3, 5])
w
w
B = np.array([0, 2, 3])
result = np.setxor1d(A, B)
print(result)
Output: [0 1 2 5]
UNIT-IV
105 www.anuupdates.org
Pandas
Pandas is a Python library used for data manipulation and analysis. It provides data structures like
DataFrames and Series, which are built on top of NumPy, making it efficient for working with
structured data DataFrames are similar to tables in SQL, er spreadsheets, allowing for easy
organization and manipulation of data, while Series are one-dimensional arrays with labeled indices
Pandas is widely used for tasks such as data cleaning, trasformation, and analysis. It offers
functionalities for handling missing data, reshaping data, merging and joining datasets, and
calculating summary statistics. Its integration with other Python Libraries the NumPy and Matplotlib
makes a s powerful tool for data science workflows.
g
or
Q. Series
s.
Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string.
float, etc.). It's similar to a one-dimensional array or a list in Python, but with additional
te
functionalities Each element in a Pandas Series has a label associated with in, called an indes. This
da
index allows for fast and efficient data access and manipulation. Pandas Series can be created from
various data structures like lists, dictionaries, NumPy arrays, etc
p
A Series is similar to a one-dimensional ndarray (NumPy array) but with labels, which are also known
as indices. These labels can be used to access the data within the Series. By default, the index values
are integers starting from 0 to the length of the Series minus one, but you can also manually set the
index labels.
Data: The data to be stored in the Series. It can be a list, ndarray, dictionary, scalar value (like an
integer or string), etc.
106 www.anuupdates.org
Index: Optional It allows you to specify the index labels for the Series. If not provided, default
integer index labels (0, 1, 2, ...) will be used.
dtype-Optional. The data type of the Series. If not specified, it will be inferred from the data.
pandas Series can be created in multiple ways, From array, list, dict, and from existing DataFrame.
Before creating & Series, first, we have to import the NumPy module and use array() function in the
program. If the data is ndarrays, then the paints should be in the same length, if the index is not
passed the default value is range(n).
Ex:
import pandas as pd
g
or
import numpy as np
s.
data = np.array([‘python’, ‘php’,’ Java’]) te
series = pd.Series(data)
da
print (series)
Output:
p
uu
0 python
1 php
2 java
n
.a
dtype: object
w
If you have a Python list it can be easily converted into Pandas Scries by passing the list object a an
w
argument to series() function. In case if you wanted to convert the Pandas Series to alist use
Series.tolist().
Ex:
print(series)
Output:
107 www.anuupdates.org
r1 python
r2 php
r3 java
dtype: object
A Python dictionary can be utilized to create a Pandas Series. In this process, the dictionary keys are
used as the index labels, and the dictionary values are used as the data for the Series. If you want to
convert a Pandas Series to a dictionary, use the Series to_dict() method.
Ex:
g
series=pd .Series(data)
or
print (series)
Output:
s.
te
Courses pandas
da
Fee 20000
p
Duration 30days
uu
dtype: object
• Series() : A pandas series can be created with the Series() constructor method.This
constructor method accepts a variety of inputs.
w
•
w
• name(): Method allows to give a name to a series object, i.e, to the column.
• is_unique(): Method returns the index positions of the highest values in a series.
• idxmax(): Method to extract the index positions of the highest values in a series.
A Dataframe in Python's pandas library in a two-dimensional labeled data structure that is used for
data manipulation and analysis. It can handle different data types such as integers, floats, and
strings. Cach column has a unique label, and each row is labeled with a unique index values, which
helps in accessing specific rows .
DataFrame is used in machine learning tasks which allow the users in manipulate and analyze the
data sets in large size. It supports the operations such as filtering, sorting, merging grouping anal
transforming data.
108 www.anuupdates.org
DataFrame Structure:
Syntax: pandas. DataFrame(data = None, index= None, columns= None, dtype= None, copy=None)
g
A pandas DataFrame can be created using various inputs like-
or
1) Lists
s.
2) Dictionary te
3) Series
4) Numpyndarrays
da
5) Another Dataframe
6) External input iles like CSV, JSON, HTML, Excel sheet, and more.
p
1) Can easily load data from different databases and data formats:
n
3) Have intuitive merging and joining data sets that une a common key in order to get a
complete view
w
6) Aggregate and summarize quickly in order to get eloquent stats from your data by accessing
w
Q. Dropping Entries
The Pandas drop() function in Python drops specified labels from rows and columns. Drop is a major
function used in data science & Machine Learning to clean the dataset. Pandas Drop() function
removes specified labels from rows or columns. When using a multi-index, labels on differen levels
can be removed by specifying the level. The drop() function removes specified rows or column from
a Pandas DataFrame or Series.
inplace=False, errors=‘raise’)
109 www.anuupdates.org
Options Explanation
labels Single-label or list-like
Index or Column labels to drop.
axis the drop will remove the provided axis; the axis can be 0 or 1.
axis 0 refers to rows or indexes (verticals)
axis I refers to columns (horizontals)
by default, axis = 0
index single label or list-like.
the index is the row (verticals) & is equivalent to axis=0
columns Single label or list-like.
the columns are horizontals in the tabular view & are denoted
with axis = 1.
inplace accepts bool (True or False), default is False.
Inplace makes changes then & there, don't need to assiga a
variable.
level int or level name, optional
For MultiIndex, the level from which the labels will be removed.
g
errors the error can be ‘ignored’ or ‘raised’ default is raised
or
if ignored, suppress the error, and only existing labels art
dropped if raised, it will show the error message & won't allow
s.
dropping the data. te
da
The drop() function in the Python pandas library is useful for removing specified rows or columes
from a DataFrame or Scries. The function takes in several parameters, including the labels to drop,
p
the axis (Le, rows or columns), and whether or not to modify the original Dataframe in place.
uu
With the Pandas dataframedrop() method, we can easily manipulate the structure of our data by
n
removing unnecessary rows or columns. We can also chain multiple drop() functions in Python to
remove multiple rows or columns simultaneously.
.a
It's important to note that the Python drop() function in Pandas with inplace-True modifies the
w
original DataFrame in place and does not return a new DataFrame object. Thin can be useful to save
w
import pandas as pd
print("-----DataFrame-----)
print(df)
print(df. drop(1))
Output:
-----Dataframe-----
110 www.anuupdates.org
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
a b c d
0 0 1 2 3
2 8 9 10 11
Q. Indexing
Indexing is the process of accessing an element in a sequence using its position in the sequence (its
index). In Python, indexing starts from 0, which means the first element in a sequence is at position
g
0. the second element is at position 1, and so on. Indexes can be numeric, string, or even datetime
or
They can also be unique or non-unique. By default, Pandas assigns a numeric, auto-incrernenting
index to cachDataFrame you create.
2. Selection : Indexes make data selection and manipulation faster and easier.
p
To access an element in a sequence, you can use square brackets [] with the index of the element
uu
Ex:
n
.a
In the above code, we have created a list called my list and then used indexing to access the first and
second elements in the list using their respective indices.
To Set an Index in Pandas DataFrame: Setting an index in a DataFrame is straightforward. You can
use the set_ index() function, which takes a column name (or a list of column names) as an argumert.
Ex:
import pandas as pd
df = pd.DataFrame({
}]
Output:
B C D
foo one 1 10
bar one 2 20
g
baz two 3 30
or
qux three 4 40
s.
The inplace=True argument modifies the original DataFrame. If you don't include this argument, the
function will return a new DataFrame.
te
Multi-Indexing in Pandas:
da
Pandas also supports multiple indexes, which can be useful for higher dimensional data. You can
p
crete a multi-index DataFrame by passing a list of column names to the set_index() function.
uu
Ex:
Output:
w
A B C D
w
foo one 1 10
bar one 2 20
baz two 3 30
qux three 4 40
Resetting the Index: If you want to revert your DataFrame to the default integer index, you can use
the reset index() function.
df.reset_index(inplace=True)
Output:
A B C D
112 www.anuupdates.org
0 foo one 1 10
1 bar one 2 20
2 baz two 3 30
3 qux three 4 40
Indexing for Performance: Indexes are not just for identification and selection. They can also
significantly improve performance. When you perform a task, that uses the index, like a data lookup
or a merge operation, Pandas uses a hash-based algorithm, which is extremely fast.
Q. Selection
Pandas select refers to the process of extracting specific portions of data from a DataFrams Data
selection involves choosing specific rows and columns based on labels, positions, or conditions.
Pandas provides various methods, such as basic indexing, slicing, boolean indexing, and querying, to
efficiently extract, filter, and transform data, enabling users to focus on relevant information for
analysis and decision-making pandas functions-loc and iloc-that allow you to select rows and
g
columns either by their labels (names) or their integer positions (indexes).
or
Ex:
import pandas as pd
s.
te
data = {'Name': ['Alice'. "Bob', 'Charlie'],'Age’ :[25, 30, 35],’ City': ['New York', 'Los Angeles’,’Chicago'])
da
df = pd.DataFrame(data)
p
Output:
The .loc[]method selects data based on labels (names of rows or columns). It is flexible and supports
various operations like selecting single rows/columns, multiple rows/columns, or specific subsets.
1) Label-based indexing.
2) Can select both rows and columns simultaneously.
3) Supports slicing and filtering.
113 www.anuupdates.org
Ex:
print(row)
Output:
Name Alice
Age 25
Ex:
g
rows = df.loc[[0, 2]] #Select rows with index labels 0 and 2
or
print(rows)
s.
Output: te
Name Age City
da
0 Alice 25 New York
2 Charlie 35 Chicago
p
uu
subset = df.loc[0:1, ['Name', 'City']] # Select first two rows and specific columns
n
.a
print(subset)
Output:
w
w
Name City
w
print(filtered)
Output:
2 Charlie 35 Chicago
The.iloc [] method selects data based on integer positions (index numbers). It is particularly
usefulwhen you don't know the labels but know the positions.
Ex:
g
row = df.iloc[1] # Select the second row (index position = 1)
or
print(row)
Output:
s.
te
NameBob
da
Age 30
p
CityLos Angeles
uu
Ex:
w
Rows = df.iloc[[0, 2]] # Select first and third rows by position Print(rows) Ex:
w
Output:
w
Name AgeCity
2 Charlie 35 Chicago
Q. Filtering
One of the most common data manipulation operations is dataframe filtering. A DataFrame is
filtered when its data are analyzed, and only those that satisfy specific requirements are returned.
115 www.anuupdates.org
Pandas, a great data manipulation tool, is the best fit for Dataframe Filtering. The core data structure
of Pandas is DataFrame, which stores data in tabular form with labled rows and columns.
df= pd.DataFrame({
‘ctg’:[‘A’,’A’,’C’,’B’,’C’,’B’],
‘val’:np.random.random(7).round(2),
‘val2’:np.random.randint(1,10,size=7)
})
Output:
g
or
0 Jane A 0.43 1
1 John A 0.67 1
2 Ashley C 0.40 7
s.
te
3 Mike B 0.91 5
da
4 Emily B 0.99 8
p
5 Jack C 0.02 7
uu
6 Catlin B 1.00 3
n
1) Logical Operators: We can use the logical operators on column values to filter rows. We've now
w
selected the rows in which the value in the "val" column is greater than 0.9.
w
Output:
6 Catlin B 1.00 3
2) Isin: The isin method is another way of applying multiple conditions for filtering. For instance, we
can filter the names that exist in a given list.
Ex:
names = [‘John’,’Catlin’,’Mike’]
df[df.nameisin(names)]
Output:
1 John A 0.67 1
3 Mike B 0.91 5
6 Catlin B 1.00 3
3) Str accessor:
Pandas is a highly efficient library on textual data as well. The functions and methods under the str
accessor provide flexible ways to filter rows based on strings. For instance. we can select the names
that start with the letter "J”.
Ex:
df[df.name.str.startswith(‘J’)]
0 Jane A 0.43 1
g
1 John A 0.67 7
or
5 Jack C 0.02 7
Tilde (-):
s.
te
The tilde operator is used for "not" logic in filtering. If we add the tilde operator before the filter
da
expression, the rows that do not fit the condition are returned.
Ex: df[~df.name.str.contains(‘J’)]
p
uu
2 Ashley C 0.40 7
n
3 Mike B 0.91 5
.a
4 Emily B 0.99 8
w
6 Catlin B 1.00 3
w
w
7) Query:
The query function offers a little more flexibility at writing the conditions for filtering. we can pass
the conditions as a string .
For instance, the following code returns the rows that belong to the B category and have a value
Output:
3 Mike B 0.91 5
4 Emily B 0.99 8
5 Catlin B 1.00 3
8) Nlargest or Nsmallest: In some cases, we do not have a specific range for filtering but just need
the largest or smallest values. The nlargest and nsmallest functions allow you to select rows that
have the largest or smallest values in a column, respectively.
Output:
6 Catlin B 1.00 3
3 Emily B 0.99 8
g
Q. Function Application and Mapping
or
Pandas provides powerful methods to apply custom or library functions to DataFrame and Series
s.
objects. Depending on whether you want to apply a function to the entire DataFrame, row- or
column-wise, or element-wise, Pandas offers several methods to achieve these tasks. three essential
te
methods for function application in Pandas -
da
a) Table wise Function Application: pipe()
p
The custom operations performed by passing a function and an appropriate number of parameters.
These are known as pipe arguments. Hence, the operation is performed on the entire DataFrame or
w
Series. When we want to apply one function to a series or DataFrame, then apply another, then
w
another, and so on, the notation can become messy. It can also makes the program more prone to
error. Here, pipe() becomes useful.
w
Here is the example that demonstrates how you can add a value to all elements in the DataFrame
using the pipe() function.
Ex:
dataflair_s1= pd.Series([11,21,31,41,51])
dataflair_s1
Output:
118 www.anuupdates.org
21
31
41
51
The apply() function is versatile and allows you to apply a function along the axes of a DataFrame. By
default, it applies the function column-wise, but you can specify row-wise application using the axis
parameter.
g
or
print('Original DataFrame:\n', df)
s.
result = df.apply(np.mean) te
print('Result:\n',result)
da
Output:
Original DataFrame:
p
col1col2col3
uu
0 3 7 6
n
1 4 1 4
.a
2 5 2 8
w
Result:
w
w
col1 4
col2 3
col3 9
This function applies the np.mean() function to the rows of the pandas DataFrame. is as follows –
Ex:
print('Result:\n', result)
119 www.anuupdates.org
Output:
Original DataFrame:
col1col2col3
0 3 7 6
1 4 1 4
2 5 2 8
Result:
0 5.3
1 3
2 5
g
When you need to apply a function to each element individually, you can use map() function. These
or
methods are particularly useful when the function cannot be vectorized.
s.
Using map() Function: The following example demonstrates how to use the map() function for
applying a custom function to the elements of the DataFrame object.
te
Ex:
da
gfg_string = 'hello'
p
gfg_series = pd.Series(list(gfg_string))
uu
print("Original series\n" +
n
gfg_series.to_string(index = False,
.a
header=False), end='\n\n')
w
new_gfg_series = gfg_series.map(str.upper)
w
print("Transformed series:\n" +
w
new_gfg_series.to_string(index=False,
header=False), end='\n\n')
Output:
h H
e E
l L
l L
o O
120 www.anuupdates.org
Sorting is a fundamental operation when working with data in Pandas, whether you're organizing
rows, columns, or specific values. Sorting can help you to arrange your data in a meaningful way for
better understanding and easy analysis. Pandas provides powerful tools for sorting your data
efficiently, which can be done by labels or actual values.
1) Sorting by Label: This involves sorting the data based on the index labels.
2) Sorting by Value: This involves sorting data based on the actual values in the DataFrame or
Series.
Sorting by Label:
To sort by the index labels, you can use the sort_index() method, by passing the axis arguments and
the order of sorting, data structure object can be sorted. By default, this method sorts the
DataFrame in ascending order based on the row labels.
g
Ex:
or
unsorted_df=pd.DataFrame(np.random.randn(10,2),index=[1,0,2,3],
sorted_df=unsorted_df.sort_index()
uu
Output:
.a
Original DataFrame:
w
col2col1
w
1 1.116 1.631
w
2 -2.070 0.148
0 0.922 -0.429
Output:
Original DataFrame:
col2col1
0 0.922 -0.429
1 1.116 1.631
2 -2.070 0.148
121 www.anuupdates.org
b) Sorting by Actual Values: Like index sorting, sorting by actual values can be done using the
sort_values() method. This method allows sorting by one or more columns. It accepts a 'by'
argument which will use the column name of the DataFrame with which the values are to be sorted.
import pandas as pd
panda_series_sorted = panda_series.sort_values(ascending=True)
Output:
g
0 18
or
1 12
2 55
s.
3 0 te
Sorted Pandas Series:
da
5 0
p
3 12
uu
0 18
4 55
n
.a
Ranking: Ranking in pandas is achieved using the rank() method, which assigns ranks to elements
within a Series or DataFrame. This method provides flexibility in how ranks are computed and
w
The rank() method is applied to a Series or a column of a DataFrame. By default, it assigns ranks in
w
ascending order, with the smallest value receiving rank 1. In cases of ties, the average rank is
assigned to the tied values.
rank() Arguments:
import pandas as pd
df = pd.DataFrame(data)
df['Rank'] = df['Score'].rank()
print(df)
Output:
Score Rank
0 250 2.0
1 400 5.0
2 300 3.5
g
3 300 3.5
or
4 150 1.0
s.
Rank a Column with Different Methods: te
Method Description
average Default:assign the average rank to each entry in
da
import pandas as pd
df= pd.DataFrame(data)
df['Rank'] = df['Score'].rank(method='max')
print(df)
Output:
Score Rank
0 78 1.0
1 85 3.0
123 www.anuupdates.org
2 85 3.0
3 90 4.0
Statistics is a branch of mathematics that deals with collecting, interpreting, organization and
interpretation of data. Descriptive statistics involves summarizing and organizing the data so that it
can be easily understood.
✓ Descriptive statistics are a set of tools in data analysis that help us understand and
summarize the main features of a dataset.
✓ They provide simple ways to describe essential aspects like central tendency (mean, median)
variability (range, standard deviation), and distribution shape.
✓ By using descriptive statistics, we can quickly grasp the overall picture of data, spot patterns,
and identify potential outliers.
g
✓ These techniques make complex data more manageable and accessible, aiding in making
or
informed decisions and drawing meaningful insights from the information.
s.
te
p da
n uu
.a
w
w
Measure of central tendency is used to describe the middle/center value of the data. Mean, Median,
Mode are measures of central tendency.
1. Mean
2. Median:
g
✓ The Median is the middle number in the dataset.
or
✓ Median is the best measure when we have outliers.
Mathematical Calculation:
.a
If we have an even number of data, find the average of the middle two items.
w
Ex:
w
3)Mode: The mode is used to find the common number in the dataset. mode for a particular variable
("Age") using a mathematical calculation
Measure of Variability:
1) Range: describes the difference between the largest and smallest data point in our data set
Syntax: Range Largest data value smallest data value
Syntax: Range = Large data value – smallest data value
Ex:
#sample Data
arr = [1,2,3,4,5]
# Finding Max
Maximum max(arr)
# Finding Min
Minimum min(arr)
#Difference Of Max and Min
Range Maximum-Minimum
print("Maximum = {}, Minimum = {} and Range = {}".format(
Maximum, Minimum, Range))
g
Output: Maximum = 5, Minimum = 1 and Range = 4
or
2) Variance: is defined as an average squared deviation from the mean. It is calculated by
finding the difference between every data point and the average which is also known as the
s.
mean, squaring them, adding all of them and then dividing by the number of data points
present in our data set.
te
Ex:
da
# sample data
arr = [1, 2, 3, 4, 5]
#variance
p
Output:
Var = 2.5
n
3) Standard deviation: Standard deviation is widely used to measure the extent of variation or
.a
dispersion in data. It's especially important when assessing model performance (e.g.,
w
import statistics
w
arr = [1,2,3,4,5]
print(“Std = “,(statistics.stdev(arr)))
output:
std = 1.5811388300841898
Q. Unique Values
Pandas is a powerful tool for manipulating data once you know the core operations and how to use
them. one such operation to get unique values in a column of pandas dataframe.
pandas.DataFrame().unique() method is used when we deal with a single column of a DataFrame and
returns all unique elements of a column. The method returns a DataFrame containing the unique
elements of a column, along with their corresponding index labels.
Syntax: Series.unique(self)
126 www.anuupdates.org
data = {
"Students": ["Ray", "John", "Mole", "Smith", "Jay", "Milli", "Tom", "Rick"], "Subjects": ["Maths",
"Economics", "Science", "Maths", "Statistics", "Statistics", "Statistics"," Computers"]
df = pd.DataFrame(data)
print(df["Subjects"].unique())
print(type(df["Subjects"].unique()))
g
or
2) Using the drop_duplicates method:
drop_duplicates() is an in-built function in the panda's library that helps to remove the duplicates
s.
from the dataframe. It helps to preserve the type of the dataframe object or its subset and removes
te
the rows with duplicate values. When it comes to dealing with the large set of dataframe, using the
drop_duplicate() method is considered to be the faster option to remove the duplicate values.
da
Ex:
p
import pandas as pd
uu
data = {
n
"Computers"] }
w
df = pd. DataFrame(data)
print(df.drop_duplicates(subset = "Subjects"))
print(type(df.drop_duplicates(subset = "Subjects")))
Output:
students Subjects
0 Ray Math
1 John Economics
2 Mole Science
4 Jay Statistics
7 Rick Computers
127 www.anuupdates.org
Till now we have understood how you can get the set of unique values from a single dataframe. But
what if you wish to identify unique values from more than one column. In such cases, you can merge
the content of those columns for which the unique values are to be found, and later, use the unique()
method on that series(column) object.
Ex:
import pandas as pd
data = {
g
}
or
#load data into a DataFrame object:
print(unique Values)
p
Output: ['Ray' 'John' 'Mole' 'Smith' 'Jay' 'Tom' 'Maths' 'Economics' 'Science' 'Statistics' 'Computers']
uu
Suppose instead of finding the names of unique values in the columns of the dataframe, you wish to
.a
count the total number of unique elements. In such a case, you can make use of the nunique()
method instead of the unique() method as shown in the below example:
w
import pandas as pd
w
w
data {
"Students": ["Ray", "John", "Mole", "Smith", "Jay", "Milli", "Tom", "Rick"], "Subjects": ["Maths",
"Economics", "Science", "Maths", "Statistics", "Statistics", "Statistics", "Computers"]
df= pd.DataFrame(data)
print(unique Values)
Output: 5
In the above method, we count the number of unique values for the given column of the dataframe.
However, if you wish to find a total number of the unique elements from each column of the
dataframe, you can pass the dataframe object using the nunique() method.
Ex:
import pandas as pd
data = {
"Students": ["Ray", "John", "Mole", "Smith", "Jay", "Milli", "Tom", "Rick"], "Subjects": ["Maths",
"Economics", "Science", "Maths", "Statistics", "Statistics", "Statistics", "Computers"] }
print(unique Values)
Output:
g
Students 8
or
Subjects 6
Q. Value Counts
s.
te
The value_counts() method in Pandas is used to count the number of occurrences of each unique
da
value in a Series.
✓ value counts is a function in the Pandas library that counts the frequency of unique
p
✓ It returns a Series containing counts of unique values. This function is beneficial for
data analysis and preprocessing tasks, providing insights into the distribution of
n
categorical data.
.a
value_counts() Arguments:
w
1) normalize (optional) if set to True, returns the relative frequencies (proportions) of unique values
instead of their counts
2)sort (optional) - determines whether to sort the unique values by their counted frequencies
Ex:
import pandas as pd
# create a Series
counts = data.value_counts()
print(counts)
Output:
apple 3
banana 2
orange 1
import pandas as pd
#create a Series
g
data = pd.Series(['apple', 'banana', 'apple', 'orange', 'banana', 'apple', 'banana', 'kiwi, 'kiwi', 'kiwi'])
or
# use value_counts() with sorting
s.
counts_sort=data.value_counts(sort=True) te
print("Counts with sorting:")
da
print(counts_sort)
Output:
p
uu
apple 3
n
.a
banana 3
kiwi 3
w
w
orange 1
w
Q. Membership
Membership operators in Python are operators used to test whether a value exists in a sequence,
such as list, tuple, or string. The membership operators available in Python are,
• in: The in operator returns True if the value is found in the sequence.
• not in: The not in operator returns True if the value is not found in the sequence.
Membership operators are commonly used with sequences like lists, tuples, and sets. They can also
be used with strings, dictionaries, and other iterable objects.
Membership operators in Python are commonly used with sequences like lists, tuples, and sets. They
can also be used with strings, dictionaries, and other iterable objects.
1) The in operator
130 www.anuupdates.org
Th in operator is used to check whether a value is present in a sequence or not. If the value is
present, then true is returned. Otherwise, false is returned.
Ex:
# Define a list fruits = ['apple', 'banana', 'cherry'] #Check if 'apple' is in the list
if 'apple' in fruits:
Output:
2) The not in operator: The not in operator is opposite to the in operator. If a value is not present in a
sequence, it returns true. Otherwise, it returns false.
Ex:
g
# Define a tuple
or
numbers = (1, 2, 3, 4, 5)
specific condition.
w
• Artificial intelligence: The membership operator is used in fuzzy logic to represent the degree
w
of membership of an element in a set. Fuzzy logic is used to model systems that have
uncertain or imprecise information.
• Natural language processing: The membership operator is used in natural language
processing to determine the similarity between words and phrases. It is used to identify
synonyms and related words in a corpus of text.
Python can read/write both ordinary text files as well as binary files. Python can read/write both
ordinary text files as well as binary files.
Below are the methods by which we can read text files with Pandas:
g
Syntax: data = pandas.read_table('filename.txt', delimiter = " ")
or
3) Read Text Files with Pandas Using read_fwf()
The fwf in the read_fwf() function stands for fixed-width lines. We can use this function to load
s.
DataFrames from files. This function also supports text files. We will read data from the text files
te
using the read_fwf() function with pandas. It also supports optionally iterating or breaking the
file into chunks. Since the columns in the text file were separated with a fixed width, this
da
Opening and Closing a File: All files must be explicitly opened and closed when working with them in
the Python environment. This can be performed using the open() and close() functions.
n
Ex:
.a
f = open(filepath,'r’)
w
f.close()
w
The most common way to do this is to use the with open statement which automatically closes the
file once the file handling is complete.mode is an optional parameter which tells python what
permissions to give to the script opening the file. It defaults to read-only 'r'.
Mode Description
‘r’ Open the file for read only
‘w’ Open the file for writing (will overwrite existing content)
There are three methods used to read the contents of a text file.
• read() this will return the data in the text file as a single string of characters .
• readline() this will read a line in the file and return it as a string. It will only read a single line.
• readlines() reads all the lines in the text file and will return each line as a string within a list.
132 www.anuupdates.org
This method looks for EOL characters to differentiate items in the list, readlines() is most
often implicitly used in a for loop to extract all the data from the text file.
Example using readline():Readline() will return the first line in the text file. The first line is defined as
Ex:
with open("sample-textfiletxt",Y) as f:
string = freadline()
Writing to a text file follows a very similar methodology to that of reading files. Use a with open
statement but now change the mode to 'w' to write a new file or 'a' to append onto an existing file.
g
There are two write functions that you can use:
or
• write() takes a string as an input and will write out a single line.
s.
• writelines() takes a list of strings as an input will write out each string in the list.
te
Example using write()
da
import os
outfilename = "sample-output.txt"
p
uu
outputfolder = "output"
filepathos.path.join(outputfolder,outfilename)
n
f.write("Hello World\n")
w
w
Output:
Sample-output.txt – Notepad
File Edit View
Hello Worls
How are you doing?
UNIT-V
133 www.anuupdates.org
DataCleaningandPreparation
Data Cleaning: Data cleaning is the process of fixing or removing
incorrect, corrupted, incorrectly formatted, duplicate, or incomplete
data within a dataset. When combining multiple data sources, there
are many opportunities for datatobeduplicatedormislabeled.Ifdata is
incorrect,
outcomesandalgorithmsareunreliable,eventhoughtheymaylookcorrect.
There is no one absolute way to prescribe the exact steps in the data
cleaning process because the processes will vary from dataset to dataset.
But it is crucial to establish a template for your data cleaning process so
you know you are doing it the right way every time.
Thefollowingareessentialstepstoperformdatacleaning.
g
or
s.
te
p da
uu
Q.HandlingMissingData
n
Missing values in a dataset can occur due to several reasons such as breakdown
.a
1 NY 57400 MID
2 TX Missingvalue ENTRY
3 NJ 90000 HIGH
✓ Missingdatacancompromisethequalityofyouranalysis.Youcanaddress
thesegapsby either deleting or imputing data.
✔Rows or columns with minimal missing data can be deleted without
heavily impactingtheoverall analysis. However, if the missing data is
significant, it's advisable to impute values using statistical measures such as
mean, median, or mode.
134 www.anuupdates.org
✓ Thisapproachhelpsmaintaintheintegrityofyouranalysiswhileaddressingtheissueof
missingdata.
g
or
s.
te
p da
n uu
.a
w
w
w
135 www.anuupdates.org
reliableanalysisresults.
LetusreadthedatasetGDP_missing_data.csv,inwhichwehaverandomlyremovedsome values, or
put missing values in some of the columns.
Ex:
importpandasaspd
import numpyas np
gdp_missingvalues_data=pd.read_csv('/Datasets/GDP_missing_data.csv')
gdp_complete_data = pd.read_csv('./Datasets/GDP_complete_data.csv')
g
or
Afghanistan 2474.0 87.5 NaN NaN
s.
Algeria 11433.0 76.4 76.1 NaN
Observe that the gdpmissing_values_data dataset consists of some missing values shown as NaN
(Not aNumber).
.a
w
Typesofmissingvalues
w
Nowthatweknowhowtoidentifymissingvaluesinthedataset,letuslearnaboutthetypesof missing
w
values that can be there. Rubin (1976) classified missing values in three categories.
● Missing completely at random (MCAR): In this case, there may be nopatternasto
why a column's data is missing. For example, survey data is missing because someone could
not make it to an appointment, or an administrator misplaces the test results he is
supposedtoenterintothecomputer.Thereasonforthemissingvaluesisunrelatedtothedatain the dataset.
● Missing at random (MAR): In this scenario, the reason the data is missing in a column
can be explained by the data in other columns. For example, a school student who scores
above the cutoff is typically given a grade. So, a missing grade for astudent can be
explained by the column that has scores below the cutoff. The reasonforthese missing
values can be described by data in another column.
● Missingnotatrandom(MNAR):Sometimes,themissingvalueisrelatedtothevalueitself.For
example,higherincomepeoplemaynotdisclosetheirincomes.Here,there is a correlation
136 www.anuupdates.org
betweenthemissingvaluesandtheactualincome.Themissingvaluesarenotdependenton
othervariablesin thedataset.
MethodsforIdentifyingMissingData
Functions Description
g
DataFrame,whereTrueindicatesnon-missing
or
values and False indicates missing values.
s.
.info()
te
DisplaysinformationabouttheDataFrame,
including data types, memory usage, and
da
presence of missing values.
up
.isna() similartonotnull()butreturnsTrueformissing
values and False for non-missing values.
nu
.dropna() Dropsrowsorcolumnscontainingmissingvalues
based on custom criteria.
.a
w
fillna() Fillsmissingvalueswithspecificvalues,means,
medians, or other calculated values.
w
w
unique() FindsuniquevaluesinaSeriesorDataFrame
137 www.anuupdates.org
Q.RemovingDuplicates
Duplicatescanarisefromvarioussources,suchashumanerrors,dataentrymistakes,
dataintegrationissues,webscrapingerrors,ordatacollectionmethods.
For example, acustomer may fill out a form twice with slightly different
information, a data entry operator may copy and paste the same record
multiple times, a data integration process may merge datfrom different sources
withoutcheckingforuniqueness,awebscrapermayextractthesamepagemorethan once,
or a survey may collect responses from the same respondent using different
g
identifiers Some of these causes are easy to avoid or fix, while others may require
or
more complex solutions.
s.
Detection of duplicates:
te
✓ When dealing with duplicates, the first step is to detect them in your data
da
set.Dependingonthetypesize,andlevelofsimilarityyouwishtoconsider,there are various
tools and techniques that can be used.
up
uniqueidentifiersforeachrecordbasedontheirvalues,allowingyoutocomparethemandfind exact or
near duplicates.
w
Measurement of Duplicates: To measure duplicates in your data set, you can use different
w
metricsandindicators.Thiscanhelpyouunderstandtheextentandimpactofduplicates on your
w
Code:df=pd.DataFrame((k1':['A]*3+['B']*4,k2:[1,1,2,2,3,3,4]})dfl
Output:
K1 k2
0 A 1
1 A 1
2 A 2
3 B 2
4 B 3
5 B 3
g
or
6 B 4
Code:dfdedup-dfl.drop_duplicates()df_dedup
s.
Output:
K1 k2
te
da
0A 1
2 A 2
up
3 B 2
4 B 3
nu
6B 4
.a
Q.TransformingDataUsingaFunctionorMapping
w
Formanydatasets,youmayneedtoperformsometransformationsbasedonvaluesin
w
You have a DataFrame consisting of customer names and addresses. The addresses include code and city, but
country information is missing. You needtoaddthecountryinformation to this dataframe.
Fortunately,youhavecity-to-countrymappingmaintainedinadictionary. You want to create an additional
column in the dataframethat contains the country values.
Let's implement thissolutionusingthemapmethodinthisparticularcase.Thismethodiscalledon a
seriesandyoupassafunctiontoitasanargument.Thefunctionisappliedtoeachvalue in the
series specified in the map method.
Usethesecodesnippetsforademonstration:
Code:
139 www.anuupdates.org
dfperson-pd.DataFrame([
[Person1','Melbourne','3024'],
['Person2','Sydney','3003'],
[Person3','Delhi','100001'].
['Person4','Kolkata','700007],
['Person5','London','QA3023']
]
columns-['Name','City','Pin'])df_person
g
Output:
or
Name City Pin
s.
0Person1 Melbourne 3024
te
1Person2 Melbourne 3003
2 Person3
da
Delhi 100001
Next,letuscreateadictionaryforthecityandthecountry.
Code:
.a
dict_mapping("Melbourne":"Australia","Sydney":"Australia","Delhi":
w
("Melbourne":"Australia".
"Sydney":"Australia"
'Delhi": "India".
"Kolkata":"India".
"London":"UnitedKingdom”)
Code:
dfperson[Country]-dfperson[City]map(lambdax
dict_mapping[x])df_person
140 www.anuupdates.org
wecanobserveintheresult:
.
g
● this inline function takes key(x) as an input, and returns the value corresponding to
or
● this key(x) from the dictionary object dict_mapping
s.
● theresultingvalueisstoredinanewcolumn,"Country",intheoriginal
DataFramedfperson
te
da
Q.ReplacingValues
up
Thereplace()methodreplacesthespecifiedvaluethespecifiedvaluewithanotherspecified value.
Pandas dataframe.replace() function is used to replace a string, regex, list,
nu
Syntax:DataFrame.replace(to_replace-None,value-None,inplace=False,limit-None,
w
regex-False,method="pad",axis=None)
w
Parameters:
w
● toreplace:[str,regex,list,dict,Series,numeric,orNone]patternthatwearetryingto replace in
dataframe.
● value:Valuetousetofillholes(eg.0),alternatelyadictofvaluesspecifyingwhichvalue to use for
each column (columns not in the dictwill not be filled). Regular expressions, strings and lists or
dicts of such objects are also allowed
● inplace:IfTrue,inplace.Note:thiswillmodifyanyotherviewsonthisobject(e.g.a column from
a DataFrame). Returns the caller if this is True.
● limit:Maximumsizegaptoforwardorbackward fill
● regex:Whethertointerprettoreplaceand/orvalueasregularexpressions.IfthisisTrue then to
replace must be a string. Otherwise, to replace must be None because this parameter
will be interpreted as a regular expression or a list, dict, or array of regular
expressions.
● method: Method to use when for replacement, when to replace is a list.
141 www.anuupdates.org
Ex2:Wearegoingtoreplaceteam"BostonCeltics"with"OmegaWarrior"inthe'dfDatafr
ame.
#thiswillreplace"BostonCeltics"with"OmegaWarrior"
dfreplace(to replace="Boston Celtics",
value="Omega Warrior")
Example: Here, we are replacing 49.50 with 60.
importpandasaspd
Df={
"Array1":149.50, 701
"Array2":[65.1,49.50]
g
}
or
data=pd.DataFrame(df)
s.
Output: print(data.replace(49.50,60))
Array1 Array2 te
da
0 60.0 65.1
up
1 70.0 60.0
Output:
nu
Q.DetectingandFilteringOutliers
Outliers are data points that are significantly different from other observations in a dataset.
Theylieatanabnormaldistancefromothervaluesinarandomsamplefromapopulation.
Insimplerterms,outliersaretheoddonesout-thedatapointsthatdon'tseemtofitthepattern
Intheaboveexample,theboxplotillustratesthedistributionofgoalsscoredperplayer.The
g
or
s.
te
da
up
nu
.a
w
w
w
143 www.anuupdates.org
majority of players scored between 2 and 8 goals, as indicated by the interquartile range (the
box). The whiskers extend to the minimum and maximum values within 1.5 times the interquartile
range from the lower and upper quartiles, respectively.
However,thereisanoutlierat20goals,whichissignificantlyhigherthantherestofthedata.This outlier
represents a player who scored an exceptionally high number of goals compared to their peers.
Types of Outliers:
g
c) Globaloutliers:Thesearedatapointsthatareexceptionalwithrespecttoallotherpointsin
or
thedataset.
s.
d) Localoutliers:Thesearedatapointsthatareoutlierswithrespecttotheirlocal
te
neighborhoodinthedataset,butmaynotbeoutliersintheglobalcontext.
da
Causes of outliers:
● Humanerrors,e.g.dataentryerrors
up
● Instrumenterrors,eg.measurementerrors
● Dataprocessingerrors,eg,datamanipulation
nu
● Samplingerrors,egextractingdatafromwrongsources
Outlier Detection Methods
.a
Outlierdetectionplaysacrucialroleinensuringthequalityandaccuracyofmachin
w
e
w
learningmodels.Byidentifyingandremovingorhandlingoutlierseffectively,wecan
w
preventthemfrombiasingthemodel,reducingitsperformance,andhinderingitsinterpretability.Here'sanoverview
1. Statistical Methods:
● Z-Score:Thismethodcalculatesthestandarddeviationofthedatapoints
andidentifiesoutliersthosewithZ-scoresexceedingacertainthreshold(typically3or
-3).
DetectingOutlierswithz-Scores
143
144 www.anuupdates.org
..
• InterquartileRange(IQR):IQRidentifiesoutliersasdatapointsfallingoutsidethe
rangedefinedbyQ1-k(03-01)andQ3+k(Q3-Q1),whereQ1andQ3arethefirstandth
quartiles,andkis afactor(typically1.5).
2. Distance-BasedMethods:
● K-NearestNeighbors(KNN):KNNidentifiesoutliersasdatapointswhoseK
nearestneighborarefaraway from them.
● LocalOutlierFactor(LOF):Thismethodcalculatesthelocaldensityofdatapoints and
identifies outliers as those with significantly lower density compared to their neighbors.
3. Clustering-BasedMethods:
g
Density-Based Spatial Clustering of Applications with Noise (DBSCAN): In
or
●
DBSCANclustersdatapointsbasedontheirdensityandidentifiesoutliersaspoints not
s.
belonging to ancluster.
●
te
Hierarchical clustering: Hierarchical clustering involves building a hierarchy of
da
clusters b iteratively merging or splitting clusters based on their similarity. Outliers can be
identifiedclusterscontainingonlyasingledatapointorclusterssignificantlysmallerthan others.
up
Detectingandfilteringoutliers:
Filteringortransformingoutliersislargelyamatterofapplyingarrayoperations.Considera DataFrame
nu
import numpy as np
w
import pandas as pd
w
dfpd.DataFrame(np.random.randn(1000, 4))
w
df.describe()
0 1 2 3
Count 1000.000000 1000.000000 1000.000000 1000.000000
mean -0.034508 0.011824 -0.24911 -0.048428
md 1.023096 1.069927 1.03749 -0.37292
Min -2.99198 -2.69065 0.602409 0.048421
25% 0.735124 0.739318 -0.699221 -0.06537
30% 0.620213 0.009185 4.641272 -0.04647
144
145 www.anuupdates.org
Supposeyouwanttofindvaluesinoneofthecolumnswhoseabsolutevalueisgreaterthan3
Col-df[1]
collcolabs()>31
Output:
435 3.693235
Name: 1, dtype: float64
g
Plottingwithpandas:
or
s.
Q. Line Plots
te
Alineplotissuitableforidentifyingtrendsandpatternsoveracontinuousvariablewhichis usually
time or a similar scale.
da
Whencreatingthedfemployeesdataframe,wehaddefinedalinearrelationship
up
between the number of years an employee has worked with the company and their
salary.Solet'slookatthelineplotshowinghowtheaveragesalariesvarywiththenumberof years.
nu
Wefindtheaveragesalarygroupedbytheyearswithcompany,andthencreatealineplot
withplotline():
.a
Ex:
w
import pandas as pd
w
#Createasampledataset
w
data={
df-pd.DataFrame(data)
#Plotalinegraph
df.plot(x-"Month",y-"Sales",kind-"line",title="MonthlySales",
legend-True,color="red")
145
146 www.anuupdates.org
Q.BarPlots
Abarplot isa plotthatpresentscategoricaldatawithrectangularbars.The lengths of
the bars are proportional to the values that they represent
WecanalsocreatebarplotbyjustpassingthecategoricalvariableintheX-axis andnumericalvalacintheY-
axis.Herewehaveadataframe.Nowwewillcreateavertical bar plot and a horizontal bar plot using kind as bar and also use the
subplots
Ex:
importpandasaspd
#Sample data
Data = {
g
"Category"["A","B","C".“D”] "Values" [4,
or
7, 1.8]
}
s.
df
pd.DataFrame(data) te
da
#Create subplots for vertical and horizontal bar plots
up
df.plot(x="Category","Values",kind-"bar",subplots-True,layout-(2,1), figsize=(12,
61
nu
df.plot(x-"Category",y-"Values",kind-"barh",subplots-True)
w
w
w
Q.HistogramsandDensity Plots
Histograms are basically used to visualize one-dimensional data. Using plot method we
cancreatehistogramsaswell.Letusillustratetheuseofhistogramusingpandas.plotmethod.
146
147 www.anuupdates.org
importpandasaspd
import numpyas np
#Sample data
data={
"Scores":[65,70,85,90,95,80]
dfpd.DataFrame(data)
#Plotahistogram
dfplotty-"Scores",kind="hist",bins-5,title="HistogramofScores",
g
legend-False,figsize-(8,6))
or
s.
te
da
Q.Scatteror Point Plots.
Togetthescatterplotofadataframeallwehavetodo istojustcalltheplot()
up
methodbyspecifyingsomeparameters
nu
kind'scatter,""some_column'y'somecolum,color-somecolor
Ex:InthisexamplecodecreatesascatterplotusingaDataFrame'drwithmath marks on
.a
import pandas as pd
w
importmatplotlib.pyplotasplt
w
d={
'name':['pl','p2,p3,p4','p5','p6] age:
[20, 20, 21, 20, 21, 201
dfpd.DataFrame(data dict)
#Scatter plot using pandas
147
148 www.anuupdates.org
ax=df.plot(kind-scatter.x-math_marks',y-physics_marks',
color-red', title="Scatter Plot)
#Customizingplot
elements
ax.set_ xlabel("Math Marks")
ax.set_ylabel("PhysicsMarks")
plt.show( )
g
Explanation: This code creates a Pandas DataFramedfwith student data, including theirnames,
or
ages and marks in Math, Physics and Chemistry. It then creates a scatter plot using the
s.
plot() method, where the math_marksareplottedonthex-axisandphysics_marksonthe y-
axis.
te
da
up
nu
.a
w
w
w
148