KEMBAR78
Ocs353 Data Science Fundamentals Notes | PDF | Data | Data Science
0% found this document useful (0 votes)
4K views145 pages

Ocs353 Data Science Fundamentals Notes

The document provides comprehensive notes on Data Science Fundamentals, covering key concepts such as the data science process, types of data, and the importance of data preparation. It outlines the steps involved in data science, including defining research goals, retrieving data, and cleansing data to ensure accuracy. Additionally, it discusses the various facets of data and their applications in different sectors, emphasizing the significance of data science in enhancing decision-making and user experiences.

Uploaded by

viji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4K views145 pages

Ocs353 Data Science Fundamentals Notes

The document provides comprehensive notes on Data Science Fundamentals, covering key concepts such as the data science process, types of data, and the importance of data preparation. It outlines the steps involved in data science, including defining research goals, retrieving data, and cleansing data to ensure accuracy. Additionally, it discusses the various facets of data and their applications in different sectors, emphasizing the significance of data science in enhancing decision-making and user experiences.

Uploaded by

viji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 145

OCS353 DATA Science Fundamentals Notes

Fundamentals of data science (Anna University)

Scan to open on Studocu

Downloaded by Vijaya K
Studocu is not sponsored or endorsed by any college or university

Downloaded by Vijaya K
GNANAMANI COLLEGE OF TECHNOLOGY
(An Autonomous Institution)
|Affiliated to Anna University - Chennai, Approved by AICTE - New Delhi|
(Accredited by NBA & NAAC with "A" Grade)
|NH-7, A.K.SAMUTHIRAM, PACHAL (PO), NAMAKKAL – 637018|

(Regulation 2021)

OCS353 – DATA SCIENCE FUNDAMENTALS

ALL UNIT NOTES

Prepared By
Mr.K.VIJAYPRABAKARAN AP/CSE

Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

UNIT I
INTRODUCTION
Data Science: Benefits and uses – facets of data - Data Science Process: Overview – Defining research goals –
Retrieving data – Data preparation - Exploratory Data analysis – Build the model– presenting findings and
building applications - Data Mining - Data Warehousing – Basic Statistical descriptions of Data.

Data
In computing, data is information that has been translated into a form that is efficient for movement or
processing
Data Science
Data science is an evolutionary extension of statistics capable of dealing with the massive amounts of data
produced today. It adds methods from computer science to the repertoire of statistics.

1. BENEFITS AND USES OF DATA SCIENCE


Data science and big data are used almost everywhere in both commercial and noncommercial Settings

 Commercial companies in almost every industry use data science and big data to gain insights into
their customers, processes, staff, completion, and products.
 Many companies use data science to offer customers a better user experience, as well as to cross-sell,
up- sell, and personalize their offerings.
 Governmental organizations are also aware of data’s value. Many governmental organizations not only
rely on internal data scientists to discover valuable information, but also share their data with the
public.
 Nongovernmental organizations (NGOs) use it to raise money and defend their causes.
 Universities use data science in their research but also to enhance the study experience of their
students. The rise of massive open online courses (MOOC) produces a lot of data, which allows
universities to study how this type of learning can complement traditional classes.

2. FACETS OF DATA
In data science and big data you’ll come across many different types of data, and each of them tends to
require different tools and techniques. The main categories of data are these:
 Structured
 Unstructured
 Natural language
 Machine-generated
 Graph-based
 Audio, video, and images
 Streaming
Let’s explore all these interesting data types.

2. 1 Structured data

 Structured data is data that depends on a data model and resides in a fixed field within a record. As such,
it’s often easy to store structured data in tables within databases or Excel files
 SQL, or Structured Query Language, is the preferred way to manage and query data that resides in
databases.

1|Page
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

2|Page
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

2.2 Unstructured data

Unstructured data is data that isn’t easy to fit into a data model because the content is context-specific or
varying. One example of unstructured data is your regular email

2.2.1 Natural language


 Natural language is a special type of unstructured data; it’s challenging to process because it requires
knowledge of specific data science techniques and linguistics.
 The natural language processing community has had success in entity recognition, topic recognition,
summarization, text completion, and sentiment analysis, but models trained in one domain don’t
generalize well to other domains.
 Even state-of-the-art techniques aren’t able to decipher the meaning of every piece of text.

2.3 Machine-generated data


 Machine-generated data is information that’s automatically created by a computer, process,
application, or other machine without human intervention.
 Machine-generated data is becoming a major data resource and will continue to do so.
 The analysis of machine data relies on highly scalable tools, due to its high volume and speed.
Examples of machine data are web server logs, call detail records, network event logs, and telemetry.

3|Page
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

2. 4 Graph-based or network data

 “Graph data” can be a confusing term because any data can be shown in a graph.
 Graph or network data is, in short, data that focuses on the relationship or adjacency of objects.
 The graph structures use nodes, edges, and properties to represent and store graphical data.
 Graph-based data is a natural way to represent social networks, and its structure allows you to
calculate specific metrics such as the influence of a person and the shortest path between two
people.

2.4 Audio, image, and video

 Audio, image, and video are data types that pose specific challenges to a data scientist.
 Tasks that are trivial for humans, such as recognizing objects in pictures, turn out to be challenging
for computers.
 MLBAM (Major League Baseball Advanced Media) announced in 2014 that they’ll increase video
capture to approximately 7 TB per game for the purpose of live, in-game analytics.
 Recently a company called DeepMind succeeded at creating an algorithm that’s capable of learning
how to play video games.
 This algorithm takes the video screen as input and learns to interpret everything via a complex of
deep learning.

2.5 Streaming data

 The data flows into the system when an event happens instead of being loaded into a data store in
a batch.
 Examples are the “What’s trending” on Twitter, live sporting or music events, and the stock market.

3|Page
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

3. DATA SCIENCE PROCESS

3.1 Overview of the data science process


The typical data science process consists of six steps through which you’ll iterate, as shown in figure

1. The first step of this process is setting a research goal. The main purpose here is making sure
all the stakeholders understand the what, how, and why of the project. In every serious project
this will result in a project charter.
2. The second phase is data retrieval. You want to have data available for analysis, so this step
includes finding suitable data and getting access to the data from the data owner. The result is
data in its raw form, which probably needs polishing and transformation before it becomes
usable.
3. Now that you have the raw data, it’s time to prepare it. This includes transforming the data from
a raw form into data that’s directly usable in your models. To achieve this, you’ll detect and
correct different kinds of errors in the data, combine data from different data sources, and
transform it. If you have successfully completed this step, you can progress to data
visualization and modeling.
4. The fourth step is data exploration. The goal of this step is to gain a deep understanding of the
data. You’ll look for patterns, correlations, and deviations based on visual and descriptive
techniques. The insights you gain from this phase will enable you to start modeling.
5. Finally, we get to model building (often referred to as “data modeling” throughout this book). It
is now that you attempt to gain the insights or make the predictions stated in your project
charter. Now is the time to bring out the heavy guns, but remember research has taught us that
often (but not always) a combination of simple models tends to outperform one complicated
model. If you’ve done this phase right, you’re almost done.
6. The last step of the data science model is presenting your results and automating the analysis,
if needed. One goal of a project is to change a process and/or make better decisions. You may
still need to convince the business that your findings will indeed change the business process
as expected. This is where you can shine in your influencer role. The importance of this step is
more apparent in projects on a strategic and tactical level. Certain projects require you to
perform the business process over and over again, so automating the project will save time.

4|Page
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

4. DEFINING RESEARCH GOALS


A project starts by understanding the what, the why, and the how of your project. The outcome should be a
clear research goal, a good understanding of the context, well-defined deliverables, and a plan of action with
a timetable. This information is then best placed in a project charter.

Spend time understanding the goals and context of your research


 An essential outcome is the research goal that states the purpose of your assignment in a clear
and focused manner.
 Understanding the business goals and context is critical for project success.
 Continue asking questions and devising examples until you grasp the exact business
expectations, identify how your project fits in the bigger picture, appreciate how your research
is going to change the business, and understand how they’ll use your results

Create a project charter


A project charter requires teamwork, and your input covers at least the following:
 A clear research goal
 The project mission and context
 How you’re going to perform your analysis
 What resources you expect to use
 Proof that it’s an achievable project, or proof of concepts
 Deliverables and a measure of success
 A timeline

5. RETRIEVING DATA
 The next step in data science is to retrieve the required data. Sometimes you need to go into the field and
design a data collection process yourself, but most of the time you won’t be involved in this step.
 Many companies will have already collected and stored the data for you, and what they don’t have can
often be bought from third parties.
 More and more organizations are making even high-quality data freely available for public and
commercial use.
 Data can be stored in many forms, ranging from simple text files to tables in a database. The objective
now is acquiring all the data you need.

5.1 Start with data stored within the company (Internal data)
 Most companies have a program for maintaining key data, so much of the cleaning work may
already be done. This data can be stored in official data repositories such as databases, data
marts, data warehouses, and data lakes maintained by a team of IT professionals.
 Data warehouses and data marts are home to preprocessed data, data lakes contain data in its
natural or raw format.
 Finding data even within your own company can sometimes be a challenge. As companies
grow, their data becomes scattered around many places. the data may be dispersed as people
change positions and leave the company.
 Getting access to data is another difficult task. Organizations understand the value and
sensitivity of data and often have policies in place so everyone has access to what they need
and nothing more.
 These policies translate into physical and digital barriers called Chinese walls. These “walls” are
mandatory and well-regulated for customer data in most countries.

5.2 External Data


 If data isn’t available inside your organization, look outside your organizations. Companies
provide data so that you, in turn, can enrich their services and ecosystem. Such is the case with
Twitter, LinkedIn, and Facebook.
 More and more governments and organizations share their data for free with the world.
 A list of open data providers that should get you started.

5|Page
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

6. DATA PREPARATION (CLEANSING, INTEGRATING, TRANSFORMING DATA)

Your model needs the data in a specific format, so data transformation will always come into play. It’s a good
habit to correct data errors as early on in the process as possible. However, this isn’t always possible in a
realistic setting, so you’ll need to take corrective actions in your program.
6.1 CLEANSING DATA
Data cleansing is a sub process of the data science process that focuses on removing errors in your
data so your data becomes a true and consistent representation of the processes it originates from.
 The first type is the interpretation error, such as when you take the value in your data for
granted,
like saying that a person’s age is greater than 300 years.
 The second type of error points to inconsistencies between data sources or against your
company’s standardized values.

An example of this class of errors is putting “Female” in one table and “F” in another when they represent the
same thing: that the person is female.
Overview of common

Sometimes you’ll use more advanced methods, such as simple modeling, to find and identify data errors;
diagnostic plots can be especially insightful. For example, in figure we use a measure to identify data points
that seem out of place. We do a regression to get acquainted with the data and detect the influence of
individual observations on the regression line.

6|Page
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

6.2 Data Entry Errors


 Data collection and data entry are error-prone processes. They often require human intervention,
and introduce an error into the chain.
 Data collected by machines or computers isn’t free from errors. Errors can arise from human
sloppiness, whereas others are due to machine or hardware failure.
 Detecting data errors when the variables you study don’t have many classes can be done by
tabulating the data with counts.
 When you have a variable that can take only two values: “Good” and “Bad”, you can create a
frequency table and see if those are truly the only two values present. In table the values “Godo” and
“Bade” point out something went wrong in at least 16 cases.

Most errors of this type are easy to fix with simple assignment statements and if-then else
rules: if x == “Godo”:
x = “Good”
if x == “Bade”:
x = “Bad”
6.3 Redundant Whitespace
 Whitespaces tend to be hard to detect but cause errors like other redundant characters would.
 The whitespace cause the miss match in the string such as “FR ” – “FR”, dropping the
observations that couldn’t be matched.
 If you know to watch out for them, fixing redundant whitespaces is luckily easy enough in most
programming languages. They all provide string functions that will remove the leading and
trailing whitespaces. For instance, in Python you can use the strip() function to remove leading
and trailing spaces.
6.4 Fixing Capital Letter Mismatches
Capital letter mismatches are common. Most programming languages make a distinction between
“Brazil” and “brazil”.
In this case you can solve the problem by applying a function that returns both strings in lowercase,
such as

7|Page
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

.lower() in Python. “Brazil”.lower() == “brazil”.lower() should result in true.

6.5 Impossible Values and Sanity Checks


Here you check the value against physically or theoretically impossible values such as people taller
than 3 meters or someone with an age of 299 years. Sanity checks can be directly expressed with
rules:
check = 0 <= age <= 120

6.6 Outliers
An outlier is an observation that seems to be distant from other observations or, more specifically, one
observation that follows a different logic or generative process than the other observations. The easiest
way to find outliers is to use a plot or a table with the minimum and maximum values.
The plot on the top shows no outliers, whereas the plot on the bottom shows possible outliers on the
upper side when a normal distribution is expected.

6.7 Dealing with Missing Values


Missing values aren’t necessarily wrong, but you still need to handle them separately; certain modeling
techniques can’t handle missing values. They might be an indicator that something went wrong in your
data collection or that an error happened in the ETL process. Common techniques data scientists use
are listed in table

6.2 INTEGRATING DATA

Your data comes from several different places, and in this substep we focus on integrating these different
sources. Data varies in size, type, and structure, ranging from databases and Excel files to text documents.

7.1 The Different Ways of Combining Data


You can perform two operations to combine information from different data sets.
 Joining
 Appending or stacking

7.1.1 Joining Tables


 Joining tables allows you to combine the information of one observation found in one table with
the information that you find in another table. The focus is on enriching a single observation.
 Let’s say that the first table contains information about the purchases of a customer and the
other table contains information about the region where your customer lives.
 Joining the tables allows you to combine the information so that you can use it for your model,
as shown in figure

8|Page
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

Figure. Joining two tables on the item and region key


To join tables, you use variables that represent the same object in both tables, such as a date, a country name,
or a Social Security number. These common fields are known as keys. When these keys also uniquely define
the records in the table they are called primary keys.

7.2 Appending Tables


 Appending or stacking tables is effectively adding observations from one table to another table.
 One table contains the observations from the month January and the second table contains
observations from the month February. The result of appending these tables is a larger one with
the observations from January as well as February.

Figure. Appending data from tables is a common operation but requires an equal
structure in the tables begin appended,

7.2 TRANSFORMING DATA

Certain models require their data to be in a certain shape. Transforming your data so it takes a suitable form
for data modeling.

Relationships between an input variable and an output variable aren’t always linear. Take, for instance, a
relationship of the form y = aebx. Taking the log of the independent variables simplifies the estimation
problem dramatically. Transforming the input variables greatly simplifies the estimation problem. Other times
you might want to combine two variables into a new variable.

9|Page
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

Reducing the Number of Variables


 Having too many variables in your model makes the model difficult to handle, and certain techniques
don’t perform well when you overload them with too many input variables. For instance, all the
techniques based on a Euclidean distance perform well only up to 10 variables.
 Data scientists use special methods to reduce the number of variables but retain the maximum
amount of data.

Figure shows how reducing the number of variables makes it easier to understand the key values. It also
shows how two variables account for 50.6% of the variation within the data set (component1 = 27.8% +
component2 = 22.8%). These variables, called “component1” and “component2,” are both combinations of
the original variables. They’re the principal components of the underlying data structure

Turning Variables into Dummies


 Dummy variables can only take two values: true(1) or false(0). They’re used to indicate
the absence of a categorical effect that may explain the observation.
 In this case you’ll make separate columns for the classes stored in one variable and indicate it
with 1 if the class is present and 0 otherwise.
 An example is turning one column named Weekdays into the columns Monday through Sunday.
10 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
You use an indicator to show if the observation was on a Monday; you put 1 on Monday and 0
elsewhere.
 Turning variables into dummies is a technique that’s used in modeling and is popular with,
but not exclusive to, economists.

Figure. Turning variables into dummies is a data transformation that breaks a variable that has
multiple classes into multiple variables, each having only two possible values: 0 or 1

8. EXPLORATORY DATA ANALYSIS

During exploratory data analysis you take a deep dive into the data (see figure below). Information
becomes much easier to grasp when shown in a picture, therefore you mainly use graphical techniques
to gain an understanding of your data and the interactions between variables.

The goal isn’t to cleanse the data, but it’s common that you’ll still discover anomalies you missed before,
forcing you to take a step back and fix them.

11 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

The visualization techniques you use in this phase range from simple line graphs or histograms, as
shown in below figure , to more complex diagrams such as Sankey and network graphs.
 Sometimes it’s useful to compose a composite graph from simple graphs to get even more insight
into the data Other times the graphs can be animated or made interactive to make it easier and,
let’s admit it, way more fun.

 The techniques we described in this phase are mainly visual, but in practice they’re certainly not
limited to visualization techniques. Tabulation, clustering, and other modeling techniques can
also be a part of exploratory analysis. Even building simple models can be a part of this step.

9. BUILD THE MODELS


With clean data in place and a good understanding of the content, you’re ready to build models with the
goal of making better predictions, classifying objects, or gaining an understanding of the system that
you’re modeling.

This phase is much more focused than the exploratory analysis step, because you know what you’re looking
for and what you want the outcome to be.

Building a model is an iterative process. The way you build your model depends on whether you go with
classic statistics or the somewhat more recent machine learning school, and the type of technique you
want to use. Either way, most models consist of the following main steps:

 Selection of a modeling technique and variables to enter in the model


 Execution of the model
 Diagnosis and model comparison

12 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

Model and variable selection


You’ll need to select the variables you want to include in your model and a modeling technique. You’ll
need to consider model performance and whether your project meets all the requirements to use your
model, as well as other factors:
 Must the model be moved to a production environment and, if so, would it be easy to implement?
 How difficult is the maintenance on the model: how long will it remain relevant if left untouched?
 Does the model need to be easy to explain?

Model execution
 Once you’ve chosen a model you’ll need to implement it in code.
 Most programming languages, such as Python, already have libraries such as StatsModels or
Scikit- learn. These packages use several of the most popular techniques.
 Coding a model is a nontrivial task in most cases, so having these libraries available can speed up
the process. As you can see in the following code, it’s fairly easy to use linear regression with
StatsModels or Scikit-learn
 Doing this yourself would require much more effort even for the simple techniques. The
following listing shows the execution of a linear prediction model.

Model diagnostics and model comparison


 You’ll be building multiple models from which you then choose the best one based on multiple
criteria. Working with a holdout sample helps you pick the best-performing model.
 A holdout sample is a part of the data you leave out of the model building so it can be used to
evaluate the model afterward.
 The principle here is simple: the model should work on unseen data. You use only a fraction of
your data to estimate the model and the other part, the holdout sample, is kept out of the
equation.
 The model is then unleashed on the unseen data and error measures are calculated to evaluate it.
 Multiple error measures are available, and in figure we show the general idea on comparing
models. The error measure used in the example is the mean square error.

Formula for mean square error.

Mean square error is a simple measure: check for every prediction how far it was from the truth, square
this error, and add up the error of every prediction.

13 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

Above figure compares the performance of two models to predict the order size from the price. The
first model is size = 3 * price and the second model is size = 10.

 To estimate the models, we use 800 randomly chosen observations out of 1,000 (or 80%),
without showing the other 20% of data to the model.
 Once the model is trained, we predict the values for the other 20% of the variables based on
those for which we already know the true value, and calculate the model error with an error
measure.
 Then we choose the model with the lowest error. In this example we chose model 1 because it
has the lowest total error.

Many models make strong assumptions, such as independence of the inputs, and you have to verify that
these assumptions are indeed met. This is called model diagnostics.

10. PRESENTING FINDINGS AND BUILDING APPLICATIONS

 Sometimes people get so excited about your work that you’ll need to repeat it over and over again
because they value the predictions of your models or the insights that you produced.

 This doesn’t always mean that you have to redo all of your analysis all the time. Sometimes it’s
sufficient that you implement only the model scoring; other times you might build an application
that automatically updates reports, Excel spreadsheets, or PowerPoint presentations. The last
stage of the data science process is where your soft skills will be most useful, and yes, they’re
extremely important.

11. DATA MINING

Data mining is the process of discovering actionable information from large sets of data. Data mining
uses mathematical analysis to derive patterns and trends that exist in data. Typically, these patterns
cannot be discovered by traditional data exploration because the relationships are too complex or
because there is too much data.
These patterns and trends can be collected and defined as a data mining model. Mining models can be
applied to specific scenarios, such as:

 Forecasting: Estimating sales, predicting server loads or server downtime


 Risk and probability: Choosing the best customers for targeted mailings, determining the
probable break-even point for risk scenarios, assigning probabilities to diagnoses or other
14 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
outcomes
 Recommendations: Determining which products are likely to be sold together, generating
recommendations
 Finding sequences: Analyzing customer selections in a shopping cart, predicting next likely events
 Grouping: Separating customers or events into cluster of related items, analyzing and
predicting affinities

Building a mining model is part of a larger process that includes everything from asking questions about
the data and creating a model to answer those questions, to deploying the model into a working
environment. This process can be defined by using the following six basic steps:

1. Defining the Problem


2. Preparing Data
3. Exploring Data
4. Building Models
5. Exploring and Validating Models
6. Deploying and Updating Models

The following diagram describes the relationships between each step in the process, and the
technologies in Microsoft SQL Server that you can use to complete each step.

Defining the Problem

The first step in the data mining process is to clearly define the problem, and consider ways that data
can be utilized to provide an answer to the problem.

This step includes analyzing business requirements, defining the scope of the problem, defining the
metrics by which the model will be evaluated, and defining specific objectives for the data mining
project. These tasks translate into questions such as the following:

 What are you looking for? What types of relationships are you trying to find?
 Does the problem you are trying to solve reflect the policies or processes of the business?
 Do you want to make predictions from the data mining model, or just look for interesting
patterns and associations?
 Which outcome or attribute do you want to try to predict?
 What kind of data do you have and what kind of information is in each column? If there are
multiple tables, how are the tables related? Do you need to perform any cleansing, aggregation, or
processing to make the data usable?
 How is the data distributed? Is the data seasonal? Does the data accurately represent the
processes of the business?

15 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

Preparing Data

 The second step in the data mining process is to consolidate and clean the data that was
identified in the Defining the Problem step.
 Data can be scattered across a company and stored in different formats, or may contain
inconsistencies such as incorrect or missing entries.
 Data cleaning is not just about removing bad data or interpolating missing values, but about
finding hidden correlations in the data, identifying sources of data that are the most accurate, and
determining which columns are the most appropriate for use in analysis

Exploring Data

Exploration techniques include calculating the minimum and maximum values, calculating mean and
standard deviations, and looking at the distribution of the data. For example, you might determine by
reviewing the maximum, minimum, and mean values that the data is not representative of your
customers or business processes, and that you therefore must obtain more balanced data or review the
assumptions that are the basis for your expectations. Standard deviations and other distribution values
can provide useful information about the stability and accuracy of the results.

Building Models

The mining structure is linked to the source of data, but does not actually contain any data until you
process it. When you process the mining structure, SQL Server Analysis Services generates aggregates
and other statistical information that can be used for analysis. This information can be used by any
mining model that is based on the structure.

Exploring and Validating Models

Before you deploy a model into a production environment, you will want to test how well the model
performs. Also, when you build a model, you typically create multiple models with different
configurations and test all models to see which yields the best results for your problem and your data.

Deploying and Updating Models

After the mining models exist in a production environment, you can perform many tasks, depending on
your needs. The following are some of the tasks you can perform:

 Use the models to create predictions, which you can then use to make business decisions.
 Create content queries to retrieve statistics, rules, or formulas from the model.
 Embed data mining functionality directly into an application. You can include Analysis
Management Objects (AMO), which contains a set of objects that your application can use to
create, alter, process, and delete mining structures and mining models.
 Use Integration Services to create a package in which a mining model is used to intelligently
separate incoming data into multiple tables.
 Create a report that lets users directly query against an existing mining model
 Update the models after review and analysis. Any update requires that you reprocess the models.
 Update the models dynamically, as more data comes into the organization, and making constant
changes to improve the effectiveness of the solution should be part of the deployment strategy.

16 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

12. DATA WAREHOUSING


Data warehousing is the process of constructing and using a data warehouse. A data warehouse is
constructed by integrating data from multiple heterogeneous sources that support analytical reporting,
structured and/or ad hoc queries, and decision making. Data warehousing involves data cleaning, data
integration, and data consolidations.

Characteristics of data warehouse


The main characteristics of a data warehouse are as follows:

 Subject-Oriented
A data warehouse is subject-oriented since it provides topic-wise information rather than the
overall processes of a business. Such subjects may be sales, promotion, inventory, etc
 Integrated
A data warehouse is developed by integrating data from varied sources into a consistent
format. The data must be stored in the warehouse in a consistent and universally acceptable
manner in terms of naming, format, and coding. This facilitates effective data analysis.
 Non-Volatile
Data once entered into a data warehouse must remain unchanged. All data is read-only.
Previous data is not erased when current data is entered. This helps you to analyze what has
happened and when.
 Time-Variant
The data stored in a data warehouse is documented with an element of time, either explicitly
or implicitly. An example of time variance in Data Warehouse is exhibited in the Primary Key,
which must have an element of time like the day, week, or month.

Database vs. Data Warehouse

Although a data warehouse and a traditional database share some similarities, they need not be the same
idea. The main difference is that in a database, data is collected for multiple transactional purposes.
However, in a data warehouse, data is collected on an extensive scale to perform analytics. Databases
provide real-time data, while warehouses store data to be accessed for big analytical queries.

Data Warehouse Architecture


Usually, data warehouse architecture comprises a three-tier structure.

Bottom Tier
The bottom tier or data warehouse server usually represents a relational database system. Back-end
tools are used to cleanse, transform and feed data into this layer.

Middle Tier
 The middle tier represents an OLAP server that can be implemented in two ways.
 The ROLAP or Relational OLAP model is an extended relational database management
system that maps multidimensional data process to standard relational process.
 The MOLAP or multidimensional OLAP directly acts on multidimensional data and
operations.

Top Tier
This is the front-end client interface that gets data out from the data warehouse. It holds various tools
like query tools, analysis tools, reporting tools, and data mining tools.

How Data Warehouse Works


Data Warehousing integrates data and information collected from various sources into one
comprehensive database. For example, a data warehouse might combine customer information from an
organization’s point- of-sale systems, its mailing lists, website, and comment cards. It might also
incorporate confidential information about employees, salary information, etc. Businesses use such
components of data warehouse to analyze customers.

Data mining is one of the features of a data warehouse that involves looking for meaningful data patterns

17 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
in vast volumes of data and devising innovative strategies for increased sales and profits.

13. TYPES OF DATA WAREHOUSE


There are three main types of data warehouse.

1. Enterprise Data Warehouse (EDW)


This type of warehouse serves as a key or central database that facilitates decision-support services
throughout the enterprise. The advantage to this type of warehouse is that it provides access to cross-
organizational information, offers a unified approach to data representation, and allows running complex
queries.

2. Operational Data Store (ODS)


This type of data warehouse refreshes in real-time. It is often preferred for routine activities like storing
employee records. It is required when data warehouse systems do not support reporting needs of the
business.

3. Data Mart
A data mart is a subset of a data warehouse built to maintain a particular department, region, or
business unit. Every department of a business has a central repository or data mart to store data. The
data from the data mart is stored in the ODS periodically. The ODS then sends the data to the EDW,
where it is stored and used.

Summary
In this chapter you learned the data science process consists of six steps:

 Setting the research goal - Defining the what, the why, and the how of your project in a
project charter.

 Retrieving data - Finding and getting access to data needed in your project. This data is either
found within the company or retrieved from a third party.

 Data preparation - Checking and remediating data errors, enriching the data with data from
other data sources, and transforming it into a suitable format for your models.

 Data exploration - Diving deeper into your data using descriptive statistics and visual techniques.

 Data modeling - Using machine learning and statistical techniques to achieve your project goal.

 Presentation and automation - Presenting your results to the stakeholders and industrializing
your analysis process for repetitive reuse and integration with other tools.

*************************************************************************************************************

18 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

UNIT II
DATA MANIPULATION
Python Shell - Jupyter Notebook - IPython Magic Commands - NumPy Arrays-Universal Functions – Aggregations
- Computation on Arrays - Fancy Indexing - Sorting arrays - Structured data - Data manipulation with
Pandas - Data Indexing and Selection - Handling missing data - Hierarchical indexing - Combining datasets
- Aggregation and Grouping - String operations - Working with time series - High performance.

1. PYTHON SHELL

A Python shell, also known as an interactive Python interpreter or REPL (Read-Eval-Print Loop), is a
command-line interface where you can interactively execute Python code, experiment with ideas, and
see immediate results.
In a Python shell:
1. You can enter Python statements or expressions.
2. The shell evaluates the code and displays the results.
3. You can access and manipulate variables, functions, and modules.
4. The shell provides features like auto-completion, syntax highlighting, and error handling.
Some popular Python shells include:
1. IDLE: A basic shell that comes bundled with Python.
2. IPython: An enhanced shell with features like syntax highlighting, auto-completion, and
visualization tools.
3. Jupyter Notebook: A web-based interactive environment that combines a shell with notebook-
style documentation and visualization.
4. PyCharm: An integrated development environment (IDE) that includes a Python shell.
5. Python Interpreter: The default shell that comes with Python, accessible from the command line
or terminal.
Python shells are useful for:
 Quick experimentation and prototyping
 Learning and exploring Python syntax and libraries
 Debugging and testing code snippets
 Interactive data analysis and visualization

To access a Python shell, you can:


 Open a terminal or command prompt and type python or python3
 Launch IDLE, IPython, or Jupyter Notebook from your application menu
 Use an IDE like PyCharm or Visual Studio Code with a built-in Python shell

19 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

2. JUPYTER NOTEBOOK

The Jupyter Notebook is an open-source web application that allows you to create and share
documents that contain live code, equations, visualizations, and narrative text. Uses include data
cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine
learning, and much more.
Jupyter has support for over 40 different programming languages and Python is one of them. Python
is a requirement (Python 3.3 or greater, or Python 2.7) for installing the Jupyter Notebook itself.
2.1 Installation
Install Python and Jupyter using the Anaconda Distribution, which includes Python, the Jupyter
Notebook, and other commonly used packages for scientific computing and data science. You can
download Anaconda’s latest Python3 version. Now, install the downloaded version of Anaconda.
Installing Jupyter Notebook using pip:
python3 -m pip install --upgrade pip
python3 -m pip install jupyter

2.2 Starting Jupyter Notebook


To start the jupyter notebook, type the below command in the terminal or CMD.

After the notebook is opened, you’ll see the Notebook Dashboard, which will show a list of the
notebooks, files, and subdirectories in the directory where the notebook server was started. Most of the

20 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

time, you will wish to start a notebook server in the highest level directory containing notebooks.
Often this will be your home directory.

The web page should look like this:

2.3 Hello World in Jupyter Notebook


After successfully installing and creating a notebook in Jupyter Notebook, let’s see how to write code
in it. Jupyter notebook provides a cell for writing code in it. The type of code depends on the type of
notebook you created. For example, if you created a Python3 notebook then you can write Python3
code in the cell. Now, let’s add the following code –
print("Hello World")

2.4 Advantages of Jupyter Notebook


1. All in one place: As you know, Jupyter Notebook is an open-source web-based interactive
environment that combines code, text, images, videos, mathematical equations, plots, maps,
graphical user interface and widgets to a single document.
2. Easy to convert: Jupyter Notebook allows users to convert the notebooks into other formats
such as HTML and PDF. It also uses online tools and nbviewer which allows you to render a
publicly available notebook in the browser directly.
3. Easy to share: Jupyter Notebooks are saved in the structured text files (JSON format), which
makes them easily shareable.
4. Language independent: Jupyter Notebook is platform-independent because it is represented
as JSON (JavaScript Object Notation) format, which is a language-independent, text-based file
format.
5. Interactive code: Jupyter notebook uses ipywidgets packages, which provide many common
user interfaces for exploring code and data interactivity.

21 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

2.5 Disadvantages of Jupyter Notebook


There are the following disadvantages of Jupyter Notebook:
 It is very hard to test long asynchronous tasks.
 Less Security
 It runs cell out of order
 In Jupyter notebook, there is no IDE integration, no linting, and no code-style correction.
3. IPYTHON MAGIC COMMANDS

IPython provides several "magic" commands that make it easier to perform common tasks in a
Python notebook. These commands, prefixed with % for line magics and %% for cell magics, offer a
variety of functionalities. Here are some key IPython magic commands:
IPython magic commands are special commands in IPython that start with the % symbol and provide
a wide range of functionality, including:
Here are some more IPython magic commands:

S.NO COMMANDS SYMBOLS


• %ls: List files in the current directory
• %cd: Change directory
Shell Commands
1. • %pwd: Print working directory
• %mkdir: Make a directory
• %rm: Remove a file or directory
• %!: Execute a shell command
System Commands
2. • %sx: Execute a shell command and capture output
• %sc: Execute a shell command and capture output as a string
• %capture: Capture output of a command
• %display: Display an object
Input/Output • %html: Display HTML content
3.
• %latex: Display LaTeX content
• %markdown: Display Markdown content
• %svg: Display SVG content
• %run: Run a Python script
Code Execution • %load: Load a Python script
4.
• %paste: Paste and execute code
• %cpaste: Paste and execute code with syntax highlighting
• %debug: Enter the debugger after an exception occurs
Debugging • %pdb: Enter the debugger after an exception occurs
5.
• %bp: Set a breakpoint
• %step: Step through code
• %time: Time the execution of a statement or expression
• %timeit: Time the execution of a statement or expression
Timing and Profiling multiple times
6.
• %prun: Profile the execution of a statement or expression
• %lprun: Profile the execution of a statement or expression
with line-by-line profiling

• %precision: Set the precision for floating-point numbers


7. Output Formatting
• %pprint: Pretty-print an object

22 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

Environment • %env: List environment variables


8.
Management • %store: Store a variable in the IPython database

Extension • %load_ext: Load an IPython extension


9.
Management • %unload_ext: Unload an IPython extension
Help and
• %help: Show help for a magic command or function
10. Documentation
• %quickref: Show a quick reference guide for IPython

Memory Usage • %memit: Measure memory usage of a statement or expression


11.
• %mprun: Profile memory usage of a statement or expression

• %parallel: Run a command in parallel


Parallel Computing
12. • %px: Run a command in parallel with Xtrae
• %pxresult: Get the result of a parallel computation
• %install_ext: Install an IPython extension
Other • %install_nbext: Install a Jupyter notebook extension
13.
• %uninstall_ext: Uninstall an IPython extension
• %uninstall_nbext: Uninstall a Jupyter notebook extension

4. NUMPY ARRAYS

NumPy is a powerful library for numerical computing in Python. It provides support for arrays,
which are more efficient than Python lists for numerical operations. Here are some basic and
advanced operations you can perform with NumPy arrays.
Creating NumPy Arrays
 numpy.array(): Create an array from a Python list or tuple
 numpy.zeros(): Create an array filled with zeros
 numpy.ones(): Create an array filled with ones
 numpy.random.rand(): Create an array with random values
NumPy Array Properties
 shape: The number of dimensions and size of each dimension
 dtype: The data type of the array elements
 size: The total number of elements in the array
Indexing and Slicing
 arr[index]: Access a single element
 arr[start:stop:step]: Access a slice of elements
 arr[start:stop]: Access a slice of elements with default step size 1
Basic Operations
 arr + arr: Element-wise addition
 arr - arr: Element-wise subtraction
 arr * arr: Element-wise multiplication
 arr / arr: Element-wise division
Advanced Operations
 numpy.dot(): Compute the dot product of two arrays
 numpy.cross(): Compute the cross product of two arrays

23 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

 numpy.inner(): Compute the inner product of two arrays


 numpy.outer(): Compute the outer product of two arrays
Array Functions
 numpy.sum(): Compute the sum of all elements in an array
 numpy.mean(): Compute the mean of all elements in an array
 numpy.median(): Compute the median of all elements in an array
 numpy.std(): Compute the standard deviation of all elements in an array
 numpy.var(): Compute the variance of all elements in an array
Array Comparison
 numpy.equal(): Compare two arrays element-wise
 numpy.not_equal(): Compare two arrays element-wise
 numpy.greater(): Compare two arrays element-wise
 numpy.less(): Compare two arrays element-wise

Creating a NumPy array Basic operations


import numpy as np import numpy as np
arr = np.array([1, 2, 3, 4, 5]) arr1 = np.array([1, 2,
print(arr) 3])
arr2 = np.array([4, 5,
Output: 6]) print(arr1 + arr2)
[1 2 3 4 5] print(arr1 * arr2)
Output:
[5 7 9]
[ 4 10 18]
Indexing and slicing Array functions
import numpy as np import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9]) arr = np.array([1, 2, 3, 4, 5])
print(arr[3]) print(np.sum(arr))
print(arr[2:5]) print(np.mean(arr))

Output: Output:
4 15
[3 4 5] 3.0
Reshaping an array Array Comparison
import numpy as np import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6]) arr1 = np.array([1, 2,
arr = arr.reshape(2, 3) 3])
print(arr) arr2 = np.array([1, 2, 4])
print(np.equal(arr1, arr2))
Output: Output:
[[1 2 3] [ True True False]
[4 5 6]]
Concatenating arrays Splitting an array
import numpy as np import numpy as np
arr1 = np.array([1, 2, arr = np.array([1, 2, 3, 4, 5, 6])
3]) arr1, arr2 = np.split(arr, 2)
arr2 = np.array([4, 5, 6]) print(arr1)
arr = np.concatenate((arr1, arr2)) print(arr2)
print(arr)
Output: Output:
[1 2 3 4 5 6] [1 2 3]
[4 5 6]

24 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

Example Program: Overall Operations using Numpy Array

import numpy as np

# Creating arrays
a = np.array([1, 2, 3])
b = np.array([(1, 2, 3), (4, 5, 6)])
c = np.arange(0, 10, 2)
d = np.linspace(0, 1, 5)
e = np.zeros((2, 3))
f = np.ones((2, 3))
g = np.eye(3)
h = np.random.random((2, 3))

# Displaying arrays
print("Array a:\n", a)
print("Array b:\n", b)
print("Array c (arange):\n", c)
print("Array d (linspace):\n",
d) print("Array e (zeros):\n", e)
print("Array f (ones):\n", f)
print("Array g (identity matrix):\n", g)
print("Array h (random values):\n", h)

# Array properties
print("Shape of array b:", b.shape)
print("Size of array b:", b.size)
print("Data type of array a:", a.dtype)

# Array operations
i = np.array([1, 2, 3])
j = np.array([4, 5, 6])
print("i + j:\n", i + j)
print("i * j:\n", i * j)

# Matrix operations
k = np.array([[1, 2], [3, 4]])
l = np.array([[5, 6], [7, 8]])
print("Matrix product of k and l:\n", np.dot(k, l))

# Aggregate functions
m = np.array([1, 2, 3, 4, 5])
print("Sum of array m:", np.sum(m))
print("Mean of array m:", np.mean(m))
print("Standard deviation of array m:", np.std(m))

# Indexing and slicing


print("First element of array a:",
a[0]) n = np.array([1, 2, 3, 4, 5])
print("Elements from index 1 to 3 of array n:", n[1:4])

25 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

5. UNIVERSAL FUNCTIONS

Universal functions (ufuncs) in NumPy are functions that operate element-wise on arrays, supporting
broadcasting, type casting, and other standard features. They are essential for performing vectorized
operations, which are both more concise and more efficient than using Python loops.
Key Characteristics of Ufuncs
1. Element-wise Operations: Ufuncs apply operations element-wise, which means they operate
on each element of the input arrays independently.
2. Broadcasting: Ufuncs support broadcasting, which allows them to work with arrays of
different shapes in a flexible manner.
3. Performance: Ufuncs are implemented in C and are optimized for performance, making them
much faster than equivalent Python loops.
Common Ufuncs;-

UNIVERSAL FUNCTIONS FUNCTION EXAMPLE


import numpy as np
a = np.array([1, 2, 3])
np.add
b = np.array([4, 5, 6])
result = np.add(a, b)
print("Addition:", result)
result = np.subtract(a, b)
Arithmetic Operations np.subtract
print("Subtraction:", result)
result = np.multiply(a, b)
np.multiply
print("Multiplication:", result)
result = np.divide(a, b)
np.divide
print("Division:", result)
result = np.power(a, 2)
np.power
print("Power:", result)
result = np.sqrt(a)
Square Root: np.sqrt
print("Square Root:",
result)
result = np.exp(a)
Exponential: np.exp
Mathematical Functions print("Exponential:", result)
result = np.log(a)
Logarithm: np.log
print("Logarithm:", result)
angle = np.array([0, np.pi/2, np.pi])
Trigonometric Functions: print("Sine:", np.sin(angle))
np.sin, np.cos, np.tan print("Cosine:", np.cos(angle))
print("Tangent:", np.tan(angle))
result = np.mean(a)
Mean: np.mean
print("Mean:", result)
Standard result = np.std(a)
Statistical Functions
Deviation: print("Standard Deviation:", result)
np.std
result = np.sum(a)
Sum: np.sum
print("Sum:", result)
result = np.greater(a, b)
Greater Than: np.greater
print("Greater Than:",
result)
Comparison Operators result = np.less(a, b)
Less Than: np.less
print("Less Than:",
result)
26 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
result = np.equal(a, b)
Equal: np.equal print("Equal:", result)

27 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

Example Code:
import numpy as np
# Arrays
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
# Arithmetic Operations
print("Addition:", np.add(a, b))
print("Subtraction:", np.subtract(a, b))
print("Multiplication:", np.multiply(a, b))
print("Division:", np.divide(a, b))
# Mathematical Functions
print("Square Root:", np.sqrt(a))
print("Exponential:", np.exp(a))
print("Logarithm:", np.log(a))
# Trigonometric Functions
angle = np.array([0, np.pi/2, np.pi])
print("Sine:", np.sin(angle))
print("Cosine:", np.cos(angle))
print("Tangent:", np.tan(angle))
# Statistical Functions
print("Mean:", np.mean(a))
print("Standard Deviation:", np.std(a))
print("Sum:", np.sum(a))
# Comparison Operators
print("Greater Than:", np.greater(a, b))
print("Less Than:", np.less(a, b))
print("Equal:", np.equal(a, b))
6. AGGREGATIONS

Aggregation in data science refers to the process of summarizing or combining multiple data points to produce
a single result or a smaller set of results. This is a fundamental concept used to simplify and analyze large
datasets, making it easier to draw insights and make decisions. Aggregation can be performed in various ways,
depending on the type of data and the analysis being conducted.

Definition: Aggregation is the process of combining multiple pieces of data to produce a summary
result.

28 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

Purpose: The primary purpose of aggregation is to simplify and summarize data, making it easier to
analyze and interpret. This helps in identifying trends, patterns, and anomalies.

6. 1 AGGREGATION TECHNIQUES

Group By: Grouping data based on one or more columns and then applying an aggregation function.
For example, grouping sales data by region and then calculating the total sales per region.

Pivot Tables: Reshaping data by turning unique values from one column into multiple columns,
providing a summarized dataset.
Rolling Aggregation: Calculating aggregates over a rolling window, such as a moving average.

Common aggregation Techniques:


1. Sum:
 Adds up all the values in a dataset. Commonly used to calculate total sales, total expenses, etc.
 Calculate the total value of a column or group of data points.

Function: SUM()

total_sales = df['sales'].sum()

2. Mean (Average):
 Calculates the average value of a dataset.
 Calculate the mean value of a column or group of data points.

Function: AVG()

average_age = df['age'].mean()
3. Median:
Finds the middle value in a dataset, which is less affected by outliers than the mean.
median_income = df['income'].median()
4. Mode:
Identifies the most frequently occurring value in a dataset.
most_common_category = df['category'].mode()
most_common_category = df['category'].mode()
5. Count:
Counts the number of entries in a dataset, often used to determine the number of occurrences of a
specific value.
count_of_sales = df['sales'].count()
6. Min and Max:
 Finds the minimum and maximum values in a dataset.
 Find the maximum or minimum value in a column or group of data points.
Function: MAX()
MIN():

29 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

min_salary = df['salary'].min()
max_salary = df['salary'].max()
7. Standard Deviation and Variance:
 Measures the spread or dispersion of the data around the mean.
 Calculate the spread or dispersion of a column or group of data points.
Function: STDDEV()
VAR()

std_dev = df['scores'].std()
variance = df['scores'].var()
8. Group By:
 Aggregates data based on one or more categories. This is often used in conjunction with other
aggregation functions.
 Group data by one or more columns and apply
aggregations. sales_by_region = df.groupby('region')['sales'].sum()
6.2 APPLICATIONS OF AGGREGATION
Descriptive Statistics:
Aggregation is used to describe the main features of a dataset quantitatively. For example,
summarizing the central tendency and dispersion of data.
Data Cleaning:
Aggregation can help in identifying and handling missing values, outliers, and inconsistencies in the
data.
Data Visualization:
Aggregated data is often used to create plots and charts, making it easier to visualize trends and
patterns.
Feature Engineering:
Aggregation can be used to create new features from existing data, improving the performance of
machine learning models.
Reporting:
Aggregated data is commonly used in business reports and dashboards to provide a high-level
overview of key metrics.
Example Code: Using Pandas
import pandas as pd
# Sample data
data = {
'region': ['North', 'South', 'East', 'West', 'North', 'South'],
'sales': [250, 150, 200, 300, 400, 100],
'expenses': [100, 50, 80, 120, 150, 60]

30 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

}
df =
pd.DataFrame(data) #
Sum of sales
total_sales =
df['sales'].sum() # Average
expenses
average_expenses = df['expenses'].mean()
# Sales by region
sales_by_region = df.groupby('region')['sales'].sum()
print(f"Total Sales: {total_sales}")
print(f"Average Expenses: {average_expenses}")
print("Sales by Region:")
print(sales_by_region)
7. COMPUTATION ON ARRAYS
Computation on arrays is a fundamental aspect of data science, enabling efficient data manipulation,
analysis, and machine learning. Arrays, especially as implemented in libraries like NumPy, provide a
powerful way to handle large datasets and perform a wide range of mathematical operations. Here,
we'll explore the essential aspects of array computations in data science.
Key Concepts
1. Array Creation and Initialization
Creating and initializing arrays is the first step in performing any computation. Arrays can be created
from lists, using functions like np.array, or from scratch using functions like np.zeros, np.ones, and
np.full.
import numpy as np
# From a list
arr = np.array([1, 2, 3, 4])
# From scratch
zeros = np.zeros((3, 3))
ones = np.ones((2, 2))
full = np.full((2, 3), 7)

2. Array Operations
NumPy supports a variety of element-wise operations, such as addition, subtraction, multiplication,
and division, as well as more complex mathematical functions like exponentiation, logarithms, and
trigonometric functions.
# Element-wise operations
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])

31 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
sum_arr = arr1 + arr2
diff_arr = arr1 - arr2

32 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

prod_arr = arr1 * arr2


quot_arr = arr1 / arr2
3. Broadcasting
As previously discussed, broadcasting allows operations on arrays of different shapes, making it
easier to perform operations without explicitly reshaping arrays.
arr = np.array([[1, 2, 3], [4, 5, 6]])
scalar = 3
result = arr + scalar # Broadcasting scalar to the shape of arr

4. Indexing and Slicing


Efficiently accessing and manipulating array elements is crucial. NumPy provides powerful indexing
and slicing capabilities.
arr = np.array([[1, 2, 3], [4, 5,
6]]) # Indexing
element = arr[1, 2] # Accessing the element at row 1, column
2 # Slicing
slice_arr = arr[:, 1:3] # Slicing columns 1 to 2 for all rows

4. Aggregation
Aggregation functions like sum, mean, median, min, and max help summarize
data. arr = np.array([[1, 2, 3], [4, 5, 6]])
total_sum = np.sum(arr)
column_mean = np.mean(arr, axis=0)
row_max = np.max(arr, axis=1)

5. Linear Algebra
NumPy provides support for linear algebra operations, including dot products, matrix multiplication,
determinants, and inverses.
# Dot product
vec1 = np.array([1, 2])
vec2 = np.array([3, 4])
dot_product = np.dot(vec1, vec2)
# Matrix multiplication
mat1 = np.array([[1, 2], [3, 4]])
mat2 = np.array([[5, 6], [7, 8]])
mat_mult = np.matmul(mat1, mat2)
7.1 BROADCASTING
Broadcasting is a powerful mechanism in NumPy (a popular library for numerical computations in
Python) that allows for element-wise operations on arrays of different shapes. When performing
arithmetic operations, NumPy automatically stretches the smaller array along the dimension with
size 1 to match the shape of the larger array. This allows for efficient computation without the need
for explicitly replicating the data.
Broadcasting Rules:
To understand how broadcasting works, it's important to know the rules that govern it:

33 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

 If the arrays differ in their number of dimensions, the shape of the smaller array is padded
with ones on its left side.
 If the shape of the two arrays does not match in any dimension, the array with shape equal to
1 in that dimension is stretched to match the other shape.
 If in any dimension the sizes are different and neither is equal to 1, an error is
raised Broadcasting follows a set of rules to make arrays compatible for element-wise
operations:
 Align Shapes: If the arrays have different numbers of dimensions, the shape of the smaller
array is padded with ones on its left side.
 Shape Compatibility: Arrays are compatible for broadcasting if, in all dimensions, the
following is true:The dimension sizes are equal, orOne of the dimensions is 1.
 Result Shape: The resulting shape is the maximum size along each dimension from the input
arrays.
Examples of Broadcasting
Example 1: Adding a Scalar to an Array
import numpy as np
arr = np.array([1, 2,
3])
scalar = 5
result = arr + scalar
print(result)

Output: [6 7 8]

Example 2: Adding Arrays of Different Shapes


arr1 = np.array([[1, 2, 3], [4, 5, 6]])
arr2 = np.array([1, 2, 3])
result = arr1 + arr2
print(result)

Output: [[2 4 6]
[5 7 9]]
Example 3: More Complex Broadcasting
arr1 = np.array([[1, 2, 3], [4, 5, 6]])
arr2 = np.array([[1], [2]])
result = arr1 + arr2
print(result)

Output: [[2 3 4]
[6 7 8]]
Practical Applications
Normalizing Data
Broadcasting is useful for normalizing data, subtracting the mean, and dividing by the standard
34 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
deviation for each feature.

35 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])


mean = np.mean(data, axis=0)
std = np.std(data, axis=0)
normalized_data = (data - mean) /
std print(normalized_data)

Output:
[[-1.22474487 -1.22474487 -1.22474487]
[0. 0. 0. ]
[ 1.22474487 1.22474487 1.22474487]]

Element-wise Operations
Broadcasting simplifies scaling each column of a matrix by a different
factor. matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
scaling_factors = np.array([0.1, 0.2, 0.3])
scaled_matrix = matrix * scaling_factors
print(scaled_matrix)

Output:
[[0.1 0.4 0.9]
[0.4 1. 1.8]
[0.7 1.6 2.7]]
8. FANCY INDEXING
Fancy indexing, also known as advanced indexing, is a technique in data science and programming
(particularly in Python with NumPy and pandas) that allows for more flexible and powerful ways to
access and manipulate data arrays or dataframes. It involves using arrays or sequences of indices to
select specific elements or slices from an array or dataframe.
Fancy indexing refers to using arrays of indices to access multiple elements of an array
simultaneously. Instead of accessing elements one by one, you can pass a list or array of indices to
obtain a subset of elements. This technique can be used for both reading from and writing to arrays.
NumPy Fancy Indexing
 NumPy is a fundamental package for scientific computing with Python, providing support
for arrays and matrices.
In NumPy, fancy indexing is done by passing arrays of indices inside square brackets. Here’s an
example:
import numpy as np
# Create a NumPy array
arr = np.array([10, 20, 30, 40, 50])
# Fancy indexing with a list of
indices indices = [0, 2, 4]
subset =
arr[indices]
print(subset)

Output: [10 30 50]

36 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

37 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

Boolean Indexing
Another form of fancy indexing is boolean indexing, where you use boolean arrays to select elements:
mask = arr > 30
subset = arr[mask]
print(subset)

Output: [40 50]


Fancy Indexing in 2D Arrays
Fancy indexing can also be applied to multi-dimensional
arrays # Create a 2D array
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
# Fancy indexing with row and column indices
row_indices = [0, 1, 2]
col_indices = [2, 1, 0]
subset = arr2d[row_indices, col_indices]
print(subset)
Output: [3 5 7]
Fancy Indexing in pandas
Using loc and iloc:
loc is used for label-based indexing, while iloc is used for integer-based indexing.
import pandas as pd
# Create a DataFrame
df = pd.DataFrame({
'A': [10, 20, 30, 40, 50],
'B': [5, 10, 15, 20, 25]
})
# Fancy indexing with .iloc (integer-location based indexing)
subset = df.iloc[[0, 2, 4]]
print(subset)
Combined Indexing Techniques
Fancy indexing can be combined with other indexing techniques to achieve complex
selections # Combined indexing
subset = df.iloc[[0, 2, 4], [0, 1]]
print(subset)

Applications in Data Science


Fancy indexing is particularly useful in various data science tasks, including:
 Data Cleaning: Selecting and modifying subsets of data based on certain conditions.
 Data Analysis: Efficiently extracting and analyzing specific parts of a dataset.
 Machine Learning: Preprocessing data by selecting specific features or samples.
 Visualization: Selecting specific data points to visualize.
 Data Selection: Extract specific elements, rows, or columns from large datasets.
 Data Filtering: Filter data based on conditions or criteria.
 Data Transformation: Apply operations to specific subsets of data.
 Efficient Computations: Perform efficient computations on selected data without looping.

38 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

9. SORTING ARRAYS
Sorting means putting elements in an ordered sequence.
Ordered sequence is any sequence that has an order corresponding to elements, like numeric or
alphabetical, ascending or decending.
The NumPy ndarray object has a function called sort(). That will sort a specified array.
Sorting in NumPy
1. Simple Sorting
 numpy.sort() returns a sorted copy of the array.
 numpy.ndarray.sort() sorts the array in-place.
import numpy as np
arr = np.array([3, 1, 2, 5, 4]) sorted_arr =
np.sort(arr) print(sorted_arr) #
Output: [1 2 3 4 5] arr.sort()
print(arr)
Output: [1 2 3 4 5]
2. Sorting Multi-dimensional Arrays
 our can sort along a specified axis using the axis
parameter. arr_2d = np.array([[3, 1, 2], [5, 4, 6]])
sorted_arr_2d = np.sort(arr_2d, axis=0) # Sort along the rows
print(sorted_arr_2d)

Output: [[3 1 2]
[5 4 6]]
sorted_arr_2d = np.sort(arr_2d, axis=1) # Sort along the columns print(sorted_arr_2d)

Output: [[1 2 3]
[4 5 6]]
3. Argsort for Indirect Sorting
 numpy.argsort() returns the indices that would sort an
array. arr = np.array([3, 1, 2, 5, 4])
indices = np.argsort(arr) print(indices)

Output: [1 2 0 4 3]
sorted_arr =
arr[indices]
print(sorted_arr)

Output: [1 2 3 4 5]

39 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

Sorting by Multiple Keys


 You can sort a structured array by multiple
fields. data = np.array([(1, 'first', 200),
(2, 'second', 100),
(3, 'third', 150)],
dtype=[('id', 'i4'), ('name', 'U10'), ('score', 'i4')])
sorted_data = np.sort(data, order=['score', 'id'])
print(sorted_data)

Output: [(2, 'second', 100) (3, 'third', 150) (1, 'first', 200)]
Custom Sorting
 You can use numpy.lexsort() for custom sorting.
names = np.array(['Betty', 'John', 'Alice', 'Alice'])
ages = np.array([25, 34, 30, 22])
indices = np.lexsort((ages, names))
sorted_data = list(zip(names[indices], ages[indices]))
print(sorted_data)

Output: [('Alice', 22), ('Alice', 30), ('Betty', 25), ('John', 34)]

10. STRUCTURED DATA


NumPy’s Stuctured Arrays:
NumPy's structured arrays (also known as record arrays) are a powerful feature for handling
heterogeneous data, where each element can have multiple fields of different data types. Structured
arrays allow you to define complex data structures and perform efficient operations on them.
Creating Structured Arrays
1. Defining Data Types
 You can define a structured array by specifying a list of tuples, where each tuple represents
a field's name and its data type.
import numpy as np
dtype = [('name', 'U10'), ('age', 'i4'), ('weight', 'f4')]
data = np.array([('Alice', 25, 55.5), ('Bob', 30, 75.2)], dtype=dtype)
print(data)

Output: [('Alice', 25, 55.5) ('Bob', 30, 75.2)]


2. Accessing Fields
 You can access individual fields of the structured array using the field
names. names = data['name']
ages = data['age']
weights = data['weight']
print(names)

Output: ['Alice' 'Bob']


print(ages)

40 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

Output: [25 30]


print(weights)

Output: [55.5 75.2]

Adding and Removing Fields


1. Adding Fields
 You can use the np.lib.recfunctions.append_fields() function to add new fields to a
structured array.
from numpy.lib import recfunctions as rfn
heights = np.array([165.0, 180.0])
data = rfn.append_fields(data, 'height', heights,
usemask=False) print(data)

Output: [('Alice', 25, 55.5, 165.0) ('Bob', 30, 75.2, 180.0)]


2. Removing Fields
 Use the np.lib.recfunctions.drop_fields() function to remove fields from a structured
array. data = rfn.drop_fields(data, 'weight')
print(data)

Output: [('Alice', 25, 165.0) ('Bob', 30, 180.0)]

Sorting and Filtering


1. Sorting
 You can sort structured arrays by any field using np.sort() or
np.argsort(). sorted_data = np.sort(data, order='age')
print(sorted_data)

Output: [('Alice', 25, 165.0) ('Bob', 30, 180.0)]


2. Filtering
 Use boolean indexing to filter structured arrays based on field
values. filtered_data = data[data['age'] > 25]
print(filtered_data)

Output: [('Bob', 30, 180.0)]

Advanced Usage
1. Nested Structures
address_dtype = [('street', 'U20'), ('city', 'U20')]
person_dtype = [('name', 'U10'), ('age', 'i4'), ('address', address_dtype)]
data = np.array([('Alice', 25, ('123 Main St', 'Springfield')), ('Bob', 30, ('456 Elm St', 'Shelbyville'))],
dtype=person_dtype)
print(data)

Output: [('Alice', 25, ('123 Main St', 'Springfield')) ('Bob', 30, ('456 Elm St', 'Shelbyville'))]

41 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

2. Accessing Nested Fields


streets = data['address']['street']
print(streets)

Output: ['123 Main St' '456 Elm St']

Example 1: Creating Structured Arrays


import numpy as np
# Define the data
type
dtype = np.dtype([('name', 'S10'), ('age', 'i4'), ('weight', 'f4')])
# Create a structured array
data = np.array([('Alice', 25, 55.5), ('Bob', 30, 75.0), ('Cathy', 22, 60.0)], dtype=dtype)
print(data)

Output: [(b'Alice', 25, 55.5) (b'Bob', 30, 75. ) (b'Cathy', 22, 60. )]

Example 2: Accessing and Modifying Data


# Accessing a single field
names = data['name']
print(names) # Output: [b'Alice' b'Bob' b'Cathy']
# Accessing a single record
alice = data[0]
print(alice) # Output: (b'Alice', 25, 55.5)
# Modifying a field
data['age'] = [26, 31, 23]
print(data)

Output: [(b'Alice', 26, 55.5) (b'Bob', 31, 75. ) (b'Cathy', 23, 60.
)] # Modifying a record
data[1] = ('Bob', 32, 77.5)
print(data)

Output: [(b'Alice', 26, 55.5) (b'Bob', 32, 77.5) (b'Cathy', 23, 60. )]

Example 3: Operations on Structured Arrays


# Filtering
young_people = data[data['age'] <
30] print(young_people)

Output: [(b'Alice', 26, 55.5) (b'Cathy', 23, 60. )]


# Sorting
sorted_data = np.sort(data, order='age')
print(sorted_data)

Output: [(b'Cathy', 23, 60. ) (b'Alice', 26, 55.5) (b'Bob', 32, 77.5)]

42 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

Example 4: Advanced Usage


# Nested structured array
nested_dtype = np.dtype([('person', [('name', 'S10'), ('age', 'i4')]), ('weight', 'f4')])
nested_data = np.array([(('Alice', 25), 55.5), (('Bob', 30), 75.0)], dtype=nested_dtype)
print(nested_data)

Output: [((b'Alice', 25), 55.5) ((b'Bob', 30), 75. )]


# Accessing nested fields
print(nested_data['person'])

Output: [(b'Alice', 25) (b'Bob', 30)]


print(nested_data['person']['name'])
Output: [b'Alice' b'Bob']
11. DATA MANIPULATION WITH PANDAS
Pandas is a powerful and flexible data manipulation library in Python, built on top of NumPy. It
provides data structures like Series (one-dimensional) and DataFrame (two-dimensional) that are
designed to make data analysis and manipulation straightforward and efficient.
Getting Started with Pandas
First, you need to import the Pandas library:
import pandas as pd
Creating DataFrames
DataFrames can be created from dictionaries, lists, or other data structures.
import pandas as pd
# From a dictionary
data = {
'Name': ['Alice', 'Bob', 'Cathy'],
'Age': [25, 30, 22],
'Score': [85.5, 92.3, 78.9]
}
df =
pd.DataFrame(data)
print(df)

Output:
Name Age Score
0 Alice 25 85.5
1 Bob 30 92.3
2 Cathy 22 78.9

# From a list of dictionaries


data = [
{'Name': 'Alice', 'Age': 25, 'Score': 85.5},
{'Name': 'Bob', 'Age': 30, 'Score': 92.3},
{'Name': 'Cathy', 'Age': 22, 'Score': 78.9}
43 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

44 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

]
df =
pd.DataFrame(data)
print(df)

Viewing and Inspecting Data


print(df.head()) # View the first 5 rows
print(df.tail()) # View the last 5 rows
print(df.info()) # Get a summary of the DataFrame
print(df.describe()) # Get descriptive statistics
print(df.shape) # Get the dimensions of the
DataFrame print(df.columns) # Get the column names
print(df.index) # Get the row indices

Data Selection and Filtering


 You can select and filter data using conditions and
indexing. # Selecting rows based on conditions
adults = df[df['Age'] > 21]
print(adults)
# Using multiple conditions
high_scorers = df[(df['Age'] > 21) & (df['Score'] > 80)]
print(high_scorers)
# Selecting specific rows and columns
subset = df.loc[df['Age'] > 21, ['Name', 'Score']]
print(subset)

Adding and Modifying Columns


# Adding a new column
df['Passed'] = df['Score'] >
80 print(df)
# Modifying an existing column
df['Score'] = df['Score'] + 5
print(df)

Handling Missing Data


Pandas provides several methods to handle missing data.
Handling missing data is a critical aspect of data preprocessing in Pandas. Missing data can occur due
to various reasons, such as errors during data collection or merging datasets. Pandas provides a
range of functions to detect, handle, and fill in missing values, ensuring that your analysis is accurate
and reliable.
Detecting Missing Data
Before handling missing data, you need to identify where the missing values are.
import pandas as pd
import numpy as np
# Sample DataFrame with missing values
df = pd.DataFrame({
'A': [1, 2, np.nan, 4],

45 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

'B': [5, np.nan, np.nan, 8],


'C': [9, 10, 11, 12]
})
# Detect missing values
print(df.isnull()) # Boolean DataFrame indicating missing
values print(df.isna()) # Same as isnull(), used interchangeably
# Summarize missing values
print(df.isnull().sum()) # Number of missing values per column

Depending on the context and the extent of the missing data, you can handle missing values using
various strategies:
 Dropping Missing Values
 Filling Missing Values
 Interpolate Missing Values
 Replace Missing Values Using Custom Functions
Example
import pandas as pd
import numpy as np
# Create a DataFrame with missing values
df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, np.nan, 8],
'C': [9, 10, 11, 12]
})
# Detect missing values
print("Missing
values:")
print(df.isnull())
# Drop rows with any missing values
print("\nDataFrame after dropping rows with missing values:")
print(df.dropna())
# Fill missing values with the mean of each column print("\
nDataFrame after filling missing values with column means:")
print(df.fillna(df.mean()))
# Forward fill missing values
print("\nDataFrame after forward filling missing values:")
print(df.fillna(method='ffill'))

# Interpolate missing values


print("\nDataFrame after interpolating missing values:")
print(df.interpolate())

Grouping and Aggregation


Grouping and aggregation operations are crucial for summarizing data.
# Grouping data
grouped = df.groupby('Passed')

# Aggregating data
mean_scores = grouped['Score'].mean()

46 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

47 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

print(mean_scores)

# Multiple aggregations
aggregated =
grouped.agg({
'Age': 'mean',
'Score': ['mean', 'max']
})
print(aggregated)

Merging and Joining DataFrames


You can merge and join DataFrames to combine data from multiple
sources. df1 = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Cathy'],
'Age': [25, 30, 22]
})
df2 = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Dave'],
'Score': [85.5, 92.3, 88.9]
})
# Merging DataFrames
merged = pd.merge(df1, df2, on='Name', how='inner')
print(merged)
# Concatenating DataFrames
concatenated = pd.concat([df1, df2], axis=0, ignore_index=True)
print(concatenated)
Pivot Tables
Pivot tables are used to summarize data.
data = {
'Name': ['Alice', 'Bob', 'Cathy', 'Dave'],
'Age': [25, 30, 22, 30],
'Score': [85.5, 92.3, 78.9, 88.9]
}
df = pd.DataFrame(data)
pivot = df.pivot_table(values='Score', index='Age', aggfunc='mean')
print(pivot)

12. DATA INDEXING AND SELECTION


ata indexing and selection are fundamental operations in Pandas, allowing you to access and
manipulate data efficiently. Pandas provides various methods for indexing, selecting, and slicing
data in Series and DataFrames.
Indexing and Selection in Pandas
1. Selecting Data in Series
 By Label: Use the index label to access
elements. import pandas as pd
# Create a Series
s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])

48 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

print(s['a'])
Output: 10
 By Position: Use integer positions to access
elements. print(s[0])
Output: 10
 Slicing: Use slicing to get subsets of the
Series. print(s['a':'b'])
Output: a 10
b 20
dtype:
int64
print(s['a':'b'])
Output: a 10
b 20
dtype:
int64
2. Selecting Data in DataFrames
 By Column Label: Access columns by their
labels. df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9]
})
print(df['A'])
Output: 0 1
1 2
2 3
Name: A, dtype: int64
 By Row Label: Use .loc to access rows by index
labels. df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
}, index=['x', 'y', 'z'])
print(df.loc['x'])

Output: A 1
B 4
Name: x, dtype: int64
49 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
df = pd.DataFrame({
'A': [1, 2, 3],

50 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

'B': [4, 5, 6]
}, index=['x', 'y', 'z'])
print(df.loc['x'])

Output: A 1
B 4
Name: x, dtype: int64
 By Position: Use .iloc for integer-based
indexing. print(df.iloc[0])
Output: A 1
B 4
Name: x, dtype: int64
 Selecting Specific Rows and
Columns # Select specific rows and
columns print(df.loc[['x', 'y'], ['A']])
Output: A
x 1
y 2
Name: A, dtype: int64
Boolean Indexing
 Filtering with Conditions: Use boolean conditions to filter
rows. df = pd.DataFrame({
'A': [10, 20, 30],
'B': [40, 50, 60]
})
# Filter rows where column 'A' is greater than 15
filtered_df = df[df['A'] > 15]
print(filtered_df)

Output: A B
1 20 50
2 30 60
MultiIndex DataFrames
 Creating and Selecting with MultiIndex: Use MultiIndex for hierarchical
indexing. arrays = [
['A', 'A', 'B', 'B'],
[1, 2, 1, 2]
]
index = pd.MultiIndex.from_arrays(arrays, names=('letter', 'number'))

51 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

df = pd.DataFrame({'value': [10, 20, 30, 40]}, index=index)


print(df)

Output:

Value letter number


A 1 10
2 20
B 1 30
2 40

# Access data with MultiIndex


print(df.loc['A'])

Output:
number #
1 10
# 2 20
# Name: value, dtype: int64
Advanced Indexing Techniques
Using .query(): Filter rows with a query string.
df = pd.DataFrame({
'A': [10, 20, 30],
'B': [40, 50, 60]
})
result = df.query('A > 15')
print(result)
Output:
A B
1 20 50
2 30 60
Using .apply() with Indexing: Apply functions to DataFrame columns
df = pd.DataFrame({
'A': [10, 20, 30],
'B': [40, 50, 60]
})
def add_ten(x):
return x +
10
df['A_plus_10'] = df['A'].apply(add_ten)
print(df)
Output:
A B A_plus_10
0 10 40 20
1 20 50 30
2 30 60 40

52 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

53 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

13. STRING OPERATIONS


String operations are fundamental in data science for cleaning, transforming, and analyzing text data.
Python's Pandas library offers powerful tools for string manipulation, which are essential for tasks
such as data cleaning, preprocessing, and feature extraction. Here’s a guide to common string
operations using Pandas and Python.
String Operations with Pandas
Pandas provides vectorized string operations through the .str accessor, which allows you to perform
various string manipulations on Series objects.
Basic String Operations
 Uppercase and
Lowercase import pandas as
pd
df = pd.DataFrame({'text': ['apple', 'BANANA', 'Cherry']})
# Convert to uppercase
df['text_upper'] = df['text'].str.upper()
print(df)

Output:
text text_ upper
0 apple APPLE
1 BANANA BANANA
2 Cherry CHERRY
# Convert to lowercase
df['text_lower'] =
df['text'].str.lower() print(df)

Output:
text text_upper text_lower
0 apple APPLE apple
1 BANANA BANANA banana
2 Cherry CHERRY cherry

 Title Case
df['text_title'] = df['text'].str.title()
print(df)
Output:
text text_ upper text_ lower text_title
0 apple APPLE apple Apple
1 BANANA BANANA banana Banana
2 Cherry CHERRY cherry Cherry

2. String Methods
 Length of Strings
df['text_length'] = df['text'].str.len()
print(df) 54 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

Output:
text text_ upper text_lower text_title length
0 apple APPLE apple Apple 5
1 BANANA BANANA banana Banana 6
2 Cherry CHERRY cherry Cherry 6

 String Containment
df['contains_e'] = df['text'].str.contains('e')
print(df)

Output:
text text_upper text_lower text_title text_length contains_e
0 apple APPLE apple Apple 5 True
1 BANANA BANANA banana Banana 6 False
2 Cherry CHERRY cherry Cherry 6 True

3. String Replacement

 Replace Substrings
df['text_replaced'] = df['text'].str.replace('e', 'x', regex=False)
print(df)
Output:
text text_upper text_lower text_title text_length contains_e text_replaced
0 apple APPLE apple Apple 5 True applx
1 BANANA BANANA banana Banana 6 False BANANA
2 Cherry CHERRY cherry Cherry 6 True Chxrry

4. Splitting and Joining


 Split Strings
df['text_split'] = df['text'].str.split('e')
print(df)
Output:
text text_upper text_lower text_title text_length contains_e text_split
0 apple APPLE apple Apple 5 True [appl, ]
1 BANANA BANANA banana Banana 6 False [BANANA]
2 Cherry CHERRY cherry Cherry 6 True [Ch, rry]

 Join Strings

df['text_joined'] = df['text'].str.cat(sep='-')
print(df)
Output:

55 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

text text_upper text_lower text_title text_length contains_e text_split text_joined


0 apple APPLE apple Apple 5 True [appl, ] apple-BANANA-Cherry
1 BANANA BANANA banana Banana 6 False[BANANA] apple-BANANA-Cherry
2 Cherry CHERRY cherry Cherry 6 True [Ch, rry] apple-BANANA-Cherry

5. String Matching and Regular Expressions


 Extracting with Regex
df['extracted'] = df['text'].str.extract(r'([a-zA-Z]+)', expand=False)
print(df)
Output:
text text_upper text_lower text_title text_length contains_e text_split text_joined extracted
0 apple APPLE apple Apple 5 True [appl, ] apple-BANANA-Cherry apple
1 BANANA BANANA banana Banana 6 False[BANANA] apple-BANANA-Cherry BANANA
2 Cherry CHERRY cherry Cherry 6 True[Ch, rry] apple-BANANA-Cherry Cherry

 String Matching
df['match'] = df['text'].str.match(r'[A-Za-z]
+') print(df)
Output:
text text_upper text_lower text_title text_length contains_e text_split text_joined extracted match
0 apple APPLE apple Apple 5 True [appl, ] apple-BANANA-Cherry apple True
1 BANANA BANANA banana Banana 6 False[BANANA] apple-BANANA-Cherry BANANA True
2 Cherry CHERRY cherry Cherry 6 True [Ch, rry] apple-BANANA-Cherry Cherry True

******************************************************************************************************

56 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

UNIT III
MACHINE LEARNING
The modeling process - Types of machine learning - Supervised learning - Unsupervised learning –Semi
supervised learning- Classification, regression - Clustering – Outliers and Outlier Analysis.

1. MODELING PROCESS
Each step in the modeling process is crucial for building an effective and reliable machine learning
model. Ensuring attention to detail at each stage can lead to better performance and more accurate
predictions.
There are 10 steps are involved to make better machine learning model.
1. Problem Definition
2. Data Collection
3. Data Exploration and Preprocessing
4. Feature Selection
5. Model Selection
6. Model Training
7. Model Evaluation
8. Model Tuning
9. Model Deployment
10. Model Maintenance

1. Problem Definition
Objective: Clearly define the problem you are trying to solve. This includes understanding the
business or scientific objectives.
Output : Decide whether it is a classification, regression, clustering, or another type of problem.
2. Data Collection
Data collection is a crucial step in the creation of a machine learning model, as it lays the foundation
for building accurate models. In this phase of machine learning model development, relevant data is
gathered from various sources to train the machine learning model and enable it to make accurate
predictions.
Sources: Gather data from various sources such as databases, online repositories, sensors, etc.
Quality : Ensure data quality by addressing issues like missing values, inconsistencies, and errors.
3. Data Exploration and Preprocessing
Exploration: Analyze the data to understand its structure, patterns, and anomalies.
 Visualization: Use plots and graphs to visualize data distributions and relationships.
 Statistics: Calculate summary statistics to get a sense of the data.
Preprocessing: Prepare the data for modeling.
 Cleaning: Handle missing values, outliers, and duplicates.
 Transformation: Normalize or standardize data, handle categorical variables, and create
new features if necessary.
 Feature Engineering: Create new features from existing ones to improve model performance.
4. Model Selection

57 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

Choose Algorithms: Choose appropriate machine learning algorithms based on the problem type
(classification, regression, clustering, etc.).
Baseline Model: Develop a simple model to establish a baseline performance.
Comparison: Compare multiple algorithms using cross-validation to find the best performing one.
Feature:
Relevance: Identify and select features that are most relevant to the problem.
Techniques: Use methods like correlation analysis, mutual information, and feature importance
scores.
5. Model Training
In this phase of building a machine learning model, we have all the necessary ingredients to train
our model effectively. This involves utilizing our prepared data to teach the model to recognize
patterns and make predictions based on the input features. During the training process, we begin
by feeding the preprocessed data into the selected machine-learning algorithm.
Training Data: Split the data into training and testing sets (and sometimes validation sets).
Training Process: Fit the chosen model to the training data, optimizing its parameters.
6. Model Evaluation
Once you have trained your model, it’s time to assess its performance. There are various metrics
used to evaluate model performance, categorized based on the type of task: regression/numerical
or classification.
1. For regression tasks, common evaluation metrics are:
Mean Absolute Error (MAE): MAE is the average of the absolute differences between predicted
and actual values.
Mean Squared Error (MSE): MSE is the average of the squared differences between predicted
and actual values.
Root Mean Squared Error (RMSE): It is a square root of the MSE, providing a measure of the
average magnitude of error.
R-squared (R2): It is the proportion of the variance in the dependent variable that is
predictable from the independent variables.
2. For classification tasks, common evaluation metrics are:
Accuracy: Proportion of correctly classified instances out of the total instances.
Precision: Proportion of true positive predictions among all positive predictions.
Recall: Proportion of true positive predictions among all actual positive instances.
F1-score: Harmonic mean of precision and recall, providing a balanced measure of model
performance.
Area Under the Receiver Operating Characteristic curve (AUC-ROC): Measure of the model’s
ability to distinguish between classes.

7. Model Tuning

58 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

Tuning and optimizing helps our model to maximize its performance and generalization ability.
This process involves fine-tuning hyperparameters, selecting the best algorithm, and improving
features through feature engineering techniques.
Hyperparameters are parameters that are set before the training process begins and control the
behavior of the machine learning model. These are like learning rate, regularization and
parameters of the model should be carefully adjusted.
Techniques: Use grid search, random search, or Bayesian optimization for hyperparameter tuning.
8. Model Deployment
Deploying the model and making predictions is the final stage in the journey of creating an ML
model. Once a model has been trained and optimized, it’s to integrate it into a production
environment where it can provide real-time predictions on new data.
During model deployment, it’s essential to ensure that the system can handle high user loads, operate
smoothly without crashes, and be easily updated.
Integration: Deploy the model into a production environment where it can make real-time
predictions.
Monitoring: Continuously monitor the model's performance to ensure it remains accurate and
reliable.
10. Model Maintenance
Updates: Periodically update the model with new data to maintain its performance.
Retraining: Retrain the model if there are significant changes in the data patterns or if the
model's performance degrades.
2. TYPES OF MACHINE LEARNING
Machine learning is a subset of AI, which enables the machine to automatically learn from data,
improve performance from past experiences, and make predictions. Machine learning contains a set
of algorithms that work on a huge amount of data. Data is fed to these algorithms to train them, and
on the basis of training, they build the model & perform a specific task.
These ML algorithms help to solve different business problems like Regression, Classification,
Forecasting, Clustering, and Associations, etc.
Based on the methods and way of learning, machine learning is divided into mainly four types, which
are:
1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning

59 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

2.1SUPERVISED LEARNING
Supervised machine learning is based on supervision. It means in the supervised learning technique,
we train the machines using the "labelled" dataset, and based on the training, the machine predicts
the output. Here, the labelled data specifies that some of the inputs are already mapped to the output.
More preciously, we can say; first, we train the machine with the input and corresponding output,
and then we ask the machine to predict the output using the test dataset.
The main goal of the supervised learning technique is to map the input variable(x) with the
output variable(y). Some real-world applications of supervised learning are Risk Assessment, Fraud
Detection, Spam filtering, etc.
Example:
Let's understand supervised learning with an example. Suppose we have an input dataset of cats and
dog images. So, first, we will provide the training to the machine to understand the images, such as
the shape & size of the tail of cat and dog, Shape of eyes, colour, height (dogs are taller, cats are
smaller), etc. After completion of training, we input the picture of a cat and ask the machine to
identify the object and predict the output. Now, the machine is well trained, so it will check all the
features of the object, such as height, shape, colour, eyes, ears, tail, etc., and find that it's a cat. So, it
will put it in the Cat category. This is the process of how the machine identifies the objects in
Supervised Learning.

Steps Involved in Supervised Learning:


1. First Determine the type of training dataset
2. Collect/Gather the labelled training data.
60 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

3. Split the training dataset into training dataset, test dataset, and validation dataset.
4. Determine the input features of the training dataset, which should have enough knowledge so
that the model can accurately predict the output.
5. Determine the suitable algorithm for the model, such as support vector machine, decision
tree, etc.
6. Execute the algorithm on the training dataset. Sometimes we need validation sets as the
control parameters, which are the subset of training datasets.
7. Evaluate the accuracy of the model by providing the test set. If the model predicts the correct
output, which means our model is accurate.
2.1.1 TYPES OF SUPERVISED MACHINE LEARNING
Supervised machine learning can be classified into two types of problems, which are given below:
1. Classification
2. Regression
2.1.1.1 REGRESSION:
Regression algorithms are used if there is a relationship between the input variable and the output
variable. It is used for the prediction of continuous variables, such as Weather forecasting, Market
Trends, etc. Below are some popular Regression algorithms which come under supervised learning:
1. Linear Regression
2. Regression Trees
3. Non-Linear Regression
4. Bayesian Linear Regression
5. Polynomial Regression
2.1.1.2 CLASSIFICATION:
Classification algorithms are used to solve the classification problems in which the output variable is
categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The classification algorithms
predict the categories present in the dataset. Some real-world examples of classification algorithms
are Spam Detection, Email filtering, etc.
Some popular classification algorithms are given below:
1. Random Forest Algorithm
2. Decision Tree Algorithm
3. Logistic Regression Algorithm
4. Support Vector Machine Algorithm
Advantages of Supervised learning Algorithm:
1. Since supervised learning work with the labelled dataset so we can have an exact idea about
the classes of objects.
2. These algorithms are helpful in predicting the output on the basis of prior experience.
Disadvantages of Supervised learning Algorithm:
1. These algorithms are not able to solve complex tasks.
2. It may predict the wrong output if the test data is different from the training data.
3. It requires lots of computational time to train the algorithm.

61 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

2.1.2 APPLICATIONS OF SUPERVISED LEARNING:


1. Logistic Regression
2. Support Vector Machine
3. Random Forest
4. Decision Tree
5. K-Nearest Neighbors (KNN)
6. Naive Bayes

2.2UNSUPERVISED LEARNING
Unsupervised learning is different from the Supervised learning technique; as its name suggests,
there is no need for supervision. It means, in unsupervised machine learning, the machine is trained
using the unlabeled dataset, and the machine predicts the output without any supervision.
In unsupervised learning, the models are trained with the data that is neither classified nor labelled,
and the model acts on that data without any supervision.
Unsupervised machine learning analyzes and clusters unlabeled datasets using machine learning
algorithms. These algorithms find hidden patterns and data without any human intervention, i.e., we
don’t give output to our model. The training model has only input parameter values and discovers the
groups or patterns on its own.

The main aim of the unsupervised learning algorithm is to group or categories the unsorted
dataset according to the similarities, patterns, and differences. Machines are instructed to find
the
hidden patterns from the input dataset.
Example:
Working of Unsupervised Learning:

Here, we have taken an unlabeled input data, which means it is not categorized and corresponding
outputs are also not given. Now, this unlabeled input data is fed to the machine learning model in
order

62 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

to train it. Firstly, it will interpret the raw data to find the hidden patterns from the data and then
will apply suitable algorithms such as k-means clustering, Decision tree, etc.
Once it applies the suitable algorithm, the algorithm divides the data objects into groups according to
the similarities and difference between the objects.
2.2.1 TYPES OF UNSUPERVISED MACHINE LEARNING
Unsupervised Learning can be further classified into two types, which are given below:
1. Clustering
2. Association

2.2.1.1 CLUSTERING

Clustering in unsupervised machine learning is the process of grouping unlabeled data into clusters
based on their similarities. The goal of clustering is to identify patterns and relationships in the data
without any prior knowledge of the data’s meaning.
Cluster analysis finds the commonalities between the data objects and categorizes them as per the
presence and absence of those commonalities.

Some common clustering algorithms

1) K-means Clustering: Partitioning Data into K Clusters


2) Hierarchical Clustering: Building a Hierarchical Structure of Clusters
3) Density-Based Clustering (DBSCAN): Identifying Clusters Based on Density
4) Mean-Shift Clustering: Finding Clusters Based on Mode Seeking
5) Spectral Clustering: Utilizing Spectral Graph Theory for Clustering

2.2.1.1 ASSOCIATION
Association rule learning is an unsupervised learning technique, which finds interesting relations
among variables within a large dataset. The main aim of this learning algorithm is to find the
dependency of one data item on another data item and map those variables accordingly so that it can
generate maximum profit. This algorithm is mainly applied in Market Basket analysis, Web usage
mining, continuous production, etc.
For e.g. shopping stores use algorithms based on this technique to find out the relationship between
the sale of one product w.r.t to another’s sales based on customer behavior. Like if a customer buys
milk, then he may also buy bread, eggs, or butter. Once trained well, such models can be used to
increase their sales by planning different offers.
Some common clustering algorithms

1) Apriori Algorithm: A Classic Method for Rule Induction


2) FP-Growth Algorithm: An Efficient Alternative to Apriori
3) Eclat Algorithm: Exploiting Closed Itemsets for Efficient Rule Mining
4) Efficient Tree-based Algorithms: Handling Large Datasets with Scalability
Advantages of Unsupervised Learning:
 These algorithms can be used for complicated tasks compared to the supervised ones because
these algorithms work on the unlabeled dataset.

63 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

 Unsupervised algorithms are preferable for various tasks as getting the unlabeled dataset is
easier as compared to the labelled dataset.

Disadvantages of Unsupervised Learning:


 The output of an unsupervised algorithm can be less accurate as the dataset is not labelled,
and algorithms are not trained with the exact output in prior.
 Working with Unsupervised learning is more difficult as it works with the unlabelled dataset
that does not map with the output.

2.2.2 APPLICATIONS OF UNSUPERVISED LEARNING


1. Network Analysis: Unsupervised learning is used for identifying plagiarism and copyright in
document network analysis of text data for scholarly articles.
2. Recommendation Systems: Recommendation systems widely use unsupervised learning
techniques for building recommendation applications for different web applications and e-commerce
websites.
3. Anomaly Detection: Anomaly detection is a popular application of unsupervised learning, which
can identify unusual data points within the dataset. It is used to discover fraudulent transactions.
4. Singular Value Decomposition: Singular Value Decomposition or SVD is used to extract
particular information from the database. For example, extracting information of each user located at
a particular location.

2.3 SEMI-SUPERVISED LEARNING


Semi-Supervised learning is a type of Machine Learning algorithm that represents the intermediate
ground between Supervised and Unsupervised learning algorithms. It uses the combination of
labeled and unlabeled datasets during the training period.
To overcome the drawbacks of supervised learning and unsupervised learning algorithms, the
concept of Semi-supervised learning is introduced. The main aim of semi-supervised learning is
to effectively use all the available data, rather than only labelled data like in supervised learning.
Initially, similar data is clustered along with an unsupervised learning algorithm, and further, it helps
to label the unlabeled data into labelled data. It is because labelled data is a comparatively more
expensive acquisition than unlabeled data.
2.3.1 Assumptions followed by Semi-Supervised Learning:
To work with the unlabeled dataset, there must be a relationship between the objects. To understand
this, semi-supervised learning uses any of the following assumptions:
Continuity Assumption:
As per the continuity assumption, the objects near each other tend to share the same group or label.
This assumption is also used in supervised learning, and the datasets are separated by the decision
boundaries. But in semi-supervised, the decision boundaries are added with the smoothness
assumption in low-density boundaries.
Cluster assumptions
In this assumption, data are divided into different discrete clusters. Further, the points in the same
cluster share the output label.

64 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

Manifold assumptions
This assumption helps to use distances and densities, and this data lie on a manifold of fewer
dimensions than input space.
The dimensional data are created by a process that has less degree of freedom and may be hard to
model directly. (This assumption becomes practical if high).
2.3.2 Applications of Semi-supervised Learning:
1. Speech Analysis
It is the most classic example of semi-supervised learning applications. Since, labeling the audio data
is the most impassable task that requires many human resources, this problem can be naturally
overcome with the help of applying SSL in a Semi-supervised learning model.
2. Web content classification
However, this is very critical and impossible to label each page on the internet because it needs mode
human intervention. Still, this problem can be reduced through Semi-Supervised learning algorithms.
Further, Google also uses semi-supervised learning algorithms to rank a webpage for a given query.
3. Protein sequence classification
DNA strands are larger, they require active human intervention. So, the rise of the Semi-supervised
model has been proximate in this field.
4. Text document classifier
As we know, it would be very unfeasible to find a large amount of labeled text data, so semi-
supervised learning is an ideal model to overcome this.
3. OUTLIER
Outliers in machine learning refer to data points that are significantly different from the majority of
the data. These data points can be anomalous, noisy, or errors in measurement.
An outlier is a data point that significantly deviates from the rest of the data. It can be either much
higher or much lower than the other data points, and its presence can have a significant impact on
the results of machine learning algorithms. They can be caused by measurement or execution errors.
The analysis of outlier data is referred to as outlier analysis or outlier mining.
3.1 TYPES OF OUTLIERS
There are two main types of outliers:
1. Global outliers:
Global outliers are isolated data points that are far away from the main body of the data. They are
often easy to identify and remove.
2. Contextual outliers:
Contextual outliers are data points that are unusual in a specific context but may not be outliers in a
different context. They are often more difficult to identify and may require additional information or
domain knowledge to determine their significance.

65 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

3.2 OUTLIER DETECTION METHODS IN MACHINE LEARNING


Outlier detection plays a crucial role in ensuring the quality and accuracy of machine learning
models. By identifying and removing or handling outliers effectively, we can prevent them from
biasing the model, reducing its performance, and hindering its interpretability. Here’s an overview of
various outlier detection methods:
1. Statistical Methods:
 Z-Score:
 Interquartile Range (IQR)
2. Distance-Based Methods:
 K-Nearest Neighbors (KNN)
 Local Outlier Factor (LOF)
3. Clustering-Based Methods:
 Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
 Hierarchical clustering
4. Other Methods:
 Isolation Forest
 One-class Support Vector Machines (OCSVM)
3.3 TECHNIQUES FOR HANDLING OUTLIERS IN MACHINE LEARNING
Outliers, data points that significantly deviate from the majority, can have detrimental effects on
machine learning models. To address this, several techniques can be employed to handle outliers
effectively:
1. Removal:
This involves identifying and removing outliers from the dataset before training the model. Common
methods include:
 Thresholding: Outliers are identified as data points exceeding a certain threshold (e.g., Z-score
> 3).
 Distance-based methods: Outliers are identified based on their distance from their nearest
neighbors.
 Clustering: Outliers are identified as points not belonging to any cluster or belonging to very
small clusters.

66 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

2. Transformation:
This involves transforming the data to reduce the influence of outliers. Common methods include:
 Scaling: Standardizing or normalizing the data to have a mean of zero and a standard
deviation of one.
 Winsorization: Replacing outlier values with the nearest non-outlier value.
 Log transformation: Applying a logarithmic transformation to compress the data and reduce
the impact of extreme values.
3. Robust Estimation:
This involves using algorithms that are less sensitive to outliers. Some examples include:
 Robust regression: Algorithms like L1-regularized regression or Huber regression are less
influenced by outliers than least squares regression.
 M-estimators: These algorithms estimate the model parameters based on a robust objective
function that down weights the influence of outliers.
 Outlier-insensitive clustering algorithms: Algorithms like DBSCAN are less susceptible to
the presence of outliers than K-means clustering.
4. Modeling Outliers:
This involves explicitly modeling the outliers as a separate group. This can be done by:
 Adding a separate feature: Create a new feature indicating whether a data point is an outlier
or not.
 Using a mixture model: Train a model that assumes the data comes from a mixture of
multiple distributions, where one distribution represents the outliers.
3.4 IMPORTANCE OF OUTLIER DETECTION IN MACHINE LEARNING
Outlier detection is important in machine learning for several reasons:
Biased models: Outliers can bias a machine learning model towards the outlier values, leading to poor
performance on the rest of the data. This can be particularly problematic for algorithms that are
sensitive to outliers, such as linear regression.
Reduced accuracy: Outliers can introduce noise into the data, making it difficult for a machine learning
model to learn the true underlying patterns. This can lead to reduced accuracy and performance.
Increased variance: Outliers can increase the variance of a machine learning model, making it more
sensitive to small changes in the data. This can make it difficult to train a stable and reliable model.
Reduced interpretability: Outliers can make it difficult to understand what a machine learning model
has learned from the data. This can make it difficult to trust the model’s predictions and can hamper
efforts to improve its performance.
3.5 TECHNIQUES FOR OUTLIER ANALYSIS:
1. Visual inspection: using plots to identify outliers
2. Statistical methods: using metrics like mean, median, and standard deviation to detect outliers
3. Machine learning algorithms: using algorithms like One-Class SVM, Local Outlier Factor (LOF),
and Isolation Forest to detect outliers

67 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

UNIT IV
DATA VISUALIZATION

Importing Matplotlib – Line plots – Scatter plots – visualizing errors – density and contour plots – Histograms –
legends – colors – subplots – text and annotation – customization – three dimensional plotting - Geographic Data
with Basemap - Visualization with Seaborn.

1. SIMPLE LINE PLOTS


The simplest of all plots is the visualization of a single function y = f x. Here we will take a first look at creating
a simple plot of this type.
The figure (an instance of the class plt.Figure) can be thought of as a single container that contains all the
objects representing axes, graphics, text, and labels.
The axes (an instance of the class plt.Axes) is what we see above: a bounding box with ticks and labels, which
will eventually contain the plot elements that make up our visualization.

Line Colors and Styles


 The first adjustment you might wish to make to a plot is to control the line colors and styles.
 To adjust the color, you can use the color keyword, which accepts a string argument representing
virtually any imaginable color. The color can be specified in a variety of ways
 If no color is specified, Matplotlib will automatically cycle through a set of default colors for multiple lines

Different forms of color representation.


specify color by name - color='blue'
short color code (rgbcmyk) - color='g'
Grayscale between 0 and 1 - color='0.75'
Hex code (RRGGBB from 00 to FF) - color='#FFDD44' RGB tuple, values 0 and 1 -
color=(1.0,0.2,0.3) all HTML color
names supported - color='chartreuse'

 We can adjust the line style using the linestyle keyword.


Different line styles
linestyle='solid'
linestyle='dashed'
linestyle='dashdot'
linestyle='dotted'
Short assignment
linestyle='-' #
solid
linestyle='--'
# dashed
linestyle='-.' #
dashdot
linestyle=':' #
dotted
 linestyle and color codes can be combined into a single nonkeyword argument to the plt.plot()
function plt.plot(x, x + 0, '-g')
# solid green
plt.plot(x, x + 1, '--c')
# dashed cyan plt.plot(x, x + 2, '-.k')
# dashdot black
plt.plot(x, x + 3, ':r');
# dotted red
Axes Limits
 The most basic way to adjust axis limits is to use the plt.xlim() and plt.ylim()
methods Example

68 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

plt.xlim(10, 0)
plt.ylim(1.2, -1.2);
 The plt.axis() method allows you to set the x and y limits with a single call, by passing a list
that specifies [xmin, xmax, ymin, ymax]
plt.axis([-1, 11, -1.5, 1.5]);
 Aspect ratio equal is used to represent one unit in x is equal to one unit in
y. plt.axis('equal')

Labeling Plots
The labeling of plots includes titles, axis labels, and simple
legends. Title - plt.title()
Label - plt.xlabel()
plt.ylabel()
Legend - plt.legend()

Example programs Line color:


import matplotlib.pyplot as plt
import numpy as np
fig =
plt.figure() ax
= plt.axes()
x = np.linspace(0, 10,
1000) ax.plot(x, np.sin(x));
plt.plot(x, np.sin(x - 0), color='blue') # specify color by name
plt.plot(x, np.sin(x - 1), color='g') # short color code (rgbcmyk)
plt.plot(x, np.sin(x - 2), color='0.75') # Grayscale between 0 and 1
plt.plot(x, np.sin(x - 3), color='#FFDD44') # Hex code (RRGGBB from 00 to
FF) plt.plot(x, np.sin(x - 4), color=(1.0,0.2,0.3)) # RGB tuple, values 0 and 1
plt.plot(x, np.sin(x - 5), color='chartreuse');# all HTML color names
supported

OUTPUT:

Line style:
import matplotlib.pyplot as
plt import numpy as
np fig = plt.figure()
ax = plt.axes()
x = np.linspace(0, 10, 1000)
plt.plot(x, x + 0, linestyle='solid')
plt.plot(x, x + 1, linestyle='dashed')
plt.plot(x, x + 2,
69 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
linestyle='dashdot')

70 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

plt.plot(x, x + 3, linestyle='dotted');
# For short, you can use the following
codes: plt.plot(x, x + 4, linestyle='-') # solid
plt.plot(x, x + 5, linestyle='--') # dashed
plt.plot(x, x + 6, linestyle='-.')# dashdot
plt.plot(x, x + 7, linestyle=':'); # dotted

OUTPUT:

Axis limit with label and


legend: import matplotlib.pyplot as
plt import numpy as np
fig =
plt.figure() ax
= plt.axes()
x = np.linspace(0, 10, 1000)
plt.xlim(-1, 11)
plt.ylim(-1.5, 1.5);
plt.plot(x, np.sin(x), '-g', label='sin(x)')
plt.plot(x, np.cos(x), ':b',
label='cos(x)') plt.title("A Sine
Curve")
plt.xlabel("x")
plt.ylabel("sin(x)");
plt.legend();

OUTPUT:

71 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

2. SIMPLE SCATTER PLOTS


Another commonly used plot type is the simple scatter plot, a close cousin of the line plot. Instead of points
being joined by line segments, here the points are represented individually with a dot, circle, or other shape.
Syntax
plt.plot(x, y, 'type of symbol ', color);
Example
plt.plot(x, y, 'o', color='black');
The third argument in the function call is a character that represents the type of symbol used for the
plotting. Just as you can specify options such as '-' and '--' to control the line style, the marker style has its
own set of short string codes.

Example
 Various symbols used to specify ['o', '.', ',', 'x', '+', 'v', '^', '<', '>', 's', 'd']
 Short hand assignment of line, symbol and color also
allowed. plt.plot(x, y, '-ok');
 Additional arguments in plt.plot()
We can specify some other parameters related with scatter plot which makes it more attractive. They are
color, marker size, linewidth, marker face color, marker edge color, marker edge width, etc

Example
plt.plot(x, y, '-p', color='gray',
markersize=15, linewidth=4,
markerfacecolor='white',
markeredgecolor='gray',
markeredgewidth=2) plt.ylim(-1.2, 1.2);

Scatter Plots with plt.scatter


 A second, more powerful method of creating scatter plots is the plt.scatter function, which can be
used very similarly to the plt.plot function
plt.scatter(x, y, marker='o');
 The primary difference of plt.scatter from plt.plot is that it can be used to create scatter plots
where the properties of each individual point (size, face color, edge color, etc.) can be individually
controlled or mapped to data.
 Notice that the color argument is automatically mapped to a color scale (shown here by the
colorbar() command), and the size argument is given in pixels.
 Cmap – color map used in scatter plot gives different color combinations.

Perceptually Uniform Sequential

['viridis', 'plasma', 'inferno', 'magma']


Sequential
['Greys','Purples','Blues','Greens','Oranges','Reds','YlOrBr','YlOrRd',
'OrRd','PuRd','RdPu','BuPu','GnBu','PuBu','YlGnBu','PuBuGn','BuGn', 'YlGn']
Sequential (2)
['binary', 'gist_yarg', 'gist_gray', 'gray', 'bone', 'pink', 'spring', 'summer',
'autumn','winter','cool','Wistia','hot','afmhot','gist_heat','copper']
Diverging
['PiYG','PRGn','BrBG','PuOr','RdGy','RdBu','RdYlBu','RdYlGn','Spectral', 'coolwarm', 'bwr', 'seismic']
Qualitative
['Pastel1', 'Pastel2', 'Paired', 'Accent', 'Dark2', 'Set1', 'Set2', 'Set3', 'tab10', 'tab20', 'tab20b', 'tab20c']
Miscellaneous
['flag', 'prism', 'ocean', 'gist_earth', 'terrain', 'gist_stern', 'gnuplot',
'gnuplot2', 'CMRmap', 'cubehelix', 'brg', 'hsv', 'gist_rainbow', 'rainbow', 'jet', 'nipy_spectral', 'gist_ncar']

72 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

Example programs. Simple scatter plot.


import numpy as np
import matplotlib.pyplot as
plt x = np.linspace(0, 10, 30)
y = np.sin(x)
plt.plot(x, y, 'o', color='black');

Scatter plot with edge color, face color,


size, and width of marker. (Scatter plot
with line) import numpy as np
import matplotlib.pyplot as
plt x = np.linspace(0, 10, 20)
y = np.sin(x)
plt.plot(x, y, '-o', color='gray', markersize=15, linewidth=4,
markerfacecolor='yellow', markeredgecolor='red',
markeredgewidth=4) plt.ylim(-1.5, 1.5);

Scatter plot with random colors, size and transparency


import numpy as np
import matplotlib.pyplot as plt
rng =
np.random.RandomState(0) x =
rng.randn(100)
y = rng.randn(100)
colors =
rng.rand(100)
sizes = 1000 * rng.rand(100)
plt.scatter(x, y, c=colors, s=sizes, alpha=0.3,
map='viridis') plt.colorbar()

3. VISUALIZING ERRORS
For any scientific measurement, accurate accounting for errors is nearly as important, if not more
important, than accurate reporting of the number itself. For example, imagine that I am using some
astrophysical observations to estimate the Hubble Constant, the local measurement of the expansion rate
of the Universe. In visualization of data and results, showing these errors effectively can make a plot
convey much more complete information.

Types of errors
 Basic Errorbars
 Continuous Errors

Basic Errorbars
A basic errorbar can be created with a single Matplotlib function call.
import matplotlib.pyplot as
plt plt.style.use('seaborn-
whitegrid') import numpy as
np
x = np.linspace(0, 10, 50)
dy = 0.8
y = np.sin(x) + dy *
np.random.randn(50) plt.errorbar(x, y,
yerr=dy, fmt='.k');
73 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

 Here the fmt is a format code controlling the appearance of lines and points, and has the same
syntax as the shorthand used in plt.plot()
 In addition to these basic options, the errorbar function has many options to fine tune the
outputs. Using these additional options you can easily customize the aesthetics of your errorbar
plot.

plt.errorbar(x, y, yerr=dy, fmt='o', color='black',ecolor='lightgray', elinewidth=3, capsize=0);

Continuous Errors

 In some situations it is desirable to show errorbars on continuous quantities. Though Matplotlib does
not have a built-in convenience routine for this type of application, it’s relatively easy to combine
primitives like plt.plot and plt.fill_between for a useful result.
 Here we’ll perform a simple Gaussian process regression (GPR), using the Scikit-Learn API. This is a
method of fitting a very flexible nonparametric function to data with a continuous measure of the
uncertainty.
4. DENSITY AND CONTOUR PLOTS

To display three-dimensional data in two dimensions using contours or color-coded regions. There are three
Matplotlib functions that can be helpful for this task:
 plt.contour for contour plots,
 plt.contourf for filled contour plots, and
 plt.imshow for showing images.

Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
65 | P a g e

Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

Visualizing a Three-Dimensional Function


A contour plot can be created with the plt.contour function.
It takes three arguments:
 a grid of x values,
 a grid of y values, and
 a grid of z values.
The x and y values represent positions on the plot,
and the z values will be represented by the contour levels.
The way to prepare such data is to use the np.meshgrid function,
which builds two-dimensional grids from one- dimensional arrays:

Example
def f(x, y):
return np.sin(x) ** 10 + np.cos(10 + y * x) *
np.cos(x) x = np.linspace(0, 5, 50)
y = np.linspace(0, 5,
40) X, Y =
np.meshgrid(x, y) Z =
f(X, Y)
plt.contour(X, Y, Z, colors='black');

 Notice that by default when a single color is used, negative values are represented by dashed lines, and
positive values by solid lines.
 Alternatively, you can color-code the lines by specifying a colormap with the cmap argument.
 We’ll also specify that we want more lines to be drawn—20 equally spaced intervals within the data range.
plt.contour(X, Y, Z, 20, cmap='RdGy');
 One potential issue with this plot is that it is a bit “splotchy.” That is, the color steps are discrete rather
than continuous, which is not always what is desired.
 You could remedy this by setting the number of contours to a very high number, but this results in a
rather inefficient plot: Matplotlib must render a new polygon for each step in the level.
 A better way to handle this is to use the plt.imshow() function, which interprets a two-dimensional
grid of data as an image.

There are a few potential gotchas with imshow().


 plt.imshow() doesn’t accept an x and y grid, so you must manually specify the extent [xmin,
xmax, ymin, ymax] of the image on the plot.
 plt.imshow() by default follows the standard image array definition where the origin is in the upper
left, not in the lower left as in most contour plots. This must be changed when showing gridded data.
 plt.imshow() will automatically adjust the axis aspect ratio to match the input data; you can change
this by setting, for example, plt.axis(aspect='image') to make x and y units match.

Finally, it can sometimes be useful to combine


contour plots and image plots. we’ll use a partially
transparent background image (with
transparency set via the alpha parameter) and
over-plot contours with labels on the contours
themselves (using the plt.clabel() function):

contours = plt.contour(X, Y, Z, 3, colors='black')


plt.clabel(contours, inline=True, fontsize=8)
plt.imshow(Z, extent=[0, 5, 0, 5], origin='lower',
cmap='RdGy', alpha=0.5)
plt.colorbar();

Example Program
66 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

import numpy as np
import matplotlib.pyplot as plt
def f(x, y):
return np.sin(x) ** 10 + np.cos(10 + y * x) *
np.cos(x)
x = np.linspace(0, 5, 50)
y = np.linspace(0, 5,
40) X, Y =
np.meshgrid(x, y) Z =
f(X, Y)
plt.imshow(Z, extent=[0, 10, 0, 10], origin='lower', cmap='RdGy')
plt.colorbar()

5. HISTOGRAMS
Histogram is the simple plot to represent the large data set. A histogram is a graph showing
frequency distributions. It is a graph showing the number of observations within each given interval.

5. 1Parameters
 plt.hist( ) is used to plot histogram. The hist() function will use an array of numbers to create a
histogram, the array is sent into the function as an argument.
 bins - A histogram displays numerical data by grouping data into "bins" of equal width. Each bin is
plotted as a bar whose height corresponds to how many data points are in that bin. Bins are also
sometimes called "intervals", "classes", or "buckets".
 normed - Histogram normalization is a technique to distribute the frequencies of the histogram
over a wider range than the current range.
 x - (n,) array or sequence of (n,) arrays Input values, this takes either a single array or a sequence
of arrays which are not required to be of the same length.
 histtype - {'bar', 'barstacked', 'step', 'stepfilled'}, optional The type of histogram to draw.

 'bar' is a traditional bar-type histogram. If multiple data are given the bars are arranged
side by side.
 'barstacked' is a bar-type histogram where multiple data are stacked on top of each other.
 'step' generates a lineplot that is by default unfilled.
 'stepfilled' generates a lineplot that is by default filled. Default is 'bar'

 align - {'left', 'mid', 'right'}, optional Controls how the histogram is plotted.

 'left': bars are centered on the left bin edges.


 'mid': bars are centered between the bin edges.
 'right': bars are centered on the right bin edges. Default is 'mid'
 orientation - {'horizontal', 'vertical'}, optional
If 'horizontal', barh will be used for bar-type histograms and the bottom kwarg will be the left
edges.
 color - color or array_like of colors or None, optional
Color spec or sequence of color specs, one per dataset. Default (None) uses the standard line color
sequence.

Default is None
 label - str or None, optional. Default is None

5.2 Other parameter


**kwargs - Patch properties, it allows us to
pass a variable number of keyword arguments
to a python function. ** denotes this type of
function.

Example
import numpy as np
67 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

68 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

import matplotlib.pyplot as
plt plt.style.use('seaborn-white')
data = np.random.randn(1000)
plt.hist(data);

The hist() function has many options to tune both the calculation and the display; here’s an example of
a more customized histogram.

plt.hist(data, bins=30, alpha=0.5,histtype='stepfilled', color='steelblue',edgecolor='none');

The plt.hist docstring has more information on other customization options available. I find this
combination of histtype='stepfilled' along with some transparency alpha to be very useful when comparing
histograms of several distributions

x1 = np.random.normal(0, 0.8, 1000)


x2 = np.random.normal(-2, 1, 1000)
x3 = np.random.normal(3, 2, 1000)
kwargs = dict(histtype='stepfilled', alpha=0.3, bins=40)
plt.hist(x1, **kwargs)
plt.hist(x2, **kwargs)
plt.hist(x3, **kwargs);

OUTPUT:

5.3 Two-Dimensional Histograms and Binnings


 We can create histograms in two dimensions by dividing points among two dimensional bins.
 We would define x and y values. Here for example We’ll start by defining some data—an x and y
array drawn from a multivariate Gaussian distribution:
 Simple way to plot a two-dimensional histogram is to use Matplotlib’s plt.hist2d() function

Example
mean = [0, 0]
cov = [[1, 1], [1, 2]]
x, y = np.random.multivariate_normal(mean, cov,
1000).T plt.hist2d(x, y, bins=30, cmap='Blues')
cb = plt.colorbar()
cb.set_label('counts in
bin')

OUTPUT:

69 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

6. LEGENDS

Plot legends give meaning to a visualization, assigning labels to the various plot elements. We previously
saw how to create a simple legend; here we’ll take a look at customizing the placement and aesthetics of the
legend in Matplotlib.
Plot legends give meaning to a visualization, assigning labels to the various plot elements. We previously
saw how to create a simple legend; here we’ll take a look at customizing the placement and aesthetics of the
legend in Matplotlib

plt.plot(x, np.sin(x), '-b', label='Sine')


plt.plot(x, np.cos(x), '--r',
label='Cosine') plt.legend();

6. 1 Customizing Plot Legends

Location and turn off the frame


We can specify the location and turn off the frame. By the parameter loc and framon.
ax.legend(loc='upper left', frameon=False)
fig
Number of columns
We can use the ncol command to specify the number of columns in the legend.
ax.legend(frameon=False, loc='lower center',
ncol=2) fig

Rounded box, shadow and frame transparency


We can use a rounded box (fancybox) or add a shadow, change the transparency (alpha value) of the frame,
or change the padding around the text.
ax.legend(fancybox=True, framealpha=1, shadow=True,
borderpad=1) fig

6.2 Choosing Elements for the Legend


 The legend includes all labeled elements by default. We can change which elements and labels
appear in the legend by using the objects returned by plot commands.
 The plt.plot() command is able to create multiple lines at once, and returns a list of created line
instances.
 Passing any of these to plt.legend() will tell it which to identify, along with the labels we’d like to
specify

y = np.sin(x[:, np.newaxis] + np.pi * np.arange(0, 2, 0.5)) OUTPUT:

69 | P a g e
Downloaded by Vijaya K (lakshmivijik@gmail.com)
OCS353 - DATA SCIENCE FUNDAMENTALS

lines = plt.plot(x, y)
plt.legend(lines[:2],['first','second']);
# Applying label individually.
plt.plot(x, y[:, 0], label='first')
plt.plot(x, y[:, 1], label='second')
plt.plot(x, y[:, 2:])
plt.legend(framealpha=1,
frameon=True);

6.3 Multiple legends


It is only possible to create a single legend for the entire plot. If you try to create a second legend using
plt.legend() or ax.legend(), it will simply override the first one. We can work around this by creating a new
legend artist from scratch, and then using the lower-level ax.add_artist() method to manually add the second
artist to the plot

Example
import matplotlib.pyplot as plt
plt.style.use('classic')
import numpy as np
x = np.linspace(0, 10, 1000)
ax.legend(loc='lower center', frameon=True,
shadow=True,borderpad=1,fancybox=True) fig

7. COLOR BARS

In Matplotlib, a color bar is a separate axes that can provide a key for the meaning of colors in a plot.
For continuous labels based on the color of points, lines, or regions, a labeled color bar can be a great tool.

The simplest colorbar can be created with the plt.colorbar() function.

Customizing Colorbars Choosing color map.

We can specify the colormap using the cmap argument to the plotting function that is creating the
visualization. Broadly, we can know three different categories of colormaps:
 Sequential colormaps - These consist of one continuous sequence of colors (e.g., binary or viridis).
 Divergent colormaps - These usually contain two distinct colors, which show positive and
negative deviations from a mean (e.g., RdBu or PuOr).
 Qualitative colormaps - These mix colors with no particular sequence (e.g., rainbow or jet).
Color limits and extensions
 Matplotlib allows for a large range of colorbar customization. The colorbar itself is simply an instance
of plt.Axes, so all of the axes and tick formatting tricks we’ve learned are applicable.
 We can narrow the color limits and indicate the out-of-bounds values with a triangular arrow at the
top and bottom by setting the extend property.
plt.subplot(1, 2, 2)
plt.imshow(I,
cmap='RdBu')
plt.colorbar(extend='both')
plt.clim(-1, 1);

OUTPUT:
70 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

Discrete colorbars
Colormaps are by default continuous, but sometimes you’d like to represent discrete values. The easiest way
to do this is to use the plt.cm.get_cmap() function, and pass the name of a suitable colormap along with the
number of desired bins.
plt.imshow(I, cmap=plt.cm.get_cmap('Blues', 6))
plt.colorbar()
plt.clim(-1, 1);

8. SUBPLOTS

 Matplotlib has the concept of subplots: groups of smaller axes that can exist together within a single
figure.
 These subplots might be insets, grids of plots, or other more complicated layouts.
 We’ll explore four routines for creating subplots in Matplotlib.
 plt.axes: Subplots by Hand
 plt.subplot: Simple Grids of Subplots
 plt.subplots: The Whole Grid in One Go
 plt.GridSpec: More Complicated Arrangements

8.1 plt.axes: Subplots by Hand

 The most basic method of creating an axes is to use the plt.axes function. As we’ve seen previously,
by default this creates a standard axes object that fills the entire figure.
 plt.axes also takes an optional argument that is a list of four numbers in the figure coordinate system.
 These numbers represent [bottom, left, width,height] in the figure coordinate system, which ranges
from 0 at the bottom left of the figure to 1 at the top right of the figure.

71 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

For example,
we might create an inset axes at the top-right corner of another axes by setting the x and y position to 0.65
(that is, starting at 65% of the width and 65% of the height of the figure) and the x and y extents to 0.2
(that is, the size of the axes is 20% of the width and 20% of the height of the figure).

import matplotlib.pyplot
as plt import numpy as np
ax1 = plt.axes() # standard axes
ax2 = plt.axes([0.65, 0.65, 0.2,
0.2])

OUTPUT:

8.2 Vertical sub plot


The equivalent of plt.axes() command within the object-oriented interface is ig.add_axes(). Let’s use this to
create two vertically stacked axes.
fig = plt.figure()
ax1 = fig.add_axes([0.1, 0.5, 0.8, 0.4], xticklabels=[], ylim=(-1.2, 1.2))
ax2 = fig.add_axes([0.1, 0.1, 0.8, 0.4], ylim=(-1.2,
1.2)) x = np.linspace(0, 10)
ax1.plot(np.sin(x))
ax2.plot(np.cos(x));

OUTPUT:

 We now have two axes (the top with no tick labels) that are just touching: the bottom of the upper
panel (at position 0.5) matches the top of the lower panel (at position 0.1+ 0.4).
 If the axis value is changed in second plot both the plots are separated with each
other, example
ax2 = fig.add_axes([0.1, 0.01, 0.8, 0.4])

72 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

8.3 plt.subplot: Simple Grids of Subplots


 Matplotlib has several convenience routines to align columns or rows of subplots.
 The lowest level of these is plt.subplot(), which creates a single subplot within a grid.
 This command takes three integer arguments—the number of rows, the number of columns, and
the index of the plot to be created in this scheme, which runs from the upper left to the bottom right

for i in range(1, 7):


plt.subplot(2, 3, i)
plt.text(0.5, 0.5, str((2, 3, i)), fontsize=18, ha='center')

OUTPUT:

8.4 plt.subplots: The Whole Grid in One Go

 The approach just described can become quite tedious when you’re creating a large grid of subplots,
especially if you’d like to hide the x- and y-axis labels on the inner plots.
 For this purpose, plt.subplots() is the easier tool to use (note the s at the end of subplots).
 Rather than creating a single subplot, this function creates a full grid of subplots in a single line,
returning them in a NumPy array
 Rather than creating a single subplot, this function creates a full grid of subplots in a single line,
returning them in a NumPy array.
 The arguments are the number of rows and number of columns, along with optional keywords
sharex and sharey, which allow you to specify the relationships between different axes.
 Here we’ll create a 2×3 grid of subplots, where all axes in the same row share their y- axis scale,
and all axes in the same column share their x-axis scale

fig, ax = plt.subplots(2, 3, sharex='col', sharey='row')

Note that by specifying sharex and sharey, we’ve automatically removed inner labels on the grid to
make the plot cleaner.

73 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

8.5 plt.GridSpec: More Complicated Arrangements

To go beyond a regular grid to subplots that span multiple rows and columns, plt.GridSpec() is the best tool.
The plt.GridSpec() object does not create a plot by itself; it is simply a convenient interface that is
recognized by the plt.subplot() command.

For example, a gridspec for a grid of two rows and three columns with some specified width and
height space looks like this:

grid = plt.GridSpec(2, 3, wspace=0.4, hspace=0.3)


From this we can specify subplot locations and
extents plt.subplot(grid[0, 0])
plt.subplot(grid[0, 1:])
plt.subplot(grid[1, :2])
plt.subplot(grid[1, 2]);

OUTPUT:

9. TEXT AND ANNOTATION


The most basic types of annotations we will use are axes labels and titles, here we will see some more
visualization and annotation information’s.
 Text annotation can be done manually with the plt.text/ax.text command, which will place text at a
particular x/y value.
 The ax.text method takes an x position, a y position, a string, and then optional keywords
specifying the color, size, style, alignment, and other properties of the text. Here we used ha='right'
and ha='center', where ha is short for horizontal alignment.

Transforms and Text Position


 We anchored our text annotations to data locations. Sometimes it’s preferable to anchor the text to
a position on the axes or figure, independent of the data. In Matplotlib, we do this by modifying the
transform.
 Any graphics display framework needs some scheme for translating between coordinate systems.
 Mathematically, such coordinate transformations are relatively straightforward, and Matplotlib has
a well- developed set of tools that it uses internally to perform them (the tools can be explored in
the matplotlib.transforms submodule).
 There are three predefined transforms that can be useful in this situation.

o ax.transData - Transform associated with data coordinates


o ax.transAxes - Transform associated with the axes (in units of axes dimensions)
o fig.transFigure - Transform associated with the figure (in units of figure dimensions)

74 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

Example
import matplotlib.pyplot as plt
import matplotlib as mpl
plt.style.use('seaborn-whitegrid')
import numpy as np
import pandas as pd
fig, ax =
plt.subplots(facecolor='lightgray')
ax.axis([0, 10, 0, 10])
# transform=ax.transData is the default, but we'll specify it anyway
ax.text(1, 5, ". Data: (1, 5)", transform=ax.transData)
ax.text(0.5, 0.1, ". Axes: (0.5, 0.1)", transform=ax.transAxes)
ax.text(0.2, 0.2, ". Figure: (0.2, 0.2)", transform=fig.transFigure);

OUTPUT:

Note that by default, the text is aligned above and to the left of the specified coordinates; here the “.” at the
beginning of each string will approximately mark the given coordinate location.

The transData coordinates give the usual data coordinates associated with the x- and y-axis labels. The
transAxes coordinates give the location from the bottom-left corner of the axes (here the white box) as a
fraction of the axes size.

The transfigure coordinates are similar, but specify the position from the bottom left of the figure (here
the gray box) as a fraction of the figure size.

Notice now that if we change the axes limits, it is only the transData coordinates that will be affected,
while the others remain stationary.

Arrows and Annotation


 Along with tick marks and text, another useful annotation mark is the simple arrow.
 Drawing arrows in Matplotlib is not much harder because there is a plt.arrow() function available.
 The arrows it creates are SVG (scalable vector graphics)objects that will be subject to the varying
aspect ratio of your plots, and the result is rarely what the user intended.
 The arrow style is controlled through the arrowprops dictionary, which has numerous options
available.

75 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

76 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

10. THREE-DIMENSIONAL PLOTTING IN MATPLOTLIB

We enable three-dimensional plots by importing the mplot3d toolkit, included with the main Matplotlib
installation.

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits
import
mplot3d fig =
plt.figure()
ax = plt.axes(projection='3d')

With this 3D axes enabled, we can now plot a variety of three-dimensional plot types.

Three-Dimensional Points and Lines


The most basic three-dimensional plot is a line or scatter plot created from sets of (x, y, z) triples.
In analogy with the more common two-dimensional plots discussed earlier, we can create these using the
ax.plot3D and ax.scatter3D functions

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits import
mplot3d ax =
plt.axes(projection='3d')
# Data for a three-dimensional
line zline = np.linspace(0, 15,
1000) xline = np.sin(zline)
yline = np.cos(zline)
ax.plot3D(xline, yline, zline, 'gray')
# Data for three-dimensional scattered points
zdata = 15 * np.random.random(100)
xdata = np.sin(zdata) + 0.1 * np.random.randn(100)
ydata = np.cos(zdata) + 0.1 * np.random.randn(100)
ax.scatter3D(xdata, ydata, zdata, c=zdata,
cmap='Greens'); plt.show()

Notice that by default, the scatter points have their transparency adjusted to give a sense of depth on the
page.

Three-Dimensional Contour Plots


 mplot3d contains tools to create three-dimensional relief plots using the same inputs.
 Like two-dimensional ax.contour plots, ax.contour3D requires all the input data to be in the form
of two- dimensional regular grids, with the Z data evaluated at each point.
 Here we’ll show a three-dimensional contour diagram of a three dimensional sinusoidal
function import numpy as np
import matplotlib.pyplot as plt from mpl_toolkits
import
mplot3d def f(x,
y):
return np.sin(np.sqrt(x ** 2 + y ** 2)) x = np.linspace(-6, 6, 30)
y = np.linspace(-6, 6,
30) X, Y =
np.meshgrid(x, y) Z =
f(X, Y)
fig = plt.figure()
ax = plt.axes(projection='3d')
77 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
ax.contour3D(X, Y, Z, 50,
cmap='binary') ax.set_xlabel('x')
ax.set_ylabel('y')

78 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

ax.set_zlabel('z')
plt.show()

Sometimes the default viewing angle is not optimal, in which case we can use the view_init method to set
the elevation and azimuthal angles.

ax.view_init(60,
35) fig

Wire frames and Surface Plots

 Two other types of three-dimensional plots that work on gridded data are wireframes and surface plots.
 These take a grid of values and project it onto the specified threedimensional surface, and can
make the resulting three-dimensional forms quite easy to visualize.

import numpy as np
import matplotlib.pyplot as plt from mpl_toolkits
import mplot3d fig = plt.figure()
ax = plt.axes(projection='3d')
ax.plot_wireframe(X, Y, Z,
color='black') ax.set_title('wireframe');
plt.show()

 A surface plot is like a wireframe plot, but each face of the wireframe is a filled polygon.
 Adding a colormap to the filled polygons can aid perception of the topology of the surface
being visualized

import numpy as np
import matplotlib.pyplot as plt from mpl_toolkits
import mplot3d ax = plt.axes(projection='3d')
ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap='viridis', edgecolor='none')
ax.set_title('surface')
plt.show()

Surface Triangulations
 For some applications, the evenly sampled grids required by the preceding routines are overly
restrictive and inconvenient.
 In these situations, the triangulation-based plots can be very useful.

import numpy as np
import matplotlib.pyplot as plt from mpl_toolkits import mplot3d
theta = 2 * np.pi * np.random.random(1000) r = 6 *
np.random.random(1000) x = np.ravel(r * np.sin(theta))
y = np.ravel(r * np.cos(theta))
z = f(x, y)
ax = plt.axes(projection='3d')
ax.scatter(x, y, z, c=z, cmap='viridis', linewidth=0.5)

79 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

Geographic Data with Basemap


 One common type of visualization in data science is that of geographic data.
 Matplotlib’s main tool for this type of visualization is the Basemap toolkit, which is one of
several Matplotlib toolkits that live under the mpl_toolkits namespace.
 Basemap is a useful tool for Python users to have in their virtual toolbelts
 Installation of Basemap. Once you have the Basemap toolkit installed and imported, geographic
plots also require the PIL package in Python 2, or the pillow package in Python 3.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None, lat_0=50, lon_0=-100)
m.bluemarble(scale=0.5);

 Matplotlib axes that understands spherical coordinates and allows us to easily over-plot data on
the map
 We’ll use an etopo image (which shows topographical features both on land and under the ocean)
as the map background Program to display particular area of the map with latitude and longitude
lines

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap from itertools
import chain
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution=None, width=8E6, height=8E6, lat_0=45, lon_0=-100,)
m.etopo(scale=0.5, alpha=0.5) def draw_map(m, scale=0.2):
# draw a shaded-relief image
m.shadedrelief(scale=scale)
# lats and longs are returned as a dictionary
lats = m.drawparallels(np.linspace(-90, 90,
13))
lons = m.drawmeridians(np.linspace(-180, 180,
13)) # keys contain the plt.Line2D instances
lat_lines = chain(*(tup[1][0] for tup in lats.items()))
lon_lines = chain(*(tup[1][0] for tup in
lons.items())) all_lines = chain(lat_lines, lon_lines)
# cycle through these lines and set the desired style
for line in all_lines:
line.set(linestyle='-', alpha=0.3, color='r')

10.1 Map Projections

The Basemap package implements several dozen such projections, all referenced by a short format code. Here
we’ll briefly demonstrate some of the more common ones.
 Cylindrical projections
 Pseudo-cylindrical projections
 Perspective projections
 Conic projections

10.2 Cylindrical projection


 The simplest of map projections are cylindrical projections, in which lines of constant latitude
and longitude are mapped to horizontal and vertical lines, respectively.
 This type of mapping represents equatorial regions quite well, but results in extreme distortions
near the poles.
 The spacing of latitude lines varies between different cylindrical projections, leading to
different conservation properties, and different distortion near the poles.
80 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

 Other cylindrical projections are the Mercator (projection='merc') and the cylindrical
equal-area (projection='cea') projections.
 The additional arguments to Basemap for this view specify the latitude (lat) and longitude
(lon) of the lower-left corner (llcrnr) and upper-right corner (urcrnr) for the desired map, in
units of degrees.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
fig = plt.figure(figsize=(8, 6), edgecolor='w')
m = Basemap(projection='cyl', resolution=None, llcrnrlat=-90, urcrnrlat=90, llcrnrlon=-180, urcrnrlon=180, )
draw_map(m)

10.3 Pseudo-cylindrical projections


 Pseudo-cylindrical projections relax the requirement that meridians (lines of constant longitude)
remain vertical; this can give better properties near the poles of the projection.
 The Mollweide projection (projection='moll') is one common example of this, in which all meridians
are elliptical arcs
 It is constructed so as to preserve area across the map: though there are distortions near the poles,
the area of small patches reflects the true area.
 Other pseudo-cylindrical projections are the sinusoidal (projection='sinu') and Robinson
(projection='robin') projections.
 The extra arguments to Basemap here refer to the central latitude (lat_0) and longitude (lon_0) for
the desired map.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
fig = plt.figure(figsize=(8, 6), edgecolor='w')
m = Basemap(projection='moll', resolution=None, lat_0=0, lon_0=0)
draw_map(m)

10.4 Perspective projections

 Perspective projections are constructed using a particular choice of perspective point, similar to if
you photographed the Earth from a particular point in space (a point which, for some projections,
technically lies within the Earth!).
 One common example is the orthographic projection (projection='ortho'), which shows one side of
the globe as seen from a viewer at a very long distance.
 Thus, it can show only half the globe at a time.
 Other perspective-based projections include the gnomonic projection (projection='gnom') and
stereographic projection (projection='stere').
 These are often the most useful for showing small portions of the map.

81 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import
Basemap fig = plt.figure(figsize=(8,
8))
m = Basemap(projection='ortho', resolution=None, lat_0=50, lon_0=0)
draw_map(m);

10.5 Conic projections


 A conic projection projects the map onto a single cone, which is then unrolled.
 This can lead to very good local properties, but regions far from the focus point of the cone may
become very distorted.
 One example of this is the Lambert conformal conic projection (projection='lcc').
 It projects the map onto a cone arranged in such a way that two standard parallels (specified in
Basemap by lat_1 and lat_2) have well-represented distances, with scale decreasing between them
and increasing outside of them.
 Other useful conic projections are the equidistant conic (projection='eqdc') and the Albers equal-area
(projection='aea') projection
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import
Basemap fig = plt.figure(figsize=(8,
8))
m = Basemap(projection='lcc', resolution=None, lon_0=0, lat_0=50, lat_1=45, lat_2=55, width=1.6E7,
height=1.2E7)
draw_map(m)

10.6 Drawing a Map Background


The Basemap package contains a range of useful functions for drawing borders of physical features like
continents, oceans, lakes, and rivers, as well as political boundaries such as countries and US states and
counties.
The following are some of the available drawing functions that you may wish to explore using IPython’s
help features:

Physical boundaries and bodies of water


drawcoastlines() - Draw continental coast
lines
drawlsmask() - Draw a mask between the land and sea, for use with projecting images on one or the other
drawmapboundary() - Draw the map boundary, including the fill color for oceans
drawrivers() - Draw rivers on the map
fillcontinents() - Fill the continents with a given color; optionally fill lakes with another color

 Political boundaries
82 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
drawcountries() - Draw country boundaries drawstates() - Draw US state boundaries drawcounties() -
Draw US county boundaries

83 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

 Map features
drawgreatcircle() - Draw a great circle between two points
drawparallels() - Draw lines of constant latitude
drawmeridians() - Draw lines of constant longitude
drawmapscale() - Draw a linear scale on the map

 Whole-globe images
bluemarble() - Project NASA’s blue marble image onto the map
shadedrelief() - Project a shaded relief image onto the map
etopo() - Draw an etopo relief image onto the map
warpimage() - Project a user-provided image onto the map

Plotting Data on Maps


 The Basemap toolkit is the ability to over-plot a variety of data onto a map background.
 There are many map-specific functions available as methods of the Basemap instance. Some
of these map-specific methods are:
contour()/contourf() - Draw contour lines or filled contours
imshow() - Draw an image
pcolor()/pcolormesh() - Draw a pseudocolor plot for irregular/regular meshes
plot() - Draw lines and/or markers
scatter() - Draw points with
markers quiver() - Draw vectors
barbs() - Draw wind barbs
drawgreatcircle() - Draw a great
circle

11. VISUALIZATION WITH SEABORN


The main idea of Seaborn is that it provides high-level commands to create a variety of plot types
useful for statistical data exploration, and even some statistical model fitting.

Histograms, KDE, and densities


 In statistical data visualization, all you want is to plot histograms and joint distributions of
variables. We have seen that this is relatively straightforward in Matplotlib
 Rather than a histogram, we can get a smooth estimate of the distribution using a kernel density
estimation, which Seaborn does with sns.kdeplot
import pandas as pd
import seaborn as sns
data = np.random.multivariate_normal([0, 0], [[5, 2], [2, 2]], )
size=2000 data = pd.DataFrame(data, columns=['x', 'y'])
for col in 'xy':
sns.kdeplot(data[col], shade=True)

 Histograms and KDE can be combined using


distplot sns.distplot(data['x'])
sns.distplot(data['y']);
 If we pass the full two-dimensional dataset to kdeplot, we will get a two-
dimensional visualization of the data.
 We can see the joint distribution and the marginal distributions together using sns.jointplot.

84 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

Pair plots
When you generalize joint plots to datasets of larger dimensions, you end up with pair plots. This is very
useful for exploring correlations between multidimensional data, when you’d like to plot all pairs of values
against each other.

We’ll demo this with the Iris dataset, which lists measurements of petals and sepals of three iris species:
import seaborn as sns
iris = sns.load_dataset("iris")
sns.pairplot(iris, hue='species', size=2.5);

Faceted histograms
 Sometimes the best way to view data is via histograms of subsets. Seaborn’s FacetGrid
makes this extremely simple.
 We’ll take a look at some data that shows the amount that restaurant staff receive in tips based
on various indicator data

85 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

Factor plots

Factor plots can be useful for this kind of visualization as well. This allows you to view the distribution
of a parameter within bins defined by any other parameter.

Joint distributions
Similar to the pair plot we saw earlier, we can use sns.jointplot to show the joint distribution between
different datasets, along with the associated marginal distributions.

Bar plots
Time series can be plotted with sns.factorplot.

86 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

UNIT V
HANDLING LARGE
DATA

Problems - techniques for handling large volumes of data - programming tips for dealing with large data sets-
Case studies: Predicting malicious URLs, Building a recommender system - Tools and techniques needed -
Research question - Data preparation - Model building – Presentation and automation.

1. TECHNIQUES FOR HANDLING LARGE VOLUMES OF DATA

Handling large volumes of data requires a combination of techniques to efficiently process, store, and
analyze the data.
Some common techniques include:

1. Distributed computing:
Using frameworks like Apache Hadoop and Apache Spark to distribute data processing tasks across
multiple nodes in a cluster, allowing for parallel processing of large datasets.

2. Data compression:
Compressing data before storage or transmission to reduce the amount of space required and
improve processing speed.

3. Data partitioning:
Dividing large datasets into smaller, more manageable partitions based on certain criteria (e.g.,
range, hash value) to improve processing efficiency.

4. Data deduplication:
Identifying and eliminating duplicate data to reduce storage requirements and improve data
processing efficiency.

5. Database sharding:
Partitioning a database into smaller, more manageable parts called shards, which can be distributed
across multiple servers for improved scalability and performance.

6. Stream processing:
Processing data in real-time as it is generated, allowing for immediate analysis and decision-making

7. In-memory computing:
Storing data in memory instead of on disk to improve processing speed, particularly for frequently
accessed data

8. Parallel processing:
Using multiple processors or cores to simultaneously execute data processing tasks, improving
processing speed for large datasets.

9. Data indexing:
Creating indexes on data fields to enable faster data retrieval, especially for queries involving large
datasets.

10.Data aggregation;
Combining multiple data points into a single, summarized value to reduce the overall volume of data
while retaining important information. These techniques can be used individually or in combination
to handle large volumes of data effectively and efficiently.

87 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

2. PROGRAMMING TIPS FOR DEALING WITH LARGE DATA SETS

When dealing with large datasets in programming, it's important to use efficient techniques to
manage memory, optimize processing speed, and avoid common pitfalls. Here are some
programming tips for dealing with large datasets:

1. Use efficient data structures:


Choose data structures that are optimized for the operations you need to perform. For example, use
hash maps for fast lookups, arrays for sequential access, and trees for hierarchical data.

2. Lazy loading:
Use lazy loading techniques to load data into memory only when it is needed, rather than loading the
entire dataset at once. This can help reduce memory usage and improve performance

3. Batch processing:
Process data in batches rather than all at once, especially for operations like data transformation or
analysis. This can help avoid memory issues and improve processing speed.

4. Use streaming APIs:


Use streaming APIs and libraries to process data in a streaming fashion, which can be more memory-
efficient than loading the entire dataset into memory

5. Optimize data access:


Use indexes and caching to optimize data access, especially for large datasets. This can help reduce
the time it takes to access and retrieve data.

6. Parallel processing:
Use parallel processing techniques, such as multithreading or multiprocessing,to process the data
concurrently and take advantage of multi-core process.

7. Use efficient algorithms:


Choose algorithms that are optimized for large datasets, such as sorting algorithms that use divide
and conquer techniques or algorithms that can be parallelized.

8. Optimize I/O operations:


Minimize I/O operations and use buffered I/O where possible to reduce the overhead of reading and
writing data to disk.

9. Monitor memory usage:


Keep an eye on memory usage and optimize your code to minimize memory leaks and excessive
memory consumption.

11. Use external storage solutions:


For extremely large datasets that cannot fit into memory, consider using external storage solutions
such as databases or distributed file systems.

3. CASE STUDIES

3.1 PREDICTING MALICIOUS URLS


Predicting malicious URLs is a critical task in cybersecurity to protect users from phishing attacks,
malware distribution, and other malicious activities. Machine learning models can be used to classify
URLs as either benign or malicious based on features such as URL length, domain age, presence of
certain keywords, and historical data. Here are two case studies that demonstrate how machine
learning can be used to predict malicious URLs:

88 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

1. Google Safe Browsing:


 Google Safe Browsing is a service that helps protect users from malicious websites by
identifying and flagging unsafe URLs.
 The service uses machine learning models to analyze URLs and classify them as safe or unsafe.
 Features used in the model include URL length, domain reputation, presence of suspicious
keywords, and similarity to known malicious URLs.
 The model is continuously trained on new data to improve its accuracy and effectiveness. 2.
2. Microsoft SmartScreen:
 Microsoft SmartScreen is a feature in Microsoft Edge and Internet Explorer browsers that
helps protect users from phishing attacks and malware.
 SmartScreen uses machine learning models to analyze URLs and determine their safety.
 The model looks at features such as domain reputation, presence of phishing keywords, and
similarity to known malicious URLs.
 SmartScreen also leverages data from the Microsoft Defender SmartScreen service to improve
its accuracy and coverage. In both cases, machine learning is used to predict the likelihood
that a given URL is malicious based on various features and historical data. These models help
protect users from online threats and improve the overall security of the web browsing
experience.

3.2 BUILDING A RECOMMENDER SYSTEM

Building a recommender system involves predicting the "rating" or "preference" that a user would
give to an item. These systems are widely used in e-commerce, social media, and content streaming
platforms to personalize recommendations for users. Here are two case studies that demonstrate
how recommender systems can be built

1. Netflix Recommendation System:


 Netflix uses a recommendation system to suggest movies and TV shows to its users.
 The system uses collaborative filtering, which involves analyzing user behavior (e.g., viewing
history, ratings) to identify patterns and make recommendations.
 Netflix also incorporates content-based filtering, which considers the characteristics of the
items (e.g., genre, cast, director) to make recommendations
 The system uses machine learning algorithms such as matrix factorization and deep learning
to improve the accuracy of its recommendations.
 Netflix continuously collects data on user interactions and feedback to refine its
recommendation algorithms.

2. Amazon Product Recommendation System:


 Amazon uses a recommendation system to suggest products to its customers based on their
browsing and purchase history.
 The system uses collaborative filtering to identify products that are popular among similar users.
 Amazon also uses item-to-item collaborative filtering, which recommends products that are
similar to those that a user has previously viewed or purchased.
 The system incorporates user feedback and ratings to improve the relevance of its
recommendations.
 Amazon's recommendation system is powered by machine learning algorithms that analyze
large amounts of data to make personalized recommendations. In both cases, the
recommendation systems use machine learning and data analysis techniques to analyze user
behavior and make personalized recommendations. These systems help improve user engagement,
increase sales, and enhance the overall user experience.

89 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

4. TOOLS AND TECHNIQUES NEEDED FOR DEALING WITH LARGE DATA

Dealing with large datasets requires a combination of tools and techniques to manage, process, and
analyze the data efficiently. Here are some key tools and techniques:

1. Big Data Frameworks:


Frameworks such as Apache Hadoop, Apache Spark, and Apache Flink provide tools for distributed
storage .

2. Data Storage:
Use of distributed file systems like Hadoop Distributed File System (HDFS), cloud storage services
like Amazon S3, or NoSQL databases like Apache Cassandra or MongoDB for storing large volumes of
data.

3. Data Processing:
Techniques such as MapReduce, Spark RDDs, and Spark DataFrames for parallel processing of data
across distributed computing clusters.

4. Data Streaming:
Tools like Apache Kafka or Apache Flink for processing real-time streaming data.

5. Data Compression:
Techniques like gzip, Snappy, or Parquet for compressing data to reduce storage requirements and
improve processing speed.

6. Data Partitioning:
Divide large datasets into smaller, more manageable partitions based on certain criteria to improve
processing efficiency.

7. Distributed Computing:
Use of cloud computing platforms like Amazon Web Services (AWS), Google Cloud Platform (GCP), or
Microsoft Azure for scalable and cost-effective processing of large datasets

8. Data Indexing:
Create indexes on data fields to enable faster data retrieval, especially for queries involving large
datasets.

9. Machine Learning:
Use of machine learning algorithms and libraries (e.g., scikit-learn, TensorFlow) for analyzing and
deriving insights from large datasets.

10. Data Visualization:


Tools like Matplotlib, Seaborn, or Tableau for visualizing large datasets to gain insights and make
data- driven decisions. By leveraging these tools and techniques, organizations can effectively
manage and analyze large volumes of data to extract valuable insights and drive informed decision-
making.

5. DATA PREPARATION FOR DEALING WITH LARGE DATA


Data preparation is a crucial step in dealing with large datasets, as it ensures that the data is clean,
consistent, and ready for analysis. Here are some key steps involved in data preparation for large
datasets:

1. Data Cleaning:
Remove or correct any errors or inconsistencies in the data, such as missing values, duplicate
records, or outliers.
2. Data Integration:
Combine data from multiple sources into a single dataset, ensuring that the data is consistent and can
90 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS
be analyzed together.

91 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

3. Data Transformation:
Convert the data into a format that is suitable for analysis, such as converting categorical variables
into numerical ones or normalizing numerical variables. Reduce the size

4. Data Reduction:
if the dataset by removing unnecessary features or aggregating data to a higher level of granularity.

5. Data Sampling:
If the dataset is too large to analyze in its entirety, use sampling techniques to extract a
representative subset of the data for analysis.

6. Feature Engineering:
Create new features from existing ones to improve the performance of machine learning models or
better capture the underlying patterns in the data.

7. Data Splitting:
Split the dataset into training, validation, and test sets to evaluate the performance of machine
learning models and avoid overfitting

8. Data Visualization:
Visualize the data to explore its characteristics and identify any patterns or trends that may be present.

9. Data Security:
Ensure that the data is secure and protected from unauthorized access or loss, especially when
dealing with sensitive information.

6. MODEL BUILDING FOR DEALING WITH LARGE DATA

When building models for large datasets, it's important to consider scalability, efficiency, and
performance. Here are some key techniques and considerations for model building with large data:

1. Use Distributed Computing:


Utilize frameworks like Apache Spark or TensorFlow with distributed computing capabilities to
process large datasets in parallel across multiple nodes.

2. Feature Selection:
Choose relevant features and reduce the dimensionality of the dataset to improve model
performance and reduce computation time.

3. Model Selection:
Use models that are scalable and efficient for large datasets, such as gradient boosting machines,
random forests, or deep learning models.

4. Batch Processing:
If real-time processing is not necessary, consider batch processing techniques to handle large
volumes of data in scheduled intervals.

5. Sampling:
Use sampling techniques to create smaller subsets of the data for model building and validation,
especially if the entire dataset cannot fit into memory

6. Incremental Learning:
Implement models that can be updated incrementally as new data becomes available, instead of
retraining the entire model from scratch.

92 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

7. Feature Engineering:
Create new features or transform existing features to better represent the underlying patterns in the
data and improve model performance.

8. Model Evaluation:
Use appropriate metrics to evaluate model performance, considering the trade-offs between
accuracy, scalability, and computational resources.

9. Parallelization:
Use parallel processing techniques within the model training process to speed up computations, such
as parallelizing gradient computations in deep learning models.

10.Data Partitioning:
Partition the data into smaller subsets for training and validation to improve efficiency and reduce
memory requirements. By employing these techniques, data scientists and machine learning
engineers can build models that are scalable, efficient, and capable of handling large datasets
effectively.

7. PRESENTATION AND AUTOMATION FOR DEALING WITH LARGE DATA

Presentation and automation are key aspects of dealing with large datasets to effectively
communicate insights and streamline data processing tasks. Here are some strategies for
presentation and automation:

1. Visualization:
Use data visualization tools like Matplotlib, Seaborn, or Tableau to create visualizations that help
stakeholders understand complex patterns and trends in the data.

2. Dashboarding:
Build interactive dashboards using tools like Power BI or Tableau that allow users to explore the
data and gain insights in real-time.

3. Automated Reporting:
Use tools like Jupyter Notebooks or R Markdown to create automated reports that can be generated
regularly with updated data.

4. Data Pipelines:
Implement data pipelines using tools like Apache Airflow or Luigi to automate data ingestion,
processing, and analysis tasks

5. Model Deployment:
Use containerization technologies like Docker to deploy machine learning models as scalable and
reusable components

6. Monitoring and Alerting:


Set up monitoring and alerting systems to track the performance of data pipelines and models, and to
be notified of any issues or anomalies.

7. Version Control:
Use version control systems like Git to track changes to your data processing scripts and models,
enabling collaboration and reproducibility

8. Cloud Services:
Leverage cloud services like AWS, Google Cloud Platform, or Azure for scalable storage, processing,
and deployment of large datasets and models. By incorporatingthese strategies, organizations can
streamline their data processes, improve decision-making, and derive more value from their large
datasets.
93 | P a g e
Downloaded by Vijaya K
OCS353 - DATA SCIENCE FUNDAMENTALS

94 | P a g e
Downloaded by Vijaya K
OCS353 – DATA SCIENCE FUNDAMENTALS

OCS353 – DATA SCIENCE FUNDAMENTALS

QUESTION BANK

(Regulation 2021)

95 | P a g e
Downloaded by Vijaya K
OCS353 – DATA SCIENCE FUNDAMENTALS

UNIT I
INTRODUCTION
PART A QUESTIONS AND ANSWERS
1. What is Data
Science?
Data science is the study of data to extract meaningful insights for business. It is a multidisciplinary
approach that combines principles and practices from the fields of mathematics, statistics, artificial
intelligence, and computer engineering to analyze large amounts of data. This analysis helps data
scientists to ask and answer questions like what happened, why it happened, what will happen, and
what can be done with the results.
2. Differentiate between Data Science and Big Data.
Data Science Big data
Data Science is an area Big Data is a technique to collect, maintain
information. and process huge information.

It is about the collection, processing. analyzing, It is about the collection, processing. analyzing,
and utilizing of data in various operations. It is and utilizing of data in various operations. It is
more conceptual. more conceptual.

It is about extracting vital and valuable It is a superset of Big Data as data science
information from a huge amount of data consists of Data scrapping, cleaning,
visualization, statistics, and many more
techniques.
It is a sub-set of Data Science as mining activities It is a superset of Big Data as data science
which is in a pipeline of Data science consists of Data scrapping, cleaning,
visualization, statistics, and many more
techniques.
It is a sub-set of Data Science as mining activities It is a field of study just like Computer Science,
which is in a pipeline of Data science Applied Statistics, or Applied Mathematics

It is a technique for tracking and discovering It is a technique for tracking and discovering
trends in complex data sets. trends in complex data sets. The goal is to
make data more vital and usable i.e. by
extracting only important information from the
huge data within existing traditional aspects.

It broadly focuses on the science of the data. It is more involved with the of handling
voluminous data.

3. Mention the characteristics of Big Data.


The characteristics of big data are often referred to as the three Vs:
 Volume-How much data is there?
 Variety - How diverse are different types of data?
 Velocity - At what speed is new data generated?
 Often these characteristics are complemented with a fourth V, Veracity - How accurate is the
data?
4. What are the benefits and uses of Data Science
 Commercial companies in almost every industry use data science to gain insights into
96 | P a g e
Downloaded by Vijaya K
OCS353 – DATA SCIENCE FUNDAMENTALS
their customers, processes, staff, completion, and products.

97 | P a g e
Downloaded by Vijaya K
OCS353 – DATA SCIENCE FUNDAMENTALS

 Governmental organizations are also aware of data's value.


 Nongovernmental organizations (NGOs) are also no strangers to using data. They use it to
raise money and defend their causes.
 Universities use data science in their research but also to enhance the study experience of
their students.
5. What are the main categories of Data?
 Structured
 Unstructured
 Natural language
 Machine-generated
 Graph-based
 Audio, video, and images
 Streaming
6. Define Structured Data.
Structured data is data that depends on a data model and resides in a fixed field SQL, or Structured
Query Language, is the preferred way to manage and query data that resides in databases within a
record

7. Define Unstructured Data with an example.


Unstructured data is data that isn't easy to fit into a data model because the content is context-
specific or varying. One example of unstructured data is regular email. Although email contains
structured elements such as the sender, titfe, and body text, it's a challenge to find the number
ofpeople who have written an email complaint about a specific employee because so many ways exist
to refer to a person.
8. What is a machine generated data?
Machine-generated data is information that's automatically created by a computer, process,
application, or other machine without human intervention
9. What is a graph based data?
In graph theory, a graph is a mathematical structure to model pair-wise relationships between
objects. Graph or network data is, in short, dats that focuses on the relationship or adjacency of
objects. The graph structures we nodes, edges, and properties to represent and store graphical data.
Graph-based data is a natural way to represent social networks, and its structure allows to calculate
specific metrics such as the influence of a person and the shortest path between two people.
Examples of graph-based data can be found on many social media websites.
10.Mention the steps of Data Science Process.
1. Setting the research goal
2. Retrieving data
3. Data preparation
4. Data exploration
5. Data modeling or model building
6. Presentation and automation
11.What are the steps needed in creating a project charter?
 A clear research goal
 The project mission and context
 How the developer going to perform the analysis
 What resources are expected to be used
 Proof that it's an achievable project, or proof of concepts
 Deliverables and a measure of success a timeline
98 | P a g e
Downloaded by Vijaya K
OCS353 – DATA SCIENCE FUNDAMENTALS

12.Define Data Cleansing.


Data cleansing is a subprocess of the data science process that focuses on removing errors in data so
that it becomes a true and consistent representation of the processes it originates from.
13. What are the different ways of combining Data?
Perform two operations to combine information from different data sets.
1. The first operation is joining: enriching an observation from one table with information
from another table.
2. The second operation is appending or stacking: adding the observations of one table to
those of another table
14. Define Pareto Diagram.
A Pareto diagram is a combination of the values and a cumulative distribution Simple graphs can be
combined which is called as Pareto diagram or 80-20 diagram.
15. What is the purpose of brushing and linking?
Brushing and linking combines and link different graphs and tables (or views) so changes in one graph
are automatically transferred to the other graphs.
16. Define Dummy Variables.
Variables can be turned into dummy variables. Dummy variables can only take two values: true(1) or
false(0). They're used to indicate the absence of a categorical effect that may explain the observation.
17. How to fix capital letter mismatches?
Capital letter mismatches are common. Most programming languages make a distinction between
"Brazil" and "brazil". The problem is solved by applying a function that returns both strings in
lowercase, such as lower() in Python. "Brazil".lower() "brazil".lower() should result in true. two
operations to combine information from different data sets.

PART B QUESTIONS
1. Explain in details the Facets of Data?
2. State the differences between Big data and Data science. Mention the Benefits and uses of
data science and big data
3. Explain the data science Process and its steps in details.
4. Discuss about defining research goals and the steps involved in creating a project charter?
5. Explain in detailed about Retrieving data in Data science?
6. Explain about cleaning, integrating and transforming data in detail?
7. Explain about data Exploration which discuss about graphs?
8. Explain about basic statistical description of data?
9. Explain standard deviation and its formula with an example
10. Explain Datamining?
11. Explain in detail about Text Mining with the steps involved in it?

99 | P a g e
Downloaded by Vijaya K
OCS353 – DATA SCIENCE FUNDAMENTALS

UNIT II
DATA MANIPULATION
PART A QUESTIONS AND ANSWERS
1. What a Python Integer contains?
A single integer in Python actually contains four pieces:
 ob refent, a reference count that helps Python silently handle memory allocation and
deallocation
 ob type, which encodes the type of the variable
 ob size, which specifies the size of the following data members
 ob digit, which contains the actual integer value that we expect the Pytin variable to represent

2. Mention Fixed-Type Arrays in Python with an example.


Python offers several different options for storing data in efficient, fixed-type data buffers. The built-
in array module) can be used to create dense arrays of a
uniform type:
import array
L= list(range(10)) Aarray.array('1', L) A Out[]: array('1', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Here 'i' is a type code indicating the contents are integers.
3. How to Create Arrays from Python Lists?
First, we can use np.array to create arrays from Python lists.
Ex: np.array([1, 4, 2, 5, 3])
Here ‘I’ is a type code indicating the contents are integers
4. How does Array indexing in Numly takes place?
In one-dimensional array accessing the ith value(counting from zero) is done by specifying the
desired index in square brackets, just as with Python lists. In a array, accessing the ith value
(counting from zero) is done
In [] = x1
Out []: array ([5, 0, 3, 3, 7,
91) In []: x1[0]
Out []: 5
Tn []: x * 1[4]
Out []: 7
In a multidimensional array, acces the items using a comma-separated tuple of indices:
In []:*2
Out[]: array [ [3, 5, 2, 4][7, 6, 8, 8], [1, 6, 7, 7] ])
In[]:* 2[, Ɵ,Ɵ]
Out[]: 3
5. Mention the syntax of Array Slicing in NumPy.
The NumPy slicing syntax follows that of the standard Python list, to access a slice of an array x, uses
the following syntax

100 | P a g
Downloaded by Vijaya K
e
OCS353 – DATA SCIENCE FUNDAMENTALS

x[start: stop: step]


6. How Reshaping of Arrays is done?
Reshaping of arrays is the most flexible way of doing with the reshape() method. For example, To put
the numbers 1 through 9 in a 3 * 3 grid, the following can be done:
In[38] grid = np.arange(1, 10).
reshape((3, 3)) print(grid)
[ [[1 2 3]]
[4 5 6]
7. low concatenation and splitting of arrays in Numpy is done?
Concatenation, er joining of two arrays in NomPy, is primarily g through the routines np concatenate,
np. vitack, and nphistaek
 The opposite of concatenation is splitsing, which is implemented by th functions np.split,
np.hsplit, and np.vaplit

8. Mention the rules of Broadcasting.


Rules of Broadcasting: Broadcasting in NumPy follows a strict set of modes determine the interaction
between the two arrays:
Rule 1: If the two arrays differ in their number of dimensions, the shape of the one with fewer
dimensions is padded with ones on its leading (left) side
Rule 2: If the shape of the two arrays does not match in any dimension, the array with shape equal to
1 in that dimension is stretched to match the other shape.
Rule 3: If in any dimension the sizes disagree and neither is equal to 1. error is raised.
9. Define Fancy Indexing.
Fancy indexing is conceptually simple: it means passing an array of indices to access multiple array
elements at once.
10.Differentiate between Pandas and NumPy.
PANDAS NUMPY
To work on Tabular data, the pandas module is To work on Numerical data, the numpy module
preferred is preferred.
The powerful tools of pandas are Data frame and The powerful tool of numpy is Arrays
Series.
Pandas consume more memory. Numpy is memory efficient.

Indexing of the pandas series is very slow as Indexing of numpy arrays is very fast.
compared to numpy arrays.
Pandas offer a have2d table object called Numpy is capable of providing multi-
DataFrame. dimensional arrays.
11.Define Pandas Objects.
Pandas objects can be thought of as enhanced versions of Numpy structured smos in which the rows
and columns are identified with tabels rather than simple integer indices.

101 | P a g
Downloaded by Vijaya K
e
OCS353 – DATA SCIENCE FUNDAMENTALS

12.Explain about Pandas Data structures


The three fundamental Pandas data structures are the Series DataFrame, and Index.A Pandas Series
is a one-dimensional array of indexed data. The Series access with the values and index attributes
A DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible
column names.
The Index object is an interesting structure in itself, and it can be thought of either as an immutable
array or as an ordered set technically a multiser, as Index objects may contain repeated values.
13.Name the ways Constructing of DataFrame objects is done.
A Pandas DataFrame can be constructed in a variety of ways.
 From a single Series object.
 From a list of dicts.
 From a dictionary of Series objects.
 From a two-dimensional NumPy array.
 From a NumPy structured array.

14.Define Indexers loc, iloc, and ix.


 The loc attribute allows indexing and slicing that always references the explicit Index.
 The iloc attribute allows indexing and slicing that always references the implicit Python-style
index
 A third indexing attribute, ix, is a hybrid of the two, and for Series objects is equivalent to
standard []-based indexing.

15.How to handle missing data in a table or DataPromat


A member of whemes have been developed to indicate thedata in a table or Duts Frame Generally,
they revolve around one of two strategies
1) Using a mask that globally indicates missing values
2) Choosing a sentinel value that indicates a mising entry.
16. State NoN and None in Pandas.
NaN and None both have their place, and Pandas is built to handle the theof them nearly
interchangeably, converting between them where appropriate
In[10 ]: pd.Series([3, np.nan, 2,
None) Out(10): 0 1.0
1 NaN
2 2.0
3 NaN

dtype: float64
17. Define the methods used in null values.
There are several useful methods for detecting, removing, and replacing nu values in Pandas data
structures. They are:
isnull()-Generate a Boolean mask indicating missing values

102 | P a g
Downloaded by Vijaya K
e
OCS353 – DATA SCIENCE FUNDAMENTALS

notnull()-Opposite of isnull()
dropna()-Return a filtered version of the data
fillna()-Return a copy of the data with missing values filled or imputed
18. Name the categories of Join.
The pd.merge() function implements a number of types of joins: the one-to-one, many-to-one,
and many-to-many joins.
19. Mention the term GroupBy and the steps involved in it.
The name "group by" comes from a command in the SQL database language, but it is perhaps more
illuminative to think of it in the terms first coined by Hadley, Wickham of Rstats fame: split, apply,
combine.
The split step involves breaking up and grouping a DataFrame depending on the value of the
specified key
apply to interes computing some function, usully an agregate Transformation, or filtering, within the
individual groups. The combine step merges the results of these operations into an output array.
20. Define Pivot Table
A pivot table is a stopilar operation that is commonly seen in spreadsheets and other programs that
operate on tabular data. The pivot table takes simple column wise data as input, and groups the
entries into a two-dimensional cable that provides a multidimensional summarization of the data.
21. Define Hierarchical Indexing.
Hierarchical Indexes are also known as multi-indexing is setting more than one column name as the
index.

PART B QUESTIONS
1. Explain about the Datatypes in Python?
2. Exolain about the Basics of NumPy arrays with coding?
3. Describe the syntax accessing sub arrays in an arrays using array Slicing?
4. Explain in detail about Reshaping of arrays?
5. Explain about the Aggregation operation in Phyton?
6. Explain about broadcasting and its rules with examples?
7. Explain about Boolean masking in detail?
8. Explain in detail about Fancy indexing?
9. Explain in detail why there is a need for NumPy’s structured arrays?
10. Describe the installation Procedure and using of Pandas?
11. Explain the different ways of constructing a Pandas data frame?
12. Explain about the data selection in series and data frames?
13. Explain about the different operations on data in Pandas?
14. Explain how handling of missing data is done?

103 | P a g
Downloaded by Vijaya K
e
OCS353 – DATA SCIENCE FUNDAMENTALS

15. Explain about the methods for detecting, removing and replacing Null values in Pandas data
structures.
16. Explain in detail about hierarchical indexing?
17. Describe data aggregations on Multiindices?
18. Explain in detail about combining datasets using Concat and Append?
19. Explain in detail about categories of Joins.
20. Explain about the steps involved in GroupBy with suitable diagram and coding?

UNIT III
MACHINE LEARNING
PART A QUESTIONS AND ANSWERS
1. What is Machine
Learning?
Machine learning is a subfield of artificial intelligence that involves the development of algorithms
and statistical models that enable improve their performance in tasks through experience. A
computer is said to learn from task T and improve its Performance(P) from experience E. computers
2. What are the steps defined in modelling phase. The modeling phase consists of four steps:
 Feature engineering and model selection
 Training the model
 Model validation and selection
 Applying the trained model to unseen data

3. How machine learning is different from general programming?


In general programming, we have the data and the logic by using these two we we have the data and
the and we het the machine learn the logic from them so, that the same logic can be used to answer
the questions which will be faced in the future. Also, there are times when writing logic in codes is
not possible so, at those times machine learning becomes a saviour and learns the logic itself.
4. How to train a model in machine
learning? Three steps to training a machine
learning model Step 1: Begin with existing
data.Learning
Step 2: Analyze data to identify patterns.
Stop 3: Make predictions.
5. What are the error measures in machine learning?
Two common error measures in machine learning are the
(i) classification error rate for classification problems and squared error for regression
(ii) the mean problems
6. Mention the validation strategies available in Data Science
104 | P a g
Downloaded by Vijaya K
e
OCS353 – DATA SCIENCE FUNDAMENTALS

Dividing your data into a training set with X% of the observations and keeping the rest as a holdout data
set.
K-folds cross validation: This strategy divides the data set into k parts and uses each part one time as a
test data set while using the others as a training data set.
Leave-1 out: This approach is the same as k-folds but with k=1. Always leave one observation out and
train on the rest of the data.
7. What are the types of machine learning?
The two big types of machine learning techniques
1. Supervised: Learning that requires labeled data
2. Unsupervised: Learning that doesn't require labeled data but is usually less accurate or reliable
than supervised learning.
3. Semi-supervised learning is in between those techniques and is used when only a small portion
of the data is labeled.
8. What is the Classification Algorithm?
The Classification algorithm is a Supervised Learning technique that is used to identify the category of
new observations on the basis of training data. In Classification, a program learns from the given
dataset or observations and then classifies new observation into a number of classes or groups such
as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can be called as targets/labels or
categories.
9. Define Regression.
A measure of the relation between the mean value of one variable A corresponding values of other
variables
10.Define regression line.
The regression line is a straight line rather than a curved line that denotes linear relationship between
the two variables.
11.What is a predictive error?
Predictive error refers to the difference between the predicted values made by some model and the
actual values. Example predictive error denntes the assumption of number of cards received and the
actual cards received.
12.What is meant by Captcha Check?
What is were hard numbers that the human user must decipher and enter into a form field before
sending the form back to the web server.
13.Define Clustering.
Clustering is the task of dividing the unlabeled data or data points into differe clusters such that similar
data points fall in the same cluster than those which differ from the others.
14. Define Principal Component Analysis Principal Component Analysis
A technique to find the latent variables in the data set while retaining as much information as possible.
15. Define Semi Supervised Learning

105 | P a g
Downloaded by Vijaya K
e
OCS353 – DATA SCIENCE FUNDAMENTALS

106 | P a g
Downloaded by Vijaya K
e
OCS353 – DATA SCIENCE FUNDAMENTALS

Semi-supervised learning is a type of machine learning that falls in between supervised and
unsupervised learning. It is a method that uses a small amount of labeled data and a large amount of
unlabeled data to train a model. The goal o semi-supervised learning is to learn a function that can
accurately predict the output variable based on the input variables, similar to supervised learning.
16.State the learners in classification problem. Mention about Lazy Learners.
Lazy Learners: Lazy Learner firstly stores the training dataset and wait until receives the test dataset.
In Lazy learner case, classification is done on the ba of the most related data stored in the training
dataset. It takes less time in train but more time for predictions.
17.Mention the Types of ML Classification Algorithm:
Linear Models
 Logistic Regression
 Support vector machines
Non-linear Models
 Support Vector Machines
 K-Nearest Neighbours
 Kernel SVM
 Naïve Bayes
 Decision Tree Classification
 Random Forest Classification
18. State Outliner analysis.
Outlier is data object that deviates significantly from the rest of the data objects and behaves from un
different manner. An outlier the rest of the data objects significantly from the rest of the objects.
They can be caused by measurement or execution errors. The analysis of outlier data is referred to as
outlier analysis or outlier mining.
19. State the difference between Supervised and Unsupervised
Learning Parameter
 Supervised Learning
 Unsupervised Learning
Input Data
 Uses Known and Labeled Data as input
 Uses Unknown Data as input
Computational Complexity
 Less Computational Complexity
 More Computational Complex
Real Time
 Uses off-line analysis
 Uses Real Time Analysis of Data
Number of Classes
 Number of Classes are known
 Number of Classes are not known
Accuracy of Results

100 | P a g e
Downloaded by Vijaya K
OCS353 – DATA SCIENCE FUNDAMENTALS

 Accurate and Reliable Results


 Moderate Accurate and
 And Reliable Results.
Output data
 Desired output is given.
 Desired output is not given.
Model
 In supervised learning it is not possible to learn larger and more complex models than with
supervised learning
 In supervised learning it is not possible to learn larger and more complex models than with
supervised learning
Training data
 In supervised learning training data is used to infer model
 In unsupervised learning training data is not used.
20. What are the steps involved in Data Science Process.
1. Setting the research goal
2. Retrieving Data
3. Data Preparation
4. Data Exploration
5. Data Modelling
6. Presentation and Automation
21. What is a Multi-class Classifier?
 Multi-class Classifier: If a classification problem has more than two outcomes, then it is called as
Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.
22. Mention the clustering techniques used in performing tasks.
The clustering technique can be widely used in various tasks. Some most common uses of this technique
are:
 Market Segmentation
 Statistical data analysis
 Social network analysis
 Image segmentation
 Anomaly detection
23. What is meant by K-means algorithm?
K-Means algorithm: The k-means algorithin is one of the most popular chastering algorithms. It classifies
the dataset by dividing the samples into different clusters of equal variances. The number of clusters
must be specified in this algorithm. It is fast with fewer computations required, with the linear
complexity of O(n)
24. State DBSCAN algorithm

101 | P a g e
Downloaded by Vijaya K
OCS353 – DATA SCIENCE FUNDAMENTALS

DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of Applications with Noise. It is an
example of a density-based model similar to the but with some are separated by the areas of low
density. Because of this, the high density clusters can be found in any arbitrary shape.xample: K-NN
algorithm, Case-based reasoning

PART B QUESTIONS
1. Explain the steps involved in modeling process.
2. Explain the types of Machine learning.
3. Explain about Supervise learning with suitable example.
4. Explain in detail about unsupervised learning with an example.
5. Explain the steps involved in discerning digits from images under supervised Jearning.
6. Explain in detail about unsupervised learning for finding latent variables in a wine quality data set,
7. Explain about Semi Supervised Learning in detail.
8. Explain Classification Algorithm in Machine Learning
9. Explain about main clustering methods used in Machine Learning.
10. Define regression. Explain about prediction of values using the regression line
11. Write short notes on (a) Clustering Algorithms (b) Applications of Clustering techniques
12. Explain in detail about Outlier analysis.

_____________________________________________________________________________________________________________________
UNIT IV
DATA VISUALIZATION
PART A QUESTIONS AND ANSWERS
1. Define Matplotlib.
Matplotlib is a cross-platform, data visualization and graphical plotting library for Python and its
numerical extension NumPy.
2. How to import Matplotlib?
matplotlib.pyplot is a collection of command style functions that make matplotlib work Ingo
MATLAB. Each pyplot une command style fundinge to a figure: eg., creates a figure, creates a plotting
area in a figure, plots some lines in a ploting area, decorates the plot with labels, etc. In
matplotlib.pyplot various states are preserved across function calls, so that it keeps track of things
like the current figure and plotting area, and the plotting functions are directed to the current axes.
import matplotlib.pyplot as plt plt.plot([1,2,3,4])
plt.ylabel('some numbers')
plt.show()
3. How to specify line colors?
A plot is used to control the line colors and styles. The plt.plot() function takes additional arguments
that can be used to specifythese. To adjust the color, the color keyword is used, which accepts a
string argument representing virtually any imaginable color. The color can be specified in a variety of

102 | P a g e
Downloaded by Vijaya K
OCS353 – DATA SCIENCE FUNDAMENTALS
ways.

103 | P a g e
Downloaded by Vijaya K
OCS353 – DATA SCIENCE FUNDAMENTALS

In [6]:
plt.plot(x, np.sin(x0), color='blue') # specify color by name plt.plot(x, np.sin(x - 1), color='g') # short
color code (rgbcmyk)
plt.plot(x, np.sin(x - 2), color='0.75') # Grayscale between and 1
plt.plot(x, np.sin(x3), color='#FFDD44') # Hex code (RRGGBB from 00 to FF)
plt.plot(x, np.sin(x4), color=(1.0,0.2,0.3)) # RGB tuple,
values 0 and 1
plt.plot(x, np.sin(x5), color='chartreuse'); # all HTML color names supported
4. Define Scatter plots.
Instead of points being joined by line segments, here in scatter plots the points are represented
individually with a dot, circle, or other shape
5. Mention the difference between plt.scatter and plt.plot.
The primary difference of plt.scatter from plt.plot is that it can be used to create scatter plots where
the properties of each individual point (size, face color, edge color, etc.) can be individually
controlled or mapped to data whereas the plt.plot function only draws a line from point to point.
6. Define Contour Plot.
A contour plot is a graphical technique for representing a 3-dimensional surface by plotting constant
z slices, called contours, on a 2-dimensional format. A contour plot can be created with the
plt.contour function.
It takes three arguments:
 a grid of x values,
 a grid of y values, and
 a grid of z values.
The x and y values represent positions on the plot, and the z values will be represented by the
contour levels.
7. Define Kernel Density Estimation.
One of the common methods of evaluating densities in multiple dimensions is Kernel Density
Estimation (KDE). KDE can be thought of as a way to "smear out" the points in space and add up the
result to obtain a smooth function.
8. Define Plot Legends.
Plot legends give meaning to a visualization, assigning labels to the various plot elements. The
simplest legend can be created with the plt.legend() command, which automatically creates a legend
for any labeled plot elements.
9. State the categories of colormaps
There are three different categories of colormaps
 Sequential colormaps-These consist of one continuous sequence of colors (e.g., binary or viridis).
 Divergent colormaps-These usually contain two distinct colors, which show positive and negative
deviations from a mean (eg, RdBu or PuOr)
 Qualitative colormaps-These mix colors with no particular sequence (eg rainbow or jet).

104 | P a g e
Downloaded by Vijaya K
OCS353 – DATA SCIENCE FUNDAMENTALS

10.What are Subplots?


Matplotlib has the concept of subplots: groups of smaller axes that can exist together within a single
figure. These subplots might be insets, grids of plotas exist together complicated layouts. The most
basic method of creating an axes is to use the pit.axes function, plt.subplot(), which creates a single
subplot within a grid.
11.What is a major and minor ticks?
Within each axis, there is the concept of a major tick mark and a minor tick mark. As the names
would imply, major ticks are usually bigger or more pronounced, while minor ticks are usually
smaller.
12.What is the need for visualizing errors?
For any scientific measurement, accurate accounting for errors is nearly as important, if not more
important, than accurate reporting of the number itself.
In visualization of data and results, showing the errors effectively can make a plot convey much more
complete information.
13.What are the types of transforms present?
There are three predefined transforms that can be useful in this
situation: ax.transData-Transform associated with data coordinates
ax.transAxes-Transform associated with the axes (in units of axes dimensions)
fig.transFigure-Transform associated with the figure (in units of figure
dimensions)
14.How hiding tick/labels is done?
The most common tick/label formatting operation is the act of hiding ticks or labels. This is done by
using plt. NullLocator() and plt. NullFormatter().
15.What is a surface plot?
A surface plot is like a wireframe plot, but each face of the wireframe is a filled polygon, Adding a
colormap to the filled polygons can aid perception of the topology of the surface being visualized.
16.What is meant by Geographic Data with Basemap?
One common type of visualization in data science is that of geographic data.Matplotlib's main tool
for this type of visualization is the Basemap toolkit, which is one of several Matplotlib toolkits that
live under the mpl_toolkits namespace. Admittedly, Basemap feels a bit clunky to use, and often even
simple visualizations take much longer to render. More modern solutions, such as leaflet or the
Google Maps API, may be a better choice for more intensive map visualizations. Still, Basemap is a
useful tool for Python users to have in their virtual toolbelts.
17.Define cylindrical projections.
The simplest of map projections are cylindrical projections, in which lines of constant latitude and
longitude are mapped to horizontal and vertical lines, respectively. This type of mapping represents
equatorial regions quite well, but results in extreme distortions near the poles.
PART B QUESTIONS
1. Explain about Matplotlib with its import, setting styles and displaying the plots.
105 | P a g e
Downloaded by Vijaya K
OCS353 – DATA SCIENCE FUNDAMENTALS
2. Explain about the dual interface of Matplotlib.

106 | P a g e
Downloaded by Vijaya K
OCS353 – DATA SCIENCE FUNDAMENTALS

3. Explain in detail about simple Line plots with line colors, styles and axes limits.
4. Explain in detail about simple scatter plots.
5. Explain how Visualizing Errors is done.
6. Explain in detail about Density and Contour Plots.
7. Explain in detail about histogram functions with Binnings and Density.
8. Explain in detail about Customizing Plot Legends.
9. Explain in detail about Customizing Colorbars.
10. Explain in detail about Multiple Subplots.
11. Describe the example Effect of Holidays on US Births with Text and Annotation.
12. Explain in detail about Customizing Ticks.
13. Explain in detail the various built in styles in matplotlib stylesheets.
14. Explain in detail about Three-Dimensional Plotting ion Matplolib.
15. Explain in detail about Visualizing a Mobius Strip.
16. Explain about Geographic Data with Basemap with different Map Projections, Map
Background and Plotting Data on Maps.
17. Explain the example California Cities.
18. Explain the example Surface Temperature Data.
19. Explain in detail about Visualization with Seaborn.

UNIT V
HANDLING LARGE
DATA
PART A QUESTIONS AND ANSWERS
1. What are the general problems you face while handling large data managing massive
 Ensuring data quality.
 Keeping data secure.
 Selecting the right big data tools.
 Scaling systems and costs efficiently.
 Lack of skilled data professionals.
 Organizational resistance.

2. What are the problems encountered when working with more da in memory.
 Not enough memory
 Processes that never end
 Some components from a bottleneck while others remain idl
 Not enough speed

3. Mention the solutions for handling large data sets.


 Choose the right algorithms
 Choose the right data structures
 Choose the right tools

107 | P a g e
Downloaded by Vijaya K
OCS353 – DATA SCIENCE FUNDAMENTALS

4. What are the algorithms involved in handling large data se


 Online algorithms
 Block Matrices
 MapReduce

5. Define Perceptron
A perceptron is one of the least complex machine learr inary classification (0 or 1)
6. State how the train_observation() function works.
This function has two large parts.
 The first is to calculate the prediction of an observation and compare it to the actual value.
 The second part is to change the weights if the prediction seems to be wrong.

7. What are the options available for online algorithms learning.


There are three options:
1. Full batch learning (also called statistical learning) - Feed the algorithm all the data at once.
2. Mini-batch learning Feed the algorithm a spoonful (100, 1000, depending on what your
hardware can handle) of observations at a time.
3. Online learning - Feed the algorithm one observation at a time.

8. What is meant by Mapreduce algorithms?


MapReduce implements various mathematical algorithms to divide a task into small parts and assign
them to multiple systems.
Ex. To count all the votes for the national elections. The country has 25 parties, 1,500 voting offices,
and 2 million people. Choose to gather all the voting tickets from every office individually and count
them centrally, or ask the local offices to count the votes for the 25 parties and hand over the results,
and could then aggregate them by party
9. What do you mean by CRUD?
CRUD is the acronym for CREATE, READ, UPDATE and DELETE. These terms describe the four
essential operations for creating and managing persistent data elements, mainly in relational and
NoSQL databases
10.What is Sparse Data?
Sparse data is a variable in which the cells do not contain actual data within data analysis. Sparse
data is empty or has a zero value. Sparse data is different from missing data because sparse data
shows up as empty or zero while missing data doesn't show what some or any of the values are. LM
11.When sparse data is present in your datasets, it creates dense data
There are two types of sparsity:
1. Controlled sparsity happens when there is a range of values for multiple dimensions that have
no value
2. Random sparsity occurs when there is sparse data scattered randomly throughout the datasets

12.State the different Python Tools

108 | P a g e
Downloaded by Vijaya K
OCS353 – DATA SCIENCE FUNDAMENTALS

Python Tools
Python has a number of libraries that can help to deal with large data. They range from smarter data
structures over code optimizers to just-in-time compilers.
The following is a list of libraries we like to use when confronted with large data:
 Cython
 Numexpr
 Numba
 Bcolz
 Blaze
 Theano
 Dask

13.Mention the three general programming tips for dealing with large data sets.
1. Don't reinvent the wheel.
2. Get the most out of your hardware.
3. Reduce the computing need.

14.What is meant by Predicting Malicious URLs?


Predicting malicious URLs is essential for protecting users and organizations from a wide range of
cyber threats. It is a dynamic field that requires continuous adaptation to evolving threats, making
use of a combination of methods and technologies to ensure robust cybersecurity.
15.How bullding a recommender system is done?
Building a recommender system inside a database leverages the strengths of a Building a
recommender scommendation capabilitiesent dalading real-time recommendations, scalability,
personalization, and efficient data management. It can be particularly useful in scenarios where low-
latency and high- performance recommendations are critical, such as e-commerce platforms,
streaming services, and content recommendation systems.
16.What is the main purpose of Locality-Sensitive Hashing
A technique is used to ensure similar customers (local optima) without guaranteeing that it finds the
best customer (global optimum). A common technique used to solve this is called Locality-Sensitive
Hashing. The idea behind Locality-Sensitive Hashing is simple: Construct functions that map similar
customers close together (they're put in a bucket with the same label) and make sure that objects
that are different are put in different buckets.
17.What is meant by hamming distance?
The distance that used is to compare customers is called the hamming distance. The hamming
distance is used to calculate how much two strings differ. The distance is defined as the number of
different characters in a string.
18. How does MYSQL database connect to Python library?
MySQL database connection Python library To connect to this server from Python there is a need to
install SQL Alchemy or another library capable of communicating with MySQL. Use MySQLdb and on
Windows use Conda right off the bat to install it.
19.Define SVMLight.

109 | P a g e
Downloaded by Vijaya K
OCS353 – DATA SCIENCE FUNDAMENTALS

SVMLight is a machine learning software package developed by Thorsten Joachims. It is designed for
solving classification and regression problems using Support Vector Machines (SVMs), which are a
type of supervised learning algorithm.
20.What is a Tree Structure?
Trees are a class of data structure that allows to retrieve information much faster than scanning
through a table. A tree always has a root value and subtrees of children, each with its children, and so
on. Simple examples would be a family tree or a biological tree and the way it splits into branches,
twigs, and leaves.
PART B QUESTIONS
1. Explain in detail about the problems that you face when handling large data.
2. Explain in detail about the general techniques for handling large volumes of data.
3. Explain the steps involved in training the perceptron by observation.
4. Explain the steps involved in train functions
5. Explain in detail about block matrix calculations with bcolz and Dask libraries.
6. Explain in detail about general programming tips for dealing with large data sets.
7. Explain the case study Predicting Malicious URLs.
8. Explain the steps for creating a model to distinguish the malicious from the normal URLs.
9. Explain in detail about Building a recommender system inside a database.
10. Explain in detail the steps involved in creating the hamming distance.
11. Explain about the customers in the database who watched the movie or not using a recommender
system.
******************************************************************************************************

*||| ALL THE BEST |||*

Downloaded by Vijaya K
108 | P a g e
EnggTree.com

Downloaded by Vijaya K
(lakshmivijik@gmail.com)
Downloaded from EnggTree.com

Downloaded by Vijaya K
(lakshmivijik@gmail.com)
EnggTree.com

Downloaded from EnggTree.com


Downloaded by Vijaya K (lakshmivijik@gmail.com)
EnggTree.com

Downloaded from
EnggTree.com
EnggTree.co
m

Downloaded from
EnggTree.com
EnggTree.co
m

Downloaded from
EnggTree.com
EnggTree.com

Downloaded from EnggTree.com


Downloaded by Vijaya K (lakshmivijik@gmail.com)
www.Notesfree.in

www.Notesfree.in
Downloaded by Vijaya K
(lakshmivijik@gmail.com)
www.Notesfree.in

www.Notesfree.in
Downloaded by Vijaya K
(lakshmivijik@gmail.com)
www.Notesfree.in

www.Notesfree.in
Downloaded by Vijaya K
(lakshmivijik@gmail.com)
www.Notesfree.in

www.Notesfree.in
Downloaded by Vijaya K
(lakshmivijik@gmail.com)

You might also like