100% found this document useful (3 votes)

254 views118 pages

Data Science - Sem6

Data science involves extracting meaningful insights from large amounts of raw data through deep study and analysis. It uses powerful hardware, programming, and algorithms to solve data problems. Data science involves asking questions, modeling data using complex algorithms, visualizing data, and understanding data to make better decisions. Exploratory data analysis and visualization are important techniques used in data science to gain initial insights from data through simple statistical analysis and visualization tools like scatter plots, histograms, and box plots. Principal component analysis is a dimensionality reduction technique used to reduce large datasets while retaining most of the information.

Uploaded by

Dinesh K Lohar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (3 votes)

254 views118 pages

Data Science - Sem6

Uploaded by

Dinesh K Lohar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 118

DATA SCIENCE

TY BSc
Introduction
What is Data?

• Data is a collection of facts, such as

numbers, words, measurements,
observations or just descriptions of things.

• Data is different types of information that

usually is formatted in a particular manner.
What is data science ?

• Data science is a deep study of the massive

amount of data , which involves extracting
meaningful insights from raw, structured, and
unstructured data.

• Data science uses the most powerful hardware,

programming systems, and most efficient
algorithms to solve the data related problems.
Data science is all about:
✓ Asking the correct questions and analyzing the raw data.

✓ Modeling the data using various complex and efficient

algorithms.

✓ Visualizing the data to get a better perspective.

✓ Understanding the data to make better decisions and

finding the final result.
Example:
• Let suppose we want to travel from station A
to station B by car. Now, we need to take
some decisions such as which route will be the
best route to reach faster at the location, in
which route there will be no traffic jam, and
which will be cost-effective. All these
decision factors will act as input data, and we
will get an appropriate answer from these
decisions, so this analysis of data is called the
data analysis, which is a part of data science.
Exploratory Data Analysis (EDA)
+ Data Visualization
• Exploratory data analysis is a simple classification
technique usually done by visual methods.

• Exploratory data analysis (EDA) is a task of analyzing

data using simple tools from statistics, simple plotting
tools.
What is the need of EDA?
• In the growing market, the size of data is also growing.

• It becomes harder for companies to make decision without proper

analyzing it.

• Every machine learning problem solving starts with EDA.

• With the use of charts and certain graphs, one can make sense out
of the data and check whether there is any relationship or not.

• Once Exploratory Data Analysis is complete and insights are

drawn, its feature can be used for supervised and unsupervised
machine learning modelling.

• Various plots are used to determine any conclusions. This helps the
company to make a firm and profitable decisions.
How can we perform EDA?
• There are a lot of tools where one can perform
EDA.

• Programming languages to perform EDA .

• EDA is considered to be the art part
for data scientist.

• The more creative we become with

data more insights we can visualize.
Here are the common graphs
used while performing EDA:
• Scatter Plot Pair plots

• Histogram Box plots

• Scatter plot: It is a type of plot which will be in a scatter
format. It is mainly between 2 features.
Pair plots: Used to see the behavior of all the features present in
the dataset also we get to see the PDF representation.
Box-Plots: Box plots tell us the percentile plotting.
Histogram: Histogram plots are used to depict the distribution
of any continuous variable.
For performing EDA, we need to
import some libraries.
➢ NumPy, Pandas, Matplotlib and seaborn.
Principal Components Analysis (PCA)

• PCA stands for Principal Component Analysis.

• which is dimensionality reduction method which is used

to reduce the dimension of large data set.

• In PCA we transform a large data set of variables into

small parts which still contains most of information of a
large set.

• This method combines highly correlated variables

together to form a smaller number of an artificial set of
variables which is called “principal components” that
account for most variance in the data.
Applications of PCA

• Image processing.

• Movie recommendation system.

• Optimizing the power allocation

in various communication
channels.
The PCA algorithm is based on some
mathematical concepts such as:

• Dimensionality: It is the number of features or variables present in the

given dataset. More easily, it is the number of columns present in the
dataset.

• Correlation:It signifies that how strongly two variables are related to each
other. Such as if one changes, the other variable also gets changed.

• Orthogonal:It defines that variables are not correlated to each other, and
hence the correlation between the pair of variables is zero.

• Eigenvectors:If there is a square matrix M, and a non-zero vector v is

given. Then v will be eigenvector.

• Covariance Matrix:A matrix containing the covariance between the pair

of variables is called the Covariance Matrix.
Data Management
• Data management is the process of storing,
organizing and maintaining the data created and
collected by an organization.
Data Management

• Data Collection :
Data collection is a process of gathering information from all
the relevant sources to find a solution to the research problem.
Data cleaning/extraction
• Data cleaning is the most important part of the Machine Learning process
and Data Scientists spend a lot of their time going through all of the data
within a database.

• Data Cleaning means the process of identifying the incorrect, incomplete,

inaccurate, irrelevant or missing part of the data and then modifying,
replacing or deleting them according to the necessity.
Data analysis & Modelling
• Data analysis is a technique to gain insight into an
organization's data.
• A data analyst might have the following responsibilities:

• To create and analyse important reports (possibly using a

third-party reporting, data warehousing, or business
intelligence system) to help the business make better
decisions.

• To merge data from multiple data sources together, as

part of data mining, so it can be analyzed and reported on.

• To run queries on existing data sources to evaluate

analytics and analyse trends.
• Data analysts will have hands-on access to the
organization's data repositories and use their
technical skills to query and manipulate the
data. They may also be skilled in statistical
analysis, having a high-level of mathematical
experience.

• The common thread among this diverse set of

job titles is that each role is responsible for
analyzing a specific type of data or using a
specific type of tool to analyse data.
What is Data Modelling?
• Data modeling is a set of tools and
techniques used to understand and
analyse how an organization should
collect, update, and store data.

• It is a critical skill for the business

analyst who is involved with discovering,
analyzing, and specifying changes to how
software systems create and maintain
information.
Introduction to high level programming language

• High Level Programming Language is portable but

requires Interpretation or compiling to convert it into a
machine language that is computer understood.
Assembly Language:

• Assembly Language is a low-level programming

language. It helps in understanding the programming
language to machine code.

• In computers, there is an assembler that helps in

converting the assembly code into machine code
executable.

• Assembly language is designed to understand the

instruction and provide it to machine language for
further processing.
Why is Assembly Language Useful?

• Assembly language helps programmers to

write human-readable code that is almost
similar to machine language.

• Machine language is difficult to understand

and read as it is just a series of numbers.
Assembly language helps in providing full
control of what tasks a computer is
performing.
Integrated Development Environment
(IDE)

• Integrated Development Environment (IDE) can be

defined as software that gives its users an environment
for performing programming, along with development
as well as testing and debugging the application.

• Integrated Development Environment comes as a

package with all tools required.

• Integrated Development Environment software is very

user-friendly software along with easy-to-use interface,
which provides suggestions for syntaxes for
programmers.
Use of IDE
• The IDE editor usually provides syntax
highlighting .

• IDEs are also used for debugging, using

an integrated debugger.
DATA SCIENCE
UNIT III
Statistical Modelling and Machine Learning

• Statistical Modelling is • Machine Learning is an

formalization of relationships algorithm that can learn
between variables in the form from data without relying on
of mathematical equations. rules-based programming.

• A statistical model is the use • Machine learning, on the

of statistics to build a other hand, is the use of
representation of the data and mathematical or statistical
then conduct analysis to infer models to obtain a general
any relationships between understanding of the data to
variables or discover make predictions.
insights.
Introduction to model selection
• Here two different types of modelling techniques:

➢ SVM
The objective of SVM algorithm is to find a hyperplane
in an N-dimensional space that distinctly classifies the
data points.

➢ K-NN
K Nearest Neighbour is a simple algorithm that stores
all the available cases and classifies the new data or
case based on a similarity measure.
Regularization
• This technique prevents the model from overfitting by
adding extra information to it.

• It used to reduce the error by fitting a function

appropriately on the given training set and avoid
overfitting.
Underfitting: (High bias and low variance )
• A statistical model or a machine learning algorithm
is said to have underfitting when it cannot capture
the underlying trend of the data.
• Underfitting destroys the accuracy of our machine
learning model. Its occurrence simply means that
our model or the algorithm does not fit the data well
enough.
• It usually happens when we have fewer data to
build an accurate model.
• Techniques to reduce underfitting:

• Increase model complexity

• Increase the number of features, performing

feature engineering

• Remove noise from the data.

• Increase the number of epochs or increase

the duration of training to get better results.
Overfitting: (High variance and low bias )

• A statistical model is said to be overfitted when a model gets

trained with so much data, it starts learning from the noise and
inaccurate data entries in our data set.
• Then the model does not categorize the data correctly, because
of too many details and noise.
• A solution to avoid overfitting is using a linear algorithm if we
have linear data or using the parameters like the maximal
depth if we are using decision trees.
Techniques to reduce overfitting:

1. Increase training data.

2. Reduce model complexity.

3. Early stopping during the training phase

(have an eye over the loss over the training
period as soon as loss begins to increase
stop training).
• Bias: Assumptions made by a model
to make a function easier to learn.

• Variance: If you train your data on

training data and obtain a very low
error, upon changing the data and then
training the same previous model you
experience a high error, this is
variance.
Regularization Techniques
Ridge Regressions

• The Ridge regression is a

regularization technique that uses L2
regularization.

• The Ridge Regression is a technique

which is specialized to analyze
multiple regression data which is
multicollinearity in nature.
LASSO

• LASSO stands for Least Absolute Shrinkage and Selection

Operator.

• Lasso regression is one of the regularization methods that

create parsimonious models in the presence of a large number
of features, where large means either of the below two things:

1. Large enough to enhance the tendency of the model to over-

fit. Minimum ten variables can cause overfitting.

2. Large enough to cause computational challenges. This situation

can arise in case of millions or billions of features.
AIC

• AIC stands for Akaike information criterion.

• AIC is a single number score that can be
used to determine which of multiple
models is most likely to be the best model
for a given dataset.
• It estimates models relatively, meaning
that AIC scores are only useful in
comparison with other AIC scores for the
same dataset. A lower AIC score is better.
• AIC is most frequently used in situations where one is
not able to easily test the model’s performance on a test
set in standard machine learning practice (small data,
or time series).
• AIC is particularly valuable for time series, because
time series analysis’ most valuable data is often the
most recent, which is stuck in the validation and test
sets.
• AIC works by evaluating the model’s fit on the training
data , The desired result is to find the lowest possible
AIC, which indicates the best balance of model fit with
generalizability.
• AIC equation, where L = likelihood and k = 2 of
parameters
• BIC stands for Bayesian Information Criterion.

• The BIC balances the number of model

parameters k and number of data
points n against the maximum likelihood
function, L.

• We seek to find the number of model

parameters k that minimizes the BIC.

• This form of the BIC derives from a paper by

Gideon Schwarz [1] from 1978.
Data Transformations:
• Data transformation is the process of
changing the format, structure, or
values of data.

• Organizations that use on-premises

data warehouses generally use an ETL
(extract, transform, load) process, in
which data transformation is the
middle step.
Dimension Reduction:
• Dimension reduction is the same principal as
zipping the data.

• Dimension reduction compresses large set of

features onto a new feature subspace of lower
dimensional without losing the important
information.

• It is harder to visualize a large set of dimensions.

Dimension reduction techniques can be employed
to make a 20+ dimension feature space into 2 or 3
dimension subspace.
Feature Extraction
• Feature extraction is a dimensionality reduction
mechanism by which an initial collection of raw data is
reduced for processing to more manageable classes.

• Example:

Spam-detection program is one example of feature

extraction If we had a large set of emails and the keywords
contained in these emails, then similarities between the
different keywords could be identified through a feature
extraction process.
Regularization
• What is Regularization?

Regularization is one of the most important concepts of machine

learning.

It is a technique to prevent the model from overfitting by adding

extra information to it.

Sometimes the machine learning model performs well with the

training data but does not perform well with the test data. It
means the model is not able to predict the output when deals with
unseen data by introducing noise in the output, and hence the
model is called overfitted. This problem can be deal with the
help of a regularization technique.
Supervised Learning:
• Regression:

Regression is a method to determine the statistical

relationship between a dependent variable and one or
more independent variables.

This can be broadly classified into two major types.

• Linear Regression

• Logistic Regression
Linear Regression model :

• The simplest case of linear regression is to find a

relationship using a linear model (i.e line) between
an input independent variable (input single feature)
and an output dependent variable.

• We find the relationship between them with the

help of the best fit line which is also known as the
Regression line.

• y = m * x + b
• x: Independent Variable
• y: Dependent Variable
• m: Slope of Line
• b: y Intercept
• Logistic Regression models:
• It is more like a classification problem. The
output can be Success / Failure, Yes / No,
True/ False or 0/1.

• The output has only two possibilities, then it

is called Binary Logistic Regression.

• If the dependent output has more than two

output possibilities and there is no ordering
in them, then it is called Multinomial
Logistic Regression.
Decision Trees
• Decision tree algorithms are nothing but if-else
statements that can be used to predict a result
based on data.

• For instance, this is a simple decision tree that predicts

whether a passenger on the Titanic survived.
Regression trees & Classification

• Regression trees AND

Classification is a term used to
describe decision tree algorithms.

• The Classification and Regression

Tree methodology, also known as
the CART were introduced in
1984 by Leo Breiman.
Regression Trees

• A regression tree refers to an algorithm where the

target variable is and the algorithm is used to
predict its value.

• As an example of a regression type problem, you

may want to predict the selling prices of a
residential house, which is a continuous dependent
variable.

• This will depend on both continuous factors like

square footage as well as categorical factors like the
style of home, area in which the property is located,
and so on.
When to use Regression Trees

• Regression trees, used when the response

variable is continuous.

• if the response variable is something like

the price of a property or the temperature
of the day, a regression tree is used.

• Regression trees are used for prediction-

type problems.
Classification Trees

• A classification tree is an algorithm where the

target variable is fixed or categorical.

• The algorithm is then used to identify the

“class” within which a target variable would
most likely fall.

• Example
There are two variables; income and age; which
determine whether or not a consumer will buy a
particular kind of phone.
When to use classification Trees

• Classification trees are used when the

dataset needs to be split into classes
that belong to the response variable.
the classes Yes or No.

• In some cases, there may be more than

two classes in which case a variant of
the classification tree algorithm is
used.
Time-series Analysis
• Time series analysis is a specific way of analyzing a
sequence of data points collected over an interval of
time.

• Time Series used to see how an object behaves over a

period of time.

• Time series takes the data vector and each data is

connected with timestamp value as given by the user.

• This function is mostly used to learn and forecast the

behavior of an asset in business for a period of time.
Example
• sales analysis of a company

• inventory analysis

• price analysis of a particular stock

or market

• population analysis
Components of time series forecasting
Syntax:
objectName <- ts(data, start, end, frequency)

where,

• data represents the data vector

• start represents the first observation in time series
• end represents the last observation in time series
• frequency represents number of observations per unit time.
For example, frequency=1 for monthly data.
Forecasting

• Forecasting is to predict or
estimate (a future event or
trend). For businesses and
analysts forecasting is determining
what is going to happen in the
future by analyzing what
happened in the past and what is
going on now.
k-means clustering

• k-means clustering is one of the simplest

algorithms which uses unsupervised learning
method to solve known clustering issues.
• k-means clustering require following two
inputs.

1. k = number of clusters
2. Training set(m) = {x1, x2, x3,……….., xm}
K-means clustering Application

• Market segmentation

• Document Clustering

• Image segmentation

• Image compression

• Customer segmentation
Hierarchical clustering
• The hierarchical clustering algorithm aims to find
nested groups of the data by building the hierarchy.

• Hierarchical clusters are generally represented using

the hierarchical tree known as a dendrogram.
Types of Hierarchical Clustering

• Divisive
This is a top-down approach, where it initially considers the entire data as
one group, and then iteratively splits the data into subgroups.

• Agglomerative
It is a bottom-up approach that relies on the merging of clusters. Two
clusters are merged into one iteratively thus reducing the number of
clusters in every iteration.
K-Nearest Neighbour

• K-nearest neighbors (KNN) algorithm is a type of

supervised ML algorithm which can be used for
classification.
Working of KNN Algorithm

• K-nearest neighbors (KNN)

algorithm uses ‘feature similarity’
to predict the values of new
datapoints which further means
that the new data point will be
assigned a value based on how
closely it matches the points in the
training set.
• We can see in the above diagram the three
nearest neighbors of the data point with
black dot. Among those three, two of them
lies in Red class hence the black dot will
also be assigned in red class.
Pros of KNN

• It is very simple algorithm to

understand and interpret.

• It has relatively high accuracy but there

are much better supervised learning
models than KNN.
Cons :

• It is computationally a bit expensive

algorithm because it stores all the
training data.

• Prediction is slow in case of big N.

• High memory storage required as

compared to other supervised learning
algorithms.
Applications of KNN

• Banking System

KNN can be used in banking system to predict weather

an individual is fit for loan approval? Does that
individual have the characteristics similar to the
defaulters one?

• Calculating Credit Ratings

KNN algorithms can be used to find an individual’s

credit rating by comparing with the persons having
similar traits.
k-means clustering
• It is an iterative algorithm that divides the unlabeled
dataset into k different clusters in such a way that
each dataset belongs only one group that has similar
properties.
Ensemble Methods
Types of Ensemble Methods
• Sequential Methods :

In this kind of Ensemble method, there

are sequentially generated base learners
in which data dependency resides. Every
other data in the base learner is having
some dependency on previous data. So,
the previous mislabeled data are tuned
based on its weight to get the
performance of the overall system
improved.
• Parallel Method :

In this kind of Ensemble

method, the base learner is
generated in parallel order in
which data dependency is not
there. Every data in the base
learner is generated
independently.
Classification of Ensemble Methods
• Bagging

• This ensemble method combines two

machine learning models i.e. Bootstrapping
and Aggregation into a single ensemble
model.
•
• Boosting
• Boosting is one of the sequential ensemble methods
in which each model or classifier run based on
features that will utilize by the next model.
Stacking

• This method also combines multiple classifications or

regression techniques using a meta-classifier or meta-model.

• The lower levels models are trained with the complete training
dataset and then the combined model is trained with the
outcomes of lower-level models.
Random Forest:
• Random forest means random selection of a
subset of a sample which reduces the chances of
getting related prediction values.

• Random forest results in an increase in the bias of

the forest slightly, but due to the averaging all the
less related.
DATA SCIENCE
UNIT II
Data curation
• Data Curation is a means of managing data that
makes it more useful for users engaging in data
discovery and analysis.

• Data curators collect data from diverse sources,

integrating it into repositories that are many times
more valuable than the independent parts.

• Data Curation includes data authentication,

archiving, management, preservation retrieval, and
representation.
Data Curation:
• Data curation is the organization and integration of data collected
from various sources.

• It is the processes of collecting data from diverse sources and

integrating it into repositories that are many more times more valuable
than the independent parts.

• Data curation activities:

• Preserving: Collecting and taking care of research data.
• Sharing: Revealing data’s potential across domains
• Discovering: Promoting the re-use and new combinations of data.
Query languages :

• Query language is a language which is

used to retrieve information from a
database.

• Query language is divided into two

types as follows −

• Procedural language
• Non-procedural language
• Procedural language :
• In this the program code is written as sqequence of instruction.
user has to specify only “what to do” and also “how to do it”.
• Eg. C, PASCAL
➢ Information is retrieved from the database by specifying the
sequence of operations to be performed.

• Non-Procedural language :

• In this the user has to specify only “what to do” and not “how
to do it”.
➢ Information is retrieved from the database without specifying
the sequence of operation to be performed. Users only specify
what information is to be retrieved.
➢ Eg. SQL, PROLOG
Structured data
• Structured data is information that has been
formatted and transformed into a well-defined data
model.
• SQL relational databases, consisting of tables with
rows and columns, are the perfect example of
structured data.
Semi-Structured Data
• Your data may not always be structured or
unstructured; semi-structured data or partially
structured data is another category between
structured and unstructured data.
• Organizational properties like metadata or
semantics tags are used with semi-structured data to
make it more manageable; however, it still contains
some variability and inconsistency.
Unstructured Data

• Unstructured data is defined as data present in absolute raw

form.
• This data is difficult to process due to its complex
arrangement and formatting.
• Unstructured data management may take data from many
forms, including social media posts, chats, satellite imagery,
IoT sensor data, emails, and presentations, to organize it in a
logical, predefined manner.
Security and ethical considerations in relation to
authenticating and authorizing access to data on
remote systems.

• Authentication and authorization both play

important roles in online security systems.

• They confirm the identity of the user and

grant access to your website or application.

• It’s vital that you make note of their

differences so you can determine which
combination of web tools best
suit your security needs.
Authentication
• Authentication is the process that confirms a user’s
identity and provides access to sensitive information.

• Traditionally, this is done through a username and

password.
• The user enters their username, which allows the system
to confirm their identity. This system relies on the fact
that (hopefully) only the user and the server know the
password.
• The website authentication process works by comparing
the user’s credentials with the ones on file. If a match is
found, the authentication process is complete, and the
individual can be pushed through to the authorization
process.
• While password authentication is the most
common way to confirm a user’s identity, it
isn’t even close to the most effective or secure
method.
• Think about it: anyone with your credentials
could access your account without your
permission, and the system wouldn’t stop them.
Most passwords are weak, and hacking
techniques can break them in less and less time.
• Luckily, passwords aren’t the only way to
authenticate your users. That’s why we’ll cover
two alternative methods that sites can use to
verify a user’s identity.
Types of authentication

• Biometric Authentication
Biometric authentication includes any method that requires a
user’s biological characteristics to verify their identity.

While this may seem like new-age technology, you’re probably

been using it to unlock the screen on your smartphone for
years. Fingerprint scanning is the most well-known form of
biometric authentication, but facial recognition tools are an
increasingly popular choice for developers and users alike.

However, it’s important to note that these processes are often

less secure than you might initially assume.
• For example, small fingerprint scanners on
smartphones only record portions of your
fingerprint. Multiple images of part of a fingerprint
are much less secure than a single, clear image.

• some hackers have created a “master fingerprint”

that contains characteristics of most common prints,
allowing them to trick the scanners.

• Remember, too, that biometric authentication can’t

be changed or altered if a user’s fingerprints have
been compromised. While biometric authentication
holds a lot of promise, it’s really most useful as a
second factor in a multi-factor authentication
strategy.
• Email Authentication

Email authentication is a password less

option that allows users to securely log in to
any account using just an email address.

The process is very similar to signing in with a

Facebook or Twitter account, but this method
offers a more universal approach.

After all, the vast majority of individuals in the

U.S. have at least one email address.
Here’s how your site can authenticate users using
Swoop’s Magic Message password less authentication:
1. The user is redirected to the Swoop service via the OAuth 2.0 protocol
for authentication.

2. From a browser window, the user pushes the “Send Magic

Message” button: The button activates a mailto link, which generates
a pre-written email for the user to send.

3. The user sends the email: This is where the magic happens. Once the
email is sent, the outgoing email server generates and embeds a
1024/2048 bit, fully encrypted digital key into the header of the email.
Swoop’s authentication server follows the public key cryptographic
procedure to decrypt this key. Each email sent receives a unique key
for that message. The level of security for these encrypted keys is
beyond comparison to traditional passwords.

4. The user is logged into their account: When the key decrypts and
passes all layers of verification, the Swoop authentication server directs
the website to open the user’s account and begin a session. This all
takes place in a matter of seconds and makes for an extremely
streamlined user experience.
• Aside from being inherently more secure than a
password, email authentication tools like Swoop will
also provide a more efficient and elegant user
experience which has helped increase new signups
by up to 49%.

• That way, users know their data is protected, and you

can avoid any potential breaches in security It’s a
win-win for users and developers alike!
Authorization
• Authorization is the next step in the login process,
which determines what a user is able to do and see on
your website.

• Once a user’s identity has been verified through the

authentication process, authorization determines what
permissions they have.

• Permissions are what the user is able to do and see on

your website or server.

• Without them, every user would have the same

abilities and access to the same information (including
the sensitive data that belongs to another user).
Permissions are crucial for a few a reasons:
• They prevent a user from accessing an account
that isn’t theirs.

Imagine if your online banking application didn’t

have permissions. When you logged in, you’d not
only have access to your account but also every other
user’s account on the application! Permissions ensure
users can access and modify only what they need to.
• They restrict free accounts from getting premium
features.
To restrict free accounts from gaining access to your
premium features, you need to implement specific
permissions so that each account only has access to
the capabilities they paid for.
Security authentication vs. authorization

• The easiest way to understand this relationship is by

asking yourself the questions, “Who are you?” and
“What are you allowed to do?”
• Software development tools
• A software development tool or a
software programming tool is a
computer program utilized by
software developers to create,
maintain, edit, support, and debug
other programs, frameworks, or
applications.
• Dataiku DSS

• One of the far-reaching data science software

platforms is Dataiku DSS. Data Analysts,
Engineers, and Data Scientists use this tool to
develop and deliver data products.

• Dataiku DSS has more than 80 built-in functions to

prepare, enrich, and clean data.

• with this tool, you can develop, deploy, and

optimise R and Python models. Moreover, it allows
you to use code APIs to integrate with any ML
library.
• Tableau

• Tableau is a powerful and fastest

growing data visualization tool used in the
Business Intelligence and data science
Industry.

• It helps in simplifying raw data in a very

easily understandable format.

• Data analysis is very fast with Tableau tool

and the visualizations created are in the form
of dashboards and worksheets.
• The best features of Tableau software are

• Data Blending

• Real time analysis

• Collaboration of data

• The great thing about Tableau software is that it

doesn’t require any technical or any kind of
programming skills to operate. The tool has
garnered interest among the people from all sectors
such as business, researchers, different industries,
etc.
• TensorFlow tool

• TensorFlow is a free and open-source

software library for machine learning ,
artificial intelligence and data science.
It can be used across a range of tasks
but has a particular focus on training
and inference of deep neural
networks.

• playground.tensorflow.org
• DataRobot

• It is fastest way to predict and prepare data. Data is the fuel that drives
AI innovation. Unleashing its benefits across the enterprise depends on
immediate self-service access to any data, as well as empowering not
just highly skilled data scientists and data engineers but every analyst
and citizen data scientist.
Amazon Web Services (AWS)
• Amazon Web Service is a supplementary of Amazon.com.

• This company was started in the year 2006.

• This company brings the best example of web-service achieved
through the service oriented architecture.

• Amazon has made it possible to develop private virtual servers that

can run worldwide via 'Hardware Virtualization' on Xen
hypervisor.

• These servers can be provisioned with different types of

application software that user might predict along with a range of
support services that not only makes cloud-computing applications
possible but also make them strong to withstand computation.
• Infrastructure As A Service (IAAS) is means of
delivering computing infrastructure as on-demand
services.
• It is one of three fundamental cloud service model
servers storage network operating system.

• Platform As A Service (PAAS) is a cloud delivery

model for application composed of services
managed by the third party.
• It provides elastic scaling of your application in
which it allows developers to build application and
services over the internet and deployment include
public, private and hybrid.
• Software As A Service (SAAS) allows user
to run existing online application and it is a
model software that deployed as a hosting
service and accessed over Output
Rephrased/Re-written Text the internet or
software delivery model during which
software and its associated data are hosted
centrally and accessed using their client,
usually an online browser over the web.

• SAAS services are used for the development

and deployment of modern application.
Difference between IAAS, PAAS and SAAS :
• Based on SOA standard, and SOAP, REST and
HTTP transfer protocols, furthermore open - source
and commercial OS, browser-based software and
application servers are running by Amazon Web
Service.

• AWS offers various suites of Cloud computing

technology that makes up an on-demand
computational platform.

• These services get operated from twelve different

geographical locations and among them, the best-
known is the Amazon's Elastic Compute Cloud
(EC2) and Amazon's Simple Storage Service (S3).
Components and services of AWS
• Amazon Elastic Compute Cloud: EC2 is the centralized
application of AWS which facilitates the management and usage of
virtual private servers that can run on Windows and Linux-based
platforms over Xen Hypervisor.
• A number of tools are used to support Amazon's web services.
These are:

• Amazon Simple Queue Service is a message queue and transaction

system for distributed Internet-based applications.
• Amazon Simple Notification Service is used to publish message from
an application.
• Amazon CloudWatch is used for monitoring EC2 Cloud which
supports by providing console or command line view of resources in
utilization.
• Elastic Load Balancing is used to detect whether an instance is failing
or check whether the traffic is healthy or not.
• Amazon's Simple Storage Service: is an online
storage and backup system which has high-speed
data transfer technique called AWS Import/Export.

• Another web - services components are:

• Amazon's Elastic Block Store

• Amazon's Simple Database (DB)

• Amazon's Relational Database Service

• Amazon Cloudfront

Free Data Science Courses & Certs
No ratings yet
Free Data Science Courses & Certs
2 pages
Lecture-1to8-HCL-DSE - Sumita Narang - IDS PDF
No ratings yet
Lecture-1to8-HCL-DSE - Sumita Narang - IDS PDF
304 pages
Sajjad DS
100% (2)
Sajjad DS
97 pages
Data Mining Project: Clustering & Model Analysis
100% (1)
Data Mining Project: Clustering & Model Analysis
40 pages
Credit EDA Assignment PDF
No ratings yet
Credit EDA Assignment PDF
40 pages
Machine Lpipearning Interview Questions: Algorithms/Tp: Q1-What's The Trade-Off Between Bias and Variance?
No ratings yet
Machine Lpipearning Interview Questions: Algorithms/Tp: Q1-What's The Trade-Off Between Bias and Variance?
46 pages
Data Science & Analytics PG Program
No ratings yet
Data Science & Analytics PG Program
16 pages
Autos Automobile.. EDA Project by Anjali Sinha
No ratings yet
Autos Automobile.. EDA Project by Anjali Sinha
26 pages
SMDM - Week 1 Checklist
100% (1)
SMDM - Week 1 Checklist
3 pages
Machine Learning: Bilal Khan
100% (2)
Machine Learning: Bilal Khan
20 pages
Statisitics Project 6
100% (2)
Statisitics Project 6
48 pages
Capstone Notes-Model
No ratings yet
Capstone Notes-Model
20 pages
Linear Regression with Scikit-Learn
No ratings yet
Linear Regression with Scikit-Learn
8 pages
IIIT-B Postgrad Assessment Guide
No ratings yet
IIIT-B Postgrad Assessment Guide
13 pages
Data Science & Business Analytics: Post Graduate Program in
No ratings yet
Data Science & Business Analytics: Post Graduate Program in
16 pages
Project Report: CS 574 - Computer Vision Using Machine Learning
No ratings yet
Project Report: CS 574 - Computer Vision Using Machine Learning
38 pages
Answer Report (Preditive Modelling)
100% (1)
Answer Report (Preditive Modelling)
29 pages
EDA Assignment
No ratings yet
EDA Assignment
15 pages
Sample - Customer Churn Prediction Python Documentation
No ratings yet
Sample - Customer Churn Prediction Python Documentation
33 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
22 pages
Python Interview Questions
No ratings yet
Python Interview Questions
8 pages
Data Analytics: Key Concepts & Terms
No ratings yet
Data Analytics: Key Concepts & Terms
22 pages
Advanced Certification in Data Science and Artificial Intelligence
No ratings yet
Advanced Certification in Data Science and Artificial Intelligence
18 pages
Data Science Masters Program - Curriculum-Updated 2019
No ratings yet
Data Science Masters Program - Curriculum-Updated 2019
52 pages
Linear Regression
No ratings yet
Linear Regression
83 pages
Name: Siti Mursyida Abdul Karim (Data Science Program) Topic: Assignment - EDA
100% (1)
Name: Siti Mursyida Abdul Karim (Data Science Program) Topic: Assignment - EDA
13 pages
Project 5 - Cars
100% (1)
Project 5 - Cars
22 pages
Machine Learning Project Analysis
No ratings yet
Machine Learning Project Analysis
114 pages
LDA 01 Linear Discriminant Analysis
No ratings yet
LDA 01 Linear Discriminant Analysis
65 pages
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
No ratings yet
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
56 pages
Machine Learning Projects PDF
No ratings yet
Machine Learning Projects PDF
5 pages
SAS Presentation
No ratings yet
SAS Presentation
49 pages
Lead Scoring Subjective Questions
No ratings yet
Lead Scoring Subjective Questions
3 pages
P-149 Final PPT
No ratings yet
P-149 Final PPT
57 pages
Lecture+Notes (Upgrad)
No ratings yet
Lecture+Notes (Upgrad)
5 pages
Car Transport Prediction
100% (2)
Car Transport Prediction
27 pages
Data Science for Industrialists
100% (2)
Data Science for Industrialists
15 pages
Time Series
No ratings yet
Time Series
23 pages
MLP - Week 5 - MNIST - Perceptron - Ipynb - Colaboratory
No ratings yet
MLP - Week 5 - MNIST - Perceptron - Ipynb - Colaboratory
31 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
9 pages
SQL - Basics
No ratings yet
SQL - Basics
25 pages
Regression Project
100% (1)
Regression Project
60 pages
Data Visualization Mastery Course
No ratings yet
Data Visualization Mastery Course
2 pages
ML MU Unit 2
100% (3)
ML MU Unit 2
84 pages
Sunira - Predictive Modeling
100% (1)
Sunira - Predictive Modeling
65 pages
Week 1 Quiz
100% (1)
Week 1 Quiz
28 pages
FRA Project Report - Chilla Nagaraju
100% (1)
FRA Project Report - Chilla Nagaraju
66 pages
Apoorva P 17th March TSF
No ratings yet
Apoorva P 17th March TSF
47 pages
Wine Quality Prediction with SVR
100% (1)
Wine Quality Prediction with SVR
6 pages
Cluster
100% (1)
Cluster
72 pages
Eda PDF
100% (1)
Eda PDF
45 pages
Capstone Project Proposal - HR Audit
No ratings yet
Capstone Project Proposal - HR Audit
3 pages
Heart Attack Prediction Model EDA
100% (1)
Heart Attack Prediction Model EDA
24 pages
Pant D. Statistics For Data Scientists and Analysts... Using Python 2025
No ratings yet
Pant D. Statistics For Data Scientists and Analysts... Using Python 2025
508 pages
ML Project Report: (Text Learning Case Study)
No ratings yet
ML Project Report: (Text Learning Case Study)
9 pages
ML MU Unit 2
100% (2)
ML MU Unit 2
42 pages
End Term Quiz Review: Forecasting
No ratings yet
End Term Quiz Review: Forecasting
5 pages
Semi-Automated EDA in Python
No ratings yet
Semi-Automated EDA in Python
3 pages
Session1 DataCharacteristics
No ratings yet
Session1 DataCharacteristics
41 pages
Data Science - Ebook
No ratings yet
Data Science - Ebook
32 pages
Setting Out Report
57% (7)
Setting Out Report
7 pages
Innovation Secrets of Steve Jobs
No ratings yet
Innovation Secrets of Steve Jobs
26 pages
Experiment-3 31
No ratings yet
Experiment-3 31
9 pages
DSP Matlab Programs
No ratings yet
DSP Matlab Programs
18 pages
Access Request Management (ARQ) Debugging Scenarios - Governance, Risk and Compliance - SCN Wiki
No ratings yet
Access Request Management (ARQ) Debugging Scenarios - Governance, Risk and Compliance - SCN Wiki
26 pages
Flare Nov Dec 08
100% (1)
Flare Nov Dec 08
102 pages
Symantec NetBackup™ 7.5 Upgrade Guide
No ratings yet
Symantec NetBackup™ 7.5 Upgrade Guide
36 pages
BGP RACE CONDITION - Networks Baseline - Cisco Engineers Live
No ratings yet
BGP RACE CONDITION - Networks Baseline - Cisco Engineers Live
10 pages
Operations Research Assignment
No ratings yet
Operations Research Assignment
7 pages
AS-QMS-011 Calibration PDF
100% (2)
AS-QMS-011 Calibration PDF
11 pages
Problem Set 1 PDF
No ratings yet
Problem Set 1 PDF
6 pages
Commands: z/OS Bulk Data Transfer
100% (1)
Commands: z/OS Bulk Data Transfer
118 pages
General Register Organization.
No ratings yet
General Register Organization.
9 pages
Arduino GSM Bank Security System
No ratings yet
Arduino GSM Bank Security System
15 pages
Sap Tables List
100% (1)
Sap Tables List
43 pages
UNIT-4 (Regular Expressions)
No ratings yet
UNIT-4 (Regular Expressions)
25 pages
Autonomous Driving Literature Survey
No ratings yet
Autonomous Driving Literature Survey
5 pages
Essential Spreadsheets Exercises
100% (1)
Essential Spreadsheets Exercises
23 pages
An Introduction On OMR Sheets: Instructions On How To Fill Registration Number and Question Paper Code On OMR Sheets
No ratings yet
An Introduction On OMR Sheets: Instructions On How To Fill Registration Number and Question Paper Code On OMR Sheets
2 pages
Union Bank of India Bank Clerk Exam (10 - 01 - 2010)
No ratings yet
Union Bank of India Bank Clerk Exam (10 - 01 - 2010)
20 pages
GSM Call Flow
100% (2)
GSM Call Flow
4 pages
Digital Electronics Two Marks Question &amp Answer
100% (10)
Digital Electronics Two Marks Question &amp Answer
5 pages
Compact Disk Standards & Specifications: History
No ratings yet
Compact Disk Standards & Specifications: History
8 pages
Mail Flow
No ratings yet
Mail Flow
10 pages
PTP 700 Series User Guide PHN 4148 007v000
No ratings yet
PTP 700 Series User Guide PHN 4148 007v000
569 pages
ISO 27001 Control Clauses List
No ratings yet
ISO 27001 Control Clauses List
7 pages
A Customer Focus in Hospitality
0% (1)
A Customer Focus in Hospitality
12 pages
Point Cloud AutoCAD
100% (1)
Point Cloud AutoCAD
12 pages
OS Practical File
75% (4)
OS Practical File
15 pages
Nature-Inspired Design of Hybrid Intelligent Systems
100% (1)
Nature-Inspired Design of Hybrid Intelligent Systems
817 pages