DATA SCIENCE
TY BSc
Introduction
What is Data?
• Data is a collection of facts, such as
numbers, words, measurements,
observations or just descriptions of things.
• Data is different types of information that
usually is formatted in a particular manner.
What is data science ?
• Data science is a deep study of the massive
amount of data , which involves extracting
meaningful insights from raw, structured, and
unstructured data.
• Data science uses the most powerful hardware,
programming systems, and most efficient
algorithms to solve the data related problems.
Data science is all about:
✓ Asking the correct questions and analyzing the raw data.
✓ Modeling the data using various complex and efficient
algorithms.
✓ Visualizing the data to get a better perspective.
✓ Understanding the data to make better decisions and
finding the final result.
Example:
• Let suppose we want to travel from station A
to station B by car. Now, we need to take
some decisions such as which route will be the
best route to reach faster at the location, in
which route there will be no traffic jam, and
which will be cost-effective. All these
decision factors will act as input data, and we
will get an appropriate answer from these
decisions, so this analysis of data is called the
data analysis, which is a part of data science.
Exploratory Data Analysis (EDA)
+ Data Visualization
• Exploratory data analysis is a simple classification
technique usually done by visual methods.
• Exploratory data analysis (EDA) is a task of analyzing
data using simple tools from statistics, simple plotting
tools.
What is the need of EDA?
• In the growing market, the size of data is also growing.
• It becomes harder for companies to make decision without proper
analyzing it.
• Every machine learning problem solving starts with EDA.
• With the use of charts and certain graphs, one can make sense out
of the data and check whether there is any relationship or not.
• Once Exploratory Data Analysis is complete and insights are
drawn, its feature can be used for supervised and unsupervised
machine learning modelling.
• Various plots are used to determine any conclusions. This helps the
company to make a firm and profitable decisions.
How can we perform EDA?
• There are a lot of tools where one can perform
EDA.
• Programming languages to perform EDA .
• EDA is considered to be the art part
for data scientist.
• The more creative we become with
data more insights we can visualize.
Here are the common graphs
used while performing EDA:
• Scatter Plot Pair plots
• Histogram Box plots
• Scatter plot: It is a type of plot which will be in a scatter
format. It is mainly between 2 features.
Pair plots: Used to see the behavior of all the features present in
the dataset also we get to see the PDF representation.
Box-Plots: Box plots tell us the percentile plotting.
Histogram: Histogram plots are used to depict the distribution
of any continuous variable.
For performing EDA, we need to
import some libraries.
➢ NumPy, Pandas, Matplotlib and seaborn.
Principal Components Analysis (PCA)
• PCA stands for Principal Component Analysis.
• which is dimensionality reduction method which is used
to reduce the dimension of large data set.
• In PCA we transform a large data set of variables into
small parts which still contains most of information of a
large set.
• This method combines highly correlated variables
together to form a smaller number of an artificial set of
variables which is called “principal components” that
account for most variance in the data.
Applications of PCA
• Image processing.
• Movie recommendation system.
• Optimizing the power allocation
in various communication
channels.
The PCA algorithm is based on some
mathematical concepts such as:
• Dimensionality: It is the number of features or variables present in the
given dataset. More easily, it is the number of columns present in the
dataset.
• Correlation:It signifies that how strongly two variables are related to each
other. Such as if one changes, the other variable also gets changed.
• Orthogonal:It defines that variables are not correlated to each other, and
hence the correlation between the pair of variables is zero.
• Eigenvectors:If there is a square matrix M, and a non-zero vector v is
given. Then v will be eigenvector.
• Covariance Matrix:A matrix containing the covariance between the pair
of variables is called the Covariance Matrix.
Data Management
• Data management is the process of storing,
organizing and maintaining the data created and
collected by an organization.
Data Management
• Data Collection :
Data collection is a process of gathering information from all
the relevant sources to find a solution to the research problem.
Data cleaning/extraction
• Data cleaning is the most important part of the Machine Learning process
and Data Scientists spend a lot of their time going through all of the data
within a database.
• Data Cleaning means the process of identifying the incorrect, incomplete,
inaccurate, irrelevant or missing part of the data and then modifying,
replacing or deleting them according to the necessity.
Data analysis & Modelling
• Data analysis is a technique to gain insight into an
organization's data.
• A data analyst might have the following responsibilities:
• To create and analyse important reports (possibly using a
third-party reporting, data warehousing, or business
intelligence system) to help the business make better
decisions.
• To merge data from multiple data sources together, as
part of data mining, so it can be analyzed and reported on.
• To run queries on existing data sources to evaluate
analytics and analyse trends.
• Data analysts will have hands-on access to the
organization's data repositories and use their
technical skills to query and manipulate the
data. They may also be skilled in statistical
analysis, having a high-level of mathematical
experience.
• The common thread among this diverse set of
job titles is that each role is responsible for
analyzing a specific type of data or using a
specific type of tool to analyse data.
What is Data Modelling?
• Data modeling is a set of tools and
techniques used to understand and
analyse how an organization should
collect, update, and store data.
• It is a critical skill for the business
analyst who is involved with discovering,
analyzing, and specifying changes to how
software systems create and maintain
information.
Introduction to high level programming language
• High Level Programming Language is portable but
requires Interpretation or compiling to convert it into a
machine language that is computer understood.
Assembly Language:
• Assembly Language is a low-level programming
language. It helps in understanding the programming
language to machine code.
• In computers, there is an assembler that helps in
converting the assembly code into machine code
executable.
• Assembly language is designed to understand the
instruction and provide it to machine language for
further processing.
Why is Assembly Language Useful?
• Assembly language helps programmers to
write human-readable code that is almost
similar to machine language.
• Machine language is difficult to understand
and read as it is just a series of numbers.
Assembly language helps in providing full
control of what tasks a computer is
performing.
Integrated Development Environment
(IDE)
• Integrated Development Environment (IDE) can be
defined as software that gives its users an environment
for performing programming, along with development
as well as testing and debugging the application.
• Integrated Development Environment comes as a
package with all tools required.
• Integrated Development Environment software is very
user-friendly software along with easy-to-use interface,
which provides suggestions for syntaxes for
programmers.
Use of IDE
• The IDE editor usually provides syntax
highlighting .
• IDEs are also used for debugging, using
an integrated debugger.
DATA SCIENCE
UNIT III
Statistical Modelling and Machine Learning
• Statistical Modelling is • Machine Learning is an
formalization of relationships algorithm that can learn
between variables in the form from data without relying on
of mathematical equations. rules-based programming.
• A statistical model is the use • Machine learning, on the
of statistics to build a other hand, is the use of
representation of the data and mathematical or statistical
then conduct analysis to infer models to obtain a general
any relationships between understanding of the data to
variables or discover make predictions.
insights.
Introduction to model selection
• Here two different types of modelling techniques:
➢ SVM
The objective of SVM algorithm is to find a hyperplane
in an N-dimensional space that distinctly classifies the
data points.
➢ K-NN
K Nearest Neighbour is a simple algorithm that stores
all the available cases and classifies the new data or
case based on a similarity measure.
Regularization
• This technique prevents the model from overfitting by
adding extra information to it.
• It used to reduce the error by fitting a function
appropriately on the given training set and avoid
overfitting.
Underfitting: (High bias and low variance )
• A statistical model or a machine learning algorithm
is said to have underfitting when it cannot capture
the underlying trend of the data.
• Underfitting destroys the accuracy of our machine
learning model. Its occurrence simply means that
our model or the algorithm does not fit the data well
enough.
• It usually happens when we have fewer data to
build an accurate model.
• Techniques to reduce underfitting:
• Increase model complexity
• Increase the number of features, performing
feature engineering
• Remove noise from the data.
• Increase the number of epochs or increase
the duration of training to get better results.
Overfitting: (High variance and low bias )
• A statistical model is said to be overfitted when a model gets
trained with so much data, it starts learning from the noise and
inaccurate data entries in our data set.
• Then the model does not categorize the data correctly, because
of too many details and noise.
• A solution to avoid overfitting is using a linear algorithm if we
have linear data or using the parameters like the maximal
depth if we are using decision trees.
Techniques to reduce overfitting:
1. Increase training data.
2. Reduce model complexity.
3. Early stopping during the training phase
(have an eye over the loss over the training
period as soon as loss begins to increase
stop training).
• Bias: Assumptions made by a model
to make a function easier to learn.
• Variance: If you train your data on
training data and obtain a very low
error, upon changing the data and then
training the same previous model you
experience a high error, this is
variance.
Regularization Techniques
Ridge Regressions
• The Ridge regression is a
regularization technique that uses L2
regularization.
• The Ridge Regression is a technique
which is specialized to analyze
multiple regression data which is
multicollinearity in nature.
LASSO
• LASSO stands for Least Absolute Shrinkage and Selection
Operator.
• Lasso regression is one of the regularization methods that
create parsimonious models in the presence of a large number
of features, where large means either of the below two things:
1. Large enough to enhance the tendency of the model to over-
fit. Minimum ten variables can cause overfitting.
2. Large enough to cause computational challenges. This situation
can arise in case of millions or billions of features.
AIC
• AIC stands for Akaike information criterion.
• AIC is a single number score that can be
used to determine which of multiple
models is most likely to be the best model
for a given dataset.
• It estimates models relatively, meaning
that AIC scores are only useful in
comparison with other AIC scores for the
same dataset. A lower AIC score is better.
• AIC is most frequently used in situations where one is
not able to easily test the model’s performance on a test
set in standard machine learning practice (small data,
or time series).
• AIC is particularly valuable for time series, because
time series analysis’ most valuable data is often the
most recent, which is stuck in the validation and test
sets.
• AIC works by evaluating the model’s fit on the training
data , The desired result is to find the lowest possible
AIC, which indicates the best balance of model fit with
generalizability.
• AIC equation, where L = likelihood and k = 2 of
parameters
• BIC stands for Bayesian Information Criterion.
• The BIC balances the number of model
parameters k and number of data
points n against the maximum likelihood
function, L.
• We seek to find the number of model
parameters k that minimizes the BIC.
• This form of the BIC derives from a paper by
Gideon Schwarz [1] from 1978.
Data Transformations:
• Data transformation is the process of
changing the format, structure, or
values of data.
• Organizations that use on-premises
data warehouses generally use an ETL
(extract, transform, load) process, in
which data transformation is the
middle step.
Dimension Reduction:
• Dimension reduction is the same principal as
zipping the data.
• Dimension reduction compresses large set of
features onto a new feature subspace of lower
dimensional without losing the important
information.
• It is harder to visualize a large set of dimensions.
Dimension reduction techniques can be employed
to make a 20+ dimension feature space into 2 or 3
dimension subspace.
Feature Extraction
• Feature extraction is a dimensionality reduction
mechanism by which an initial collection of raw data is
reduced for processing to more manageable classes.
• Example:
Spam-detection program is one example of feature
extraction If we had a large set of emails and the keywords
contained in these emails, then similarities between the
different keywords could be identified through a feature
extraction process.
Regularization
• What is Regularization?
Regularization is one of the most important concepts of machine
learning.
It is a technique to prevent the model from overfitting by adding
extra information to it.
Sometimes the machine learning model performs well with the
training data but does not perform well with the test data. It
means the model is not able to predict the output when deals with
unseen data by introducing noise in the output, and hence the
model is called overfitted. This problem can be deal with the
help of a regularization technique.
Supervised Learning:
• Regression:
Regression is a method to determine the statistical
relationship between a dependent variable and one or
more independent variables.
This can be broadly classified into two major types.
• Linear Regression
• Logistic Regression
Linear Regression model :
• The simplest case of linear regression is to find a
relationship using a linear model (i.e line) between
an input independent variable (input single feature)
and an output dependent variable.
• We find the relationship between them with the
help of the best fit line which is also known as the
Regression line.
• y = m * x + b
• x: Independent Variable
• y: Dependent Variable
• m: Slope of Line
• b: y Intercept
• Logistic Regression models:
• It is more like a classification problem. The
output can be Success / Failure, Yes / No,
True/ False or 0/1.
• The output has only two possibilities, then it
is called Binary Logistic Regression.
• If the dependent output has more than two
output possibilities and there is no ordering
in them, then it is called Multinomial
Logistic Regression.
Decision Trees
• Decision tree algorithms are nothing but if-else
statements that can be used to predict a result
based on data.
• For instance, this is a simple decision tree that predicts
whether a passenger on the Titanic survived.
Regression trees & Classification
• Regression trees AND
Classification is a term used to
describe decision tree algorithms.
• The Classification and Regression
Tree methodology, also known as
the CART were introduced in
1984 by Leo Breiman.
Regression Trees
• A regression tree refers to an algorithm where the
target variable is and the algorithm is used to
predict its value.
• As an example of a regression type problem, you
may want to predict the selling prices of a
residential house, which is a continuous dependent
variable.
• This will depend on both continuous factors like
square footage as well as categorical factors like the
style of home, area in which the property is located,
and so on.
When to use Regression Trees
• Regression trees, used when the response
variable is continuous.
• if the response variable is something like
the price of a property or the temperature
of the day, a regression tree is used.
• Regression trees are used for prediction-
type problems.
Classification Trees
• A classification tree is an algorithm where the
target variable is fixed or categorical.
• The algorithm is then used to identify the
“class” within which a target variable would
most likely fall.
• Example
There are two variables; income and age; which
determine whether or not a consumer will buy a
particular kind of phone.
When to use classification Trees
• Classification trees are used when the
dataset needs to be split into classes
that belong to the response variable.
the classes Yes or No.
• In some cases, there may be more than
two classes in which case a variant of
the classification tree algorithm is
used.
Time-series Analysis
• Time series analysis is a specific way of analyzing a
sequence of data points collected over an interval of
time.
• Time Series used to see how an object behaves over a
period of time.
• Time series takes the data vector and each data is
connected with timestamp value as given by the user.
• This function is mostly used to learn and forecast the
behavior of an asset in business for a period of time.
Example
• sales analysis of a company
• inventory analysis
• price analysis of a particular stock
or market
• population analysis
Components of time series forecasting
Syntax:
objectName <- ts(data, start, end, frequency)
where,
• data represents the data vector
• start represents the first observation in time series
• end represents the last observation in time series
• frequency represents number of observations per unit time.
For example, frequency=1 for monthly data.
Forecasting
• Forecasting is to predict or
estimate (a future event or
trend). For businesses and
analysts forecasting is determining
what is going to happen in the
future by analyzing what
happened in the past and what is
going on now.
k-means clustering
• k-means clustering is one of the simplest
algorithms which uses unsupervised learning
method to solve known clustering issues.
• k-means clustering require following two
inputs.
1. k = number of clusters
2. Training set(m) = {x1, x2, x3,……….., xm}
K-means clustering Application
• Market segmentation
• Document Clustering
• Image segmentation
• Image compression
• Customer segmentation
Hierarchical clustering
• The hierarchical clustering algorithm aims to find
nested groups of the data by building the hierarchy.
• Hierarchical clusters are generally represented using
the hierarchical tree known as a dendrogram.
Types of Hierarchical Clustering
• Divisive
This is a top-down approach, where it initially considers the entire data as
one group, and then iteratively splits the data into subgroups.
• Agglomerative
It is a bottom-up approach that relies on the merging of clusters. Two
clusters are merged into one iteratively thus reducing the number of
clusters in every iteration.
K-Nearest Neighbour
• K-nearest neighbors (KNN) algorithm is a type of
supervised ML algorithm which can be used for
classification.
Working of KNN Algorithm
• K-nearest neighbors (KNN)
algorithm uses ‘feature similarity’
to predict the values of new
datapoints which further means
that the new data point will be
assigned a value based on how
closely it matches the points in the
training set.
• We can see in the above diagram the three
nearest neighbors of the data point with
black dot. Among those three, two of them
lies in Red class hence the black dot will
also be assigned in red class.
Pros of KNN
• It is very simple algorithm to
understand and interpret.
• It has relatively high accuracy but there
are much better supervised learning
models than KNN.
Cons :
• It is computationally a bit expensive
algorithm because it stores all the
training data.
• Prediction is slow in case of big N.
• High memory storage required as
compared to other supervised learning
algorithms.
Applications of KNN
• Banking System
KNN can be used in banking system to predict weather
an individual is fit for loan approval? Does that
individual have the characteristics similar to the
defaulters one?
• Calculating Credit Ratings
KNN algorithms can be used to find an individual’s
credit rating by comparing with the persons having
similar traits.
k-means clustering
• It is an iterative algorithm that divides the unlabeled
dataset into k different clusters in such a way that
each dataset belongs only one group that has similar
properties.
Ensemble Methods
Types of Ensemble Methods
• Sequential Methods :
In this kind of Ensemble method, there
are sequentially generated base learners
in which data dependency resides. Every
other data in the base learner is having
some dependency on previous data. So,
the previous mislabeled data are tuned
based on its weight to get the
performance of the overall system
improved.
• Parallel Method :
In this kind of Ensemble
method, the base learner is
generated in parallel order in
which data dependency is not
there. Every data in the base
learner is generated
independently.
Classification of Ensemble Methods
• Bagging
• This ensemble method combines two
machine learning models i.e. Bootstrapping
and Aggregation into a single ensemble
model.
•
• Boosting
• Boosting is one of the sequential ensemble methods
in which each model or classifier run based on
features that will utilize by the next model.
Stacking
• This method also combines multiple classifications or
regression techniques using a meta-classifier or meta-model.
• The lower levels models are trained with the complete training
dataset and then the combined model is trained with the
outcomes of lower-level models.
Random Forest:
• Random forest means random selection of a
subset of a sample which reduces the chances of
getting related prediction values.
• Random forest results in an increase in the bias of
the forest slightly, but due to the averaging all the
less related.
DATA SCIENCE
UNIT II
Data curation
• Data Curation is a means of managing data that
makes it more useful for users engaging in data
discovery and analysis.
• Data curators collect data from diverse sources,
integrating it into repositories that are many times
more valuable than the independent parts.
• Data Curation includes data authentication,
archiving, management, preservation retrieval, and
representation.
Data Curation:
• Data curation is the organization and integration of data collected
from various sources.
• It is the processes of collecting data from diverse sources and
integrating it into repositories that are many more times more valuable
than the independent parts.
• Data curation activities:
• Preserving: Collecting and taking care of research data.
• Sharing: Revealing data’s potential across domains
• Discovering: Promoting the re-use and new combinations of data.
Query languages :
• Query language is a language which is
used to retrieve information from a
database.
• Query language is divided into two
types as follows −
• Procedural language
• Non-procedural language
• Procedural language :
• In this the program code is written as sqequence of instruction.
user has to specify only “what to do” and also “how to do it”.
• Eg. C, PASCAL
➢ Information is retrieved from the database by specifying the
sequence of operations to be performed.
• Non-Procedural language :
• In this the user has to specify only “what to do” and not “how
to do it”.
➢ Information is retrieved from the database without specifying
the sequence of operation to be performed. Users only specify
what information is to be retrieved.
➢ Eg. SQL, PROLOG
Structured data
• Structured data is information that has been
formatted and transformed into a well-defined data
model.
• SQL relational databases, consisting of tables with
rows and columns, are the perfect example of
structured data.
Semi-Structured Data
• Your data may not always be structured or
unstructured; semi-structured data or partially
structured data is another category between
structured and unstructured data.
• Organizational properties like metadata or
semantics tags are used with semi-structured data to
make it more manageable; however, it still contains
some variability and inconsistency.
Unstructured Data
• Unstructured data is defined as data present in absolute raw
form.
• This data is difficult to process due to its complex
arrangement and formatting.
• Unstructured data management may take data from many
forms, including social media posts, chats, satellite imagery,
IoT sensor data, emails, and presentations, to organize it in a
logical, predefined manner.
Security and ethical considerations in relation to
authenticating and authorizing access to data on
remote systems.
• Authentication and authorization both play
important roles in online security systems.
• They confirm the identity of the user and
grant access to your website or application.
• It’s vital that you make note of their
differences so you can determine which
combination of web tools best
suit your security needs.
Authentication
• Authentication is the process that confirms a user’s
identity and provides access to sensitive information.
• Traditionally, this is done through a username and
password.
• The user enters their username, which allows the system
to confirm their identity. This system relies on the fact
that (hopefully) only the user and the server know the
password.
• The website authentication process works by comparing
the user’s credentials with the ones on file. If a match is
found, the authentication process is complete, and the
individual can be pushed through to the authorization
process.
• While password authentication is the most
common way to confirm a user’s identity, it
isn’t even close to the most effective or secure
method.
• Think about it: anyone with your credentials
could access your account without your
permission, and the system wouldn’t stop them.
Most passwords are weak, and hacking
techniques can break them in less and less time.
• Luckily, passwords aren’t the only way to
authenticate your users. That’s why we’ll cover
two alternative methods that sites can use to
verify a user’s identity.
Types of authentication
• Biometric Authentication
Biometric authentication includes any method that requires a
user’s biological characteristics to verify their identity.
While this may seem like new-age technology, you’re probably
been using it to unlock the screen on your smartphone for
years. Fingerprint scanning is the most well-known form of
biometric authentication, but facial recognition tools are an
increasingly popular choice for developers and users alike.
However, it’s important to note that these processes are often
less secure than you might initially assume.
• For example, small fingerprint scanners on
smartphones only record portions of your
fingerprint. Multiple images of part of a fingerprint
are much less secure than a single, clear image.
• some hackers have created a “master fingerprint”
that contains characteristics of most common prints,
allowing them to trick the scanners.
• Remember, too, that biometric authentication can’t
be changed or altered if a user’s fingerprints have
been compromised. While biometric authentication
holds a lot of promise, it’s really most useful as a
second factor in a multi-factor authentication
strategy.
• Email Authentication
Email authentication is a password less
option that allows users to securely log in to
any account using just an email address.
The process is very similar to signing in with a
Facebook or Twitter account, but this method
offers a more universal approach.
After all, the vast majority of individuals in the
U.S. have at least one email address.
Here’s how your site can authenticate users using
Swoop’s Magic Message password less authentication:
1. The user is redirected to the Swoop service via the OAuth 2.0 protocol
for authentication.
2. From a browser window, the user pushes the “Send Magic
Message” button: The button activates a mailto link, which generates
a pre-written email for the user to send.
3. The user sends the email: This is where the magic happens. Once the
email is sent, the outgoing email server generates and embeds a
1024/2048 bit, fully encrypted digital key into the header of the email.
Swoop’s authentication server follows the public key cryptographic
procedure to decrypt this key. Each email sent receives a unique key
for that message. The level of security for these encrypted keys is
beyond comparison to traditional passwords.
4. The user is logged into their account: When the key decrypts and
passes all layers of verification, the Swoop authentication server directs
the website to open the user’s account and begin a session. This all
takes place in a matter of seconds and makes for an extremely
streamlined user experience.
• Aside from being inherently more secure than a
password, email authentication tools like Swoop will
also provide a more efficient and elegant user
experience which has helped increase new signups
by up to 49%.
• That way, users know their data is protected, and you
can avoid any potential breaches in security It’s a
win-win for users and developers alike!
Authorization
• Authorization is the next step in the login process,
which determines what a user is able to do and see on
your website.
• Once a user’s identity has been verified through the
authentication process, authorization determines what
permissions they have.
• Permissions are what the user is able to do and see on
your website or server.
• Without them, every user would have the same
abilities and access to the same information (including
the sensitive data that belongs to another user).
Permissions are crucial for a few a reasons:
• They prevent a user from accessing an account
that isn’t theirs.
Imagine if your online banking application didn’t
have permissions. When you logged in, you’d not
only have access to your account but also every other
user’s account on the application! Permissions ensure
users can access and modify only what they need to.
• They restrict free accounts from getting premium
features.
To restrict free accounts from gaining access to your
premium features, you need to implement specific
permissions so that each account only has access to
the capabilities they paid for.
Security authentication vs. authorization
• The easiest way to understand this relationship is by
asking yourself the questions, “Who are you?” and
“What are you allowed to do?”
• Software development tools
• A software development tool or a
software programming tool is a
computer program utilized by
software developers to create,
maintain, edit, support, and debug
other programs, frameworks, or
applications.
• Dataiku DSS
• One of the far-reaching data science software
platforms is Dataiku DSS. Data Analysts,
Engineers, and Data Scientists use this tool to
develop and deliver data products.
• Dataiku DSS has more than 80 built-in functions to
prepare, enrich, and clean data.
• with this tool, you can develop, deploy, and
optimise R and Python models. Moreover, it allows
you to use code APIs to integrate with any ML
library.
• Tableau
• Tableau is a powerful and fastest
growing data visualization tool used in the
Business Intelligence and data science
Industry.
• It helps in simplifying raw data in a very
easily understandable format.
• Data analysis is very fast with Tableau tool
and the visualizations created are in the form
of dashboards and worksheets.
• The best features of Tableau software are
• Data Blending
• Real time analysis
• Collaboration of data
• The great thing about Tableau software is that it
doesn’t require any technical or any kind of
programming skills to operate. The tool has
garnered interest among the people from all sectors
such as business, researchers, different industries,
etc.
• TensorFlow tool
• TensorFlow is a free and open-source
software library for machine learning ,
artificial intelligence and data science.
It can be used across a range of tasks
but has a particular focus on training
and inference of deep neural
networks.
• playground.tensorflow.org
• DataRobot
• It is fastest way to predict and prepare data. Data is the fuel that drives
AI innovation. Unleashing its benefits across the enterprise depends on
immediate self-service access to any data, as well as empowering not
just highly skilled data scientists and data engineers but every analyst
and citizen data scientist.
Amazon Web Services (AWS)
• Amazon Web Service is a supplementary of Amazon.com.
• This company was started in the year 2006.
• This company brings the best example of web-service achieved
through the service oriented architecture.
• Amazon has made it possible to develop private virtual servers that
can run worldwide via 'Hardware Virtualization' on Xen
hypervisor.
• These servers can be provisioned with different types of
application software that user might predict along with a range of
support services that not only makes cloud-computing applications
possible but also make them strong to withstand computation.
• Infrastructure As A Service (IAAS) is means of
delivering computing infrastructure as on-demand
services.
• It is one of three fundamental cloud service model
servers storage network operating system.
• Platform As A Service (PAAS) is a cloud delivery
model for application composed of services
managed by the third party.
• It provides elastic scaling of your application in
which it allows developers to build application and
services over the internet and deployment include
public, private and hybrid.
• Software As A Service (SAAS) allows user
to run existing online application and it is a
model software that deployed as a hosting
service and accessed over Output
Rephrased/Re-written Text the internet or
software delivery model during which
software and its associated data are hosted
centrally and accessed using their client,
usually an online browser over the web.
• SAAS services are used for the development
and deployment of modern application.
Difference between IAAS, PAAS and SAAS :
• Based on SOA standard, and SOAP, REST and
HTTP transfer protocols, furthermore open - source
and commercial OS, browser-based software and
application servers are running by Amazon Web
Service.
• AWS offers various suites of Cloud computing
technology that makes up an on-demand
computational platform.
• These services get operated from twelve different
geographical locations and among them, the best-
known is the Amazon's Elastic Compute Cloud
(EC2) and Amazon's Simple Storage Service (S3).
Components and services of AWS
• Amazon Elastic Compute Cloud: EC2 is the centralized
application of AWS which facilitates the management and usage of
virtual private servers that can run on Windows and Linux-based
platforms over Xen Hypervisor.
• A number of tools are used to support Amazon's web services.
These are:
• Amazon Simple Queue Service is a message queue and transaction
system for distributed Internet-based applications.
• Amazon Simple Notification Service is used to publish message from
an application.
• Amazon CloudWatch is used for monitoring EC2 Cloud which
supports by providing console or command line view of resources in
utilization.
• Elastic Load Balancing is used to detect whether an instance is failing
or check whether the traffic is healthy or not.
• Amazon's Simple Storage Service: is an online
storage and backup system which has high-speed
data transfer technique called AWS Import/Export.
• Another web - services components are:
• Amazon's Elastic Block Store
• Amazon's Simple Database (DB)
• Amazon's Relational Database Service
• Amazon Cloudfront