Foundations of Data Science
Foundations of Data Science
 FOUNDATIONS OF DATA
      SCIENCE
                                            BY
                   K.Issack Babu MCA
               Assistant Professor
science project, setting expectations, loading data into R – working with data from
files, working with relational databases. Exploring data – Using summary statistics to
spot problems, spotting problems using graphics and visualization. Managing data –
cleaning and sampling for modelling and validation.
                                               UNIT II
objects, Types of Data items, the structure of data items, examining data structure,
working with history commands, saving your work in R.
PROBABILITY DISTRIBUTIONS in R - Binomial, Poisson, Normal distributions.
UNIT IV
UNIT I
science project, setting expectations, loading data into R – working with data from
files, working with relational databases. Exploring data – Using summary statistics to
spot problems, spotting problems using graphics and visualization. Managing data –
cleaning and sampling for modelling and validation.
  Customers:- Customers are the people who use the product. Their interest
  determines the success of project and their feedback is very valuable in data
  science.
  Data scientists:- Data scientists explore and transform the data in new ways to
  create and publish new features. These scientists also combine data from diverse
  sources to create a new value. They play an important role in creating
  visualizations with researchers, engineers and web developers.
  Adapting to Change:- All the team members of data science are required to
  adapt to new changes and work on the basis of requirements. Several changes
  should be made for adopting agile methodology with data science, which are
  mentioned as follows −
         Choosing generalists over specialists.
         Preference of small teams over large teams.
         Using high-level tools and platforms.
         Continuous and iterative sharing of intermediate work.
Roles:-
  The role of a data scientist is normally associated with tasks such as predictive
  modeling, developing segmentation algorithms, recommender systems, A/B
  testing frameworks and often working with raw unstructured data.
  The nature of their work demands a deep understanding of mathematics,
  applied statistics and programming. There are a few skills common between a
    data analyst and a data scientist, for example, the ability to query databases.
    Both analyze data, but the decision of a data scientist can have a greater impact
    in an organization.
    Here is a set of skills a data scientist normally need to have −
    Life cycle of data science is recursive. After completing the all phases, the data
    scientist can back to top. The data Science life cycle is like a cross industry
    process for data mining as data science is an interdisciplinary field of data
    collection, data analysis, feature engineering, data prediction, data visualization
    and is involved in both structured and unstructured data.
   Business Understanding
   Data Mining
   Data Cleaning
   Exploration
   Feature Engineering
   Prediction Modeling
   Data Visualization
Business Understanding:-
    At first, the data scientist identifies the problem, a group of people analyzes the
    problem and discuss their solutions. They also learn the previous records to
identify whether such problem happened earlier or not. Every decision has to be
in favour of the organization.
Data Mining:-
Data Cleaning:-
The collected data in the data mining process may contain lots of unnecessary
data or may be inconsistent way. It may also happen that some pieces of the
data are in different sources, the date format may be incomplete. So, the next
task of data scientist is to clean all the unwanted data or make data
consolidation. This process may be time consuming, as all depends on the
quality of gathered data. At last of the process the data scientist has cleaning
and manipulated data.
Exploration:-
Data exploration is in actually the starting stage of data analysis. In this process,
the data scientist summarizes the data with main characteristics and analyze and
explore each data set very carefully. They can use the different graphical
representation technique like histogram, scatter plots and so on.
Feature Engineering:-
This process is basically the applied machine learning. In this process, domain
knowledge and deep learning of data is required to make the machine learning
algorithm to work. This is very difficult and expensive. This process requires
brainstorming to improve the features. The features in your data is important for
the data prediction.
Prediction Modeling:-
Here, the data scientist predicts the project. There are so many predictive
analytics questions in front of the finally built data science project. They are
also predicting the future events and actions.
Data Visualization:-
                                               *****
Setting expectations:
Having the title of Data Scientist can come with a lot of assumptions and expectations.
Different companies have their definitions for what it means to be a data scientist there.
And each company comes with its expectations and assumptions about what they want
you to do for them. With that, I picked out three of the most tiring assumptions and
expectations I often face while working or interviewing as a data scientist.
  We can do EVERYTHING:-
  Often, data scientists are expected to look at the data and make an analysis work
  for the person requesting it. Some companies will hire and assume that the data
  scientist can perform multiple roles: data scientist, front-end developer, backend
  developer, data analyst, data engineer, ML Ops, and more. Unfortunately, this is
  not the case. A data scientist shouldn’t be expected to perform every aspect of
  the pipeline.
  Each of the roles in the data space has a subset of skills that they do and do well.
  Each one is a cog in the machine that keeps the process moving end to end. Data
  scientists shouldn’t be expected to know and understand every aspect of the
  pipeline. Instead, we should be working with a team of software developers,
  engineers, SME’s, and more to build a long-lasting product.
  One role I held expected me to recreate all datasets in a separate location. While
  at the same time, we had a data engineering team who housed our data in the
  cloud in an easy-to-access format. We were not a small team or a startup that
  needed people to generalize their skills to get things working.
  I wrote a post about this a while back because this request bothered me. If we
  have a data engineering team that will work closely with us to provide the data in
the format we need, why are we recreating all of that? The reason — my
manager wanted a self-sufficient team that did not rely on others. This is not fair.
Teams need to work effectively together, not battle against unreasonable
requests.
When you are evaluating your business objective, you need to determine the best
solution. When I started as a data scientist, the best advice I was given was to
find the most straightforward solution first. Focus on what is simple, and then
build your solution out from there. Don’t over-complicate a problem to flex your
skills. You may be able to solve someone’s problem without the need for
extensive ML/AI. Make sure you are evaluating the use case first and using
ML/AI when appropriate.
Final Thoughts:-
The title of data scientist can come with a lot of assumptions and expectations.
Don’t let this overwhelm you. Instead, focus on what skills and knowledge are
essential to know for your role and objectives.
   You are not required to use ML/AI for every project. Understand your
    business objectives, and learn your use cases. ML/AI should be applied
    where applicable.
   Along with ML/AI, tools do not work right out of the box. You will need to
    evaluate if a tool is the right one for the job or not. The newest tool or the
    next trend is not always the right fit for your use case.
*****
  R contains a set of functions that can be used to load data sets into memory.
  You can also load data into memory using R Studio - via the menu items and
  toolbars. In this tutorial I will cover both methods.
  Which method of loading data in R you should use depends on what you are
  doing. If you are just playing around with some data, using the R Studio menu
  items might be fine. But if you are writing an R program that needs be repeated
  for many different data sets, it might be better to write the loading of data as R
  program statements.
R has three different functions which can import data. These are:
        read.table()
        read.csv()
        read.delim()
  These functions are very similar to each other, so if you master one of them you
  will soon master the others. In fact, you can probably just use
  the read.table() function for all of your data imports. These 3 functions will be
  covered in the following sections.
read.table()
  The R function read.table() function loads data from a file into a tabular data set
  (table) in memory. A tabular data set consists of rows and columns, just like a
  spreadsheet. Sometimes rows are also referred to as "records" and columns
  referred to as "fields" or "properties".
  The parameters to read.table() are listed between the parentheses, separated with
  commas. Here is an example of loading a CSV file using read.table() in R:
The first parameter is the path to the file to read. In the example above that is
the "data.csv" part. This parameter should contain a path to the file to read. In
the above example only the file name itself is shown. Then R expects to find the
file in the same directory R is running from. If you want to specify the full path
to the file, you can do so too. Here is an example of how that looks on
Windows:
"d:\\data\\projects\\tutorial-projects\\r-programming\\data.csv"
The same file path on a Mac or Linux machine could look like this:
"/data/projects/tutorial-projects/r-programming/data.csv"
Notice the use of / between directories instead of \, and notice that you only
need a single / between the directories, because / is not an escape character.
By "header line" is meant whether the first line contains the column names, or if
the first line already contains data. Look at this CSV file:
name;id;salary
John Doe;1;99999
Joe Blocks;2;120000
Cindy Loo;3;150000
Notice how the first row contains the column names for the data on the
following rows.
The third parameter specifies what character inside the data file that is used to
separate the different column values on each row. If you look at the CSV file
contents above you can see that a semicolon (;) is used as separator. That is why
the third parameter to the read.table() function call is sep=";" meaning that the
separator character used in the data file is a semicolon.
To execute read.table() you type the commands shown in this section into the
console part of R Studio and press the "Enter" key.
read.csv()
This example loads the CSV file located at D:\\data\\data.csv and assign it to the
variable named data. The first line is a header line containing the names of the
columns in the CSV file. This is specified by the second parameter header=T.
The third parameter specifies that the separator character used inside the CSV
file is ; (a semicolon).
read.delim()
The read.delim() function reads a CSV file into the memory, just like
the read.csv() function. The read.delim() function takes 3 parameters, just like
the read.table() function. Here is an example call to the read.delim() function:
This example loads the CSV file located at D:\\data\\data.csv and assign it to the
variable named data. The first line is a header line containing the names of the
columns in the CSV file. This is specified by the second parameter header=T.
The third parameter specifies that the separator character used inside the CSV
file is ; (a semicolon).
CSV: the CSV is stand for Comma-separated values. as-well-as this name CSV
file is use comma to separated values. In CSV file each line is a data record and
Each record consists of one or more then one data fields, the field is separated
by commas.
import pandas as pd
df = pd.read_csv("file_path / file_name.csv")
print(df)
XLSX: The XLSX file is Microsoft Excel Open XML Format Spreadsheet file.
This is used to store any type of data but it’s mainly used to store financial data
and to create mathematical models etc.
import pandas as pd
df = pd.read_excel (r'file_path\\name.xlsx')
print (df)
Note:
install xlrd before reading excel file in python for avoid the error. You can
install xlrd using following command.
pip install xlrd
ZIP: ZIP files are used an data containers, they store one or more then one files
in the compressed form. it widely used in internet After you downloaded ZIP
file, you need to unpack its contents in order to use it.
import pandas as pd
print(df)
TXT: TXT files are useful for storing information in plain text with no special
formatting beyond basic fonts and font styles. It is recognized by any text
editing and other software programs.
import pandas as pd
print(df)
JSON: JSON is stand for JavaScript Object Notation. JSON is a standard text-
based format for representing structured data based on JavaScript object syntax
import pandas as pd
print(df)
HTML: HTML is stand for stands for Hyper Text Markup Language is use for
creating web pages. we can read html table in python pandas using read_html()
function.
import pandas as pd
df = pd.read_html('File_Path \\File_Name.html')
print(df)
   PDF: pdf stands for Portable Document Format (PDF) this file format is use
   when we need to save files that cannot be modified but still need to be easily
   available.
   pip install tabula-py
   pip install pandas
   df = tabula.read_pdf(file_path \\ file_name .pdf)
   print(df)
   Working with relational databases:
   In the age of big data, data scientists should leverage relational databases in
their workflow. Doing so, analysts can integrate the power of a database engine
for munging and calculating and streamlined workflow for data summarization,
visualization, modeling, and high end computing, all while rendering a
reproducible, efficient process.
However, many data analysts today continue to work with small to large flat
files including delimited text (.csv, .tab, .txt) files; Excel (.xls, .xlsx, .xlsb,
.xlsm) files; other software binary types (.sas7bdat, .dta, .sav); nested
XML/JSON files; and other formats that can easily be changed, moved, deleted,
and corrupted to interrupt workflows and version controls set in place.
Additionally these formats can maintain redundant, repetitive indicator
information for inefficient disk storage use.
                                              *****
Exploring data – Using summary statistics to spot problems:
# Take a look at the help for ?geom_point and geom_line to find similar
examples
# Here we take the carrier code as the x axis
# the value from the dt data.table goes in the y axis
  This error indicates that the margins of the particular plot are very large while
  the region allocated for the plot is too small. You can solve this problem by
  increasing the size of the plots pane.
When legends, lines, text, or points are missing or "incorrectly" placed, this is
often the result of R condensing the plot to fit the region. You can generally
solve this by increasing or decreasing the plotting region.
Resetting your graphics device will remove any leftover options or settings
from previous plots. These might be causing undesired behavior or errors with
your current plotting environment. See ?par and ?options for more details. For
example:
> plot(cars)
> par(mfrow=c(2,2))
> plot(cars)
To fix this behavior, sometimes it is best to reset your graphics device and then
try your plot again. Subsequent plots will use the default graphics settings. To
reset your graphics device, call the following code from the console:
> dev.off()
                                                  UNIT II
                                     MODELING METHODS
  As a data scientist, your ultimate goal is to solve a concrete business problem: increase look-
  to-buy ratio, identify fraudulent transactions, predict and manage the losses of a loan
  portfolio, and so on. Many different statistical modelling methods can be used to solve any
  given problem. Each statistical method will have its advantages and disadvantages for a given
  business goal and business constraints. This chapter presents an outline of the most common
  machine learning and statistical methods used in data science.
  To make progress, you must be able to measure model quality during training and also ensure
  that your model will work as well in the production environment as it did on your
  training data. In general, we’ll call these two tasks model evaluation and model validation.
  To prepare for these statistical tests, we always split our data into training data and test data,
We define model evaluation as quantifying the performance of a model. To do this we must find a
measure of model performance that’s appropriate to both the original business goal and the chosen
modeling technique. For example, if we’re predicting who would default on loans, we have a
classification task, and measures like precision and recall are appropriate. If we instead are predicting
revenue lost to defaulting loans, we have a scoring task, and measures like root mean square
error (RMSE) are appropriate. The point is this: there are a number of measures the data scientist
should be familiar with.
   Your task is to map a business problem to a good machine learning method. To use a real-
   world situation, let’s suppose that you’re a data scientist at an online retail company. There
   are a number of business problems that your team might be called on to address:
   Your intended uses of the model have a big influence on what methods you should use. If you
   want to know how small variations in input variables affect outcome, then you likely want to
   use a regression method. If you want to know what single variable drives most of
   a categorization, then decision trees might be a good choice. Also, each business problem
   suggests a statistical approach to try. If you’re trying to predict scores, some sort of
   regression is likely a good choice; if you’re trying to predict categories, then something
   like random forests is probably a good choice.
Suppose your task is to automate the assignment of new products to your company’s product
categories, as shown in figure . This can be more complicated than it sounds. Products that
come from different sources may have their own product classification that doesn’t coincide
with the one that you use on your retail site, or they may come without any classification at
all. Many large online retailers use teams of human taggers to hand-categorize their products.
This is not only labor-intensive, but inconsistent and error-prone. Automation is an attractive
option; it’s lab or-saving, and can improve the quality of the retail site.
Evaluating Models :
After training a model, AutoML Tables uses the test dataset to evaluate the quality and
accuracy of the new model, and provides an aggregate set of evaluation metrics indicating
how well the model performed on the test dataset.
Using the evaluations metrics to determine the quality of your model depends on your
business need and the problem you model is trained to solve. For example, there might be a
higher cost to false positives than for false negatives, or vice versa. For regression models,
does the delta between the prediction and the correct answer matter or not? These kinds of
questions affect how you will look at your model evaluation metrics.
If you included a weight column in your training data, it does not affect evaluation metrics.
Weights are considered only during the training phase.
   AUC PR: The area under the precision-recall (PR) curve. This value ranges from zero to one,
    where a higher value indicates a higher-quality model.
   AUC ROC: The area under the receiver operating characteristic (ROC) curve. This ranges
    from zero to one, where a higher value indicates a higher-quality model.
   Accuracy: The fraction of classification predictions produced by the model that were correct.
   Log loss: The cross-entropy between the model predictions and the target values. This ranges
    from zero to infinity, where a lower value indicates a higher-quality model.
   F1 score: The harmonic mean of precision and recall. F1 is a useful metric if you're looking
    for a balance between precision and recall and there's an uneven class distribution.
   Precision: The fraction of positive predictions produced by the model that were correct.
    (Positive predictions are the false positives and the true positives combined.)
   Recall: The fraction of rows with this label that the model correctly predicted. Also called
    "True positive rate".
   False positive rate: The fraction of rows predicted by the model to be the target label but
    aren't (false positive).
    These metrics are returned for every distinct value of the target column. For multi-class
    classification models, these metrics are micro-averaged and returned as the summary metrics.
    For binary classification models, the metrics for the minority class are used as the summary
    metrics. The micro-averaged metrics are the expected value of each metric on a random
    sample from your dataset.
    In addition to the above metrics, AutoML Tables provides two other ways to understand your
    classification model, the confusion matrix and a feature importance graph.
   Confusion matrix: The confusion matrix helps you understand where misclassifications
    occur (which classes get "confused" with each other). Each row represents ground truth for a
    specific label, and each column shows the labels predicted by the model.
    Confusion matrices are provided only for classification models with 10 or fewer values for
    the target column.
    Feature importance: AutoML Tables tells you how much each feature impacts this model. It
    is shown in the Feature importance graph. The values are provided as a percentage for each
    feature: the higher the percentage, the more strongly that feature impacted model training.
    You should review this information to ensure that all of the most important features make
    The micro-averaged precision is calculated by adding together the number of true positives
    (TP) for each potential value of the target column and dividing it by the number of true
    positives (TP) and true negatives (TN) for each potential value.
precisionmicro=TP1+…+TPnTP1+…+TPn+FP1+…+FPn
where
Score threshold
    The score threshold is a number that ranges from 0 to 1. It provides a way to specify the
    minimum confidence level where a given prediction value should be taken as true. For
    example, if you have a class that is quite unlikely to be the actual value, then you would want
    to lower the threshold for that class; using a threshold of .5 or higher would result in that
    class being predicted extremely rarely (or never).
    A higher threshold decreases false positives, at the expense of more false negatives. A lower
    threshold decreases false negatives at the expense of more false positives.
    Put another way, the score threshold affects precision and recall. A higher threshold results in
    an increase in precision (because the model never makes a prediction unless it is extremely
    sure) but the recall (the percentage of positive examples that the model gets right) decreases.
   MAE: The mean absolute error (MAE) is the average absolute difference between the target
    values and the predicted values. This metric ranges from zero to infinity; a lower value
    indicates a higher quality model.
   RMSE: The root-mean-square error metric is a frequently used measure of the differences
    between the values predicted by a model or an estimator and the values observed. This metric
    ranges from zero to infinity; a lower value indicates a higher quality model.
   RMSLE: The root-mean-squared logarithmic error metric is similar to RMSE, except that it
    uses the natural logarithm of the predicted and actual values plus 1. RMSLE penalizes under-
    prediction more heavily than over-prediction. It can also be a good metric when you don't
    want to penalize differences for large prediction values more heavily than for small
    prediction values. This metric ranges from zero to infinity; a lower value indicates a higher
    quality model. The RMSLE evaluation metric is returned only if all label and predicted
    values are non-negative.
   r^2: r squared (r^2) is the square of the Pearson correlation coefficient between the labels
    and predicted values. This metric ranges between zero and one; a higher value indicates a
    higher quality model.
   MAPE: Mean absolute percentage error (MAPE) is the average absolute percentage
    difference between the labels and the predicted values. This metric ranges between zero and
    infinity; a lower value indicates a higher quality model.
    MAPE is not shown if the target column contains any 0 values. In this case, MAPE is
    undefined.
   Feature importance: AutoML Tables tells you how much each feature impacts this model. It
    is shown in the Feature importance graph. The values are provided as a percentage for each
    feature: the higher the percentage, the more strongly that feature impacted model training.
   You should review this information to ensure that all of the most important features make
   sense for your data and business problem. Learn more about explainability.
   Getting the evaluation metrics for your model
   To evaluate how well your model did on the test dataset, you inspect the evaluation metrics
   for your model.
To see your model's evaluation metrics using the Google Cloud Console:
There are two straightforward ways to statistically validate a model: one can evaluate the
model on the data the model was trained on, or one can evaluate it on an external test set. The
first method introduces the problem of overfitting: one can fit any dataset arbitrarily well at
the cost of creating a model brittle to extra data. If the model is perfected to one dataset, the
model may not be able to use and identify correct outputs with new data and thus will not
validate. One could take the case of trying to fit a curve through (x,y) pairs when given 100
A high-degree polynomial could fit the data in the training set exactly while being very brittle
to data outside the training set. It is common practice instead to validate models using a test
set. When originally given a data set, one can construct the test set by randomly extracting 10
to 20 percent of the data. In the case of the 100 (x,y) pairs, one could discover that the high-
degree polynomial was overfitting easily by separating out a test set and evaluating the model
against it. One might then choose to use a simpler model, such as a linear regression, which
Many different statistical evaluation metrics can be used for model validation in general,
including mean average error, mean squared error, and the ROC curve.
For instance, a key part of model validation is ensuring that you have picked the right high-
level statistical model. One could consider the example of training a system to predict the
price of an item given an image of it. One could obtain reasonably good results by simply
applying a logistic regression to the set of images. But this would ignore much better results
that could potentially be obtained by applying a multiple layer convolutional neural network
to the images.
It is thus important to perform thorough research of the machine learning literature as a part
of model validation. The results of endless hours of work on a model that is a poor or
mediocre choice for a given dataset can be surpassed by a simple glance at the right areas of
the arXiv. On the other hand, a model that is not exactly the right choice for a given data set,
but still close to the optimum, can still be considered to pass model validation.
It is generally mistaken to take the perspective prevalent in Kaggle competitions that the goal
is to squeeze every last drop of performance out of your model. Redoing a machine learning
problem with a different model carries the problems of being expensive, time-consuming, and
error-prone. It is often true that there is either one model that is “right” for the dataset, as is
the case with large image datasets and neural networks. Or that there is no one “right” model,
and several will be close to the optimum, as is the case in most non-image-based Kaggle
competitions.
Data validation is another key component of model validation. Data values can be corrupted
or contain errors in ways that impair the results of model training. The integrity of data values
can be verified by manually delving into sections of the data, programmatically searching
through it, or by creating graphs. The integrity can also be checked qualitatively by ensuring
that the data was drawn from a reliable, trustworthy, well-maintained and up-to-date source.
Data for the training and test sets should be drawn from the same probability distribution or
as close as possible to achieve adequate results. In addition, there is a risk that models may be
vulnerable to errors on specific input data values because they are poorly represented in the
training set. If it happens to be the case that such a class of errors is possible, it is important to
verify that the training set adequately covers all data inputs on which the model will need to
be evaluated, or you may lose model validation. There are many methods of guaranteeing that
the training set is adequate, including manual searching through the data or creating visual
plots of it.
It is in general critical to have made correct assumptions about the similarities between the
training set and the data the model will ultimately be evaluated on, which is again a
qualitative process.
cluster analysis:
Cluster analysis is a technique whose purpose is to divide into groups (clusters) a collection
of objects in such a way that:
1. The objects of the same group are the most similar possible.
2. The objects of the same group are the most similar possible (internal cohesion of the group).
CLUSTER ANALYSIS
Cluster analysis is the grouping of objects based on their characteristics such that there is
high intra-cluster similarity and low inter-cluster similarity.
WHAT IS CLUSTERING?
Cluster analysis is the grouping of objects such that objects in the same cluster are more
similar to each other than they are to objects in another cluster. The classification into clusters
is done using criteria such as smallest distances, density of data points, graphs, or various
statistical distributions. Cluster analysis has wide applicability, including in unsupervised
machine learning, data mining, statistics, Graph Analytics, image processing, and numerous
physical and social science applications.
      K-Means finds clusters by minimizing the mean distance between geometric points.
      DBSCAN uses density-based spatial clustering.
Cluster analysis is a problem with significant parallelism and can be accelerated by using
GPUs. The NVIDIA Graph Analytics library (nvGRAPH) will provide both spectral and
hierarchical clustering/partitioning techniques based on the minimum balanced cut metric in
the future. The nvGRAPH library is freely available as part of the
NVIDIA® CUDA® Toolkit. For more information about graphs, please refer to the Graph
Analytics page.
are supported for both single-GPU and large data center deployments. For large datasets,
these GPU-based implementations can complete 10-50X faster than their CPU equivalents.
K means algorithm:
, which groups the unlabeled dataset into different clusters. Here K defines the number of
pre-defined clusters that need to be created in the process, as if K=2, there will be two
clusters, and for K=3, there will be three clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a
way that each dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main
aim of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best clusters.
The value of k should be predetermined in this algorithm.
   o   Determines the best value for K center points or centroids by an iterative process.
   o   Assigns each data point to its closest k-center. Those data points which are near to
       the particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from
other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two
variables is given below:
   o   Let's take number k of clusters, i.e., K=2, to identify the dataset and to put
       them into different clusters. It means here we will try to group these datasets
       into two different clusters.
   o   We need to choose some random k points or centroid to form the cluster.
       These points can be either the points from the dataset or any other point. So,
       here we are selecting the below two points as k points, which are not the part
       of       our         dataset.           Consider                the         below   image:
   o   Now we will assign each data point of the scatter plot to its closest K-point or
       centroid. We will compute it by applying some mathematics that we have
       studied to calculate the distance between two points. So, we will draw a
From the above image, it is clear that points left side of the line is near to the K1 or
blue centroid, and points to the right of the line are close to the yellow centroid. Let's
color them as blue and yellow for clear visualization.
   o   Next, we will reassign each datapoint to the new centroid. For this, we will
       repeat the same process of finding a median line. The median will be like
       below
   o   image:
From the above image, we can see, one yellow point is on the left side of the line,
and two blue points are right to the line. So, these three points will be assigned to
new centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding
new centroids or K-points.
  o   We will repeat the process by finding the center of gravity of centroids, so the
      new     centroids       will     be      as                     shown   in    the    below    image:
  o   As we got the new centroids so again will draw the median line and reassign
      the       data         points.           So,                      the        image     will      be:
   o   We can see in the above image; there are no dissimilar data points on either
       side of the line, which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two
final clusters will be as shown in the below image:
Introduction
This model is easy to build and is mostly used for large datasets. It is a
probabilistic machine learning model that is used for classification
problems. The core of the classifier depends on the Bayes theorem with an
assumption of independence among predictors. That means changing the
value of a feature doesn’t change the value of another feature.
Naive Bayes does seem to be a simple yet powerful algorithm. But why is it
so popular?
Table of Contents
Suppose I ask you to pick a card from the deck and find the probability of
getting a king given the card is clubs.
Observe carefully that here I have mentioned a condition that the card is
clubs.
Since we have only one king in clubs the probability of getting a KING
given the card is clubs will be 1/13 = 0.077.
If a person is asked to find the probability of getting a tail his answer would
be 3/4 = 0.75
From the above examples, we observe that the probability may change if
some additional information is given to us. This is exactly the case while
building any machine learning model, we need to find the output given
some features.
Bayes’ Rule
Now we are prepared to state one of the most useful results in conditional
probability: Bayes’ Rule.
Bayes’ rule provides us with the formula for the probability of Y given some
feature X. In real-world problems, we hardly find any case where there is
only one feature.
When the features are independent, we can extend Bayes’ rule to what is
called Naive Bakes which assumes that the features are independent that
means changing the value of one feature doesn’t influence the values of
other variables and this is why we call this algorithm “NAIVE”
Naive Bayes can be used for various things like face recognition, weather
prediction, Medical Diagnosis, News classification, Sentiment Analysis, and
a lot more.
When there are multiple X variables, we simplify it by assuming that X’s are
independent, so
Since the denominator is constant here so we can remove it. It’s purely
your choice if you want to remove it or not. Removing the denominator will
help you save time and calculations.
There are a whole lot of formulas mentioned here but worry not we will try
to understand all this with the help of an example.
· All the variables are independent. That is if the animal is Dog that doesn’t
mean that Size will be Medium
· All the predictors have an equal effect on the outcome. That is, the animal
being dog does not have more importance in deciding If we can pet him or
not. All the features have equal importance.
We should try to apply the Naive Bayes formula on the above dataset
however before that, we need to do some precomputations on our dataset.
We also need the probabilities (P(y)), which are calculated in the table
below. For example, P(Pet Animal = NO) = 6/14.
Now if we send our test data, suppose test = (Cow, Medium, Black)
We know P(Yes|Test)+P(No|test) = 1
We see here that P(Yes|Test) > P(No|Test), so the prediction that we can
pet this animal is “Yes”.
We can use this formula to compute the probability of likelihoods if our data
is continuous.
Endnotes
The term Knowledge Discovery in Databases, or KDD for short, refers to the broad
process of finding knowledge in data, and emphasizes the "high-level" application
of particular data mining methods. It is of interest to researchers in machine
learning, pattern recognition, databases, statistics, artificial intelligence, knowledge
acquisition for expert systems, and data visualization.
The unifying goal of the KDD process is to extract knowledge from data in the
context of large databases.
It does this by using data mining methods (algorithms) to extract (identify) what is
deemed knowledge, according to the specifications of measures and thresholds,
using a database along with any required preprocessing, subsampling, and
transformations of that database.
The overall process of finding and interpreting patterns from data involves the
repeated application of the following steps:
   1. Developing an understanding of
         o the application domain
         o the relevant prior knowledge
         o the goals of the end-user
   2. Creating a target data set: selecting a data set, or focusing on a subset of
      variables, or data samples, on which discovery is to be performed.
   3. Data cleaning and preprocessing.
         o Removal of noise or outliers.
         o Collecting necessary information to model or account for noise.
         o Strategies for handling missing data fields.
         o Accounting for time sequence information and known changes.
   4. Data reduction and projection.
         o Finding useful features to represent the data depending on the goal
             of the task.
         o Using dimensionality reduction or transformation methods to reduce
             the effective number of variables under consideration or to find
             invariant representations for the data.
   5. Choosing the data mining task.
         o Deciding whether the goal of the KDD process is classification,
             regression, clustering, etc.
   6. Choosing the data mining algorithm(s).
         o Selecting method(s) to be used for searching for patterns in the data.
KDD refers to the overall process of discovering useful knowledge from data. It
involves the evaluation and possibly interpretation of the patterns to make the
decision of what qualifies as knowledge. It also includes the choice of encoding
schemes, preprocessing, sampling, and projections of the data prior to the data
mining step.
Data mining refers to the application of algorithms for extracting patterns from
data without the additional steps of the KDD process.
Single-variable models are simply models built using only one variable at a time. Single-
variable models can be powerful tools, so it’s worth learning how to work well with them
before jumping into general modeling (which almost always means multiple variable
models). We’ll show how to build single-variable models from both categorical and numeric
variables. By the end of this section, you should be able to build, evaluate, and cross-validate
single-variable models with confidence.
subsets indexed by the predictive variable and then store a summary of the distribution of
outcome as their future prediction. These models are atoms or sub-assemblies that we sum in
different ways to get the rest of the models of this chapter.
From this, we see variable 218 takes on two values plus NA , and we see the joint
distribution of these values against the churn outcome. At this point it’s easy to write down a
single-variable model based on variable 218.
Function to build single-variable models for categorical variables
For most model evaluations, we just want to compute one or two summary scores that tell us if the
model is effective. To decide if a given score is high or low, we have to appeal to a few ideal models:
a null model (which tells us what low performance looks like), a Bayes rate model (which tells us
what high performance looks like), and the best single-variable model (which tells us what a simple
model can achieve). We outline the concepts in
 In this section, we’ll present the standard measures of model quality, which are useful in model
construction. In all cases, we suggest that in addition to the standard model quality assessments you
try to design your own custom “business-oriented loss function” with your project sponsor or client.
Usually this is as simple as assigning a notional dollar value to each outcome and then seeing how
your model performs under that criterion. Let’s start with how to evaluate classification models and
then continue from there.
          Ideal models to calibrate against
  Ideal
                                                                         Purpose
  model
  Ideal
                                                                         Purpose
  model
Null model A null model is the best model of a very simple form you’re trying to outperform. The
           two most typical null model choices are a model that is a single constant (returns the
           same answer for all situations) or a model that is independent (doesn’t record any
           important relation or interaction between inputs and outputs). We use null models to
           lower-bound desired performance, so we usually compare to a best null model. For
           example, in a categorical problem, the null model would always return the most popular
           category (as this is the easy guess that is least often wrong); for a score model, the null
           model is often the average of all the outcomes (as this has the least square deviation
           from all of the outcomes); and so on. The idea is this: if you’re not out-performing the
           null model, you’re not delivering value. Note that it can be hard to do as good as the
           best null model, because even though the null model is simple, it’s privileged to know
           the overall distribution of the items it will be quizzed on. We always assume the null
           model we’re comparing to is the best of all possible null models.
Bayes rate A Bayes rate model (also sometimes called a saturated model) is a best possible model
model      given the data at hand. The Bayes rate model is the perfect model and it only makes
           mistakes when there are multiple examples with the exact same set of known facts
           (same xs) but different outcomes (different ys). It isn’t always practical to construct the
           Bayes rate model, but we invoke it as an upper bound on a model evaluation score. If we
           feel our model is performing significantly above the null model rate and is approaching
           the Bayes rate, then we can stop tuning. When we have a lot of data and very few
           modeling features, we can estimate the Bayes error rate. Another way to estimate the
           Bayes rate is to ask several different people to score the same small sample of your data;
           the found inconsistency rate can be an estimate of the Bayes rate. [a]
Single-     We also suggest comparing any complicated model against the best single-variable
variable    model you have available for how to convert single variables into single-variable
models      models). A complicated model can’t be justified if it doesn’t outperform the best single-
            variable model available from your training data. Also, business analysts have many tools
            for building effective single-variable models (such as pivot tables), so if your client is an
            analyst, they’re likely looking for performance above this level.
The first part of the summary() is how the lm() model was constructed:
Call:
lm(formula = log(PINCP, base = 10) ~ AGEP + SEX + COW + SCHL,
    data = dtrain)
we looked at how to use linear regression to model and predict quantitative output, and how
to use logistic regression to predict class probabilities. Linear and logistic regression models
are powerful tools, especially when you want to understand the relationship between the input
variables and the output. They’re robust to correlated variables (when regularized), and
logistic regression preserves the marginal probabilities of the data. The primary shortcoming
of both these models is that they assume that the relationship between the inputs and the
output is monotone. That is, if more is good, than much more is always better.
You want R-squared to be fairly large (1.0 is the largest you can achieve) and R-squareds that
are similar on test and training. A significantly lower R-squared on test data is a symptom of
an overfit model that looks good in training and won’t work in production. In our case, our R-
squareds were 0.338 on training and 0.261 on test. We’d like to see R-squares higher than
this (say, 0.7–1.0). So the model is of low quality, but not substantially overfit.
Once the model is fit, scoring is fast.
While the sponsor is the role that represents the business interest, the client is the role that
represents the model’s end users’ interests. Sometimes the sponsor and client roles may be
filled by the same person. Again, the data scientist may fill the client role if they can weight
business trade-offs, but this isn’t ideal.
As with the sponsor, you should keep the client informed and involved. Ideally you’d like to
have regular meetings with them to keep your efforts aligned with the needs of the end users.
Generally the client belongs to a different group in the organization and has other
responsibilities beyond your project. Keep meetings focused, present results and progress in
terms they can understand, and take their critiques to heart. If the end users can’t or won’t use
your model, then the project isn’t a success, in the long run.
Linear and logistic regression:
Linear Regression and Logistic Regression are the two famous Machine Learning Algorithms
which come under supervised learning technique. Since both the algorithms are of
supervised in nature hence these algorithms use labeled dataset to make the predictions. But
the main difference between them is how they are being used. The Linear Regression is used
for solving Regression problems whereas Logistic Regression is used for solving the
Classification problems. The description of both the algorithms is given below along with
difference table.
Linear Regression:
   o   Linear Regression is one of the most simple Machine learning algorithm that comes
       under Supervised Learning technique and used for solving regression problems.
   o   It is used for predicting the continuous dependent variable with the help of
       independent variables.
   o   The goal of the Linear regression is to find the best fit line that can accurately predict
       the output for the continuous dependent variable.
   o   If single independent variable is used for prediction then it is called Simple Linear
       Regression and if there are more than two independent variables then such
       regression is called as Multiple Linear Regression.
   o   By finding the best fit line, algorithm establish the relationship between dependent
       variable and independent variable. And the relationship should be of linear nature.
   o   The output for Linear regression should only be the continuous values such as price,
       age, salary, etc. The relationship between the dependent variable and independent
       variable can be shown in below image:
In above image the dependent variable is on Y-axis (salary) and independent variable
is on x-axis(experience). The regression line can be written as:
y= a0+a1x+ ε
Logistic Regression:
   o   Logistic regression is one of the most popular Machine learning algorithm that
       comes under Supervised Learning techniques.
          o   It can be used for Classification as well as for Regression problems, but mainly used
              for Classification problems.
          o   Logistic regression is used to predict the categorical dependent variable with the help
              of independent variables.
          o   The output of Logistic Regression problem can be only between the 0 and 1.
          o   Logistic regression can be used where the probabilities between two classes is
              required. Such as whether it will rain today or not, either 0 or 1, true or false etc.
          o   Logistic regression is based on the concept of Maximum Likelihood estimation.
              According to this estimation, the observed data should be most probable.
          o   In logistic regression, we pass the weighted sum of inputs through an activation
              function that can map values in between 0 and 1. Such activation function is known
              as sigmoid function and the curve obtained is called as sigmoid curve or S-curve.
              Consider the below image:
Linear regression is used to predict the       Logistic Regression is used to predict the
continuous dependent variable using a          categorical dependent variable using a
given set of independent variables.            given set of independent variables.
Linear Regression is used for solving         Logistic regression is used for solving
Regression problem.                           Classification problems.
In linear regression, we find the best fit    In Logistic Regression, we find the S-
line, by which we can easily predict the      curve by which we can classify the
output.                                       samples.
The output for Linear Regression must be      The output of Logistic Regression must
a continuous value, such as price, age,       be a Categorical value such as 0 or 1, Yes
etc.                                          or No, etc.
       Clustering is an unsupervised data science technique where the records in a dataset are
       organized into different logical groupings. The data are grouped in such a way that records
       inside the same group are more similar than records outside the group. Clustering has a wide
       variety of applications ranging from market segmentation to customer segmentation, electoral
       grouping, web analytics, and outlier detection. Clustering is also used as a data compression
       technique and data preprocessing technique for supervised tasks. Many different data science
       approaches are available to cluster the data and are developed based on proximity between
       the records, density in the dataset, or novel application of neural networks. k-Means
       clustering, density clustering, and self-organizing map techniques are reviewed in the chapter
       along with implementations using RapidMiner.
       A wide array of data based process monitoring techniques have been developed for the online
       classification of process data into normal and faulty classes (Ge 2013), however many of
       these methods are “supervised”, or require that the training data for the models be organized
into labelled groups. In real plants this is rarely true, and unsupervised data mining
algorithms are needed to find meaningful clusters corresponding to fault data.
With any fault diagnosis system, a major obstacle to implementation is that process data are
often uncategorized. Algorithms need to (1) separate fault data from normal data, (2) train a
model based on statistics or a supervised learning technique for fault detection, and (3) assist
with the identification and management of new faults. It is important for the larger
acceptance of these methods that those tasks are all performed in a way that is simple to
understand for non-experts in data science and easy to deploy on multiple units around a
plant with low overhead.
This research studies the potential for data clustering and unsupervised learning to
automatically separate data into groups significant to abnormal event detection.
Vekatasubramanian (2009) calls for a “tool box” based approach in which a data modeller is
comfortable with using a diverse array of modelling techniques to solve a given problem. In
that spirit, this research evaluates a set of knowledge discovery techniques for mining
databases to solve process monitoring problems. Sensor data from an industrial separations
tower, reactor, and the Tennessee Eastman simulation are studied and used to compare
different dimensionality reduction and clustering techniques in terms of their effectiveness in
extracting knowledge from process databases.
Association rules.:
Association rule learning is a type of unsupervised learning technique that checks for the
dependency of one data item on another data item and maps accordingly so that it can be
more profitable. It tries to find some interesting relations or associations among the variables
of dataset. It is based on different rules to discover the interesting relations between variables
in the database.
The association rule learning is one of the very important concepts of machine learning, and
it is employed in Market Basket analysis, Web usage mining, continuous production,
etc. Here market basket analysis is a technique used by the various big retailer to discover the
associations between items. We can understand it by taking an example of a supermarket, as
in a supermarket, all products that are purchased together are put together.
For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk, so
these products are stored within a shelf or mostly nearby. Consider the below diagram:
   1. Apriori
   2. Eclat
   o   Support
   o   Confidence
   o   Lift
Support
Support is the frequency of A or how frequently an item appears in the dataset. It is
defined as the fraction of the transaction T that contains the itemset X. If there are X
datasets, then for transactions T, it can be written as:
Confidence
Confidence indicates how often the rule has been found to be true. Or how often the
items X and Y occur together in the dataset when the occurrence of X is already
given. It is the ratio of the transaction that contains X and Y to the number of records
that contain X.
Lift
It is the strength of any rule, which can be defined as below formula:
It is the ratio of the observed support measure and expected support if X and Y are
independent of each other. It has three possible values:
   o   If Lift= 1: The probability of occurrence of antecedent and consequent is
       independent of each other.
   o   Lift>1: It determines the degree to which the two itemsets are dependent to each
       other.
   o   Lift<1: It tells us that one item is a substitute for other items, which means one item
       has a negative effect on another.
Apriori Algorithm
This algorithm uses frequent datasets to generate association rules. It is designed to
work on the databases that contain transactions. This algorithm uses a breadth-first
search and Hash Tree to calculate the itemset efficiently.
It is mainly used for market basket analysis and helps to understand the products
that can be bought together. It can also be used in the healthcare field to find drug
reactions for patients.
Eclat Algorithm
Eclat algorithm stands for Equivalence Class Transformation. This algorithm uses a
depth-first search technique to find frequent itemsets in a transaction database. It
performs faster execution than Apriori Algorithm.
UNIT III
objects, Types of Data items, the structure of data items, examining data structure,
working with history commands, saving your work in R.
PROBABILITY DISTRIBUTIONS in R - Binomial, Poisson, Normal distributions.
INTRODUCTION TO R LANGUAGE:
To read an entire data frame directly, the external file will normally have a special form
         The first line of the file should have a name for each variable in the data frame.
         Each additional line of the file has as its first item a row label and the values for each
          variable.
  By default numeric items (except row labels) are read as numeric variables. This can be
  changed if necessary.
The function read.table() can then be used to read the data frame directly
 Similarly, to read .csv files the read.csv() function can be used to read in the data
frame directly
[Note: I have noticed that occasionally you'll need to do a double slash in your path //. This
seems to depend on the machine.]
Occasionally, you will need to read in data that does not already have column name
information. For example, the dataset BOD.txt looks like this:
1 8.3
2 10.3
3 19.0
4 16.0
5 15.6
7      19.8
Initially, there are no column names associated with the dataset. We can use
the colnames() command to assign column names to the dataset. Suppose that we want to
assign columns, "Time" and "demand" to the BOD.txt dataset. To do so we do the following
> colnames(bod)
The first command reads in the dataset, the command "header=F" specifies that there are
no column names associated with the dataset.
Read in the cars.txt dataset and call it car1. Make sure you use
the "header=F" option to specify that there are no column names associated with
the dataset. Next, assign "speed" and "dist" to be the first and second column
names to the car1 dataset.
The two videos below provide a nice explanations of different methods to read data from a
spreadsheet into an R dataset.
install.packages("haven")
library(haven)
dat = read_sas("path to file", "path to formats catalog")
The returned object will be a data frame where SAS variable labels are attached as an attribute
to each variable. When a variable is attached to a format in SAS and the formats are stored in a
library, its path also needs to be supplied. Missing values in numeric variables should be
seamlessly converted. Missing values in character variables are converted to the empty string.
To convert empty strings to missing values, use zap_empty(), for example,
dat$x1 = zap_empty(dat$x1)
SAS, Stata and SPSS all have the notion of a “labelled”" variable. These are similar to
categorical factor variables in R, but integer, numeric and character vectors can be labelled and
not every value must be associated with a label. To turn a labelled variable into a standard
factor R variable use the as_factor() function,
dat$facvar = as_factor(dat$facvar)
The haven package is under active development and becoming increasingly robust. If you have
difficulties loading a file, try using the development version on GitHub:
devtools::install_github("hadley/haven")
For example, consider the National Youth Tobacco Survey (NYTS) from the CDC website. After
downloading the files (and installing the development version from GitHub) the data can be
imported into R using
x =
haven::read_sas("nyts2014_dataset.sas7bdat","nyts2014_formats.sas7bcat")
# convert qn1 to a factor:
x$qn1 = as_factor(x$qn1)
one function can be used to import standard text files, RData, JSON, Stata, SPSS, Excel, SAS,
XML, Minitab and many more. There is an analogous export() function that allows users to
similarly easily export data to various file types.
Parameters:
x: represents an R object
value: represents names that has to be given to elements of x object
Creating a Named List
A Named list can be created by two methods. The first one is by allocating
the names to the elements while defining the list and another method is by
using names() function.
Example 1:
In this example, we are going to create a named list without
using names() function.
$lt
[1] "a" "b" "c" "d" "e" "f" "g" "h"
$n
[1] 1 2 3 4 5 6 7 8 9 10
Example 2:
In this example, we are going to define the names of elements of the list
using names() function after defining the list.
# Defining list
letters[1:8],
c(1:10))
cat("Whole list:\n")
print(x)
Output:
Whole list:
[[1]]
   [,1] [,2] [,3]
[1,]    1   3   5
[2,]    2   4   6
[[2]]
[1] "a" "b" "c" "d" "e" "f" "g" "h"
[[3]]
[1] 1 2 3 4 5 6 7 8 9 10
Accessing components of Named List
Components of a named list can be easily accessed by $ operator.
Example:
lt = letters[1:8],
n = c(1:10))
print(x$mt)
cat("\n")
print(x$n)
Output:
Element named 'mt':
   [,1] [,2] [,3]
[1,]   1    3   5
[2,]   2    4   6
lt <- list(a = 1,
let = letters[1:8],
print(lt)
lt$a <- 5
print(lt)
Output:
List before modifying:
$a
[1] 1
$let
[1] "a" "b" "c" "d" "e" "f" "g" "h"
$mt
     [,1] [,2] [,3]
[1,]    1   3    5
[2,]    2   4    6
$let
[1] "a" "b" "c" "d" "e" "f" "g" "h"
$mt
     [,1] [,2] [,3]
[1,]    1   3    5
[2,]    2   4    6
Output:
List before deleting:
$a
[1] 1
$let
[1] "a" "b" "c" "d" "e" "f" "g" "h"
$mt
     [,1] [,2] [,3]
[1,]    1   3    5
[2,]    2   4    6
$mt
     [,1] [,2] [,3]
[1,]    1   3    5
[2,]    2   4    6
         Vectors
         Lists
         Matrices
         Arrays
         Factors
         Data Frames
The simplest of these objects is the vector object and there are six data types of
these atomic vectors, also termed as six classes of vectors. The other R-Objects
are built upon the atomic vectors.
                                                                      print(class(v))
                                                                      it produces the following result −
                                                                      [1] "complex"
In R programming, the very basic data types are the R-objects called vectors which
hold elements of different classes as shown above. Please note in R the number of
classes is not confined to only the above six types. For example, we can use many
atomic vectors and create an array whose class will become array.
Vectors
When you want to create vector with more than one element, you should
use c() function which means to combine the elements into a vector.
# Create a vector.
apple <- c('red','green',"yellow")
print(apple)
Lists
A list is an R-object which can contain many different types of elements inside it like
vectors, functions and even another list inside it.
# Create a list.
list1 <- list(c(2,5,3),21.3,sin)
[[2]]
[1] 21.3
[[3]]
function (x)      .Primitive("sin")
Matrices
A matrix is a two-dimensional rectangular data set. It can be created using a vector
input to the matrix function.
# Create a matrix.
M = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow
= TRUE)
print(M)
When we execute the above code, it produces the following result −
     [,1] [,2] [,3]
[1,] "a" "a" "b"
[2,] "c" "b" "a"
Arrays
While matrices are confined to two dimensions, arrays can be of any number of
dimensions. The array function takes a dim attribute which creates the required
number of dimension. In the below example we create an array with two elements
which are 3x3 matrices each.
# Create an array.
a <- array(c('green','yellow'),dim = c(3,3,2))
print(a)
When we execute the above code, it produces the following result −
, , 1
Factors
Factors are the r-objects which are created using a vector. It stores the vector along
with the distinct values of the elements in the vector as labels. The labels are
always character irrespective of whether it is numeric or character or Boolean etc. in
the input vector. They are useful in statistical modeling.
Factors are created using the factor() function. The nlevels functions gives the
count of levels.
# Create a vector.
apple_colors <-
c('green','green','yellow','red','red','red','green')
Data Frames
Data frames are tabular data objects. Unlike a matrix in data frame each column
can contain different modes of data. The first column can be numeric while the
second column can be character and third column can be logical. It is a list of
vectors of equal length.
Data Frames are created using the data.frame() function.
# Create the data frame.
BMI <-       data.frame(
   gender = c("Male", "Male","Female"),
   height = c(152, 171.5, 165),
   weight = c(81,93, 78),
   Age = c(42,38,26)
)
print(BMI)
When we execute the above code, it produces the following result −
  gender height weight Age
1   Male 152.0      81 42
2   Male 171.5      93 38
3 Female 165.0      78 26
conversions is the most common sources of despairs for beginners. We can say that
everything in R is an object.
   1. Atomic vector
   2. List
   3. Array
   4. Matrices
   5. Data Frame
   6. Factors
Vectors
A vector is the basic data structure in R, or we can say vectors are the most basic R
data objects. There are six types of atomic vectors such as logical, integer, character,
double, and raw. "A vector is a collection of elements which is most commonly
of mode character, integer, logical or numeric" A vector can be one of the
following two types:
   1. Atomic vector
   2. Lists
   List
   In R, the list is the container. Unlike an atomic vector, the list is not restricted to be a
   single mode. A list contains a mixture of data types. The list is also known as generic
   vectors because the element of the list can be of any type of R object. "A list is a
   special type of vector in which each element can be a different type."
   We can create a list with the help of list() or as.list(). We can use vector() to create a
   required length empty list.
   Arrays
   There is another type of data objects which can store data in more than two
   dimensions known as arrays. "An array is a collection of a similar data type with
   contiguous memory allocation." Suppose, if we create an array of dimension (2, 3,
   4) then it creates four rectangular matrices of two rows and three columns.
   In R, an array is created with the help of array() function. This function takes a vector
   as an input and uses the value in the dim parameter to create an array.
   Matrices
   A matrix is an R object in which the elements are arranged in a two-dimensional
   rectangular layout. In the matrix, elements of the same atomic types are contained.
   For mathematical calculation, this can use a matrix containing the numeric element. A
   matrix is created with the help of the matrix() function in R.
Syntax
   Data Frames
   A data frame is a two-dimensional array-like structure, or we can say it is a table in
   which each column contains the value of one variable, and row contains the set of
   value from each column.
Factors
Factors are also data objects that are used to categorize the data and store it as
levels. Factors can store both strings and integers. Columns have a limited number of
unique values so that factors are very useful in columns. It is very useful in data
analysis for statistical modeling.
Factors are created with the help of factor() function by taking a vector as an input
parameter.
The RStudio IDE maintains a database of all commands which you have ever entered
into the Console. You can browse and search this database using the History pane.
Browsing History
Commands you have previously entered in the RStudio console can be browsed from
the History tab. The commands are displayed in order (most recent at the bottom)
and grouped by block of time:
Searching History
Executing a Search
You can use the search box at the top right of the history tab to search for all
instances of a previous command (e.g. plot). The search can be further refined by
adding additional words separated by spaces (e.g. the name of particular dataset):
After searching for a command within your history you may wish to view the other
commands that were executed in proximity to it. By clicking the arrow in the right
margin of the search results you can view the command within its context:
Using Commands
Commands selected within the History pane can be used in two fashions
(corresponding to the two buttons on the left side of the History toolbar):
     Send to Console— Sends the selected command(s) to the Console. Note that
      the commands are inserted into the Console however they are not executed
      until you press Enter.
     Insert into Source— Inserts the selected command(s) into the currently
      active Source document. If there isn't currently a Source document available
      then a new untitled one will be created.
Within the history list you can select a single command or multiple commands:
Before learning how to save a dataset in R, it is a good idea to create an example dataset. The
following R script creates an R data frame [explained in another topic of this learning
infrastructure] for you to practice saving.
Your R session now has a data frame object named df that you can use for
the exercises below.
R dataset files
One of the simplest ways to save your data is by saving it into an RData file
with the function save( ). R saves your data to the working folder on your
computer disk in a binary file. This storage method is efficient and the only
You can save the data frame df [from the above example] using this
command:
save(df, file = "df.RData")
While the save( ) command can have several arguments, this example
uses only two. The first argument is the name of your R data object, df in
this example. The second argument assigns a name to the RData
file, df.RData in this example. You can use any text as your file name as
long as it does not contain any embedded spaces. While you do not have
to use the .RData extension, this is a recommended practice because
the .RData extension will help RStudio to identify your R datasets. Notice
that the file name is enclosed in quotation marks.
Try to save your data frame using the save( ) command. Another topic in
this learning infrastructure addressed how to load a R dataset into R so that
will not be covered here.
Text files
There are other options for saving your data from your R session. You can
save your data as text file. One advantage of saving your data into a text
file is that you can open it in another application, such as a text editor or
Excel, and work with it there.
The simplest way to save your data into a text file is by using the write.csv(
) command. You may recall from the learning infrastructure topic about
reading data files that a csv file is a text file that uses commas to separate
each item of data form the other items of data. You can experiment saving
the data frame df using the command:
write.csv(df, file = "df.csv")
While the write.csv( ) command can have several arguments, this example
uses only two. The first argument is the name of your R data object, df in
this example. The second argument assigns a name to the csv
file, df.csv in this example. You can use any text as your file name as long
as it does not contain any embedded spaces. While you do not have to use
the .csv extension, this is a recommended practice. Notice that the file
name is enclosed in quotation marks.
"","x","y","z","t"
"1",1,11,21,"red"
"2",2,12,22,"blue"
"3",3,13,23,"red"
"4",4,14,24,"white"
"5",5,15,25,"blue"
"6",6,16,26,"white"
"7",7,17,27,"red"
"8",8,18,28,"blue"
"9",9,19,29,"white"
"10",10,20,30,"white"
Notice that each item of data is separated from the other items of data with
a comma and the header row of column titles is included. Another thing you
may notice are the numbers enclosed in quotes in front of every line. This
will be discussed below.
In both cases, your data is available for you to work with as text. The one
issue is the fact that your export of df included the line numbers. This can
be corrected by adding a third argument to your write.csv( ) command. If
you save your data object using this command
write.csv(df, file = "df2.csv", row.names = FALSE)
It will save df without the line numbers. Notice that the data object is saved
as df2.csv this time. A different name was used so you can compare the
two csv files later.
"x","y","z","t"
1,11,21,"red"
2,12,22,"blue"
3,13,23,"red"
4,14,24,"white"
5,15,25,"blue"
6,16,26,"white"
7,17,27,"red"
8,18,28,"blue"
9,19,29,"white"
10,20,30,"white"
The first column of line numbers is not in df2.csv. Everything else looks
like df.csv.
Again, this looks like the df2.csv Excel worksheet without the line numbers.
You can export your R data object using other R functions. One example of
this is the function write.table( ). These functions will not be discussed
here, but references to them are easily found on the Internet.
You can export your R data object as an Excel spreadsheet using functions
in the xlsx R package. You will need to manually install this package
because the RStudio package manager will not do it. To install the
package, enter this command in the command console
install.packages("xlsx")
This will install the packages and its dependencies. You will find the package in the Packages
panel of RStudio. Check the box next to the package to load it for use in your R session. This
package will enable you to read and write directly into and out of Excel files from your R
session. A good reference for this package can be found at
If you work with your text data file in Excel, you can export it as a csv file and easily import
it into your R session as discussed in another learning infrastructure topic.
You can easily save an Excel worksheet as a csv file. In Excel, open the File menu
and click Save As. [note: this example uses Mac Excel screen shots, Windows
Excel will act similarly]
Enter the name that you wish to use for your file in the file name box at the
top of the dialog. Next, go to the File Format box below the folder directory
and open the list. You can now choose the MS-DOS Comma Separated
(.csv) format.
Click the Save button. If you are exporting an Excel spreadsheet, you will
encounter two warning dialogs. They will look like this
In the first warning dialog, click Save Active Sheet. In the second warning
dialog click Continue. Excel will now save your data into a csv file.
PROBABILITY DISTRIBUTIONS in R:
The binomial distribution model deals with finding the probability of success of an
event which has only two possible outcomes in a series of experiments. For
example, tossing of a coin always gives a head or a tail. The probability of finding
exactly 3 heads in tossing a coin repeatedly for 10 times is estimated during the
binomial distribution.
R has four in-built functions to generate binomial distribution. They are described
below.
dbinom(x,    size,   prob)
pbinom(x,    size,   prob)
qbinom(p,    size,   prob)
rbinom(n,    size,   prob)
Following is the description of the parameters used −
      x is a vector of numbers.
      p is a vector of probabilities.
      n is number of observations.
      size is the number of trials.
      prob is the probability of success of each trial.
dbinom()
This function gives the probability density distribution at each point.
# Create a sample of 50 numbers which are incremented by 1.
x <- seq(0,50,by = 1)
pbinom()
This function gives the cumulative probability of an event. It is a single value
representing the probability.
print(x)
When we execute the above code, it produces the following result −
[1] 0.610116
qbinom()
This function takes the probability value and gives a number whose cumulative
value matches the probability value.
# How many heads will have a probability of 0.25 will come out
when a coin
# is tossed 51 times.
x <- qbinom(0.25,51,1/2)
print(x)
When we execute the above code, it produces the following result −
[1] 23
rbinom()
This function generates required number of random values of given probability from
a given sample.
# Find 8 random values from a sample of 150 with probability of
0.4.
x <- rbinom(8,150,.4)
print(x)
When we execute the above code, it produces the following result −
[1] 58 61 59 66 55 60 61 67
Poisson Distribution
Problem
If there are twelve cars crossing a bridge per minute on average, find the probability of having
seventeen or more cars crossing the bridge in a particular minute.
Solution
The probability of having sixteen or less cars crossing the bridge in a particular minute is given
by the function ppois.
> ppois(16, lambda=12)         # lower tail
[1] 0.89871
Hence the probability of having seventeen or more cars crossing the bridge in a minute is in
the upper tail of the probability density function.
> ppois(16, lambda=12, lower=FALSE)             # upper tail
[1] 0.10129
Answer
If there are twelve cars crossing a bridge per minute on average, the probability of having
seventeen or more cars crossing the bridge in a particular minute is 10.1%.
R - Normal Distribution:
In a random collection of data from independent sources, it is generally observed
that the distribution of data is normal. Which means, on plotting a graph with the
value of the variable in the horizontal axis and the count of the values in the vertical
axis we get a bell shape curve. The center of the curve represents the mean of the
data set. In the graph, fifty percent of values lie to the left of the mean and the other
fifty percent lie to the right of the graph. This is referred as normal distribution in
statistics.
R has four in built functions to generate normal distribution. They are described
below.
dnorm(x,      mean,    sd)
pnorm(x,      mean,    sd)
qnorm(p,      mean,    sd)
rnorm(n,      mean,    sd)
Following is the description of the parameters used in above functions −
       x is a vector of numbers.
       p is a vector of probabilities.
       n is number of observations(sample size).
       mean is the mean value of the sample data. It's default value is zero.
       sd is the standard deviation. It's default value is 1.
dnorm()
This function gives height of the probability distribution at each point for a given
mean and standard deviation.
# Create a sequence of numbers between -10 and 10 incrementing by
0.1.
x <- seq(-10, 10, by = .1)
png(file = "dnorm.png")
plot(x,y)
pnorm()
This function gives the probability of a normally distributed random number to be
less that the value of a given number. It is also called "Cumulative Distribution
Function".
qnorm()
This function takes the probability value and gives a number whose cumulative
value matches the probability value.
# Create a sequence of probability values incrementing by 0.02.
x <- seq(0, 1, by = 0.02)
rnorm()
This function is used to generate random numbers whose distribution is normal. It
takes the sample size as input and generates that many random numbers. We draw
a histogram to show the distribution of the generated numbers.
# Create a sample of 50 numbers which are normally distributed.
y <- rnorm(50)
we will process data using R, which is a very powerful tool, designed by statisticians for
data analysis. Described on its website as “free software environment for statistical
computing and graphics,” R is a programming language that opens a world of
possibilities for making graphics and analyzing and processing data. Indeed, just about
anything you may want to do with data can be done with R, from web scraping to making
interactive graphics.
Next week we will make static graphics with R. We will explore’s its potential for making
interactive charts and maps in week 13, and use it to make animations in week 14. Our
goal for this week’s class is to get used to working with data in R.
The main panel to the left is the R Console. Type valid R code into here, hit return, and it
will be run. See what happens if you run:
print("Hello World!")
      pfizer.csv  Payments made by Pfizer to doctors across the United States in the
       second half on 2009. Contains the following variables:
           o org_indiv Full name of the doctor, or their organization.
           o first_plus Doctor’s first and middle names.
           o first_name last_name. First and last names.
           o city state City and state.
           o category of payment Type of payment, which include Expert-led Forums, in
               which doctors lecture their peers on using Pfizer’s drugs, and
               `Professional Advising.
           o cash Value of payments made in cash.
           o other Value of payments made in-kind, for example puschase of meals.
           o total value of payment, whether cash or in-kind.
      fda.csv Data on warning letters sent to doctors by the U.S. Food and Drug
       Administration, because of problems in the way in which they ran clinical trials
       testing experimental treatments. Contains the following variables:
           o name_last name_first name_middle Doctor’s last, first, and middle names.
Any code we type in here can be run in the console. Hitting Run will run the line of code
on which the cursor is sitting. To run multiple lines of code, highlight them and click Run.
Click on the save/disk icon in the script panel and save the blank script to the file on your
desktop with the data for this week, calling it week7.R.
setwd("~/Desktop/week7")
Copy this code into your script, placing it at the end, with a comment, explaining what it
does:
            o    +-  add, subtract.
            o    * / multiply, divide.
            o    > < greater than, less than.
            o    >= <= greater than or equal to, less than or equal to.
            o    != not equal to.
      Equals signs can be a little confusing, but see how they are used in the code we
       use today:
In this class, we will work with two incredibly useful packages developed by Hadley
Wickham, chief scientist at RStudio:
      readr For reading and writes CSV and other text files.
      dplyr For processing and manipulating data.
These and several other useful packages have been combined into a super-package
called tidyverse.
To install a package, click on the Install icon in the Packages tab, type its name into the
dialog box, and make sure that Install dependencies is checked, as some packages will
only run correctly if other packages are also installed. Click Install and all of the
required packages should install:
install.packages("tidyverse")
So you can also install packages with cod in this format, without using the point-and-click
interface.
Each time you start R, it’s a good idea to click on Update in the Packages panel to update
all your installed packages to the latest versions.
Installing a package makes it available to you, but to use it in any R session you need to
load it. You can do this by checking its box in the Packages tab. However, we will enter
the following code into our script, then highlight these lines of code and run them:
# load packages to read, write and manipulate data
library(readr)
library(dplyr)
At this point, and at regular intervals, save your script, by clicking the save/disk icon in
the script panel, or using the ⌘-S keyboard shortcut.
The Value for each data frame details the number of columns, and the number of rows,
or observations, in the data.
You can remove any object from your environment by checking it in the Grid view and
clicking the broom icon.
The str function will tell you more about the columns in your data, including their data
type. Copy this code into your script and Run:
# view structure of data
str(pfizer)
If you run into any trouble importing data with readr, you may need to specify the data
types for some columns — in particular for date and time. This link explains how to set
data types for individual variables when importing data with readr.
To specify an individual column use the name of the data frame and the column name,
separated by $. Type this into your script and run:
# print values for total in pfizer data
pfizer$total
The output will be the first 10,000 values for that column.
If you need to change the data type for any column, use the following functions:
(Conversions to full dates and times can get complicated, because of timezones.
Contact me for advice if you need to work with full dates and times for your project!)
Now add the following code to your script to convert the convert total in the pfizer data
to a numeric variable (which would allow it to hold decimal values, if we had any).
# convert total to numeric variable
pfizer$total <- as.numeric(pfizer$total)
str(pfizer)
Notice that the data type for total has now changed:
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    10087 obs. of 10 variables:
 $ org_indiv : chr "3-D MEDICAL SERVICES LLC" "AA DOCTORS, INC." "ABBO,
LILIAN MARGARITA" "ABBO, LILIAN MARGARITA" ...
 $ first_plus: chr "STEVEN BRUCE" "AAKASH MOHAN" "LILIAN MARGARITA" "LILIAN
MARGARITA" ...
 $ first_name: chr "STEVEN" "AAKASH" "LILIAN" "LILIAN" ...
 $ last_name : chr "DEITELZWEIG" "AHUJA" "ABBO" "ABBO" ...
 $ city      : chr "NEW ORLEANS" "PASO ROBLES" "MIAMI" "MIAMI" ...
 $ state     : chr "LA" "CA" "FL" "FL" ...
     total
 Min.   :      0
 1st Qu.:    191
 Median :    750
 Mean   :   3507
 3rd Qu.:   2000
 Max.   :1185466
      Join: Merging entries from two or more datasets based on common field(s), e.g.
       unique ID number, last name and first name.
There are also various functions to join data, which we will explore below.
These functions can be chained together using the operator %>% which makes the output
of one line of code the input for the next. This allows you to run through a series of
operations in logical order. I find it helpful to think of %>% as “then.”
Now add a sort to the end of the code to list the doctors in descending order by the
payments received:
  arrange(desc(total))
Notice the use of the | Boolean operator, and the brackets around that part of the query.
This ensures that this part of the query is run first. See what happens if you exclude
them.
         Find the 20 doctors across the four largest states (CA, TX,
          FL, NY) who were paid the most for professional advice.
ca_ny_tx_fl_prof_top20 <- pfizer %>%
  filter((state=="CA" | state == "NY" | state == "TX" | state == "FL") &
category == "Professional Advising") %>%
  arrange(desc(total)) %>%
  head(20)
Notice the use of head, which grabs a defined number of rows from the start of a data
frame. Here, it is crucial to run the sort first! See what happens if you change the order
of the last two lines.
             Filter the data for all payments for running Expert-Led Forums or for
          Professional Advising, and arrange alphabetically by doctor (last name, then
                                           first name)
# Filter the data for all payments for running Expert-Led Forums or for
Professional Advising, and arrange alphabetically by doctor (last name, then
first name)
expert_advice <- pfizer %>%
  filter(category == "Expert-Led Forums" | category == "Professional
Advising") %>%
  arrange(last_name, first_name)
  arrange(last_name, first_name)
This code differs only by the ! Boolean operator. Notice that it has split the data into two,
based on categories of payment.
              Filter the data for letters sent from the start of 2005
                                      onwards
# FDA warning letters sent from the start of 2005 onwards
post2005 <- fda %>%
  filter(issued >= "2005-01-01") %>%
  arrange(issued)
Notice that operators like >= can be used for dates, as well as for numbers.
      inner_join()  returns values from both tables only where there is a match.
      left_join() returns all the values from the first-mentioned table, plus those from
       the second table that match.
      semi_join() filters the first-mentioned table to include only values that have
       matches in the second table.
      anti_join() filters the first-mentioned table to include only values that have no
       matches in the second table.
To illustrate, these joins will find doctors paid by Pfizer to run expert led forums who had
also received a warning letter from the FDA:
# join to identify doctors paid to run Expert-led forums who also received a
warning letter
expert_warned_inner <- inner_join(pfizer, fda, by=c("first_name" =
"name_first", "last_name" = "name_last")) %>%
  filter(category=="Expert-Led Forums")
The code in by=c() defines how the join should be made. If instructions on how to join
the tables are not supplied, dplyr will look for columns with matching names, and
perform the join based on those.
The difference between the two joins above is that the first contains all of the columns
from both data frames, while the second gives only columns from the pfizer data frame.
In practice, you may wish to inner_join and then use dplyr’s select function to select the
columns that you want to retain, for example:
# as above, but select desired columns from data
expert_warned <- inner_join(pfizer, fda, by=c("first_name" = "name_first",
"last_name" = "name_last")) %>%
  filter(category=="Expert-Led Forums") %>%
  select(first_plus, last_name, city, state, total, issued)
UNIT-IV
DELIVERING RESULTS:
Goal Description
Goal Description
This chapter explains how to share your work—even sharing it with your future self. We’ll
discuss how to use R markdown to create substantial project milestone documentation and
automate reproduction of graphs and other results. You’ll learn about using effective
comments in code, and using Git for version management and for collaboration. We’ll also
discuss deploying models as HTTP services and applications.
For some of the examples, we will use RStudio, which is an integrated development
environment (IDE) that is a product of RStudio, Inc. (and not part of R/CRAN itself).
Everything we show can be done without RStudio, but RStudio supplies a basic editor and
some single-button-press alternatives to some scripting tasks.
or our example scenario, we want to use metrics collected about the first few days of article
views to predict the long-term popularity of an article. This can be important for selling
advertising and predicting and managing revenue. To be specific: we will use measurements
taken during the first eight days of an article’s publication to predict if the article will remain
popular in the long term.
Our tasks for this chapter are to save and share our Buzz model, document the model, test the
model, and deploy the model into production.
To simulate our example scenario of predicting long term article popularity or buzz we will
use the Buzz dataset from http://ama.liglab.fr/datasets/buzz/. We’ll work with the data found
in the file TomsHardware-Relative-Sigma-500.data.txt.[1] The original supplied
documentation (TomsHardware-Relative-Sigma-500.names.txt and BuzzDataSetDoc.pdf)
tells us the Buzz data is structured as shown in
Attribute Description
                     Topics include technical issues about personal computers such as brand names,
 Topics
                     memory, overclocking, and so on.
 Measurement         For each topic, measurement types are quantities such as the number of
 types               discussions started, number of posts, number of authors, number of readers,
Attribute Description
The eight relative times are named 0 through 7 and are likely days (the original
                     variable documentation is not completely clear and the matching paper has
 Times
                     not yet been released). For each measurement type, all eight relative times are
the ongoing rate of additional discussion activity is at least 500 events per day
Buzz averaged over a number of days after the observed days. Likely buzz is a future
unclear on this).
The first audience you’ll have to prepare documentation for is yourself and your peers. You
may need to return to previous work months later, and it may be in an urgent situation like an
important bug fix, presentation, or feature improvement. For self/peer documentation, you
want to concentrate on facts: what the stated goals were, where the data came from, and what
techniques were tried. You assume that as long as you use standard terminology or
references, the reader can figure out anything else they need to know. You want to emphasize
any surprises or exceptional issues, as they’re exactly what’s expensive to relearn. You can’t
expect to share this sort of documentation with clients, but you can later use it as a basis for
building wider documentation and presentations.
Documentation scenario: Share the ROC curve for the Buzz model
Our first task is to build a document that contains the ROC curve for the example model. We
want to be able to rebuild this document automatically if we change model or evaluation
data, so we will use R markdown to produce the document.
What is R markdown?
We’ll continue with the example from last chapter: our company (let’s call it WVCorp)
makes and sells home electronic devices and associated software and apps. WVCorp wants to
monitor topics on the company’s product forums and discussion board to identify “about-to-
       buzz” issues: topics that are posed to generate a lot of interest and active discussion. This
       information can be used by product and marketing teams to proactively identify desired
       product features for future releases, and to quickly discover issues with existing product
       features. Once we’ve successfully built a model for identifying about-to-buzz topics on the
       forum, we’ll want to explain the work to the project sponsor, and also to the product
       managers, marketing managers, and support engineering managers who will be using the
       results of our model.
  Entity                                           Description
WVCorp       The company you work for
eRead        WVCorp’s e-book reader
TimeWrangler WVCorp’s time-management app
BookBits     A competitor’s e-book reader
GCal         A third-party cloud-based calendar service that TimeWrangler can integrate with
The equation of a linear line is: y = mx+b. “m” is the slope of the line, and “b” is
the ySlope is a measure of how much a line rises or falls from one point on the line
to another point on that line.         The slope will have units of the y axis divided by the
units of x-axis. To find the slope of a line using the math method:
        The y-intercept is where the line crosses the y-axis. The y-intercept will have units
        of the y-axis.
Graphing y vs. x When graphing a y vs. x graph put the y values on the y-axis
and the x values on the x-axis. For example when making a “distance vs. time ” graph
distance (m) goes on the y-axis and time (s) goes on the x-axis.
   2. Click the
      Variables
      button and
      select the
      column name
      which needs to
      be squared. The
      column name
      selected will
      then appearin
      the equation
          box. In the
          equation box
          enter ^2 this
          will square your
          data in that
          column. Click
          done.
Plot
At its simplest, you can use the plot() function to plot two numbers against
each other:
Example
Draw one point in the diagram, at position (1) and position (3):
plot(1, 3)
Result:
This booklet tells you how to use the R statistical software to carry out some simple
multivariate analyses, with a focus on principal components analysis (PCA) and linear
discriminant analysis (LDA).
This booklet assumes that the reader has some basic knowledge of multivariate analyses,
and the principal focus of the booklet is not to explain multivariate analyses, but rather to
explain how to carry out these analyses using R.
If you are new to multivariate analysis, and want to learn more about any of the concepts
presented here, I would highly recommend the Open University book “Multivariate
Analysis” (product code M249/03), available from from the Open University Shop.
In the examples in this booklet, I will be using data sets from the UCI Machine Learning
Repository, http://archive.ics.uci.edu/ml.
The first thing that you will want to do to analyse your multivariate data will be to read it
into R, and to plot the data. You can read data into R using the read.table() function.
1,14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050
1,13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185
1,14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480
1,13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735
...
There is one row per wine sample. The first column contains the cultivar of a wine sample
(labelled 1, 2 or 3), and the following thirteen columns contain the concentrations of the
13 different chemicals in that sample. The columns are separated by commas.
When we read the file into R using the read.table() function, we need to use the “sep=”
argument in read.table() to tell it that the columns are separated by commas. That is, we
can read in the file using the read.table() function as follows:
In this case the data on 178 samples of wine has been read into the variable ‘wine’.
Once you have read a multivariate data set into R, the next step is usually to make a plot
of the data.
A Matrix Scatterplot
One common way of plotting multivariate data is to make a “matrix scatterplot”, showing each
pair of variables plotted against each other. We can use the “scatterplotMatrix()” function from
the “car” R package to do this. To use this function, we first need to install the “car” R package
(for instructions on how to install an R package, see How to install an R package).
Once you have installed the “car” R package, you can load the “car” R package by typing:
> library("car")
You can then use the “scatterplotMatrix()” function to plot the multivariate data.
To use the scatterplotMatrix() function, you need to give it as its input the variables that
you want included in the plot. Say for example, that we just want to include the variables
corresponding to the concentrations of the first five chemicals. These are stored in
columns 2-6 of the variable “wine”. We can extract just these columns from the variable
“wine” by typing:
> wine[2:6]
         V2      V3     V4     V5    V6
  1   14.23    1.71   2.43   15.6   127
  2   13.20    1.78   2.14   11.2   100
  3   13.16    2.36   2.67   18.6   101
  4   14.37    1.95   2.50   16.8   113
  5   13.24    2.59   2.87   21.0   118
  ...
> scatterplotMatrix(wine[2:6])
In this matrix scatterplot, the diagonal cells show histograms of each of the variables, in
this case the concentrations of the first five chemicals (variables V2, V3, V4, V5, V6).
Each of the off-diagonal cells is a scatterplot of two of the five chemicals, for example,
the second cell in the first row is a scatterplot of V2 (y-axis) against V3 (x-axis).
If you see an interesting scatterplot for two variables in the matrix scatterplot, you may want to
plot that scatterplot in more detail, with the data points labelled by their group (their cultivar in
this case).
For example, in the matrix scatterplot above, the cell in the third column of the fourth row
down is a scatterplot of V5 (x-axis) against V4 (y-axis). If you look at this scatterplot, it
appears that there may be a positive relationship between V5 and V4.
We may therefore decide to examine the relationship between V5 and V4 more closely, by
plotting a scatterplot of these two variable, with the data points labelled by their group (their
cultivar). To plot a scatterplot of two variables, we can use the “plot” R function. The V4 and
V5 variables are stored in the columns V4 and V5 of the variable “wine”, so can be accessed by
typing wine$V4 or wine$V5. Therefore, to plot the scatterplot, we type:
If we want to label the data points by their group (the cultivar of wine here), we can use
the “text” function in R to plot some text beside every data point. In this case, the cultivar
of wine is stored in the column V1 of the variable “wine”, so we type:
If you look at the help page for the “text” function, you will see that “pos=4” will plot the
text just to the right of the symbol for a data point. The “cex=0.5” option will plot the
text at half the default size, and the “col=red” option will plot the text in red. This gives
us the following plot:
We can see from the scatterplot of V4 versus V5 that the wines from cultivar 2 seem to
have lower values of V4 compared to the wines of cultivar 1.
 Matrix plots:
he aim of the package plot.matrix is to visualize a matrix as is with a heatmap. Automatic
reordering of rows and columns is only done if necessary. This is different as in similar
function like heatmap. Additionally it should be user-friendly and give access to a lot of
options if necessary.
Currently the package implements the S3 functions below such that you can use the
generic plot function to plot matrices as heatmaps:
The plot itself is composed by a heatmap (usually left) where colors represent matrix entries
and a key (usually right) which links the colors to the values.
First examples
library('plot.matrix')
# numeric matrix
x <- matrix(runif(35), ncol=5) # create a numeric matrix object
class(x)
#> [1] "matrix" "array"
par(mar=c(5.1, 4.1, 4.1, 4.1)) # adapt margins
plot(x)
# logical matrix
m <- matrix(runif(35)<0.5, ncol=7)
plot(m)
# text matrix
s <- matrix(sample(letters[1:10], 35, replace=TRUE), ncol=5)
plot(s)
library('plot.matrix')
library('psych')
data <- na.omit(bfi[,1:25])
fa <- fa(data, 5, rotate="varimax")
par(mar=c(5.1, 4.1, 4.1, 4.1)) # adapt margins
plot(loadings(fa), cex=0.5)
In case of a non-numeric vector breaks must contain all values which are will get a color.
If breaks is not given then a sensible default is chosen: in case of a numeric vector derived
from pretty and otherwise all unique values/levels are used.
col can be either be a vector of colors or a function which generates         via col(n) a set
of n colors. The default is to use heat.colors.
 ggplot2,
 ggsci,
 wesanderson,
 cetcolor,
 colormap,
 ColorPalette,
 colorr,
 colorRamps,
 dichromat,
 jcolors,
 morgenstemning,
 painter,
 paletteer,
 pals,
 Polychrome,
 qualpalr,
 randomcoloR, or
 Redmonder.
Formal parameters
y = NULL,
breaks = NULL,
col = heat.colors,
na.col = “white”,
na.cell = TRUE,
na.print = TRUE,
digits = NA,
fmt.cell = NULL,
fmt.key = NULL,
polygon.cell = NULL,
polygon.key = NULL,
text.cell = NULL,
axis.key = NULL,
max.col = 70,
…)
   1. ... all parameters given here will be given to the plot command, e.g. xlab, ylab, ….
   2. polygon.cell list of parameters for drawing polygons for matrix entries
   3. text.cell list of parameters for putting for matrix entries as texts
   4. axis.col and axis.row list of parameters for drawing for row and column axes
   5. key, axis.key, spacing.key and polygon.key to draw the key
   6. max.col to determine when text color and background color to near
functi
  on                                             parameter(s)
axis     cex.axis, col.axis, col.ticks, font, font.axis, hadj, las, lwd.ticks, line , outer,
          padj, tck, tcl, tick
You need to access the position of the cell text used by accessing the invisible return
of plot.matrix. The cell.text contains the parameters used to draw the text.
Note the double braces: [[i,j]]
Or alternatively
Modifying a plot
Defaults
The default plot always draws a heatmap and a key where the colors and breaks are
determined by the entries of x. In case of a numeric matrix ten colors from heat.colors are
chosen and eleven breaks with cover the range of entries with an equidistant grid. In case of a
non-numeric matrix each unique element gets a color determined from heat.colors.
plot(x, breaks=range(x))
>par()
$xlog
[1] FALSE
...
$yaxt
[1] "s"
$ylbias
[1] 0.2
You will see a long list of parameters and to know what each does you can
check the help section ?par. Here we will focus on those which help us in
creating subplots.
Graphical parameter mfrow can be used to specify the number of subplot we
need.
It takes in a vector of form c(m, n) which divides the given plot into m*n array
of subplots. For example, if we need to plot two graphs side by side, we would
have m=1 and n=2. Following example illustrates this.
22 27 26 24 23 26 28
barplot(max.temp, main="Barplot")
This same phenomenon can be achieved with the graphical parameter mfcol.
The only difference between the two is that, mfrow fills in the subplot region
row wise while mfcol fills it column wise.
par(mfrow=c(2,2))
hist(Temperature)
boxplot(Temperature, horizontal=TRUE)
hist(Ozone)
boxplot(Ozone, horizontal=TRUE)
Same plot with the change par(mfcol = c(2, 2)) would look as follows. Note
that only the ordering of the subplot is different.
The graphical parameter fig lets us control the location of a figure precisely in
a plot.
We need to provide the coordinates in a normalized form as c(x1, x2, y1,
y2). For example, the whole plot area would be c(0, 1, 0, 1) with (x1, y1) =
(0, 0) being the lower-left corner and (x2, y2) = (1, 1) being the upper-right
corner.
Note: we have used parameters cex to decrease the size of labels and mai to
define margins.
par(cex=0.7, mai=c(0.1,0.1,0.2,0.1))
par(fig=c(0.1,0.7,0.3,0.9))
hist(Temperature)
par(fig=c(0.8,1,0,1), new=TRUE)
boxplot(Temperature)
par(fig=c(0.1,0.67,0.1,0.25), new=TRUE)
stripchart(Temperature, method="jitter")
 Exporting graph:
Creating plots in R is all well and good but what if you want to use these plots in your
thesis, report or publication? One option is to click on the ‘Export’ button in the ‘Plots’ tab
in RStudio as we described previously. You can also export your plots from R to an
external file by writing some code in your R script. The advantage of this approach is that
you have a little more control over the output format and it also allows you to generate (or
update) plots automatically whenever you run your script. You can export your plots in
many different formats but the most common are, pdf, png, jpeg and tiff.
By default, R (and therefore RStudio) will direct any plot you create to the plot window.
To save your plot to an external file you first need to redirect your plot to a different
graphics device. You do this by using one of the many graphics device functions to start a
new graphic device. For example, to save a plot in pdf format we will use
the pdf() function. The first argument in the pdf() function is the filepath and filename of
the file we want to save (don’t forget to include the .pdf extension). Once we’ve used
the pdf() function we can then write all of the code we used to create our plot including
any graphical parameters such as setting the margins and splitting up the plotting device.
Once the code has run we need to close the pdf plotting device using
the dev.off() function.
pdf(file = 'output/my_plot.pdf')
par(mar = c(4.1, 4.4, 4.1, 1.9), xaxs="i", yaxs="i")
plot(flowers$weight, flowers$shootarea,
    xlab = "weight (g)",
    ylab = expression(paste("shoot area (cm"^"2",")")),
    xlim = c(0, 30), ylim = c(0, 200), bty = "l",
    las = 1, cex.axis = 0.8, tcl = -0.2,
You can customize many features of your graphs (fonts, colors, axes, titles) through graphic
options.
One way is to specify these options in through the par( ) function. If you set parameter values
here, the changes will be in effect for the rest of the session or until you change them again.
The format is par(optionname=value, optionname=value, ...)
# Set a graphical parameter using par()
See the help for a specific high level plotting function (e.g. plot, hist, boxplot) to
determine which graphical parameters can be set this way.
The remainder of this section describes some of the more important graphical
parameters that you can set.
      option      description
      cex         number indicating the amount by which plotting text and
                  symbols should be scaled relative to the default. 1=default,
                  1.5 is 50% larger, 0.5 is 50% smaller, etc.
      cex.axis magnification of axis annotation relative to cex
      cex.lab     magnification of x and y labels relative to cex
      cex.main magnification of titles relative to cex
      cex.sub     magnification of subtitles relative to cex
Plotting Symbols
Use the pch= option to specify symbols to use when plotting points. For symbols 21
through 25, specify border color (col=) and fill color (bg=).
Lines
You can change lines using the following options. This is particularly useful for
reference lines, axes, and fit lines.
option description
Colors
Options that specify colors include the following.
option description
     col         Default plotting color. Some functions (e.g. lines) accept a vector of
                 values that are recycled.
Fonts
You can easily set font size and style, but font family is a bit more complicated.
option description
      family       font family for drawing text. Standard values are "serif", "sans",
                   "mono", "symbol". Mapping is device dependent.
In windows, mono is mapped to "TT Courier New", serif is mapped to"TT Times New
Roman", sans is mapped to "TT Arial", mono is mapped to "TT Courier New", and
symbol is mapped to "TT Symbol" (TT=True Type). You can add your own
mappings.
plot(1:10,1:10,type="n")
windowsFonts(
     A=windowsFont("Arial Black"),
     B=windowsFont("Bookman Old Style"),
D=windowsFont("Symbol")
option description
      mar     numerical vector indicating margin size c(bottom, left, top, right) in lines.
              default = c(5, 4, 4, 2) + 0.1
mai numerical vector indicating margin size c(bottom, left, top, right) in inches
Going Further
See help(par) for more information on graphical parameters. The customization of plotting
axes and text annotations are covered next section.
***************