KEMBAR78
Computational Statistic Using R Language | PDF | Statistics
0% found this document useful (0 votes)
35 views147 pages

Computational Statistic Using R Language

The document introduces the importance of statistics across various fields, defining it as a branch of applied mathematics focused on data collection, analysis, and inference. It highlights the applications of statistics in healthcare, business, environmental science, government, sports, and research, emphasizing its role in informed decision-making. Additionally, it provides a guide on installing R and R Studio for statistical computing, detailing steps for both Windows and Ubuntu systems.

Uploaded by

Vinaya Rajput
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views147 pages

Computational Statistic Using R Language

The document introduces the importance of statistics across various fields, defining it as a branch of applied mathematics focused on data collection, analysis, and inference. It highlights the applications of statistics in healthcare, business, environmental science, government, sports, and research, emphasizing its role in informed decision-making. Additionally, it provides a guide on installing R and R Studio for statistical computing, detailing steps for both Windows and Ubuntu systems.

Uploaded by

Vinaya Rajput
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 147

Computational statistic using R

language
(Unit 1)Introduction to statistic and R language

(Importance of statistic in various field)

Definition and purpose of statistic

Statistics is a branch of applied mathematics that involves the collection, description,


analysis, and inference of conclusions from quantitative data. The mathematical
theories behind statistics rely heavily on differential and integral calculus, linear
algebra, and probability theory.

People who do statistics are referred to as statisticians. They’re particularly


concerned with determining how to draw reliable conclusions about large groups and
general events from the behavior and other observable characteristics of small
samples. These small samples represent a portion of the large group or a limited
number of instances of a general phenomenon.

Key Takeaways
Statistics is the study and manipulation of data, including ways to gather, review,
analyze, and draw conclusions from data.

The two major areas of statistics are descriptive and inferential statistics.

Statistics can be communicated at different levels ranging from non-numerical


descriptor (nominal-level) to numerical in reference to a zero-point (ratio-level).

Several sampling techniques can be used to compile statistical data, including


simple random, systematic, stratified, or cluster sampling.

Statistics are present in almost every department of every company and are an
integral part of investing.

Why Is Statistics Important?


Statistics is used to conduct research, evaluate outcomes, develop critical thinking,
and make informed decisions about a set of data. Statistics can be used to inquire
about almost any field of study to investigate why things happen, when they occur,
and whether reoccurrence is predictable.

Application of statistics

Applications of Statistics

Computational statistic using R language 1


Healthcare: In the medical field, statistics are used for designing clinical trials,
understanding disease prevalence, and evaluating the effectiveness of
treatments. It helps in making informed decisions regarding patient care and
public health policies.

Business and Economics: Companies rely on statistical analysis for market


research, quality control, and financial forecasting. It aids in understanding
consumer behavior, optimizing operations, and assessing economic trends.

Environmental Science: Statistics help in monitoring environmental changes,


assessing pollution levels, and studying the impact of human activities on the
environment. It's vital for developing sustainable practices and policies.

Government and Public Policy: Statistical data is essential for governments in


planning, resource allocation, and policy formulation. It supports decision-making
in areas like education, transportation, and social services.

Sports: In sports, statistics are used to analyze player performance, strategize


game plans, and predict outcomes. It enhances the understanding of game
dynamics and improves team performance.

Research and Development: Statistics are fundamental in scientific research,


enabling the testing of hypotheses and interpretation of experimental data. It
supports innovation and discovery across all scientific disciplines.

The application of statistics is integral to solving complex problems and making


informed decisions in a data-driven world.

Introduction to R programming language

Installing R and R studio

How to Download R and R Studio?


To Install R and R Studio on Windows we will have to download R and R Studio with
the following steps.

Step 1: First, you need to set up an R environment in your local


machine. You can download the same from r-project.org.

Computational statistic using R language 2


Install R and R Studio

You have to download both the applications first go with R Base


and then install RStudio. after click on install R you will get a new
page like this.

Install R and R Studio

Here we can select the linux,mac or windows any one according


to users system. you have to click on for which you want to install.

Computational statistic using R language 3


Install R and R Studio

now click on the link show above in image so R base start downloading and after
again go to main page and download and click on Install RStudio.

Steps to Install R and R Studio


Step 1: After downloading R for the Windows platform, install it by
double-clicking it.

Computational statistic using R language 4


Step 2: Download R Studio from their official page. Note: It is free
of cost (under AGPL licensing).

Step 3: After downloading, you will get a file named “RStudio-


1.x.xxxx.exe” in your Downloads folder.

Step 4: Double-click the installer, and install the software.

Step 5: Test the R Studio installation

Search for RStudio in the Window search bar on Taskbar.

Computational statistic using R language 5


Start the application.

Insert the following code in the console.

Input :print('Hello world!')


Output :[1] "Hello world!"

Computational statistic using R language 6


Step 6: Your installation is successful.

Steps to Install R Studio on Ubuntu


Installing R Studio on Ubuntu has steps similar to Windows:

Using Terminal

Step 1: Open terminal (Ctrl+Alt+T) in Ubuntu.

Step 2: Update the package’s cache.

sudo apt-get update

Step 3: Install R environment.

sudo apt -y install r-base

Check for the version of R package using

R --version

Computational statistic using R language 7


Step 4: Check R installation by using the following command.

user@Ubuntu:~$ R

(Note that R version should be 3.6+ to be able to install all packages like tm, e1071,
etc.). If there is issue with R version, see the end of the post.

Step 5: Exit the terminal.

Using Ubuntu software Center

Step 1: Open Ubuntu Software Center.

Step 2: Search for r-base.

Step 3: Click install.

Steps to Install Rstudio on Ubuntu

Step 1: Install gdebi package to install .deb packages easily.

Computational statistic using R language 8


sudo add-apt-repository universe
sudo apt-get install gdebi-core

Step 2: Go to R Studio downloads and select the latest *.deb


package available under Ubuntu 18/Debian 10.

Step 3: Navigate to the Downloads folder in the local machine.

$ cd Downloads/
$ ls
rstudio-1.2.5042-amd64.deb

Step 4: Install using gdebi package.

sudo gdebi rstudio-1.2.5042-amd64.deb

Step 5: Run the RStudio using Terminal

Computational statistic using R language 9


user@Ubuntu:~/Downloads/ $ rstudio

Alternatively, use the menu to search for Rstudio.

Step 6: Test the R Studio using the basic “Hello world!” command
and exit.

Input :print('Hello world!')


Output :[1] "Hello world!"

Alternatively, RStudio can be installed through Ubuntu Software as well, but using the
above approach generally guarantees the latest version is installed.
If there are issues with the R version getting downloaded or the previously installed
version is older, check R version with

R --version

Now, Run the following commands in Terminal (Ctrl + Alt + T)

Add the key to secure APT from the CRAN package list:

sudo apt-key adv –keyserver keyserver.ubuntu.com –recv-keys


E298A3A825C0D65DFD57CBB651716619E084DAB9

Add the latest CRAN repository to the repository list. (This is for Ubuntu 18.04
specifically):

sudo add-apt-repository ‘deb https://cloud.r-


project.org/bin/linux/ubuntu bionic-cran35/’

Update the package cache:

sudo apt update

Install the r-base package:

sudo apt install r-base

Conclusion

Computational statistic using R language 10


R Programming Language, coupled with R Studio, offers a robust environment for
statistical computing and data analysis. Installing R and R Studio on different
platforms like Windows and Ubuntu involves downloading the necessary files and
following a few simple steps, ensuring a smooth setup process.

Navigating the R studio interface

3.1 The console


The first pane we are going to talk about is the Console/Terminal/Jobs pane. 3.2

Figure 3.2: The R console.


It’s called the Console/Terminal/Jobs pane because it has three tabs you can click on:
Console, Terminal, and Jobs. However, we will mostly refer to it as the Console pane
and we will mostly ignore the Terminal and Jobs tabs. We aren’t ignoring them
because they aren’t useful; rather, we are ignoring them because using them isn’t
essential for anything we discuss anytime soon, and I want to keep things as simple
as possible.
The console is the most basic way to interact with R. You can type a command to R
into the console prompt (the prompt looks like “>”) and R will respond to what you
type. For example, below I’ve typed “1 plus 1,” hit enter, and the R console returned
the sum of the numbers 1 and 1. 3.3

Computational statistic using R language 11


Figure 3.3: Doing some addition in the R console.

The number 1 you see in brackets before the 2 (i.e., [1]) is telling you that this line of
results starts with the first result. That fact is obvious here because there is only one
result. To make this idea clearer, let’s show you a result with multiple lines.

Figure 3.4: Demonstrating a function that returns multiple results.


In the screenshot above we see a couple new things demonstrated. 3.4

Computational statistic using R language 12


First, as promised, we have more than one line of results (or output). The first line of
results starts with a 1 in brackets (i.e., [1]), which indicates that this line of results
starts with the first result. In this case the first result is the number 2. The second line
of results starts with a 29 in brackets (i.e., [29]), which indicates that this line of
results starts with the twenty-ninth result. In this case the twenty-ninth result is the
number 58. If you count the numbers in the first line, there should be 28 – results 1
through 28. I also want to make it clear that “1” and “29” are NOT results themselves.
They are just helping us count the number of results per line.
The second new thing here that you may have noticed is our use of a function.
Functions are a BIG DEAL in R. So much so that R is called a functional language. You
don’t really need to know all the details of what that means; however, you should
know that, in general, everything you do in R you will do with a function. By contrast,
everything you create in R will be an object. If you wanted to make an analogy
between the R language and the English language, functions are verbs –
they do things – and objects are nouns – they are things. This may be confusing right
now. Don’t worry. It will become clearer soon.
Most functions in R begin with the function name followed by parentheses. For
example, seq() , sum() , and mean() .
Question: What is the name of the function we used in the example above?
It’s the seq() function – short for sequence. Inside the function, you may notice that
there are three pairs of words, equal symbols, and numbers that are separated by
commas. They are, from = 2 , to = 100 , and by = 2 . In this case, from , to , and by are
all arguments to the seq() function. I don’t know why they are called arguments, but
as far as we are concerned, they just are. We will learn more about functions and
arguments later, but for now just know that arguments give functions the information
they need to give us the result we want.
In this case, the seq() function gives us a sequence of numbers, but we have to give
it information about where that sequence should start, where it should end, and how
many steps should be in the middle. Here the sequence begins with the value we
gave to the from argument (i.e., 2), ends with the value we gave to the to argument
(i.e., 100), and increases at each step by the number we gave to the by argument
(i.e., 2). So, 2, 4, 6, 8 … 100.
While it’s convenient, let’s also learn some programming terminology:

Arguments: Arguments always go inside the parentheses of a function and give


the function the information it needs to give us the result we want.

Pass: In programming lingo, you pass a value to a function argument. For


example, in the function call seq(from = 2, to = 100, by = 2) we could say that we
passed a value of 2 to the from argument, we passed a value of 100 to
the to argument, and we passed a value of 2 to the by argument.

Computational statistic using R language 13


Returns: Instead of saying, “the seq() function gives us a sequence of
numbers…” we could say, “the seq() function returns a sequence of numbers…”
In programming lingo, functions return one or more results.

🗒Side Note: The seq() function isn’t particularly important or noteworthy. I


essentially chose it at random to illustrate some key points. However, arguments,
passing values, and return values are extremely important concepts and we will
return to them many times.

3.2 The environment pane


The second pane we are going to talk about is the Environment/History/Connections
pane. 3.5 However, we will mostly refer to it as the Environment pane and we will
mostly ignore the History and Connections tab. We aren’t ignoring them because they
aren’t useful; rather, we are ignoring them because using them isn’t essential for
anything we will discuss anytime soon, and I want to keep things as simple as
possible.

Figure 3.5: The environment pane.


The Environment pane shows you all the objects that R can currently use for data
management or analysis. In this picture, 3.5 our environment is empty. Let’s create an
object and add it to our Environment.

Computational statistic using R language 14


Figure 3.6: The vector x in the global environment.

Here we see that we created a new object called x , which now appears in our Global
Environment. 3.6 This gives us another great opportunity to discuss some new
concepts.
First, we created the x object in the Console by assigning the value 2 to the letter x.
We did this by typing “x” followed by a less than symbol (<), a dash symbol (-), and
the number 2. R is kind of unique in this way. I have never seen another programming
language (although I’m sure they are out there) that uses <- to assign values to
variables. By the way, <- is called the assignment operator (or assignment arrow),
and ”assign” here means “make x contain 2” or “put 2 inside x.”
In many other languages you would write that as x = 2 . But, for whatever reason, in R
it is <- . Unfortunately, <- is more awkward to type than = . Fortunately, RStudio
gives us a keyboard shortcut to make it easier. To type the assignment operator in
RStudio, just hold down Option + - (dash key) on a Mac or Alt + - (dash key) on a PC
and RStudio will insert <- complete with spaces on either side of the arrow. This may
still seem awkward at first, but you will get used to it.
🗒Side Note: A note about using the letter “x”: By convention, the letter “x” is a widely
used variable name. You will see it used a lot in example documents and online.
However, there is nothing special about the letter x. We could have just as easily used
any other letter ( a <- 2 ), word ( variable <- 2 ), or descriptive name ( my_favorite_number
<- 2 ) that is allowed by R.

Second, you can see that our Global Environment now includes the object x , which
has a value of 2. In this case, we would say that x is a numeric vector of length 1
(i.e., it has one value stored in it). We will talk more about vectors and vector types

Computational statistic using R language 15


soon. For now, just notice that objects that you can manipulate or analyze in R will
appear in your Global Environment.
⚠️Warning: R is a case sensitive language. That means that uppercase x (X) and
lowercase x (x) are different things to R. So, if you assign 2 to lower case x ( x <- 2 ).
And then later ask R to tell what number you stored in uppercase X, you will get an
error ( Error: object 'X' not found ).

3.3 The files pane


Next, let’s talk about the Files/Plots/Packages/Help/Viewer pane (that’s a
mouthful). 3.7

Figure 3.7: The Files/Plots/Packages/Help/Viewer pane.


Again, some of these tabs are more applicable for us than others. For us, the files tab
and the help tab will probably be the most useful. You can think of the files tab as a
mini Finder window (for Mac) or a mini File Explorer window (for PC). The help tab is
also extremely useful once you get acclimated to it.

Computational statistic using R language 16


Figure 3.8: The help tab.
For example, in the screenshot above 3.8 we typed the seq into the search bar. The
help pane then shows us a page of documentation for the seq() function. The
documentation includes a brief description of what the function does, outlines all the
arguments the seq() function recognizes, and, if you scroll down, gives examples of
using the seq() function. Admittedly, this help documentation can seem a little like
reading Greek (assuming you don’t speak Greek) at first. But, you will get more
comfortable using it with practice. I hated the help documentation when I was
learning R. Now, I use it all the time.

3.4 The source pane


There is actually a fourth pane available in RStudio. If you click on the icon shown
below you will get the following dropdown box with a list of files you can create. 3.9

Computational statistic using R language 17


Figure 3.9: Click the new source file icon.
If you click any of these options, a new pane will appear. I will arbitrarily pick the first
option – R Script.

Figure 3.10: New source file options.


When I do, a new pane appears. It’s called the source pane. In this case, the source
pane contains an untitled R Script. We won’t get into the details now because I don’t
want to overwhelm you, but soon you will do the majority of your R programming in
the source pane.

Computational statistic using R language 18


Figure 3.11: A blank R script in the source pane.

3.5 RStudio preferences


Finally, We’re going to recommend that you change a few settings in RStudio before
we move on. Start by clicking Tools , and then Global Options in RStudio’s menu bar,
which probably runs horizontally across the top of your computer’s screen.

Computational statistic using R language 19


Figure 3.12: Select the preferences menu on Mac.
In the Generaltab, we recommend turning off the Restore .Rdata into workspace at
startup option. We also recommend setting the Save workspace .Rdata on exit dropdown
to Never . Finally, we recommend turning off the Always save history (even when not saving

.Rdata) option.

Computational statistic using R language 20


Figure 3.13: General options tab.
We change our editor theme to Twilight in the Appearance tab. We aren’t necessarily
recommending that you change your theme – this is entirely personal preference –
we’re just letting you know why our screenshots will look different from here on out.

Computational statistic using R language 21


Figure 3.14: Appearance tab.

R package and CRAN repository

CRAN (Comprehensive R Archive Network) is the primary repository for R packages,


and it hosts thousands of packages that users can download and install to extend the
functionality of the R Programming Language. These packages are created by R
users and developers from around the world and cover a wide range of topics and
applications.
It functions as a robust repository, hosting a diverse collection of R packages and
related software, making it an essential cornerstone for statisticians, data scientists,
and researchers worldwide. In this comprehensive exploration, we will delve deep

Computational statistic using R language 22


into the significance of CRAN and its pivotal role in nurturing the growth of the R
programming language.

Understanding CRAN in Simple Terms and its


Purpose
CRAN is a network of servers storing R packages.

R is an open-source programming language for statistical computing.

The packages on CRAN enhance data analysis capabilities.

CRAN serves as the primary platform for sharing packages with the R community.

The Importance of CRAN


1. Central Hub: CRAN acts as the central hub for R packages, a place where users
can easily access, download, and install packages without the need for extensive
searches across various websites or sources. This seamless access streamlines
the process of enhancing R's capabilities, enabling users to find and install
packages effortlessly.

2. Quality Assurance: One of CRAN's standout features is its steadfast dedication


to quality assurance. Package maintainers undergo rigorous review processes
when submitting their packages to CRAN. This meticulous examination ensures
that packages meet the highest standards, including thorough documentation,
best practices, and adherence to CRAN's guidelines. As a result, users can have
full confidence in the quality and dependability of packages on CRAN.

3. Version Management: CRAN maintains a comprehensive history of package


versions, allowing users to access and install specific versions of packages. This
feature is crucial for ensuring the reproducibility of data analysis and research,
ensuring that code performs as intended, even as packages evolve over time.

4. Diverse Selection of Packages: CRAN hosts a vast array of packages covering a


wide range of domains. From statistical modeling and machine learning to data
visualization and manipulation, CRAN's repository caters to the needs of
beginners and experienced users alike. Whatever your data analysis
requirements, you're likely to discover a package that streamlines and enhances
your workflow on CRAN.

5. Community Collaboration: Beyond being a distribution platform for packages,


CRAN fosters a vibrant community of R developers, maintainers, and users.
Developers can collaborate on packages, share their expertise, and contribute to
the ongoing enrichment of R's ecosystem. Users can seek help, report issues,

Computational statistic using R language 23


and engage in discussions, fostering a sense of camaraderie and support that
bolsters the entire community.

Install Packages in CRAN


To access CRAN and install packages from it, you can use
the install.packages() function in R. For example, to install the ggplot2 package from
CRAN, you would run:
Syntax to install package in CRAN

install.packages("package_name")

R
# Installing a package with the help of CRAN install.packages("ggplot2")

This will download and install the ggplot2 package from CRAN, along with any
dependencies that it requires. Once the package is installed, you can load it into your
R session using the library() function:
R
# Code library(ggplot2)

You can also browse the CRAN website (https://cran.r-project.org/) to search for
packages and read their documentation. The website provides information on how to
install packages, as well as news and updates about the R community.
One can make contributions to CRAN which involves submitting new R packages or
updates for review. Developers must adhere to guidelines ensuring proper
documentation and functionality testing. For example, a developer creating a data
visualization package can share it with the R community through CRAN after meeting
the submission requirements.

CRAN Guidelines and Package Maintenance


CRAN maintains strict policies and guidelines for package submissions. These
guidelines cover aspects like package structure, documentation standards, and
code quality. By adhering to these policies, developers ensure the quality and
consistency of packages available on CRAN.

Package maintainers play a crucial role in updating and maintaining packages on


CRAN. They need to follow guidelines for version numbering, changelog
documentation, and responding to user feedback promptly. For example, a
maintainer releasing updates to fix bugs ensures that the package remains
reliable and functional for users.

Conclusion

Computational statistic using R language 24


CRAN serves as a vital hub for the R programming ecosystem, facilitating the
distribution and maintenance of R packages. By following CRAN's guidelines,
developers contribute to a repository of high-quality packages. Task Views simplify
package discovery, while maintaining and updating packages ensures users have
access to reliable resources. CRAN's collaborative approach continues to drive
innovation and growth within the R community.

Basic Syntax in R Programming


R is the most popular language used for Statistical Computing and Data
Analysis with the support of over 10, 000+ free packages in CRAN repository. Like
any other programming language, R has a specific syntax which is important to
understand if you want to make use of its powerful features. This article assumes R is
already installed on your machine. We will be using RStudio but we can also use R
command prompt by typing the following command in the command line.

$ R

This will launch the interpreter and now let’s write a basic Hello World program to get
started.

We can see that “Hello, World!” is being printed on the console. Now we can do the
same thing using print() which prints to the console. Usually, we will write our code
inside scripts which are called
RScripts
in R. To create one, write the below given code in a file and save it as
myFile.R
and then run it in console by writing:

Rscript myFile.R

Computational statistic using R language 25


Output:

[1] "Hello, World!"

Syntax of R program
A program in R is made up of three things: Variables, Comments, and Keywords.
Variables are used to store the data, Comments are used to improve code readability,
and Keywords are reserved words that hold a specific meaning to the compiler.

Variables in R
Previously, we wrote all our code in a print() but we don’t have a way to address them
as to perform further operations. This problem can be solved by
using variables which like any other programming language are the name given to
reserved memory locations that can store any type of data. In R, the assignment can
be denoted in three ways:

1. = (Simple Assignment)

2. <- (Leftward Assignment)

3. > (Rightward Assignment)

Example:

Output:

"Simple Assignment"
"Leftward Assignment!"
"Rightward Assignment"

The rightward assignment is less common and can be confusing for some
programmers, so it is generally recommended to use the <- or = operator for
assigning values in R.

Computational statistic using R language 26


Comments in R
Comments are a way to improve your code’s readability and are only meant for the
user so the interpreter ignores it. Only single-line comments are available in R but we
can also use multiline comments by using a simple trick which is shown below. Single
line comments can be written by using
#

at the beginning of the statement.


Example:

Output:

[1] "This is fun!"

From the above output, we can see that both comments were ignored by the
interpreter.

Keywords in R
Keywords
are the words reserved by a program because they have a special meaning thus a
keyword can’t be used as a variable name, function name, etc. We can view these
keywords by using either help(reserved) or ?reserved.

Computational statistic using R language 27


if, else, repeat, while, function, for, in, next and break are used for control-flow
statements and declaring user-defined functions.

The ones left are used as constants like TRUE/FALSE are used as boolean
constants.

NaN defines Not a Number value and NULL are used to define an Undefined
value.

Inf is used for Infinity values.

Data Types in R Programming Language

Each variable in R has an associated data type. Each R-Data Type requires different
amounts of memory and has some specific operations which can be performed over
it.

Data Types in R are:

1. numeric – (3,6.7,121)

2. Integer – (2L, 42L; where ‘L’ declares this as an integer)

3. logical – (‘True’)

4. complex – (7 + 5i; where ‘i’ is imaginary number)

5. character – (“a”, “B”, “c is third”, “69”)

6. raw – ( as.raw(55); raw creates a raw vector of the specified length)

R Programming language has the following basic R-data types and the following
table shows the data type and the values that each data type can take.

Basic Data
Values Examples
Types

Numeric Set of all real numbers "numeric_value <- 3.14"

Integer Set of all integers, Z "integer_value <- 42L"

Logical TRUE and FALSE "logical_value <- TRUE"

Computational statistic using R language 28


Complex Set of complex numbers "complex_value <- 1 + 2i"

“a”, “b”, “c”, …, “@”, “#”, “$”, …., “1”, “2”, "character_value <- "Hello
Character Geeks"
…etc

raw as.raw() "single_raw <- as.raw(255)"

1. Numeric Data type in R


Decimal values are called numeric in R. It is the default R data type for numbers in R.
If you assign a decimal value to a variable x as follows, x will be of numeric type.

Real numbers with a decimal point are represented using this data type in R. It uses a
format for double-precision floating-point numbers to represent numerical values.

R
# A simple R program

# to illustrate Numeric data type

# Assign a decimal value to x

x = 5.6

# print the class name of variable

print(class(x))

# print the type of variable

print(typeof(x))

Output

[1] "numeric"
[1] "double"

Even if an integer is assigned to a variable y, it is still saved as a numeric value.

R
# A simple R program

# to illustrate Numeric data type

# Assign an integer value to y

y = 5

# print the class name of variable

print(class(y))

# print the type of variable

print(typeof(y))

Output

Computational statistic using R language 29


[1] "numeric"
[1] "double"

When R stores a number in a variable, it converts the number into a “double” value or
a decimal type with at least two decimal places.

This means that a value such as “5” here, is stored as 5.00 with a type of double and
a class of numeric. And also y is not an integer here can be confirmed with
the is.integer() function.

R
# A simple R program

# to illustrate Numeric data type

# Assign a integer value to y

y = 5

# is y an integer?

print(is.integer(y))

Output

[1] FALSE

2. Integer Data type in R


R supports integer data types which are the set of all integers.

You can create as well as convert a value into an integer type using
the as.integer() function.
You can also use the capital ‘L’ notation as a suffix to denote that a particular value is
of the integer R data type.

R
# A simple R program

# to illustrate integer data type

# Create an integer value

x = as.integer(5)

# print the class name of x

print(class(x))

# print the type of x

print(typeof(x))

# Declare an integer by appending an L suffix.

y = 5L

Computational statistic using R language 30


# print the class name of y

print(class(y))

# print the type of y

print(typeof(y))

Output

[1] "integer"
[1] "integer"
[1] "integer"
[1] "integer"

3. Logical Data type in R


R has logical data types that take either a value of true or false.
A logical value is often created via a comparison between variables.

Boolean values, which have two possible values, are represented by this R data type:
FALSE or TRUE

R
# A simple R program

# to illustrate logical data type

# Sample values

x = 4

y = 3

# Comparing two values

z = x > y

# print the logical value

print(z)

# print the class name of z

print(class(z))

# print the type of z

print(typeof(z))

Output

[1] TRUE
[1] "logical"
[1] "logical"

4. Complex Data type in R

Computational statistic using R language 31


R supports complex data types that are set of all the complex numbers. The complex
data type is to store numbers with an imaginary component.

R
# A simple R program

# to illustrate complex data type

# Assign a complex value to x

x = 4 + 3i

# print the class name of x

print(class(x))

# print the type of x

print(typeof(x))

Output

[1] "complex"
[1] "complex"

5. Character Data type in R


R supports character data types where you have all the alphabets and special
characters.
It stores character values or strings. Strings in R can contain alphabets, numbers, and
symbols.
The easiest way to denote that a value is of character type in R data type is to wrap
the value inside single or double inverted commas.

R
# A simple R program

# to illustrate character data type

# Assign a character value to char

char = "Geeksforgeeks"

# print the class name of char

print(class(char))

# print the type of char

print(typeof(char))

Output

Computational statistic using R language 32


[1] "character"
[1] "character"

There are several tasks that can be done using R data types. Let’s understand each
task with its action and the syntax for doing the task along with an R code to illustrate
the task.

6. Raw data type in R


To save and work with data at the byte level in R, use the raw data type. By displaying
a series of unprocessed bytes, it enables low-level operations on binary data. Here
are some speculative data on R’s raw data types:

R
# Create a raw vector

x <- as.raw(c(0x1, 0x2, 0x3, 0x4, 0x5))

print(x)

Output

[1] 01 02 03 04 05

Five elements make up this raw vector x, each of which represents a raw byte value.

Find Data Type of an Object in R


To find the data type of an object you have to use class() function. The syntax for
doing that is you need to pass the object as an argument to the function class() to
find the data type of an object.

Syntax

class(object)

Example

R
# A simple R program

# to find data type of an object

# Logical

print(class(TRUE))

# Integer

Computational statistic using R language 33


print(class(3L))

# Numeric

print(class(10.5))

# Complex

print(class(1+2i))

# Character

print(class("12-04-2020"))

Output

[1] "logical"
[1] "integer"
[1] "numeric"
[1] "complex"
[1] "character"

Type verification
You can verify the data type of an object, if you doubt about it’s data type.
To do that, you need to use the prefix “is.” before the data type as a command.

Syntax:

is.data_type(object)

Example

R
# A simple R program

# Verify if an object is of a certain datatype

# Logical

print(is.logical(TRUE))

# Integer

print(is.integer(3L))

# Numeric

print(is.numeric(10.5))

# Complex

print(is.complex(1+2i))

# Character

print(is.character("12-04-2020"))

print(is.integer("a"))

Computational statistic using R language 34


print(is.numeric(2+3i))

Output

[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
[1] FALSE
[1] FALSE

Coerce or Convert the Data Type of an Object


to Another
The process of altering the data type of an object to another type is referred to as
coercion or data type conversion. This is a common operation in many programming
languages that is used to alter data and perform various computations.

When coercion is required, the language normally performs it automatically, whereas


conversion is performed directly by the programmer.
Coercion can manifest itself in a variety of ways, depending on the R programming
language and the context in which it is employed.
In some circumstances, the coercion is implicit, which means that the language will
change one type to another without the programmer having to expressly request it.

Syntax

as.data_type(object)

Note: All the coercions are not possible and if attempted will be returning an “NA”
value.

For Detailed Explanation – Data Type Conversion in R

Example

R
# A simple R program

# convert data type of an object to another

# Logical

print(as.numeric(TRUE))

Computational statistic using R language 35


# Integer

print(as.complex(3L))

# Numeric

print(as.logical(10.5))

# Complex

print(as.character(1+2i))

# Can't possible

print(as.numeric("12-04-2020"))

Output

[1] 1
[1] 3+0i
[1] TRUE
[1] "1+2i"
[1] NA
Warning message:
In print(as.numeric("12-04-2020")) : NAs introduced by coercion

Date and time


date() function in R Language is used to return the current date and time.

Syntax: date()Parameters:
Does not accept any parameters

Example:
# R program to illustrate

# date function

# Calling date() function to

# return current date and time

date()

Output:

[1] "Thu Jun 11 04:29:39 2020"

Sys.Date() Function
Sys.Date() function is used to return the system’s date.

Computational statistic using R language 36


Syntax: Sys.Date()Parameters:
Does not accept any parameters

Example:
# R program to illustrate

# Sys.Date function

# Calling Sys.Date() function to

# return the system's date

Sys.Date()

Output:

[1] "2020-06-11"

Sys.time()
Sys.time() function is used to return the system’s date and time.

Syntax: Sys.time()Parameters:
Does not accept any parameters

Example:
# R program to illustrate

# Sys.time function

# Calling Sys.time() function to

# return the system's date and time

Sys.time()

Output:

[1] "2020-06-11 05:35:49 UTC"

Sys.timezone()
Sys.timezone() function is used to return the current time zone.

Syntax: Sys.timezone()Parameters:
Does not accept any parameters

Example:
# R program to illustrate

# Sys.timezone function

Computational statistic using R language 37


# Calling Sys.timezone() function to

# return the current time zone

Sys.timezone()

Output:

[1] "Etc/UTC"

Data Structures in R Programming

A data structure is a particular way of organizing data in a computer so that it can be


used effectively. The idea is to reduce the space and time complexities of different
tasks. Data structures in R programming are tools for holding multiple values.

R’s base data structures are often organized by their dimensionality (1D, 2D, or nD)
and whether they’re homogeneous (all elements must be of the identical type) or
heterogeneous (the elements are often of various types). This gives rise to the six
data types which are most frequently utilized in data analysis.

The most essential data structures used in R include:

Vectors

Lists

Dataframes

Matrices

Arrays

Factors

Tibbles

Vectors
A vector is an ordered collection of basic data types of a given length. The only key
thing here is all the elements of a vector must be of the identical data type e.g
homogeneous data structures. Vectors are one-dimensional data structures.

Example:
R
# R program to illustrate Vector# Vectors(ordered collection of same data type) X = c(1, 3, 5,
7, 8)

# Printing those elements in console print(X)

Output:

Computational statistic using R language 38


[1] 1 3 5 7 8

Lists
A list is a generic object consisting of an ordered collection of objects. Lists are
heterogeneous data structures. These are also one-dimensional data structures. A list
can be a list of vectors, list of matrices, a list of characters and a list of functions and
so on.
Example:

R
# R program to illustrate a List# The first attributes is a numeric vector# containing the
employee IDs which is # created using the 'c' command here empId = c(1, 2, 3, 4)

# The second attribute is the employee name # which is created using this line of code here#
which is the character vector empName = c("Debi", "Sandeep", "Subham", "Shiba")

# The third attribute is the number of employees# which is a single numeric


variable. numberOfEmp = 4

# We can combine all these three different# data types into a list# containing the details of
employees# which can be done using a list command empList = list(empId, empName, numberOfEmp)

print(empList)

Output:

[[1]]
[1] 1 2 3 4

[[2]]
[1] "Debi" "Sandeep" "Subham" "Shiba"

[[3]]
[1] 4

Dataframes
Dataframes are generic data objects of R which are used to store the tabular data.
Dataframes are the foremost popular data objects in R programming because we are
comfortable in seeing the data within the tabular form. They are two-dimensional,
heterogeneous data structures. These are lists of vectors of equal lengths.

Computational statistic using R language 39


Data frames have the following constraints placed upon them:

A data-frame must have column names and every row should have a unique
name.

Each column must have the identical number of items.

Each item in a single column must be of the same data type.

Different columns may have different data types.

To create a data frame we use the data.frame() function.

Example:
R
# R program to illustrate dataframe# A vector which is a character vector Name = c("Amiya",
"Raj", "Asish")

# A vector which is a character vector Language = c("R", "Python", "Java")

# A vector which is a numeric vector Age = c(22, 25, 45)

# To create dataframe use data.frame command# and then pass each of the vectors # we have created
as arguments# to the function data.frame() df = data.frame(Name, Language, Age)

print(df)

Output:

Name Language Age


1 Amiya R 22
2 Raj Python 25
3 Asish Java 45

Matrices
A matrix is a rectangular arrangement of numbers in rows and columns. In a matrix,
as we know rows are the ones that run horizontally and columns are the ones that run
vertically. Matrices are two-dimensional, homogeneous data structures.

Now, let’s see how to create a matrix in R. To create a matrix in R you need to use the
function called matrix. The arguments to this matrix() are the set of elements in the
vector. You have to pass how many numbers of rows and how many numbers of
columns you want to have in your matrix and this is the important point you have to
remember that by default, matrices are in column-wise order.

Example:

Computational statistic using R language 40


R
# R program to illustrate a matrix A = matrix(

# Taking sequence of elements c(1, 2, 3, 4, 5, 6, 7, 8, 9),

# No of rows and columns nrow = 3, ncol = 3,

# By default matrices are # in column-wise order # So this parameter decides #


how to arrange the matrix byrow = TRUE
)

print(A)

Output:

[,1] [,2] [,3]


[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9

Arrays
Arrays are the R data objects which store the data in more than two dimensions.
Arrays are n-dimensional data structures. For example, if we create an array of
dimensions (2, 3, 3) then it creates 3 rectangular matrices each with 2 rows and 3
columns. They are homogeneous data structures.
Now, let’s see how to create arrays in R. To create an array in R you need to use the
function called array(). The arguments to this array() are the set of elements in
vectors and you have to pass a vector containing the dimensions of the array.

Example:
Python3
# R program to illustrate an array A = array(

# Taking sequence of elements c(1, 2, 3, 4, 5, 6, 7, 8),

# Creating two rectangular matrices # each with two rows and two columns dim = c(2, 2, 2)
)

print(A)

Output:

, , 1

[,1] [,2]

Computational statistic using R language 41


[1,] 1 3
[2,] 2 4

, , 2

[,1] [,2]
[1,] 5 7
[2,] 6 8

Factors
Factors are the data objects which are used to categorize the data and store it as
levels. They are useful for storing categorical data. They can store both strings and
integers. They are useful to categorize unique values in columns like “TRUE” or
“FALSE”, or “MALE” or “FEMALE”, etc.. They are useful in data analysis for statistical
modeling.

Now, let’s see how to create factors in R. To create a factor in R you need to use the
function called factor(). The argument to this factor() is the vector.
Example:

R
# R program to illustrate factors# Creating factor using factor() fac = factor(c("Male",
"Female", "Male",
"Male", "Female", "Male", "Female"))

print(fac)

Output:

[1] Male Female Male Male Female Male Female


Levels: Female Male

Tibbles
Tibbles are an enhanced version of data frames in R, part of the tidyverse. They offer
improved printing, stricter column types, consistent subsetting behavior, and allow
variables to be referred to as objects. Tibbles provide a modern, user-friendly
approach to tabular data in R.

Now, let’s see how we can create a tibble in R. To create tibbles in R we can use
the tibble function from the tibble package, which is part of the tidyverse.
Example:

Computational statistic using R language 42


R
# Load the tibble package library(tibble)

# Create a tibble with three columns: name, age, and city my_data <- tibble(
name = c("Sandeep", "Amit", "Aman"),
age = c(25, 30, 35),
city = c("Pune", "Jaipur", "Delhi")
)

# Print the tibble print(my_data)

Output:

name age city


<chr> <dbl> <chr>
1 Sandeep 25 Pune
2 Amit 30 Jaipur
3 Aman 35 Delhi

Control Statements in R Programming

Control statements are expressions used to control the execution and flow of the
program based on the conditions provided in the statements. These structures are
used to make a decision after assessing the variable. In this article, we’ll discuss all
the control statements with the examples.

In R programming, there are 8 types of control statements as follows:

if condition

if-else condition

for loop

nested loops

while loop

repeat and break statement

return statement

next statement

if condition
This control structure checks the expression provided in parenthesis is true or not. If
true, the execution of the statements in braces {} continues.
Syntax:

Computational statistic using R language 43


if(expression){
statements
....
....
}

Example:

x <- 100

if(x > 10){

print(paste(x, "is greater than 10"))

Output:

[1] "100 is greater than 10"

if-else condition
It is similar to if condition but when the test expression in if condition fails, then
statements in else condition are executed.

Syntax:

if(expression){
statements
....
....
}
else{
statements
....
....
}

Example:
x <- 5

# Check value is less than or greater than 10

if(x > 10){

print(paste(x, "is greater than 10"))

}else{

print(paste(x, "is less than 10"))

Computational statistic using R language 44


}

Output:

[1] "5 is less than 10"

for loop
It is a type of loop or sequence of statements executed repeatedly until exit condition
is reached.
Syntax:

for(value in vector){
statements
....
....
}

Example:
x <- letters[4:10]

for(i in x){

print(i)

Output:

[1] "d"
[1] "e"
[1] "f"
[1] "g"
[1] "h"
[1] "i"
[1] "j"

Nested loops
Nested loops are similar to simple loops. Nested means loops inside loop. Moreover,
nested loops are used to manipulate the matrix.

Example:
# Defining matrix

m <- matrix(2:15, 2)

Computational statistic using R language 45


for (r in seq(nrow(m))) {

for (c in seq(ncol(m))) {

print(m[r, c])

Output:

[1] 2
[1] 4
[1] 6
[1] 8
[1] 10
[1] 12
[1] 14
[1] 3
[1] 5
[1] 7
[1] 9
[1] 11
[1] 13
[1] 15

while loop
while loop is another kind of loop iterated until a condition is satisfied. The testing
expression is checked first before executing the body of loop.
Syntax:

while(expression){
statement
....
....
}

Example:
x = 1

# Print 1 to 5

while(x <= 5){

print(x)

x = x + 1

Computational statistic using R language 46


}

Output:

[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

repeat loop and break statement


repeat is a loop which can be iterated many number of times but there is no exit
condition to come out from the loop. So, break statement is used to exit from the
loop. break statement can be used in any type of loop to exit from the loop.

Syntax:

repeat {
statements
....
....
if(expression) {
break
}
}

Example:
x = 1

# Print 1 to 5

repeat{

print(x)

x = x + 1

if(x > 5){

break

Output:

[1] 1
[1] 2
[1] 3

Computational statistic using R language 47


[1] 4
[1] 5

return statement
return statement is used to return the result of an executed function and returns
control to the calling function.

Syntax:

return(expression)

Example:
# Checks value is either positive, negative or zero

func <- function(x){

if(x > 0){

return("Positive")

}else if(x < 0){

return("Negative")

}else{

return("Zero")

func(1)

func(0)

func(-1)

Output:

[1] "Positive"
[1] "Zero"
[1] "Negative"

next statement
next statement is used to skip the current iteration without executing the further
statements and continues the next iteration cycle without terminating the loop.
Example:
# Defining vector

x <- 1:10

# Print even numbers

Computational statistic using R language 48


for(i in x){

if(i%%2 != 0){

next #Jumps to next loop

print(i)

Output:

[1] 2
[1] 4
[1] 6
[1] 8
[1] 10

Measures of central tendency

Statistical measures like mean, median, and mode are essential for summarizing and
understanding the central tendency of a dataset. In R, these measures can be calculated
easily using built-in functions. This article will provide a comprehensive guide on how to
calculate mean, median, and mode in R Programming Language.

Mean, Median and Mode in R Programming

Dataset used for Calculating the Mean, Median, and Mode in R


Programming
Before doing any computation, first of all, we need to prepare our data and save our data
in external .txt or .csv files and it’s a best practice to save the file in the current directory.
After that import, your data into R as follow:
Dataset Link: CardioGoodFitness

R
# R program to import data into R# Import the data using read.csv() myData =
read.csv("CardioGoodFitness.csv",
stringsAsFactors=F)

Computational statistic using R language 49


# Print the first 6 rows print(head(myData))

Output:

Product Age Gender Education MaritalStatus Usage Fitness Income


Miles
1 TM195 18 Male 14 Single 3 4 29562
112
2 TM195 19 Male 15 Single 2 3 31836
75
3 TM195 19 Female 14 Partnered 4 3 30699
66
4 TM195 19 Male 12 Single 3 3 32973
85
5 TM195 20 Male 13 Partnered 4 2 35247
47
6 TM195 20 Female 14 Partnered 3 3 32973
66

Mean in R Programming Language


It is the sum of observations divided by the total number of observations. It is also
defined as average which is the sum divided by count.
[Mean(μ)=1N∑i=1Nxi][Mean(μ)=N1​∑i=1N​xi​]

R
# R program to illustrate# Descriptive Analysis# Import the data using read.csv() myData =
read.csv("CardioGoodFitness.csv",
stringsAsFactors=F)

# Compute the mean value mean = mean(myData$Age)


print(mean)

Output:

[1] 28.78889

Median in R Programming Language


It is the middle value of the data set. It splits the data into two halves. If the number of
elements in the data set is odd then the center element is median and if it is even then the
median would be the average of two central elements.

Computational statistic using R language 50


[Median={xN+12if N is oddxN2+xN2+12if N is even][Median={x2N+1​​2x2N​​+x2N​+1​​
if N is oddif N is even​]
R
# R program to illustrate# Descriptive Analysis# Import the data using read.csv() myData =
read.csv("CardioGoodFitness.csv",
stringsAsFactors=F)

# Compute the median value median = median(myData$Age)


print(median)

Output:

[1] 26

Mode in R Programming Language


It is the value that has the highest frequency in the given data set. The data set may have
no mode if the frequency of all data points is the same. Also, we can have more than one
mode if we encounter two or more data points having the same frequency. There is no
inbuilt function for finding mode in R, so we can create our own function for finding the
mode or we can use the package called modest.
[Mode=The value that appears most frequently in the dataset]
[Mode=The value that appears most frequently in the dataset]

Creating a user-defined function for finding Mode


There is no in-built function for finding mode in R. So let’s create a user-defined function
that will return the mode of the data passed. We will be using the table() method for this
as it creates a categorical representation of data with the variable names and the
frequency in the form of a table. We will sort the column Age column in descending order
and will return the 1 value from the sorted values.
R
# Import the data using read.csv() myData = read.csv("CardioGoodFitness.csv",
stringsAsFactors=F)

mode = function(){
return(sort(-table(myData$Age))[1])
}

mode()

Output:

25: -25

Computational statistic using R language 51


Using Modeest Package
We can use the modeest package of the R. This package provides methods to find the
mode of the univariate data and the mode of the usual probability distribution.

R
# R program to illustrate# Descriptive Analysis# Import the library library(modeest)

# Import the data using read.csv() myData = read.csv("CardioGoodFitness.csv",


stringsAsFactors=F)

# Compute the mode value mode = mfv(myData$Age)


print(mode)

Output:

[1] 25

Calculate Mean and median in R for null values


When dealing with null values, you can still calculate the mean and median in R by
specifying the appropriate argument to the functions. Here’s an example:
Let’s start by creating a vector with null values:

R
x <- c(1, 2, NA , 4, 5, NA , 7, 8, NA , 9, 10)

mean(x, na.rm =
TRUE )

median(x, na.rm =
TRUE )

Output:

[1] 5.75
[1] 6

The arithmetic mean (average) of the non-missing numbers in x is determined by the


function mean(x, na.rm = TRUE). Any NA values in x are ensured to be omitted from
the calculation by the na.rm = TRUE option.

Adding up all the non-missing data is the first step in calculating the mean of x.

This function, median(x, na.rm = TRUE), finds the median of the non-missing values
in x. Any NA values in x are ensured to be omitted from the calculation by the na.rm =
TRUE option.

Computational statistic using R language 52


Conclusion
In R calculating the mean and median is straightforward using the built-in
functions mean() and median() . Calculating the mode requires a custom function since R
does not have a built-in mode function. These measures provide valuable insights into
the central tendency of your data, helping to summarize and understand your dataset
effectively.

calculation of range varience and standard deviation

var() function in R Language computes the sample variance of a vector. It is the measure
of how much value is away from the mean value.

Syntax: var(x)Parameters:x : numeric vector

Example 1: Computing variance of a vector


# R program to illustrate

# variance of vector

# Create example vector

x <- c(1, 2, 3, 4, 5, 6, 7)

# Apply var function in R

var(x)

print(x)

Output:

4.667

Here in the above code, we took an example vector “x1” and calculated its variance.

sd() Function
sd() function is used to compute the standard deviation of given values in R. It is the
square root of its variance.

Syntax: sd(x)

Parameters:x: numeric vector

Example 1: Computing standard deviation of a vector


# R program to illustrate

# standard deviation of vector

# Create example vector

x2 <- c(1, 2, 3, 4, 5, 6, 7)

Computational statistic using R language 53


# Compare with sd function

sd(x2)

print(x2)

Output: 2.200

Here in the above code, we took an example vector “x2” and calculated its standard
deviation.

The range can be defined as the difference between the maximum and minimum
elements in the given data, the data can be a vector or a dataframe. So we can define the
range as the difference between maximum_value – minimum_value

Method 1: Find range in a vector using min and


max functions
We can find the range by performing the difference between the minimum value in the
vector and the maximum value in the given vector. We can find the maximum value using
the max() function and the minimum value by using the min() function.

Syntax:

max(vector)-min(vector)

If a vector contains NA values then we should use the na.rm function to exclude NA
values

Example:

R
# create vector

data = c(12, 45, NA, NA, 67, 23, 45, 78, NA, 89)

# display

print(data)

# find range

print(max(data, na.rm=TRUE)-min(data, na.rm=TRUE))

Output:

[1] 12 45 NA NA 67 23 45 78 NA 89
[1] 77

Computational statistic using R language 54


Method 2: Get range in the dataframe column
We can get the range in a particular column in the dataframe. Similarly, like a vector, we
can get the maximum value from a column using the max function excluding NA values
and we can get the minimum value from a column using the min function excluding NA
values and finally, we can find the difference.

Syntax:

max(dataframe$column_name,na.rm=TRUE)-
min(dataframe$column_name,na.rm=TRUE)

where

dataframe is the input dataframe

column_name is the column in the dataframe

Example:

R
# create dataframe

data = data.frame(column1=c(12, 45, NA, NA, 67, 23, 45, 78, NA, 89),

column2=c(34, 41, NA, NA, 27, 23, 55, 78, NA, 73))

# display

print(data)

# find range in column1

print(max(data$column1, na.rm=TRUE)-min(data$column1, na.rm=TRUE))

# find range in column2

print(max(data$column2, na.rm=TRUE)-min(data$column2, na.rm=TRUE))

Output:

column1 column2
1 12 34
2 45 41
3 NA NA
4 NA NA
5 67 27
6 23 23
7 45 55
8 78 78
9 NA NA
10 89 73

Computational statistic using R language 55


[1] 77

[1] 55

Method 3: Get range from entire dataframe


we can get a range from the entire dataframe. Here we are getting the maximum value
and minimum value by directly using min and max functions in the dataframe, then
subtracting the minimum value from the maximum value.
Syntax:

max(dataframe,na.rm=TRUE)-min(dataframe,na.rm=TRUE)

Example:

R
# create dataframe

data = data.frame(column1=c(12, 45, NA, NA, 67, 23, 45, 78, NA, 89),

column2=c(34, 41, NA, NA, 27, 23, 55, 78, NA, 73))

# display

print(data)

# find range in entire dataframe

print(max(data, na.rm=TRUE)-min(data, na.rm=TRUE))

Output:

column1 column2
1 12 34
2 45 41
3 NA NA
4 NA NA
5 67 27
6 23 23
7 45 55
8 78 78
9 NA NA
10 89 73

[1] 77

Computational statistic using R language 56


Method 4: Using range() function
We can use the range() function to get maximum and minimum values. Here we are
calculating range values

Syntax:

range(vector/dataframe)

Example;

R
# create vector

data = c(12, 45, NA, NA, 67, 23, 45, 78, NA, 89)

# display

print(data)

# find range in vector

print(range(data, na.rm=TRUE))

Output:

[1] 12 45 NA NA 67 23 45 78 NA 89
[1] 12 89

Introduction to data visualization

Importance of data visualization

1. Data Visualization Discovers the Trends in Data


The most important thing that data visualization does is discover the trends in data.
After all, it is much easier to observe data trends when all the data is laid out in front
of you in a visual form as compared to data in a table. For example, the screenshot
below on visualization on Tableau demonstrates the sum of sales made by each
customer in descending order. However, the color red denotes loss while grey
denotes profits. So it is very easy to observe from this visualization that even though
some customers may have huge sales, they are still at a loss. This would be very
difficult to observe from a table.

Computational statistic using R language 57


2. Data Visualization Provides a Perspective on the Data
Visualizing Data provides a perspective on data by showing its meaning in the larger
scheme of things. It demonstrates how particular data references stand concerning
the overall data picture. In the data visualization below, the data between sales and
profit provides a data perspective concerning these two measures. It also
demonstrates that there are very few sales above 12K and higher sales do not
necessarily mean a higher profit.

Computational statistic using R language 58


3. Data Visualization Puts the Data into the Correct Context
It isn’t easy to understand the context of the data with data visualization. Since
context provides the whole circumstances of the data, it is very difficult to grasp by
just reading numbers in a table. In the below data visualization on Tableau,
a TreeMap is used to demonstrate the number of sales in each region of the United
States. It is very easy to understand from this data visualization that California has the
largest number of sales out of the total number since the rectangle for California is
the largest. But this information is not easy to understand outside of context without
visualizing data.

4. Data Visualization Saves Time


It is definitely faster to gather some insights from the data using data visualization
rather than just studying a chart. In the screenshot below on Tableau, it is very easy
to identify the states that have suffered a net loss rather than a profit. This is because
all the cells with a loss are coloured red using a heat map, so it is obvious states have
suffered a loss. Compare this to a normal table where you would need to check each
cell to see if it has a negative value to determine a loss. Visualizing Data can save a
lot of time in this situation!

Computational statistic using R language 59


5. Data Visualization Tells a Data Story
Data visualization is also a medium to tell a data story to the viewers. The
visualization can be used to present the data facts in an easy-to-understand form
while telling a story and leading the viewers to an inevitable conclusion. This data
story, like any other type of story, should have a good beginning, a basic plot, and an
ending that it is leading towards. For example, if a data analyst has to craft a data
visualization for company executives detailing the profits of various products, then
the data story can start with the profits and losses of multiple products and move on
to recommendations on how to tackle the losses.

Types of data visualization

1. Bar Charts
Bar charts are one of the common visualization tool, used to symbolize and compare
express facts by way of showing square bars. A bar chart has X and Y Axis where the
X Axis represents the types and the Y axis represents the price. The top of the bar
represents the price for that class at the y-axis. Longer bars suggest better values.

There are various types of Bar charts like horizontal bar chart, Stacked bar chart,
Grouped bar chart and Diverging bar Chart.

When to Use Bar Chart:

Comparing Categories: Showcasing contrast among distinct categories to


evaluate, summarize or discover relationship in the information.

Ranking: When we've got records with categories that need to be ranked with
highest to lowest.

Computational statistic using R language 60


Relationship between categories: When you have a dataset with multiple
specific variables, bar chart can help to display courting between them, to
discover patterns and tendencies.

2. Line Charts
Line chart or Line graph is used to symbolize facts through the years series. It
presentations records as a series of records points called as markers, connected with
the aid of line segments showing the between values over the years. This chart is
normally used to evaluate developments, view patterns or examine charge moves.

When to Use Line Chart:

Line charts can be used to analyse developments over individual values.

Line charts also are utilized in comparing trends among more than one facts
series.

Line chart is high-quality used for time series information.

3. Pie Charts
A pie chart is a round records visualization tool, this is divided into slices to
symbolize numerical percentage or percentages of an entire. Each slice in pie chart
corresponds to a category in the dataset and the perspective of the slice is
proportional to the share it represents. Pie charts are only valid with small variety of
categories. Simple Pie chart and Exploded Pie charts are distinctive varieties of Pie
charts.

When to Use Pie Chart:

Pie charts are used to show specific facts to expose the proportion of elements to
the whole. It is used to depict how exclusive classes make up a total pleasant.

Useful in eventualities where statistics has small range of classes.

Useful in emphasizing a particular category by way of highlighting a dominant


slice.

4. Scatter Chart (Plots)


A scatter chart or scatter plot chart is a effective information visualization device,
makes use of dots to symbolize information factors. Scatter chart is used to display
and examine variables which enables find courting between the ones variables.
Scatter chart uses axes, X and Y. X-Axis represents one numerical variable and Y-
axis represents another numerical variable. The variable on X-axis is independent
and plotted against the dependent variable in Y-axis. Type of scatter chart consists of
simple scatter chart, scatter chart with trendline and scatter chart with coloration
coding.

Computational statistic using R language 61


When to Use Scatter Chart:

Scatter charts are awesome for exploring dating between numerical variables and
in identifying traits, outliers and subgroup variations.

It is used while we've got to plot two sets of numerical statistics as one collection
of X and Y coordinates.

Scatter charts are satisfactory used for identifying outliers or unusual remark for
your facts.

5.Histogram
A histogram represents the distribution of numerical facts by using dividing it into
periods (packing containers) and displaying the frequency of records as bars. It is
commonly used to visualize the underlying distribution of a dataset and discover
styles inclusive of skewness, valuable tendency, and variability. Histograms are
treasured gear for exploring facts distributions, detecting outliers, and assessing
records great.

When to Use Histogram:

Distribution Visualization: Histograms are best for visualizing the distribution of


numerical information, allowing customers to recognize the unfold and shape of
the records.

Data Exploration: They facilitate records exploration by using revealing patterns,


trends, and outliers inside datasets, aiding in hypothesis generation and
information-pushed decision-making.

Quality Control: Histograms assist assess statistics first-class by way of


identifying anomalies, errors, or inconsistencies inside the facts distribution,
enabling facts validation and cleaning strategies.

6. Box Plot (Box-and-Whisker Plot)


A box plot provides a concise precis of the distribution of numerical facts, such as
quartiles, outliers, and median values. It is beneficial for identifying variability,
skewness, and capacity outliers in datasets. Box plots are typically utilized in
statistical analysis, exceptional manipulate, and statistics exploration.

When to Use Box Plots:

Identify Outliers: Box plots assist discover outliers and extreme values within
datasets, helping in information cleansing and anomaly detection.

Compare Distributions: They permit contrast of distributions between specific


groups or categories, facilitating statistical analysis.

Computational statistic using R language 62


Visualize Spread: Box plots visualize the spread and variability of information,
providing insights into the distribution's form and traits.

Principles of good data visualization

A well-designed visualization can effectively communicate complex information,


engage the audience, support decision-making, and provide an excellent user
experience, thereby maximising the impact of the data being presented.

Design principles are crucial in any context because they provide a foundational
framework for creating effective journeys and helping to make choices that enhance
the overall user experience and communication.
Effective data visualization relies on 12 key design principles that help convey
information accurately and efficiently. Here you will find
1. Clarity

The visualization should be clear and easily understood by the intended audience.

2. Simplicity

Keep the visualization simple and avoid unnecessary complexity.


3. Purposeful

Understand what message or insight you want to communicate and design for that
purpose.

4. Consistency

Maintain consistency in the design elements throughout the visualization.

5. Contextualization
Provide context for the data being presented.

6. Accuracy

Ensure the visualization accurately represents the underlying data.

7. Visuals Encoding
Choose appropriate visual encodings for the data types you are visualizing.

8. Intuitiveness

Design the visualization to be intuitive and easy to comprehend.

9. Interactivity

Consider adding interactive elements to the visualization, such as tooltips, zooming,


filtering, or highlighting.
10. Aesthetics

Although aesthetics are subjective, a visually appealing design can engage viewers
and increase their interest in the data.

Computational statistic using R language 63


11. Accessibility

Accessibility is key; if users can’t read the data, it’s useless.

12. Hierarchy
Work out hierarchy of information early on and always remind yourself of what the
purpose of representing the data is.

Ultimately, design principles play a pivotal role in streamlining the design process, a
facet of their significance that extends far beyond the realms of aesthetics. By
adhering to these principles, designers and creators can ensure that their work is not
only visually pleasing but also thoughtful, impactful, and harmonious for the end user.

(Unit 2) Probability and data distribution

Basics of probability theory

Probability
Probability is the branch of mathematics that is concerned with the chances of
occurrence of events and possibilities. Further, it gives a method of measuring the
probability of uncertainty and predicting events in the future by using the available
information.
Probability is a measure of how likely an event is to occur. It ranges from 0 (impossible
event) to 1 (certain event). The probability of an event

A is often written as P(A).

Basic Concepts

Experiment: An action or process that leads to one or more outcomes. For example,
tossing a coin.

Sample Space (S): The set of all possible outcomes of an experiment. For a coin toss,

S={Heads, Tails}.

Event: A subset of the sample space. For instance, getting a head when tossing a
coin.

Formula for Probability

The probability of an event A is given by:

P(A)= Number of favorable outcomes /


Total number of possible outcomes

Table of Content

Probability

Computational statistic using R language 64


Axioms of Probability

Properties of Probability

Conditional Probability

Law of Total Probability

Bayes’ Theorem

Independence of Events

Random Variables and Expectation

Expectation of a Random Variable

Variance and Standard Deviation

Relationship with Expectation

Probability Distributions

Properties of Probability – Sample Problems

Practice Problems on Properties of Probability

Applications of Probability

Solved Examples Properties of Probability

Axioms of Probability
There are three axioms that are the basis of probability and are as follows:

1. Non-Negativity: For any event A, the probability of A is always non-negative:P(A)≥0

P(A)≥0

2. Normalization: The total chance of the whole possible outcomes of the sample space
S:P(S)=1

P(S)=1

3. Additivity: For any two mutually exclusive events we have A and B (i.e., events that
cannot occur simultaneously), the probability of their union is the sum of their
individual probabilities

P(A ∪B)=P(A)+P(B)if A∩B=0

Properties of Probability
Properties of Probability: Probability is a branch of mathematics that specifies how
likely an event can occur. The value of probability is between 0 and 1. Zero(0) indicates

Computational statistic using R language 65


an impossible event and One(1) indicates certainly (surely) that will happen. There are a
few properties of probability that are mentioned below:

Key Properties of Probability:

Non-Negativity: The probability of any event is always non-negative. For any event
A,

P(A) ≥ 0.

Normalization: The probability of the sure event (sample space) is 1. If S is the


sample space, then

P(S) = 1.

Additivity (Sum Rule): For any two mutually exclusive (disjoint) events A and B, the
probability of their union is the sum of their individual probabilities:


P(A B) = P(A) + P(B)

Complementary Rule: The probability of the complement of an event A (i.e., the


event not occurring) is


P(A B)= P(A∩B) / P(B) provided P(B)>0.

Multiplication Rule: For any two events A and B, the probability of both occurring
(i.e., the intersection) is:


P(A∩B) = P(A B) ⋅ P(B)

1. The probability of an event can be defined as the number of


favorable outcomes of an event divided by the total number of possible
outcomes of an event.

Probability(Event) = (Number of favorable outcomes of an event) /


(Total Number of possible outcomes).

Example: What is the probability of getting a Tail when a coin is tossed?

Solution:

Number of Favorable Outcomes- {Tail} = 1Total Number of possible


outcomes- {Head, Tail} – 2Probability of getting Tail= 1/2 = 0.5

Computational statistic using R language 66


2. Probability of a sure/certain event is 1.
Example: What is the probability of getting a number between 1 and 6 when a dice is
rolled?

Solution:

Number of favorable outcomes- {1,2,3,4,5,6} = 6Total Possible


outcomes- {1,2,3,4,5,6} = 6Probability of getting a number between 1
to 6= 6/6 = 1Probability is 1 indicates it is a certain event.

3. The probability of an impossible event is zero (0).


Example: What is the probability of getting a number greater than 6 when a dice is rolled?

Solution:

Number of favorable outcomes – {} = 0Total possible outcomes –


{1,2,3,4,5,6} = 6Probability(Number>6) = 0/6 = 0Probability Zero
indicates impossible event.

4. Probability of an event always lies between 0 and 1. It is always a


positive.

0 <= Probability(Event) <= 1

Example: We can notice that in all the above examples probability is always between 0 &
1.

5. If A and B are 2 events that are said to be mutually exclusive events


then P(AUB) = P(A) + P(B).
Note: Two events are mutually exclusive when if 2 events cannot occur simultaneously.

Example: Probability of getting head and tail when a coin is tossed comes under mutual
exclusive events.

Solution:

To solve this we need to find probability separately for each


possibility. i.e, Probability of getting head and Probability of getting
tail and sum of those to get P(Head U Tail). P(Head U Tail)= P(Head)
+ P(Tail) = (1/2)+(1/2) = 1

Computational statistic using R language 67


6. Elementary event is an event that has only one outcome. These
events are also called atomic events or sample points. The Sum of
probabilities of such elementary events of an experiment is always 1.
Example: When we are tossing a coin the possible outcome is head or tail. These
individual events i.e. only head or only tail of a sample space are called elementary
events.

Solution:

Probability of getting only head=1/2Probability of getting only


tail=1/2So, sum=1.

7. Sum of probabilities of complementary events is 1.

P(A)+P(A’)=1

Example: When a coin is tossed, the probability of getting ahead is 1/2, and the
complementary event for getting ahead is getting a tail so the Probability of getting a tail
is 1/2.

Solution:

If we sum those two then,P(Head)+P(Head’)=(1/2)+(1/2)=1Head’=


Getting Tail

8. If A and B are 2 events that are not mutually exclusive events then
P(AUB)=P(A)+P(B)-P(A∩B)

P(A∩B)=P(A)+P(B)-P(AUB)

Note: 2 events are said to be mutually not exclusive when they have at least one common
outcome.
Example: What is the probability of getting an even number or less than 4 when a die is
rolled?

Solution:

Favorable outcomes of getting even number ={2,4,6}Favorable


outcomes of getting number<4 ={1,2,3}So, there is only 1 common
outcome between two events so these two events are not mutually
exclusive.

So, we can find P(Even U Number<4)= P(Even) + P(Number<4) – P(Even ∩ Number<4)

Computational statistic using R language 68


P(Even)=3/6=1/2

P(Number<4)=3/6=1/2

P(Even ∩ Number<4)=1/6 (Common element)

P(Even U Number<4)=(1/2) +(1/2)-(1/6)=1-(1/6)=0.83

9. If E1,E2,E3,E4,E5,………EN are mutually exclusive events then


Probability(E1UE2UE3UE4UE5U……
UEN)=P(E1)+P(E2)+P(E3)+P(E4)+P(E5)+…….+P(EN).
Example: What is the probability of getting 1 or 2 or 3 numbers when a die is rolled.

Solution:

Let A be the event of getting 1 when a die is rolled.


Favorable outcome- {1}

Let B be the event of getting 2 when a die is rolled.

Favorable outcome- {2}

Let C be the event of getting 3 when a die is rolled.


Favorable outcome- {3}

No common favorable outcomes.

So, A, B, C are mutually exclusive events.

According to above probability rule- P(A U B U C)= P(A) + P(B) + P(C)

P(A)=1/6
P(B)=1/6

P(C)=1/6

P(A U B U C)=(1/6)+(1/6)+(1/6)=3/6=1/2

These are the top basic properties of probability.

Conditional Probability
Conditional probability quantifies the probability of an event A given that another event B
has occurred. It is defined as:

P(A ∣B)=P(A∩B)/P(B),provided P(B)>0P


Law of Total Probability
The law of total probability enables us the formulation in terms of probabilities of an
event A as a weighted sum of conditional probabilities:

P(A)=Σi=0nP(A ∣Bi)P(Bi)

Computational statistic using R language 69


where B1,B2,…,BnB1​,B2​,…,Bn​is a partition of sample space S.

Bayes’ Theorem
Bayes’ Theorem provides a way to update the probability of a hypothesis H based on new
evidence E:

P(H ∣E)=P(E∣H)P(H)/P(E)

where P(H) is the prior probability of the hypothesis, P(E H) is the likelihood of the
evidence given the hypothesis, and P(E) is the marginal likelihood of the evidence.

Independence of Events
Two events A and B are said to be independent if the occurrence of one does not affect
the probability of the occurrence of the other:
P(A∩B)=P(A)⋅P(B)

If P(A∩B)≠P(A)P(B) the events are dependent.

Random Variables and Expectation


A random variable is a variable that takes values that are at the same time both random
and mathematical in a given sample space. There are two main types of random
variables:

Discrete Random Variables: Be limited or countable in the sense that they take on one
of a finite or countably infinite number of values.

Continuous Random Variables: Thus theses converge on beavering on an


uncountably infinite set of values.

Expectation of a Random Variable


The expectation (or expected value) of a random variable X, denoted by E(X), is a
measure of the central tendency of its distribution.

For a discrete random variable X with possible values x1​,x2​,… and corresponding
probabilities p1​,p2​,…E(X)=∑i​xi​⋅P(X=xi​)

x1,x2,…
p1,p2,…

E(X)=∑ixi⋅P(X=xi)

For a continuous random variable X with probability density function f(x):E(X)=∫−∞∞​


x⋅f(x)dx

f(x)

Computational statistic using R language 70


E(X)=∫−∞∞x⋅f(x)dx

Variance and Standard Deviation


Variance measures the spread of a random variable around its expectation and is
denoted by Var(X). The standard deviation is the square root of the variance, denoted
by σX.

Variance of a random variable X:Var(X)=E[(X−E(X))2]=E(X2)−[E(X)]2


Var(X)=E[(X−E(X))2]=E(X2)−[E(X)]2

Standard Deviation of X:σX​=Var(X)

σX=Var(X)

Relationship with Expectation


The variance can also be understood as the mathematical expectation of the value of
squares of the deviations from the mean, which provides an understanding of the
dispersion of the values of the random variable.

Probability Distributions
This basically gives the manner in which probabilities are spread over the values of the
random variable involved. Some common distributions include:

Discrete Distributions

Binomial Distribution

Poisson Distribution

Continuous Distributions

Normal Distribution

Exponential Distribution

Properties of Probability – Sample Problems


Example 1: A fair die is rolled. What is the probability of rolling a number greater than
6?

Solution: Since there are no numbers greater than 6 on a standard


die, P(rolling > 6) = 0

Example 2: A coin is tossed. Verify that the sum of probabilities of all outcomes is 1.

Computational statistic using R language 71


Solution: P(Heads) = 1/2, P(Tails) = 1/21/2 + 1/2 = 1, so the property
holds.

Example 3 : In a deck of 52 cards, what is the probability of drawing either a king or a


queen?

Solution: P(King) = 4/52 = 1/13, P(Queen) = 4/52 = 1/13P(King or


Queen) = 1/13 + 1/13 = 2/13

Example 4: The probability of rain tomorrow is 0.3. What is the probability it won’t
rain?

Solution: P(no rain) = 1 – P(rain) = 1 – 0.3 = 0.7

Example 5: In a standard deck, compare P(drawing a king) and P(drawing a face card).

Solution: P(King) = 4/52 = 1/13P(Face card) = 12/52 = 3/13Since all


kings are face cards, P(King) ≤ P(Face card)

Example 6: In a class, 60% of students play soccer, 30% play basketball, and 20%
play both. What percentage plays either soccer or basketball?

Solution: P(Soccer or Basketball) = P(Soccer) + P(Basketball) –


P(Soccer and Basketball)= 0.60 + 0.30 – 0.20 = 0.70 or 70%

Example 7: A fair coin is tossed twice. What’s the probability of getting heads both
times?

Solution: P(H on first toss) = 1/2, P(H on second toss) = 1/2P(H and
H) = 1/2 × 1/2 = 1/4

Example 8: In a deck of 52 cards, what’s the probability of drawing a king, given that
it’s a face card?

Solution: P(King | Face card) = P(King and Face card) / P(Face


card)= (4/52) / (12/52) = 1/3

Example 9: 30% of students are in Science. 80% of Science students and 60% of non-
Science students wear glasses. What percentage of all students wear glasses?

Computational statistic using R language 72


Solution: P(Glasses) = P(Glasses|Science) × P(Science) +
P(Glasses|not Science) × P(not Science)= 0.80 × 0.30 + 0.60 × 0.70
= 0.24 + 0.42 = 0.66 or 66%

Example 10: 1% of people have a certain disease. The test for this disease is 95%
accurate (both for positive and negative results). If a person tests positive, what’s the
probability they have the disease?

Solution: Let D = disease, T = positive testP(D|T) = [P(T|D) × P(D)] /


[P(T|D) × P(D) + P(T|not D) × P(not D)]= (0.95 × 0.01) / (0.95 × 0.01 +
0.05 × 0.99)≈ 0.1611 or about 16.11%

Discrete probability distribution

What is Discrete Probability Distribution?


A probability distribution that gives the finite trials of a discrete random variable at a
given point in time is called a discrete probability distribution. The probability distribution
gives the different values of a random variable along with its different probabilities. The
two types of probability distribution include discrete probability distribution and
continuous probability distribution.

Discrete Probability Distribution Definition

Discrete probability distribution is defined as the probability at a


specific value for a discrete random variable. The discrete
probability distributions represent the probability distributions with
finite outcomes.

Conditions for Discrete Probability Distribution


Conditions for the discrete probability distribution are:

Probability of a discrete random variable lies between 0 and 1: 0 ≤ P (X = x) ≤ 1

Sum of Probabilities is always equal to 1: ∑ P (X =x) = 1

Discrete Probability Distribution Example


Let two coins be tossed then the probability of getting a tail is an example of a discrete
probability distribution. The sample space for the given event is {HH, HT, TH, TT} and X

Computational statistic using R language 73


be the number of tails then, the discrete probability distribution table is given by:

x 0 {HH} 1 {HT, TH} 2 {TT}

P (X = x) 1/4 1/2 1/4

Discrete Probability Distribution Formulas


The different formulas for the discrete probability distribution like probability mass
function, cumulative distribution function, mean and variance are given below.

PMF of Discrete Probability Distribution


PMF of a discrete random variable X is the value completely equal to x. The PMF i.e.,
probability mass function of discrete probability distribution is given by:

f(x) = P (X = x)

CDF of Discrete Probability Distribution


CDF of a discrete random variable X is less than or equal to value x. The CDF i.e.,
cumulative distribution function of discrete probability distribution is given by:

f(x) = P (X ≤ x)

Discrete Probability Distribution Mean


Mean of discrete probability distribution is the average of all the values that a discrete
variable can obtain. It is also called as the expected value of the discrete probability
distribution. The mean of discrete probability distribution is given by:

E[X] = ∑x P(X =x)

Discrete Probability Distribution Variance


Variance of discrete probability distribution is defined as the product of squared
difference of distribution and mean with PMF. The variance of the discrete probability
distribution is given by:

Var[X] = ∑(x - μ)2 P(X = x)

How to Find Discrete Probability Function


Steps to find the discrete probability function are given below:

Computational statistic using R language 74


Step 1: First determine the sample space of the given event.

Step 2: Define random variable X as the event for which the probability has to be
found.

Step 3: Consider the possible values of x and find the probabilities for each value.

Step 4: Write all the values of x and their respective probabilities in tabular form to
get the discrete probability distribution.

Types of Discrete Probability Distribution


The different types of discrete probability distribution are listed below.

Bernoulli Distribution

Binomial Distribution

Poisson Distribution

Geometric Distribution

Binomial Distribution
A discrete probability distribution that includes the number of trials n, probability of
success and probability of failure is called as Binomial distribution. The probability mass
function of the Binomial distribution is given by:

P (X = x) = nCx px (1-p) n-x

Binomial Distribution Definition


Binomial Probability Distribution talks about the probability of success or failure of an
outcome in a series of events. The Binomial Distribution in Probability maps the outcome
obtained in the form of success or failure, yes or no, true or false, etc. Each trial done to
obtain the outcome of success or failure is called the Bernoulli Trial and the probability
distribution for each Bernoulli Trial is called the Bernoulli Distribution. Let’s learn the
definition and meaning of Binomial Distribution.

Binomial Distribution for a Random Variable X = 0, 1, 2, …., n is


defined as the probability distribution of two outcomes success or
failure in a series of events. Binomial Distribution in statistics uses
one of the two independent variables in each trial where the
outcome of each trial is independent of the outcome of other trials.

Computational statistic using R language 75


Binomial Distribution Properties
Properties of Binomial Distribution are mentioned below:

There are only two possible outcomes: success or failure, yes or no, true or false.

There is a finite number of trials given as ‘n’.

Probability of success and failure in each trial is the same.

Only Success is calculated out of all trials.

Each trial is independent of any other trial.

Binomial Distribution Applications


Binomial Distribution is used where we have only two possible outcomes. Let’s see some
of the areas where Binomial Distribution can be used

To find the number of male and female students in an institute.

To find the likeability of something in Yes or No.

To find defective or good products manufactured in a factor.

To find positive and negative reviews on a product.

Votes collected in the form of 0 or 1.

Functions for Binomial Distribution


We have four functions for handling binomial distribution in R namely:

dbinom()

dbinom(k, n, p)

pbinom()

pbinom(k, n, p)

where n is total number of trials, p is probability of success, k is the value at which


the probability has to be found out.

qbinom()

qbinom(P, n, p)

Computational statistic using R language 76


Where P is the probability, n is the total number of trials and p is the probability of
success.

rbinom()

rbinom(n, N, p)

Where n is numbers of observations, N is the total number of trials, p is the


probability of success.

dbinom() Function
This function is used to find probability at a particular value for a data that follows
binomial distribution i.e. it finds:

P(X = k)

Syntax:

dbinom(k, n, p)

Example:

dbinom(3, size = 13, prob = 1 / 6)

probabilities <- dbinom(x = c(0:10), size = 10, prob = 1 / 6)

data.frame(x, probs)

plot(0:10, probabilities, type = "l")

Output :

> dbinom(3, size = 13, prob = 1/6)


[1] 0.2138454
> probabilities = dbinom(x = c(0:10), size = 10, prob = 1/6)
> data.frame(probabilities)
probabilities
1 1.615056e-01
2 3.230112e-01
3 2.907100e-01
4 1.550454e-01
5 5.426588e-02
6 1.302381e-02
7 2.170635e-03
8 2.480726e-04
9 1.860544e-05

Computational statistic using R language 77


10 8.269086e-07
11 1.653817e-08

The above piece of code first finds the probability at k=3, then it displays a data frame
containing the probability distribution for k from 0 to 10 which in this case is 0 to n.

pbinom() Function
The function pbinom() is used to find the cumulative probability of a data following
binomial distribution till a given value ie it finds

P(X <= k)

Syntax:

pbinom(k, n, p)

Example:

pbinom(3, size = 13, prob = 1 / 6)

plot(0:10, pbinom(0:10, size = 10, prob = 1 / 6), type = "l")

Output :

> pbinom(3, size = 13, prob = 1/6)


[1] 0.8419226

Computational statistic using R language 78


qbinom() Function
This function is used to find the nth quantile, that is if P(x <= k) is given, it finds k.

Syntax:

qbinom(P, n, p)

Example:

qbinom(0.8419226, size = 13, prob = 1 / 6)

x <- seq(0, 1, by = 0.1)

y <- qbinom(x, size = 13, prob = 1 / 6)

plot(x, y, type = 'l')

Output :

> qbinom(0.8419226, size = 13, prob = 1/6)


[1] 3

Computational statistic using R language 79


rbinom() Function
This function generates n random variables of a particular probability.

Syntax:

rbinom(n, N, p)

Example:

rbinom(8, size = 13, prob = 1 / 6)

hist(rbinom(8, size = 13, prob = 1 / 6))

Output:

> rbinom(8, size = 13, prob = 1/6)


[1] 1 1 2 1 4 0 2 3

Computational statistic using R language 80


What is Poisson Distribution?
Poisson distribution is a probability distribution which is used to model the number of
events that occur in a fixed interval of time or space, given the average rate of
occurrence, assuming that the events happen independently and at a constant rate.
It deals with discrete random variables, meaning the number of events can only take on
non-negative integer values (0, 1, 2, 3,…). Each event is considered to be independent of
others and they are assumed to occur at a constant average rate (λ) over the given
interval.

Poisson Distribution Definition

Poisson distribution is a mathematical concept used to model the


probability of a given number of events occurring within a fixed
interval of time or space, provided that these events happen at a
constant average rate and are independent of the time since the last
event.

The Poisson distribution is a probability distribution that can be used to model events that
are rare and independent, and occur over a given time or space. Some properties and
applications of the Poisson distribution include:

PropertiesThe Poisson distribution has the following properties:

Events are independent

No two events can occur at the same time

Computational statistic using R language 81


The standard deviation is equal to the square root of the mean

The standard deviation is equal to the square root of the mean

When the mean is large, the Poisson distribution is approximately a normal


distribution

When the mean is large, the Poisson distribution is approximately a normal


distribution

ApplicationsThe Poisson distribution can be used to model a variety of events,


including:The number of meteorites that strike Earth in a year The number of laser
photons hitting a detector in a given time interval The number of students who achieve a
low or high mark on an exam The number of phone calls at a call center in a given time
period The number of jobs arriving at a print queue The frequency of earthquakes in a
specific region The number of car accidents at a location

ApplicationsThe Poisson distribution can be used to model a variety of events,


including:

The number of meteorites that strike Earth in a year

The number of meteorites that strike Earth in a year

The number of laser photons hitting a detector in a given time interval

The number of laser photons hitting a detector in a given time interval

The number of students who achieve a low or high mark on an exam

The number of students who achieve a low or high mark on an exam

The number of phone calls at a call center in a given time period

The number of phone calls at a call center in a given time period

The number of jobs arriving at a print queue

The number of jobs arriving at a print queue

The frequency of earthquakes in a specific region

The frequency of earthquakes in a specific region

The number of car accidents at a location

The number of car accidents at a location

Poisson Probability Mass Function – dpois()


This function is used for the illustration of Poisson density in an R plot. The function
dpois() calculates the probability of a random variable that is available within a certain
range.

Computational statistic using R language 82


Syntax: dpois(k,λ,log) dpois(k,λ,log) where, K: number of
successful events happened in an interval lambda: mean per
interval log: If TRUE then the function returns probability in form
of log

R
dpois(2, 3)
dpois(6, 6)

Output:

[1] 0.2240418

[1] 0.1606231

Poisson Distribution – ppois()


This function is used for the illustration of the cumulative probability function in an R
plot. The function ppois() calculates the probability of a random variable that will be
equal to or less than a number.

Syntax: ppois(q,λ,lower.tail,log) ppois(q,λ,lower.tail,log) where, K:


number of successful events happened in an interval lambda:
mean per interval lower.tail: If TRUE then left tail is considered
otherwise if the FALSE right tail is considered log: If TRUE then
the function returns probability in form of log

R
ppois(2, 3)
ppois(6, 6)

Output:

[1] 0.4231901
[1] 0.6063028

Poisson pseudorandom – rpois()


The function rpois() is used for generating random numbers from a given Poisson
distribution.

Computational statistic using R language 83


Syntax: rpois(q,λ) rpois(q,λ) where, q: number of random
numbers needed lambda: mean per interval

R
rpois(2, 3)
rpois(6, 6)

Output:

[1] 2 3
[1] 6 7 6 10 9 4

Poisson Quantile Function – qpois()


The function qpois() is used for generating the quantile of a given Poisson’s
distribution. In probability, quantiles are marked points that divide the graph of a
probability distribution into intervals (continuous ) which have equal probabilities.

Syntax: qpois(q,λ,lower.tail,log) qpois(q,λ,lower.tail,log) where, K:


number of successful events happened in an interval lambda:
mean per interval lower.tail: If TRUE then left tail is considered
otherwise if the FALSE right tail is considered log: If TRUE then
the function returns probability in form of log

R
y <- c(.01, .05, .1, .2)
qpois(y, 2)
qpois(y, 6)

Output:

[1] 0 0 0 1
[1] 1 2 3 4

Continuous probability distribution


A Continuous Probability Distribution is a statistical concept that describes the probability
distribution of a continuous random variable. It specifies the probabilities associated with
various outcomes or values that the random variable can take within a specified range.

In this article, we'll look into Real Life Applications of Continuous Probability Distribution.

What is Continuous Probability Distribution?

Computational statistic using R language 84


A continuous distribution is a statistical distribution wherein the possible values of the
random variable constitute a continuous range. This implies that the variable can be any
value within the specified range and not necessarily restricted to the discrete individual
values.

Distributions that are continuous are commonly defined by probability density functions
(PDFs), which express the probability of the variable assuming a given value within its
range.

Key Points
Continuous Random Variable: In a continuous probability distribution, the random
variable can take on any value within a specified interval or range. This means that
the variable can theoretically assume an infinite number of values within that range.

Probability Density Function (PDF): Persistent probabilities are frequently modeled


using a probability density function (PDF), which is a function that shows the relative
probability of the random variable obtaining different values across the domain of
interest. The area under the curve of PDF of a distribution between the needed
interval implies that a random variable has that value with the biggest probability.

No Individual Probabilities: In contrast with discrete probability distribution that


assigns a probability to each individual value, continuous probability distribution will
specify the probability for each interval of the values. This is because there is no
definite probability assigned to any single value of continuous random variable since
as in the case of discrete probabilities, they assume infinite number of possible
values.

Computational statistic using R language 85


Examples: Normal distribution, uniform distribution, exponential distribution, as well as
the beta distribution are famous examples of continuous probability distributions, and it is
definitely not an exhaustive list. It can be said that every distribution has its own PDF that
shows how the probability density for each random variable would be distributed.

Continuous Probability Distribution Formulas


Some fundamental formulas used in probability theory and statistics to characterize
continuous probability distributions and analyze their properties depending on the
distribution being considered are:

1. Probability Density Function (PDF)

The probability density function f(x)describes the probability distribution of a


continuous random variable X. It satisfies the following properties:

f(x) ≥ 0 for all x in the range of X.

∫ f(x)dx = 1 (total area under the curve equals 1).

Depending on the specific distribution, the PDF formula varies.

For example:

Normal Distribution: f(x) = (1/√2πσ 2).e−2σ2xμ2​

e−xμ22σ2

Uniform distribution: f(x) = 1/(b-a) for a <= x <= b

Exponential Distribution: f(x) = λeλx for x >-= 0

Beta Distribution: f(x) = xα-1 (1−x)β-1/B(α,β) for 0 <= x <= 1

2. Cumulative Distribution Function (CDF)

Cumulative distribution function F(x)gives the probability that a random variable X


takes on a value less than or equal to x.

It is the integral of the PDF up to a certain value of x: F(x) = ∫x∞ f(t)dt

CDF is often provided as part of the definition of specific distributions.

3. Mean (Expected Value)

Mean μ of a continuous random variable X is its average value, and it is given by: μ =
∫∞∞x⋅f(x)dx

Alternatively, for a function g(X), the mean is E[g(X)] = ∫∞∞ g(x)⋅f(x)dx.

4. Variance

Computational statistic using R language 86


Variance σof a continuous random variable X measures the spread of its distribution,
and it is given by:

σ2 = ∫∞∞ (x−μ)2⋅f(x)dx

Standard deviation σ is the square root of the variance.

Normal Distribution: A Bell Curve of Probability

Definition:

The normal distribution, also known as the Gaussian distribution, is a probability


distribution that is symmetrical about the mean. It's often represented graphically as a
bell-shaped curve. This curve indicates that data points are more likely to be near the
mean and less likely to be far from it.

Properties:

1. Symmetry: The curve is symmetrical about the mean, meaning half the data points
lie to the left of the mean and half to the right.

2. Mean, Median, and Mode: In a normal distribution, the mean, median, and mode are
all equal.

3. Standard Deviation: The standard deviation determines the spread of the data. A
larger standard deviation indicates a wider spread, while a smaller standard deviation
indicates a narrower spread.

4. Empirical Rule: Approximately 68% of the data falls within one standard deviation of
the mean, 95% within two standard deviations, and 99.7% within three standard
deviations.

Applications:

The normal distribution is widely used in various fields due to its versatility and the
frequency with which it appears in natural phenomena. Here are some key applications:

Statistics:

Hypothesis testing

Confidence intervals

Regression analysis

Finance:

Stock price modeling

Risk assessment

Portfolio management

Computational statistic using R language 87


Engineering:

Quality control

Reliability analysis

Signal processing

Natural Sciences:

Physics

Chemistry

Biology

Social Sciences:

Psychology

Sociology

Economics

Why is it so widely used?

Central Limit Theorem: This theorem states that the distribution of sample means
approaches a normal distribution as the sample size increases, regardless of the
underlying population distribution. This makes the normal distribution a powerful tool
for statistical inference.

Real-world phenomena: Many natural phenomena, such as human height, weight,


and IQ scores, tend to follow a normal distribution.

R provides several functions to work with the normal distribution. Here are the key ones:
1. dnorm()

Calculates the probability density function (PDF) of the normal distribution at a


specific point.

Syntax: dnorm(x, mean = 0, sd = 1)

x : The value at which to evaluate the PDF.

mean : The mean of the distribution (default is 0).

sd : The standard deviation of the distribution (default is 1).

Example:

Code snippet
# Probability density at x = 1 for a standard normal distribution
dnorm(1)

2. pnorm()

Computational statistic using R language 88


Calculates the cumulative distribution function (CDF) of the normal distribution.

Syntax: pnorm(q, mean = 0, sd = 1, lower.tail = TRUE)

q : The value at which to evaluate the CDF.

mean : The mean of the distribution (default is 0).

sd : The standard deviation of the distribution (default is 1).

lower.tail : Logical argument indicating whether to calculate the probability below


q (default is TRUE).

Example:

Code snippet
# Probability of a value less than 1.96 in a standard normal distribution
pnorm(1.96)

3. qnorm()

Calculates the quantile function (inverse CDF) of the normal distribution.

Syntax: qnorm(p, mean = 0, sd = 1, lower.tail = TRUE)

p : The probability.

mean : The mean of the distribution (default is 0).

sd : The standard deviation of the distribution (default is 1).

lower.tail: Logical argument indicating whether p is the probability below the


quantile (default is TRUE).

Example:

Code snippet
# 95th percentile of a normal distribution with mean 100 and SD 15
qnorm(0.95, mean = 100, sd = 15)

4. rnorm()

Generates random numbers from a normal distribution.

Syntax: rnorm(n, mean = 0, sd = 1)

n : The number of random numbers to generate.

mean : The mean of the distribution (default is 0).

sd : The standard deviation of the distribution (default is 1).

Example:

Code snippet
# Generate 10 random numbers from a normal distribution with mean 50 and SD 10
rnorm(10, mean = 50, sd = 10)

Computational statistic using R language 89


By understanding and utilizing these functions, you can effectively work with normal
distributions in R for various statistical analyses and data visualizations.
Exponential Distribution

Definition:

The exponential distribution is a continuous probability distribution that models the time
elapsed between events in a Poisson process. This means it's used to describe the time
between occurrences of events that happen at a constant average rate.

Properties:

1. Memorylessness: One of the key properties of the exponential distribution is the


memoryless property. This means that the probability of an event occurring in the
future is independent of how much time has already passed. In other words, the
distribution "forgets" its history.

2. Positive Support: The exponential distribution is defined only for positive values, as it
models time intervals.

3. Shape: The probability density function (PDF) of the exponential distribution is a


decreasing exponential curve.

4. Mean and Variance: The mean and variance of an exponential distribution with rate
parameter λ are:

Mean: 1/λ

Variance: 1/λ²

Applications:

The exponential distribution has a wide range of applications in various fields:

1. Reliability Engineering:

Modeling the time to failure of components or systems.

Analyzing the lifespan of products.

2. Queueing Theory:

Modeling the waiting time between arrivals of customers or jobs.

Analyzing the service time of customers or jobs.

3. Telecommunications:

Modeling the duration of phone calls.

Analyzing the time between network failures.

4. Physics:

Modeling the decay of radioactive particles.

Computational statistic using R language 90


Analyzing the arrival times of photons in a detector.

5. Finance:

Modeling the time between defaults of bonds or loans.

Analyzing the inter-arrival times of trades in a financial market.

6. Biology:

Modeling the time between cell divisions.

Analyzing the time to extinction of species.

In R:

R provides functions to work with the exponential distribution:

dexp(x, rate = 1): Probability density function.

pexp(q, rate = 1, lower.tail = TRUE): Cumulative distribution function.

qexp(p, rate = 1, lower.tail = TRUE): Quantile function.

rexp(n, rate = 1): Random number generation.

By understanding the properties and applications of the exponential distribution, you can
effectively model and analyze a variety of real-world phenomena.

R provides several functions to work with the exponential distribution:

1. dexp(x, rate = 1)

Calculates the probability density function (PDF) of the exponential distribution at a


specific point x .

rate is the rate parameter of the distribution.

Example:

Code snippet
# Probability density at x = 2 for a rate parameter of 0.5
dexp(2, rate = 0.5)

2. pexp(q, rate = 1, lower.tail = TRUE)

Calculates the cumulative distribution function (CDF) of the exponential distribution.

q is the value at which to evaluate the CDF.

lower.tail is a logical argument indicating whether to calculate the probability below


q (default is TRUE).

Example:

Code snippet
# Probability of a value less than 3 for a rate parameter of 1
pexp(3, rate = 1)

Computational statistic using R language 91


3. qexp(p, rate = 1, lower.tail = TRUE)

Calculates the quantile function (inverse CDF) of the exponential distribution.

p is the probability.

lower.tailis a logical argument indicating whether p is the probability below the


quantile (default is TRUE).

Example:

Code snippet
# 90th percentile of an exponential distribution with rate parameter 2
qexp(0.9, rate = 2)

4. rexp(n, rate = 1)

Generates n random numbers from an exponential distribution with the specified


rate parameter.

Example:
Code snippet
# Generate 10 random numbers from an exponential distribution with rate parameter 1.5
rexp(10, rate = 1.5)

By understanding and utilizing these functions, you can effectively work with exponential
distributions in R for various statistical analyses and simulations.

working with probability distributions in R

Generating Random Samples from Discrete and Continuous Distributions in R


R provides a rich set of functions to generate random samples from various probability
distributions. Here's a breakdown for discrete and continuous distributions:
Discrete Distributions:

Binomial Distribution:
Code snippet
# Generate 10 random numbers from a binomial distribution with 10 trials and probability of
success 0.5
rbinom(n = 10, size = 10, prob = 0.5)

Poisson Distribution:
Code snippet
# Generate 5 random numbers from a Poisson distribution with a rate parameter of 2
rpois(n = 5, lambda = 2)

Geometric Distribution:

Code snippet
# Generate 20 random numbers from a geometric distribution with probability of success 0.3
rgeom(n = 20, prob = 0.3)

Computational statistic using R language 92


Negative Binomial Distribution:

Code snippet
# Generate 15 random numbers from a negative binomial distribution with 5 successes and
probability of success 0.2
rnbinom(n = 15, size = 5, prob = 0.2)

Hypergeometric Distribution:

Code snippet
# Generate 10 random numbers from a hypergeometric distribution with a population size of 20,
number of successes in the population of 10, and sample size of 5
rhyper(n = 10, m = 10, n = 10, k = 5)

Continuous Distributions:

Normal Distribution:

Code snippet
# Generate 20 random numbers from a normal distribution with mean 10 and standard deviation 2
rnorm(n = 20, mean = 10, sd = 2)

Uniform Distribution:

Code snippet
# Generate 15 random numbers from a uniform distribution between 0 and 1
runif(n = 15, min = 0, max = 1)

Exponential Distribution:
Code snippet
# Generate 10 random numbers from an exponential distribution with rate parameter 0.5
rexp(n = 10, rate = 0.5)

Gamma Distribution:

Code snippet
# Generate 8 random numbers from a gamma distribution with shape parameter 2 and rate parameter 1
rgamma(n = 8, shape = 2, rate = 1)

Beta Distribution:
Code snippet
# Generate 12 random numbers from a beta distribution with shape parameters 2 and 3
rbeta(n = 12, shape1 = 2, shape2 = 3)

Key Points:

The r prefix in the function names indicates "random."

The arguments n , mean , sd , lambda , prob , etc., specify the parameters of the
distribution.

You can adjust the parameters to generate random samples with different
characteristics.

Computational statistic using R language 93


To visualize the generated samples, you can use histograms, density plots, or other
visualization techniques.

By understanding these functions and their parameters, you can effectively generate
random samples from various probability distributions in R to conduct simulations,
statistical analysis, and data modeling.

Estimate distribution parameters from data


Description
This function takes a numeric vector input and attempts to find the most optimal solution
for the parameters of the distribution of choice. Right now
only norm and mnorm distributions are supported.

Usage

estimate_distr(data, distr, init = NULL, args_list = NULL)

Arguments
(Numeric Vector). A vector of numbers that can be inputted to estimate the
data
parameters of the distributional forms.
distr (String). The distribution to be fitted. Right now only norm or mnorm is supported

(List). Initialization parameters for each distribution. For mixtures, each named
init element in the list should be a vector with length equal to the number of
components
args_list (List). Named list of additional arguments passed onto fitdist and normalmixEM
... Other paremteres passed to fitdistrplus or normalmixEM

Details
The package fitdistrplus is used to estimate parameters of the normal distribution while
the package normalmixEM is used to estimate parameters of the mixture normal distribution.
So far we suggest only estimating two components for the mixture normal distribution.
For default options, we use mostly defaults from the packages themselves. The only
difference was the mixture normal distribution where the convergence parameters were
loosened and requiring more iterations to converge.

Value
A named list with all the parameter names and values
()

Computational statistic using R language 94


Estimating distribution parameters:

Method of moments:

Calculate sample moments (mean, variance, etc.) and equate them to theoretical
moments of the distribution.

Solve the resulting equations for the parameters.

Maximum likelihood estimation (MLE):

Find the parameters that maximize the likelihood function of the data.

Use optimization functions like optim() or nlm() to find the MLEs.

Bayesian estimation:

Combine prior information about the parameters with the data to obtain posterior
distribution.

Use MCMC methods like Gibbs sampling or Metropolis-Hastings to sample from


the posterior.

How to Calculate Percentiles in R?


In this article, we will discuss how to calculate percentiles in the R programming
language.

Percentiles are measures of central tendency, which depict that out of the total data
about certain percent data lies below it. In R, we can use quantile() function to get the job
done.

Syntax: quantile( data, probs)Parameter: data: data whose


percentiles are to be calculatedprobs: percentile value

Example 1: Calculate percentile


To calculate the percentile we simply pass the data and the value of the required
percentile.

R
x<-c(2,13,5,36,12,50)

res<-quantile(x,probs=0.5)

res

Output:

50%

Computational statistic using R language 95


12.5

Example 2: Calculate percentiles of vector


We can calculate multiple percentiles at once. For that, we have to pass the vector of
percentiles instead of a single value to probs parameter.

R
x<-c(2,13,5,36,12,50)

res<-quantile(x,probs=c(0.5,0.75))

res

Output:

50% 75%
12.50 30.25

Example 4: Calculate percentile in dataframe


Sometimes requirement asks for calculating percentiles for a dataframe column in that
case the entire process remains same only you have to pass the column name in place of
data along with the percentile value to be calculated.

R
df<-data.frame(x=c(2,13,5,36,12,50),

y=c('a','b','c','c','c','b'))

res<-quantile(df$x,probs=c(0.35,0.7))

res

Output:

35% 70%
10.25 24.50

Example 5: Quantiles of several and all columns


We can also find percentiles of several dataframe columns at once. This can also be
applied to find the percentiles of all numeric columns of dataframe. For this we use
apply() function, within this we will pass the dataframe with just numeric columns and the
quantile function that has to be applied on all columns.

Syntax: apply( dataframe, function)

Computational statistic using R language 96


df<-data.frame(x=c(2,13,5,36,12,50),

y=c('a','b','c','c','c','b'),

z=c(2.1,6,3.8,4.8,2.2,1.1))

sub_df<-df[,c('x','z')]

res<-apply(sub_df, 2, function(x) quantile(x,probs=0.5))

res

Output:

x z
12.5 3.0

Example 6: Calculate Quantiles by group


We can also group values together and find the percentile with respect to each group.
For this, we use groupby() function, and then within summarize() we will apply the
quantile function.

R
library(dplyr)

df<-data.frame(x=c(2,13,5,36,12,50),

y=c('a','b','c','c','c','b'))

df %>% group_by(y) %>%

summarize(res=quantile(x,probs=0.5))

Output:

A tibble: 3 x 2
y res
<chr> <dbl>
a 2
b 31.5
c 12

Example 7: Visualizing percentiles


Visualizing percentiles can make it better to understand.

R
df<-data.frame(x=c(2,13,5,36,12,50),

y=c('a','b','c','c','c','b'),

z=c(2.1,6,3.8,4.8,2.2,1.1))

n<-length(df$x)

plot((1:n-1)/(n-1), sort(df$x.Length), type='h',

Computational statistic using R language 97


xlab = "Percentile",

ylab = "Value")

Output:

Probabilities using R

Probability theory is a fundamental concept in mathematics and statistics that plays a


crucial role in various fields such as finance, engineering, medicine, and more.
Understanding probabilities allows us to make informed decisions in uncertain situations.
In this comprehensive guide, we'll delve into the basics of probabilities using R
Programming Language.

Basic Concepts of Probability in R

Computational statistic using R language 98


Probability in R is the measure of the likelihood that an event will occur. The probability of
an event A, denoted as P(A), lies between 0 and 1, where 0 indicates impossibility and 1
indicates certainty. Some key concepts include:

Sample Space (S): The set of all possible outcomes of a random experiment.

Event: Any subset of the sample space.

Probability of an Event: The likelihood of occurrence of an event, calculated as the


ratio of favorable outcomes to the total number of outcomes.

Calculating Probabilities in R
R offers various functions and packages for calculating Probability in R and performing
statistical analyses. Some commonly used functions include:

dbinom(): Computes the probability mass function (PMF) for the binomial distribution.

pnorm(): Calculates the cumulative distribution function (CDF) for the normal
distribution.

dpois(): Computes the PMF for the Poisson distribution.

punif(): Calculates the CDF for the uniform distribution.

Here is the basic example of calculating Probability in R:

R
# Define the sample space sample_space <- c(1, 2, 3, 4, 5, 6)

# Define an event, for example, rolling an even number event <- c(2, 4, 6)

# Calculate the probability of the event probability <- length(event) / length(sample_space)


print(probability)

Output:

[1] 0.5

Probability Distributions in R
R provides extensive support for probability distributions, which are mathematical
functions that describe the likelihood of different outcomes in a random experiment.
Common probability distributions include:

Uniform Distribution: All outcomes are equally likely.

Normal Distribution: Symmetric bell-shaped curve, characterized by mean (μ) and


standard deviation (σ).

Computational statistic using R language 99


Binomial Distribution: Describes the number of successes in a fixed number of
independent Bernoulli trials.

Poisson Distribution: Models the number of events occurring in a fixed interval of


time or space.

Let’s visualize the normal distribution with a mean of 0 and standard deviation of 1.
R
library(ggplot2)

# Generate a sequence of x values x <- seq(-4, 4, length.out = 100)

# Calculate the corresponding densities for normal distribution y <- dnorm(x, mean = 0, sd = 1)

# Create a data frame df <- data.frame(x, y)

# Plot the normal distribution ggplot(df, aes(x = x, y = y)) +


geom_line(color = "blue") +
labs(title = "Normal Distribution", x = "x", y = "Density") +
theme_minimal()

Output:

Normal Distribution

Simulating Probabilistic Experiments in R


Simulation is a powerful tool for understanding probabilities through empirical
experiments. R facilitates simulation by allowing the generation of random numbers from
different probability distributions. Key functions for simulation include:

runif(): Generates random numbers from a uniform distribution.

Computational statistic using R language 100


rnorm(): Generates random numbers from a normal distribution.

rbinom(): Generates random numbers from a binomial distribution.

rpois(): Generates random numbers from a Poisson distribution.

R
# Simulating coin flips with a binomial distribution num_flips <- 1000
num_heads <- sum(rbinom(num_flips, size = 1, prob = 0.5))
probability_heads <- num_heads / num_flips
print(probability_heads)

Output:

[1] 0.494

Visualizing Probabilities in R
Visualization is essential for gaining insights from Probability in R and it offers numerous
packages such as ggplot2, lattice, and base graphics for creating visualizations. Common
plots include histograms, density plots, boxplots, and scatter plots, which help in
understanding the shape and characteristics of probability distributions.

R
# Visualizing the binomial distribution of coin flips flips <- rbinom(1000, size = 10, prob = 0.5)
hist(flips, breaks = seq(-0.5, 10.5, by = 1), col = "lightgreen",
main = "Binomial Distribution of Coin Flips", xlab = "Number of Heads",
ylab = "Frequency")

Output:

Computational statistic using R language 101


Conclusion
Calculating Probability in R is the part of statistics and data analysis, enabling us to
quantify uncertainty and make informed decisions. By mastering probabilities using R,
you gain powerful tools for analyzing data, conducting simulations, and drawing
meaningful insights. With the knowledge and skills gained from this guide, you'll be well-
equipped to tackle real-world problems involving uncertainty and randomness.

Visualizing Probability Distributions in R


Visualizing probability distributions is a powerful tool for understanding the underlying
patterns and characteristics of data. R provides a variety of functions and packages to
create informative and visually appealing plots.

Basic Plots
1. Histogram:

Displays the distribution of a numerical variable by grouping values into bins.

Useful for understanding the shape, center, and spread of the data.

Code snippet
hist(x, breaks = "Sturges", main = "Histogram of X", xlab = "X")

2. Density Plot:

Smooths the histogram to estimate the probability density function.

Provides a more continuous representation of the data.

Code snippet
plot(density(x), main = "Density Plot of X", xlab = "X")

Advanced Plots
1. QQ-Plot (Quantile-Quantile Plot):

Compares the quantiles of the data to the quantiles of a theoretical distribution.

Used to assess whether the data follows a specific distribution (e.g., normal,
exponential).

Code snippet
qqnorm(x)
qqline(x)

2. Box Plot:

Visualizes the distribution of data through quartiles, median, and outliers.

Provides a summary of the data's central tendency and variability.

Computational statistic using R language 102


Code snippet
boxplot(x, main = "Boxplot of X", ylab = "X")

3. Cumulative Distribution Function (CDF) Plot:

Shows the cumulative probability of a random variable being less than or equal to a
certain value.

Code snippet
plot(ecdf(x), main = "ECDF of X", xlab = "X", ylab = "Cumulative Probability")

Using ggplot2 for Enhanced Visualization


The ggplot2 package offers a more flexible and customizable approach to creating
visualizations:

Code snippet
library(ggplot2)

# Histogram with density plot


ggplot(data.frame(x), aes(x = x)) +
geom_histogram(aes(y = ..density..), fill = "lightblue", color = "black") +
geom_density(color = "red", size = 1) +
labs(title = "Histogram and Density Plot of X", x = "X", y = "Density")

# QQ-Plot
ggplot(data.frame(x), aes(sample = x)) +
stat_qq() +
stat_qq_line() +
labs(title = "QQ-Plot of X")

# Box Plot
ggplot(data.frame(x), aes(y = x)) +
geom_boxplot() +
labs(title = "Boxplot of X", y = "X")

Key Considerations:

Data Cleaning and Preparation: Ensure data is clean and free from outliers or
missing values.

Choice of Plot: Select the appropriate plot type based on the data type (continuous
or discrete) and the desired insights.

Customization: Use R's plotting functions and ggplot2 to customize the appearance
of plots (colors, labels, themes).

Interpretation: Carefully interpret the visualizations to draw meaningful conclusions.

By effectively visualizing probability distributions, you can gain a deeper understanding


of data, identify patterns, and make informed decisions.

Would you like to explore a specific distribution or visualization technique in more


detail?

Computational statistic using R language 103


(Unit 3) Basic inferential statistics

Sampling and sampling distributions

What is sampling?
A sample is a subset of individuals from a larger population. Sampling means selecting
the group that you will actually collect data from in your research. For example, if you are
researching the opinions of students in your university, you could survey a sample of 100
students.
In statistics, sampling allows you to test a hypothesis about the characteristics of a
population.

Sampling Distribution
Sampling distribution is essential in various aspects of real life. Sampling distributions are
important for inferential statistics. A sampling distribution represents the distribution of a
statistic, like the mean or standard deviation, which is calculated from multiple samples of
a population. It shows how these statistics vary across different samples drawn from the
same population.

In this article, we will discuss the Sampling Distribution in detail and its types along with
examples and go through some practice questions too.

What is Sampling Distribution?


Sampling distribution is also known as a finite-sample distribution. Sampling
distribution is the probability distribution of a statistic based on random samples of a
given population. It represents the distribution of frequencies on how spread apart
various outcomes will be for a specific population.
Since population is too large to analyze, you can select a smaller group and repeatedly
sample or analyze them. The gathered data, or statistic, is used to calculate the likely
occurrence, or probability, of an event.

Important Terminologies in Sampling Distribution


Some important terminologies related to sampling distribution are given below:

Statistic: A numerical summary of a sample, such as mean, median, standard


deviation, etc.

Parameter: A numerical summary of a population is often estimated using sample


statistics.

Sample: A subset of individuals or observations selected from a population.

Computational statistic using R language 104


Population: Entire group of individuals or observations that a study aims to describe
or draw conclusions about.

Sampling Distribution: Distribution of a statistic (e.g., mean, standard deviation)


across multiple samples taken from the same population.

Central Limit Theorem(CLT): A fundamental theorem in statistics stating that the


sampling distribution of the sample mean tends to be approximately normal as the
sample size increases, regardless of the shape of the population distribution.

Standard Error: Standard deviation of a sampling distribution, representing the


variability of sample statistics around the population parameter.

Bias: Systematic error in estimation or inference, leading to a deviation of the


estimated statistic from the true population parameter.

Confidence Interval: A range of values calculated from sample data that is likely to
contain the population parameter with a certain level of confidence.

Sampling Method: Technique used to select a sample from a population, such as


simple random sampling, stratified sampling, cluster sampling, etc.

Inferential Statistics: Statistical methods and techniques used to draw conclusions


or make inferences about a population based on sample data.

Hypothesis Testing: A statistical method for making decisions or drawing


conclusions about a population parameter based on sample data and assumptions
about the population.

Understanding Sampling Distribution


Sampling distribution of a statistic is the distribution of all possible values taken by the
statistic when all possible samples of a fixed size n are taken from the population. Each
random sample that is selected may have a different value assigned to the statistics
being studied. Sampling distribution of a statistic is the probability distribution of that
statistic.

Factors Influencing Sampling Distribution


A sampling distribution's variability can be measured either by calculating the standard
deviation(also called the standard error of the mean), or by calculating the population
variance. The one to be chosen is depending on the context and interferences you want
to draw. They both measure the spread of data points in relation to the mean.
3 main factors influencing the variability of a sampling distribution are:

1. Number Observed in a Population: The symbol for this variable is "N." It is the
measure of observed activity in a given group of data.

Computational statistic using R language 105


2. Number Observed in Sample: The symbol for this variable is "n." It is the measure of
observed activity in a random sample of data that is part of the larger grouping.

3. Method of Choosing Sample: How you chose the samples can account for variability
in some cases.

Central Limit Theorem[CLT]


Central Limit Theorem is the most important theorem of Statistics.

Central Limit Theorem


According to the central limit theorem, if X1, X2, ..., Xn is a random sample of size n taken
from a population with mean µ and variance σ2 then the sampling distribution of the
sample mean tends to normal distribution with mean µ and variance σ2/n as sample size
tends to large.

Computational statistic using R language 106


This formula indicates that as the sample size increases, the spread of the sample means
around the population mean decreases, with the standard deviation of the sample means
shrinking proportionally to the square root of the sample size, and the variate Z,

Z = (x - μ)/(σ/√n)

where,

z is z-score

x is Value being Standardized (either an individual data point or the sample mean)

μ is Population Mean

σ is Population Standard Deviation

n is Sample Size

This formula quantifies how many standard deviations a data point (or sample mean) is
away from the population mean. Positive z-scores indicate values above the mean, while
negative z-scores indicate values below the mean. Follows the normal distribution with
mean 0 and variance unity, that is, the variate Z follows standard normal distribution.

According to the central limit theorem, the sampling distribution of the sample means
tends to normal distribution as sample size tends to large (n > 30).

Examples on Sampling Distribution


Example 1: Mean and standard deviation of the tax value of all vehicles
registered in a certain state are μ=$13,525 and σ=$4,180. Suppose
random samples of size 100 are drawn from the population of vehicles.
What are the mean μx̄ and standard deviation σx̄ of the sample mean x̄ ?
Solution:

Since n = 100, the formulas yieldμx̄ = μ = $13,525σx̄ = σ / √n = $4180


/ √100 σx̄ = $418

Example 2: A prototype automotive tire has a design life of 38,500


miles with a standard deviation of 2,500 miles. Five such tires are
manufactured and tested. On the assumption that the actual population
mean is 38,500 miles and the actual population standard deviation is
2,500 miles, find the probability that the sample mean will be less than
36,000 miles. Assume that the distribution of lifetimes of such tires is
normal.
Solution:

Computational statistic using R language 107


Here, we will assume and use units of thousands of miles. Then
sample mean x̄ has Mean: μx̄ = μ = 38.5Standard Deviation: σx̄ = σ/
√n = 2.5/√5 = 1.11803Since the population is normally distributed, so
is x̄ , hence,P (X < 36) = P(Z < {36 - μx̄ }/σx̄ )P (X < 36) = P(Z < {36 -
38.5}/1.11803)P (X < 36) = P(Z < -2.24)P(X < 36) = 0.0125Therefore,
if the tires perform as designed then there is only about a 1.25%
chance that the average of a sample of this size would be so low.

Calculate Standard Error in R

Using sd() function with length function


Here we are going to use sd() function which will calculate the standard deviation and
then the length() function to find the total number of observation.

Syntax: sd(data)/sqrt(length((data)))

Example: R program to calculate a standard error from a set of 10


values in a vector
R
# consider a vector with 10 elements

a < - c(179, 160, 136, 227, 123, 23,

45, 67, 1, 234)

# calculate standard error

print(sd(a)/sqrt(length((a))))

Output:

[1] 26.20274

Confidence Intervals

Confidence Intervals
1. Definition and Purpose of Confidence Intervals
A confidence interval (CI) is a range of values derived from sample data that estimates
an unknown population parameter with a given level of confidence.

Computational statistic using R language 108


Purpose: Confidence intervals provide a range that likely includes the true population
parameter (like a mean or proportion) rather than a single point estimate. This
approach accounts for sample variability and helps quantify the uncertainty in an
estimate.

Common Confidence Levels: Common confidence levels include 90%, 95%, and
99%. For example, a 95% CI means that if we were to take multiple samples and
calculate the CI for each, about 95% of those intervals would contain the true
population parameter.

Interpretation: A 95% confidence interval of (a, b) for a population mean implies that we
are 95% confident the true mean lies between a and b. Note that this does not mean
there's a 95% probability that the true mean is within the interval—it means that if
repeated samples were taken, 95% of those intervals would capture the true mean.

2. Calculation of Confidence Intervals for Population Mean

For calculating confidence intervals for the population mean, we use the t-distribution
when the sample size is small (n < 30) or when the population standard deviation is
unknown.
Formula:
\[
\text{CI} = \bar{x} \pm t_{\alpha/2, df} \times \frac{s}{\sqrt{n}}
\]
Where:

\(\bar{x}\): Sample mean

\(t_{\alpha/2, df}\): t-value for the desired confidence level and degrees of freedom
(df = n - 1)

\(s\): Sample standard deviation

\(n\): Sample size

Example Calculation
Suppose we have a sample of 20 observations with a sample mean (\(\bar{x}\)) of 50
and a standard deviation (s) of 10. For a 95% CI:

1. Find \(t\)-value: Use the t-distribution table or R’s qt() function.

2. Calculate the Margin of Error: \(ME = t_{\alpha/2} \times \frac{s}{\sqrt{n}}\)

3. Determine the Interval: \((\bar{x} - ME, \bar{x} + ME)\)

3. R Functions for Calculating Confidence Intervals for Population Mean

t.test() : Computes a CI directly from sample data.

Computational statistic using R language 109


# Example
data <- c(48, 52, 47, 53, 49, 51) # Sample data
t.test(data, conf.level = 0.95)

qt() : Gets the critical t-value based on confidence level and degrees of freedom.

# Example for 95% CI with df = 19


qt(0.975, df = 19)

Manual Calculation: You can also calculate CIs manually in R using the formula.

4. Calculation of Confidence Intervals for Population Proportion


To estimate the CI for a population proportion, we use the Z-distribution, especially
when the sample size is large (np and n(1 - p) ≥ 10).
Formula:
\[
\text{CI} = \hat{p} \pm Z_{\alpha/2} \times \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}
\]
Where:

\(\hat{p}\): Sample proportion

\(Z_{\alpha/2}\): Z-score for the desired confidence level

\(n\): Sample size

Example Calculation
Suppose 60 out of 100 people in a survey favor a policy. Here, \(\hat{p} = 0.60\) and \(n =
100\). For a 95% CI:

1. Find Z-score: For 95% confidence, \(Z = 1.96\).

2. Calculate Margin of Error: \(ME = Z_{\alpha/2} \times \sqrt{\frac{\hat{p}(1 - \hat{p})}


{n}}\)

3. Determine the Interval: \((\hat{p} - ME, \hat{p} + ME)\)

5. R Functions for Calculating Confidence Intervals for Population Proportion

prop.test() : Computes a CI for a proportion based on sample data.

# Example with 60 successes out of 100 trials


prop.test(60, 100, conf.level = 0.95)

Computational statistic using R language 110


Manual Calculation: You can calculate CIs for proportions manually in R using the
formula.

Visualizing Confidence Intervals


Visual diagrams can help make confidence intervals more intuitive. Here are a few ideas
for visualization:

1. Bell Curve with Shaded Confidence Region: A normal distribution curve showing the
confidence interval range.

2. Confidence Interval Around Sample Mean: A plot with a series of confidence


intervals from repeated sampling to illustrate that not every interval contains the true
population mean.

Here is a visualization of a confidence interval for a population mean on a normal


distribution curve:

The blue shaded area represents the 95% confidence interval, meaning there's a
95% probability that the true mean falls within this range.

The red dashed lines mark the lower and upper bounds of this confidence interval.

The blue dashed line at the center represents the sample mean.

Would you like additional diagrams, such as for confidence intervals based on repeated
sampling?

Hypothesis Testing

Computational statistic using R language 111


What is Hypothesis Testing?
A hypothesis is an assumption or idea, specifically a statistical claim about an unknown
population parameter. For example, a judge assumes a person is innocent and verifies
this by reviewing evidence and hearing testimony before reaching a verdict.
Hypothesis testing is a statistical method that is used to make a statistical decision
using experimental data. Hypothesis testing is basically an assumption that we make
about a population parameter. It evaluates two mutually exclusive statements about a
population to determine which statement is best supported by the sample data.
To test the validity of the claim or assumption about the population parameter:

A sample is drawn from the population and analyzed.

The results of the analysis are used to decide whether the claim is true or not.

Example: You say an average height in the class is 30 or a boy is


taller than a girl. All of these is an assumption that we are assuming,
and we need some statistical way to prove these. We need some
mathematical conclusion whatever we are assuming is true.

This structured approach to hypothesis testing in data science, hypothesis testing in


machine learning, and hypothesis testing in statistics is crucial for making informed
decisions based on data.

By employing hypothesis testing in data analytics and other fields, practitioners can
rigorously evaluate their assumptions and derive meaningful insights from their
analyses.

Understanding hypothesis generation and testing is also essential for effectively


implementing statistical hypothesis testing in various applications.

Defining Hypotheses
Null hypothesis (H0): In statistics, the null hypothesis is a general statement or
default position that there is no relationship between two measured cases or no
relationship among groups. In other words, it is a basic assumption or made based on
the problem knowledge.Example: A company’s mean production is 50 units/per da
H: μ = 50.

Alternative hypothesis (H1): The alternative hypothesis is the hypothesis used in


hypothesis testing that is contrary to the null hypothesis. Example: A company’s
production is not equal to 50 units/per day i.e. H: μ = 50.

Computational statistic using R language 112


Why do we use Hypothesis Testing?
Hypothesis testing is an important procedure in statistics. Hypothesis testing evaluates
two mutually exclusive population statements to determine which statement is most
supported by sample data. When we say that the findings are statistically significant,
thanks to hypothesis testing.
Understanding hypothesis testing in statistics is essential for data scientists and
machine learning practitioners, as it provides a structured framework for statistical
hypothesis generation and testing. This methodology can also be applied in hypothesis
testing in Python, enabling data analysts to perform robust statistical analyses
efficiently. By employing techniques such as multiple hypothesis testing in machine
learning, researchers can ensure more reliable results and avoid potential pitfalls
associated with drawing

One-Tailed and Two-Tailed Test


One tailed test focuses on one direction, either greater than or less than a specified
value. We use a one-tailed test when there is a clear directional expectation based on
prior knowledge or theory. The critical region is located on only one side of the
distribution curve. If the sample falls into this critical region, the null hypothesis is
rejected in favor of the alternative hypothesis.

One-Tailed Test
There are two types of one-tailed test:

Left-Tailed (Left-Sided) Test: The alternative hypothesis asserts that the true
parameter value is less than the null hypothesis. Example: H​:μ≥50 and H: μ<50

Right-Tailed (Right-Sided) Test: The alternative hypothesis asserts that the true
parameter value is greater than the null hypothesis. Example: H:μ≤50and H:μ>50

Two-Tailed Test
A two-tailed test considers both directions, greater than and less than a specified
value.We use a two-tailed test when there is no specific directional expectation, and want
to detect any significant difference.
Example: H0: μ=μ= 50 and H1: μ≠50μ=50

To delve deeper into differences into both types of test: Refer to link

Computational statistic using R language 113


What are Type 1 and Type 2 errors in
Hypothesis Testing?
In hypothesis testing, Type I and Type II errors are two possible errors that researchers
can make when drawing conclusions about a population based on a sample of data.
These errors are associated with the decisions made regarding the null hypothesis and
the alternative hypothesis.

Type I error: When we reject the null hypothesis, although that hypothesis was true.
Type I error is denoted by alpha(α).

Type II errors: When we accept the null hypothesis, but it is false. Type II errors are
denoted by beta(β).

Null Hypothesis is True Null Hypothesis is False

Type II Error (False


Null Hypothesis is True (Accept) Correct Decision
Negative)

Alternative Hypothesis is True Type I Error (False


Correct Decision
(Reject) Positive)

. Chi-Square Test
Chi-Square Test for Independence categorical Data (Non-normally distributed) using:

χ2=∑(Oij–Eij)2Eijχ2=∑Eij​(Oij​–Eij​)2​
where,

OijOij​is the observed frequency in cell ij


ij

i,j are the rows and columns index respectively.

EijEij​is the expected frequency in cell ij, calculated as


:Total observationsRow total×Column total​

ij
Row total×Column totalTotal observations

T-Statistics
T test is used when n<30,
t-statistic calculation is given by:

t=xˉ−μs/nt=s/n​xˉ−μ​

Computational statistic using R language 114


where,

t = t-score,

x̄ = sample mean

μ = population mean,

s = standard deviation of the sample,

n = sample size

How to Perform a Chi-Square Goodness of Fit Test in R


The Chi-Square Goodness of Fit Test is a statistical test used to analyze the difference
between the observed and expected frequency distribution values in categorical data.
This test is popularly used in various domains such as science, biology, business, etc. In
this article, we will understand how to perform the chi-square test in the R Programming
Language.

What is the Chi-Square Goodness of Fit Test?


The chi-square goodness of fit test is used to measure the significant difference between
the expected and observed frequencies under the null hypothesis that there is no
difference between the expected and observed frequencies. We can use the formula to
calculate the chi-square test mathematically.
χ2=∑i(Oi−Ei)2Eiχ2=∑iEi(Oi−Ei)2
where,

χ2is the Chi-Square statistic

Oi is the observed frequency for each category

EI is the expected frequency for each category

∑ denotes the sum over all categories.

Calculating the Chi-square goodness of fit test manually in R


We can calculate the chi-square test since we know the mathematical formula for it. In
this example, we will create a fictional dataset comparing the frequencies of
transportation modes of cities.
R
# Create a fictional dataset city <- c("City A", "City A", "City A", "City B", "City B", "City B")
transport_mode <- c("Car", "Public Transit", "Bicycle", "Car", "Public Transit",
"Bicycle")
observed <- c(40, 30, 20, 35, 25, 15)
# Observed frequencies expected <- c(35, 30, 20, 40, 25, 15) # Expected frequencies# Calculate
Chi-Square statistic manually chi_sq_statistic <- sum((observed - expected)^2 / expected)

Computational statistic using R language 115


df <- length(observed) - 1
p_value <- 1 - pchisq(chi_sq_statistic, df)

# Print results print(paste("Chi-Square Statistic:", chi_sq_statistic))


print(paste("Degrees of Freedom:", df))
print(paste("P-value:", p_value))

Output:

[1] "Chi-Square Statistic: 1.33928571428571"

[1] "Degrees of Freedom: 5"

[1] "P-value: 0.930837766731732"

chi- square statistics here is 1.33 which shows the discrepancy between the observed
frequencies and the expected frequencies under the null hypothesis. The value is small
here so it means there is not much difference.

Degrees of Freedom: This shows the number of independent pieces available for
estimation. The formula for calculating this is = number of categories -1. Here, 6
categories are present therefore, df will be 5 which is enough to make a decision.

P-value: A high p-value suggests that the observed frequencies are consistent with
the expected frequencies, and we fail to reject the null hypothesis.

We can also plot the graph to see the difference between the values. To plot graph we
need to load "dplyr" package in R programming language.
R
#install packages install.packages("dplyr")

# Create a data frame for plotting data_plot <- data.frame(Transportation_Mode = transport_mode,


Observed = observed,
Expected = expected)

# Calculate deviations between observed and expected frequencies data_plot <- data_plot %>%
mutate(deviation = Observed - Expected)

# Plot observed and expected frequencies with deviations ggplot(data_plot, aes(x =


Transportation_Mode, y = Observed, fill = "Observed")) +
geom_bar(stat = "identity", position = "dodge", width = 0.5) +
geom_bar(aes(y = Expected, fill = "Expected"), stat = "identity", position = "dodge",
width = 0.5, alpha = 0.5) +
geom_errorbar(aes(ymin = pmin(Observed, Expected), ymax = pmax(Observed, Expected),
color = "Deviation"),
width = 0.2, position = position_dodge(width = 0.5)) +
labs(title = "Observed vs. Expected Frequencies of Transportation Modes",
y = "Frequency",
fill = "") +

Computational statistic using R language 116


scale_fill_manual(values = c("Observed" = "blue", "Expected" = "green"),
name = "Category") +
scale_color_manual(values = "red",
name = "Deviation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

Output:

Chi-Square Goodness of Fit Test in R

Calculating Chi-square test on an airline dataset using chisq.test()


function
In this example, we will use another method to calculate chi-square test in R. For this
example, we will use an external dataset from the kaggle website.
Dataset: Flight Price Prediction
Make sure you replace the path of the file with the original path in your system.

R
# load dataset data<- read.csv('path\to\your\file.csv')

# Create a contingency table cont_table <- table(data$airline, data$class)

# Perform Chi-Square test chi_sq_result <- chisq.test(cont_table)

# Print the results print(chi_sq_result)

Output:

Computational statistic using R language 117


Pearson's Chi-squared test

data: cont_table
X-squared = 60493, df = 5, p-value < 2.2e-16

We can also plot these values with the help of ggplot2 library in R
R
# Extract observed and expected frequencies from the contingency table observed <-
as.vector(cont_table)
expected <- chi_sq_result$expected

# Create a data frame for plotting plot_data <- data.frame(


Category = rep(rownames(cont_table), 2),
Frequency = c(observed, expected),
Type = rep(c("Observed", "Expected"), each = nrow(cont_table))
)

# Plot the frequencies library(ggplot2)

ggplot(plot_data, aes(x = Category, y = Frequency, fill = Type)) +


geom_bar(stat = "identity", position = position_dodge(width = 0.9), width = 0.7) +
labs(title = "Observed vs. Expected Frequencies",
y = "Frequency",
fill = "Type") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

Output:

Computational statistic using R language 118


Chi-Square Goodness of Fit Test in R

As we saw our chi square test shows high discrepancy and our graph shows wide
difference between the expected and observed frequencies too.

Calculating Chi-square test using vcd package


We can use 'vcd' package available in R to calculate the chi-square test and other
statistical values that can help us understanding the dataset better. Here, we are creating
a fictional dataset on different age groups of people using different brands of smart
phones.
R
# Load necessary packages library(vcd)

# Set the seed for reproducibility set.seed(123)

# Define age groups and smartphone brands age_groups <- c("Teenager", "Adult", "Senior")
smartphone_brands <- c("Samsung", "Apple", "Xiaomi", "Huawei", "Google")

# Generate a fictional dataset n <- 1000 # Number of observations age_sample <- sample(age_groups,
n, replace = TRUE )
smartphone_sample <- sample(smartphone_brands, n, replace =
TRUE )

# Convert to factors age_sample <- factor(age_sample, levels = age_groups)


smartphone_sample <- factor(smartphone_sample, levels = smartphone_brands)

# Create a contingency table cont_table <- table(age_sample, smartphone_sample)

# Perform Chi-Square test using assocstats() chi_sq_result <- assocstats(cont_table)

# Print the result print(chi_sq_result)

Output:

X^2 df P(> X^2)


Likelihood Ratio 10.856 8 0.20998
Pearson 10.961 8 0.20394

Phi-Coefficient : NA
Contingency Coeff.: 0.104
Cramer's V : 0.074

Calculating Chi-Square test using the prop.test() function

Computational statistic using R language 119


In this example we will create a fictional dataset of a drug test and use prop.test()
function to get our values.
R
# Treatment outcomes data success_new_drug <- 45
failure_new_drug <- 15
success_standard_drug <- 30
failure_standard_drug <- 30

# Create the 2x2 contingency table cont_table_2x2 <- matrix(c(success_new_drug, failure_new_drug,


success_standard_drug,
failure_standard_drug), nrow = 2, byrow =
TRUE )

# Perform Chi-Square test using prop.test() for proportions chi_sq_result_2x2 <-


prop.test(cont_table_2x2)

# Print the result print(chi_sq_result_2x2)

Output:

2-sample test for equality of proportions with continuity correctio


n

data: cont_table_2x2
X-squared = 6.9689, df = 1, p-value = 0.008294
alternative hypothesis: two.sided
95 percent confidence interval:
0.06596955 0.43403045
sample estimates:
prop 1 prop 2
0.75 0.50

We created a 2x2 contingency table where the rows represent treatment outcomes
(success or failure) and the columns represent the two groups (new drug treatment vs.
standard drug treatment).
We used the prop.test() function to perform a Chi-Square test for proportions on this 2x2
table.

Chi-Square Test in R

The chi-square test of independence evaluates whether there is an association between


the categories of the two variables. There are basically two types of random variables

Computational statistic using R language 120


and they yield two types of data: numerical and categorical. In R Programming
Language Chi-square statistics is used to investigate whether distributions of categorical
variables differ from one another. The chi-square test is also useful while comparing the
tallies or counts of categorical responses between two(or more) independent groups.

In R Programming Language, the function used for performing a chi-square test


is chisq.test() .

Syntax: chisq.test(data) Parameters:data: data is a table containing


count values of the variables in the table.

We will take the survey data in the MASS library which represents the data from a survey
conducted on students.

R
# load the MASS package

library(MASS)

print(str(survey))

Output:

'data.frame': 237 obs. of 12 variables:


$ Sex : Factor w/ 2 levels "Female","Male": 1 2 2 2 2 1 2 1 2 2
...
$ Wr.Hnd: num 18.5 19.5 18 18.8 20 18 17.7 17 20 18.5 ...
$ NW.Hnd: num 18 20.5 13.3 18.9 20 17.7 17.7 17.3 19.5 18.5 ...
$ W.Hnd : Factor w/ 2 levels "Left","Right": 2 1 2 2 2 2 2 2 2 2
...
$ Fold : Factor w/ 3 levels "L on R","Neither",..: 3 3 1 3 2 1 1
3 3 3 ...
$ Pulse : int 92 104 87 NA 35 64 83 74 72 90 ...
$ Clap : Factor w/ 3 levels "Left","Neither",..: 1 1 2 2 3 3 3 3
3 3 ...
$ Exer : Factor w/ 3 levels "Freq","None",..: 3 2 2 2 3 3 1 1 3 3
...
$ Smoke : Factor w/ 4 levels "Heavy","Never",..: 2 4 3 2 2 2 2 2 2
2 ...
$ Height: num 173 178 NA 160 165 ...
$ M.I : Factor w/ 2 levels "Imperial","Metric": 2 1 NA 2 2 1 1 2
2 2 ...
$ Age : num 18.2 17.6 16.9 20.3 23.7 ...
NULL

Computational statistic using R language 121


The above result shows the dataset has many Factor variables which can be considered
as categorical variables. For our model, we will consider the variables “Exer” and
“Smoke“.The Smoke column records the students smoking habits while the Exer column
records their exercise level. Our aim is to test the hypothesis whether the students
smoking habit is independent of their exercise level at .05 significance level.

R
# Create a data frame from the main data set.

stu_data = data.frame(survey$Smoke,survey$Exer)

# Create a contingency table with the needed variables.

stu_data = table(survey$Smoke,survey$Exer)

print(stu_data)

Output:

Freq None Some


Heavy 7 1 3
Never 87 18 84
Occas 12 3 4
Regul 9 1 7

And finally we apply the chisq.test() function to the contingency table stu_data.

R
# applying chisq.test() function

print(chisq.test(stu_data))

Output:

Pearson's Chi-squared test

data: stu_data
X-squared = 5.4885, df = 6, p-value = 0.4828

As the p-value 0.4828 is greater than the .05, we conclude that the smoking habit is
independent of the exercise level of the student and hence there is a weak or no
correlation between the two variables. The complete R code is given below.
So, in summary, it can be said that it is very easy to perform a Chi-square test using R.
One can perform this task using chisq.test() function in R.

Visualize the Chi-Square Test data


R
# Load required library

Computational statistic using R language 122


library(MASS)

# Print structure of the survey dataset

print(str(survey))

# Create a data frame for smoking and exercise columns

stu_data <- data.frame(survey$Smoke, survey$Exer)

stu_data <- table(survey$Smoke, survey$Exer)

# Print the table

print(stu_data)

# Perform the Chi-Square Test

chi_result <- chisq.test(stu_data)

print(chi_result)

# Visualize the data with a bar plot

barplot(stu_data, beside = TRUE, col = c("lightblue", "lightgreen"),

main = "Smoking Habits vs Exercise Levels",

xlab = "Exercise Level", ylab = "Number of Students")

# Add legend separately

legend("center", legend = rownames(stu_data), fill = c("lightblue", "lightgreen"))

Output:

Chi-Square Test in R
In this code we use the MASS library to conduct a Chi-Square Test on the ‘survey’ dataset,
focusing on the relationship between smoking habits and exercise levels.

Computational statistic using R language 123


It creates a contingency table, performs the statistical test, and visualizes the data using
a bar plot. The legend is added separately to the top-left corner, distinguishing between
different smoking habits with distinct colors.
The code aims to explore and communicate the associations between smoking behavior
and exercise practices within the dataset.

Introduction to Linear Regression

Introduction to Linear Regression


Linear regression is a statistical method used to model the relationship between a
dependent (response) variable and one or more independent (predictor) variables by
fitting a linear equation to the observed data.

1. Definition and Purpose of Linear Regression


Linear Regression aims to find a linear relationship between variables. For example,
predicting a person’s height based on their age.

Purpose: Helps to predict outcomes, understand relationships, and evaluate the


strength and nature of associations between variables.

2. Simple Linear Regression Model: Assumptions and Parameters


In Simple Linear Regression, the model has only one predictor variable and one
response variable, expressed as:
y=β0+β1x+ϵy = \beta_0 + \beta_1 x + \epsilon

y=β0​+β1​x+ϵ
Where:

yyy: Response variable

xxx: Predictor variable

β0\beta_0β0​: Intercept (value of y when x=0)

yy
x=0x = 0

β1\beta_1β1​: Slope (change in y for a one-unit increase in x)


yy

xx

ϵ\epsilonϵ: Error term, representing random noise

Assumptions of Simple Linear Regression

Computational statistic using R language 124


1. Linearity: The relationship between x and y is linear.
xx
yy

2. Independence: Observations are independent of each other.

3. Homoscedasticity: Constant variance of residuals (errors) across values of x.


xx

4. Normality: Residuals should be normally distributed.

3. Estimation of Parameters Using the Least-Squares Method


The Least-Squares Method finds the line that minimizes the sum of squared differences
(errors) between observed values and predicted values. This method calculates the
values of β0\beta_0β0​and β1\beta_1β1​that minimize:

Sum of Squares Error (SSE)=∑(yi−y^i)2\text{Sum of Squares Error (SSE)} = \sum (y_i -


\hat{y}_i)^2

Sum of Squares Error (SSE)=∑(yi​−y^​i​)2


Where yiy_iyi​is the observed value, and y^i\hat{y}_iy^​i​is the predicted value from the
regression line.

4. Interpretation of Regression Coefficients and R-squared


Regression Coefficient β1\beta_1β1​ (Slope): Indicates the average change in y for
each one-unit increase in x.

yy
xx

Intercept β0\beta_0β0​: Represents the predicted value of y when x=0 (often not
meaningful unless x=0 is within the data range).

yy
x=0x = 0
x=0x = 0

Coefficient of Determination (R-squared): Measures the proportion of variance in


the response variable that is predictable from the predictor variable. R-squared
ranges from 0 to 1, where:

0 indicates that the model does not explain any variability in y.


yy

1 indicates that the model explains all the variability in y.

Computational statistic using R language 125


yy

Regression analysis is a very widely used statistical tool to establish a relationship


model between two variables. One of these variable is called predictor variable
whose value is gathered through experiments. The other variable is called response
variable whose value is derived from the predictor variable.
In Linear Regression these two variables are related through an equation, where
exponent (power) of both these variables is 1. Mathematically a linear relationship
represents a straight line when plotted as a graph. A non-linear relationship where the
exponent of any variable is not equal to 1 creates a curve.
The general mathematical equation for a linear regression is −

y = ax + b

Following is the description of the parameters used −

y is the response variable.

x is the predictor variable.

a and b are constants which are called the coefficients.

Steps to Establish a Regression


A simple example of regression is predicting weight of a person when his height is
known. To do this we need to have the relationship between height and weight of a
person.

The steps to create the relationship is −

Carry out the experiment of gathering a sample of observed values of height and
corresponding weight.

Create a relationship model using the lm() functions in R.

Find the coefficients from the model created and create the mathematical
equation using these

Get a summary of the relationship model to know the average error in prediction.
Also called residuals.

To predict the weight of new persons, use the predict() function in R.

Input Data
Below is the sample data representing the observations −

# Values of height
151, 174, 138, 186, 128, 136, 179, 163, 152, 131

Computational statistic using R language 126


# Values of weight.
63, 81, 56, 91, 47, 57, 76, 72, 62, 48

lm() Function
This function creates the relationship model between the predictor and the response
variable.

Syntax
The basic syntax for lm() function in linear regression is −

lm(formula,data)

Following is the description of the parameters used −

formula is a symbol presenting the relation between x and y.

data is the vector on which the formula will be applied.

Create Relationship Model & get the Coefficients


Live Demo

x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.


relation <- lm(y~x)

print(relation)

When we execute the above code, it produces the following result −

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept) x
-38.4551 0.6746

Get the Summary of the Relationship

Computational statistic using R language 127


Live Demo

x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.


relation <- lm(y~x)

print(summary(relation))

When we execute the above code, it produces the following result −

Call:
lm(formula = y ~ x)

Residuals:
Min 1Q Median 3Q Max
-6.3002 -1.6629 0.0412 1.8944 3.9775

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -38.45509 8.04901 -4.778 0.00139 **
x 0.67461 0.05191 12.997 1.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.253 on 8 degrees of freedom


Multiple R-squared: 0.9548, Adjusted R-squared: 0.9491
F-statistic: 168.9 on 1 and 8 DF, p-value: 1.164e-06

Explore our latest online courses and learn new skills at your own pace. Enroll and
become a certified expert to boost your career.

predict() Function
Syntax
The basic syntax for predict() in linear regression is −

predict(object, newdata)

Following is the description of the parameters used −

Computational statistic using R language 128


object is the formula which is already created using the lm() function.

newdata is the vector containing the new value for predictor variable.

Predict the weight of new persons


Live Demo

# The predictor vector.


x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)

# The resposne vector.


y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.


relation <- lm(y~x)

# Find weight of a person with height 170.


a <- data.frame(x = 170)
result <- predict(relation,a)
print(result)

When we execute the above code, it produces the following result −

1
76.22869

Visualize the Regression Graphically


Live Demo

# Create the predictor and response variable.


x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
relation <- lm(y~x)

# Give the chart file a name.


png(file = "linearregression.png")

# Plot the chart.


plot(y,x,col = "blue",main = "Height & Weight Regression",
abline(lm(x~y)),cex = 1.3,pch = 16,xlab = "Weight in Kg",ylab =
"Height in cm")

Computational statistic using R language 129


# Save the file.
dev.off()

When we execute the above code, it produces the following result −

(Unit 4) Data visualization in R

Data visualization techniques in R

Histograms in R language
A histogram contains a rectangular area to display the statistical information which is
proportional to the frequency of a variable and its width in successive numerical
intervals. A graphical representation that manages a group of data points into different
specified ranges. It has a special feature that shows no gaps between the bars and is
similar to a vertical bar graph.

R – Histograms
We can create histograms in R Programming Language using the hist() function.
Syntax: hist(v, main, xlab, xlim, ylim, breaks, col, border)

Computational statistic using R language 130


Parameters:
v: This parameter contains numerical values used in histogram.main: This parameter
main is the title of the chart.col: This parameter is used to set color of the bars.xlab: This
parameter is the label for horizontal axis.border: This parameter is used to set border
color of each bar.xlim: This parameter is used for plotting values of x-axis.ylim: This
parameter is used for plotting values of y-axis.breaks: This parameter is used as width
of each bar.

v: This parameter contains numerical values used in histogram.

main: This parameter main is the title of the chart.

col: This parameter is used to set color of the bars.

xlab: This parameter is the label for horizontal axis.

border: This parameter is used to set border color of each bar.

xlim: This parameter is used for plotting values of x-axis.

ylim: This parameter is used for plotting values of y-axis.

breaks: This parameter is used as width of each bar.

Creating a simple Histogram in R


Creating a simple histogram chart by using the above parameter. This vector v is plot
using hist().
Example:

R
# Create data for the graph. v <- c(19, 23, 11, 5, 16, 21, 32,
14, 19, 27, 39)

# Create the histogram. hist(v, xlab = "No.of Articles ",


col = "green", border = "black")

Output:

Computational statistic using R language 131


Histograms in R language

Range of X and Y values


To describe the range of values we need to do the following steps:

1. We can use the xlim and ylim parameters in X-axis and Y-axis.

2. Take all parameters which are required to make a histogram chart.

Example

R
# Create data for the graph. v <- c(19, 23, 11, 5, 16, 21, 32, 14, 19, 27, 39)

# Create the histogram. hist(v, xlab = "No.of Articles", col = "green",


border = "black", xlim = c(0, 50),
ylim = c(0, 5), breaks = 5)

Output:

Computational statistic using R language 132


Histograms in R language

Using histogram return values for labels using text()


To create a histogram return value chart.
R
# Creating data for the graph. v <- c(19, 23, 11, 5, 16, 21, 32, 14, 19,
27, 39, 120, 40, 70, 90)

# Creating the histogram. m<-hist(v, xlab = "Weight", ylab ="Frequency",


col = "darkmagenta", border = "pink",
breaks = 5)

# Setting labels text(m$mids, m$counts, labels = m$counts,


adj = c(0.5, -0.5))

Output:

Computational statistic using R language 133


Histograms in R language

Histogram using non-uniform width


Creating different width histogram charts, by using the above parameters, we created a
histogram using non-uniform width.

Example
R
# Creating data for the graph. v <- c(19, 23, 11, 5, 16, 21, 32, 14,
19, 27, 39, 120, 40, 70, 90)

# Creating the histogram. hist(v, xlab = "Weight", ylab ="Frequency",


xlim = c(50, 100),
col = "darkmagenta", border = "pink",
breaks = c(5, 55, 60, 70, 75,
80, 100, 140))

Output:

Computational statistic using R language 134


Bar charts are a popular and effective way to visually represent categorical data in a
structured manner. R stands out as a powerful programming language for data analysis
and visualization. In this article, we’ll look at how to make visually appealing bar charts in
R.

Bar Charts using R


A bar chart also known as bar graph is a pictorial representation of data that presents
categorical data with rectangular bars with heights or lengths proportional to the values
that they represent. In other words, it is the pictorial representation of the dataset. These
data sets contain the numerical values of variables that represent the length or height.

R uses the barplot() function to create bar charts. Here, both vertical and Horizontal bars
can be drawn.

Syntax:

barplot(H, xlab, ylab, main, names.arg, col)

Parameters:

Computational statistic using R language 135


H: This parameter is a vector or matrix containing numeric values which are used in bar
chart.xlab: This parameter is the label for x axis in bar chart.ylab: This parameter is the
label for y axis in bar chart.main: This parameter is the title of the bar chart.names.arg:
This parameter is a vector of names appearing under each bar in bar chart.col: This
parameter is used to give colors to the bars in the graph.

H: This parameter is a vector or matrix containing numeric values which are used in
bar chart.

xlab: This parameter is the label for x axis in bar chart.

ylab: This parameter is the label for y axis in bar chart.

main: This parameter is the title of the bar chart.

names.arg: This parameter is a vector of names appearing under each bar in bar
chart.

col: This parameter is used to give colors to the bars in the graph.

Creating a Simple Bar Chart in R


In order to create a Bar Chart:

1. A vector (H <- c(Values…)) is taken which contains numeral values to be used.

2. This vector H is plot using barplot().

R
# Create the data for the chart

A <- c(17, 32, 8, 53, 1)

# Plot the bar chart

barplot(A, xlab = "X-axis", ylab = "Y-axis", main ="Bar-Chart")

Output:

Computational statistic using R language 136


R – Bar Charts

Creating a Horizontal Bar Chart in R


To create a horizontal bar chart:

1. Take all parameters which are required to make a simple bar chart.

2. Now to make it horizontal new parameter is added.

barplot(A, horiz=TRUE )

Creating a horizontal bar chart

R
# Create the data for the chart

A <- c(17, 32, 8, 53, 1)

# Plot the bar chart

barplot(A, horiz = TRUE, xlab = "X-axis",

ylab = "Y-axis", main ="Horizontal Bar Chart"

Output:

Computational statistic using R language 137


Horizontal Bar Chart

Adding Label, Title and Color in the BarChart


Label, title and colors are some properties in the bar chart which can be added to the bar
by adding and passing an argument.

1. To add the title in bar chart.

barplot( A, main = title_name )

2. X-axis and Y-axis can be labeled in bar chart. To add the label in bar chart.

barplot( A, xlab= x_label_name, ylab= y_label_name)

3. To add the color in bar chart.

barplot( A, col=color_name)

Implementations

R
# Create the data for the chart

A <- c(17, 2, 8, 13, 1, 22)

B <- c("Jan", "feb", "Mar", "Apr", "May", "Jun")

Computational statistic using R language 138


# Plot the bar chart

barplot(A, names.arg = B, xlab ="Month",

ylab ="Articles", col ="green",

main ="GeeksforGeeks-Article chart")

Output:

R – GeeksforGeeks-Article chart

Add Data Values on the Bar


R
# Create the data for the chart

A <- c(17, 2, 8, 13, 1, 22)

B <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun")

# Plot the bar chart with text features

barplot(A, names.arg = B, xlab = "Month",

ylab = "Articles", col = "steelblue",

main = "GeeksforGeeks - Article Chart",

cex.main = 1.5, cex.lab = 1.2, cex.axis = 1.1)

# Add data labels on top of each bar

text(

x = barplot(A, names.arg = B, col = "steelblue", ylim = c(0, max(A) * 1.2)),

y = A + 1, labels = A, pos = 3, cex = 1.2, col = "black"

Computational statistic using R language 139


)

Output:

GeeksforGeeks – Article Chart

cex.main , cex.lab , and cex.axis : These arguments control the font size of the chart
title, x-axis label, and y-axis label, respectively. They are set to 1.5, 1.2, and 1.1 to
increase the font size for better readability.

text() : We use the text() function to add data labels on top of each bar.
The x argument specifies the x-coordinates of the labels (same as the barplot() x-
coordinates), the y argument adds a value of 1 to the corresponding bar heights ( A +

1 ) to position the labels just above the bars.

Creating Stacked and Grouped Bar Chart in R


The bar chart can be represented in two form group of bars and stacked.

1. Take a vector value and make it matrix M which to be grouped or stacked. Making of
matrix can be done by.

M <- matrix(c(values...), nrow = no_of_rows, ncol = no_of_colum


n, byrow = TRUE)

2. To display the bar explicitly we can use the beside parameter.

Computational statistic using R language 140


barplot( beside=TRUE )

Grouped Bar Chart:


R
colors = c("green", "orange", "brown")

months <- c("Mar", "Apr", "May", "Jun", "Jul")

regions <- c("East", "West", "North")

# Create the matrix of the values.

Values <- matrix(c(2, 9, 3, 11, 9, 4, 8, 7, 3, 12, 5, 2, 8, 10, 11),

nrow = 3, ncol = 5, byrow = TRUE)

# Create the bar chart

barplot(Values, main = "Total Revenue", names.arg = months,

xlab = "Month", ylab = "Revenue",

col = colors, beside = TRUE)

# Add the legend to the chart

Output:

R – Total Revenue

Stacked Bar Chart:

Computational statistic using R language 141


R
colors = c("green", "orange", "brown")

months <- c("Mar", "Apr", "May", "Jun", "Jul")

regions <- c("East", "West", "North")

# Create the matrix of the values.

Values <- matrix(c(2, 9, 3, 11, 9, 4, 8, 7, 3, 12, 5, 2, 8, 10, 11),

nrow = 3, ncol = 5, byrow = TRUE)

# Create the bar chart

barplot(Values, main = "Total Revenue", names.arg = months,

xlab = "Month", ylab = "Revenue", col = colors)

# Add the legend to the chart

legend("topleft", regions, cex = 0.7, fill = colors)

Boxplots in R Language
A box graph is a chart that is used to display information in the form of distribution by
drawing boxplots for each of them. This distribution of data is based on five sets
(minimum, first quartile, median, third quartile, and maximum).

Boxplots in R Programming Language


Boxplots are created in R by using the boxplot() function.

Computational statistic using R language 142


x: This parameter sets as a vector or a formula.data: This parameter sets the data
frame.notch: This parameter is the label for horizontal axis.varwidth: This parameter is a
logical value. Set as true to draw width of the box proportionate to the sample size.main:
This parameter is the title of the chart.names: This parameter are the group labels that
will be showed under each boxplot.

x: This parameter sets as a vector or a formula.

notch: This parameter is the label for horizontal axis.

varwidth: This parameter is a logical value. Set as true to draw width of the box
proportionate to the sample size.

names: This parameter are the group labels that will be showed under each boxplot.

Creating a Dataset
We use the data set “mtcars”.

R
input <- mtcars[, c('mpg', 'cyl')]

print(head(input))

Creating the Boxplot


Creating the Boxplot graph.

R
data(mtcars)

boxplot(disp ~ gear, data = mtcars,

ylab = "Displacement")

Boxplot using notch


Analyzing real world dataset and case studies

Analyzing Real-World Datasets and Case Studies


Working with real-world datasets allows you to develop hands-on skills in data analysis
and apply statistical methods to practical problems. Here’s a comprehensive guide to
approaching real-world data analysis, including how to choose datasets, analytical steps,
and methods for interpreting results.

1. Choosing Appropriate Datasets for Practice and Analysis


Selecting the right dataset is crucial for meaningful analysis and skill development. There
are various sources available:

Computational statistic using R language 143


Kaggle: Offers diverse datasets across domains like finance, healthcare, sports, and
marketing, often with competitions for added motivation.

UCI Machine Learning Repository: Known for datasets like the Iris, Wine, and Heart
Disease datasets, which are commonly used in academic research and training.

Government Websites: Datasets from sources like data.gov (U.S.), data.gov.uk (UK),
and World Bank provide reliable and extensive datasets on topics like public health,
economics, and social statistics.

Choosing Criteria:

Domain Relevance: Pick datasets related to your field of interest (e.g., finance,
healthcare, social sciences).

Size and Complexity: Start with smaller datasets and move to larger, more complex
datasets as you advance.

Structure and Documentation: Ensure the dataset has clear documentation,


including data types, column descriptions, and potential variables of interest.

2. Steps for Analyzing Real-World Datasets


Once you have chosen a dataset, the following steps can guide your analysis.

A. Data Exploration and Pre-Processing


Exploratory data analysis (EDA) and data pre-processing are essential first steps in
handling any dataset.

1. Data Exploration: Start by inspecting the dataset for basic characteristics.

Shape and Structure: Use functions like head() , tail() , describe() (in Python) to
understand dataset dimensions, column names, and data types.

Summary Statistics: Check summary statistics to get a sense of the central


values, range, and variability.

2. Handling Missing Values:

Identify Missing Data: Determine which columns have missing values and how
many. Missing data can lead to biased results if not addressed.

Strategies:

Imputation: Replace missing values with the mean, median, or mode (for
numeric variables), or with the most common category (for categorical
variables).

Deletion: Remove rows or columns with a high percentage of missing data if it


doesn’t bias results.

Computational statistic using R language 144


3. Handling Outliers:

Outliers can skew analysis. Identify them through methods like IQR (interquartile
range) or z-score analysis.

Handling Strategies:

Transformation: Apply log or square root transformations to mitigate extreme


values.

Capping: Set a reasonable threshold and cap values that exceed it.

4. Data Transformation:

Ensure that all variables are on a comparable scale if they’ll be used together.
Standardization (z-scores) or normalization (scaling between 0 and 1) are
common transformations.

Encoding Categorical Data: Convert categorical data into numerical format if


required (e.g., one-hot encoding).

B. Descriptive Statistics
Descriptive statistics help summarize data, providing a clear snapshot of key features.

1. Measures of Central Tendency: Mean, median, and mode provide insights into the
central values in the data.

2. Measures of Dispersion: Variance, standard deviation, range, and interquartile range


give information about data spread.

3. Visualizing Data:

Histograms: For understanding the distribution of a single variable.

Box Plots: Useful for spotting outliers and comparing distributions across
categories.

Scatter Plots: To explore relationships between two continuous variables.

C. Inferential Statistics
Once you have a solid understanding of the data, inferential statistics can help you draw
conclusions and make predictions.

1. Hypothesis Testing: Useful for determining if observed differences are statistically


significant.

T-tests: Compare means across two groups (e.g., male vs. female test scores).

Chi-Square Tests: Assess relationships between categorical variables.

2. Regression Analysis: Helps quantify relationships and predict outcomes.

Computational statistic using R language 145


Simple Linear Regression: Analyze the relationship between two continuous
variables.

Multiple Regression: Examine how multiple predictors impact the response


variable.

Logistic Regression: Used when the outcome variable is binary (e.g., pass/fail).

3. Interpretation and Communication of Results


The final step in data analysis is interpreting and communicating the results in a clear and
actionable way. This involves summarizing findings and addressing the research
questions or hypotheses set at the start.

1. Interpretation of Key Findings:

Statistical Significance: Look at p-values and confidence intervals to determine


if results are statistically significant.

Effect Size: Consider the strength of relationships and effect sizes, not just
statistical significance.

Practical Significance: Ask whether the results are meaningful and actionable in
real-world terms.

2. Visual and Written Communication:

Use graphs, tables, and summaries to make complex results accessible. Bar
charts, line graphs, and heatmaps are popular choices depending on the data.

In reports, summarize findings succinctly, highlight key statistics, and discuss any
limitations (like sample size or biases).

3. Actionable Insights and Recommendations:

Provide recommendations based on the data analysis. For example, if a variable


significantly impacts sales, suggest strategies to optimize it.

Example: Analysis Workflow with R


Using R, you can follow these steps for a full data analysis workflow. Here’s an example:

# Load dataset
data <- read.csv("your_dataset.csv")

# Exploratory Data Analysis


summary(data) # Summary statistics
str(data) # Data structure

Computational statistic using R language 146


# Handling Missing Values
data$column <- ifelse(is.na(data$column), mean(data$column, na.rm =
TRUE), data$column)

# Detecting and Handling Outliers


boxplot(data$column) # Boxplot for outliers

# Data Transformation
data$scaled_column <- scale(data$column) # Standardization

# Descriptive Statistics
mean(data$column)
sd(data$column)

# Visualizations
hist(data$column, main = "Histogram of Column", xlab = "Column Valu
es")
plot(data$x, data$y, main = "Scatter Plot", xlab = "X", ylab = "Y")

# Inferential Statistics
# T-test
t.test(data$group1, data$group2)

# Linear Regression
model <- lm(y ~ x, data = data)
summary(model) # Model summary

# Prediction
predict(model, newdata = data.frame(x = c(5, 10)))

Using this framework, you can approach any dataset with confidence, knowing each step
builds toward a comprehensive understanding and actionable insights. Let me know if
you would like visual examples to further illustrate any of these steps!

Computational statistic using R language 147

You might also like