KEMBAR78
IDS Notes Unit 1 | PDF | Statistics | Data Type
0% found this document useful (0 votes)
43 views22 pages

IDS Notes Unit 1

The document serves as an introduction to Data Science, covering its definition, significance, and applications across various industries. It outlines key concepts such as data types, statistical inference, modeling, and machine learning, along with practical programming in R. Additionally, it discusses the advantages and challenges of Big Data and Data Science, emphasizing the importance of datafication in transforming business operations.

Uploaded by

bijjavinodkumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views22 pages

IDS Notes Unit 1

The document serves as an introduction to Data Science, covering its definition, significance, and applications across various industries. It outlines key concepts such as data types, statistical inference, modeling, and machine learning, along with practical programming in R. Additionally, it discusses the advantages and challenges of Big Data and Data Science, emphasizing the importance of datafication in transforming business operations.

Uploaded by

bijjavinodkumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 22

INTRODUCTION TO DATA SCIENCE

B.Tech. III Year I Sem. LTPC


3003

UNIT - I
Introduction: Definition of Data Science- Big Data and Data Science hype – and getting past
the hype - Datafication - Current landscape of perspectives - Statistical Inference -
Populations and samples - Statistical modeling, probability distributions, fitting a model –
Over fitting. Basics of R: Introduction, REnvironment Setup, Programming with R, Basic
Data Types.
UNIT - II
Data Types & Statistical Description
Types of Data: Attributes and Measurement, What is an Attribute? The Type of an Attribute,
The Different Types of Attributes, Describing Attributes by the Number of Values,
Asymmetric Attributes, Binary Attribute, Nominal Attributes, Ordinal Attributes, Numeric
Attributes, Discrete versus Continuous Attributes. Basic Statistical Descriptions of Data:
Measuring the Central Tendency: Mean, Median, and Mode, Measuring the Dispersion of
Data: Range, Quartiles, Variance, Standard Deviation, and Interquartile Range, Graphic
Displays of Basic Statistical Descriptions of Data.
UNIT - III
Vectors: Creating and Naming Vectors, Vector Arithmetic, Vector sub setting, Matrices:
Creating and Naming Matrices, Matrix Sub setting, Arrays, Class. Factors and Data
Frames: Introduction to Factors: Factor Levels, Summarizing a Factor, Ordered Factors,
Comparing Ordered Factors, Introduction to Data Frame, subsetting of Data Frames,
Extending Data Frames, Sorting Data Frames.
Lists: Introduction, creating a List: Creating a Named List, Accessing List Elements,
Manipulating List Elements, Merging Lists, Converting Lists to Vectors
UNIT - IV
Conditionals and Control Flow: Relational Operators, Relational Operators and Vectors,
Logical Operators, Logical Operators and Vectors, Conditional Statements. Iterative
Programming in R:
Introduction, While Loop, For Loop, Looping Over List. Functions in R: Introduction,
writing a Function in R, Nested Functions, Function Scoping, Recursion, Loading an R
Package, Mathematical Functions in R.
UNIT - V
Data Reduction: Overview of Data Reduction Strategies, Wavelet Transforms, Principal
Components Analysis, Attribute Subset Selection, Regression and Log-Linear Models:
Parametric Data Reduction, Histograms, Clustering, Sampling, Data Cube Aggregation. Data
Visualization: Pixel-Oriented Visualization Techniques, Geometric Projection Visualization
Techniques, Icon-Based Visualization Techniques, Hierarchical Visualization Techniques,
Visualizing Complex Data and Relations.
Introduction

What is Data Science?


Data Science is about data gathering, analysis and decision-making.
Data Science is about finding patterns in data, through analysis, and make future predictions.
By using Data Science, companies are able to make:
 Better decisions (should we choose A or B)
 Predictive analysis (what will happen next?)
 Pattern discoveries (find pattern, or maybe hidden information in the data)
Where is Data Science Needed?
Data Science is used in many industries in the world today, e.g. banking, consultancy,
healthcare, and manufacturing.
Examples of where Data Science is needed:
For route planning: To discover the best routes to ship
To foresee delays for flight/ship/train etc. (through predictive analysis)
To create promotional offers
To find the best suited time to deliver goods
To forecast the next years revenue for a company
To analyze health benefit of training
To predict who will win elections
What is Data?
Data is a collection of information.
One purpose of Data Science is to structure data, making it interpretable and easy to work
with.
Data can be categorized into two groups:
 Structured data
 Unstructured data

Unstructured Data
Unstructured data is not organized. We must organize the data for analysis purposes.
Structured Data
Structured data is organized and easier to work with.
How to Structure Data?
We can use an array or a database table to structure or present data.
Example of an array: [80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
Big Data and Data Science hype
Big Data:
It is huge, large, or voluminous data, information, or the relevant statistics acquired by large
organizations and ventures. Many software and data storages is created and prepared as it is
difficult to compute the big data manually. It is used to discover patterns and trends and make
decisions related to human behavior and interaction technology.
Advantages of Big Data:
 Able to handle and process large and complex data sets that cannot be easily managed
with traditional database systems
 Provides a platform for advanced analytics and machine learning applications
 Enables organizations to gain insights and make data-driven decisions based on large
amounts of data
 Offers potential for significant cost savings through efficient data management and
analysis
Disadvantages of Big Data:
 Requires specialized skills and expertise in data engineering, data management, and
big data tools and technologies
 Can be expensive to implement and maintain due to the need for specialized
infrastructure and software
 May face privacy and security concerns when handling sensitive data
 Can be challenging to integrate with existing systems and processes
Data Science:
 Data Science is a field or domain which includes and involves working with a huge
amount of data and using it for building predictive, prescriptive, and prescriptive
analytical models. It’s about digging, capturing, (building the model)
analyzing(validating the model), and utilizing the data(deploying the best model). It is
an intersection of Data and computing. It is a blend of the field of Computer Science,
Business, and Statistics together.
Advantages of Data Science:
 Provides a framework for extracting insights and knowledge from data through
statistical analysis, machine learning, and
 data visualization techniques
 Offers a wide range of applications in various fields such as finance, healthcare, and
marketing
 Helps organizations make informed decisions by extracting meaningful insights from
data
 Offers potential for significant cost savings through efficient data management and
analysis
Disadvantages of Data Science:
 Requires specialized skills and expertise in statistical analysis, machine learning, and
data visualization
 Can be time-consuming and resource-intensive due to the need for data cleaning and
preprocessing
 May face ethical concerns when dealing with sensitive data
 Can be challenging to integrate with existing systems and processes

Getting Past the Hype


Rachel’s experience going from getting a PhD in statistics to working at Google is a great
example to illustrate why we thought, in spite of the aforementioned reasons to be dubious,
there might be some meat in the data science sandwich. In her words:
It was clear to me pretty quickly that the stuff I was working on at Google was different than
anything I had learned at school when I got my PhD in statistics. This is not to say that my
degree was useless; far from it—what I’d learned in school provided a framework andway of
thinking that I relied on daily, and much of the actual content provided a solid theoretical and
practical foundation necessary to do my work.
But there were also many skills I had to acquire on the job at Google that I hadn’t learned in
school. Of course, my experience is specific to me in the sense that I had a statistics
background and picked up more computation, coding, and visualization skills, as well as
domain expertise while at Google. Another person coming in as a computer scientist or a
social scientist or a physicist would have different gaps and would fill them in accordingly.
But what is important here is that, as individuals, we each had different strengths and gaps,
yet we were able to solve problems by putting ourselves together into a data team well-suited
to solve the data problems that came our way.
Here’s a reasonable response you might have to this story. It’s a general truism that,
whenever you go from school to a real job, you realize there’s a gap between what you
learned in school and what you do on the job. In other words, you were simply facing the
difference between academic statistics and industry statistics.
Datafication
Datafication refers to the collective tools, technologies, and processes used to transform an
organization into a data-driven enterprise. An organizational trend of defining the key to core
business operations through a global reliance on data and its related infrastructure.

The crux is, “Datafication” is the process of turning everything into data. It is the act of
taking something that was once unquantifiable and turning it into quantitative data.

Datafication enables the transformation of business operations, behaviors, and actions, in


addition to those of its clients and consumers, into quantifiable, usable, and actionable data.
This information can then be tracked, processed, monitored, analyzed, and utilized to
improve an organization and the products and services it offers to customers. To put them
into perspective.

Google transforms our searches into data


Facebook transforms our friendships into data
LinkedIn transforms our professional life into data
Netflix or Amazon Prime transforms our watched TV shows and films into data
Tinder transforms our dating activities into data
Amazon transforms our shopping into data
Data either personal or commercial are used to monitor every activity within its reach.
Massive datasets are stored that get updated daily by the above tech giants for datafication.
Collected data is then used for personalization in the form of ads, push notifications,
consumable content, and more within each tech app or platform. This level of interference is
usually regulated by the law.
Current landscape of perspectives in Data Science

Data science is part of the computer sciences. It comprises the disciplines of i) analytics, ii)
statistics and iii) machine learning.
The Data Science Landscape —
Analytics
Data analytics focuses on processing and performing statistical analysis of existing datasets.
Analysts concentrate on creating methods to capture, process, and organize data to uncover
actionable insights for current problems, and establishing the best way to present this data.
More simply, the field of data and analytics is directed toward solving problems for questions
we know we don’t know the answers to. More importantly, it’s based on producing results
that can lead to immediate improvements.

Data analytics also encompasses a few different branches of broader statistics and analysis
which help combine diverse sources of data and locate connections while simplifying the
results.
Statistics
In many instances, analytics may be sufficient to address a given problem. In other instances,
the issue is more complex and requires a more sophisticated approach to provide an answer,
especially if there is a high-stakes decision to be made under uncertainty. This is when
statistics comes into play. Statistics provides a methodological approach to answer questions
raised by the analysts with a certain level of confidence.
Sometimes simple descriptive statistics are sufficient to provide the necessary insight. Yet, on
other occasions, more sophisticated inferential statistics such as regression analysis are
required to reveal relationships between cause and effect for a certain phenomenon. The
limitation of statistics is that it is traditionally conducted with software packages, such as
SPSS and SAS, which require a distinct calculation for a specific problem by a statistician or
trained professional. The degree of automation is rather limited.
Machine Learning
Artificial intelligence refers to the broad idea that machines can perform tasks normally
requiring human intelligence, such as visual perception, speech recognition, decision-making
and translation between languages. In the context of data science, machine learning can be
considered as a sub-field of artifical intelligence that is concerned with decision making. In
fact, in its most essential form, machine learning is decision making at scale. Machine
learning is the field of study of computer algorithms that allow computer programs to identify
and extract patterns from data. A common purpose of machine learning algorithms is
therefore to generalize and learn from data in order to perform certain tasks.

In traditional programming, input data is applied to a model and a computer in order to


achieve a desired output. In machine learning, an algorithm is applied to input and output
data in order to identify the most suitable model. Machine Learning can thus be
complementary to traditional programming as it can provide a useful model to explain a
phenomenon.

Statistical Inference
Statistical Inference
Using data analysis and statistics to make conclusions about a population is called statistical
inference.
The main types of statistical inference are:

Estimation
Hypothesis testing
Estimation
Statistics from a sample are used to estimate population parameters. The most likely value is
called a point estimate. There is always uncertainty when estimating. The uncertainty is often
expressed as confidence intervals defined by a likely lowest and highest value for the
parameter.
An example could be a confidence interval for the number of bicycles a Dutch person owns:
"The average number of bikes a Dutch person owns is between 3.5 and 6."
Hypothesis Testing
Hypothesis testing is a method to check if a claim about a population is true. More precisely,
it checks how likely it is that a hypothesis is true is based on the sample data.

There are different types of hypothesis testing.

The steps of the test depends on:

Type of data (categorical or numerical)


If you are looking at:
A single group
Comparing one group to another
Comparing the same group before and after a change
Some examples of claims or questions that can be checked with hypothesis testing:

90% of Australians are left handed


Is the average weight of dogs more than 40kg?
Do doctors make more money than lawyers?
Population and Samples
Population: Everything in the group that we want to learn about.
Sample: A part of the population.
Examples of populations and a sample from those populations:

Population Sample

All of the people in Germany 500 Germans

All of the customers of Netflix 300 Netflix customers

Every car manufacturer Tesla, Toyota, BMW, Ford

For good statistical analysis, the sample needs to be as "similar" as possible to the population.
If they are similar enough, we say that the sample is representative of the population.
The sample is used to make conclusions about the whole population. If the sample is not
similar enough to the whole population, the conclusions could be useless.
statistical modeling
What is statistical modeling?
 The statistical modeling process is a way of applying statistical analysis to datasets in
data science. The statistical model involves a mathematical relationship between
random and non-random variables.
 A statistical model can provide intuitive visualizations that aid data scientists in
identifying relationships between variables and making predictions by applying
statistical models to raw data.
 Examples of common data sets for statistical analysis include census data, public
health data, and social media data.
Statistical modeling techniques
 Data gathering is the foundation of statistical modeling. The data may come from the
cloud, spreadsheets, databases, or other sources. There are two categories of statistical
modeling methods used in data analysis. These are:
Supervised learning
 In the supervised learning model, the algorithm uses a labeled data set for learning,
with an answer key the algorithm uses to determine accuracy as it trains on the
data. Supervised learning techniques in statistical modeling include:
 Regression model: A predictive model designed to analyze the relationship between
independent and dependent variables. The most common regression models are
logistical, polynomial, and linear. These models determine the relationship between
variables, forecasting, and modeling.
 Classification model: An algorithm analyzes and classifies a large and complex set
of data points. Common models include decision trees, Naive Bayes, the nearest
neighbor, random forests, and neural networking models.
Unsupervised learning
 In the unsupervised learning model, the algorithm is given unlabeled data and
attempts to extract features and determine patterns independently. Clustering
algorithms and association rules are examples of unsupervised learning. Here are two
examples:
 K-means clustering: The algorithm combines a specified number of data points into
specific groupings based on similarities.
 Reinforcement learning: This technique involves training the algorithm to iterate
over many attempts using deep learning, rewarding moves that result in favorable
outcomes, and penalizing activities that produce undesired effects.
Probability distributions
What is Probability Distribution?
A Probability Distribution of a random variable is a list of all possible outcomes with
corresponding probability values.
Note : The value of the probability always lies between 0 to 1.
What is an example of Probability Distribution?
Let’s understand the probability distribution by an example:
When two dice are rolled with six sided dots, let the possible outcome of rolling is
denoted by (a, b), where
a : number on the top of first dice
b : number on the top of second dice
Then, sum of a + b are:
Sum of a + b (a, b)
2 (1,1)
3 (1,2), (2,1)
4 (1,3), (2,2), (3,1)
5 (1,4), (2,3), (3,2), (4,1)
6 (1,5), (2,4), (3,3), (4,2), (5,1)
7 (1,6), (2,5), (3,4),(4,3), (5,2), (6,1)
8 (2,6), (3,5), (4,4), (5,3), (6,2)
9 (3,6), (4,5), (5,4), (6,3)
10 (4,6), (5,5), (6,4)
11 (5,6), (6,5)
12 (6,6)

 If a random variable is a discrete variable, it’s probability distribution is called


discrete probability distribution.
 Example : Flipping of two coins
 Functions that represents a discrete probability distribution is known
as Probability Mass Function.
 If a random variable is a continuous variable, it’s probability distribution is called
continuous probability distribution.
 Example: Measuring temperature over a period of time
 Functions that represents a continuous probability distribution is known
as Probability Density Function.
Fitting a model
Fitting a model means that you estimate the parameters of the model using the observed
data. You are using your data as evidence to help approximate the real-world
mathematical process that generated the data. Fitting the model often involves
optimization methods and algorithms, such as maximum likelihood estimation, to help get
the parameters.
In fact, when you estimate the parameters, they are actually estimators,meaning they
themselves are functions of the data. Once you fit the model, you actually can write it as y
=7.2+4.5x, for example,which means that your best guess is that this equation or
functional form expresses the relationship between your two variables, based on
your assumption that the data followed a linear pattern.
Fitting the model is when you start actually coding: your code will read in the data, and
you’ll specify the functional form that you wrote down on the piece of paper. Then R or
Python will use built-in optimization methods to give you the most likely values of the
parameters given the data.
As you gain sophistication, or if this is one of your areas of expertise, you’ll dig around in
the optimization methods yourself. Initially you should have an understanding that
optimization is taking place and how it works, but you don’t have to code this part
yourself—it underlies the R or Python functions.
Overfitting
Overfitting is the term used to mean that you used a dataset to estimate the parameters of
your model, but your model isn’t that good at capturing reality beyond your sampled data.
You might know this because you have tried to use it to predict labels for another set of
data that you didn’t use to fit the model, and it doesn’t do a good job, as measured by an
evaluation metric such as accuracy.

Basics of R
Introduction:
R is a popular programming language used for statistical computing and graphical
presentation.
Its most common use is to analyze and visualize data.
Why Use R?
 It is a great resource for data analysis, data visualization, data science and machine
learning
 It provides many statistical techniques (such as statistical tests, classification,
clustering and data reduction)
 It is easy to draw graphs in R, like pie charts, histograms, box plot, scatter plot, etc++
 It works on different platforms (Windows, Mac, Linux)
 It is open-source and free
 It has a large community support
 It has many packages (libraries of functions) that can be used to solve different
problems
REnvironment Setup
Downloading and Installing R
• R is free available from the comprehensive R Archive Network (CRAN) at
http://cran.r-project.org
• Precompiled binaries are available for Linux, Mac OS X and windows.
• R latest release R-3.4.0
• Installing R on windows and Mac is just like installing any other program.
• Install R Studio: a free IDE for R at http://www.rstudio.com/
• If we install R and R Studio, then we need to run R Studio only.
• R is case-sensitive.
• R scripts are simply text files with a .R extension.

Programming with R
Once we are inside the R session, we can directly execute R language commands by
typing them line by line. Pressing the enter key terminates typing of command and brings the
> prompt again. In the example session below, we declare 2 variables 'a' and 'b' to have
values 5 and 6 respectively, and assign their sum to another variable called 'c':
>a=5
>b=6
>c=a+b
> c The value of the variable 'c' is printed as, [1] 11 In R session, typing a variable
name prints its value on the screen.
Get help inside R session
To get help on any function of R, type help(function-name) in R prompt. For example, if we
need help on "if" logic, type, > help("if") then, help lines for the "if" statement are printed.
Exit the R session
To exit the R session, type quit() in the R prompt, and say 'n' (no) for saving the workspace
image. This means, we do not want to save the memory of all the commands we typed in the
current session: > quit() Save workspace image? [y/n/c]: n >
Saving the R session
Note that by not saving the current session, we loose all the memory of current session
commands and the variables and objects created when we exit R prompt. When we work in
R, the R objects we created and loaded are stored in a memory portion called workspace.
When we say 'no' to saving the workspace, we all these objects are wiped out from the
workspace memory. If we say 'yes', they are saved into a file called ".RData" is written to the
present working directory.
Listing the objects in the current R session
We can list the names of the objects in the current R session by ls() command. For example,
start R session fresh and proceed as follows:
>>a=5
>b=6
>c=8
> sum = a+b+c
> sum [1] 19
> ls() [1] "a" "b" "c" "sum" Here, the objects we created have been listed.
Removing objects from the current R session
Specific objects created in the current session can be removed using rm() command. If we
specify the name of an object, it will be removed. If we just say rm(list = las()) , all objects
created so far will be removed. See below: > a = 5 > b = 6 > c = 8 > sum = a+b+c > sum [1]
19 > ls() [1] "a" "b" "c" "sum"
> > rm(list=c("sum"))
> ls() [1] "a" "b" "c"
> > rm(list = ls())
> ls() character(0)
Getting and setting the current working directories
From R prompt, we can get information about the current working directory using getwd()
command: > getwd() [1] "/home/user" Similarly, we can set the current wor directory by
calling setwd() function: > setwd("/home/user/prog") After this, "/home/user/prog" will be
the working directory.
Comments Comments are like helping text in your R program and they are ignored by the
interpreter while executing your actual program. Single comment is written using # in the
beginning of the statement as follows: # My first program in R Programming R does not
support multi-line comments
R Reserved Words
Reserved words in R programming are a set of words that have special
meaning and cannot be used as an identifier (variable name, function name
etc.). Here is a list of reserved words in the R's parser. Reserved words in R
if else repeat while function
for in next break TRUE
FALSE NULL Inf NaN NA
NA_integer_ NA_real_ NA_complex_ NA_character_ ...

Variables in R
Variables are used to store data, whose value can be changed according to our need. Unique
name given to variable (function and objects as well) is identifier.
Rules for writing Identifiers in R
1. Identifiers can be a combination of letters, digits, period (.) and underscore (_).
2. It must start with a letter or a period. If it starts with a period, it cannot be followed by a
digit.
3. Reserved words in R cannot be used as identifiers.
Valid identifiers in R
total, Sum, .fine.with.dot, this_is_acceptable, Number5
Invalid identifiers in R
tot@l, 5um, _fine, TRUE, .0ne
Constants in R
Constants, as the name suggests, are entities whose value cannot be altered. Basic types of
constant are numeric constants and character constants.
Numeric Constants
All numbers fall under this category. They can be of type integer, double or complex. It can
be checked with the typeof() function. Numeric constants followed by L are regarded as
integer and those followed by i are regarded as complex.
> typeof(5) [1] "double"
> typeof(5L) [1] "integer"
> typeof(5i) [1] "complex"
Numeric constants preceded by 0x or 0X are interpreted as hexadecimal numbers.
> 0xff [1] 255
> 0XF + 1 [1] 16
Character Constants
Character constants can be represented using either single quotes (') or double quotes (") as
delimiters. > 'example' [1] "example" > typeof("5") [1] "character"
Built-in Constants
Some of the built-in constants defined in R along with their values is shown below. >
LETTERS [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R"
"S" [20] "T" "U" "V" "W" "X" "Y" "Z" > letters [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k"
"l" "m" "n" "o" "p" "q" "r" "s" [20] "t" "u" "v" "w" "x" "y" "z" > pi [1] 3.141593
> month.name [1] "January" "February" "March" "April" "May" "June" [7] "July" "August"
"September" "October" "November" "December" > month.abb [1] "Jan" "Feb" "Mar" "Apr"
"May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec" But it is not good to rely on these, as they
are implemented as variables whose values can be changed.
> pi [1] 3.141593
> pi =56
> pi
[1] 56
Example: Hello World Program
> # We can use the print() function > print("Hello World!") [1] "Hello World!" >

Basic Data Types


Basic data types in R can be divided into the following types:
 numeric - (10.5, 55, 787)
 integer - (1L, 55L, 100L, where the letter "L" declares this as an integer)
 complex - (9 + 3i, where "i" is the imaginary part)
 character (a.k.a. string) - ("k", "R is exciting", "FALSE", "11.5")
 logical (a.k.a. boolean) - (TRUE or FALSE)
We can use the class() function to check the data type of a variable:
Example
# numeric
x = 10.5
class(x)

# integer
x = 1000L
class(x)

# complex
x = 9i + 3
class(x)

# character/string
x = "R is exciting"
class(x)
# logical/boolean
x = TRUE
class(x)

Numbers
There are three number types in R:

numeric
integer
complex
Variables of number types are created when you assign a value to them:
Example
x = 10.5 # numeric
y = 10L # integer
z = 1i # complex

Numeric
A numeric data type is the most common type in R, and contains any number with or without
a decimal, like: 10.5, 55, 787:

Example
x = 10.5
y = 55

# Print values of x and y


x
y

# Print the class name of x and y


class(x)
class(y)

Integer
Integers are numeric data without decimals. This is used when you are certain that you will
never create a variable that should contain decimals. To create an integer variable, you must
use the letter L after the integer value:

Example
x = 1000L
y = 55L
# Print values of x and y
x
y
# Print the class name of x and y
class(x)
class(y)

Complex
A complex number is written with an "i" as the imaginary part:

Example
x = 3+5i
y = 5i

# Print values of x and y


x
y

# Print the class name of x and y


class(x)
class(y)
Type Conversion
You can convert from one type to another with the following functions:

as.numeric()
as.integer()
as.complex()
Example
x = 1L # integer
y = 2 # numeric

# convert from integer to numeric:


a = as.numeric(x)

# convert from numeric to integer:


b = as.integer(y)

# print values of x and y


x
y

# print the class name of a and b


class(a)
class(b)

Simple Math
In R, you can use operators to perform common mathematical operations on numbers.

The + operator is used to add together two values:


Example
10 + 5
And the - operator is used for subtraction:

Example
10 – 5
Built-in Math Functions
R also has many built-in math functions that allows you to perform mathematical tasks on
numbers.

For example, the min() and max() functions can be used to find the lowest or highest number
in a set:

Example
max(5, 10, 15)

min(5, 10, 15)


sqrt()
The sqrt() function returns the square root of a number:

Example
sqrt(16)
abs()
The abs() function returns the absolute (positive) value of a number:

Example
abs(-4.7)
ceiling() and floor()
The ceiling() function rounds a number upwards to its nearest integer, and the floor() function
rounds a number downwards to its nearest integer, and returns the result:
Example
ceiling(1.4)

floor(1.4)

String Literals
Strings are used for storing text.

A string is surrounded by either single quotation marks, or double quotation marks:

"hello" is the same as 'hello':

Example
"hello"
'hello'
Assign a String to a Variable
Assigning a string to a variable is done with the variable followed by the <- operator and the
string:

Example
str = "Hello"
str # print the value of str

String Length
There are many usesful string functions in R.

For example, to find the number of characters in a string, use the nchar() function:

Example
str = "Hello World!"

nchar(str)
Combine Two Strings
Use the paste() function to merge/concatenate two strings:

Example
str1 = "Hello"
str2 = "World"

paste(str1, str2)

Check a String
Use the grepl() function to check if a character or a sequence of characters are present in a
string:

Example
str = "Hello World!"

grep("H", str)
grep("Hello", str)
grep("X", str)

Booleans (Logical Values)


In programming, you often need to know if an expression is true or false.

You can evaluate any expression in R, and get one of two answers, TRUE or FALSE.

When you compare two values, the expression is evaluated and R returns the logical answer:
Example
a =10
b=9

a>b

You can also run a condition in an if statement, which you will learn much more about in the
if..else chapter.

Example
a = 200
b = 33

if (b > a) {
print ("b is greater than a")
} else {
print("b is not greater than a")
}

You might also like