KEMBAR78
Data Science & R for Professionals | PDF | Probability Distribution | Data Science
0% found this document useful (0 votes)
431 views95 pages

Data Science & R for Professionals

This document provides an overview of the key concepts in data science including data science, data scientist, data types and variables in R, operators in R, conditional expressions in R, loops in R, R script files, and functions in R. It discusses data science as an interdisciplinary field that uses scientific methods and algorithms to extract knowledge from structured and unstructured data. It also defines a data scientist as someone who has skills in extracting insights from data using various processes, methods, algorithms, and who can analyze, interpret and visualize data to make informed decisions. The document then covers important concepts in R programming like the basic data types in R, arithmetic, assignment, comparison and logical operators, conditional expressions using if/else statements

Uploaded by

Rachu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
431 views95 pages

Data Science & R for Professionals

This document provides an overview of the key concepts in data science including data science, data scientist, data types and variables in R, operators in R, conditional expressions in R, loops in R, R script files, and functions in R. It discusses data science as an interdisciplinary field that uses scientific methods and algorithms to extract knowledge from structured and unstructured data. It also defines a data scientist as someone who has skills in extracting insights from data using various processes, methods, algorithms, and who can analyze, interpret and visualize data to make informed decisions. The document then covers important concepts in R programming like the basic data types in R, arithmetic, assignment, comparison and logical operators, conditional expressions using if/else statements

Uploaded by

Rachu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 95

Data Science

TCS 733
Compiled by Dr. Vijay Singh
Associate Professor,
Department of Computer Science and Engineering
Graphic Era Deemed to be University, Dehradun
+91-9760322316
Vijaysingh.cse@geu.ac.in
Syllabus
UNIT-I

Data science is an interdisciplinary field that uses


scientific methods, processes, algorithms and systems
to extract or extrapolate knowledge and insights from
noisy, structured and unstructured data, and apply
knowledge from data across a broad range of
application domains. Data science is related to data
mining, machine learning and big data.
Data science

Data science is a "concept to unify


statistics, data analysis, informatics, and
their related methods" in order to
"understand and analyze actual
phenomena" with data. It uses techniques
and theories drawn from many fields
within the context of mathematics,
statistics, computer science, information
science, and domain knowledge.
Data Scientist

A data scientist is a person who knows how to extract insights from the
data by using various processes, methods, systems, and algorithms.
Data scientist requires a range of skills to analyze, interpret, and
visualize data to make informed decisions.

Data Scientist is a high-ranking professional in the big data world


who possesses mathematical, statistical, scientific, analytical,
and technical skills to clean, prepare, and validate structured and
unstructured data in order to make better decisions for
businesses.
What do data science people do?
They collect data, conduct a wide range
of experiments using different models
and methods, understand the result,
predict impact, and communicate the
same to their peers in the organization.
Their skills must not be confined to only
analytical, statistical or managerial.
That’s where they are different. They
need special skills that make them
different from data engineers, analysts,
and other data-centric roles.
The roles and responsibilities of data scientists greatly vary
based on the organizations’ needs. Overall, they require to
fulfill several or all responsibilities mentioned below:
• Gather data and identify sources of data
• Analyze a large amount of structured and unstructured data
• Create solutions and strategies that address business challenges
• Collaborate with team members and leaders to design data strategies
• Combine different algorithms and modules to discover trends and patterns
• Present information through various data visualization techniques and
tools
• Explore more technologies and tools to create innovative data strategies
• Create end-to-end analytical solutions from data collection to presentation
• Help building data engineering pipelines
• Assisting team of data scientists, BI developers, and analysts in their
projects whenever required
• Working with the sales and pre-sales team in cost reduction and
effort estimation as well as cost optimization
• Stay updated with the latest tools, trends, and technologies to
improve overall efficiency and performance
• Working closely with product team and partners to present data-
driven solutions built with innovative ideas
• Design analytics solutions for businesses using various tools,
applied statistics, and ML
• Lead discussions and check implementation feasibility of the AI/ML
solutions concerning business processes and outcomes
• Architect, implement, and monitor data pipelines and conduct
knowledge sharing sessions to the peers for effective use of data
Data Scientist Skills:

• Mathematics And Statistics


• Machine Learning
• Programming Skills
• Analysis And Visualization
• Database Management
• Software Engineering Skills
Several other skills that are important for a data scientist:
• Years of experience as a data scientist, data analyst or data
engineer
• Experience in data mining, data modeling, and reporting
• Familiarity with machine learning and operation-research models
• Experience using various data visualization and data management
tools
• Problem-solving attitude and analytical mind
• Excellent verbal, written, and presentation skills
• Understanding of various business domains and business acumen
• Storytelling skills and ability to effectively communicate results to the
team
Data Science in Business

Google, Facebook, Twitter, Netflix, and Amazon have many


things in common, the use of data science is one of them. They
have understood the importance of data science and are using
defined processes to collect and analyze information about their
users to offer them a more personalized experience and remain
at the top in their respective businesses.
Data Science in Business
• Empowering management and officers to make better decision
• Directing actions based on trends—which in turn help to define goals
• Challenging the staff to adopt best practices and focus on issues that
matter
• Identifying opportunities
• Decision making with quantifiable, data-driven evidence
• Testing these decisions
• Identification and refining of target audiences
• Recruiting the right talent for the organization
What is a Data Science Use Case?

A data science use case is a concrete real-world task to be solved


using the available data. In the framework of a particular company,
many variables are analyzed using data science techniques in the
context of the company’s specific industry. Data science use cases
can be a problem to be resolved, a hypothesis to be checked, or a
question to be answered. Essentially, doing data science means
solving real-world use cases.
Data Science Use Cases by Industry
• Data Science in Healthcare
Medicine and healthcare providers gather a huge amount of
data from numerous sources: electronic health record (EHR)
systems, data from wearable devices, medical research studies,
and billing documents. Taking advantage of innovative data
science and data analysis techniques is gradually revolutionizing
the whole industry of healthcare, offering the most promising and
impactful solutions for its future development.
Data Science Use Cases by Industry
• Data Science in Transport and Logistics
Logistics is the organization of the process of product delivery from
one point to another, while transport implies the act of transporting
clients or goods. Depending on the profile of the company's activity
(e.g., lunch delivery, mail service, international shipment, airline
companies, and taxi or bus services), the data sources can be
represented by schedules, timesheets, route details, warehouse stock
itemization, reports, contracts, agreements, legal papers, and
customer feedback.
Data Science Use Cases by Industry
Data Science in Telecom
Using data science methods to work through the accumulated
data can help the telecom industry in many ways:
• streamlining the operations,
• optimizing the network,
• filtering out spam,
• improving data transmissions,
• performing real-time analytics,
• developing efficient business strategies,
• creating successful marketing campaigns,
• increasing revenue.
Introduction to R
R is an open-source programming language
that is widely used as a statistical software
and data analysis tool. R generally comes
with the Command-line interface. R is
available across widely used platforms like
Windows, Linux, and macOS.
Important links:

• R-intro.pdf (r-project.org)

• Download the RStudio IDE – Rstudio

• Download R-4.2.1 for Windows. The R-project for statistical


computing.
Importance of R Programming Language
• R is a well-developed, simple and effective programming
language. Which includes conditional loops; user defined
recursive functions and input and output facilities.
• R provides graphical facilities for data analysis and
display.
• R is a very flexible language. It does not necessitate that
everything should be done in R itself. It allows the use of
other tools, like C and C++ if required.
• R has an effective data handling and storage facility.
• R provides an extensive, coherent and integrated
collection of tools for data analysis.
Importance of R Programming Language
• R also includes a package system that allows the users to add their
individual functionality in a manner that is indistinguishable from the
core of R.
• R is actively used for statistical computing and design. It has brought
about revolutionary improvements in big data and data analytics. It is
the most widely used language in the world of data science! Some of
the big shots in the industry like Google, LinkedIn, and Facebook, rely
on R for many of their operations.
The TIOBE Programming Community index is an indicator of the popularity of
programming languages. The index is updated once a month.
The TIOBE Programming Community index is an indicator of the popularity of
programming languages. The index is updated once a month.
Data types and variables in R

In R there are 6 basic data types:


• Logical
• Numeric
• Integer
• Complex
• Character
• Raw
Operators in R

• Arithmetic operators
• Assignment operators
• Comparison operators
• Logical operators
• Miscellaneous
operators
Conditional expression in R
A conditional expression or a conditional statement is a programming
construct where a decision is made to execute some code based on a
Boolean (true or false) condition. A more commonly used term for
conditional expression in programming is an 'if-else' condition. In plain
English, this is stated as 'if this test is true then do this operation;
otherwise do this different operation’.
Conditional expression in R
Loops in R
R Script file is a file with extension “.R” that
contains a program (a set of commands). Rscript
R script is an R Interpreter which helps in the execution of
R commands present in the script file.
Functions in R
A random variable, usually written X, is a variable
UNIT-II (Data whose possible values are numerical outcomes of a
Preprocessing) random phenomenon. There are two types of random
variables, discrete and continuous.
Discrete Random Variables
A discrete random variable is one which may take on only a countable number of
distinct values such as 0,1,2,3,4,........ Discrete random variables are usually (but not
necessarily) counts. If a random variable can take only a finite number of distinct
values, then it must be discrete. Examples of discrete random variables include the
number of children in a family, the Friday night attendance at a cinema, the number
of patients in a doctor's surgery, the number of defective light bulbs in a box of ten.

The probability distribution of a discrete random variable is a list of probabilities


associated with each of its possible values. It is also sometimes called the
probability function or the probability mass function.
(Definitions taken from Valerie J. Easton and John H.
McColl's Statistics Glossary v1.1)

Suppose a random variable X may take k different values, with the


probability that X = xi defined to be P(X = xi) = pi. The
probabilities pi must satisfy the following:

1: 0 < pi < 1 for each i


2: p1 + p2 + ... + pk = 1.
Example
Suppose a variable X can take the values 1, 2, 3, or 4.
The probabilities associated with each outcome are described by the following table:

The probability that X is equal to 2 or 3 is the sum of the two probabilities: P(X = 2 or X = 3) = P(X
= 2) + P(X = 3) = 0.3 + 0.4 = 0.7. Similarly, the probability that X is greater than 1 is equal to 1 - P(X
= 1) = 1 - 0.1 = 0.9, by the complement rule. This distribution may also be described by
the probability histogram shown in the figure:
Continuous Random Variables
A continuous random variable is one which takes an infinite number of
possible values. Continuous random variables are usually
measurements. Examples include height, weight, the amount of sugar
in an orange, the time required to run a mile.
(Definition taken from Valerie J. Easton and John H. McColl's Statistics
Glossary v1.1)
A continuous random variable is not defined at specific values. Instead,
it is defined over an interval of values, and is represented by the area
under a curve (in advanced mathematics, this is known as an integral).
The probability of observing any single value is equal to 0, since the
number of values which may be assumed by the random variable is
infinite.
Continuous Random Variables
Suppose a random variable X may take all values over an interval of real
numbers. Then the probability that X is in the set of outcomes A, P(A),
is defined to be the area above A and under a curve. The curve, which
represents a function p(x), must satisfy the following:
Missing Values
In statistics, missing data, or missing values, occur when no data value
is stored for the variable in an observation. Missing data are a common
occurrence and can have a significant effect on the conclusions that can
be drawn from the data.
• The problem of missing value is quite common in many real-life
datasets. Missing value can bias the results of the machine learning
models and/or reduce the accuracy of the model.
Why Is Data Missing From The Dataset
• There can be multiple reasons why certain values are missing from the data.
• Reasons for the missing data from the dataset affect the approach of handling
missing data.
• So it’s necessary to understand why the data could be missing.

Some of the reasons are listed below:

• Past data might get corrupted due to improper maintenance.


• Observations are not recorded for certain fields due to some reasons. There might be a
failure in recording the values due to human error.
• The user has not provided the values intentionally.
Types Of Missing Value

Missing Completely At Random (MCAR)


• In MCAR, the probability of data being missing is the same for all the
observations.
• In this case, there is no relationship between the missing data and any other
values observed or unobserved (the data which is not recorded) within the given
dataset.
• That is, missing values are completely independent of other data. There is no
pattern.
Missing Completely At Random (MCAR)

In the case of MCAR, the data could be missing due to human error,
some system/equipment failure, loss of sample, or some unsatisfactory
technicalities while recording the values.
For Example, suppose in a library there are some overdue books. Some
values of overdue books in the computer system are missing. The
reason might be a human error like the librarian forgot to type in the
values. So, the missing values of overdue books are not related to any
other variable/data in the system.
It should not be assumed as it’s a rare case. The advantage of such data
is that the statistical analysis remains unbiased.
Missing At Random (MAR)
• Missing at random (MAR) means that the reason for missing values can be
explained by variables on which you have complete information as there is
some relationship between the missing data and other values/data.
• In this case, the data is not missing for all the observations. It is missing
only within sub-samples of the data and there is some pattern in the
missing values.
• For example, if you check the survey data, you may find that all the people
have answered their ‘Gender’ but ‘Age’ values are mostly missing for
people who have answered their ‘Gender’ as ‘female’. (The reason being
most of the females don’t want to reveal their age.)
• So, the probability of data being missing depends only on the observed
data.
Missing At Random (MAR)

• In this case, the variables ‘Gender’ and ‘Age’ are related and the
reason for missing values of the ‘Age’ variable can be explained by the
‘Gender’ variable but you can not predict the missing value itself.
• Suppose a poll is taken for overdue books of a library. Gender and the
number of overdue books are asked in the poll. Assume that most of
the females answer the poll and men are less likely to answer. So why
the data is missing can be explained by another factor that is gender.
• In this case, the statistical analysis might result in bias.
Missing Not At Random (MNAR)
• Missing values depend on the unobserved data.
• If there is some structure/pattern in missing data and other observed
data can not explain it, then it is Missing Not At Random (MNAR).
• If the missing data does not fall under the MCAR or MAR then it can
be categorized as MNAR.
• It can happen due to the reluctance of people in providing the
required information. A specific group of people may not answer
some questions in a survey.
Missing Not At Random (MNAR)
• For example, suppose the name and the number of overdue books
are asked in the poll for a library. So most of the people having no
overdue books are likely to answer the poll. People having more
overdue books are less likely to answer the poll.
• So in this case, the missing value of the number of overdue books
depends on the people who have more books overdue.
How to deal with missing values using R
R has various packages to deal with the missing data.

List of R Packages
• MICE
• Amelia
• missForest
• Hmisc
• mi
Decoding the job description

The data analyst role is one of many job titles that contain the word “analyst.” To name a few
others that sound similar but may not be the same role:

• Business analyst — analyzes data to help businesses improve processes, products, or services
• Data analytics consultant — analyzes the systems and models for using data
• Data engineer — prepares and integrates data from different sources for analytical use
• Data scientist — uses expert skills in technology and social science to find trends through data
analysis
• Data specialist — organizes or converts data for use in databases or software systems
• Operations analyst — analyzes data to assess the performance of business operations and
workflows
The six data analysis phases

There are six data analysis phases that will help you make seamless decisions: ask,
prepare, process, analyze, share, and act. Keep in mind, these are different from
the data life cycle, which describes the changes data goes through over its lifetime.
Let’s walk through the steps to see how they can help you solve problems you
might face on the job.
Step 1: Ask
It’s impossible to solve a problem if you don’t know what it is. These are some things to consider:
• Define the problem you’re trying to solve
• Make sure you fully understand the stakeholder’s expectations
• Focus on the actual problem and avoid any distractions
• Collaborate with stakeholders and keep an open line of communication
• Take a step back and see the whole situation in context

Questions to ask yourself in this step:


• What are my stakeholders saying their problems are?
• Now that I’ve identified the issues, how can I help the stakeholders resolve their questions?
Step 2: Prepare
You will decide what data you need to collect in order to answer your questions and how to organize
it so that it is useful. You might use your business task to decide:
• What metrics to measure
• Locate data in your database
• Create security measures to protect that data

Questions to ask yourself in this step:


• What do I need to figure out how to solve this problem?
• What research do I need to do?
Step 3: Process

Clean data is the best data and you will need to clean up your data to get rid of any possible errors,
inaccuracies, or inconsistencies. This might mean:
• Using spreadsheet functions to find incorrectly entered data
• Using SQL functions to check for extra spaces
• Removing repeated entries
• Checking as much as possible for bias in the data

Questions to ask yourself in this step:


• What data errors or inaccuracies might get in my way of getting the best possible answer to the
problem I am trying to solve?
• How can I clean my data so the information I have is more consistent?
Step 4: Analyze

You will want to think analytically about your data. At this stage, you might sort and format your
data to make it easier to:
• Perform calculations
• Combine data from multiple sources
• Create tables with your results

Questions to ask yourself in this step:


• What story is my data telling me?
• How will my data help me solve this problem?
• Who needs my company’s product or service? What type of person is most likely to use it?
Step 5: Share
Everyone shares their results differently so be sure to summarize your results with clear
and enticing visuals of your analysis using data via tools like graphs or dashboards. This is
your chance to show the stakeholders you have solved their problem and how you got
there. Sharing will certainly help your team:
• Make better decisions
• Make more informed decisions
• Lead to stronger outcomes
• Successfully communicate your findings

Questions to ask yourself in this step:


• How can I make what I present to the stakeholders engaging and easy to understand?
• What would help me understand this if I were the listener?
Step 6: Act
Now it’s time to act on your data. You will take everything you have learned from your data
analysis and put it to use. This could mean providing your stakeholders with
recommendations based on your findings so they can make data-driven decisions.
Questions to ask yourself in this step:
• How can I use the feedback I received during the share phase (step 5) to actually meet
the stakeholder’s needs and expectations?
• These six steps can help you to break the data analysis process into smaller, manageable
parts, which is called structured thinking. This process involves four basic activities:
• Recognizing the current problem or situation
• Organizing available information
• Revealing gaps and opportunities
• Identifying your options
Data life cycle
Data life cycle
• Plan for the users who will work within a spreadsheet by developing
organizational standards. This can mean formatting your cells, the headings you
choose to highlight, the color scheme, and the way you order your data points.
When you take the time to set these standards, you will improve communication,
ensure consistency, and help people be more efficient with their time.

• Capture data by the source by connecting spreadsheets to other data sources,


such as an online survey application or a database. This data will automatically be
updated in the spreadsheet. That way, the information is always as current and
accurate as possible.
Data life cycle
• Manage different kinds of data with a spreadsheet. This can involve
storing, organizing, filtering, and updating information. Spreadsheets also
let you decide who can access the data, how the information is shared, and
how to keep your data safe and secure.
• Analyze data in a spreadsheet to help make better decisions. Some of the
most common spreadsheet analysis tools include formulas to aggregate
data or create reports, and pivot tables for clear, easy-to-understand
visuals.
• Archive any spreadsheet that you don’t use often, but might need to
reference later with built-in tools. This is especially useful if you want to
store historical data before it gets updated.
• Destroy your spreadsheet when you are certain that you will never need it
again, if you have better backup copies, or for legal or security reasons.
Keep in mind, lots of businesses are required to follow certain rules or have
measures in place to make sure data is destroyed properly.
Data anonymization
Data anonymization is the process of protecting people's private or sensitive data
by eliminating that kind of information. Typically, data anonymization involves
blanking, hashing, or masking personal information, often by using fixed-length
codes to represent data columns, or hiding data with altered values.
Organizations have a responsibility to protect their data and the personal
information that data might contain. As a data analyst, you might be expected to
understand what data needs to be anonymized, but you generally wouldn't be
responsible for the data anonymization itself. A rare exception might be if you work
with a copy of the data for testing or development purposes. In this case, you could
be required to anonymize the data before you work with it.
What types of data should be anonymized?
Healthcare and financial data are two of the most sensitive types of data. These industries
rely a lot on data anonymization techniques. After all, the stakes are very high. That’s why
data in these two industries usually goes through de-identification, which is a process
used to wipe data clean of all personally identifying information.
Data anonymization is used in just about every industry. That is why it is so important for
data analysts to understand the basics. Here is a list of data that is often anonymized:
• Telephone numbers
• Names
• License plates and license numbers
• Social security numbers
• IP addresses
• Medical records
• Email addresses
• Photographs
• Account numbers
Bias
Bias: A preference in favor of or against a person, group of people, or
thing.
Data Bias: A type of error that systematically skews results in a certain
direction.
Sampling Bias: When a sample is not representative of the population
as a whole.
Unbiased sampling: When a sample is representative of the population
being measured.
More types of data bias:

• Observer bias
• Interpretation bias
• Confirmation bias
References
• Data science – Wikipedia
• What Does A Data Scientist Do: Skills, Roles And Responsibilities, Salary, And More | SPEC INDIA (spec-india.com)
• Why Data Science Matters and How It Powers Business in 2022 (simplilearn.com)
• Importance of Data Science for Businesses (naukri.com)
• Data Science Use Cases Guide | DataCamp
• Introduction and Importance of R – Programming Language | XTIVIA
• R Data Types (programiz.com)

• R Operators (w3schools.com)

• Loops in R (for, while, repeat) – GeeksforGeeks

• Run R Script file with RScript - Example (tutorialkart.com)

• Random Variables (yale.edu)

• Missing data – Wikipedia


• Tackling Missing Value in Dataset - Analytics Vidhya
• Qualitative vs. Quantitative Data: What's the Difference? And Why They're So Valuable | FullStory
• Qualitative vs Quantitative Research | Simply Psychology
• Quantitative vs Qualitative Data | Research, Analysis, and More (geopoll.com)
• dataset - Is nominal, ordinal, & binary for quantitative data, qualitative data, or both? - Cross Validated (stackexchange.com)

You might also like