KEMBAR78
Data Analytics - Advanced | PDF | Sql | Table (Database)
0% found this document useful (0 votes)
13 views62 pages

Data Analytics - Advanced

Uploaded by

vdalal83
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views62 pages

Data Analytics - Advanced

Uploaded by

vdalal83
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Data Analytics - Advanced

Foundation Skills Academy


Index - Advanced

1. Quick Recap of Basics & Intermediate

2. SQL for Data Querying

3. Data Analysis with R Programming

4. Power of Statistics

5. Nuts and Bolts of Machine Learning

6. Logistic Regression
Index - Basics

1. Introduction to Data Analytics


2. Overview of various stages
• Data Collection and Storage
• Data Organization and Cleaning
• Data Analysis
• Data Sharing - Insights & Recommendations

Appendix
1. New age terminologies - Data Science vs. AI vs. ML
2. Functional analytics - Marketing Analytics
Index - Intermediate

1. Data Extraction & Storage - Methods & Technologies

2. Overview of Languages for Data Professionals - SQL, Python, R, Shell script

3. Data Analysis – Linear Regression Analysis

4. Use Case of Regression Analysis - Marketing Mix Modeling


SQL for Data Querying
Introduction to SQL

• SQL, or Structured Query Language, is a querying language designed for accessing and manipulating information from,
mostly, though not exclusively, relational databases.
• Using SQL, we can write a set of instructions to perform operations such as Insert, update, and delete records in a
database; Create new databases, tables, and views; and write stored procedures - which means you can write a set of
instructions and call them for later use.
• Here are some advantages of using SQL,
• SQL is portable and can be used independent of the platform. It can be used for querying data in a wide variety of
databases and data repositories, although each vendor may have some variations and special extensions.
• It has a simple syntax that is similar to the English language. Its syntax allows developers to write programs with
fewer lines than some of the other programming languages using basic keywords such as select, insert, into, and
update.
• It can retrieve large amounts of data quickly and efficiently.
• It runs on an interpreter system, which means code can be executed as soon as it is written, making prototyping quick
• Due to its large user community and the sheer volume of documentation accumulated over the years, it continues to
provide a uniform platform, worldwide, to all its users.
Categories of SQL statements

• There are two main categories of SQL statements which are essential for interacting with relational databases,
• Data Definition Language (DDL)
• Data Manipulation Language (DML)

Data Definition Language (DDL)

• DDL statements are used to define, change, or drop database objects such as tables.

• Common DDL statements include CREATE (to create tables), ALTER (to modify tables), TRUNCATE (to delete data without
removing the table), and DROP (to delete tables).

Data Manipulation Language (DML)

• DML statements are used to read and modify data within tables, often referred to as CRUD operations (Create, Read,
Update, Delete).

• Common DML statements include INSERT (to add data), SELECT (to retrieve data), UPDATE (to modify existing data), and
DELETE (to remove data).
Basic SQL Queries - Select

• The SELECT statement is a fundamental command used to


retrieve data from a database table, with the simplest form
being SELECT * FROM table_name

• You can specify particular columns to retrieve, such as


SELECT book_id, title FROM book, allowing for more
targeted data retrieval.

• Using the WHERE Clause: The WHERE clause is used to


filter results based on specific conditions, known as
predicates, which evaluate to true, false, or unknown.

• Comparison Operators: Various comparison operators can


be used in the WHERE clause, including equal to, greater
than, less than, greater than or equal to, less than or equal to,
and not equal to. These operators help refine queries to
retrieve only the relevant data from the database.
Basic SQL Queries - Select

• Using String Patterns: The WHERE clause can utilize the LIKE predicate to search for patterns in a column, using the
percent sign (%) as a wildcard character to represent missing letters.

• For example, SELECT first_name FROM author WHERE first_name LIKE 'R%' retrieves all authors whose first names start
with the letter R.

• Utilizing Ranges: Instead of using multiple comparison operators, the BETWEEN operator can simplify queries by specifying
a range of values. For instance, WHERE pages BETWEEN 290 AND 300 retrieves books with a page count within that range,
making the query easier to write.

• Working with Sets of Values: The IN operator allows for specifying multiple values in a WHERE clause without lengthy
repetition. For example, WHERE country IN ('Australia', 'Brazil', 'Canada') efficiently retrieves authors from those countries.

• Scalar functions operate on individual values, such as rounding numbers with SELECT ROUND(COST) FROM
PETRESCUE.

• String functions manipulate text values, for instance, SELECT UPPER(ANIMAL) FROM PETRESCUE converts animal
names to uppercase.
Basic SQL Queries - Select

Count function
The COUNT function retrieves the number of rows that match specific query criteria. For example, to get the total number of rows
in a table, you can use: SELECT COUNT(*) FROM tablename
To count the number of medal recipients from Canada in a MEDALS table, the query would be: SELECT COUNT(COUNTRY)
FROM MEDALS WHERE COUNTRY='CANADA’

Distinct function
The DISTINCT expression is used to eliminate duplicate values from a result set. For instance, to retrieve unique values in a
column, you can use: SELECT DISTINCT columnname FROM tablename
In the MEDALS table, to get a list of unique countries that received gold medals, the query would be: SELECT DISTINCT
COUNTRY FROM MEDALS WHERE MEDALTYPE = 'GOLD’

Limit function
The LIMIT expression restricts the number of rows retrieved from the database. For example, to get just the first 10 rows, you can
use: SELECT * FROM tablename LIMIT 10
Basic SQL Queries - Select

Sorting Data: The ORDER BY clause is used to sort the result set by a specified column, with the default sorting being in
ascending order. For descending order, you can add the keyword DESC, such as ORDER BY title DESC, which will sort the titles
from Z to A.
SELECT * FROM Products ORDER BY Price;

Aggregate functions, such as SUM, MIN, MAX, and AVG, take a collection of values and return a single value.

Grouping Data: The GROUP BY clause allows you to group results into subsets based on matching values in one or more
columns, such as adding up salary for each employee.
SELECT name, SUM(sal) FROM emp GROUP BY name
You can use the COUNT function in conjunction with GROUP BY
SELECT SUBJECT, YEAR, Count(*) FROM Student GROUP BY SUBJECT, YEAR
Restricting Results with Conditions
To further filter grouped results, the HAVING clause is used, which applies conditions specifically to grouped data.
SELECT NAME, SUM(sal) FROM Emp GROUP BY name HAVING SUM(sal)>50000
Basic SQL Queries - Select

DATE FUNCTIONS
• The DAY function can extract the day from a date, for example:
SELECT DAY(RESCUEDATE) FROM PETRESCUE WHERE ANIMAL = 'cat’;

• You can count records based on specific months using the MONTH function, like so:
SELECT COUNT(*) FROM PETRESCUE WHERE MONTH(RESCUEDATE) = 5;

• You can also determine the number of days since a rescue date using the CURRENT_DATE function:
SELECT (CURRENT_DATE - RESCUEDATE) FROM PETRESCUE;

• Date arithmetic allows you to calculate future dates, such as finding a date three days after a rescue date:
SELECT (RESCUEDATE + 3 DAYS) FROM PETRESCUE;
Basic SQL Queries - Select

Sub-queries in SQL
Sub-queries are regular queries enclosed in parentheses and can be used in various parts of a main query, such as the WHERE
clause, SELECT list, or FROM clause. They help overcome limitations of aggregate functions, like using the AVG function in the
WHERE clause.
To find employees earning more than the average salary, you can use a sub-query:
SELECT EMP_ID, F_NAME, L_NAME, SALARY FROM employees WHERE SALARY > (SELECT AVG(SALARY) FROM
employees)

Sub-queries can also be used in the SELECT list to compare individual salaries with the average salary:
SELECT EMP_ID, SALARY, (SELECT AVG(SALARY) FROM employees) AS AVG_SALARY FROM employees;

Sub-queries can act as derived tables, providing a data source for the outer query, which can simplify complex queries involving
multiple tables.
SELECT * FROM (SELECT EMP_ID, F_NAME, L_NAME, DEP_ID FROM employees) AS EMP4ALL;
Sub-queries allow you to filter results from one table based on data from another table. For example, you can retrieve employee
records where the department ID exists in the departments table.
Basic SQL Queries – Insert, Update, Delete

The INSERT statement is used to add new rows to a table and is part of the data manipulation language (DML) statements
The syntax of the INSERT statement is structured as: INSERT INTO table_name (column_name1, column_name2, ...) VALUES
(value1, value2, ...).
It's crucial that the number of values matches the number of columns specified to ensure each column receives a value.
You can insert multiple rows in a single statement by separating each row's values with commas.
INSERT INTO author (author_id, last_name, first_name, email, city, country) VALUES ('A1', 'Chong', 'Raul',
'RFC@IBM.com', 'Toronto', 'CA'), ('A2', 'Ahuja', 'Rav', 'Rav@IBM.com', 'Toronto', 'CA’);

The UPDATE statement is used to modify existing data in a table, following the syntax: UPDATE [TableName] SET
[ColumnName] = [Value] WHERE [Condition].
The WHERE clause is crucial as it specifies which rows to update; omitting it will result in all rows being modified.

The DELETE statement is used to remove rows from a table, with the syntax: DELETE FROM [TableName] WHERE [Condition]
Similar to the UPDATE statement, the WHERE clause determines which rows to delete; failing to include it will remove all rows
from the table.
Basic SQL Queries – Create, Alter, Drop, Truncate

The syntax begins with "CREATE TABLE" followed by the table name, with the rest of the statement enclosed in parentheses.
Each column is defined by its name, datatype, and optional constraints, separated by commas.
CREATE TABLE provinces (id CHAR(2) PRIMARY KEY NOT NULL, name VARCHAR(24));

The ALTER TABLE statement allows you to add or remove columns, modify data types, and manage keys and constraints
For example, to add a telephone number column to the AUTHOR table, you would use:
ALTER TABLE author ADD COLUMN telephone_number BIGINT;

The DROP TABLE statement is used to delete an entire table from a database, along with its data.
The syntax for this statement is straight forward: DROP TABLE table_name;

The TRUNCATE TABLE statement deletes all rows in a table without removing the table itself
The syntax for truncating a table is: TRUNCATE TABLE table_name;
SQL Queries - Join

The JOIN operator is used to combine rows from two or more tables based on a relationship between certain columns, typically
involving primary and foreign keys.
A primary key uniquely identifies each row in a table, while a foreign key refers to a primary key in another table, establishing a
link between the two.
Types of JOINs
The most common type of join is the Inner join, which returns only the rows with matching values in the common columns.

SELECT column_name(s)
FROM table1
INNER JOIN table2
ON table1.column_name = table2.column_name;
SQL Queries - Join

Outer joins return matching rows as well as rows from one or both tables that do not match, allowing for a broader data set.
Types of Outer Joins
A left outer join includes all rows from the left table and only the matching rows from the right table, displaying null values for
non-matching rows.
SELECT column_name(s) FROM table1 LEFT JOIN table2 ON table1.column_name = table2.column_name;

A right outer join includes all rows from the right table and only the matching rows from the left table
SELECT column_name(s) FROM table1 RIGHT JOIN table2 ON table1.column_name = table2.column_name;

A full outer join returns all rows from both tables, including matching rows and non-matching rows from both sides. This type of
join is useful when you want to see all data, regardless of whether there are matches in the other table.
SELECT column_name(s) FROM table1 FULL OUTER JOIN table2 ON table1.column_name = table2.column_name
SQL Queries - View

• A view is a virtual table based on the result-set of an SQL statement. A view contains rows and columns, just like a real table.
The fields in a view are fields from one or more real tables in the database.
• You can add SQL statements and functions to a view and present the data as if the data were coming from one single table.
• A view is created with the CREATE VIEW statement.
CREATE VIEW view_name AS
SELECT column1, column2, ...
FROM table_name
WHERE condition;
• A view always shows up-to-date data! The database engine recreates the view, every time a user queries it.

When to use a view:


• Views can be used to display selected data while omitting sensitive information, such as salaries or birth dates.
• They simplify data access by allowing users to query a view without needing direct access to the underlying tables.
SQL Queries – Stored Procedure

• A stored procedure is a prepared SQL code that you can save, so the code can be reused over and over again. If you have an
SQL query that you write over and over again, save it as a stored procedure, and then just call it to execute it.
• You can also pass parameters to a stored procedure, so that the stored procedure can act based on the parameter value(s)
that is passed.

CREATE PROCEDURE procedure_name AS sql_statement GO;


EXEC procedure_name;

• Stored Procedure With Multiple Parameters


CREATE PROCEDURE SelectAllCustomers @City nvarchar(30), @PostalCode nvarchar(10) AS
SELECT * FROM Customers WHERE City = @City AND PostalCode = @PostalCode GO;
EXEC SelectAllCustomers @City = 'London', @PostalCode = 'WA1 1DP’;

Tip: Views are primarily used for data presentation, while stored procedures are used for performing actions on data.
Data Analysis with R Programming
Programming Language for Data Professionals – R

• R is a free, open-source programming language primarily used for statistical computing, data analysis, and powerful data
visualization

• R is freely accessible to anyone, allowing for wide adoption and community development

• It is known for its extensive collection of packages that provide a wide range of statistical functions and tools

• R excels in creating high-quality, publication-ready visualizations with packages like ggplot2, allowing for customization
and exploration of data through various charts and graphs

• RStudio: A popular integrated development environment (IDE) that provides a user-friendly interface for writing, executing, and
visualizing R code
Basic building blocks

• R is like the engine of a car, while RStudio provides the tools to control and interact with it, such as the accelerator, steering
wheel, and dashboard.

• R packages are collections of reusable functions and tools, offering a wide range of functionalities.

• The Tidyverse is a popular collection of R packages designed for data manipulation, visualization, and modeling.

• Functions in R are reusable blocks of code designed to perform specific tasks. They streamline code and make it efficient.

• Variables act as containers for storing data values, allowing you to easily manage and manipulate data. age <- 25

• Vectors are collections of data elements of the same type. Vector of strings: fruits <- c("banana", "apple", "orange")

• Operators are symbols that represent actions to be performed, such as calculations in a formula. They are used similarly to
how they are used in spreadsheets and SQL - Arithmetic, Relational (<, >, ==), Logical (&&, ||, !), Assignment

• Assignment operators, like <-, assign values to variables. For example, sales <- c(100, 150, 120) would store those
sales figures in the variable "sales."
RStudio
Working with Scripts in RStudio
Scripts in RStudio help you save your code and provide a record of your work. You can run your code within a script, and the
output will appear in the console.

x <- 7 x <- -1
if (x > 0) { # run the code
print ("x is a positive number") if (x < 0) {
} else { print("x is a negative number")
print ("x is either a negative number or zero") } else if (x == 0) {
} print("x is zero")
Since 7 is greater than 0, the condition of the if statement is } else {
true. So, when you run the code, R prints out “x is a positive
number”. print("x is a positive number")
}
Pipes

• Pipes in R help to make your code more efficient and easier to read by representing a sequence of multiple operations.
• They simplify complex nested functions, where one function is embedded within another, by using the pipe operator (%>%) to
chain operations together.
Benefits of Using Pipes
• Pipes enhance code readability, making it easier to understand the flow of operations.
• They promote code efficiency by reducing the need for intermediate variables and nested function calls.
• Pipes make it simple to add or modify steps in your data analysis workflow.
Example
# Without pipe
res1 <- add(2,4) # Using pipe

res2 <- mul(res1,8) res <- add(2,4) %>% mul(8) %>% div(2)

res3 <- div(res2,2) print(res)

print(res3) Output: # [1] 24

Output: # [1] 24
Exploring Tidyverse

• The tidyverse is a collection of R packages designed for data science, making data analysis and manipulation more efficient.
• Eight core tidyverse packages: ggplot2, tidyr, readr, dplyr, tibble, purrr, stringr, and forcats
Top 4 packages
• ggplot2 is used for creating various data visualizations, allowing you to apply different visual properties to data variables.
• dplyr provides functions for common data manipulation tasks like selecting variables and filtering data based on conditions.
• tidyr helps clean data, ensuring every data point is in the correct format and location within a data table or data frame.
• readr imports data from external sources, with its read_csv function commonly used for importing CSV files into R.
Next 4 packages
• tibble works with data frames. A data frame is like a spreadsheet with columns for variables and rows for observations,
making data easy to read and use. Tibbles are similar to data frames but are designed to be easier to work with.
• purrr makes working with functions and vectors easier. The most prominent function in purrr is map(), which iterates through
elements of a vector and applies a specified function to each element, returning a list of results
• stringr includes functions that make it easier to work with strings.
• forcats provides tools that solve common problems with factors. As a quick reminder, factors store categorical data in R where
the data values are limited and usually based on a finite group like country or year.
Working with Data Frame

Data frames are fundamental in R for data manipulation and analysis, mirroring the structure of spreadsheets with columns and
rows.
• The head() function displays the first few rows of a data frame, providing a sneak peek without printing the entire dataset.
• str() reveals the data frame's structure, including column names and data types, while colnames() lists the column names.

Tibbles are a little different from standard data frames. A data frame is a collection of columns, like a spreadsheet or a SQL
table. Tibbles are like streamlined data frames that are automatically set to pull up only the first 10 rows of a dataset, and only as
many columns as can fit on the screen. This is really useful when you’re working with large sets of data
Exploring Functions in R

Sorting and Saving Data


• The arrange function in R allows you to sort your data based on a specific column, either in ascending or descending order.
• To save the sorted data as a new data frame, you can assign the arranged output to a new data frame name.
Grouping and Summarizing Data
• The group_by function lets you group data based on a specific column, and it's often used with other functions like summarize.
• You can calculate summary statistics (like mean or max) for each group using the summarize function after grouping the data.
Filtering Data
• The filter function allows you to extract specific rows from your data frame based on a condition.
Splitting and Combining Columns
• The separate function is used to split a column into multiple columns. For instance, you can split a "full name" column into "first
name" and "last name" columns.
• Conversely, the unite function merges multiple columns into one. You can combine "first name" and "last name" to "full name"
Creating New Variables
The mutate function lets you create new variables or modify existing ones in your data frame. This function can perform
calculations, such as converting grams to kilograms and add the results as a new column.
Visualization in R

• ggplot2 is one of the core packages of the tidyverse used for creating diverse and sophisticated data visualizations, from simple
charts to intricate interactive graphs and maps.
• ggplot2 provides a structured approach to building visualizations, allowing you to create various plots using a consistent set of
building blocks.

• Out of these components, ggplot2 needs


at least the following three to produce a
chart: data, a mapping, and a layer.

• The scales, facets, coordinates, and


themes have sensible defaults that take
away a lot of finicky work.
Visualization in R
Data
• As the first step in many plots, you would pass the data to the ggplot() function, which stores the data to be used later by other
parts of the plotting system. For example, if we intend to make a graphic about the mpg dataset, we would start as follows:
ggplot(data = mpg)

Mapping
• The mapping of a plot is a set of instructions on how parts of the data are mapped onto aesthetic attributes of geometric
objects. It is the ‘dictionary’ to translate data to the graphics system. A mapping can be made by using the aes() function to
make pairs of graphical attributes and parts of the data. If we want the cty and hwy columns to map to the x- and y-
coordinates in the plot, we can do that as follows: ggplot(mpg, mapping = aes(x = cty, y = hwy))

Layers
• The heart of any graphic is the layers. They take the mapped data and display it in something humans can understand as a
representation of the data. Every layer consists of three important parts:
• The geometry that determines how data are displayed, such as points, lines, or rectangles.
• The statistical transformation that may compute new variables from the data and affect what of the data is displayed.
• The position adjustment that primarily determines where a piece of data is being displayed.

A layer can be constructed using the geom_*() and stat_*() functions. These functions often determine one of the three parts of a
layer, while the other two can still be specified.
Visualization in R
Here is how we can use two layers to display the cty and hwy columns of the mpg dataset as points and stack a trend line on top.

ggplot(mpg, aes(cty, hwy)) +


# to create a scatterplot
geom_point() +
# to fit and overlay a trendline
geom_smooth(formula = y ~ x, method = "lm")
Visualization in R

Scales
• Scales are important for translating what is shown on
the graph back to an understanding of the data.
• The scales typically form pairs with aesthetic attributes
of the plots, and are represented in plots by guides, like
axes or legends.
• Scales are responsible for updating the limits of a plot,
setting the breaks, formatting the labels, and possibly
applying a transformation.
• To use scales, one can use one of the scale functions
that are patterned as scale_{aesthetic}_{type}()
functions, where {aesthetic} is one of the pairings made
in the mapping part of a plot.
• To map the class column in the mpg dataset to the
viridis colour palette, we can write the following:

ggplot(mpg, aes(cty, hwy, colour = class)) +


geom_point() +
scale_colour_viridis_d()
Visualization in R

Facets
• Facets can be used to separate small
multiples, or different subsets of the data.
• It is a powerful tool to quickly split up the data
into smaller panels, based on one or more
variables, to display patterns or trends (or the
lack thereof) within the subsets.
• The facets have their own mapping that can
be given as a formula.
• To plot subsets of the mpg dataset based on
levels of the drv and year variables, we can
use facet_grid() as follows:

ggplot(mpg, aes(cty, hwy)) +


geom_point() +
facet_grid(year ~ drv)
Visualization in R

Theme
• The theme system controls almost any visuals of the plot
that are not controlled by the data and is therefore
important for the look and feel of the plot.
• You can use the theme for customizations ranging from
changing the location of the legends to setting the
background color of the plot.
• To tweak the look of the plot, one can use many of the
built-in theme_*() functions and/or detail specific aspects
with the theme() function. The element_*() functions
control the graphical attributes of theme components.

ggplot(mpg, aes(cty, hwy, colour = class)) +


geom_point() +
theme_minimal() +
theme(
legend.position = "top",
axis.line = element_line(linewidth = 0.75),
axis.line.x.bottom = element_line(colour = "blue")
Visualization in R - Annotation
Power of Statistics
Basics of Statistics - Measures of Central Tendency

• Measures of central tendency, like the mean, median, and mode, help us determine the center of a dataset, similar to
finding a city's center for orientation.

• These measures provide insights into the typical or central values within the data, aiding in understanding its overall
distribution.

Calculating Central Tendency

• The mean is calculated by summing all values and dividing by the total number of values, representing the average.

• To find the median, values are ordered from least to greatest, and the middle value is selected. If the dataset has an even
number of values, the median is the average of the two middle values.

• The mode is the most frequent value in a dataset. Datasets can have no mode, one mode, or multiple modes.

Choosing the Right Measure

• The mean is susceptible to outliers, which are extreme values that can skew the average.

• The median is a better measure of central tendency when outliers are present, as it is less affected by extreme values.

• The mode is particularly useful for categorical data, revealing the most frequently occurring category.
Basics of Statistics - Measures of Dispersion

• The range, calculated as the difference between the highest and lowest values, provides a quick snapshot of data spread
but might be sensitive to outliers.

• Standard deviation, on the other hand, offers a more nuanced understanding of data spread by measuring the typical
distance of data points from the mean.

Understanding Standard Deviation

• A higher standard deviation indicates greater variability in the data, meaning the values are more spread out from the mean.
n
Conversely, a lower standard deviation suggests that data points cluster closer to the mean, indicating less variability.

Application of Standard Deviation

• Meteorologists use standard deviation in weather forecasting to understand temperature fluctuations

• A higher standard deviation in temperature for a location suggests more significant day-to-day variations, while a lower
standard deviation implies more consistent weather patterns.
Basics of Statistics - Measures of Position

Understanding Percentiles
• A percentile indicates the percentage of data that falls below a specific value. For instance, if you score in the 90th
percentile on a test, it means you scored better than 90% of test-takers. Percentiles are particularly useful for comparing
values from different datasets, even if those datasets have different scoring systems.
Exploring Quartiles and the Interquartile Range
• Quartiles divide a dataset into four equal parts. The first quartile (Q1) represents the 25th percentile, the second quartile
(Q2) is the median, and the third quartile (Q3) is the 75th percentile.
• The interquartile range (IQR) is the range between first and third quartiles (Q3 - Q1), representing middle 50% of the data.
Summarizing Data with the Five-Number Summary
• The five-number summary provides a concise overview of a dataset's distribution.n It includes the minimum value, Q1, the
median (Q2), Q3, and the maximum value. This summary can be visualized using a box plot.
Basics of Statistics - Understanding Probability

• Probability, expressed as a number between 0 and 1, signifies the likelihood of an event occurring. A probability of 0
means the event won't happen, while 1 indicates certainty.

Random Experiments and Calculations

• Random experiments, like flipping a coin or rolling a die, have unpredictable outcomes.

• To calculate the probability of a specific outcome in a random experiment, divide the number of desired outcomes by the
total number of possible outcomes.

Example: Probability of Picking a Green Marble

Imagine a jar with 10 marbles: 2 red, 3 green, and 5 blue. n

The probability of picking a green marble is calculated by dividing the number of green marbles (3) by the total number
of marbles (10), resulting in a probability of 0.3 or 30%.

• Data professionals use probability to help stakeholders make informed decisions about uncertain events.
Basics of Statistics – Normal Distribution
• The normal distribution, also known as the Bell curve or Gaussian distribution, is symmetrical with the mean at its center
and tails extending infinitely in both directions.
• The empirical rule helps us understand the spread of data in a normal distribution. It states that 68% of data falls within
one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations.
• This rule is useful for estimating data distribution, identifying potential outliers (data points far from the mean), and
understanding data spread.
• Many naturally occurring phenomena appear to be normally distributed. For example, the average height of a human is
roughly 175 cm (5' 9"), counting both males and females. Taller and shorter people exist with decreasing frequency in the
population. According to the empirical rule, 99.7% of all people will fall with +/- three standard deviations of the mean, or
between 154 cm (5' 0") and 196 cm (6' 5").

n
Basics of Statistics – Z score
A z-score, or standard score, tells you how many standard deviations a data point is away from the mean of a dataset. Z-
scores enable the comparison of data from different datasets that might have different units, means, and standard deviations.
The formula for calculating a z-score is: z = (x - μ) / σ, where x is the data point, μ is the population mean, and σ is the
population standard deviation.
A z-score of 0 means the data point is equal to the mean. A positive z-score indicates the data point is above the mean, and a
negative z-score indicates it's below the mean.
Data professionals use z-scores for anomaly detection, which helps identify outliers in datasets. This has applications in
detecting fraud, manufacturing flaws, and network intrusions.

n
Hypothesis Testing

• Data professionals use hypothesis testing to determine whether a relationship between variables is statistically
significant. As a data professional, it’s important to understand the concept of statistical significance

• Insights based on statistically significant results can help stakeholders make more informed business decisions.

Steps for conducting a hypothesis test:

1. State the null hypothesis and the alternative hypothesis.

2. Choose a significance level.

3. Find the p-value.

4. Reject or fail to reject the null hypothesis.

Example:

Imagine you’re a data professional working for a computer company. The company claims the mean battery life for their
best-selling laptop is 8.5 hours with a standard deviation of 0.5 hours. Recently, the engineering team redesigned the laptop
to increase the battery life. The team takes a random sample of 40 redesigned laptops. The sample mean is 8.7 hours.

The team asks you to determine if the increase in mean battery life is statistically significant, or if it’s due to random chance.
You decide to conduct a hypothesis test to find out.
Hypothesis Testing

Step 1: State the null hypothesis and alternative hypothesis

The null hypothesis typically assumes that your observed data occurs by chance, and it is not statistically significant. In this
case, your null hypothesis says that there is no actual effect on mean battery life in the population of laptops.

The alternative hypothesis typically assumes that your observed data does not occur by chance and is statistically significant. In
this case, your alternative hypothesis says that there is an effect on mean battery life in the population of laptops.

In this example, you formulate the following hypotheses:

H0: μ = 8.5 (the mean battery life of all redesigned laptops is equal to 8.5 hours)

Ha: μ > 8.5 (the mean battery life of all redesigned laptops is greater than 8.5 hours)

Step 2: Choose a significance level

The significance level, is the threshold at which you will consider a result statistically significant. If a statistician declares that
some result is “highly significant”, then he indicates by stating that it might be very probably true. It does not mean that the result
is highly significant, but it suggests that it is highly probable.

The value significant at 5% refers to p-value is less than 0.05 or p < 0.05. Similarly, significant at the 1% means that the p-value
is less than 0.01.
Hypothesis Testing

Step 3: Find the p-value

p-value refers to the probability of observing results as or more extreme than those observed when the null hypothesis is
true. Your p-value helps you determine whether a result is statistically significant. A low p-value indicates high statistical
significance, while a high p-value indicates low or no statistical significance.

As a data professional, you’ll almost always calculate p-value on your computer, using a programming language like R or
other statistical software. In this example, you calculate a p-value of 0.0057, or 0.57%.

Step 4: Reject or fail to reject the null hypothesis

In a hypothesis test, you compare your p-value to your significance level to decide whether your results are statistically
significant. There are two main rules for drawing a conclusion about a hypothesis test:

• If your p-value is less than your significance level, you reject the null hypothesis.

• If your p-value is greater than your significance level, you fail to reject the null hypothesis.

In this example, your p-value of 0.57% is less than your significance level of 5%. Your test provides sufficient evidence to
conclude that the mean battery life of all redesigned laptops has increased from 8.5 hours. You reject the null hypothesis.
You determine that your results are statistically significant.
Nuts & Bolts of Machine Learning
Types of Machine Learning
There are two main types: Supervised and Unsupervised learning.
Machine Learning: Supervised
• Supervised learning utilizes labeled datasets, which are tagged with specific metrics or classifications, to train
algorithms for predicting outcomes or classifying data. For instance, Training a model on emails labeled as "spam" or
"not spam" to identify new, unlabelled emails as spam.
• "labeled data" refers to a dataset where each data point has been assigned a specific category or label, providing clear
information about the desired output for a given input, which allows a machine learning model to learn patterns and
make accurate predictions based on these labeled examples
Machine Learning: Unsupervised
• Unsupervised learning employs algorithms to analyze and group unlabeled datasets without specific instructions on
desired outcomes. The system attempts to uncover patterns from the data. There is no label or target given. A prime
example is identifying unusual or unexpected patterns in data, such as fraudulent transactions
Other Machine Learning Types:
Reinforcement learning, often used in robotics, involves training a computer to make decisions based on rewards or
penalties for its actions.
Deep learning, on the other hand, utilizes interconnected layers of nodes to process and transform signals, ultimately
generating an output.
Common Machine Learning Algorithms

Linear Regression & Logistic Regression

Clustering

• Using unsupervised learning, clustering algorithms can identify patterns in data so that it can be grouped. Computers
can help data scientists by identifying differences between data items that humans have overlooked.

Decision trees

• Decision trees can be used for both predicting numerical values and classifying data into categories. Decision trees
use a branching sequence of linked decisions that can be represented with a tree diagram. One of the advantages of
decision trees is that they are easy to validate and audit, unlike the black box of the neural network.

Random forests

• In a random forest, the machine learning algorithm predicts a value or category by combining the results from a
number of decision trees.

Neural networks

Neural networks simulate the way the human brain works, with a huge number of linked processing nodes. Neural
networks are good at recognizing patterns and play an important role in applications including natural language
translation, image recognition, speech recognition, and image creation.
K-means Clustering

• K-means clustering is a way of grouping data based on how similar or close the data points are to each other.
• The algorithm works by first randomly picking some central points (called centroids) and then assigning every data
point to the nearest centroid. Once that’s done, it recalculates the centroids based on the new groupings and repeats
the process until the clusters make sense.
• It’s a pretty fast and efficient method, but it works best when the clusters are distinct and not too mixed up.
Here’s a look at its key objectives:
Grouping Similar Data Points
• K-Means is designed to cluster data points that share common traits, allowing patterns or trends to emerge. Whether
analyzing customer behavior or images, the method helps reveal hidden relationships within your dataset.
Minimizing Within-Cluster Distance
• Another objective is to keep data points in each group as close to the cluster's centroid as possible. Reducing this
internal distance results in compact, cohesive clusters, enhancing the accuracy of your results.
Maximizing Between-Cluster Distance
• K-Means also aims to maintain clear separation between different clusters. By maximizing the distance between
groups, the algorithm ensures that each cluster remains distinct, providing a better understanding of data categories
without overlap.
Decision Tree

• Decision trees are nothing but a bunch of if-else


statements in layman terms.
• It checks if the condition is true and if it is then it
goes to the next node attached to that decision.
• The goal of decision tree is to decrease uncertainty
or disorders from the dataset.
Random Forest
A "random forest" is a machine learning technique that combines the predictions of multiple decision trees to make a final
prediction, essentially acting like a "forest" where each tree is a separate model, and the final decision is made by taking a
majority vote from all the trees, making it more accurate and robust than relying on just one decision tree alone; it can be
used for both classification (categorizing data) and regression (predicting numerical values) problems.

Key points about random forests:


Ensemble learning: It's an "ensemble learning" method, which means it
combines multiple decision trees to improve prediction.
Bagging: To create diversity among the trees, random forests use a technique
called "bagging" where each tree is trained on a slightly different random sample.
Random feature selection: The features considered at each split in a decision
tree are randomly selected, further increasing diversity.
How it works:
Create multiple decision trees: The algorithm generates many decision trees,
each trained on a different random subset
Make predictions: Each tree makes a prediction on new data points.
Aggregate results: The final prediction is determined by taking the average (for
regression) or majority vote (for classification)
Regression Analysis - Logistic
Regression Analysis

Regression analysis is a statistical method for understanding and quantifying the relationship between two or more
variables. It helps a business estimate one dependent variable based on the values of one or more independent variables.
Dependent Variable: The dependent variable is essentially the "outcome" you’re trying to understand or predict.
Independent Variable: Independent variables are the "factors" that might influence or cause changes in the dependent
variable. These are the variables you manipulate or observe to see their impact on your outcome of interest. For example, if
you increase the price of a product, will the customer churn, in this case price is the independent variable.

Logistic Regression: Logistic regression is used when the dependent variable is binary (two possible outcomes). It models
the probability of a particular outcome occurring.
Business Application: In business, logistic regression is employed for tasks like predicting customer churn (yes/no), whether
a customer will purchase a product (yes/no), or whether a loan applicant will default on a loan (yes/no).
Logistic Regression
Types of Logistic Regression
Binary Logistic Regression: Binary logistic regression is used to predict the probability of a binary outcome, such as yes or
no, true or false, or 0 or 1. For example, it could be used to predict whether a customer will churn or not, whether a patient
has a disease or not, or whether a loan will be repaid or not.
Multinomial Logistic Regression: Multinomial logistic regression is used to predict the probability of one of three or more
possible outcomes, such as the type of product a customer will buy, the rating a customer will give a product, or the political
party a person will vote for.
Ordinal Logistic Regression: It is used to predict the probability of an outcome that falls into a predetermined order, such as
the level of customer satisfaction, the severity of a disease, or the stage of cancer.

How to Perform Logistic Regression Analysis?


Prepare the data: The data should be in a format where each row represents a single observation and each column
represents a different variable. The target variable (the variable you want to predict) should be binary (yes/no, true/false, 0/1).
Train the model: We teach the model by showing it the training data. This involves finding the values of the model
parameters that minimize the error in the training data.
Evaluate the model: The model is evaluated on the test data to assess its performance on unseen data.
Use the model to make predictions: After the model has been trained and assessed, it can be used to forecast outcomes
on new data.
Logistic Regression

• In medicine, a frequent application is to find out which variables have an influence on a disease. In this case, 0 could
stand for not diseased and 1 for diseased. Subsequently, the influence of age, gender and smoking status (smoker or not)
on this particular disease could be examined.

• In linear regression, the independent variables (e.g., age and gender) are used to estimate the specific value of the
dependent variable (e.g., body weight).

• In logistic regression, on the other hand, the dependent variable is dichotomous (0 or 1) and the probability that
expression 1 occurs is estimated. Returning to the example above, this means: How likely is it that the disease is present
if the person under consideration has a certain age, sex and smoking status.

To build a logistic regression model, the linear regression equation is used as the starting point.
Logistic Regression

However, if a linear regression were simply calculated for solving a logistic regression, the following result would appear
graphically:

As can be seen in the graph, values between plus and minus infinity can now occur. The goal of logistic regression,
however, is to estimate the probability of occurrence and not the value of the variable itself. Therefore, the equation must be
transformed.

To do this, it is necessary to restrict the value range for the prediction to the range between 0 and 1. To ensure that only
values between 0 and 1 are possible, the logistic function is used.
Logistic Regression

Logistic Function

The logistic model is based on the logical function. The special thing about the logistic function is that for values between
minus and plus infinity, it always assumes only values between 0 and 1.

To calculate the probability of a person being sick or not using the logistic regression for the example above, the model
parameters b1, b2, b3 and a must first be determined. Once these have been determined, the equation will be:
Key properties of the Logistic Regression equation

Sigmoid Function: The logistic regression model, when explained, uses a special “S” shaped curve to predict
probabilities. It ensures that the predicted probabilities stay between 0 and 1, which makes sense for probabilities.

Coefficients: These are just numbers that tell us how much each input affects the outcome in the logistic regression
model. For example, if age is a predictor, the coefficient tells us how much the outcome changes for every one-year
increase in age.

Best Guess: We figure out the best coefficients for the logistic regression model by looking at the data we have and
tweaking them until our predictions match the real outcomes as closely as possible.

Basic Assumptions: We assume that our observations are independent, meaning one doesn’t affect the other. We also
assume that there’s not too much overlap between our predictors (like age and height)

Linearity in the Logit: The relationship between the independent variables and the logit of the dependent variable (ln(p /
(1-p))) is assumed to be linear. This doesn’t necessarily mean the outcome itself has a linear relationship with the
independent variables, but the log-odds do

Probabilities, Not Certainties: Instead of saying “yes” or “no” directly, logistic regression gives us probabilities, like
saying there’s a 70% chance it’s a “yes” in the logistic regression model. We can then decide on a cutoff point to make our
final decision.

Checking Our Work: We have some tools to make sure our predictions are good, like accuracy, precision, recall, and a
curve called the ROC curve.
Logistic Regression Example

You have a dataset, and you need to predict whether a candidate will get admission in the desired college or not, based on
the person’s GRE score, GPA and College Rank.

Steps:

1. In the dataset, we are given the GPAs and college ranks for several students, but it also has a column that indicates
whether those students were admitted or not.

2. Based on this labeled data, you can train the model, validate it, and then use it to predict the admission for any GPA and
college rank.

3. Once you split the data into training and test sets, you will apply the regression on the three independent variables (GRE,
GPA and Rank), generate the model, and then run the test set through the model.

4. Once that is complete, you will validate the model to see how well it performed.

Data Set
Logistic Regression Example

Model & Results Interpretation


1- Each one-unit change in gre will increase the log odds of
getting admit by 0.002, and its p-value indicates that it is
somewhat significant in determining the admit.

2- Each unit increase in GPA increases the log odds of getting


admit by 0.80 and p-value indicates that it is somewhat
significant in determining the admit.

3- The interpretation of rank is different from others, going to


rank-2 college from rank-1 college will decrease the log odds
of getting admit by -0.67. Going from rank-2 to rank-3 will
decrease it by -1.340.

4- The difference between Null deviance and Residual


deviance tells us that the model is a good fit. Greater the
difference better the model. Null deviance is the value when
you only have intercept in your equation with no variables.
The null deviance tells us how well the response variable can
be predicted by a model with only an intercept term. Residual
deviance is the value when you are taking all the variables
into account.
*When using logistic regression, you should convert a rank from an integer to a factor to indicate that the rank is a categorical variable.
Logistic Regression Example

Prediction

Let’s say a student have a profile with 790 in GRE,3.8 GPA and he studied from a rank-1 college. Now you want to predict
the chances of that boy getting admit in future.

We see that there is 85% chance that this guy will get the admit.
Foundation Skills Academy

Thank You

You might also like