Data Analytics - Advanced
Data Analytics - Advanced
4. Power of Statistics
6. Logistic Regression
Index - Basics
Appendix
1. New age terminologies - Data Science vs. AI vs. ML
2. Functional analytics - Marketing Analytics
Index - Intermediate
• SQL, or Structured Query Language, is a querying language designed for accessing and manipulating information from,
mostly, though not exclusively, relational databases.
• Using SQL, we can write a set of instructions to perform operations such as Insert, update, and delete records in a
database; Create new databases, tables, and views; and write stored procedures - which means you can write a set of
instructions and call them for later use.
• Here are some advantages of using SQL,
• SQL is portable and can be used independent of the platform. It can be used for querying data in a wide variety of
databases and data repositories, although each vendor may have some variations and special extensions.
• It has a simple syntax that is similar to the English language. Its syntax allows developers to write programs with
fewer lines than some of the other programming languages using basic keywords such as select, insert, into, and
update.
• It can retrieve large amounts of data quickly and efficiently.
• It runs on an interpreter system, which means code can be executed as soon as it is written, making prototyping quick
• Due to its large user community and the sheer volume of documentation accumulated over the years, it continues to
provide a uniform platform, worldwide, to all its users.
Categories of SQL statements
• There are two main categories of SQL statements which are essential for interacting with relational databases,
• Data Definition Language (DDL)
• Data Manipulation Language (DML)
• DDL statements are used to define, change, or drop database objects such as tables.
• Common DDL statements include CREATE (to create tables), ALTER (to modify tables), TRUNCATE (to delete data without
removing the table), and DROP (to delete tables).
• DML statements are used to read and modify data within tables, often referred to as CRUD operations (Create, Read,
Update, Delete).
• Common DML statements include INSERT (to add data), SELECT (to retrieve data), UPDATE (to modify existing data), and
DELETE (to remove data).
Basic SQL Queries - Select
• Using String Patterns: The WHERE clause can utilize the LIKE predicate to search for patterns in a column, using the
percent sign (%) as a wildcard character to represent missing letters.
• For example, SELECT first_name FROM author WHERE first_name LIKE 'R%' retrieves all authors whose first names start
with the letter R.
• Utilizing Ranges: Instead of using multiple comparison operators, the BETWEEN operator can simplify queries by specifying
a range of values. For instance, WHERE pages BETWEEN 290 AND 300 retrieves books with a page count within that range,
making the query easier to write.
• Working with Sets of Values: The IN operator allows for specifying multiple values in a WHERE clause without lengthy
repetition. For example, WHERE country IN ('Australia', 'Brazil', 'Canada') efficiently retrieves authors from those countries.
• Scalar functions operate on individual values, such as rounding numbers with SELECT ROUND(COST) FROM
PETRESCUE.
• String functions manipulate text values, for instance, SELECT UPPER(ANIMAL) FROM PETRESCUE converts animal
names to uppercase.
Basic SQL Queries - Select
Count function
The COUNT function retrieves the number of rows that match specific query criteria. For example, to get the total number of rows
in a table, you can use: SELECT COUNT(*) FROM tablename
To count the number of medal recipients from Canada in a MEDALS table, the query would be: SELECT COUNT(COUNTRY)
FROM MEDALS WHERE COUNTRY='CANADA’
Distinct function
The DISTINCT expression is used to eliminate duplicate values from a result set. For instance, to retrieve unique values in a
column, you can use: SELECT DISTINCT columnname FROM tablename
In the MEDALS table, to get a list of unique countries that received gold medals, the query would be: SELECT DISTINCT
COUNTRY FROM MEDALS WHERE MEDALTYPE = 'GOLD’
Limit function
The LIMIT expression restricts the number of rows retrieved from the database. For example, to get just the first 10 rows, you can
use: SELECT * FROM tablename LIMIT 10
Basic SQL Queries - Select
Sorting Data: The ORDER BY clause is used to sort the result set by a specified column, with the default sorting being in
ascending order. For descending order, you can add the keyword DESC, such as ORDER BY title DESC, which will sort the titles
from Z to A.
SELECT * FROM Products ORDER BY Price;
Aggregate functions, such as SUM, MIN, MAX, and AVG, take a collection of values and return a single value.
Grouping Data: The GROUP BY clause allows you to group results into subsets based on matching values in one or more
columns, such as adding up salary for each employee.
SELECT name, SUM(sal) FROM emp GROUP BY name
You can use the COUNT function in conjunction with GROUP BY
SELECT SUBJECT, YEAR, Count(*) FROM Student GROUP BY SUBJECT, YEAR
Restricting Results with Conditions
To further filter grouped results, the HAVING clause is used, which applies conditions specifically to grouped data.
SELECT NAME, SUM(sal) FROM Emp GROUP BY name HAVING SUM(sal)>50000
Basic SQL Queries - Select
DATE FUNCTIONS
• The DAY function can extract the day from a date, for example:
SELECT DAY(RESCUEDATE) FROM PETRESCUE WHERE ANIMAL = 'cat’;
• You can count records based on specific months using the MONTH function, like so:
SELECT COUNT(*) FROM PETRESCUE WHERE MONTH(RESCUEDATE) = 5;
• You can also determine the number of days since a rescue date using the CURRENT_DATE function:
SELECT (CURRENT_DATE - RESCUEDATE) FROM PETRESCUE;
• Date arithmetic allows you to calculate future dates, such as finding a date three days after a rescue date:
SELECT (RESCUEDATE + 3 DAYS) FROM PETRESCUE;
Basic SQL Queries - Select
Sub-queries in SQL
Sub-queries are regular queries enclosed in parentheses and can be used in various parts of a main query, such as the WHERE
clause, SELECT list, or FROM clause. They help overcome limitations of aggregate functions, like using the AVG function in the
WHERE clause.
To find employees earning more than the average salary, you can use a sub-query:
SELECT EMP_ID, F_NAME, L_NAME, SALARY FROM employees WHERE SALARY > (SELECT AVG(SALARY) FROM
employees)
Sub-queries can also be used in the SELECT list to compare individual salaries with the average salary:
SELECT EMP_ID, SALARY, (SELECT AVG(SALARY) FROM employees) AS AVG_SALARY FROM employees;
Sub-queries can act as derived tables, providing a data source for the outer query, which can simplify complex queries involving
multiple tables.
SELECT * FROM (SELECT EMP_ID, F_NAME, L_NAME, DEP_ID FROM employees) AS EMP4ALL;
Sub-queries allow you to filter results from one table based on data from another table. For example, you can retrieve employee
records where the department ID exists in the departments table.
Basic SQL Queries – Insert, Update, Delete
The INSERT statement is used to add new rows to a table and is part of the data manipulation language (DML) statements
The syntax of the INSERT statement is structured as: INSERT INTO table_name (column_name1, column_name2, ...) VALUES
(value1, value2, ...).
It's crucial that the number of values matches the number of columns specified to ensure each column receives a value.
You can insert multiple rows in a single statement by separating each row's values with commas.
INSERT INTO author (author_id, last_name, first_name, email, city, country) VALUES ('A1', 'Chong', 'Raul',
'RFC@IBM.com', 'Toronto', 'CA'), ('A2', 'Ahuja', 'Rav', 'Rav@IBM.com', 'Toronto', 'CA’);
The UPDATE statement is used to modify existing data in a table, following the syntax: UPDATE [TableName] SET
[ColumnName] = [Value] WHERE [Condition].
The WHERE clause is crucial as it specifies which rows to update; omitting it will result in all rows being modified.
The DELETE statement is used to remove rows from a table, with the syntax: DELETE FROM [TableName] WHERE [Condition]
Similar to the UPDATE statement, the WHERE clause determines which rows to delete; failing to include it will remove all rows
from the table.
Basic SQL Queries – Create, Alter, Drop, Truncate
The syntax begins with "CREATE TABLE" followed by the table name, with the rest of the statement enclosed in parentheses.
Each column is defined by its name, datatype, and optional constraints, separated by commas.
CREATE TABLE provinces (id CHAR(2) PRIMARY KEY NOT NULL, name VARCHAR(24));
The ALTER TABLE statement allows you to add or remove columns, modify data types, and manage keys and constraints
For example, to add a telephone number column to the AUTHOR table, you would use:
ALTER TABLE author ADD COLUMN telephone_number BIGINT;
The DROP TABLE statement is used to delete an entire table from a database, along with its data.
The syntax for this statement is straight forward: DROP TABLE table_name;
The TRUNCATE TABLE statement deletes all rows in a table without removing the table itself
The syntax for truncating a table is: TRUNCATE TABLE table_name;
SQL Queries - Join
The JOIN operator is used to combine rows from two or more tables based on a relationship between certain columns, typically
involving primary and foreign keys.
A primary key uniquely identifies each row in a table, while a foreign key refers to a primary key in another table, establishing a
link between the two.
Types of JOINs
The most common type of join is the Inner join, which returns only the rows with matching values in the common columns.
SELECT column_name(s)
FROM table1
INNER JOIN table2
ON table1.column_name = table2.column_name;
SQL Queries - Join
Outer joins return matching rows as well as rows from one or both tables that do not match, allowing for a broader data set.
Types of Outer Joins
A left outer join includes all rows from the left table and only the matching rows from the right table, displaying null values for
non-matching rows.
SELECT column_name(s) FROM table1 LEFT JOIN table2 ON table1.column_name = table2.column_name;
A right outer join includes all rows from the right table and only the matching rows from the left table
SELECT column_name(s) FROM table1 RIGHT JOIN table2 ON table1.column_name = table2.column_name;
A full outer join returns all rows from both tables, including matching rows and non-matching rows from both sides. This type of
join is useful when you want to see all data, regardless of whether there are matches in the other table.
SELECT column_name(s) FROM table1 FULL OUTER JOIN table2 ON table1.column_name = table2.column_name
SQL Queries - View
• A view is a virtual table based on the result-set of an SQL statement. A view contains rows and columns, just like a real table.
The fields in a view are fields from one or more real tables in the database.
• You can add SQL statements and functions to a view and present the data as if the data were coming from one single table.
• A view is created with the CREATE VIEW statement.
CREATE VIEW view_name AS
SELECT column1, column2, ...
FROM table_name
WHERE condition;
• A view always shows up-to-date data! The database engine recreates the view, every time a user queries it.
• A stored procedure is a prepared SQL code that you can save, so the code can be reused over and over again. If you have an
SQL query that you write over and over again, save it as a stored procedure, and then just call it to execute it.
• You can also pass parameters to a stored procedure, so that the stored procedure can act based on the parameter value(s)
that is passed.
Tip: Views are primarily used for data presentation, while stored procedures are used for performing actions on data.
Data Analysis with R Programming
Programming Language for Data Professionals – R
• R is a free, open-source programming language primarily used for statistical computing, data analysis, and powerful data
visualization
• R is freely accessible to anyone, allowing for wide adoption and community development
• It is known for its extensive collection of packages that provide a wide range of statistical functions and tools
• R excels in creating high-quality, publication-ready visualizations with packages like ggplot2, allowing for customization
and exploration of data through various charts and graphs
• RStudio: A popular integrated development environment (IDE) that provides a user-friendly interface for writing, executing, and
visualizing R code
Basic building blocks
• R is like the engine of a car, while RStudio provides the tools to control and interact with it, such as the accelerator, steering
wheel, and dashboard.
• R packages are collections of reusable functions and tools, offering a wide range of functionalities.
• The Tidyverse is a popular collection of R packages designed for data manipulation, visualization, and modeling.
• Functions in R are reusable blocks of code designed to perform specific tasks. They streamline code and make it efficient.
• Variables act as containers for storing data values, allowing you to easily manage and manipulate data. age <- 25
• Vectors are collections of data elements of the same type. Vector of strings: fruits <- c("banana", "apple", "orange")
• Operators are symbols that represent actions to be performed, such as calculations in a formula. They are used similarly to
how they are used in spreadsheets and SQL - Arithmetic, Relational (<, >, ==), Logical (&&, ||, !), Assignment
• Assignment operators, like <-, assign values to variables. For example, sales <- c(100, 150, 120) would store those
sales figures in the variable "sales."
RStudio
Working with Scripts in RStudio
Scripts in RStudio help you save your code and provide a record of your work. You can run your code within a script, and the
output will appear in the console.
x <- 7 x <- -1
if (x > 0) { # run the code
print ("x is a positive number") if (x < 0) {
} else { print("x is a negative number")
print ("x is either a negative number or zero") } else if (x == 0) {
} print("x is zero")
Since 7 is greater than 0, the condition of the if statement is } else {
true. So, when you run the code, R prints out “x is a positive
number”. print("x is a positive number")
}
Pipes
• Pipes in R help to make your code more efficient and easier to read by representing a sequence of multiple operations.
• They simplify complex nested functions, where one function is embedded within another, by using the pipe operator (%>%) to
chain operations together.
Benefits of Using Pipes
• Pipes enhance code readability, making it easier to understand the flow of operations.
• They promote code efficiency by reducing the need for intermediate variables and nested function calls.
• Pipes make it simple to add or modify steps in your data analysis workflow.
Example
# Without pipe
res1 <- add(2,4) # Using pipe
res2 <- mul(res1,8) res <- add(2,4) %>% mul(8) %>% div(2)
Output: # [1] 24
Exploring Tidyverse
• The tidyverse is a collection of R packages designed for data science, making data analysis and manipulation more efficient.
• Eight core tidyverse packages: ggplot2, tidyr, readr, dplyr, tibble, purrr, stringr, and forcats
Top 4 packages
• ggplot2 is used for creating various data visualizations, allowing you to apply different visual properties to data variables.
• dplyr provides functions for common data manipulation tasks like selecting variables and filtering data based on conditions.
• tidyr helps clean data, ensuring every data point is in the correct format and location within a data table or data frame.
• readr imports data from external sources, with its read_csv function commonly used for importing CSV files into R.
Next 4 packages
• tibble works with data frames. A data frame is like a spreadsheet with columns for variables and rows for observations,
making data easy to read and use. Tibbles are similar to data frames but are designed to be easier to work with.
• purrr makes working with functions and vectors easier. The most prominent function in purrr is map(), which iterates through
elements of a vector and applies a specified function to each element, returning a list of results
• stringr includes functions that make it easier to work with strings.
• forcats provides tools that solve common problems with factors. As a quick reminder, factors store categorical data in R where
the data values are limited and usually based on a finite group like country or year.
Working with Data Frame
Data frames are fundamental in R for data manipulation and analysis, mirroring the structure of spreadsheets with columns and
rows.
• The head() function displays the first few rows of a data frame, providing a sneak peek without printing the entire dataset.
• str() reveals the data frame's structure, including column names and data types, while colnames() lists the column names.
Tibbles are a little different from standard data frames. A data frame is a collection of columns, like a spreadsheet or a SQL
table. Tibbles are like streamlined data frames that are automatically set to pull up only the first 10 rows of a dataset, and only as
many columns as can fit on the screen. This is really useful when you’re working with large sets of data
Exploring Functions in R
• ggplot2 is one of the core packages of the tidyverse used for creating diverse and sophisticated data visualizations, from simple
charts to intricate interactive graphs and maps.
• ggplot2 provides a structured approach to building visualizations, allowing you to create various plots using a consistent set of
building blocks.
Mapping
• The mapping of a plot is a set of instructions on how parts of the data are mapped onto aesthetic attributes of geometric
objects. It is the ‘dictionary’ to translate data to the graphics system. A mapping can be made by using the aes() function to
make pairs of graphical attributes and parts of the data. If we want the cty and hwy columns to map to the x- and y-
coordinates in the plot, we can do that as follows: ggplot(mpg, mapping = aes(x = cty, y = hwy))
Layers
• The heart of any graphic is the layers. They take the mapped data and display it in something humans can understand as a
representation of the data. Every layer consists of three important parts:
• The geometry that determines how data are displayed, such as points, lines, or rectangles.
• The statistical transformation that may compute new variables from the data and affect what of the data is displayed.
• The position adjustment that primarily determines where a piece of data is being displayed.
A layer can be constructed using the geom_*() and stat_*() functions. These functions often determine one of the three parts of a
layer, while the other two can still be specified.
Visualization in R
Here is how we can use two layers to display the cty and hwy columns of the mpg dataset as points and stack a trend line on top.
Scales
• Scales are important for translating what is shown on
the graph back to an understanding of the data.
• The scales typically form pairs with aesthetic attributes
of the plots, and are represented in plots by guides, like
axes or legends.
• Scales are responsible for updating the limits of a plot,
setting the breaks, formatting the labels, and possibly
applying a transformation.
• To use scales, one can use one of the scale functions
that are patterned as scale_{aesthetic}_{type}()
functions, where {aesthetic} is one of the pairings made
in the mapping part of a plot.
• To map the class column in the mpg dataset to the
viridis colour palette, we can write the following:
Facets
• Facets can be used to separate small
multiples, or different subsets of the data.
• It is a powerful tool to quickly split up the data
into smaller panels, based on one or more
variables, to display patterns or trends (or the
lack thereof) within the subsets.
• The facets have their own mapping that can
be given as a formula.
• To plot subsets of the mpg dataset based on
levels of the drv and year variables, we can
use facet_grid() as follows:
Theme
• The theme system controls almost any visuals of the plot
that are not controlled by the data and is therefore
important for the look and feel of the plot.
• You can use the theme for customizations ranging from
changing the location of the legends to setting the
background color of the plot.
• To tweak the look of the plot, one can use many of the
built-in theme_*() functions and/or detail specific aspects
with the theme() function. The element_*() functions
control the graphical attributes of theme components.
• Measures of central tendency, like the mean, median, and mode, help us determine the center of a dataset, similar to
finding a city's center for orientation.
• These measures provide insights into the typical or central values within the data, aiding in understanding its overall
distribution.
• The mean is calculated by summing all values and dividing by the total number of values, representing the average.
• To find the median, values are ordered from least to greatest, and the middle value is selected. If the dataset has an even
number of values, the median is the average of the two middle values.
• The mode is the most frequent value in a dataset. Datasets can have no mode, one mode, or multiple modes.
• The mean is susceptible to outliers, which are extreme values that can skew the average.
• The median is a better measure of central tendency when outliers are present, as it is less affected by extreme values.
• The mode is particularly useful for categorical data, revealing the most frequently occurring category.
Basics of Statistics - Measures of Dispersion
• The range, calculated as the difference between the highest and lowest values, provides a quick snapshot of data spread
but might be sensitive to outliers.
• Standard deviation, on the other hand, offers a more nuanced understanding of data spread by measuring the typical
distance of data points from the mean.
• A higher standard deviation indicates greater variability in the data, meaning the values are more spread out from the mean.
n
Conversely, a lower standard deviation suggests that data points cluster closer to the mean, indicating less variability.
• A higher standard deviation in temperature for a location suggests more significant day-to-day variations, while a lower
standard deviation implies more consistent weather patterns.
Basics of Statistics - Measures of Position
Understanding Percentiles
• A percentile indicates the percentage of data that falls below a specific value. For instance, if you score in the 90th
percentile on a test, it means you scored better than 90% of test-takers. Percentiles are particularly useful for comparing
values from different datasets, even if those datasets have different scoring systems.
Exploring Quartiles and the Interquartile Range
• Quartiles divide a dataset into four equal parts. The first quartile (Q1) represents the 25th percentile, the second quartile
(Q2) is the median, and the third quartile (Q3) is the 75th percentile.
• The interquartile range (IQR) is the range between first and third quartiles (Q3 - Q1), representing middle 50% of the data.
Summarizing Data with the Five-Number Summary
• The five-number summary provides a concise overview of a dataset's distribution.n It includes the minimum value, Q1, the
median (Q2), Q3, and the maximum value. This summary can be visualized using a box plot.
Basics of Statistics - Understanding Probability
• Probability, expressed as a number between 0 and 1, signifies the likelihood of an event occurring. A probability of 0
means the event won't happen, while 1 indicates certainty.
• Random experiments, like flipping a coin or rolling a die, have unpredictable outcomes.
• To calculate the probability of a specific outcome in a random experiment, divide the number of desired outcomes by the
total number of possible outcomes.
The probability of picking a green marble is calculated by dividing the number of green marbles (3) by the total number
of marbles (10), resulting in a probability of 0.3 or 30%.
• Data professionals use probability to help stakeholders make informed decisions about uncertain events.
Basics of Statistics – Normal Distribution
• The normal distribution, also known as the Bell curve or Gaussian distribution, is symmetrical with the mean at its center
and tails extending infinitely in both directions.
• The empirical rule helps us understand the spread of data in a normal distribution. It states that 68% of data falls within
one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations.
• This rule is useful for estimating data distribution, identifying potential outliers (data points far from the mean), and
understanding data spread.
• Many naturally occurring phenomena appear to be normally distributed. For example, the average height of a human is
roughly 175 cm (5' 9"), counting both males and females. Taller and shorter people exist with decreasing frequency in the
population. According to the empirical rule, 99.7% of all people will fall with +/- three standard deviations of the mean, or
between 154 cm (5' 0") and 196 cm (6' 5").
n
Basics of Statistics – Z score
A z-score, or standard score, tells you how many standard deviations a data point is away from the mean of a dataset. Z-
scores enable the comparison of data from different datasets that might have different units, means, and standard deviations.
The formula for calculating a z-score is: z = (x - μ) / σ, where x is the data point, μ is the population mean, and σ is the
population standard deviation.
A z-score of 0 means the data point is equal to the mean. A positive z-score indicates the data point is above the mean, and a
negative z-score indicates it's below the mean.
Data professionals use z-scores for anomaly detection, which helps identify outliers in datasets. This has applications in
detecting fraud, manufacturing flaws, and network intrusions.
n
Hypothesis Testing
• Data professionals use hypothesis testing to determine whether a relationship between variables is statistically
significant. As a data professional, it’s important to understand the concept of statistical significance
• Insights based on statistically significant results can help stakeholders make more informed business decisions.
Example:
Imagine you’re a data professional working for a computer company. The company claims the mean battery life for their
best-selling laptop is 8.5 hours with a standard deviation of 0.5 hours. Recently, the engineering team redesigned the laptop
to increase the battery life. The team takes a random sample of 40 redesigned laptops. The sample mean is 8.7 hours.
The team asks you to determine if the increase in mean battery life is statistically significant, or if it’s due to random chance.
You decide to conduct a hypothesis test to find out.
Hypothesis Testing
The null hypothesis typically assumes that your observed data occurs by chance, and it is not statistically significant. In this
case, your null hypothesis says that there is no actual effect on mean battery life in the population of laptops.
The alternative hypothesis typically assumes that your observed data does not occur by chance and is statistically significant. In
this case, your alternative hypothesis says that there is an effect on mean battery life in the population of laptops.
H0: μ = 8.5 (the mean battery life of all redesigned laptops is equal to 8.5 hours)
Ha: μ > 8.5 (the mean battery life of all redesigned laptops is greater than 8.5 hours)
The significance level, is the threshold at which you will consider a result statistically significant. If a statistician declares that
some result is “highly significant”, then he indicates by stating that it might be very probably true. It does not mean that the result
is highly significant, but it suggests that it is highly probable.
The value significant at 5% refers to p-value is less than 0.05 or p < 0.05. Similarly, significant at the 1% means that the p-value
is less than 0.01.
Hypothesis Testing
p-value refers to the probability of observing results as or more extreme than those observed when the null hypothesis is
true. Your p-value helps you determine whether a result is statistically significant. A low p-value indicates high statistical
significance, while a high p-value indicates low or no statistical significance.
As a data professional, you’ll almost always calculate p-value on your computer, using a programming language like R or
other statistical software. In this example, you calculate a p-value of 0.0057, or 0.57%.
In a hypothesis test, you compare your p-value to your significance level to decide whether your results are statistically
significant. There are two main rules for drawing a conclusion about a hypothesis test:
• If your p-value is less than your significance level, you reject the null hypothesis.
• If your p-value is greater than your significance level, you fail to reject the null hypothesis.
In this example, your p-value of 0.57% is less than your significance level of 5%. Your test provides sufficient evidence to
conclude that the mean battery life of all redesigned laptops has increased from 8.5 hours. You reject the null hypothesis.
You determine that your results are statistically significant.
Nuts & Bolts of Machine Learning
Types of Machine Learning
There are two main types: Supervised and Unsupervised learning.
Machine Learning: Supervised
• Supervised learning utilizes labeled datasets, which are tagged with specific metrics or classifications, to train
algorithms for predicting outcomes or classifying data. For instance, Training a model on emails labeled as "spam" or
"not spam" to identify new, unlabelled emails as spam.
• "labeled data" refers to a dataset where each data point has been assigned a specific category or label, providing clear
information about the desired output for a given input, which allows a machine learning model to learn patterns and
make accurate predictions based on these labeled examples
Machine Learning: Unsupervised
• Unsupervised learning employs algorithms to analyze and group unlabeled datasets without specific instructions on
desired outcomes. The system attempts to uncover patterns from the data. There is no label or target given. A prime
example is identifying unusual or unexpected patterns in data, such as fraudulent transactions
Other Machine Learning Types:
Reinforcement learning, often used in robotics, involves training a computer to make decisions based on rewards or
penalties for its actions.
Deep learning, on the other hand, utilizes interconnected layers of nodes to process and transform signals, ultimately
generating an output.
Common Machine Learning Algorithms
Clustering
• Using unsupervised learning, clustering algorithms can identify patterns in data so that it can be grouped. Computers
can help data scientists by identifying differences between data items that humans have overlooked.
Decision trees
• Decision trees can be used for both predicting numerical values and classifying data into categories. Decision trees
use a branching sequence of linked decisions that can be represented with a tree diagram. One of the advantages of
decision trees is that they are easy to validate and audit, unlike the black box of the neural network.
Random forests
• In a random forest, the machine learning algorithm predicts a value or category by combining the results from a
number of decision trees.
Neural networks
Neural networks simulate the way the human brain works, with a huge number of linked processing nodes. Neural
networks are good at recognizing patterns and play an important role in applications including natural language
translation, image recognition, speech recognition, and image creation.
K-means Clustering
• K-means clustering is a way of grouping data based on how similar or close the data points are to each other.
• The algorithm works by first randomly picking some central points (called centroids) and then assigning every data
point to the nearest centroid. Once that’s done, it recalculates the centroids based on the new groupings and repeats
the process until the clusters make sense.
• It’s a pretty fast and efficient method, but it works best when the clusters are distinct and not too mixed up.
Here’s a look at its key objectives:
Grouping Similar Data Points
• K-Means is designed to cluster data points that share common traits, allowing patterns or trends to emerge. Whether
analyzing customer behavior or images, the method helps reveal hidden relationships within your dataset.
Minimizing Within-Cluster Distance
• Another objective is to keep data points in each group as close to the cluster's centroid as possible. Reducing this
internal distance results in compact, cohesive clusters, enhancing the accuracy of your results.
Maximizing Between-Cluster Distance
• K-Means also aims to maintain clear separation between different clusters. By maximizing the distance between
groups, the algorithm ensures that each cluster remains distinct, providing a better understanding of data categories
without overlap.
Decision Tree
Regression analysis is a statistical method for understanding and quantifying the relationship between two or more
variables. It helps a business estimate one dependent variable based on the values of one or more independent variables.
Dependent Variable: The dependent variable is essentially the "outcome" you’re trying to understand or predict.
Independent Variable: Independent variables are the "factors" that might influence or cause changes in the dependent
variable. These are the variables you manipulate or observe to see their impact on your outcome of interest. For example, if
you increase the price of a product, will the customer churn, in this case price is the independent variable.
Logistic Regression: Logistic regression is used when the dependent variable is binary (two possible outcomes). It models
the probability of a particular outcome occurring.
Business Application: In business, logistic regression is employed for tasks like predicting customer churn (yes/no), whether
a customer will purchase a product (yes/no), or whether a loan applicant will default on a loan (yes/no).
Logistic Regression
Types of Logistic Regression
Binary Logistic Regression: Binary logistic regression is used to predict the probability of a binary outcome, such as yes or
no, true or false, or 0 or 1. For example, it could be used to predict whether a customer will churn or not, whether a patient
has a disease or not, or whether a loan will be repaid or not.
Multinomial Logistic Regression: Multinomial logistic regression is used to predict the probability of one of three or more
possible outcomes, such as the type of product a customer will buy, the rating a customer will give a product, or the political
party a person will vote for.
Ordinal Logistic Regression: It is used to predict the probability of an outcome that falls into a predetermined order, such as
the level of customer satisfaction, the severity of a disease, or the stage of cancer.
• In medicine, a frequent application is to find out which variables have an influence on a disease. In this case, 0 could
stand for not diseased and 1 for diseased. Subsequently, the influence of age, gender and smoking status (smoker or not)
on this particular disease could be examined.
• In linear regression, the independent variables (e.g., age and gender) are used to estimate the specific value of the
dependent variable (e.g., body weight).
• In logistic regression, on the other hand, the dependent variable is dichotomous (0 or 1) and the probability that
expression 1 occurs is estimated. Returning to the example above, this means: How likely is it that the disease is present
if the person under consideration has a certain age, sex and smoking status.
To build a logistic regression model, the linear regression equation is used as the starting point.
Logistic Regression
However, if a linear regression were simply calculated for solving a logistic regression, the following result would appear
graphically:
As can be seen in the graph, values between plus and minus infinity can now occur. The goal of logistic regression,
however, is to estimate the probability of occurrence and not the value of the variable itself. Therefore, the equation must be
transformed.
To do this, it is necessary to restrict the value range for the prediction to the range between 0 and 1. To ensure that only
values between 0 and 1 are possible, the logistic function is used.
Logistic Regression
Logistic Function
The logistic model is based on the logical function. The special thing about the logistic function is that for values between
minus and plus infinity, it always assumes only values between 0 and 1.
To calculate the probability of a person being sick or not using the logistic regression for the example above, the model
parameters b1, b2, b3 and a must first be determined. Once these have been determined, the equation will be:
Key properties of the Logistic Regression equation
Sigmoid Function: The logistic regression model, when explained, uses a special “S” shaped curve to predict
probabilities. It ensures that the predicted probabilities stay between 0 and 1, which makes sense for probabilities.
Coefficients: These are just numbers that tell us how much each input affects the outcome in the logistic regression
model. For example, if age is a predictor, the coefficient tells us how much the outcome changes for every one-year
increase in age.
Best Guess: We figure out the best coefficients for the logistic regression model by looking at the data we have and
tweaking them until our predictions match the real outcomes as closely as possible.
Basic Assumptions: We assume that our observations are independent, meaning one doesn’t affect the other. We also
assume that there’s not too much overlap between our predictors (like age and height)
Linearity in the Logit: The relationship between the independent variables and the logit of the dependent variable (ln(p /
(1-p))) is assumed to be linear. This doesn’t necessarily mean the outcome itself has a linear relationship with the
independent variables, but the log-odds do
Probabilities, Not Certainties: Instead of saying “yes” or “no” directly, logistic regression gives us probabilities, like
saying there’s a 70% chance it’s a “yes” in the logistic regression model. We can then decide on a cutoff point to make our
final decision.
Checking Our Work: We have some tools to make sure our predictions are good, like accuracy, precision, recall, and a
curve called the ROC curve.
Logistic Regression Example
You have a dataset, and you need to predict whether a candidate will get admission in the desired college or not, based on
the person’s GRE score, GPA and College Rank.
Steps:
1. In the dataset, we are given the GPAs and college ranks for several students, but it also has a column that indicates
whether those students were admitted or not.
2. Based on this labeled data, you can train the model, validate it, and then use it to predict the admission for any GPA and
college rank.
3. Once you split the data into training and test sets, you will apply the regression on the three independent variables (GRE,
GPA and Rank), generate the model, and then run the test set through the model.
4. Once that is complete, you will validate the model to see how well it performed.
Data Set
Logistic Regression Example
Prediction
Let’s say a student have a profile with 790 in GRE,3.8 GPA and he studied from a rank-1 college. Now you want to predict
the chances of that boy getting admit in future.
We see that there is 85% chance that this guy will get the admit.
Foundation Skills Academy
Thank You