Google data analytics professional course
Week - 1
Data integrity and analytics objectives
Why data integrity is important
Data integrity is the accuracy, completeness, consistency, and
trustworthiness of data throughout its lifecycle.
Data can also be compromised through human error, viruses,
malware, hacking, and system failures
More about data integrity and compliance
Scenario: calendar dates for a global company
Calendar dates are represented in a lot of different short forms. Depending
on where you live, a different format might be used.
● In some countries,12/10/20 (DD/MM/YY) stands for October 12,
2020.
● In other countries, the national standard is YYYY-MM-DD so October
12, 2020 becomes 2020-10-12.
● In the United States, (MM/DD/YY) is the accepted format so
October 12, 2020 is going to be 10/12/20.
Data replication compromising data integrity: Continuing with the example,
imagine you ask your international counterparts to verify dates and stick to
one format. One analyst copies a large dataset to check the dates. But
because of memory issues, only part of the dataset is actually copied. The
analyst would be verifying and standardizing incomplete data. That partial
dataset would be certified as compliant but the full dataset would still
contain dates that weren't verified. Two versions of a dataset can introduce
inconsistent results. A final audit of results would be essential to reveal
what happened and correct all dates.
Data transfer compromising data integrity: Another analyst checks the
dates in a spreadsheet and chooses to import the validated and standardized
data back to the database. But suppose the date field from the spreadsheet
was incorrectly classified as a text field during the data import (transfer)
process. Now some of the dates in the database are stored as text strings.
At this point, the data needs to be cleaned to restore its integrity.
Data manipulation compromising data integrity: When checking dates,
another analyst notices what appears to be a duplicate record in the
database and removes it. But it turns out that the analyst removed a unique
record for a company’s subsidiary and not a duplicate record for the
company. Your dataset is now missing data and the data must be restored
for completeness.
Well-aligned objectives and data
Clean data + alignment to business objective = accurate conclusions
Alignment to business objective + additional data cleaning = accurate conclusions
Overcoming the challenges of insufficient data
Ways you can address them
● You can identify trends with the available data
● wait for more data if time allows
● you can talk with stakeholders and adjust your objective
● you can look for a new data set
What to do when you find an issue with your data
Data issue
No data:
● Gather the data on a small scale to perform a preliminary analysis and
then request additional time to complete the analysis after you have
collected more data.
● If there isn’t time to collect data, perform the analysis using proxy
data from other datasets. This is the most common workaround.
Too little data:
● Do the analysis using proxy data along with actual data.
● Adjust your analysis to align with the data you already have.
wrong data, including data with errors*
● If you have the wrong data because requirements were misunderstood,
communicate the requirements again.
● Identify errors in the data and, if possible, correct them at the
source by looking for a pattern in the errors.
● If you can’t correct data errors yourself, you can ignore the wrong
data and go ahead with the analysis if your sample size is still large
enough and ignoring the data won’t cause systematic bias.
The importance of sample size
Population is all possible data values in a certain dataset.
sample size!
When you use sample size or a sample, you use a part of a population
that's representative of the population.
Sampling bias is when a sample isn't representative of the population as a
whole. This means some members of the population are being
overrepresented or underrepresented.
Random sampling is a way of selecting a sample from a population so
that every possible type of the sample has an equal chance of being chosen.
Things to remember when determining the size of your sample
● Don’t use a sample size less than 30.
● The confidence level most commonly used is 95%, but 90% can work in
some cases.
Increase the sample size to meet specific needs of your project:
● For a higher confidence level, use a larger sample size
● To decrease the margin of error, use a larger sample size
● For greater statistical significance, use a larger sample size
Why a minimum sample of 30?
Central Limit Theorem (CLT)
Testing your data
Using statistical power
Hypothesis testing
If a test is statistically significant,
It means the results of the test are real and not an error caused by random
chance.
What to do when there is no data
open data is the information that has been published on
government-sanctioned portals. In the best case, this data is structured,
machine-readable, open-licensed, and well maintained.
Public data is the data that exists everywhere else. This is information
that’s freely available (but not really accessible) on the web. It is frequently
unstructured and unruly, and its usage requirements are often vague.
Different types of data set on kaggle
● https://www.kaggle.com/datasnaek/youtube-new
● https://www.kaggle.com/sakshigoyal7/credit-card-customers
● https://www.kaggle.com/rtatman/188-million-us-wildfires
● https://www.kaggle.com/bigquery/google-analytics-sample
Sample size calculator
● https://www.surveymonkey.com/mp/sample-size-calculator/
● http://www.raosoft.com/samplesize.html
Consider the margin of error
Margin of error is the maximum amount that the sample results are
expected to differ from those of the actual population.
Eg: Imagine you are playing baseball and that you are up at bat. The crowd is
roaring, and you are getting ready to try to hit the ball. The pitcher delivers
a fastball traveling about 90-95mph, which takes about 400 milliseconds
(ms) to reach the catcher’s glove. You swing and miss the first pitch because
your timing was a little off. You wonder if you should have swung slightly
earlier or slightly later to hit a home run. That time difference can be
considered the margin of error, and it tells us how close or far your timing
was from the average home run swing.
Week-2
Data cleaning is a must
Clean it up!
Dirty data is data that's incomplete, incorrect, or irrelevant to the problem
you're trying to solve.
Clean data is data that's complete, correct, and relevant to the problem
you're trying to solve.
What is dirty data?
● Types of dirty data you may encounter
● What may have caused the data to become dirty
● How dirty data is harmful to businesses
Types of dirty data
Inconsistent data
Any data that uses different formats to represent the same thing
Field is a single piece of information from a row or column of a spreadsheet.
Data validation is a tool for checking the accuracy and quality of data
before adding or importing it.
Begin cleaning data
Common data-cleaning pitfalls
Top ten ways to clean your data
● https://support.microsoft.com/en-us/office/top-ten-ways-to-clean-yo
ur-data-2844b620-677c-47a7-ac3e-c2e157d1db19
● https://support.google.com/a/users/answer/9604139?hl=en#zippy=
Hands-On Activity: Cleaning data with spreadsheets
● Filter
● Transpose (while pasting)
● Data cleanup (option: cleanup suggestion)
● Change text format (Add on: caps to lower etc…)
Cleaning data in spreadsheets
Data-cleaning features in spreadsheets
Conditional formatting
● Conditional formatting (to find empty cell)
● Remove duplicates
● Date formatting (format->number->date)
● specified text separating also called the delimiter.
● Data validation
Optimize the data-cleaning process
A function is a set of instructions that performs a specific calculation using
the data in a spreadsheet.
Some basic types of functions in Spreadsheet
● COUNTIF
● LEN
● LEFT
● RIGHT
● CONCATENATE
● TRIM
Workflow automation
● https://towardsdatascience.com/automating-scientific-data-analysis-p
art-1-c9979cd0817e
● https://news.mit.edu/2016/automating-big-data-analysis-1021
● https://technologyadvice.com/blog/information-technology/top-10-wo
rkflow-automation-software/
Different data perspectives
● Pivot table
● VLOOKUP - vertical lookup
● Find
● Graph plotting
Even more data-cleaning techniques
Data mapping is the process of matching fields from one database to
another.
Compatibility describes how well two or more data sets are able to work
together.
● CONCATENATE
Hands-On Activity: Clean data with spreadsheet
functions
● SPLIT
● COUNTIF
● Sort
Learning Log: Develop your approach to cleaning data
Step 1: Create your checklist
Some things you might include in your checklist:
● Size of the data set
● Number of categories or labels
● Missing data
● Unformatted data
● The different data types
Step 2: List your preferred cleaning methods
After you have compiled your personal checklist, you can create a list of
activities you like to perform when cleaning data. This list is a collection of
procedures that you will implement when you encounter specific issues
present in the data related to your checklist or every time you clean a new
dataset.
For example, suppose that you have a dataset with missing data, how would
you handle it? Moreover, if the data set is very large, what would you do to
check for missing data? Outlining some of your preferred methods for
cleaning data can help save you time and energy.
Step 3: Choose a data cleaning motto
Now that you have a personal checklist and your preferred data cleaning
methods, you can create a data cleaning motto to help guide and explain your
process. The motto is a short one or two sentence summary of your
philosophy towards cleaning data. For example, here are a few data cleaning
mottos from other data analysts:
1. "Not all data is the same, so don't treat it all the same."
2. "Be prepared for things to not go as planned. Have a backup plan.”
3. "Avoid applying complicated solutions to simple problems."
My list
● Find Empty cell
● Remove duplicates
● Date format
● Split wanted informations
● Check conditions
Week - 3
Using SQL to clean data
Understanding SQL capabilities
Relational databases
This is a database that contains a series of tables that can be connected to
form relationships.
Using SQL as a junior data analyst
SQL dialects and their uses
Links to learn SQL documentation
● https://learnsql.com/blog/what-sql-dialect-to-learn/
● https://www.softwaretestinghelp.com/sql-vs-mysql-vs-sql-server/
● https://www.datacamp.com/community/blog/sql-differences
● https://sqlite.org/windowfunctions.html
● https://www.sqltutorial.org/what-is-sql/
Hands-On Activity: Processing time with SQL
SELECT
language,
title,
SUM(views) AS views
FROM
`bigquery-samples.wikipedia_benchmark.Wiki10B`
WHERE
title LIKE '%Google%'
GROUP BY
language,
title
ORDER BY
views DESC;
Learn basic SQL queries
Widely used SQL queries
➢ INSERT INTO
➢ VALUES
➢ UPDATE
➢ SET
➢ SELECT COUNT SUM * DISTINCT
➢ FROM
➢ WHERE
➢ ORDERED BY
➢ GROUP BY
➢ LIMIT
SELECT
● COUNT
● SUM
● *
● DISTINCT
● LENGTH()
WHERE
● SUBSTR()
● TRIM()
● LENGTH()
Hands-On Activity: Clean data using SQL
● MIN
● MAX
● UPDATE
● SET
● DISTINCT
Step 1:
SELECT
DISTINCT(fuel_type)
FROM
`dulcet-velocity-294320.From_course.automobile_data`
Multi line command
*/
--STEP 2
/*SELECT
MIN(length) as min_length,
MAX(length) as max_length
FROM
`dulcet-velocity-294320.From_course.automobile_data`*/
--STEP 3
/*SELECT
*
FROM
`dulcet-velocity-294320.From_course.automobile_data`
WHERE
num_of_doors is NULL*/
--STEP 4
/*UPDATE
`dulcet-velocity-294320.From_course.automobile_data`
SET
num_of_doors = "four"
WHERE
make = "dodge"
AND fuel_type = "gas"
AND body_style = "sedan";*/ --KASU KATUNA THA WORK AAGUM
--MY CODE
/*SELECT
*
FROM
`dulcet-velocity-294320.From_course.automobile_data`
WHERE
make = "dodge"
OR fuel_type = "gas"
AND body_style = "sedan"*/
--STEP 5
/*SELECT
DISTINCT(num_of_cylinders)
FROM
`dulcet-velocity-294320.From_course.automobile_data`*/
--STEP 6
/*UPDATE
cars.car_info
SET
num_of_cylinders = "two"
WHERE
num_of_cylinders = "tow";*/
--STEP 7
/*SELECT
MIN(compression_ratio) AS min_compression_ratio,
MAX(compression_ratio) AS max_compression_ratio
FROM
`dulcet-velocity-294320.From_course.automobile_data`
WHERE
compression_ratio <> 70;*/ --omit 70
--STEP 8
/*SELECT
COUNT(*) AS num_of_rows_to_delete
FROM
`dulcet-velocity-294320.From_course.automobile_data`
WHERE
compression_ratio = 70;*/
--STEP 9
/*DELETE
`dulcet-velocity-294320.From_course.automobile_data`
WHERE
compression_ratio = 70;*/
--STEP 9
/*SELECT
DISTINCT drive_wheels,
LENGTH(drive_wheels) AS string_length
FROM
`dulcet-velocity-294320.From_course.automobile_data`*/
--STEP 10
/*UPDATE
cars.car_info
SET
drive_wheels = TRIM(drive_wheels)
WHERE
TRUE;*/
--STEP 10
/*SELECT
TRIM(drive_wheels),
LENGTH(drive_wheels) AS string_length,
FROM
`dulcet-velocity-294320.From_course.automobile_data`*/
--TEST
/*SELECT
MAX(price) as MAX_PRICE
FROM
`dulcet-velocity-294320.From_course.automobile_data`*/
Transforming data
Upload the store transactions dataset to BigQuery
[
{
"description": "date",
"mode": "NULLABLE",
"name": "date",
"type": "DATETIME"
},
{
"description": "transaction id",
"mode": "NULLABLE",
"name": "transaction_id",
"type": "INTEGER"
},
{
"description": "customer id",
"mode": "NULLABLE",
"name": "customer_id",
"type": "INTEGER"
},
{
"description": "product name",
"mode": "NULLABLE",
"name": "product",
"type": "STRING"
},
{
"description": "product_code",
"mode": "NULLABLE",
"name": "product_code",
"type": "STRING"
},
{
"description": "product color",
"mode": "NULLABLE",
"name": "product_color",
"type": "STRING"
},
{
"description": "product price",
"mode": "NULLABLE",
"name": "product_price",
"type": "FLOAT"
},
{
"description": "quantity purchased",
"mode": "NULLABLE",
"name": "purchase_size",
"type": "INTEGER"
},
{
"description": "purchase price",
"mode": "NULLABLE",
"name": "purchase_price",
"type": "STRING"
},
{
"description": "revenue",
"mode": "NULLABLE",
"name": "revenue",
"type": "FLOAT"
}
]
Three types of file uploading
● Direct csv file upload which has header
● Txt file upload with including headers and type, same for csv which
does not have header
● Changing data type while uploading with headers
Type conversion
PART-1
SELECT
*
FROM
`dulcet-velocity-294320.From_course.customer_purchase`
ORDER BY
CAST(purchase_price AS FLOAT64 ) DESC
PART-2
--SORTING WITH DATE
/*
SELECT
date,
purchase_price
FROM
`dulcet-velocity-294320.From_course.customer_purchase`
WHERE
date BETWEEN '2020-12-1' and '2020-12-31' */
--CAST Change data types
/*
SELECT
CAST(date as date) as DATE,
purchase_price
FROM
`dulcet-velocity-294320.From_course.customer_purchase`
ORDER BY
CAST(date as date) */
--CONCAT join strings to form substring
/*
SELECT
CONCAT(product_code,product_color) as unic_clr_id
FROM
`dulcet-velocity-294320.From_course.customer_purchase`
WHERE
product = 'couch'*/
--COALESCE() return non null values
SELECT
COALESCE(product,product_code) as product_info
FROM
`dulcet-velocity-294320.From_course.customer_purchase`
Part 1 & 2
● CAST() change data type
● CONCAT() join 2 string
● COALESCE() from this or that
Week - 4
Manually cleaning data
Verifying and reporting results
Verification is a process to confirm that a data cleaning effort was well-
executed and the resulting data is accurate and reliable.
A changelog is a file containing a chronologically ordered list of
modifications made to a project.
Cleaning and your data expectations
● Using spreadsheet
● Use SQL
● Big picture verification (including graphs)
The final step in data cleaning
● Spell check
● Spreadsheet => find and replace
● SQL => CASE
SELECT
customer_id,
CASE
WHEN product = 'fan' THEN 'FAN'
WHEN product = 'lamps' THEN 'LAMP'
ELSE product
END AS Dhamu
FROM `dulcet-velocity-294320.From_course.customer_purchase`
Data-cleaning verification: A checklist
● Sources of errors: Did you use the right tools and functions to find
the source of the errors in your dataset?
● Null data: Did you search for NULLs using conditional formatting and
filters?
● Misspelled words: Did you locate all misspellings?
● Mistyped numbers: Did you double-check that your numeric data has
been entered correctly?
● Extra spaces and characters: Did you remove any extra spaces or
characters using the TRIM function?
● Duplicates: Did you remove duplicates in spreadsheets using the
Remove Duplicates function or DISTINCT in SQL?
● Mismatched data types: Did you check that numeric, date, and string
data are typecast correctly?
● Messy (inconsistent) strings: Did you make sure that all of your
strings are consistent and meaningful?
● Messy (inconsistent) date formats: Did you format the dates
consistently throughout your dataset?
● Misleading variable labels (columns): Did you name your columns
meaningfully?
● Truncated data: Did you check for truncated or missing data that
needs correction?
● Business Logic: Did you check that the data makes sense given your
knowledge of the business?
The goal of your project
● Confirm the business problem
● Confirm the goal of the project
● Verify that data can solve the problem and is aligned to the goal
Documenting results and the cleaning process
Capturing cleaning changes
Documentation
Which is the process of tracking changes, additions, deletions and errors
involved in your data cleaning effort.
Data errors are the crime, data cleaning is gathering evidence, and
documentation is detailing exactly what happened for peer review or
court.
Here is how a version control system affects a change to a query:
1. A company has official versions of important queries in their version
control system.
2. An analyst makes sure the most up-to-date version of the query is the
one they will change. This is called syncing
3. The analyst makes a change to the query.
4. The analyst might ask someone to review this change. This is called a
code review and can be informally or formally done. An informal review
could be as simple as asking a senior analyst to take a look at the
change.
5. After a reviewer approves the change, the analyst submits the
updated version of the query to a repository in the company's version
control system. This is called a code commit. A best practice is to
document exactly what the change was and why it was made in a
comments area. Going back to our example of a query that pulls daily
revenue, a comment might be: Updated revenue to include revenue
coming from the new product, Calypso.
6. After the change is submitted, everyone else in the company will be
able to access and use this new query when they sync to the most
up-to-date queries stored in the version control system.
7. If the query has a problem or business needs change, the analyst can
undo the change to the query using the version control system. The
analyst can look at a chronological list of all changes made to the query
and who made each change. Then, after finding their own change, the
analyst can revert to the previous version.
8. The query is back to what it was before the analyst made the change.
And everyone at the company sees this reverted, original query, too.
9. Changelogs are for humans, not machines, so write legibly.
Embrace changelogs
In
● Spreadsheet
● Excel
● Big query
Typically, a changelog records this type of information:
● Data, file, formula, query, or any other component that changed
● Description of what changed
● Date of the change
● Person who made the change
● Person who approved the change
● Version number
● Reason for the change
Changelog documentation
# Changelog
This file contains the notable changes to the project
Version 1.0.0 (02-23-2019)
## New
- Added column classifiers (Date, Time, PerUnitCost, TotalCost, etc. )
- Added Column “AveCost” to track average item cost
## Changes
- Changed date format to MM-DD-YYYY
- Removal of whitespace (cosmetic)
## Fixes
- Fixed misalignment in Column "TotalCost" where some rows did not
match with correct dates
- Fixed SUM to run over entire column instead of partial
Some of the most common errors involve
● human mistakes like mistyping or misspelling,
● flawed processes like poor design of a survey form, and
● system issues where older systems integrate data incorrectly.
Advanced functions for speedy data cleaning
● QUERY
● IMPORTRANGE
● FILTER
QUERY
● https://support.google.com/docs/answer/3093343?hl=en
Filter
● https://support.google.com/docs/answer/3093197?hl=en
● https://support.google.com/docs/answer/3093197?hl=en
IMPORTRANGE
● https://support.google.com/docs/answer/3093340?hl=en#
Week - 5
Understand the elements of a data analyst
resume
CareerCon resources on YouTube
Youtube links
● https://www.youtube.com/playlist?list=PLqFaTIg4myu-npFrYu6cO7h7
AI6bkcOlL
● https://www.youtube.com/watch?v=cBbYhhH399c&list=PLqFaTIg4myu
-npFrYu6cO7h7AI6bkcOlL&index=9
Adding professional skills to your resume
Highlighting experiences on resumes
Adding soft skills to your resume
Quick Review
Week -1
● Data integrity
● Manage insufficient data
● Statistics
Week - 2
● Spreadsheet
Week - 3
● SQL
Week - 4
● Verification and Cleaning
● Changelog and documentation
● Checklist
Week - 5
● Hiring process
● Resume building
Dhamodharan
14/10/2021