0% found this document useful (0 votes)

268 views36 pages

Module 4 - (Process Data From Dirty To Clean)

Uploaded by

lostbilla66

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

268 views36 pages

Module 4 - (Process Data From Dirty To Clean)

Uploaded by

lostbilla66

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Google data analytics professional course

Week - 1
Data integrity and analytics objectives

Why data integrity is important

Data integrity is the accuracy, completeness, consistency, and
trustworthiness of data throughout its lifecycle.
Data can also be compromised through human error, viruses,
malware, hacking, and system failures

More about data integrity and compliance

Scenario: calendar dates for a global company

Calendar dates are represented in a lot of different short forms. Depending

on where you live, a different format might be used.

● In some countries,12/10/20 (DD/MM/YY) stands for October 12,

2020.
● In other countries, the national standard is YYYY-MM-DD so October
12, 2020 becomes 2020-10-12.
● In the United States, (MM/DD/YY) is the accepted format so
October 12, 2020 is going to be 10/12/20.
Data replication compromising data integrity: Continuing with the example,
imagine you ask your international counterparts to verify dates and stick to
one format. One analyst copies a large dataset to check the dates. But
because of memory issues, only part of the dataset is actually copied. The
analyst would be verifying and standardizing incomplete data. That partial
dataset would be certified as compliant but the full dataset would still
contain dates that weren't verified. Two versions of a dataset can introduce
inconsistent results. A final audit of results would be essential to reveal
what happened and correct all dates.

Data transfer compromising data integrity: Another analyst checks the

dates in a spreadsheet and chooses to import the validated and standardized
data back to the database. But suppose the date field from the spreadsheet
was incorrectly classified as a text field during the data import (transfer)
process. Now some of the dates in the database are stored as text strings.
At this point, the data needs to be cleaned to restore its integrity.

Data manipulation compromising data integrity: When checking dates,

another analyst notices what appears to be a duplicate record in the
database and removes it. But it turns out that the analyst removed a unique
record for a company’s subsidiary and not a duplicate record for the
company. Your dataset is now missing data and the data must be restored
for completeness.
Well-aligned objectives and data

Clean data + alignment to business objective = accurate conclusions

Alignment to business objective + additional data cleaning = accurate conclusions

Overcoming the challenges of insufficient data

Ways you can address them

● You can identify trends with the available data
● wait for more data if time allows
● you can talk with stakeholders and adjust your objective
● you can look for a new data set

What to do when you find an issue with your data

Data issue
No data:
● Gather the data on a small scale to perform a preliminary analysis and
then request additional time to complete the analysis after you have
collected more data.
● If there isn’t time to collect data, perform the analysis using proxy
data from other datasets. This is the most common workaround.
Too little data:
● Do the analysis using proxy data along with actual data.
● Adjust your analysis to align with the data you already have.

wrong data, including data with errors*

● If you have the wrong data because requirements were misunderstood,
communicate the requirements again.
● Identify errors in the data and, if possible, correct them at the
source by looking for a pattern in the errors.
● If you can’t correct data errors yourself, you can ignore the wrong
data and go ahead with the analysis if your sample size is still large
enough and ignoring the data won’t cause systematic bias.
The importance of sample size
Population is all possible data values in a certain dataset.

sample size!
When you use sample size or a sample, you use a part of a population
that's representative of the population.

Sampling bias is when a sample isn't representative of the population as a

whole. This means some members of the population are being
overrepresented or underrepresented.

Random sampling is a way of selecting a sample from a population so

that every possible type of the sample has an equal chance of being chosen.

Things to remember when determining the size of your sample

● Don’t use a sample size less than 30.
● The confidence level most commonly used is 95%, but 90% can work in
some cases.

Increase the sample size to meet specific needs of your project:

● For a higher confidence level, use a larger sample size

● To decrease the margin of error, use a larger sample size
● For greater statistical significance, use a larger sample size

Why a minimum sample of 30?

Central Limit Theorem (CLT)
Testing your data

Using statistical power

Hypothesis testing
If a test is statistically significant,
It means the results of the test are real and not an error caused by random
chance.

What to do when there is no data

open data is the information that has been published on

government-sanctioned portals. In the best case, this data is structured,
machine-readable, open-licensed, and well maintained.

Public data is the data that exists everywhere else. This is information
that’s freely available (but not really accessible) on the web. It is frequently
unstructured and unruly, and its usage requirements are often vague.
Different types of data set on kaggle
● https://www.kaggle.com/datasnaek/youtube-new
● https://www.kaggle.com/sakshigoyal7/credit-card-customers
● https://www.kaggle.com/rtatman/188-million-us-wildfires
● https://www.kaggle.com/bigquery/google-analytics-sample

Sample size calculator

● https://www.surveymonkey.com/mp/sample-size-calculator/
● http://www.raosoft.com/samplesize.html

Consider the margin of error

Margin of error is the maximum amount that the sample results are
expected to differ from those of the actual population.
Eg: Imagine you are playing baseball and that you are up at bat. The crowd is
roaring, and you are getting ready to try to hit the ball. The pitcher delivers
a fastball traveling about 90-95mph, which takes about 400 milliseconds
(ms) to reach the catcher’s glove. You swing and miss the first pitch because
your timing was a little off. You wonder if you should have swung slightly
earlier or slightly later to hit a home run. That time difference can be
considered the margin of error, and it tells us how close or far your timing
was from the average home run swing.
Week-2
Data cleaning is a must

Clean it up!
Dirty data is data that's incomplete, incorrect, or irrelevant to the problem
you're trying to solve.
Clean data is data that's complete, correct, and relevant to the problem
you're trying to solve.

What is dirty data?

● Types of dirty data you may encounter
● What may have caused the data to become dirty
● How dirty data is harmful to businesses

Types of dirty data

Inconsistent data
Any data that uses different formats to represent the same thing

Field is a single piece of information from a row or column of a spreadsheet.

Data validation is a tool for checking the accuracy and quality of data
before adding or importing it.

Begin cleaning data

Common data-cleaning pitfalls

Top ten ways to clean your data
● https://support.microsoft.com/en-us/office/top-ten-ways-to-clean-yo
ur-data-2844b620-677c-47a7-ac3e-c2e157d1db19
● https://support.google.com/a/users/answer/9604139?hl=en#zippy=

Hands-On Activity: Cleaning data with spreadsheets

● Filter
● Transpose (while pasting)
● Data cleanup (option: cleanup suggestion)
● Change text format (Add on: caps to lower etc…)

Cleaning data in spreadsheets

Data-cleaning features in spreadsheets

Conditional formatting
● Conditional formatting (to find empty cell)
● Remove duplicates
● Date formatting (format->number->date)
● specified text separating also called the delimiter.
● Data validation
Optimize the data-cleaning process
A function is a set of instructions that performs a specific calculation using
the data in a spreadsheet.
Some basic types of functions in Spreadsheet
● COUNTIF
● LEN
● LEFT
● RIGHT
● CONCATENATE
● TRIM

Workflow automation
● https://towardsdatascience.com/automating-scientific-data-analysis-p
art-1-c9979cd0817e
● https://news.mit.edu/2016/automating-big-data-analysis-1021
● https://technologyadvice.com/blog/information-technology/top-10-wo
rkflow-automation-software/

Different data perspectives

● Pivot table
● VLOOKUP - vertical lookup
● Find
● Graph plotting
Even more data-cleaning techniques
Data mapping is the process of matching fields from one database to
another.
Compatibility describes how well two or more data sets are able to work
together.
● CONCATENATE

Hands-On Activity: Clean data with spreadsheet

functions
● SPLIT
● COUNTIF
● Sort

Learning Log: Develop your approach to cleaning data

Step 1: Create your checklist

Some things you might include in your checklist:

● Size of the data set

● Number of categories or labels
● Missing data
● Unformatted data
● The different data types

Step 2: List your preferred cleaning methods

After you have compiled your personal checklist, you can create a list of
activities you like to perform when cleaning data. This list is a collection of
procedures that you will implement when you encounter specific issues
present in the data related to your checklist or every time you clean a new
dataset.

For example, suppose that you have a dataset with missing data, how would
you handle it? Moreover, if the data set is very large, what would you do to
check for missing data? Outlining some of your preferred methods for
cleaning data can help save you time and energy.

Step 3: Choose a data cleaning motto

Now that you have a personal checklist and your preferred data cleaning
methods, you can create a data cleaning motto to help guide and explain your
process. The motto is a short one or two sentence summary of your
philosophy towards cleaning data. For example, here are a few data cleaning
mottos from other data analysts:

1. "Not all data is the same, so don't treat it all the same."
2. "Be prepared for things to not go as planned. Have a backup plan.”
3. "Avoid applying complicated solutions to simple problems."

My list

● Find Empty cell

● Remove duplicates
● Date format
● Split wanted informations
● Check conditions
Week - 3
Using SQL to clean data

Understanding SQL capabilities

Relational databases
This is a database that contains a series of tables that can be connected to
form relationships.

Using SQL as a junior data analyst

SQL dialects and their uses
Links to learn SQL documentation
● https://learnsql.com/blog/what-sql-dialect-to-learn/
● https://www.softwaretestinghelp.com/sql-vs-mysql-vs-sql-server/
● https://www.datacamp.com/community/blog/sql-differences
● https://sqlite.org/windowfunctions.html
● https://www.sqltutorial.org/what-is-sql/

Hands-On Activity: Processing time with SQL

SELECT
language,
title,
SUM(views) AS views
FROM
`bigquery-samples.wikipedia_benchmark.Wiki10B`
WHERE
title LIKE '%Google%'
GROUP BY
language,
title
ORDER BY
views DESC;
Learn basic SQL queries

Widely used SQL queries

➢ INSERT INTO
➢ VALUES
➢ UPDATE
➢ SET
➢ SELECT COUNT SUM * DISTINCT
➢ FROM
➢ WHERE
➢ ORDERED BY
➢ GROUP BY
➢ LIMIT
SELECT
● COUNT
● SUM
● *
● DISTINCT
● LENGTH()
WHERE
● SUBSTR()
● TRIM()
● LENGTH()
Hands-On Activity: Clean data using SQL
● MIN
● MAX
● UPDATE
● SET
● DISTINCT
Step 1:
SELECT
DISTINCT(fuel_type)
FROM
`dulcet-velocity-294320.From_course.automobile_data`
Multi line command
*/
--STEP 2
/*SELECT
MIN(length) as min_length,
MAX(length) as max_length
FROM
`dulcet-velocity-294320.From_course.automobile_data`*/

--STEP 3
/*SELECT
*
FROM
`dulcet-velocity-294320.From_course.automobile_data`
WHERE
num_of_doors is NULL*/

--STEP 4
/*UPDATE
`dulcet-velocity-294320.From_course.automobile_data`
SET
num_of_doors = "four"
WHERE
make = "dodge"
AND fuel_type = "gas"
AND body_style = "sedan";*/ --KASU KATUNA THA WORK AAGUM

--MY CODE
/*SELECT
*
FROM
`dulcet-velocity-294320.From_course.automobile_data`
WHERE
make = "dodge"
OR fuel_type = "gas"
AND body_style = "sedan"*/

--STEP 5
/*SELECT
DISTINCT(num_of_cylinders)
FROM
`dulcet-velocity-294320.From_course.automobile_data`*/

--STEP 6
/*UPDATE
cars.car_info
SET
num_of_cylinders = "two"
WHERE
num_of_cylinders = "tow";*/

--STEP 7
/*SELECT
MIN(compression_ratio) AS min_compression_ratio,
MAX(compression_ratio) AS max_compression_ratio
FROM
`dulcet-velocity-294320.From_course.automobile_data`
WHERE
compression_ratio <> 70;*/ --omit 70

--STEP 8
/*SELECT
COUNT(*) AS num_of_rows_to_delete
FROM
`dulcet-velocity-294320.From_course.automobile_data`
WHERE
compression_ratio = 70;*/

--STEP 9
/*DELETE
`dulcet-velocity-294320.From_course.automobile_data`
WHERE
compression_ratio = 70;*/

--STEP 9
/*SELECT
DISTINCT drive_wheels,
LENGTH(drive_wheels) AS string_length
FROM
`dulcet-velocity-294320.From_course.automobile_data`*/

--STEP 10
/*UPDATE
cars.car_info
SET
drive_wheels = TRIM(drive_wheels)
WHERE
TRUE;*/

--STEP 10
/*SELECT
TRIM(drive_wheels),
LENGTH(drive_wheels) AS string_length,
FROM
`dulcet-velocity-294320.From_course.automobile_data`*/

--TEST
/*SELECT
MAX(price) as MAX_PRICE
FROM
`dulcet-velocity-294320.From_course.automobile_data`*/
Transforming data

Upload the store transactions dataset to BigQuery

[
{
"description": "date",
"mode": "NULLABLE",
"name": "date",
"type": "DATETIME"
},
{
"description": "transaction id",
"mode": "NULLABLE",
"name": "transaction_id",
"type": "INTEGER"
},
{
"description": "customer id",
"mode": "NULLABLE",
"name": "customer_id",
"type": "INTEGER"
},
{
"description": "product name",
"mode": "NULLABLE",
"name": "product",
"type": "STRING"
},
{
"description": "product_code",
"mode": "NULLABLE",
"name": "product_code",
"type": "STRING"
},
{
"description": "product color",
"mode": "NULLABLE",
"name": "product_color",
"type": "STRING"
},
{
"description": "product price",
"mode": "NULLABLE",
"name": "product_price",
"type": "FLOAT"
},
{
"description": "quantity purchased",
"mode": "NULLABLE",
"name": "purchase_size",
"type": "INTEGER"
},
{
"description": "purchase price",
"mode": "NULLABLE",
"name": "purchase_price",
"type": "STRING"
},
{
"description": "revenue",
"mode": "NULLABLE",
"name": "revenue",
"type": "FLOAT"
}
]
Three types of file uploading
● Direct csv file upload which has header
● Txt file upload with including headers and type, same for csv which
does not have header
● Changing data type while uploading with headers

Type conversion
PART-1
SELECT
*
FROM
`dulcet-velocity-294320.From_course.customer_purchase`
ORDER BY
CAST(purchase_price AS FLOAT64 ) DESC

PART-2
--SORTING WITH DATE
/*
SELECT
date,
purchase_price
FROM
`dulcet-velocity-294320.From_course.customer_purchase`
WHERE
date BETWEEN '2020-12-1' and '2020-12-31' */
--CAST Change data types
/*
SELECT
CAST(date as date) as DATE,
purchase_price
FROM
`dulcet-velocity-294320.From_course.customer_purchase`
ORDER BY
CAST(date as date) */

--CONCAT join strings to form substring

/*
SELECT
CONCAT(product_code,product_color) as unic_clr_id
FROM
`dulcet-velocity-294320.From_course.customer_purchase`
WHERE
product = 'couch'*/

--COALESCE() return non null values

SELECT
COALESCE(product,product_code) as product_info
FROM
`dulcet-velocity-294320.From_course.customer_purchase`
Part 1 & 2
● CAST() change data type
● CONCAT() join 2 string
● COALESCE() from this or that

Week - 4
Manually cleaning data

Verifying and reporting results

Verification is a process to confirm that a data cleaning effort was well-
executed and the resulting data is accurate and reliable.
A changelog is a file containing a chronologically ordered list of
modifications made to a project.

Cleaning and your data expectations

● Using spreadsheet
● Use SQL
● Big picture verification (including graphs)
The final step in data cleaning
● Spell check
● Spreadsheet => find and replace
● SQL => CASE
SELECT
customer_id,
CASE
WHEN product = 'fan' THEN 'FAN'
WHEN product = 'lamps' THEN 'LAMP'
ELSE product
END AS Dhamu
FROM `dulcet-velocity-294320.From_course.customer_purchase`

Data-cleaning verification: A checklist

● Sources of errors: Did you use the right tools and functions to find
the source of the errors in your dataset?
● Null data: Did you search for NULLs using conditional formatting and
filters?
● Misspelled words: Did you locate all misspellings?
● Mistyped numbers: Did you double-check that your numeric data has
been entered correctly?
● Extra spaces and characters: Did you remove any extra spaces or
characters using the TRIM function?
● Duplicates: Did you remove duplicates in spreadsheets using the
Remove Duplicates function or DISTINCT in SQL?
● Mismatched data types: Did you check that numeric, date, and string
data are typecast correctly?
● Messy (inconsistent) strings: Did you make sure that all of your
strings are consistent and meaningful?
● Messy (inconsistent) date formats: Did you format the dates
consistently throughout your dataset?
● Misleading variable labels (columns): Did you name your columns
meaningfully?
● Truncated data: Did you check for truncated or missing data that
needs correction?
● Business Logic: Did you check that the data makes sense given your
knowledge of the business?

The goal of your project

● Confirm the business problem
● Confirm the goal of the project
● Verify that data can solve the problem and is aligned to the goal
Documenting results and the cleaning process

Capturing cleaning changes

Documentation
Which is the process of tracking changes, additions, deletions and errors
involved in your data cleaning effort.

Data errors are the crime, data cleaning is gathering evidence, and
documentation is detailing exactly what happened for peer review or
court.

Here is how a version control system affects a change to a query:

1. A company has official versions of important queries in their version

control system.
2. An analyst makes sure the most up-to-date version of the query is the
one they will change. This is called syncing
3. The analyst makes a change to the query.
4. The analyst might ask someone to review this change. This is called a
code review and can be informally or formally done. An informal review
could be as simple as asking a senior analyst to take a look at the
change.
5. After a reviewer approves the change, the analyst submits the
updated version of the query to a repository in the company's version
control system. This is called a code commit. A best practice is to
document exactly what the change was and why it was made in a
comments area. Going back to our example of a query that pulls daily
revenue, a comment might be: Updated revenue to include revenue
coming from the new product, Calypso.
6. After the change is submitted, everyone else in the company will be
able to access and use this new query when they sync to the most
up-to-date queries stored in the version control system.
7. If the query has a problem or business needs change, the analyst can
undo the change to the query using the version control system. The
analyst can look at a chronological list of all changes made to the query
and who made each change. Then, after finding their own change, the
analyst can revert to the previous version.
8. The query is back to what it was before the analyst made the change.
And everyone at the company sees this reverted, original query, too.
9. Changelogs are for humans, not machines, so write legibly.

Embrace changelogs

● Spreadsheet
● Excel
● Big query

Typically, a changelog records this type of information:

● Data, file, formula, query, or any other component that changed

● Description of what changed
● Date of the change
● Person who made the change
● Person who approved the change
● Version number
● Reason for the change
Changelog documentation

# Changelog

This file contains the notable changes to the project

Version 1.0.0 (02-23-2019)

## New

- Added column classifiers (Date, Time, PerUnitCost, TotalCost, etc. )

- Added Column “AveCost” to track average item cost

## Changes

- Changed date format to MM-DD-YYYY

- Removal of whitespace (cosmetic)

## Fixes

- Fixed misalignment in Column "TotalCost" where some rows did not

match with correct dates

- Fixed SUM to run over entire column instead of partial

Some of the most common errors involve

● human mistakes like mistyping or misspelling,

● flawed processes like poor design of a survey form, and
● system issues where older systems integrate data incorrectly.
Advanced functions for speedy data cleaning

● QUERY
● IMPORTRANGE
● FILTER

QUERY

● https://support.google.com/docs/answer/3093343?hl=en

Filter

● https://support.google.com/docs/answer/3093197?hl=en
● https://support.google.com/docs/answer/3093197?hl=en

IMPORTRANGE

● https://support.google.com/docs/answer/3093340?hl=en#
Week - 5
Understand the elements of a data analyst
resume

CareerCon resources on YouTube

Youtube links
● https://www.youtube.com/playlist?list=PLqFaTIg4myu-npFrYu6cO7h7
AI6bkcOlL
● https://www.youtube.com/watch?v=cBbYhhH399c&list=PLqFaTIg4myu
-npFrYu6cO7h7AI6bkcOlL&index=9

Adding professional skills to your resume

Highlighting experiences on resumes

Adding soft skills to your resume

Quick Review
Week -1
● Data integrity
● Manage insufficient data
● Statistics

Week - 2
● Spreadsheet

Week - 3
● SQL

Week - 4
● Verification and Cleaning
● Changelog and documentation
● Checklist

Week - 5
● Hiring process
● Resume building

Dhamodharan
14/10/2021

Process Data From Dirty To Clean
No ratings yet
Process Data From Dirty To Clean
34 pages
Data Integrity for Analysts
No ratings yet
Data Integrity for Analysts
48 pages
Data Analytics Program - Introduction To Data Analytics - Lesson 1
No ratings yet
Data Analytics Program - Introduction To Data Analytics - Lesson 1
56 pages
Importance of Data Cleaning
No ratings yet
Importance of Data Cleaning
35 pages
Process Data From Dirty To Clean
No ratings yet
Process Data From Dirty To Clean
30 pages
Module 2 Data Science New
No ratings yet
Module 2 Data Science New
57 pages
Data Cleaning and JSON in R
No ratings yet
Data Cleaning and JSON in R
61 pages
Data Analitics 4
No ratings yet
Data Analitics 4
10 pages
Data Analysis
No ratings yet
Data Analysis
29 pages
Business Analytics Essentials
100% (2)
Business Analytics Essentials
45 pages
Data Integrity and Cleaning Guide
No ratings yet
Data Integrity and Cleaning Guide
6 pages
Data Cleaning Essentials
No ratings yet
Data Cleaning Essentials
42 pages
CAC 428 Topic 2 - Data Quality
No ratings yet
CAC 428 Topic 2 - Data Quality
29 pages
Data Quality
No ratings yet
Data Quality
14 pages
Intro. Data Science 3
No ratings yet
Intro. Data Science 3
38 pages
(M3S1) Data Analytics Framework
No ratings yet
(M3S1) Data Analytics Framework
12 pages
EDA
100% (1)
EDA
9 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Data Analytics
No ratings yet
Data Analytics
5 pages
Lect 6
No ratings yet
Lect 6
36 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Data Cleaning: A Brief Guide To
No ratings yet
Data Cleaning: A Brief Guide To
15 pages
Data Cleaning: A Brief Guide To
100% (2)
Data Cleaning: A Brief Guide To
15 pages
Data Prep for Aspiring Analysts
No ratings yet
Data Prep for Aspiring Analysts
22 pages
Mylessons 4
No ratings yet
Mylessons 4
6 pages
Data Analytics
No ratings yet
Data Analytics
13 pages
BIA 5000 Introduction To Analytics - Lesson 6
No ratings yet
BIA 5000 Introduction To Analytics - Lesson 6
59 pages
Unit 2
No ratings yet
Unit 2
22 pages
ML Lecture 5 Data Quality
No ratings yet
ML Lecture 5 Data Quality
19 pages
Data Preparation Guide COS10022
No ratings yet
Data Preparation Guide COS10022
61 pages
FDS UNIT 1 Part2
No ratings yet
FDS UNIT 1 Part2
47 pages
DS-Unit-2 ABM Final
No ratings yet
DS-Unit-2 ABM Final
134 pages
Ch03 DS-Unit-2 ABM Final
No ratings yet
Ch03 DS-Unit-2 ABM Final
143 pages
L 4 and 5-Data Cleaning DS-Sa
No ratings yet
L 4 and 5-Data Cleaning DS-Sa
44 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
Cognizant Data Analyst Interview Questions 1745235888
No ratings yet
Cognizant Data Analyst Interview Questions 1745235888
18 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
BC 2014 Session2
No ratings yet
BC 2014 Session2
45 pages
Introduction To Data Analysis
100% (1)
Introduction To Data Analysis
94 pages
KMBN IT01 LM Consolidated
No ratings yet
KMBN IT01 LM Consolidated
123 pages
03 Data Preprocessing
No ratings yet
03 Data Preprocessing
15 pages
LECTURE The PROCESS Phase
No ratings yet
LECTURE The PROCESS Phase
6 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
33 pages
Intro To Data Analytics - Cleanup & Transformation
No ratings yet
Intro To Data Analytics - Cleanup & Transformation
30 pages
Big Data Lec5
No ratings yet
Big Data Lec5
37 pages
Why Data Cleaning Is Critical
No ratings yet
Why Data Cleaning Is Critical
5 pages
3 - Data Trials and Triumphs
No ratings yet
3 - Data Trials and Triumphs
3 pages
Lec 3 Data Preprocessing and Transformation
No ratings yet
Lec 3 Data Preprocessing and Transformation
73 pages
Data Integrity and Compliance
No ratings yet
Data Integrity and Compliance
4 pages
Data Management Quiz
No ratings yet
Data Management Quiz
4 pages
Data Prepration Lecture 9 Data Analysis - 1
No ratings yet
Data Prepration Lecture 9 Data Analysis - 1
32 pages
2 - Data Management and Wrangling
No ratings yet
2 - Data Management and Wrangling
33 pages
Integrating Data From Different Sources
No ratings yet
Integrating Data From Different Sources
11 pages
03 Data Science Process - Spring-24-25
No ratings yet
03 Data Science Process - Spring-24-25
48 pages
Notes 3 (Prepare Coursera)
No ratings yet
Notes 3 (Prepare Coursera)
67 pages
Data Prepration Presentation
No ratings yet
Data Prepration Presentation
34 pages
Data Clean R
100% (1)
Data Clean R
11 pages
Dynamic Chart Data Labels Stacked Column
No ratings yet
Dynamic Chart Data Labels Stacked Column
7 pages
Instant Pot Pressure Cooker Recipe Book ENG OPTIMIZED
No ratings yet
Instant Pot Pressure Cooker Recipe Book ENG OPTIMIZED
19 pages
Cafe Products Sales Analysis
No ratings yet
Cafe Products Sales Analysis
10 pages
Excel Slicer & Timeline Stylesheet
No ratings yet
Excel Slicer & Timeline Stylesheet
10 pages
2021 Annual Calendar Overview
No ratings yet
2021 Annual Calendar Overview
7 pages
Al4 02
100% (1)
Al4 02
20 pages
Al4 09
No ratings yet
Al4 09
17 pages
Al4 14
No ratings yet
Al4 14
14 pages
Al4 01
No ratings yet
Al4 01
23 pages
Al4 10
No ratings yet
Al4 10
19 pages
Al4 08
No ratings yet
Al4 08
11 pages
Al4 13
No ratings yet
Al4 13
10 pages
Al4 03
No ratings yet
Al4 03
17 pages
Al4 16
No ratings yet
Al4 16
9 pages
Al4 05
No ratings yet
Al4 05
17 pages
Al4 11
No ratings yet
Al4 11
18 pages
Al4 07
No ratings yet
Al4 07
14 pages
Al4 06
No ratings yet
Al4 06
15 pages
Al4 12
No ratings yet
Al4 12
13 pages
Al4 04
No ratings yet
Al4 04
14 pages
MODULE 2 - (Ask Questions To Make Data-Driven Decisions)
No ratings yet
MODULE 2 - (Ask Questions To Make Data-Driven Decisions)
23 pages
Newtons Law of Motion - Phys213
No ratings yet
Newtons Law of Motion - Phys213
8 pages
Optimal Reconfiguration of Power Distribution Radial Network Using Hybrid Meta-Heuristic Algorithms
No ratings yet
Optimal Reconfiguration of Power Distribution Radial Network Using Hybrid Meta-Heuristic Algorithms
13 pages
Operating System - Assignment 5: 18F-0123 Amina Javed
No ratings yet
Operating System - Assignment 5: 18F-0123 Amina Javed
8 pages
Optimization of Submerged Arc Welding
No ratings yet
Optimization of Submerged Arc Welding
4 pages
Nuclear Physics Foundations
No ratings yet
Nuclear Physics Foundations
21 pages
Chapter 10-Statistical Inference For Two Samples
No ratings yet
Chapter 10-Statistical Inference For Two Samples
38 pages
Final Thesis Copy Nitesh
No ratings yet
Final Thesis Copy Nitesh
109 pages
Selecting The Right Thermodynamic Models For Process Simulation PDF
No ratings yet
Selecting The Right Thermodynamic Models For Process Simulation PDF
5 pages
2) Change Control
No ratings yet
2) Change Control
4 pages
Astm D4239 18 Sulphur
100% (5)
Astm D4239 18 Sulphur
8 pages
Art Quadra FX Manual
No ratings yet
Art Quadra FX Manual
76 pages
Synopsis DIYA TERM 2
No ratings yet
Synopsis DIYA TERM 2
54 pages
Gurgaon
No ratings yet
Gurgaon
48 pages
Eaton Metal Seals
No ratings yet
Eaton Metal Seals
60 pages
Video
No ratings yet
Video
5 pages
ReleaseNote - FileList of X756UAK - WIN10 - 64 - V7.02
No ratings yet
ReleaseNote - FileList of X756UAK - WIN10 - 64 - V7.02
2 pages
Session 1-Introduction To Data Analytics
No ratings yet
Session 1-Introduction To Data Analytics
42 pages
Lines and Planes
No ratings yet
Lines and Planes
3 pages
Green University of Bangladesh Department of Textile: Lab Report
No ratings yet
Green University of Bangladesh Department of Textile: Lab Report
5 pages
Pre-Concept Design Report PDF
No ratings yet
Pre-Concept Design Report PDF
434 pages
Pendulum Energy Program Engineering
100% (2)
Pendulum Energy Program Engineering
86 pages
Yaestj: Antenna Rotator Model G-450XL
No ratings yet
Yaestj: Antenna Rotator Model G-450XL
12 pages
Modified Fuel-less Air Engine Design
No ratings yet
Modified Fuel-less Air Engine Design
46 pages
Assignment Electric Charges & Field
No ratings yet
Assignment Electric Charges & Field
6 pages
Q Data Based - 5
No ratings yet
Q Data Based - 5
2 pages
Page (1) of
No ratings yet
Page (1) of
4 pages
Cu CR 1 ZR
No ratings yet
Cu CR 1 ZR
38 pages
Physics: Paper 0625/11 Multiple Choice
No ratings yet
Physics: Paper 0625/11 Multiple Choice
47 pages
Statistics Summer Course
No ratings yet
Statistics Summer Course
49 pages
Cryptography & Network Security
No ratings yet
Cryptography & Network Security
10 pages

Module 4 - (Process Data From Dirty To Clean)

Uploaded by

Module 4 - (Process Data From Dirty To Clean)

Uploaded by

Google data analytics professional course

Why data integrity is important

More about data integrity and compliance

Calendar dates are represented in a lot of different short forms. Depending

● In some countries,12/10/20 (DD/MM/YY) stands for October 12,

Data transfer compromising data integrity: Another analyst checks the

Data manipulation compromising data integrity: When checking dates,

Clean data + alignment to business objective = accurate conclusions

Overcoming the challenges of insufficient data

Ways you can address them

What to do when you find an issue with your data

wrong data, including data with errors*

Sampling bias is when a sample isn't representative of the population as a

Random sampling is a way of selecting a sample from a population so

Things to remember when determining the size of your sample

Increase the sample size to meet specific needs of your project:

● For a higher confidence level, use a larger sample size

Why a minimum sample of 30?

Using statistical power

What to do when there is no data

open data is the information that has been published on

Sample size calculator

Consider the margin of error

What is dirty data?

Types of dirty data

Field is a single piece of information from a row or column of a spreadsheet.

Begin cleaning data

Common data-cleaning pitfalls

Hands-On Activity: Cleaning data with spreadsheets

Cleaning data in spreadsheets

Data-cleaning features in spreadsheets

Different data perspectives

Hands-On Activity: Clean data with spreadsheet

Learning Log: Develop your approach to cleaning data

Step 1: Create your checklist

● Size of the data set

Step 2: List your preferred cleaning methods

Step 3: Choose a data cleaning motto

● Find Empty cell

Understanding SQL capabilities

Using SQL as a junior data analyst

Hands-On Activity: Processing time with SQL

Widely used SQL queries

Upload the store transactions dataset to BigQuery

--CONCAT join strings to form substring

--COALESCE() return non null values

Verifying and reporting results

Cleaning and your data expectations

Data-cleaning verification: A checklist

The goal of your project

Capturing cleaning changes

Here is how a version control system affects a change to a query:

1. A company has official versions of important queries in their version

Typically, a changelog records this type of information:

● Data, file, formula, query, or any other component that changed

This file contains the notable changes to the project

Version 1.0.0 (02-23-2019)

- Added column classifiers (Date, Time, PerUnitCost, TotalCost, etc. )

- Added Column “AveCost” to track average item cost

- Changed date format to MM-DD-YYYY

- Removal of whitespace (cosmetic)

- Fixed misalignment in Column "TotalCost" where some rows did not

- Fixed SUM to run over entire column instead of partial

Some of the most common errors involve

● human mistakes like mistyping or misspelling,

CareerCon resources on YouTube

Adding professional skills to your resume

Adding soft skills to your resume

You might also like