0% found this document useful (0 votes)

85 views20 pages

Session 28 Data Accessing and Cleaning

This dataset contains information on 500 patients with diabetes who participated in a clinical trial comparing an oral insulin drug (Auralin) to injected insulin (Novodra). It includes demographic data on patients, their treatment regimens and outcomes, and any adverse reactions experienced. 350 patients were initially treated with Novodra to establish baseline levels, then 175 were switched to Auralin while 175 continued Novodra for 24 weeks, with HbA1c and insulin dose tracked over time.

Uploaded by

Avantika Saxena

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

85 views20 pages

Session 28 Data Accessing and Cleaning

Uploaded by

Avantika Saxena

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Data Analysis Process

1. Asking Questions
2. Data Wrangling
a. Gathering Data - i. CSV files - ii. APIs - iii. Web Scraping - iv. Databases
b. Assessing Data
c. Cleaning Data
3. Exploratory Data Analysis
4. Drawing Conclusion
5. Comunicating Results

Summary of today's session

• You have to become Sherlock
• Will try to create a framework
• Frameworks may vary
• The goal is to give you idea

1. Data Wrangling
• Data Gathering
• Data Accessing
• Data Cleaning

1b. Data Accessing

In this step, the data is to be understood more deeply. Before implementing methods to clean it,
you will definitely need to have a better idea about what the data is about.

Types of Unclean Data

There are 2 kinds of unclean data
• Dirty Data (Data with Quality issues): Dirty data, also known as low
quality data. Low quality data has content issues.
– Duplicated data
– Missing Data
– Corrupt Data
– Inaccurate Data
• Messy Data (Data with tidiness issues): Messy data, also known as untidy
data. Untidy data has structural issues.Tidy data has the following properties:
– Each variable forms a column
– Each observation forms a row
– Each observational unit forms a table
import pandas as pd
import numpy as np

patients = pd.read_csv('patients.csv')
treatments = pd.read_csv('treatments.csv')
adverse_reactions = pd.read_csv('adverse_reactions.csv')
treatments_cut = pd.read_csv('treatments_cut.csv')

# view datasets
patients.head()

patient_id assigned_sex given_name surname

address \
0 1 female Zoe Wellish 576 Brown Bear
Drive
1 2 female Pamela Hill 2370 University Hill
Road
2 3 male Jae Debord 1493 Poling Farm
Road
3 4 male Liêm Phan 2335 Webster
Street
4 5 male Tim Neudorf 1428 Turkey Pen
Lane

city state zip_code country \

0 Rancho California California 92390.0 United States
1 Armstrong Illinois 61812.0 United States
2 York Nebraska 68467.0 United States
3 Woodbridge NJ 7095.0 United States
4 Dothan AL 36303.0 United States

contact birthdate weight

height bmi
0 951-719-9170ZoeWellish@superrito.com 7/10/1976 121.7
66 19.6
1 PamelaSHill@cuvox.de+1 (217) 569-3204 4/3/1967 118.8
66 19.2
2 402-363-6804JaeMDebord@gustr.com 2/19/1980 177.8
71 24.8
3 PhanBaLiem@jourrapide.com+1 (732) 636-8246 7/26/1951 220.9
70 31.7
4 334-515-7487TimNeudorf@cuvox.de 2/18/1928 192.3
27 26.1

treatments.head()

given_name surname auralin novodra hba1c_start hba1c_end

\
0 veronika jindrová 41u - 48u - 7.63 7.20

1 elliot richardson - 40u - 45u 7.56 7.09

2 yukitaka takenaka - 39u - 36u 7.68 7.25

3 skye gormanston 33u - 36u - 7.97 7.62

4 alissa montez - 33u - 29u 7.78 7.46

hba1c_change
0 NaN
1 0.97
2 NaN
3 0.35
4 0.32

treatments_cut.shape

(70, 7)

adverse_reactions

given_name surname adverse_reaction

0 berta napolitani injection site discomfort
1 lena baer hypoglycemia
2 joseph day hypoglycemia
3 flavia fiorentino cough
4 manouck wubbels throat irritation
5 jasmine sykes hypoglycemia
6 louise johnson hypoglycemia
7 albinca komavec hypoglycemia
8 noe aranda hypoglycemia
9 sofia hermansen injection site discomfort
10 tegan johnson headache
11 abel yonatan cough
12 abdul-nur isa hypoglycemia
13 leon scholz injection site discomfort
14 gabriele saenger hypoglycemia
15 jia li teng nausea
16 jakob jakobsen hypoglycemia
17 christopher woodward nausea
18 ole petersen hypoglycemia
19 finley chandler headache
20 anenechi chidi hypoglycemia
21 miłosław wiśniewski injection site discomfort
22 lixue hsueh injection site discomfort
23 merci leroux hypoglycemia
24 kang mai injection site discomfort
25 elliot richardson hypoglycemia
26 clinton miller throat irritation
27 idalia moore hypoglycemia
28 xiuxiu chang hypoglycemia
29 alex crawford hypoglycemia
30 monika lončar hypoglycemia
31 steven roy headache
32 cecilie nilsen hypoglycemia
33 krisztina magyar hypoglycemia

1. Write a summary for your data

This is a dataset about 500 patients of which 350 patients participated in a clinical trial. None of
the patients were using Novodra (a popular injectable insulin) or Auralin (the oral insulin being
researched) as their primary source of insulin before. All were experiencing elevated HbA1c
levels.

All 350 patients were treated with Novodra to establish a baseline HbA1c level and insulin dose.
After 4 weeks, which isn’t enough time to capture all the change in HbA1c that can be attributed
by the switch to Auralin or Novodra:

• 175 patients switched to Auralin for 24 weeks

• 175 patients continued using Novodra for 24 weeks

Data about patients feeling some adverse effects is also recorded.

2. Write Column descriptions

Table -> patients:
• patient_id: the unique identifier for each patient in the Master Patient Index (i.e.
patient database) of the pharmaceutical company that is producing Auralin
• assigned_sex: the assigned sex of each patient at birth (male or female)
• given_name: the given name (i.e. first name) of each patient
• surname: the surname (i.e. last name) of each patient
• address: the main address for each patient
• city: the corresponding city for the main address of each patient
• state: the corresponding state for the main address of each patient
• zip_code: the corresponding zip code for the main address of each patient
• country: the corresponding country for the main address of each patient (all United
states for this clinical trial)
• contact: phone number and email information for each patient
• birthdate: the date of birth of each patient (month/day/year). The inclusion criteria for
this clinical trial is age >= 18 (there is no maximum age because diabetes is a growing
problem among the elderly population)
• weight: the weight of each patient in pounds (lbs)
• height: the height of each patient in inches (in)
• bmi: the Body Mass Index (BMI) of each patient. BMI is a simple calculation using a
person's height and weight. The formula is BMI = kg/m2 where kg is a person's weight in
kilograms and m2 is their height in metres squared. A BMI of 25.0 or more is overweight,
while the healthy range is 18.5 to 24.9. The inclusion criteria for this clinical trial is 16 >=
BMI >= 38.

Table -> treatments and treatment_cut:

• given_name: the given name of each patient in the Master Patient Index that took part
in the clinical trial
• surname: the surname of each patient in the Master Patient Index that took part in the
clinical trial
• auralin: the baseline median daily dose of insulin from the week prior to switching to
Auralin (the number before the dash) and the ending median daily dose of insulin at the
end of the 24 weeks of treatment measured over the 24th week of treatment (the
number after the dash). Both are measured in units (shortform 'u'), which is the
international unit of measurement and the standard measurement for insulin.
• novodra: same as above, except for patients that continued treatment with Novodra
• hba1c_start: the patient's HbA1c level at the beginning of the first week of treatment.
HbA1c stands for Hemoglobin A1c. The HbA1c test measures what the average blood
sugar has been over the past three months. It is thus a powerful way to get an overall
sense of how well diabetes has been controlled. Everyone with diabetes should have this
test 2 to 4 times per year. Measured in %.
• hba1c_end: the patient's HbA1c level at the end of the last week of treatment
• hba1c_change: the change in the patient's HbA1c level from the start of treatment to
the end, i.e., hba1c_start - hba1c_end. For Auralin to be deemed effective, it must be
"noninferior" to Novodra, the current standard for insulin. This "noninferiority" is
statistically defined as the upper bound of the 95% confidence interval being less than
0.4% for the difference between the mean HbA1c changes for Novodra and Auralin (i.e.
Novodra minus Auralin).

Table -> adverse_reactions

• given_name: the given name of each patient in the Master Patient Index that took part
in the clinical trial and had an adverse reaction (includes both patients treated Auralin
and Novodra)
• surname: the surname of each patient in the Master Patient Index that took part in the
clinical trial and had an adverse reaction (includes both patients treated Auralin and
Novodra)
• adverse_reaction: the adverse reaction reported by the patient

3. Add any additional information

Additional useful information:

• Insulin resistance varies person to person, which is why both starting median daily dose
and ending median daily dose are required, i.e., to calculate change in dose.
• It is important to test drugs and medical products in the people they are meant to help.
People of different age, race, sex, and ethnic group must be included in clinical trials. This
diversity is reflected in the patients table.

Types of Assessment
There are 2 types of assessment styles

• Manual - Looking through the data manually in google sheets

• Programmatic - By using pandas functions such as info(), describe() or sample()

Steps in Assessment
There are 2 steps involved in Assessment

• Discover
• Document
# export data for manual assessment

with pd.ExcelWriter('clinical_trials.xlsx') as writer:

patients.to_excel(writer,sheet_name='patients')
treatments.to_excel(writer,sheet_name='treatments')
treatment_cut.to_excel(writer,sheet_name='treatment_cut')
adverse_reactions.to_excel(writer,sheet_name='adverse_reactions')

Issues with the dataset

1. Dirty Data

Table - Patients

- patient_id = 9 has misspelled name 'Dsvid' instead of David

àccuracy`
- state col sometimes contain full name and some times abbrivietation
`consistency`
- zip code col has entries with 4 digit `validity`
- data missing for 12 patients in address,city, state,zip_code
,country, contact `completion`
- incorrect data type assigned to sex, zip code, birthdate `validity`
- duplicate entries by the name of John Doe àccuracy`
- one patient has weight = 48 pounds àccuracy`
- one patient has height = 27 inches àccuracy`

Table - Treatments & Treatments_cut

- given_name and surname col is is all lower case `consistency`

- remove u from Auralin and Novadra cols `validity`
- '-' in novadra and Auralin col treated as nan `validity`
- missing values in hba1c_change col `completion`
- 1 duplicate entry by the name Joseph day `accuracy`
- in hba1c_change 9 instead of 4 `accuracy`

Table - Adverse_reactions

- given_name and surname are all in lower case `consistency`

1. Messy Data

Table - Patients

- contact col contains both phone and email

Table - Treatments & Treatments_cut

- Auralin and Novadra col should be split into 2 cols start and end
dose
- merge both the tables

Table - Adverse_reactions

- This table should not exist independently

Automatic Assessment
• head and tail
• sample
• info
• isnull
• duplicated
• describe
adverse_reactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 given_name 34 non-null object
1 surname 34 non-null object
2 adverse_reaction 34 non-null object
dtypes: object(3)
memory usage: 944.0+ bytes

patients[patients['address'].isnull()]

patient_id assigned_sex given_name surname address city

state \
209 210 female Lalita Eldarkhanov NaN NaN
NaN
219 220 male Mỹ Quynh NaN NaN
NaN
230 231 female Elisabeth Knudsen NaN NaN
NaN
234 235 female Martina Tománková NaN NaN
NaN
242 243 male John O'Brian NaN NaN
NaN
249 250 male Benjamin Mehler NaN NaN
NaN
257 258 male Jin Kung NaN NaN
NaN
264 265 female Wafiyyah Asfour NaN NaN
NaN
269 270 female Flavia Fiorentino NaN NaN
NaN
278 279 female Generosa Cabán NaN NaN
NaN
286 287 male Lewis Webb NaN NaN
NaN
296 297 female Chỉ Lâm NaN NaN
NaN

zip_code country contact birthdate weight height bmi

209 NaN NaN NaN 8/14/1950 143.4 62 26.2
219 NaN NaN NaN 4/9/1978 237.8 69 35.1
230 NaN NaN NaN 9/23/1976 165.9 63 29.4
234 NaN NaN NaN 4/7/1936 199.5 65 33.2
242 NaN NaN NaN 2/25/1957 205.3 74 26.4
249 NaN NaN NaN 10/30/1951 146.5 69 21.6
257 NaN NaN NaN 5/17/1995 231.7 69 34.2
264 NaN NaN NaN 11/3/1989 158.6 63 28.1
269 NaN NaN NaN 10/9/1937 175.2 61 33.1
278 NaN NaN NaN 12/16/1962 124.3 69 18.4
286 NaN NaN NaN 4/1/1979 155.3 68 23.6
296 NaN NaN NaN 5/14/1990 181.1 63 32.1
treatments[treatments.duplicated()]

given_name surname auralin novodra hba1c_start hba1c_end \

136 joseph day 29u - 36u - 7.7 7.19

hba1c_change
136 NaN

treatments[treatments.duplicated(subset=['given_name' ,'surname'])]

given_name surname auralin novodra hba1c_start hba1c_end \

136 joseph day 29u - 36u - 7.7 7.19

hba1c_change
136 NaN

treatments_cut[treatments_cut.duplicated(subset=['given_name' ,'sur
name'])]

Empty DataFrame
Columns: [given_name, surname, auralin, novodra, hba1c_start,
hba1c_end, hba1c_change]
Index: []

adverse_reactions.duplicated().sum()

patients.describe()

patient_id zip_code weight height bmi

count 503.000000 491.000000 503.000000 503.000000 503.000000
mean 252.000000 49084.118126 173.434990 66.634195 27.483897
std 145.347859 30265.807442 33.916741 4.411297 5.276438
min 1.000000 1002.000000 48.800000 27.000000 17.100000
25% 126.500000 21920.500000 149.300000 63.000000 23.300000
50% 252.000000 48057.000000 175.300000 67.000000 27.200000
75% 377.500000 75679.000000 199.500000 70.000000 31.750000
max 503.000000 99701.000000 255.900000 79.000000 37.700000

patients[patients['height'] == 27]

patient_id assigned_sex given_name surname address

city \
4 5 male Tim Neudorf 1428 Turkey Pen Lane
Dothan

state zip_code country contact

birthdate \
4 AL 36303.0 United States 334-515-7487TimNeudorf@cuvox.de
2/18/1928
weight height bmi
4 192.3 27 26.1

treatments_cut.describe()

hba1c_start hba1c_end hba1c_change

count 70.000000 70.000000 42.000000
mean 7.838000 7.443143 0.518810
std 0.423007 0.418706 0.270719
min 7.510000 7.020000 0.280000
25% 7.640000 7.232500 0.340000
50% 7.730000 7.345000 0.370000
75% 7.860000 7.467500 0.907500
max 9.910000 9.460000 0.970000

treatments.sort_values('hba1c_change',na_position='first')

given_name surname auralin novodra hba1c_start

hba1c_end \
0 veronika jindrová 41u - 48u - 7.63
7.20
2 yukitaka takenaka - 39u - 36u 7.68
7.25
8 saber ménard - 54u - 54u 8.08
7.70
9 asia woźniak 30u - 36u - 7.76
7.37
10 joseph day 29u - 36u - 7.70
7.19
.. ... ... ... ... ...
...
49 jackson addison - 42u - 42u 7.99
7.51
17 gina cain - 36u - 36u 7.88
7.40
32 laura ehrlichmann - 43u - 40u 7.95
7.46
245 wu sung - 47u - 48u 7.61
7.12
138 giovana rocha - 23u - 21u 7.87
7.38

hba1c_change
0 NaN
2 NaN
8 NaN
9 NaN
10 NaN
.. ...
49 0.98
17 0.98
32 0.99
245 0.99
138 0.99

[280 rows x 7 columns]

Note - Assessing Data is an Iterative Process

Data Quality Dimensions
• Completeness -> is data missing?
• Validity -> is data invalid -> negative height -> duplicate patient id
• Accuracy -> data is valid but not accurate -> weight -> 1kg
• Consistency -> both valid and accurate but written differently -> New Youk and NY

Order of severity
Completeness <- Validity <- Accuracy <- Consistency

Data Cleaning Order

1. Quality -> Completeness
2. Tidiness
3. Quality -> Validity
4. Quality -> Accuracy
5. Quality -> Consistency

Steps involved in Data cleaning

• Define
• Code
• Test
Always make sure to create a copy of your pandas dataframe before you
start the cleaning process

patients_df = patients.copy()
treatments_df = treatments.copy()
treatments_cut_df = treatments_cut.copy()
adverse_reactions_df = adverse_reactions.copy()

Define
• replace all missing values of patients df with no data
• sub hba1c_start from hba1c_end to get all the change values
• in patients table we will use regex to separate email and phone
# code
patients_df.fillna('No data',inplace=True)
# test
patients_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 503 entries, 0 to 502
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 patient_id 503 non-null int64
1 assigned_sex 503 non-null object
2 given_name 503 non-null object
3 surname 503 non-null object
4 address 503 non-null object
5 city 503 non-null object
6 state 503 non-null object
7 zip_code 503 non-null object
8 country 503 non-null object
9 contact 503 non-null object
10 birthdate 503 non-null object
11 weight 503 non-null float64
12 height 503 non-null int64
13 bmi 503 non-null float64
dtypes: float64(2), int64(2), object(10)
memory usage: 55.1+ KB

treatments.head()

given_name surname auralin novodra hba1c_start hba1c_end

\
0 veronika jindrová 41u - 48u - 7.63 7.20

1 elliot richardson - 40u - 45u 7.56 7.09

2 yukitaka takenaka - 39u - 36u 7.68 7.25

3 skye gormanston 33u - 36u - 7.97 7.62

4 alissa montez - 33u - 29u 7.78 7.46

hba1c_change
0 NaN
1 0.97
2 NaN
3 0.35
4 0.32

# code
treatments_df['hba1c_change'] = treatments_df['hba1c_start'] -
treatments_df['hba1c_end']
treatments_cut_df['hba1c_change'] = treatments_cut_df['hba1c_start'] -
treatments_cut_df['hba1c_end']

# test
treatments_cut_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70 entries, 0 to 69
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 given_name 70 non-null object
1 surname 70 non-null object
2 auralin 70 non-null object
3 novodra 70 non-null object
4 hba1c_start 70 non-null float64
5 hba1c_end 70 non-null float64
6 hba1c_change 70 non-null float64
dtypes: float64(3), object(4)
memory usage: 4.0+ KB

patients.head()

patient_id assigned_sex given_name surname

city state zip_code country \

contact birthdate weight

def find_contact_details(text: str) -> tuple:

# it the value is NaN, then return it
if pd.isna(text):
return np.nan

# create the phone number pattern

phone_number_pattern = re.compile(r"(\+[\d]{1,3}\s)?(\(?[\d]
{3}\)?\s?-?[\d]{3}\s?-?[\d]{4})")
# find the phone number from the value/text, as a result we will
get a list
phone_number = re.findall(phone_number_pattern, text)

# if length is 0, then the regex can't find any ph number, then

define with NaN
if len(phone_number) <= 0:
phone_number = np.nan
# if the country code is attached with the ph number, for that
case, the first
# element will be the country code and the 2nd element will be the
actual ph
# number. So, get that ph number
elif len(phone_number) >= 2:
phone_number = phone_number[1]
# else, we will get the ph number. Grab it.
else:
phone_number = phone_number[0]

# if we found the ph number (with/without country code), then

remove that part from the actual value.
# after removing the ph number, the remaining string might be the
email address.
possible_email_add = re.sub(phone_number_pattern, "",
text).strip()

# then return the ph number and the email address

return phone_number, possible_email_add

patients_df['phone'] = patients_df["contact"].apply(lambda x:
find_contact_details(x)).apply(lambda x:x[0])
patients_df['email'] = patients_df["contact"].apply(lambda x:
find_contact_details(x)).apply(lambda x:x[1])

patients_df.drop(columns='contact',inplace=True)
import re

df = pd.DataFrame(columns=['phone', 'email'])

# Iterate through the data

for item in patients['contact']:
# Use regular expressions to find the phone numbers and email
addresses
phone_match = re.search(r'(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\
d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})', item)
email_match = re.search(r'([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-
Z0-9-.]+)', item)
phone = phone_match.group(0) if phone_match else None
email = email_match.group(1) if email_match else None
df = df.append({'phone': phone, 'email': email},
ignore_index=True)

print(df)

phone email
0 951-719-9170 951-719-9170ZoeWellish@superrito.com
1 (217) 569-3204 PamelaSHill@cuvox.de
2 402-363-6804 402-363-6804JaeMDebord@gustr.com
3 (732) 636-8246 PhanBaLiem@jourrapide.com
4 334-515-7487 334-515-7487TimNeudorf@cuvox.de
.. ... ...
498 207-477-0579 207-477-0579MustafaLindstrom@jourrapide.com
499 928-284-4492 928-284-4492RumanBisliev@gustr.com
500 816-223-6007 816-223-6007JinkedeKeizer@teleworm.us
501 360 443 2060 ChidaluOnyekaozulu@jourrapide.com1
502 402-848-4923 PatrickGersten@rhyta.com402-848-4923

[503 rows x 2 columns]

treatments_df = pd.concat([treatments_df,treatments_cut_df])

treatments_df = treatments_df.melt(id_vars=['given_name',
'surname' ,'hba1c_start',
'hba1c_end','hba1c_change'],var_name='type',value_name='dosage_range')

treatments_df = treatments_df[treatments_df['dosage_range'] != '-']

treatments_df['dosage_start'] =
treatments_df['dosage_range'].str.split('-').str.get(0)
treatments_df['dosage_end'] =
treatments_df['dosage_range'].str.split('-').str.get(1)
<ipython-input-171-19cff1a047dc>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation:

https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#
returning-a-view-versus-a-copy
treatments_df['dosage_start'] =
treatments_df['dosage_range'].str.split('-').str.get(0)
<ipython-input-171-19cff1a047dc>:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation:

https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#
returning-a-view-versus-a-copy
treatments_df['dosage_end'] =
treatments_df['dosage_range'].str.split('-').str.get(1)

treatments_df.drop(columns='dosage_range',inplace=True)

/usr/local/lib/python3.8/dist-packages/pandas/core/frame.py:4906:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation:

https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#
returning-a-view-versus-a-copy
return super().drop(

treatments_df['dosage_start'] =
treatments_df['dosage_start'].str.replace('u','')
treatments_df['dosage_end'] =
treatments_df['dosage_end'].str.replace('u','')

<ipython-input-176-71418162ebe6>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation:

https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#
returning-a-view-versus-a-copy
treatments_df['dosage_start'] =
treatments_df['dosage_start'].str.replace('u','')
<ipython-input-176-71418162ebe6>:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation:

https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#
returning-a-view-versus-a-copy
treatments_df['dosage_end'] =
treatments_df['dosage_end'].str.replace('u','')

treatments_df['dosage_start'] =
treatments_df['dosage_start'].astype('int')
treatments_df['dosage_end'] =
treatments_df['dosage_end'].astype('int')

<ipython-input-179-c144854612b4>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation:

https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#
returning-a-view-versus-a-copy
treatments_df['dosage_start'] =
treatments_df['dosage_start'].astype('int')
<ipython-input-179-c144854612b4>:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation:

https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#
returning-a-view-versus-a-copy
treatments_df['dosage_end'] =
treatments_df['dosage_end'].astype('int')

treatments_df

given_name surname hba1c_start hba1c_end hba1c_change

type \
0 veronika jindrová 7.63 7.20 0.43
auralin
3 skye gormanston 7.97 7.62 0.35
auralin
6 sophia haugen 7.65 7.27 0.38
auralin
7 eddie archer 7.89 7.55 0.34
auralin
9 asia woźniak 7.76 7.37 0.39
auralin
.. ... ... ... ... ...
...
688 christopher woodward 7.51 7.06 0.45
novodra
690 maret sultygov 7.67 7.30 0.37
novodra
694 lixue hsueh 9.21 8.80 0.41
novodra
696 jakob jakobsen 7.96 7.51 0.45
novodra
698 berta napolitani 7.68 7.21 0.47
novodra

dosage_start dosage_end
0 41 48
3 33 36
6 37 42
7 31 38
9 30 36
.. ... ...
688 55 51
690 26 23
694 22 23
696 28 26
698 42 44

[350 rows x 8 columns]

treatments_df = treatments_df.merge(adverse_reactions_df, how ='left',

on=['given_name','surname'])

treatments_df

given_name surname hba1c_start hba1c_end hba1c_change

type \
0 veronika jindrová 7.63 7.20 0.43
auralin
1 skye gormanston 7.97 7.62 0.35
auralin
2 sophia haugen 7.65 7.27 0.38
auralin
3 eddie archer 7.89 7.55 0.34
auralin
4 asia woźniak 7.76 7.37 0.39
auralin
.. ... ... ... ... ...
...
345 christopher woodward 7.51 7.06 0.45
novodra
346 maret sultygov 7.67 7.30 0.37
novodra
347 lixue hsueh 9.21 8.80 0.41
novodra
348 jakob jakobsen 7.96 7.51 0.45
novodra
349 berta napolitani 7.68 7.21 0.47
novodra

dosage_start dosage_end adverse_reaction

0 41 48 NaN
1 33 36 NaN
2 37 42 NaN
3 31 38 NaN
4 30 36 NaN
.. ... ... ...
345 55 51 nausea
346 26 23 NaN
347 22 23 injection site discomfort
348 28 26 hypoglycemia
349 42 44 injection site discomfort

[350 rows x 9 columns]

Diabetes Management Survey in Namakkal
No ratings yet
Diabetes Management Survey in Namakkal
2 pages
Gargi Das, Community Engagement
No ratings yet
Gargi Das, Community Engagement
12 pages
Diabetes Super Final Raw Data
No ratings yet
Diabetes Super Final Raw Data
11 pages
NCDControlStatus 302 2025
No ratings yet
NCDControlStatus 302 2025
34 pages
Case Presentation On Poor Glycemic Control
No ratings yet
Case Presentation On Poor Glycemic Control
24 pages
Case Presentation On Diabetes Mellitus
No ratings yet
Case Presentation On Diabetes Mellitus
25 pages
Research Paper Covid 19 Part 2
No ratings yet
Research Paper Covid 19 Part 2
9 pages
Nursing Case Study: Type II Diabetes
No ratings yet
Nursing Case Study: Type II Diabetes
37 pages
Direct Study (SHAH, Noor Ul Ain's Copy)
No ratings yet
Direct Study (SHAH, Noor Ul Ain's Copy)
36 pages
Uncontrolled Type 2 DM
No ratings yet
Uncontrolled Type 2 DM
29 pages
The Diabetic Patient
No ratings yet
The Diabetic Patient
36 pages
Case Study
No ratings yet
Case Study
35 pages
Diabetes Mellitus C
No ratings yet
Diabetes Mellitus C
48 pages
The Diabetic Patient
No ratings yet
The Diabetic Patient
34 pages
Empagliflozin Between Evidence & Practice
No ratings yet
Empagliflozin Between Evidence & Practice
87 pages
Kasus 2 Dan 3
No ratings yet
Kasus 2 Dan 3
3 pages
FNCP Sample
No ratings yet
FNCP Sample
19 pages
Diabetes
No ratings yet
Diabetes
30 pages
Therapeutics Diabetes Day 1 TBL Cases Fall 2024 - Therapeutics Pharmacology
No ratings yet
Therapeutics Diabetes Day 1 TBL Cases Fall 2024 - Therapeutics Pharmacology
7 pages
Diabetes Hospitalization Causes Study
No ratings yet
Diabetes Hospitalization Causes Study
81 pages
Defense Miller
No ratings yet
Defense Miller
35 pages
Adime PT 2
No ratings yet
Adime PT 2
8 pages
Type 2 Diabetes and Asthma Case Study
No ratings yet
Type 2 Diabetes and Asthma Case Study
15 pages
Department of Pharmacy Practice
No ratings yet
Department of Pharmacy Practice
2 pages
Chapter-5 CVDSP2020
No ratings yet
Chapter-5 CVDSP2020
36 pages
Basic Concept of Diabetes Mellitus - Kuliah Mhswa 2013
100% (1)
Basic Concept of Diabetes Mellitus - Kuliah Mhswa 2013
33 pages
Dmtac Case Presentation Azian
No ratings yet
Dmtac Case Presentation Azian
26 pages
QR Management of Type 2 Diabetes Mellitus (5th Edition) PDF
100% (1)
QR Management of Type 2 Diabetes Mellitus (5th Edition) PDF
8 pages
QR Management of Type 2 Diabetes Mellitus (5th Edition)
No ratings yet
QR Management of Type 2 Diabetes Mellitus (5th Edition)
8 pages
6.01 - Diabetes Mellitus EDITED 4.22.2018
No ratings yet
6.01 - Diabetes Mellitus EDITED 4.22.2018
30 pages
OUTPUT
No ratings yet
OUTPUT
375 pages
Nutrition Case Analysis: Adult Obesity
No ratings yet
Nutrition Case Analysis: Adult Obesity
25 pages
Diabetes-Obesity Management - Recording - ESOedits - FINAL
No ratings yet
Diabetes-Obesity Management - Recording - ESOedits - FINAL
57 pages
Managing Diabetes: Case Studies
No ratings yet
Managing Diabetes: Case Studies
7 pages
Cole Kelby Nutrition Database Form 042018 4
100% (1)
Cole Kelby Nutrition Database Form 042018 4
11 pages
Present Therapies of Type 2 Diabetes Mellitus: ACP Annual Session MTP 057&058 San Francisco, CA April 15-16, 2005
No ratings yet
Present Therapies of Type 2 Diabetes Mellitus: ACP Annual Session MTP 057&058 San Francisco, CA April 15-16, 2005
127 pages
T2 DM Updates - 2025
No ratings yet
T2 DM Updates - 2025
35 pages
4 6030846665903053164.sav
No ratings yet
4 6030846665903053164.sav
54 pages
8 Diabetes
No ratings yet
8 Diabetes
77 pages
Process Paper
No ratings yet
Process Paper
18 pages
DM Medical Nutritional Therapy
No ratings yet
DM Medical Nutritional Therapy
5 pages
Chapter 91 - Diabetes Mellitus
No ratings yet
Chapter 91 - Diabetes Mellitus
75 pages
Glyxambi Endo
No ratings yet
Glyxambi Endo
33 pages
Table 1 & 2
No ratings yet
Table 1 & 2
7 pages
Center For Diabetes Care
No ratings yet
Center For Diabetes Care
8 pages
Diabetes Mellitus
No ratings yet
Diabetes Mellitus
80 pages
Diabetes Mellitus: Salient Features of Type 1 Am D Type 2 DM
100% (1)
Diabetes Mellitus: Salient Features of Type 1 Am D Type 2 DM
20 pages
Cleeeean
No ratings yet
Cleeeean
33 pages
History of Present Illness
No ratings yet
History of Present Illness
4 pages
ADA Standards of Care For Diabetes 2013
No ratings yet
ADA Standards of Care For Diabetes 2013
4 pages
DM - Case Pres 1a
No ratings yet
DM - Case Pres 1a
59 pages
t2dm Case Study
No ratings yet
t2dm Case Study
17 pages
Case Studies Diabetes and Complications 003 002
No ratings yet
Case Studies Diabetes and Complications 003 002
57 pages
Uncontrolled Blood Sugar NCP
No ratings yet
Uncontrolled Blood Sugar NCP
4 pages
Diabetes 1
No ratings yet
Diabetes 1
56 pages
RMUPPC Module 1 Diabetes, Nutrition, Lab Results
No ratings yet
RMUPPC Module 1 Diabetes, Nutrition, Lab Results
6 pages
Chronic Diabetic Complications Study
No ratings yet
Chronic Diabetic Complications Study
11 pages
MS Case
No ratings yet
MS Case
5 pages
CV - Avantika Saxena - Data Privacy..
No ratings yet
CV - Avantika Saxena - Data Privacy..
2 pages
Jaipuria Institute of Management, Vineet Khand, Gomti Nagar Lucknow - 226 010
No ratings yet
Jaipuria Institute of Management, Vineet Khand, Gomti Nagar Lucknow - 226 010
6 pages
DPDP
No ratings yet
DPDP
27 pages
Background of The Study: People Will Prefer in New Product Line
No ratings yet
Background of The Study: People Will Prefer in New Product Line
3 pages
Webchutney: Digital Marketing Agencies in India
No ratings yet
Webchutney: Digital Marketing Agencies in India
3 pages
Overview of CRM Practices in Banking Industry
No ratings yet
Overview of CRM Practices in Banking Industry
8 pages
Accounts Group Project 2
No ratings yet
Accounts Group Project 2
4 pages
Jaipuria Institute of Management, Vineetkhand, Gomti Nagar Lucknow - 226 010
No ratings yet
Jaipuria Institute of Management, Vineetkhand, Gomti Nagar Lucknow - 226 010
10 pages
Section 80C Tax Deduction
No ratings yet
Section 80C Tax Deduction
1 page
Noise Report by Sanyam
No ratings yet
Noise Report by Sanyam
16 pages
Microfinance Competition in India
No ratings yet
Microfinance Competition in India
4 pages
Jaipuria Institute of Management, Vineet Khand, Gomti Nagar Lucknow - 226 010
No ratings yet
Jaipuria Institute of Management, Vineet Khand, Gomti Nagar Lucknow - 226 010
1 page
Vayner X
100% (1)
Vayner X
6 pages
Indian Banking: Tech & CRM Trends
No ratings yet
Indian Banking: Tech & CRM Trends
4 pages
As at 30st June 2016: Equity and Lailblity
No ratings yet
As at 30st June 2016: Equity and Lailblity
15 pages
Viva Schedule
No ratings yet
Viva Schedule
2 pages
SCM Group File
No ratings yet
SCM Group File
21 pages
Vivek Rai JL19PG181: Research Paper 1
No ratings yet
Vivek Rai JL19PG181: Research Paper 1
2 pages
Sonal Tiwari - (NTCC REPORT) - 1
No ratings yet
Sonal Tiwari - (NTCC REPORT) - 1
31 pages
Internship Offer With Wrappers India Online: Date: - 15/06/2020
No ratings yet
Internship Offer With Wrappers India Online: Date: - 15/06/2020
4 pages
Africa Crystal Lighting Market - Avantika
No ratings yet
Africa Crystal Lighting Market - Avantika
1 page
Avantika Saxena: Emai:avantika - Saxena.21l@jaipuria - Ac.in Mobile:8924870569 Date of Birth: 2nd March.1997
No ratings yet
Avantika Saxena: Emai:avantika - Saxena.21l@jaipuria - Ac.in Mobile:8924870569 Date of Birth: 2nd March.1997
1 page
Principles of Management: Rational Decision-Maker
No ratings yet
Principles of Management: Rational Decision-Maker
2 pages
Hul & P&G
No ratings yet
Hul & P&G
16 pages
Contingency Plan: TITLE-To Study The Challenges and Changing Business Models in CPG Industry
No ratings yet
Contingency Plan: TITLE-To Study The Challenges and Changing Business Models in CPG Industry
2 pages
Asia Pacific Crystal Lighting Market-Avantika
No ratings yet
Asia Pacific Crystal Lighting Market-Avantika
1 page
AMC Social Media
No ratings yet
AMC Social Media
17 pages
A Level Mathematics Paper 1 Equations Involving Indices Logarithms & Others
No ratings yet
A Level Mathematics Paper 1 Equations Involving Indices Logarithms & Others
10 pages
WWG - Maiden of The High Seas - Props - Instructions
No ratings yet
WWG - Maiden of The High Seas - Props - Instructions
22 pages
Cambridge English Advanced Result Workbook - 2014 - 120p No Key
No ratings yet
Cambridge English Advanced Result Workbook - 2014 - 120p No Key
96 pages
Marketing Students' Food Delivery Satisfaction
No ratings yet
Marketing Students' Food Delivery Satisfaction
40 pages
Yuken India Limited: 2024 Research Report
No ratings yet
Yuken India Limited: 2024 Research Report
9 pages
Video Teaching Boosts Patient Safety
No ratings yet
Video Teaching Boosts Patient Safety
7 pages
JMP Amoroto G1
No ratings yet
JMP Amoroto G1
30 pages
Medical Biller Practice Test, Medical Billing Practice Test
No ratings yet
Medical Biller Practice Test, Medical Billing Practice Test
7 pages
Princeton Chromatography SFC & HPLC Solutions
No ratings yet
Princeton Chromatography SFC & HPLC Solutions
20 pages
BoardingCard 345631232 VNO BVA
No ratings yet
BoardingCard 345631232 VNO BVA
1 page
CROUZET - M3 - FBD Connection Guide
No ratings yet
CROUZET - M3 - FBD Connection Guide
4 pages
Reid Hoffman On Musk, AI and The Future of Humanity
No ratings yet
Reid Hoffman On Musk, AI and The Future of Humanity
16 pages
Connect Diag SUR-RON - 105V - Brake - v08
No ratings yet
Connect Diag SUR-RON - 105V - Brake - v08
1 page
Led (0603) R
No ratings yet
Led (0603) R
11 pages
ImagePROGRAF TM Series Brochure 200
No ratings yet
ImagePROGRAF TM Series Brochure 200
4 pages
Patrick Nguyen - Key Selection Criteria
No ratings yet
Patrick Nguyen - Key Selection Criteria
2 pages
Medical Speech Conversion Guide
No ratings yet
Medical Speech Conversion Guide
3 pages
Underwater Data Logger
No ratings yet
Underwater Data Logger
11 pages
Metal Fracture Mechanisms
No ratings yet
Metal Fracture Mechanisms
10 pages
Slide of Chapter 1
No ratings yet
Slide of Chapter 1
29 pages
Ultrasonic Test Set Manual 42A12D
No ratings yet
Ultrasonic Test Set Manual 42A12D
7 pages
Future Literacy 30-1 Vocab Practice
No ratings yet
Future Literacy 30-1 Vocab Practice
128 pages
Paul Arden
100% (9)
Paul Arden
7 pages
02 Second Law
No ratings yet
02 Second Law
61 pages
IS208 PROFESSIONAL ISSUES IN INFORMATION SYSTEMS Revised
67% (3)
IS208 PROFESSIONAL ISSUES IN INFORMATION SYSTEMS Revised
2 pages
Belimo NMV-D3-MFT
No ratings yet
Belimo NMV-D3-MFT
10 pages
History 2ND Year
100% (1)
History 2ND Year
4 pages
CV
No ratings yet
CV
5 pages
Tests ICT 2024-2025 1 Семестр (Common)
No ratings yet
Tests ICT 2024-2025 1 Семестр (Common)
307 pages
IJMSE
No ratings yet
IJMSE
13 pages

Session 28 Data Accessing and Cleaning

Uploaded by

Session 28 Data Accessing and Cleaning

Uploaded by

Data Analysis Process

Summary of today's session

1b. Data Accessing

Types of Unclean Data

patient_id assigned_sex given_name surname

city state zip_code country \

contact birthdate weight

given_name surname auralin novodra hba1c_start hba1c_end

1 elliot richardson - 40u - 45u 7.56 7.09

2 yukitaka takenaka - 39u - 36u 7.68 7.25

3 skye gormanston 33u - 36u - 7.97 7.62

4 alissa montez - 33u - 29u 7.78 7.46

given_name surname adverse_reaction

1. Write a summary for your data

• 175 patients switched to Auralin for 24 weeks

Data about patients feeling some adverse effects is also recorded.

2. Write Column descriptions

Table -> treatments and treatment_cut:

Table -> adverse_reactions

3. Add any additional information

• Manual - Looking through the data manually in google sheets

with pd.ExcelWriter('clinical_trials.xlsx') as writer:

Issues with the dataset

- patient_id = 9 has misspelled name 'Dsvid' instead of David

Table - Treatments & Treatments_cut

- given_name and surname col is is all lower case `consistency`

- given_name and surname are all in lower case `consistency`

- contact col contains both phone and email

Table - Treatments & Treatments_cut

- This table should not exist independently

patient_id assigned_sex given_name surname address city

zip_code country contact birthdate weight height bmi

given_name surname auralin novodra hba1c_start hba1c_end \

given_name surname auralin novodra hba1c_start hba1c_end \

patient_id zip_code weight height bmi

patient_id assigned_sex given_name surname address

state zip_code country contact

hba1c_start hba1c_end hba1c_change

given_name surname auralin novodra hba1c_start

[280 rows x 7 columns]

Note - Assessing Data is an Iterative Process

Data Cleaning Order

Steps involved in Data cleaning

given_name surname auralin novodra hba1c_start hba1c_end

1 elliot richardson - 40u - 45u 7.56 7.09

2 yukitaka takenaka - 39u - 36u 7.68 7.25

3 skye gormanston 33u - 36u - 7.97 7.62

4 alissa montez - 33u - 29u 7.78 7.46

patient_id assigned_sex given_name surname

city state zip_code country \

contact birthdate weight

def find_contact_details(text: str) -> tuple:

# create the phone number pattern

# if length is 0, then the regex can't find any ph number, then

# if we found the ph number (with/without country code), then

# then return the ph number and the email address

# Iterate through the data

[503 rows x 2 columns]

treatments_df = treatments_df[treatments_df['dosage_range'] != '-']

See the caveats in the documentation:

See the caveats in the documentation:

See the caveats in the documentation:

See the caveats in the documentation:

See the caveats in the documentation:

See the caveats in the documentation:

See the caveats in the documentation:

given_name surname hba1c_start hba1c_end hba1c_change

[350 rows x 8 columns]

treatments_df = treatments_df.merge(adverse_reactions_df, how ='left',

given_name surname hba1c_start hba1c_end hba1c_change

dosage_start dosage_end adverse_reaction

[350 rows x 9 columns]

You might also like