NATIONAL BENEFICIARY DATA TRAINING FOR 2016
April 28-May 7, 2016
The Cloud 9 Hotel, Antipolo, Rizal
At the end of the module, participants
will:
Appreciate the power of STATA as
another method to undertake data
management
Be able to perform data profiling and
data analysis in STATA
Be able to produce graphs and tables
for reporting using STATA
Be able to use STATA for data
management
Getting started with STATA
Data profiling using STATA
Data analysis using STATA
Data reporting using STATA
Workshop on data profiling,
analysis and reporting using
STATA and P1 2016 approved
updates
STATA
A statistical software used in
data management and
analysis What is STATA?
The name STATA is a syllabic
abbreviation of the words
“statistics” and “data”.
STATA
Initial release: 1985 by
StataCorp
What is STATA?
Command-line interface and
graphical user interface (menus
and dialog boxes)
Case sensitive (all commands
are in lower cases)
Results window: The big window. Results of
all Stata commands appear here (except
graphs which are shown in their own
windows).
Command window: Below the results window. The STATA
Commands are entered here. Windows
Variables window: Shows a record of all
variables in the dataset that is currently
being used
Review window: Records all Stata commands
that have been entered. A previous command
can be repeated by double-clicking the
command in the Review window (or by using
Page Up).
Properties Window
Browser Window
Other STATA
Windows
Editor Window
Do-file Window
The three most important menus
Data - for organizing and managing the data
Graphics - for visual exploration &
presentation
Stata Menus and
Statistics - for analysis
Toolbar
Data, Graphics, Statistics
By default:
c:\Program Files\stata 12
Change working directory:
cd c:\ foldername Define Working
Directory
Example:
cd "C:\Users\HP-
PC\Documents\Module 2 DM"
Open STATA file (.dta)
−by selecting File Open or by typing:
use <filename>
Example:
Opening Data
use “sample roster”
If the file name contains blanks, the
address must be enclosed in quotation
marks.
Save STATA file (.dta)
save <filename>
Example :
save “sample roster1” (1st time save)
Saving Data
save “sample roster1”, replace
(2nd time – be careful with overwriting)
Close file (save first!)
clear
Import excel (.xls)
- by selecting File Import Excel
spreadsheet or by typing
import excel “<filename>”, firstrow Importing Data
- Example:
import excel "update type 5 sample.xls“,
firstrow
Import text file (.csv, .txt)
- by selecting File Import Text data
created by a spreadsheet or by typing
insheet using <filename> Importing Data
- Example:
insheet using "update type 5 sample.txt",
clear
Export excel (.xls)
- Store excel data format (.xls)
- by selecting File Export Excel
spreadsheet or by typing
Exporting Data
export excel “<filename>”, firstrow(variable)
- Example:
export excel "update type 5 sample1.xls",
firstrow(variable)
Import text file (.csv, .txt)
- by selecting File Export Comma-
or tab-separated data or by typing
outsheet using “<filename>” Exporting Data
- Store data created by a spreadsheet (.csv, .txt)
- Example:
outsheet using "update type 5 sample1.txt“
• Utilizes different kinds of
descriptive statistics such as:
• Frequency
• Minimum It is used as a
starting point for
• Maximum the data quality
assessment which
• Mean/Median/Mode will determine the
• Total/Sum future strategy with
regards to data
• Missing validation, error
correction
• Describes the data type of the
variables (whether string or
numerical)
count
– counts the number of observation
distinct [varlist]
- report number(s) of distinct observations or
values
- Example: distinct household_id
describe or describe [varlist]
displays the number of observations, variable
names, types and labels
Example: describe household_id
Dataset: sample roster
codebook [varlist]
displays the variable type, number of
missing values and sample values
Example: codebook
codebook member_status
summarize
– useful for numerical variables
- summary of # observation, average, min, max
Example: summarize age
browse or browse [varlist]
displays the dataset
Example: browse
browse gender preg_status
edit or edit [varlist]
displays the dataset and at the same time, the user may
edit the data
Example: edit
edit gender preg_status
sort [varlist]
to arrange the data in ascending/alphabetical order using a
specific variable
Example: sort last_name first_name mid_name
rename [var1] [var2]
to change the name of a particular variable
Example: rename household_id hhid
generate newvar = exp
Create new variable
Example:
generate attending_school = "NO" if
schoolname=="NO SCHOOL" & attend_school !=1
replace oldvar = exp1 [if exp2]
may be used to change the contents of an existing
variable
Example:
replace attending_school = “YES" if
attending_school==“”
drop [varlist]
to remove/delete a specific variable/data
Example: drop hh_set set_group
keep [varlist]
to retain specific variables
Example: keep hhid entry_id
One-way tabulation
tab [varname]
produces one-way table of
frequency counts and percentages
Example: tab region, m
tab1 [varlist]
produces results of the indicated
variables
Example: tab1 fieldupdated
newvalue, m
Dataset: Update 5 sample
append using filename [, options]
to consolidate different datasets
A
+ B
=
B
append
Example:
- to consolidate approved update type 5 in
Antipolo City and Morong
insheet using “update type 5 antipolo.txt“,
clear
save “update type 5 antipolo.dta”
insheet using "update type 5 morong.txt“,
clear
save “update type 5 morong.dta”
append using “update type 5 antipolo”
Merge
- combine datasets horizontally
- to match/merge datasets
+ =
A B A B
• Type:
merge 1:1 varlist using filename
merge m:1 varlist using filename
merge 1:m varlist using filename
•Dataset in memory is called “master dataset”.
•Dataset filename is called “using dataset”.
Example:
- to check if there are newly selected child benes with
newly approved updates on education
insheet using "update type 11 sample.txt", clear
save "update type 11 sample"
merge m:m entryid using "update type 5 sample“
keep if _merge==3
save “update type 11 with type 5.dta”
“There are 159 children selected as
beneficiaries for education last P1
2016 who also have approved
updates on education info in the
same period.
Comparing and contrasting two or more
variables/data
Duplicity checking
where STATA commands are saved in order to
keep a record of the commands used to
produce the result;
allows the user to run a long series of
commands several times;
One-way tabulation
tab [varname] if [exp]
produces one-way table of
frequency counts and percentages
according to the condition specified
Example:
- to show the grade level of the
child bene for education
tab ed_attnmnt if childbene
!="", m
Dataset: sample roster
Two-way tabulation
tab [varname1] [varname2]
produces two-way table of frequency counts
Example:
- to identify the child beneficiaries who are target for updating on school
facility
-tab attending_school childbene, m
Dataset: sample roster
Two-way tabulation
tab [varname1] [varname2] if [exp]
produces two-way table of frequency counts according to the
condition specified
Example:
tab attending_school childbene if member_status=="1 -
Active", m
Dataset: sample roster
Import and save the sample grantee list
insheet using “sample grantee list.txt”, clear
save “sample grantee list.dta"
Example 1: To identify possible duplicate households in terms of
HHID
Example 1: To identify possible duplicate households in terms of
HHID
Example 1: To identify possible duplicate households in terms of
HHID
Example 1: To identify possible duplicate households in terms of
HHID
Type:
duplicates tag [varlist], generate(newvar)
duplicates tag household_id, generate(dup_ID)
Example 1: To identify possible duplicate households in
terms of HHID
tab dup_ID
Since the value of
dup_ID is only zero
(0), then we are sure
that there are no
duplicates in terms of
the household ID.
Example 2: To identify possible duplicate households in
terms of grantee names
duplicates tag lastname first_name middle_name, generate(dup_name)
Example 2: To identify possible duplicate households in
terms of grantee names
tab dup_name
keep if dup_name==1
save “possible duplicate HHs.dta”
Since the value of
dup_NAME consists
of one (1), then we
are sure that there
are indeed possible
duplicates in terms
of names.
Example 2: To identify possible duplicate households in terms of grantee
names
browse if dup_name==1
1. Merge list of identified possible duplicates
(dup_name = 1) with sample roster
use “possible duplicate HHs.dta”, clear
merge 1:m household_id using “sample roster”
keep if _merge==3 (to retain only the roster of the possible
duplicate households)
sort lastname first_name middle_name
3. Export the roster of possible duplicate HHs
3. Export the roster of possible duplicate HHs
export excel using "roster possible duplicate hhs.xls",
firstrow(variables)
4. Check the exported file in your folder and
conduct (manually) the roster analysis by analyzing
the composition of the possible duplicate
households
Textual Form
Can be presented using sentences and paragraphs
Involves enumerating important characteristics of the
data (count, minimum, maximum, average, mode
etc.)
Tabular Form
Clear organization of data into rows and columns
1) Table Number
Table 1. Number of Child Beneficiaries per Age and Grade Level and Table Title
Age Category 2) Column
Level of Education Grand Total
3-5 YO 6-14 YO 15 -18 YO Header
No Grade Reported 36,448 77,969 27,099 141,516
Day Care 103,861 50,582 949 155,392
Kinder 197,272 160,854 1,791 359,917
Kinder / Day Care 1,109 62,520 1,613 65,242
Grade 1 121,643 314,199 1,502 437,344
Grade 2 6,692 683,029 4,194 693,915
Grade 3 1,028 833,623 8,495 843,146
Grade 4 589 1,063,993 20,544 1,085,126
3) Row
Grade 5 308 1,029,544 38,114 1,067,966 4) Body
Classifier
Grade 6 450 1,105,209 195,563 1,301,222
Grade 7 97 613,181 77,412 690,690
Grade 8 69 522,538 181,210 703,817
Grade 9 79 302,485 300,276 602,840
Grade 10 / 4th Year HS 62 42,561 433,370 475,993
Grade 11 - 79 423 502
Grade 12 2 32 90 124
Grand Total 469,709 6,862,398 1,292,645 8,624,752
Source: Pantawawid Pamilya Information System as of March 31, 2016 5) Source Note
Example:
- to tabulate the grade level of the child bene for
education
tab ed_attnmnt if childbene !="", m
Tabular Form
To copy the table from Stata to another application, say Excel,
highlight the table in the Stata Results window, and go to
the Edit menu, and select Copy, Copy Table, or Copy Table as
HTML. After you have copied the table, you can paste the table into
another program.
Tabular Form
Bar graph
- used to compare things between different
groups or to track changes over time.
- graph bar (count) entry_id , over(region_nick)
Source: Update Type 11 Graphical Form
To give a visual
effect
Line graph
- used to compare changes over the same
period of time for more than one group.
Graphical Form
To give a visual
effect
Pie Chart
- best to use when you are trying to compare
parts of a whole. It doesn’t show changes over time.
#delimit ;
graph pie entryid, over(relationshiptohead)
plabel(_all name, size(*1.5) color(white))
legend(off)
Graphical Form
plotregion(lstyle(none))
title("Relationship to HH Head of Child-beneficiaries P1 2016"); To give a visual
#delimit cr
effect
1. Using the list of approved updates
Import and save all types of updates into STATA
Tabulate
Count of data set records with values or null (blank)
- E.g. – distinct count of households, count per field updated, count of new values with blanks, etc.
Find 3 data inconsistencies
- E.g. – old value equals new value, age versus grade level, child bene but not child/grandchild, etc.
2. Using the grantee list and HH roster
Graph the number of households by client status and province
Summary of households with count of eligible by province.
Compare the grade level and age of the child beneficiaries for education using a two way table.
3. Using the list of Code 21 households
Check possible duplicates using the duplicates command in STATA and conduct sample of roster analysis if possible (minimum of 1
pair)