1.
Open Stata
2. Click on file/ change directory
3. Search for the folder with all files required for your analysis
4. Select data sub folder and click open
Summary: file/change directories/stataclassfolder/datasubfolder/open
1. Type “dir” in command view to check directory under the
subfolder data
2. Type “pwd” (print working directory)
• Log file purpose is to save everything/any changes made to your work in
stata.
• It creates a file that will store all the information necessary for the work
done and changes:-
Double checking your work later on
Reviewing the output of statistical procedures
Copying and pasting results
General way of making it:-
Log using “filename.log
When you want to save to the already existing log file;
Log using “filename.log, replace
Log close- closes the log file
Tyoe use “path and filename”.dta
use “path and filename”.dta,clear for clearing or cancelling the selection
use “path and filename”.dta,append/replace used when one wants to add more
information to the existing .dta file (append) or replace used when you want the file to
be replaced by the new formed .dta
e.g use zsbs2009.dta, append/replace
Note: replace in the command changes the contents in the old file, so be conscious
that no any wanted information is deleted by resaving it.
use zsbs2009.dta,clear
Type clear and close if you want to close your Stata remember your log file is not
affected because you have saved all the changes to it.
• Type describe to check all the variables available in the file
• This brings the observations, number of variables and variable names,
storage type and labels and order of variables
• describe
• Type describe [varlist] for checking specific variables of interest.
• Describe q02-q08
Used to search for variables in stata.
e.g. lookfor age or lookfor “marital status”
Useful when working with huge datasets.
BROWSE:- is used to open a window with dataset currently in memory that
resembles excel spreadsheet.
COUNT:- used to assess how many observations satisfy a particular condition
e.g. count
count if q103>20 for data of those responds to age less that 20 years
Codebook [varlist]
Shows contents of data, with more details such as:-
• Variable name/labels/order, and value labels
• Range of values, frequency count, number of missing values
• If you use noes option, the notes attached to data are displayed
• e.g. codebook q101
• codebook q101 q501
• codebook q101-q103
List [varlist] [if exp] [ in range], [,nolabel]
Example
• List q103
• List q101 q501
• List q103 in 1/20
TABULATE ONE WAY CMD
TABULATE varname [if]
-nolabel option suppresses value labels in display such as :-
tab q101
tab q101, nol
Used to create frequency distribution
of individual variable ( frequency is
the number or counts f a category of a Used to display the
variable) values of variables for
• Double check variables observations in data set
a) Frequency- the frequency count of the number of observations that fall into the various
categories defined for the variables
b) Percent- the % of the total number of observation for which each category accounts
c) Cum. The cumulative percentage
TABULATE ONE-WAY. MISSING VALUES
Stata ignores. Missing values when running tabulations to view them you must use the
missing syntax,
This is good for knowing what is missing in the data e.g
Tab q101, missing
OPERATION,EXPRESSIONS, AND OPERATORS
Y=X+Z
• Expression is a statement that requires operands and operators
• Operands can be numbers, variables, the result of a function, or sme combination
• Operators can either be arithmetic , logical or relational
• This works using the keep or drop command
Arithmetic Logical Relational
+ addition & (and) >(greater than)
- Subtraction | (or) <(less than)
• Multiplication ! Or ~ (not) >=(greater than equal to)
/ division <=(less than equal to)
^ power == (equal to)
- Negation != or ~=(not equal to)
= assignment
Stata runs in two ways
• Interactive mode: were CMD are directly and executed in CMD window
• Batch mode: this is when CMDs can be written in a text file and executed
together in one step
• DO_FILE ends with the extension .do extension
• Contains Stata CMDs exactly way you’d type them into CMD window
• In windows, the text for the do file is written in do file
1. Type doedit in cmd window
2. Using menu bar-click on windows, select do file editor/ Ctrl+9
To save the do file you
Adding comments on a do file
• Name of project
• Purpose of analysis
• Your name
• Date of creation
• Institution
• Any modification
Different ways t write comments
Text of comment
• * Used at the beginning of the text
• Double bar slash
// text of comment used to comment out single CMD or text line
Can be placed at beginning or end
name of a project: DEM 3310
*purpose of analysis: sexual behaviour among teenagers
*name of Authors: Vincelaama
*Institution: university of Zambia
*Date: 31/07/2017
*stata commands
clear // this closes any dataset that is open in stata
capture log close // this will suppress any error message
cd "C:\\users\Chilax\Desktop\stata class\Data" // this will enable stata to log in
to the prefered working folder
log using dem33110.log, append
use zsbs2009.dta
* commands for subsetting datasets
keep q101 q103 q701-q711
*drop command
drop q101 q107
keep if q101==1
keep if q103<=19
//data transformation in stata
tab q101, nol
*creating new variables
**generate command
tab q103, missing
gen age=q103
recode q103 (15/24=1 “15-24) (25/49=2 “25-49”)(50/60=3 “50-60), gen(agegroup)// this generate new
groups to the new name called agegroup
tabulate agegroup, missing
tab agegroup, nol // nol means no labels
Rename q101 sex // to rename write the Varname space then new name of the variable
Kay Vincelaama 8/26/2017
recode q103 (15/24=1 “15-24) (25/49=2 “25-49”)(50/60=3 “50-60) if q101==1, gen(mage)
• Open stata, type the do file CMD or just use the CTRL +9 function
• Click on the open icon and select your do file by allocation it
• To open the file using the do file, click execute once and in the result window a
set of instruction will appear
• Use the keep CMD
e.g. keep q101 q701-q711
• Highlight the command and click on execute to open the selected
variables
To execute all the variables highlight the dataset and click execute to open
all the variables
Using the zsbs2009 individual dataset and the age variable q103, what command
could you sue to identify the number of repsondents who are missing a value for
age? tab q103, m
How many are there? 482
Create a new variable called oldfolks to identify those respondents older than age
45 using the generate and replace commands. How many are there
gen oldfolks=q103
Keep if oldfolks>45
574 people older than 45 yrs
Kay Vincelaama 8/26/2017
EXAMPLES OF CMDS TO USE
tab q107, m tab q101 q501, row
drop if q107==9 tab q501, m
tab q107 drop if ==9
tab q107, m drop if q501==9
tab q108 tab q501
tab q108, m tab q101 q108, col
drop if q108==. tab q101 q108, col
tab q108 Tab q101 q108,exp
tab q108 tab q101 q108,exp
tab q101 q108 tab q101 q108, row chi2
tab q101 q108, row
set more off
tab q101 q105
Kay Vincelaama 8/26/2017
tab q101 q105, row
This is the simplest form of quantitative analysis. The analysis is
carried out with the description of a single variable in terms of the
applicable unit of analysis.
It is performed when we want to explore each variable in a data set,
separately.
It looks at the range of values, as well as the central tendency of the
values.
It describes the pattern of response to the variable.
Kay Vincelaama 8/26/2017
Helps to relate one variable against another; The option of which could be to
investigate if there is a relationship existing or if it exists, determine whether or not it is
significant
To perform a bivariate analysis, you have to cross-tabulate variables of interest
Cross tabulation general rules
Decide which is the independent variable and which is the dependent
The independent variable is written first in the CMD structure
In case you get lost, the independent variable should always add up to 100%
E.g Gender with the completition of school
Kay Vincelaama 8/26/2017
X2 (chi-squared) analysis:
A chi-square test measures association between two categorical variables
Calculates the probability that the relationship observation between two categorical
variables is due to chance (a.k.a. random sampling error)
Its requires that you compare what you observe to what you expected to observe if
there were no pattern
E.g. sex and education
Tab q101 q108, bivariate
Tab q101 q108,ex
Tab q101 q108,row chi2
Chi2 pr= 0.000 the confidence level 95% probability of it happening which is put
at 95%
Kay Vincelaama 8/26/2017
DESCRIPTION
The T-test performs tests of equality of means. It has the following descriptions
Tests that varname has a mean of #
E.g. males average age is 21
Tests that varname has the same mean within the two groups defined by group var
E.g. check if the score of girls and boys is the same.
Tests that varnmane have the same mean, assuming paired (or unpaired) data
E.g. test whether if women of 15-24 and 25-29 have the same number of children
Kay Vincelaama 8/26/2017
The dependent variable should be measured at the interval or ratio level i.e. continuous);
e.g. weight, height, income
The independent variable should consist of two categorical, independent (unrelated)
groups; e.g. marital status
Indendence observation; i.e. there must be different participants in each group with no
participant being in more than one group;
E.g. women using traditional and modern contraceptive in CEB as the continuous variable
No significant outliers; that which sticks out of the ordinary e.g. from the data when the
average year was 23 and the results shows someone above 35years
The dependent variable should be approximately normally distributed for each category (of
the independent variable); and
E.g. when we take the population sample we assume there is a normally distribution
Homogeneity(same) of variances(difference); using Levene’s test for homogeneity of
The variation should be similar
Kay Vincelaama 8/26/2017
Test whether the mean of the sample is equal to a known constant under the
assumption of unknown variance;
-use http//www.stata-ress.com/data/r13/auto
-save ttest_de3310, replace
Example: test whether the overall average for the sample is 20 km per litter
(variable is miles per gallon=mpg)
-ttest mpg==20
Kay Vincelaama 8/26/2017
Determine whether the mean of dependent varable (e.g. weight) is the same in two unrealated,
independent groups ) e.g. male and Females)
Example: how can we test the effectiveness of new treated fuel?
Such as leaded petrol, low Sulphur diesel
Performance of the banks by measuring the customers satisfaction of services
We can run an experiment in which 12 cars are given the new treated fuel and other 12 cars are not;
we then measure how far they travel in kilometres using the “mpg” variable. We also have a variable
called “treated” coded 1 if fuel is new (treated) and “0” if not.
This process will test the equality of means for the treated and untreated group:
Use twosample_ttest_de3310.dta
Ttest mpg, by(treated) how far a car can go if using treated
Ttest mpg, by(treated) how far a car can go if using treated
In demo trying to measure contraceptive of those using and not
Kay Vincelaama 8/26/2017
Used to determine whether the mean of a dependent vriable is the same intwo
related groups (e.g., two groups measured at two different “time points” or who
undergo two different “conditions”
Examples compare km travelled measures as “mpg1” and “mpg2” by a car ran on
an additive fuel and also ran without or with an ordinary fule. In this case, we
conduct a paired ttest
Ttest mpg1==mpg2
Ttest mpg1==mpg2, unpaired
Kay Vincelaama 8/26/2017
Analysis of variances: this looks on 3 characteristics of a sample of independence
Kay Vincelaama 8/26/2017
CORRELATION AND REGRESSION
The relationship is developed through answering the questions over the core problem
E.g. if there is a relationship or distinctions between rural and urban.
Mostly the questions that requires more research about defining the solutions such as defining
culture of Zambia which is so complex.
Another example is “breastfeeding duration has impact on IQ?” such questions does not require
yes or no answers but analysis.
Analysis can be made in the assumption below
Breasfeeding impacts on disease burden
Breastfeeding (independent) diarrhea (dependent) in this case we can not use ttest because the
dependent variable is a categorical variable. In this case we use chi-square, bivariate and
univariate then jump to correlation
Bivariate explains if there is a relationship by producing the percentages
Correction will show if the relationship is strong or weak
Regression shows how strong or weak the relationship is
Kay Vincelaama 8/26/2017
WHAT IS OUR WORKING OR SOUND THEORY
We hypothesize (or ask a research question) that a baby’s birth weight (m19_1) is a
function of or related to the mothers’…
1. Area of residence (v025)
2. Sex of the baby (b4_01)
3. Education status of the mother (v438)
4. Age of respondent (v012)
5. We shall use the 2007 zdhs dataset
6. Scatter plot graph used only when the independence and dependent variables
are continuos variables.
7. twoway (scatter v458 m19_1) (lfit v438 m19_1)
Kay Vincelaama 8/26/2017
SCATTER PLOT GRAPH
10000
8000
6000
4000
2000
0 2000 4000 6000 8000 10000
birth weight in kilograms (3 decimals)
respondent's height in centimeters (1 decimal) Fitted values
The concentration lies between 0 and 6kg the other values beyond 6kg lies in the unusual
outcomes hence called outliers this can be eliminated to have a clear view of the data 8/26/2017
Kay Vincelaama
1. drop if v438>2000
2. drop if m19_1>6000
1800
1600
1400
1200
1000
1000 2000 3000 4000 5000 6000
birth weight in kilograms (3 decimals)
respondent's height in centimeters (1 decimal) Fitted values
This is what the final outcomes should show.
Kay Vincelaama 8/26/2017
CORRELATION
Run a correlation using either corr or pwcorr
In both CMDs, correlation shows how strong or weak a relationship is between the
observed variables of interest. This can either be positive or negative
A positive value toward 1 shows how strong it is and,
A negative value toward -1 shows the relation closeness
E.g. 0.002 is a weak positive relationship and 0.4 is a strong relationship towards 1
unlike the first one.
Corr uses listwise deletion method
Pwcorr uses pairwise deletion method, but they produce similar results
corr m19_1 v438 b4_01 v106 v025 v012
Pwcorr m19_1 v438 b4_01 v106 v025 v012, star (0.05)
Kay Vincelaama 8/26/2017
. corr m19_1 v438 b4_01 v106 v025 v012
(obs=6605)
| m19_1 v438 b4_01 v106 v025 v012
-------------+------------------------------------------------------
m19_1 | 1.0000
v438 | 0.1056 1.0000
b4_01 | -0.0808 -0.0002 1.0000
v106 | -0.0268 0.1444 -0.0053 1.0000
v025 | 0.0756 -0.1229 -0.0031 -0.3265 1.0000
v012 | 0.1177 0.1147 0.0067 -0.1717 0.0472 1.0000
. Kay Vincelaama 8/26/2017
pwcorr m19_1 v438 b4_01 v106 v025 v012, star (0.05)
| m19_1 v438 b4_01 v106 v025 v012
-------------+------------------------------------------------------
m19_1| 1.0000
v438 | 0.0277* 1.0000
b4_01 | 0.0016 0.0037 1.0000
v106 | -0.2841* -0.0120 0.0006 1.0000
v025 | 0.3028* 0.0054 -0.0115 -0.3652* 1.0000
v012 | 0.0953* 0.0302* 0.0047 -0.2173* 0.0606* 1.0000
The points showing the star, indicates variable that have a strong -/+ relationship
Kay Vincelaama 8/26/2017
This is a best guess of the number of the number of people expected to be alive at
a future date, based on assumptions about population size, births, deaths, and
migration.
A set of calculations, which show the future course of population based on
assumptions used about fertility, mortality and migration.
BASE POPULATION: this is the population classified by age and sexat the start of
the projection period.
SOURCES OF PROJECTION DATA
1. National statistics office.
2. National university (Academics)
3. Global population bureau.
Kay Vincelaama 8/26/2017
Government .e.g. for planning purposes
Academia
NNGOs
Donor community
LEVELS OF PROJECTIONS
Projections are generated at three levels
1. Short term ( about 5years) considered as the best to review population changes
2. Medium term (5-15yrs)
3. Long term (over 15yrs)
Kay Vincelaama 8/26/2017
There are basically two main methods for making projections
1. Mathematical methods-
growth rates
2. Cohort component methods
Taking into account contribution of fertility, mortality and migration
PURPOSE OF POPULATION PROJECTIONS
Why do we bother to generate population projections
1. Planning for developmental sectors at national and subnational levels
2. Provide information for further research
Kay Vincelaama 8/26/2017
Educational level :planning to build primary and secondary schools
and provision of teaching staff in 2020 will depend on the projected
proportion of children of primary and secondary entry ages. (age 7
and age 15 respectively
Health sector
Labour
Energy
Agriculture
Housing
Transport
Water
Kay Vincelaama 8/26/2017
1. Select geographic area
2. Determine data usage
3. Determine the period for the projection
4. Gather input data
5. Make assumptions
6. Enter data into software
7. Examine projection output
8. Make alternative projections
9. Publish the results
Requirements and outputs
1 Demproj [future group international]
2 PAS (population analysis software) [US census bureau]
Kay Vincelaama 8/26/2017
Demographic indicators Child mortality rate
Total population size Fertility indicators
Population aged 0-4 CBR
Population aged 5-14 NRR
Population aged 15-64 GRR
Population aged 65+ TFR
Total net international migration
Mortality indicators
CDR
Annual deaths
IMR
Life expectancy
8/26/2017
Kay Vincelaama