Introduction to
Stata
What is Stata?
• Stata is a comprehensive statistical software used for
data analysis, data management, and graphics
• Widely used in Economics, Social Sciences, Biostatistics,
and other fields for rigorous data analysis and research
Why Use Stata?
• User-friendly interface.
• Comprehensive documentation and strong community support.
• Versatility in handling large datasets.
• Wide range of statistical tools.
• Stata has continued to evolve with a focus on expanding its
statistical methods, improving user interface and experience, and
integrating with other programming languages like Python and R.
• It remains widely used in academia, government, and industry for
a variety of purposes, including economics, biostatistics, political
science, and sociology.
Notes:
• Why Use Stata?
• User-Friendly Interface: Stata offers an intuitive and easy-to-navigate interface,
making it accessible for beginners while still powerful enough for advanced users.
• Comprehensive Toolset: It provides a wide range of statistical, graphical, and
data management tools, all in one package, supporting both simple and complex
analyses.
• Consistency Across Platforms: Whether you're using Windows, Mac, or Linux,
Stata delivers the same functionality, ensuring a consistent experience across
different operating systems.
• Strong Community Support: Stata has a large and active user community,
offering plenty of online resources, tutorials, and forums where users can seek
help and share knowledge.
• Regular Updates: The software is regularly updated to include the latest
statistical methods, ensuring that users have access to cutting-edge tools for their
research.
• Widely Accepted in Academia and Industry: Stata is widely used in academia
for teaching and research, as well as in various industries for data analysis,
making it a valuable skill in both academic and professional settings.
2.Getting Started with Stata
• Getting Started with Stata
• Stata Interface Overview
• Menu Bar: Provides access to most of Stata’s features.
• Command Window: Where commands are typed and
executed.
• Review Window: Shows the history of commands you
have entered.
• Variables Window: Displays the list of variables in the
dataset.
• Results Window: Displays the output from commands.
• Do-file Editor: Used for writing and saving sequences of
commands (scripts).
Basic
•LoadingCommands
Data:
•use command to load datasets.
and Syntax
•Example: use sysuse auto
•Viewing Data:
•browse to view the dataset in a spreadsheet format.
•list to display data in the Results window.
•Describing Data:
•describe for an overview of the dataset.
•summarize to get summary statistics like mean, standard
deviation, etc.
•Clearing Data:
•clear command to clear the dataset.
•Example: clear
•Importing and Exporting Data
•Importing from Excel, CSV, etc. (import excel, import
delimited)
•Exporting to different formats (export excel, export
delimited)
3. Data Management in Stata
•. Data Management in Stata
•Creating and Modifying Variables
•generate to create new variables.
•replace to modify existing variables.
•Example: generate age_sq = age^2
4.
•
Basic Data Analysis
Descriptive Statistics
•tabulate for frequency tables.
•summarize with detailed options.
•tabstat for customized summary statistics.
•Graphs and Plots
•histogram, scatter, and boxplot for visual data exploration.
•Example: scatter yvar xvar
•Mention the Graph Editor for customizing plots.
•Basic Regression Analysis
•regress command for running linear regression.
•Interpreting the output (coefficients, R-squared, etc.).
•Example: regress yvar xvar1 xvar2
Introduction to Do-Files
• What is a Do-File?
• Explanation of a Do-file as a script containing Stata
commands.
• Creating and Running Do-Files
• How to write, save, and run a Do-file.
• Benefits of using Do-files (reproducibility, efficiency).
6. Tips and Best Practices
•Commenting Code
•Using * or // to add comments in Do-files.
•Organizing Work
•Importance of clear file structures and naming
conventions.
•Using log files to save session outputs.
Hands on Activity
Task 1 in Stata
11
Agenda
Introduction to Stata
Introduction to the assignment
Simple step wise guidelines to carry out the assigned tasks.
Input of data in Stata
Creating a self explanatory Do file
Carrying out analysis
Interpretation of results
12
Stata Basic Commands:
Loading Data Set:
Sysuse auto
Browsing Data set in Data Editor:
Browse/br
Wiping out memory /clearing DataSet:
Clear/clr
Codebook
Sum
Input y x
4 important windows Basic interface of Stata.
Managing a do file
Keeping a log file
The first assignment
is related to
household income
and consumption
using a dataset. You
can either use a
real dataset or
create a simple
hypothetical one.
Step-by-Step
Assignment Outline:
Objective: Analyze the
relationship between household
income and consumption, run a
regression model, interpret the
coefficients, and test for
heteroskedasticity and
multicollinearity using Stata.
1. Dataset:
You can use real data from publicly available sources.
(e.g., World Bank, UCI Machine Learning
Repository).
2. For simplicity, let’s use a small hypothetical dataset
for this example.
15
Data input in Stata
Input y x
1
2
3
end
16
Hypothetical Data:
Assignment Tasks:
Task 1: Run the
Regression in Stata
Model: Y=β0+β1X+u
Where Y is household
consumption and X is
household income.
Command
regress consumption income
Task 2: Interpret
the Coefficients
•Explain what the slope (β1) means.
•For instance, if β1=0.5 it means that for every 1 unit
increase in household income, consumption increases
by 0.5 units.
•Interpret the intercept (β0) and the significance levels
(p-values).
19
Outcome of the activity:
Conclusion:
In this assignment, students will:
•Learn to run a simple linear regression.
•Understand the interpretation of
regression coefficients.
Thank You
Regards,
Fatima.
Practice:2 Hands On Activity
T W O S M A L L D ATA S E T S
Agenda:
• Entering 2 small data sets by making use of following commands:e
• clear
• . input Y X
• end
2 Data Sets given on Page no: 65
Entering
Data Set
1:
Entering 1st data Set:
• . clear
• . input Y X
• Y X
• 1. 70 80
• 2. 65 100
• 3. 90 120
• 4. 95 140
• 5. 110 160
• 6. 120 180
• 7. 130 200
• 8. 140 220
• 9. 155 240
• 10. 150 260
• 11. end
• .
• . gen sample = 1
• . save temp_sample1, replace
• file temp_sample1.dta saved
Using (gen, save) commands to generate
and save sample 1
• gen sample = 1
• save temp_sample1, replace
• br
Browsing 1st sample Data Set
Now give the command of clear and enter
sample 2
• Clear
• Input Y X
• 55 80
• 60 88
• 70 100
• 80 120
• 95 140
• 110 160
• 118 180
• 145 220
• 150 240
• 175 260
• end
Generating sample 2 and saving it:
• gen sample = 2
• . save temp_sample2, replace
Browse for sample 2:
• Y X sample
• 55 80 2
• 60 88 2
• 70 100 2
• 80 120 2
• 95 140 2
• 110 160 2
• 118 180 2
• 145 220 2
• 150 240 2
• 175 260 2
•
Now Append both samples
• use temp_sample1, clear
• append using temp_sample2
• List
• Br
Y X sample
70 80 1
65 100 1
90 120 1
95 140 1
110 160 1
120 180 1
130 200 1
140 220 1
155 240 1
150 260 1
55 80 2
60 88 2
70 100 2
80 120 2
95 140 2
110 160 2
118 180 2
145 220 2
150 240 2
175 260 2
Using List Command
• Command syntax:
• List
| Y X|
• |-----------------|
• 1. | 15000 25000 |
• 2. | 18000 30000 |
• 3. | 30000 50000 |
• 4. | 35000 60000 |
• 5. | 40000 70000 |
• |-----------------|
• 6. | 50000 80000 |
• 7. | 55000 100000 |
• 8. | 600000 110000 |
• 9. | 70000 115000 |
• 10. | 80000 125000 |
• |-----------------|
• 11. | . .
• 12. | . .
Next set of commands to be executed:
•Now running individual regression analysis on each sample to obtain estimates and predict yhat:
•use temp_sample1, clear
•reg y x
•predict yhat1
•predict res, residuals
•gen residuals_square=res^2
•scatter y x
• twoway (scatter y x) (lfit y x)
•twoway (scatter y x) (lfit yhat x)
•use temp_sample2,clear
•reg y x
•predict yhat2
•predict res, residuals
•list yhat2 res
•gen res_squares =res^2
•list res res_squares
•scatter y x
•twoway (scatter y x) (lfit y x)
•twoway (scatter y x) (lfit yhat x)
Command for making histogram
• . histogram residuals, normal
Results of sample1:
Results of sample 2:
Entering 2 blank values
• Input y x
• 1
• 2
• 3
• 4. .
• End
Using List if missing (X) Command:
. list if missing(X)
• |Y X|
• |-------|
• 11. | . . |
• 12. | . . |
• +-------+
Drop if X is missing:
• drop if missing (X)
Using Edit Command:
• edit
• Data editor will open, we will manually enter the values and then save every
individual value before entering new value or we can use drop command to drop
the missing value
• List if missing (X)
• Drop if missing (X)
Replacing missing values by mean
values
• Use following Commands to get means of X and Y:
• summarize X
• Summarize Y
• Now using the following commands:
• replace X = 70000 if missing(X)
• replace Y = 50000 if missing(Y)
• br