Stata Tutorial

Stata Tutorial
(Windows version SE 10.1)
Spring 2009
Data and Statistical Services
Social Science Reference Center
Firestone Library
Princeton University
data@princeton.edu
http://dss.princeton.edu
This guide was prepared to give you a basic overview of commands in reading in, examining, analyzing, and
graphing data using Stata. For reading in data, described here use typical data format you may encounter when
downloading data files as examples. In examining, analyzing, and graphing, listed here are some of the most
commonly used commands. In this new edition, I added an equivalent way to submit commands by using menus.
Menus may make it easier for you to explore the commands – feel free to play around beyond what is described here.
Once you gain familiarity, you can learn more of Stata’s operations using Stata’s help and search functions, trying
on-line tutorials, or searching through Stata’s list serve available at http://www.stata.com/statalist/archive. It will
be easier to follow the whole document in the sequence, but you may skip and try only the sections that interest you.
If you have any questions about or suggestions for this guide, please email furuichi@princeton.edu.
Table of Contents
1.
INTRODUCTION .......................................................................................................................................................... 3
1.1.
1.2.
1.3.
2.
GETTING HELP IN STATA ........................................................................................................................................... 4
INTERPRETING STATA’S HELP PAGES ........................................................................................................................ 4
ABOUT DATA FOR STATA .......................................................................................................................................... 5
READ IN DATA............................................................................................................................................................. 6
2.1.
2.2.
2.3.
2.4.
3.
READING IN AN ASCII DATA FILE WITH A STATA SET UP FILE................................................................................... 6
CREATING A STATA SET UP FILE................................................................................................................................ 8
READING IN AN EXCEL FILE ...................................................................................................................................... 9
READING IN STATA DATA ........................................................................................................................................ 11
DATA MANAGEMENT.............................................................................................................................................. 13
3.1.
3.2.
3.3.
3.4.
3.5.
APPEND ................................................................................................................................................................... 13
MERGE .................................................................................................................................................................... 13
TRANSPOSE ............................................................................................................................................................. 14
RESHAPE ................................................................................................................................................................. 14
ORGANIZING VARIABLES ......................................................................................................................................... 15
4.
EXPLORE DATA ........................................................................................................................................................ 16
5.
OBTAIN DESCRIPTIVE STATISTICS.................................................................................................................... 19
6.
TRANSFORM VARIABLES AND RECORDS ........................................................................................................ 20
7.
GRAPH ......................................................................................................................................................................... 22
8.
OBTAIN DIFFERENCE OF MEANS STATISTICS ............................................................................................... 23
9.
OBTAIN LINEAR REGRESSION ESTIMATES .................................................................................................... 24
10.
DO FILES ................................................................................................................................................................. 25
11.
SHORTCUT MENUS .............................................................................................................................................. 25
12.
EXPORTING RESULTS......................................................................................................................................... 26
13.
OTHER HELPFUL COMMANDS ........................................................................................................................ 27
14.
ON-LINE TUTORIALS .......................................................................................................................................... 28
15.
REFERENCES ......................................................................................................................................................... 28
* The period (.) in front of Stata commands indicates Stata prompt. Do not type the period as a part of the command.
* Stata commands are separated by “–“ in texts, as in –help-.
* The words to be replaced are written in italics.
* The variable names are in capital letters (unless they are in the command lines). Stata is case sensitive (upper case and lower
case are seen as different characters).
Page 2 of 28
1. Introduction
Stata is a statistical analysis package, used for manipulating, examining, summarizing, and graphing data. Stata contains
statistical commands that are built into the program and allows the users to do statistical analyses such as cross-tabulations and
regression analyses. Stata stores data in its own format. Once a data set is in its memory, Stata will output the results
responding to your commands. Commands can be executed one at a time interactively, or in groups in a command file.
Princeton University has Unix version of Stata on the server tombstone, and Windows and Macintosh versions on the Office of
Information Technology (OIT) cluster computers. If you have Princeton computer accounts, you have access to either version.
To use the Unix version, your tombstone account has to be activated. For more information about activating your account,
check OIT’s help sites on the web (see http://helpdesk.princeton.edu/kb/display.plx?id=9682). In this guide, you will learn
how to submit individual commands using a Windows version of Stata.
A Windows version of Stata shows four windows as in the picture below*. The rectangular window on the bottom is the
Command window, where you enter your commands. The largest dark window is the Results window. The commands you
enter in the Command window and Stata’s response to the commands appear in the Results window. The window on the top
left corner, the Review window, is where Stata puts the commands you had submitted. If you want to reuse the command, you
can click on the command shown in the Review window to input it in the Command window. The Variable window, shown
below the Review window in the picture, shows a list of the variables in the data set in the memory. You may click on the
variable names in the Variables window to input them into your commands in the Command window. You can resize the
windows by dragging the borders. If you widen the Variables window, you can also see the variable labels (if the labels exist in
your data).
You can also submit commands using the menus on the top bar. In this guide, I will show doing the same thing using either
commands or menus, where applicable. In the texts, where it says CMD is where you enter commands in the Command
window, and where it says MNU is where you click menus from the menu bar.
* The position of the windows may not be exactly the same as in this picture.
Page 3 of 28
1.1. Getting help in Stata
If you are not sure what the command is, you can search for
the command in Stata using the search function.
CMD: Search commands by typing –search- and a keyword in
the command window. Stata will display a list of commands
and other resources associated with that keyword, if there are
any. For example, type in the Command window,
. search regression
MNU: Alternatively, you can click Help => Search from the
menu on the top, and input keywords in the dialog box. Click
on the name of one of the commands or resources to display
the help screen.
If you know what the command is and want to know the
detail about the command, use Stata’s help function.
CMD: Type –help- and a command name.
. help regress
MNU: You may click Help => Stata command, then input command name in the dialog box. Stata will then display a help
window.
1.2. Interpreting Stata’s Help Pages
Help page shows the command you just typed, -help
regress- in this example, on the top. Next to it are blue
letters next to “dialogs:” and “also see:”. Any of the blue
letters in help windows are clickable. The commands next
to “dialogs:” open the dialog box associated with the
command, and the commands next to “also see:” open
help pages that are related with the command. Title,
which the help is about, shows a letter in brackets. In
addition to the on-line help, Stata has print manuals. The
letter in the brackets indicates the volume of the printed
manual, R, that contains the information on regress.
Syntax is Stata’s command language structure. Underlined
letters are the minimum number of characters Stata
recognizes as the command. So, if you type reg in the
command window, Stata understands it as regress. After
the word regress, Stata expects depvar, a dependent
variable. The depvar is required. indepvars, if, in,
weight, ,options, are all in brackets. The bracketted words
are optional, and the command works without them.
What goes in to the ,options are explained underneath.
Description describes what the command is for. More
details of the options follow. If you scroll further down,
there are examples. Examples are helpful in seeing how to
type the commands. Often times, examples contain series
of commands you can try out with Stata’s example data set,
which comes with the installation.
Page 4 of 28
1.3. About Data for Stata
To put a data set into Stata’s memory, the data set has to be in a format Stata understands. The following is a list of the
extensions of files Stata can read directly.
Data Format
File Extension
Stata
.dta
Text (ASCII)
Free or fixed columns
.raw, .txt
Comma separated values
.csv
Fixed columns
.dat
SAS export
.xport, xpt
MS Access
.mdb
Command to read the data
. use
. infile using
. insheet using
. infix using
. fdause
. odbc
Many data download sites provide you with data already formatted for a common statistical program such as Stata, SPSS, or
SAS. Formatted data often contain variable labels and value labels, that make it easier for you to understand the contents of the
data.
If Stata data are not available, and you can choose a data format between SPSS and SAS, then I would recommend selecting
SPSS. You can use SPSS to open SPSS data, then save the data as Stata data. SPSS versions 12 and up can save the data as
Stata 8 data 1 . Windows version of SPSS is available in McCosh 59 cluster or DSS computer lab. If data are only available in
SAS format, you may use SAS to open SAS data, then create SAS export file, as Stata can read a SAS export file. Windows
version of SAS is available in DSS computer lab. Unix version of SPSS and SAS are available at tombstone.
Also, if you acquire data that are in a format other than Stata, you may use DBMS/Copy to convert them into Stata format.
Windows version of DBMS/Copy is available in DSS computer lab. Unix version of DBMS/Copy is available at Tombstone.
If you have SAS data, we recommend converting them into SAS transport file in SAS instead of using DBMS/Copy.
DBMS/Copy has a known issue in converting value labels from SAS to Stata.
If formatted data are not available, data distributers may provide set up files in Stata, SPSS, or SAS along with ASCII data.
ASCII data set is a text file with rows (or columns) of numbers. If a set up file is available in Stata, you can attach the variable
information using Stata. If a set up file is available in SPSS, it will be easier to use SPSS to attach the definition, then save the
data as Stata data. If a set up file is available in SAS, you may use SAS to attach the file definition, then create a SAS export file
in SAS. You may also modify the set up files in text editors to use in Stata. Commands to define data are different in all three
programs. If no set up files are available and only PDF codebooks are available, you will need to select the variables you want
to use and create your own set up file for Stata.
If you need help in defining or converting data, please come by the Data and Statistical Services computer lab at A-16-H-3 in
Firestone Library during walk-in hours or email data@princeton.edu. The hours and directions are available at
http://dss.princeton.edu. If you are emailing questions, please use your Princeton email. Our resources and assistance are
available to Princeton University community members.
Stata data format has changed from version 9 to version 10. Stata 10 can read data saved for Stata 9, but Stata 9 can not read
data saved for Stata 10, while both has the same extension .dta. If you plan to use Stata 9 after using Stata 10, you may save the
data as Stata 9 data in Stata 10. Followinig commands allow you to save data as Stata 9 data in Stata 10.
CMD: . saveold filename
MNU: File=> Save As. Then select “Stata 9 Data” from the drop down list for box “Save As Type:”
1
Stata 8 and Stata 9 data are interchangeable.
Page 5 of 28
2. Read in data
2.1. Reading in an ASCII data file using a Stata set up file.
Often times, you may obtain a command and a dictionary files as a set of Stata set up files along with a data file. I suggest that
you save all three files in the same directory. The command file has an extension .do, the dictionary file .dct and data file .txt
(or .dat). The command files in Stata are also called do files. Sometimes the do file contains the dictionary, and you have two
files, do file and data file. The procedure is similar to having three files.
As an example, I downloaded a Stata set up file and data file for National Health Interview Survey from the Inter-university
Consortium for Political and Social Research (ICPSR) web site, http://www.icpsr.umich.edu. The files usually are zipped when
you download. I extracted the zipped files using WinZip, and put them in C:\StataHandsOn\SampleData directory. WinZip is
available in DSS lab computers. OIT computers do not have WinZip, but extraction software that comes with Windows can
unzip files.
Then I opened the Stata command file using NotePad (any text editor will do, but not a word processor like MS Word).
Instructions are given at the beginning of the command file, sandwiched between lines of asterisks as in the picture below. A
forward slash and an asterisc (/* texts */) makes the texts in between comments. Follow the instruction and specify the name
and the path of the data, dictionary, and output data files in the do file. Here is a copy of the beginning of the do file for the
National Health Interview Survey data.
Page 6 of 28
/**************************************************************************
|
STATA SETUP FILE FOR ICPSR 04349
|
NATIONAL HEALTH INTERVIEW SURVEY, 2004
|
(DATASET 0004: SAMPLE ADULT)
|
| Please edit this file as instructed below.
| To execute, start Stata, change to the directory containing:
|
- this do file
|
- the ASCII data file
|
- the dictionary file
|
| Then execute the do file (e.g., do 04349-0004-statasetup.do)
**************************************************************************/
set mem 40m /* Allocating 40 megabyte(s) of RAM for Stata SE to read the
data file into memory. */
set more off /* This prevents the Stata output viewer from pausing the
process*/
/****************************************************
Section 1: File Specifications
This section assigns local macros to the necessary files.
Please edit:
"data-filename" ==> The name of data file downloaded from ICPSR
"dictionary-filename" ==> The name of the dictionary file downloaded.
"stata-datafile" ==> The name you wish to call your Stata data file.
Note: We assume that the raw data, dictionary, and setup (this do file) all
reside in the same directory (or folder). If that is not the case
you will need to include paths as well as filenames in the macros.
********************************************************/
File Path
File Name
Replace the
file names
here.
local raw_data "C:\StataHandsOn\SampleData\04349-0004-Data.txt"
local dict "C:\StataHandsOn\SampleData\04349-0004-Stata_dictionary.dct"
local outfile "C:\StataHandsOn\SampleData\health.dta"
/********************************************************
Section 2: Infile Command
This section reads the raw data into Stata format. If Section 1 was defined
properly, there should be no reason to modify this section. These macros
should inflate automatically.
**********************************************************/
infile using `dict', using (`raw_data') clear
Once you have the file paths and names inserted into the do file, execute the do file (in this example named 04349-0004Setup.do) in Stata by giving a command:
. do 04349-0004-Setup
In this case, you do not need to modify the dictionary file. In some cases, you may need to specify the data file path and name
in the dictionary file.
I specified in the do file to name output Stata data as health.dta (see the third line that starts with “local outfile”), and you see
the file listed in the directory in the picture on the previous page.
You may obtain a data definition file for SAS or SPSS. The idea of attaching the data definition in SAS or in SPSS is the same
as in Stata, except that their data definition would only be in one file, and they need to be executed in respective program.
Please refer to separate handouts for details in running data definition files using SAS or SPSS.
Page 7 of 28
2.2. Creating a Stata set up file.
When you have an ASCII data file but not a set up file, you will need to create one to define variables. An ASCII data file
contains many rows of numbers and Stata will not know which numbers belong to which variables. You also need to define the
type of variables, whether they are numeric (numbers) or string (texts or characters). ASCII data may be in free format, comma
separated, or fixed columns. Example files used in this exercise are 2008 Democratic and Republican Presidential
Primaries/Iraq (United States), downloaded from the Roper Center, http://www.ropercenter.uconn.edu.
Here is a portion of a fixed column ASCII data, called lat544.dat:
.25
1.66
.47
.41
1
2
3
4
2391 1 1 1 4 2 1 2 3 3 2 & 338 & & & & & & & 131 5 6 7 7 3 1 4 4 4 4 4 4 & & & & 0 & & & 2 0 2 2 1 2 2 0 1 2 1 1 2 2 2 1 4 2 & 5 1 2 3 2 65 & 6 112 5 1 2
9041 1 1 1 4 1 2 1 2 3 2 & 8 13& & & & & & & 5 2 1 3 3 6 3 1 4 & & & & & 4 4 4 4 4 3 5 & 1 2 1 2 4 2 2 0 1 2 1 4 1 1 1 1 5 2 & 1 3 4 1 1 55 & 6 6 1 5 1 2
2122 4 4 4 2 3 2 4 2 1 1 & 1 2 112 5 3 6 6 6 & & & & & & & & 1 & & & & & 1 1 1 1 0 & & & 2 2 2 1 1 1 1 4 2 1 2 1 2 1 1 1 1 1 & 1 7 & & & 65 & 7 8 1 2 1 1
4122 4 4 4 1 2 4 1 1 1 1 & 1 8 7 2 3 1 2 5 6 & & & & & & & & 1 & & & & & 1 1 1 1 0 & & & 3 3 3 3 1 1 1 4 0 1 2 0 2 1 2 1 2 1 & 1 2 3 1 1 & 5 5 5 2 2 1 1
A portion of the corresponding codebook says ...
Data Locations
Variable Rec Start
WTVAR
1
1
CASE
1
8
AREACODE 1
15
TRACK
1
22
GWBUSHJO 1
24
GWBECON
1
26
GWBIRAQ 28 29 A2
End Format
7
F7.2
14
F7.0
21
F7.0
23
A2
25
A2
27
A2
This means that the variable WTVAR is at the first record, starts
from the column 1 and ends in column 7. The data format is F7.2,
meaning that it’s a numeric variable with width 7, and includes two
decimal places. In the data above, .25, 1.66, .47, and .41 correspond
to this variable.
To define the variables in Stata, you can create a “dictionary” file that
contains the variable information as in below. You can use any text
editors, but here we use Stata’s “do-file editor.” Open a do-flie editor:
CMD: . doedit
MNU: Window-> Do File Editor -> New Do File.
infix dictionary using H:\lat544.dat {
WTVAR 1-7
CASE 8-14
AREACODE 15-21
str TRACK 22-23
str GWBUSHJO 24-25
str GWBECON 26-27
str GWBIRAQ 28-29
}
There, type
Carriage return is Stata’s default signal to end
commands. So it is important to type as it appears
here. The first line ends after the squiggly-brace
({ ), each variable name and the column locations
is in one line, and the last squiggly-brace ( }) is on
its own line.
and save the file as a dictionary (.dct) file in the same directory as the data file. For example, I saved the file as H:\ lat544.dct, as
I have the data file, lat544.dat, at the root of H drive. The str in front of variable names indicates that they are string variables.
I have omitted record number, as there is only one record in this data. If your data file has more than one record, you need to
define which record you are referring to for each of the variables. Please see help infix to see the syntax for multiple records.
Once you save the dictionary file,
CMD: . infix using H:\lat544.dct
MNU: File => Import => ASCII data in fixed format. Then Browse to find the dictionary file name and path.
Stata will show the following in the output window.
. infix dictionary using H:\lat544.dat {
WTVAR 1-7
CASE 8-14
AREACODE 15-21
str TRACK 22-23
str GWBUSHJO 24-25
str GWBECON 26-27
str GWBIRAQ 28-29
}
(1373 observations read)
Check with the codebook and see if the total number of records is 1373.
Page 8 of 28
2.3. Reading in an Excel file
If the data file is “clean,” all you need to do is to save the file as .csv file in Excel and import it into Stata. However, if the data
file is “not clean,” editing it may make it easy to import it into Stata. Here is an example of a “not clean” excel file.
header lines
variable names include special
characters, starts with a number,
or have spaces between words
Stata reads the values in the first line
as the variable names. Header lines
prevent the program to read the
variable names. Also, the program
expects data from the second line, so
in this example, Stata will convert all
the variables as string.
The variable names in Stata can not
have special symbols or start with an
underscore (_).
each value includes a
special character and
a comma
The following is an example of a
“clean” excel sheet. It has the
following characteristics:
blank line
and column
the second line.
•
No blank rows or columns. (Blank
cells are ok. Stata automatically adds a
period (.) if numeric. Do not manually
add . in blank cells.)
•
Missing numeric data should be an
empty cell or values defined as missing,
such as 0, 9, or 99. A space (stored
space, different from empty), dot, or
any other non-numeric character such
as n/a will cause the variable to
become string.
•
Commas in numbers or texts are
particularly problematic because Stata
may see them as a delimiter and will
not read the data properly. You
should remove the commas from
numeric values before saving the file.
•
The first line has Stata
variable names: 32
characters or less, no special
characters, and not starting
with an underscore or
number. Data begin from
Page 9 of 28
Once you examine the file and make sure that the file is clean, here is a step-by-step instruction for saving a worksheet as a
comma separated values file in Excel. As a practice, let’s read in a sample excel data.
1.
2.
3.
4.
5.
6.
Open the Internet Explorer and download
auto.xls from
https://webshare.princeton.edu/users/furuic
hi/auto.xls and select Save to Disk.
Save the file in your H:\ directory.
Start Excel and read the file by selecting File
=> Open
Under the File menu, select Save As, then
Save as type 'CSV' (comma separated values).
Open Stata
Change the directory in Stata
Note: Renaming the file with a .csv extension in the
Windows Explorer is not the same as saving the file as
a .csv file.
If the spreadsheet is small, you may copy the data and paste them into
Stata’s data editor. Highlight all data in Excel, and select Edit =>
Copy. Open Stata, then select Data Editor. Right click and select
Paste, or press Control and v keys at the same time to paste the Excel
data contents into Stata’s Data editor.
Stata may mistakenly read numeric variables as strings. Check that
original numeric values are numeric in converted Stata data by issuing
the command –describe- in Stata and examining the storage type. If the
variable has storage type that starts with "str," then Stata has made it a string variable.
If you see that a numeric variable in the original data file has
string storage type in Stata, go back to Excel, and change the
variable’s format into numeric, and re-save the file as .csv file.
Here is how:
1.
2.
3.
4.
5.
6.
Highlight the column with the numeric variable name.
Click Format => Cells.
In the Format Cells window, select Number tab.
Under Category drop down list, select Number.
Click OK
Under the File menu, select Save As, then Save as
type 'CSV' (comma separated values).
You can also change the variable type from string to numeric
in Stata. At the command prompt, type
. destring stringvariablename, replace
For this command to work, the stringvariablename can not have any non-numeric characters as its value. If it fails, check the
values of the variable to find non-numeric characters.
Page 10 of 28
2.4. Reading in Stata data
Now, let’s start using Stata. From an OIT computer, a link to Stata may be found in Start => All Programs => Stata10 =>
StataSE10. A shortcut to Stata may be available from the Special Applications folder on the desktop. Double click the Stata
icon.
Typing commands in the Command window
Stata starts in its default working folder, typically C:\Program
Files\Stata\Stata10. Let’s change the directory to
H:\StataHandsOn.
. cd H:\
Let’s create a StataHandsOn directory.
. mkdir StataHandsOn
. cd StataHandsOn
Before reading in a data file, let’s open a log file. A log file stores
your output that appears in the Results Window.
. log using auto1.log
Now let’s read in 1978 Auto data. It is a data file that comes with
Stata installation, and available in Stata format.
. sysuse auto
Suppose you want to add a label to the data, so that you can
remember what the dataset is about. It is convenient if you make
many subsets of data files from the original file. As an exercise, let’s
label the data that it is for 1978 auto data for hands on training.
. label data “1978 auto data for hands on training”
We will work more on this data, but let’s save the data at this time.
We’ll give it a name, testauto.
. save testauto
Let’s close a log file at this time and look at the file. Issue a
command:
. log close
using Menus
File =>
Change
Working
Directory..
(navigate to
H:) then
select Make
New Folder
File => Log
=> Begin....
File=>
Example
datasets =>
Example
datasets
installed with
Stata =>
(auto.dta) use
Data=>
Labels=>
Label dataset
Notes and Tips
If you don't change the directory, Stata will assume
that the file name you type is in the default
directory. The log and data files you save will also
be in this directory unless you change it.
If you do not open a log file at the beginning of the
session, the output will only be available in the
temporary memory. Once you exit from the
program, the output will be lost.
A log file with the extension ".log" is a plain text file.
This means you can open and read it in almost any
text editor or word processor. If you issued the –
log- command without the file extension,
. log using stata1
Stata would create "stata1.smcl." smcl is a log file
type specific for Stata.
Notice that “log on (text)” appear on the rectangular
space between Results Window and Command
Window once you begin a log.
If you issue –save, replace- command without
specifying a file name, what is currently in memory
will overwrite original input file. To avoid losing
original data file by mistake, always remember to
make a master copy before starting to work on
data.
File=> Save
as
File=> Log=>
Close
Notice that “log on (text)” disappeared from the
rectangular space after closing the log.
Use a text editor or MS Word to open the log file.
You can also view a log file in Stata. Remember to include the
extension with the file name when typing the –view- command.
. view stata1.log
Let’s clear the data in the memory at this time and exit from Stata.
. clear
. exit
File=> Log=>
View
File=> Exit
Results window can not be cleared while in
session.
using Menus
Page 11 of 28
Typing commands in the Command window
Restart Stata, and check if we are in the directory we first specified.
. pwd
then list the files:
. ls
Notes and Tips
File=>
Change
working
directory...
It should show that we are at H:\StataHandsOn directory. Can you
find the testauto.dta and auto1.log? What is the size of the data file?
. log using stata1.log, append
To read testauto data:
. use testauto
If the data file is larger than Stata’s current memory, it will issue an
error message. Check the file size and set memory to give Stata
more space. For example, if the data file is 36 megabytes, type
. set memory 40m
It gives 40 megabytes worth of data memory in Stata to read the
data.
You can see the directory Stata is pointing at by
looking at the bottom bar of Stata’s window. If you
are not at H:\StataHandsOn, change the directoyr
by typing in the Command window:
. cd H:\StataHandsOn
It is a small data file, of only 5.4 kilobytes. As you may have seen
at the very first screen, Stata’s memory may be initially set to 10
megabytes. Because this data file is smaller than what Stata allows
in the memory at this time, you will not have problems reading in
the data.
Let’s open the log file back on to continue to save the output on the
same log file. Issue a command:
pwd stands for Present Working Directory.
File=> Log=>
Begin, then
select
stata1.log,
and Append
to existing file
File=> Open
(no equivalent
menu)
You can, of course, start a new log file instead of
appending the new results to the existing log file.
To start a new log file, give a new file name as in:
. log using stata2.log
You could clear the data in memory and read in a
new data file in one step, by issuing a command:
. use testauto, clear
Only one data file can be read into the Stata’s
memory at a time. You need to clear the memory
before reading in another set of data. (You can,
however, open many instances of Stata in one
computer.)
To see the maximum limits in Stata, type in the
Command window:
. help limits
Review Questions:
1. How can I start Stata?
2. Which directory is this program pointing at?
3. How can I change the directory to H:\?
4. How large is the Auto data?
5. How do I read the data into Stata?
6. How do I label the data?
7. How do I save the data?
8. I don’t know what commands to use. How do I get a help
in Stata?
9. How do I record the output?
Hints:
. pwd
. cd C:\mydata
. dir . ls
. set memory 20m
. use filename
. label data “descriptions”
. save filename, replace
. help commandname
. search keyword
Page 12 of 28
3. Data management
Goal: combine, transpose, and reshape data files. Search, order, and rename variables.
You may have multiple files you want to combine. Some files are so large that they are split into multiple files, having the same
variables but different cases, for downloading. If you want to combine datasets that have same variables but different cases, the
command you use is –append-. Some longitudinal studies follow up same individuals across time and ask same or different
questions. If you want to combine datasets that have same cases or people but have different variables, the command you use is
–merge-.
3.1. Append
Let’s create sample data first, then we will try appending.
. sysuse auto
. keep make price mpg
. keep in 1/5
. save temp
. keep in 4/5
. save data1
data1 will look like:
3.2. Merge
This time, we will try merging. We will do match merge,
meaning we want to merge two files matched by the key
variable, in this case MAKE. Because we are adding a
variable, we want to make sure that the variable values are
assigned to the right observations. For match merge, it’s
very important that BOTH files are sorted by the key
variable.
. use temp
. keep make price
. sort make
/* IMPORTANT! */
. save data2
data2 will look like:
. use temp, clear
. keep in 1/3
. use temp, clear
. keep make mpg
. sort make
/* IMPORTANT! */
Let’s combine the two dataset. What I want to do is to
stack data1 under the second dataset. While having the
second data still in memory,
. append using data1
Resulting dataset looks like:
. merge using data2
Page 13 of 28
3.4. Reshape
Reshaping dataset is useful when you have, for example,
time series data and have the same question asked across
time as separate variables.
Notice there is a new variable called _merge. Stata
automatically creates this variable whenever files are
merged. _merge can have values 1, 2, or 3. 1 means that
the records are in data in memory before merging, but not
in data2. 2 means that the records are in data2 but not in
the data that were in memory at the time of the merge. 3
means that the records were in both files. In this example,
all records matched up in two files.
3.3. Transpose
Transposing is switching observations and variables. In
Stata, string values can not be transposed except for
variable names. If you have string values, you will need to
encode them and make them numeric before transposing.
The command to transpose is –xpose-.
This time, we will use another Stata’s example file. To use
this file, your computer needs to be connected to internet.
Here again, we will use Stata’s example file. Your
computer needs to be connected to internet to use this file.
. webuse reshape1
. drop ue*
I dropped ue variables to save space. If you keep them,
you need to include ue in the reshape command after inc.
In this file, you see three persons’ incomes for 1980, 1981,
and 1982. Suppose you want to have a variable called year,
and have the income values listed for each year for each
person. Then what you want to do is to reshape it into a
long format.
. reshape long inc, i(id) j(year)
Here, the command is telling Stata that inc is a stem of the
variable whose values need to be repeated for a person for
different years, by id. The variable year does not exist in
the pre-reshaped data, but will be assigned as the variable
name for the values attached to inc that will become a
variable. It may be easier to understand this by comparing
post-reshaped data with pre-reshaped data.
Now you see that there is a new variable called year, and
the id and sex is repeated for different years.
varname is an option, but clear is required in xpose
command. With the option varname, the transposed
dataset will contain the variable names as a variable.
To go from the long format to wide format (in the format
of post-reshaped file to pre-reshaped file in the above
example), the command is –reshape wide. reshape wide inc, i(id) j(year)
Page 14 of 28
3.5. Organizing variables
Typing commands in the Command window
If you have a large data file with many variables, it may be difficult
to find variables by using –describe- or –codebook-. Here are some
commands that may be helpful if you have a large data file. The
command –lookfor- search for a variable that has either name or
label that contains the keyword. Let’s use another example data to
try the commands.
. sysuse nlsw88
Let’s see if there are variables that contain the word married, age,
and education.
. lookfor married
. lookfor age
. lookfor education
You can order the variables alphabetically by the variable names.
. aorder
If you want to place a variable at a particular location, -order- var1
var2 places var1 before var2. For example,
. order smsa age
places smsa before age, after idcode.
If you want to rename variables, the command is –rename-.
. rename idcode id
using Menus
File=> Example
datasets...=>
Example
datasets
installed with
Stata, click use
next to
nlsw88.dta
(no equivalent in
menu)
Data=> Variable
utilities=>
Alphabetize
variables
Data=> Variable
utilities=>
Change order of
variables in
dataset
Data=> Variable
utilities=>
Rename
variable
Notes and Tips
If Stata returns blank, it means there is no
variable that contains the word. In this
example, it does not mean there is no
education variables. You can see that there
are variables called grade and collgrad, as it is
a small dataset. It did not find those variables
because the keyword education was not a part
of the variable name or label. So, -lookforhelps you find the variable if you know what to
look for. You would still need to read the
codebook to know the appropriate keywords
you can use to search for variables.
If you have many variables with the same
stem, such as education1, education2, ..., you
can rename the stem education to edu by
using the command renpfix
. renpfix education edu
renames the variable idcode to new name, id.
This command will rename education1,
education2, ... to edu1, edu2, ....
Review Questions:
1. How do you search for a variable in a dataset?
2. How do you order variables in an alphabetical order?
3. How can I move a variable next to another one?
4. How can I rename variables?
Hints:
. lookfor
. aorder
. order
. rename
4. Explore data
Goal: find out what information is in the data – how many variables are in the data, what variables are in
the data, and what they mean.
Typing commands in the Command window
Now let’s see what this data file contains.
. describe
The -describe- command shows you the path, label, date, and the
size of the data file, the number of observations and variables, and
the name, type, format, and label of the variables in the dataset.
You will also notice that it says “_dta has notes.” Let’s see what the
notes say.
. notes
You can also add your own notes to the data.
. note: I used this data set in a hands-on training course during the
fall of 2008.
See it by typing –notes-.
We also see that the variable FOREIGN has a label called origin.
You can see the details about the label by typing:
. labelbook origin
Suppose we want to know what the REP78 is about. The
–codebook- command gives you detail of the variable. Type:
. codebook rep78
If you want to get a quick summary of numeric variables, -inspectreports the number of negative, zero, and positive values; the
number of integers and non-integers; the number of unique values;
and the number of missing; and it produces a small histogram. Try:
. inspect mpg
The –list- command lists values of the different variables in your
dataset on the Results window. Similarly, -browse- open the data
browser. You can have browser open only for the variables you
want to see. For example,
. list make
. browse make
In using –list-, you may see –more- on the bottom of the screen.
To scroll down the screens, hit space bar or click –more-.
If you only want to see first five observations of the variable
MAKE, type
. list make in 1/5
. browse make in 1/5
using Menus
Notes and Tips
Data=>
Describe data=>
Describe data in
memory
There are two types of variables: numeric and
string. Numeric variables are numbers. String
variables contain texts which can contain any
characters on the keyboard: letters, numbers,
and special characters.
Data=> Notes=>
List notes
Data=> Notes=>
Add notes
Data=> Notes=>
List notes
Data=>
Labels=> Label
values=>
Produce
codebook of
value labels
The storage type refers to the size used in
storing the variables. Numeric variables’
storage types include byte, int, long, float, and
double. String variables have storage types
that begin with "str", followed by a number
indicating the maximum length of the string:
e.g., str18.
We can do numeric calculations and statistical
analysis on numeric variables, but not on
string variables. A variable that looks like a
number, for example, “20025” could be either
a string (a set of five characters that happen to
be numbers, like a zip code) or a numeric
value (the integer that’s after 20024). It's
important to check the variable types to know
how you can analyze those variables.
Data=>
Describe data=>
Describe data
contents
(codebook)
Data=>
Describe data=>
Inspect
variables
Data=>
Describe data=>
List data
You can also click on Data Browser to see the
data file. While you have the browser open,
you can not enter commands. Closing the
browser does not delete the data file.
Data=> Data
Browser
click by/if/in tab
in list dialog box,
and select Use a
range of
observations
Page 16 of 28
Typing commands in the Command window
The –list- command is particularly helpful to use after sorting data,
or combining with if. For example, you can obtain five minimum
values of MPG by listing the first five records after sorting.
. sort mpg
. list mpg in 1/5
Suppose you want to see the make of the cars whose price is less
than $5000. Try:
. list make price if price<5000
The –if- qualifier
using Menus
Data=> Sort=>
Ascending sort
Data=>
Describe data=>
List data
click by/if/in tab
in list dialog box,
and select
Create, type in
price<5000
The –if- qualifier is used to isolate a set of observations with
variables meeting some particular criteria. Values on variables in a
dataset are compared to values on other variables or to numbers or
strings using logical comparison operators.
Operator
==
>
>=
<
<=
!= or ~=
Notes and Tips
To list last five records (maximum values):
. list mpg in -5/-1
To sort in reverse order, use –gsort-.
. gsort –mpg sorts MPG in reverse.
. gsort +mpg is the same as . sort mpg.
Pay special attention to that double equals
sign! If you are evaluating for equality, use a
double equals sign (==). A single equals sign
(=) is used for assignments, to set something
equal to something else.
For example, if you want to list all information
in the dataset about cars whose MAKE is
“subaru”, you would type:
. list if make=="subaru"
Meaning
equal to
greater than
greater than or equal to
less than
less than or equal to
not equal to
String values need to be put in quotes.
Whereas if you want to create a new variable
called POWERSTEER for cars whose make is
SUBARU, you would type:
. generate powersteer=1 if make==“subaru”
Refer to the section on “Transform variables
and records” for more information on creating
new variables.
You can put spaces around these operators (e.g., either a >= b or
a>=b), but you cannot put spaces within them (e.g., it must be
‘>=’, not ‘> =’).
Combining tests: -and- and –or-
Note that the –if- statement is included only
once.
-if- can be combined with and (&) to evaluate for more than one
conditions. Let's say you want to find out the MAKE of the cars
whose MPG is greater than 30 and PRICE is less than $5000.
. list make if mpg>=30 & mpg<=40
(OK)
. list make if mpg>=30 & if mpg<=40 (won’t
work)
. list make if mpg>30 & price<5000
. list make if 30<= mpg <=40 (won’t work)
-if- can also be combined with or (|) to look at cases where at least
one of two or more conditions is met. For example:
“|” can be obtained by pressing shift and \..
The \ key is between Backspace and Enter
keys on most key boards.
. list make if mpg>30 | price<5000
It is possible to combine the & and | operators. If you have both
in one command, & takes precedence over |. Use parentheses to
help you organize them and avoid errors, as combining & and | can
make the conditions complicated.
Among missing values, after a period, the
values increase by a combination of a period
and an alphabet character. So, .a is larger
than ., .b is larger than .a: .z is the largest
missing value.
. list make if (30<= mpg | 2000<price ) & rep78<4
returns different results from:
. list make if 30<=mpg | 2000<price & rep78<4
using Menus
Page 17 of 28
Typing commands in the Command window
Notes and Tips
About Missing Values
Stata indicates a missing numerical value as a period (.), and a
missing string value an empty string, “”. Missing numerical values
are larger than numerical numbers.
We know from the previous examination (.codebook rep78) that
five out of 74 records of REP78 are missing. We can use the period
to indicate missing record in the command and see which MAKE
of the cars are missing in the data.
. list make if rep78 >= .
Check that a period is the largest values in rep78, by sorting by
rep78 and listing the last six values.
Data=>
Describe data=>
List data
click by/if/in tab
in list dialog box
Data=> Sort=>
Ascending sort
. sort rep78
. list make rep78 in -6/-1
Review Questions:
5. How many variables and records are in the data?
6. What does the note say?
7. How can I add notes or comments to the data?
8. What variables are in the data?
9. How do I sort?
10. Which variables have missing values?
11. List the cars for which data is missing.
12. List the cars whose repair record is less than 3 and the
price is less than $5,000
Hints:
. describe
. codebook . labelbook
. inspect
. summarize
. sort
. gsort
. list [if] [in]
Page 18 of 28
5. Obtain descriptive statistics
Goal: find out number of missing records, minimum and maximum values, means, and medians, view
frequency tables, and cross tabulations.
The commands that are useful for getting basic descriptive statistics include tabulate, summarize, tabstat, and table.
Typing commands in the Command window
using Menus
The –tabulate- command gives you a frequency distribution
if only one variable is specified, and a cross-tabulation if two
variables are specified. If two variables are specified, the first
variable will be shown in rows, and the second in columns.
Statistics=>
Summaries,
tables, and
tests=>
Tables=> Oneway tables, or All
possible twoway tabulations
. tabulate rep78
. tabulate rep78 foreign
-summarize- gives the number of valid observations, mean,
standard deviation, minimum, and maximum values.
. summarize price mpg
What if you wanted to see the average MPG for foreign and
domestic cars? The –tabulate- command can be combined
with –summarize- to produce a summary of one variable for
the variable specified in –tabulate-. For example, if you want
to see the average MPG by car type, type:
. tabulate foreign, sum(mpg)
If you want to see more statistics such as total, range, or
median, you may use tabstat.
. tabstat price mpg, stat(sum, range, median)
There are more statistics you can see using tabstat. See . help
tabstat for a list of statistis.
Statistics=>
Summaries,
tables, and
tests=>Summary
and descriptive
statistics=>
Summary
statistics
Statistics=>
Summaries,
tables, and
tests=>
Tables=>
One/two way
table of
summary
statistics
Notes and Tips
–tabulate- can not cross-tabulate more than two
variables. If you have more than two categorical
variables to crosstab, use –table- (see below).
Because –tabulate- gives you frequency counts, it
makes sense to use it for categorical variables than
continuous variables.
It would make sense to summarize continuous
variables rather than categorical variables.
You can also see the average MPG by FOREIGN
by using –by- and –summarize-.
. bysort foreign, summarize(mpg)
To use –by-, the data have to be sorted by
FOREIGN. You could do .sort foreign, then .by
foreign, sum(mpg). . bysort foreign does the
sorting and by in one step.
Stata allows shorthand in some commands. –sumis the shorthand for –summary-. The shorthand is
shown as an underscored letters in the help page.
The –table- command lets you create three-way (or four-way
if combined with –by-) cross-tabulations. We can try that
after we create more categorical variables in the next section.
Review Questions:
1. Which five cars yield the lowest gas mileage?
Which five cars yield the highest gas mileage?
2. What is the average price and average miles per
gallon (MPG) of a car in the 1978 auto data?
3. What is the average price of cars that are below and
above the mean MPG?
4. What is the median MPG?
5. How are price and MPG different for domestic and
foreign cars?
6. How can I see the number of cars by the car type?
7. How are the cars distributed by the repair records?
8. Compare frequency-of-repair records for domestic
and foreign cars.
Hints:
. sort
. list
. summarize
. tabulate
. table
. by groupingvarname: summarize varnames
Page 19 of 28
6. Transform variables and records
Goal: create and label new variables, modify existing variables, keep or delete variables and records from the file,
recode values, create dummy variables from existing variables.
The basic commands for creating new variables and modifying old ones are –generate- and –replace-.
Typing commands in the Command window
The command
. generate newvar = something
creates a new variable named newvar and sets it equal to
something. Something can be a number, a string, a mathematical
expression, or a function of other variables. You can combine
–if-, -&-, and -|- in generating new variables.
. generate two = 1+1
. generate mycars = 1 if (rep78==1 & price<5000) |
(rep78==2 & price<5000)
The –replace- command is used to make changes to existing
variables:
. generate domestic=0
. replace domestic = 1 if foreign==0
Remember that missing values are larger than numbers. So, if
you use –if- qualifier to indicate values larger than a specified
value, it could include missing values. For example, say, you
want to group cars into two categories by the repair rating,
high-repair-rating cars and low-repair-rating cars.
. generate hirep =1 if rep78>=3
. replace hirep = 0 if rep78<3
using Menus
Data=> Create or
change variables=>
Create new variable,
then click if/in tab,
select Create..., type
in criteria in the
window, click OK
Data=> Create or
change variables=>
Change contents of
variable
To exclude the missing values, you needed to specify:
. generate hirep =1 if rep78>=3 & rep78!=.
. replace hirep = 0 if rep78<3
Say, you want to group repair ratings into 3 groups. The easiest
way to re-group existing variables would be to use
–recode-. Giving an option of , gen(newrep78), the command
-recode- recodes REP78 into a new variable, NEWREP78,
instead of overwriting the existing variable, REP78.
. recode rep78 (1/2=1) (3/4=2) (5=3), gen(newrep78)
You normally want to use replace for
second and later steps in multi-step
variable creations. When you modify
existing variables, make sure you will still
have a way to recreate the original
variable or have a back-up copy of the
variable. Once you write over existing
variable, there is no way to get the
original data back.
. list hirep if rep78==.
Data=> Variable
utilities=> Keep or
drop variables=>
select Drop variables
HIREP also includes the cars whose repair records are missing.
Check it by listing the value of HIREP when REP78 is missing.
Now that we know HIREP contains missing values, let’s delete
the variable.
. drop hirep
Notes and Tips
Data=> Create or
change variables=>
Other variable
transformation
commands=>
Recode categorical
variable
Notice that hirep disappears from the
variables window. Once you delete a
variable, you can not undo the deletion.
If you issue a command –preservebefore removing a variable, you may
restore deleted variable by issuing a
command –restore-. This is a temporary
measure and only works as a set. Once
you issue –restore- command, you need
to issue another preserve command to
restore.
If you do not specify a new variable name
with the generate option, you will
overwrite the original variable. Let’s try
that with –preserve- and –restorecommands.
. tab rep78
. preserve
. recode rep78 (1/2=1) (3/4=2)(5=3)
. tab rep78
. restore
. tab rep78
gen is a short for generate.
Page 20 of 28
Typing commands in the Command window
We have already seen how to create a dummy variable (whose
outcome is either 0 or 1) using –generate- and –replace-.
Another easy way to create dummy variables is to use –
tabulate- command. The –tabulate- command, when used with
a generate option, produces dummy variables for each value.
For example, suppose we want to create a dummy variable for
each of the outcomes of the categorical variable REP78.
. tabulate rep78, gen(dumrep78)
Suppose you want to group a continuous variable, PRICE, into
five equal ranges. First find out the minimum and maximum
value that you want to use to group the PRICE by using –
summarize-. Then,
using Menus
Data=> Create or
change variables=>
Other variable
creation
commands=> Create
indicator variables
Data=> Create or
change variables=>
Create new variable,
then enter autocode
function in the box
. generate ivprice = autocode(price,5,3291,15906)
If you want to group PRICE into five groups of equal
frequencies, first sort PRICE, then issue the following
command:
. sort price
. generate fqprice = group(5)
Now, we have several more categorical variables to make a four
way table. Let’s create a table of repair records by HIREP by
IVPRICE by FOREIGN. Here is how:
. table rep78 hirep ivprice, by(foreign)
You can label the variables so that you know what they are later
on. Let’s add a label to HIREP as an example.
. label variable hirep “repair record is 3 or higher”
. label define yesno 1 “yes” 0 “no”
. lable values hirep yesno
Review Questions:
What is the command to
1. create new variables?
2. delete variables?
3. regroup variables?
4. group continuous variables?
5. create dummy variables?
Data=> Sort=>
Ascending sort
Statistics=>
Summaries, tables,
and tests=>
Tables=> Table of
summary
statistics(table)
Data=> Labels=>
Label variable
Data=> Labels=>
Label values=>
Define or modify
value labels
Notes and Tips
Scroll down the Variables Window to see
what Stata created. Alternatively, view
the list of variables by:
. describe
You can also add notes to the variables.
. note hirep: “temporary variable created
on October 1, 2006”
When you describe data, (-describe-) you
will see an asterisk (*) by the variable
label indicating that the variable hirep has
notes.
See the notes by typing
. notes
The maximum number of variables you
can list in –table- is three.
-label variable- adds a label to the
variable.
-label define- defines values of a lable.
The label name can be different from the
variable name, and can be used for other
variables.
-label values- attach label to the variable.
Data=> Labels=>
Label variable
Data=> Labels=>
Label values=>
Assign value labels
to variable
Hints:
. generate newvar =
. drop varnames
. recode oldvar (1/2=1) (3/4=2) (5=3),
gen(newvar)
. generate varname = group(5)
. generate newvar =
autocode(oldvar,5,min,max)
. generate newvar = 0
. replace newvar = 1 if oldvar > 6165
Page 21 of 28
7. Graph
Goal: view the relationships of the variables by graphing and save graphs.
Stata has several graphs for graphing distributions of individual variables, the relationship of the variables, as well as many more
specialized graphs. Shown here are commands for some basic graphs. You may explore graphs using the menus as well. In
Stata, graphs appear in separate windows that pop up. The graphs do not appear on the Results window, and will not be stored
in the log file. If you want to save the graphs, you will need to save each graph as a file.
Typing commands in the Command window
Here's a simple histogram of PRICE.
. histogram price
You can see the histogram separately for different groups.
For example, you can see a histogram of price for foreign
and domestic cars separately and have Y values in
frequency.
. histogram price, by(foreign) freq
Another popular graph is box plot. Let’s see box plots of
price by foreign.
. graph box price, by(foreign)
The basic command for drawing a bivariate graph is twoway.
The command twoway is followed by a keyword indicating
the type of graph. To obtain a scatter plot showing the
relationship between MPG and WEIGHT, type
. graph twoway scatter mpg weight
We can obtain the scatter plot by the car type, FOREIGN .
. graph twoway scatter mpg weight, by(foreign)
Twoway graphs can be overlaid: you can draw two twoway
graphs on the same set of axes. A common use of this is to
draw a scatterplot with a regression line laid overtop of it to
show how the regression line fits the data.
We will overlay scatter plot of with regression line fit for
MPG and WEIGHT.
. graph twoway (scatter mpg weight) (lfit mpg weight)
Let’s save the graph. On the Command Window, type:
. graph save OverlaidMpgWeight
Once it’s saved, close the graph window, and bring it up
again.
. graph use OverlaidMpgWeight
Review Questions:
How can I …
1. make a histogram of MPG?
2. see a scatter plot of MPG against WEIGHT?
3. fit a regression line over the previous scatter plot?
4. bring the graph up again after I close the graph
window?
using Menus
Graphics=> Histogram,
insert variable name PRICE
in the Variable: box and
check the box next to Bins,
change the number to 5
Graphics=> Box plot
Graphics=> Twoway
Graph, click Create, select
Scatter in the Basic plots:
box, Y variable: mpg, X
variable: weight, click
Accept, then in the “By” tab,
select Draw subgraphs...,
input foreign in Variables:
box
Graphics=> Twoway
Graph, click Create, select
Fit plots under plot
category, and Linear
prediction under Fit plots:,
Y variable: mpg, X variable:
weight
File=> Save Graph...
or
In the Stata Graph window,
File=> Save
File=> Open Graph...
Notes and Tips
For an introduction to Stata graphs,
type
. help graph intro
Default Y value of histogram is density.
To see the histogram in frequency or
percentage, type freq or percent after
a comma:
. histogram price, freq
To see more options, see
.help histogram
Typing scatter y x draws a graph of y
against x.
Here, scatter and lfit are plot types
within the twoway family. Alternatively,
you can use || to separate the plot
types.
. graph twoway || scatter mpg weight ||
lfit mpg weight
You do need to separate the plot types
by the parentheses or the pipes.
Hints:
. histogram
. graph twoway scatter
. graph twoway (scatter y x) (lfit y x)
. graph save
. graph use
Page 22 of 28
8. Obtain difference of means statistics
Goal: obtain Pearson’s chi-square, t-test, and analysis of variance statistics.
Once we reviewed the variables in the dataset, we may want to see the relationship among the variables. In the cross-tabulation
of repair records obtained above, domestic cars appeared to have poorer frequency-of-repair records. Is the difference
statistically significant? Let’s obtain a chi-square statistic to test the hypothesis that the frequency-of-repair records are different
by the car type.
Typing commands in the Command window
using Menus
Notes and Tips
. tabulate rep78 foreign, chi2
Suppose we reviewed literature on the automobiles made
in 1978, and hypothesize that the average MPG of 1978
cars is 20. To test this hypothesis, do a one sample t-test.
. ttest mpg==20
Comparing domestic and foreign cars, it appears that the
average MPG differs by the car type. To test a hypothesis
that the MPG is the same for foreign and domestic cars,
let’s do a two-sample t-test.
. ttest mpg, by(foreign)
We suspect that MPG is really influenced by the cars’
repair records. I want to examine if the mean MPG is
significantly different among cars that have different repair
records.
. oneway mpg rep78
Suppose that we then decided to keep the impact of
foreign in the model in addition to the repair-record in
examining miles per gallon. To run two-way analysis of
variance,
. anova mpg rep78 foreign
What if I also wanted to see the impact of weight, which is
a continuous variable. Analysis of covariance can be done
in Stata using anova command, with continuous option.
. anova mpg rep78 foreign weight, continuous(weight)
Review Questions:
How do I obtain…
1. a chi-square statistic.
2. t-test statistics?
3. one-way ANOVA statistics?
4. two-way ANOVA statistics?
Statistics=> Exact statistics=>
Two-way tables with measures
of association=> select
Likelihood-ratio chi-squared
Statistics=> Summaries, tables,
and tests=> Classical tests of
hypotheses => One-sample
mean-comparison test
Statistics=> Summaries, tables,
and tests=> Classical tests of
hypotheses => Two-sample
mean-comparison test, in
“by/if/in” tab, select Repeat
command by groups, then input
foreign in Variables taht define
groups:
One way analysis of variance tests
whether the means of mpg differ
across categories of repair record.
If instead I wanted to see the mean
difference by foreign, one way
result is the same as ttest result,
as the variable foreing only has
two categories.
Statistics=> Linear models and
related=> ANOVA/MANOVA=>
One-way ANOVA
Statistics=> Linear models and
related=> ANOVA/MANOVA=>
Analysis of variance and
covariance
To learn more about ttest, oneway,
or anova, use help.
Hints:
. tabulate varnames, chi2
. ttest
. oneway
. anova
Page 23 of 28
9.
Obtain linear regression estimates
Goal: run a multiple linear regression model.
Typing commands in the Command window
In estimating relationships among variables, you may
first want to examine how the variables are correlated.
We suspect that MPG and WEIGHT are correlated.
Let’s see the correlation:
. correlate mpg weight
In addition, we suspect that the correlation may be
different between foreign and domestic cars. We can
combine the –correlate- command with a by statement.
Before using a by statement, the data need to be sorted
by the by-variable.
. sort foreign
. by foreign: correlate mpg weight
It seems that mpg and weight have a relatively high
correlation. The correlation is different for foreign and
domestic cars, so foreign must also impact MPG.
using Menus
Statistics=> Summaries, tables,
and tests=> Summary and
descriptive statistics=>
Correlations and covariances
. pwcorr mpg weight, star(.05)
adds an asterisc (*) next to the
correlation coefficients that are
statistically significat at 95% level.
Data=> Sort=> Ascending sort
You can also sort and use “by
statement” in one step:
. bysort foreign: correlate mpg weight
Statistics=> Summaries, tables,
and tests=> Summary and
descriptive statistics=>
Correlations and covariances, in
“by/if/in” tab click Repeat
command by groups, insert
foreign in Variables that define
groups:
From the scatterplots we saw earlier, we also discovered
that the relationship between WEIGHT and MPG is
not exactly linear. We’ll include a square of WEIGHT
to improve the model. Let’s run a regression estimating
MPG by WEIGHT, WEIGHT2 and FOREIGN.
. regress mpg weight weight2 foreign
Statistics=> Linear models and
related=> Linear regression
After estimating a regression model, we can use the
values estimated by the model, called post-estimation
values. Using estimated MPG, we can see how the
estimated line fit the original distribution by viewing
overlaid graph. To do so, we first need to create a
variable for the predicted MPG. We’ll call this
MPGHAT.
Statistics=> Postestimation=>
Predictions, residuals, etc.,
. predict mpghat
. graph twoway (scatter mpg weight) (line mpghat
weight), by (foreign)
Review Questions:
1. What is the correlation between MPG and
WEIGHT?
2. Is the correlation different between domestic
and foreign cars?
3. How do I obtain regression estimates?
4. How can I compare observed and predicted
values on a graph?
Notes and Tips
Graphics=> Two-way graph
(if there are already defined
plots in “Plot definitions:”, either
Disable or Edit them to create
new combinations)
There are series of regression
diagnostics you can do using graphs.
See UCLA’s Stata tutorial site for
more information.
To compute a square of WEIGHT,
WEIGHT2, you can multiply WEIGHT
by itself, or raise it to the power of 2.
. generate weight2 = weight*weight
. generate weight2 = weight^2
do the same thing.
Stata has a series of “post estimation
commands.” After running a
regression estimates, for example,
you can test if the coefficients are
statistically significantly different from
0, or from another independent
variable (wald test), or test for
heteroscedasticity. For details, see
. help regress postestimation
,xb that appear as an option when
menu is used is a default in
command window input. It will not
appear in the Results window when
command is input in the Command
window.
Hints:
. correlate
. by varname: correlate
. regress
. predict yhat
. graph two way (scatter y x) (line
yhat x)
Page 24 of 28
10. Do files
When you have rather intense computations or repeat/modify existing computations, it may be helpful for you to
create a file that contains a set of Stata commands. Such files are called “do files” in Stata. Do files can be created
by manually entering commands in any text editors, or using Stata’s do-file editor. In Stata, do-file editors can be
invoked by:
CMD: .doedit
MNU: Window=> Do-file editor=> New do-file
You may also create do-files by saving commands you submit interactively. When you start a Stata session, start
“command log,” which is a log file with only the commands. It by default attaches .txt file extension if you do not
specify the extension. If that is the case, you can change it in Window’s file explorer. For this command, I have not
found a menu version.
CMD: .cmdlog using filename.do
If you forget to start a command log, you may save the commands in the Review window. First, right click in the
Review window then, select “Select All”. Right click in the Review window again, then select “Send to Do-file
Editor”. You can eliminate error commands by clicking the _rc on top of the Review window, which sorts the
commands by the errors, then select the error commands, right click, then “Delete”. You can resort the commands
in the original order by clicking the top of the numbered column on the far left. For the same token, you can sort
the commands by clicking the top bar where it says “Command” and delete commands like –browse- and –help-.
By the way, if you use menu for help and search, they do not appear on the Review or Results window.
11. Shortcut menus
Open dofile
editor
Open data
editor
Open data
browser
Open data
Quit4
Save data
Print results
Log 1
Open/ close
viewers
Graph
window 2
Scroll
Results
window3
1. Begins log if no log file is open. If a log file is open, it lets you view, close, or suspend the log. You may append
to the previous log by selecting an existing log file. Dialog box menu changes accordingly.
2. Moves graph window upfront. It only becomes active when a graph window is open.
3. Scrolls the Results window one screen at a time, when you have –more- at the bottom of the Results window. It
is equivalent to hitting the space bar or clicking –more4. Quit processing. Useful when a process is taking a log time and you want to stop the process, or when you have
–more- but do not want to see more. It is equivalent to hitting q in Command window or Ctrl-c at the same time.
Page 25 of 28
12. Exporting results
You can copy what appears in Results window by highlighting and right clicking. There are several options: Copy Text, Copy
Table, Copy Table as HTML, and Copy as Picture. Here are pasted tables for each.
Copy Text
If you are pasting tables into Excel, copying
Repair |
either as table or HTML will work well.
Record 1978 |
Freq.
Percent
Cum.
------------+----------------------------------If you are pasting tables into Word, copying
1 |
2
2.90
2.90
2 |
8
11.59
14.49
as picture seems to produce the best
3 |
30
43.48
57.97
apperance. If you save them as picture,
4 |
18
26.09
84.06
though, modifying the contents can only be
5 |
11
15.94
100.00
done using a graphic software.
------------+----------------------------------Total |
69
100.00
Copy Table as HTML
Copy Table
Repair
Record 1978 Freq. Percent
1
2
3
4
5
2
8
30
18
11
2.90
11.59
43.48
26.09
15.94
Total 69
Repair
Cum.
2.90
14.49
57.97
84.06
100.00
100.00
Copy as Picture
Repair
Record 1978
Freq.
Percent
Cum.
1
2
3
4
5
2
8
30
18
11
2.90
11.59
43.48
26.09
15.94
2.90
14.49
57.97
84.06
100.00
Total
69
100.00
Record 1978
Freq. Percent
Cum.
1
2
2.90
2.90
2
8
11.59
14.49
3
30
43.48
57.97
4
18
26.09
84.06
5
11
15.94
100.00
Total
69
100.00
Log files with extension .log can be opened in Word. Log files with extension .smcl will show the tags for Stata. See the
command in the next section to convert .smcl files into .log files.
Graphs saved as a picture (see section 7. Graph) can be imported into a document. There are several options for the format.
Use the drop down list in Save As box for the selection. Graphs can also be copied and pasted into another application like MS
Word. Right click the graph you want to copy, then select Copy Graph. Paste the graph in Word using Edit=> Paste, right
click and Paste, or hit Control and v at the same time. When the graphs are copied into Word 2003, they may not appear
correctly when the file is converted into Word 2007.
There are also user created commands to output results. You may check out commands such as outreg, outreg2, estout, tabout,
est2tex, mktab, and xml_tab. To read about the commands, use search. For example, type in Stata’s command window,
. search outreg, all
Note about user created commands: Stata, being a programmer friendly program, makes it easy to install and use user made
commands. If you see a user made command that you want to use, you can install it by first finding the command by searching
for it (you can also type -findit- commandname in Stata’s Command window) and clicking the blue letters “click here to install.”
The help pages on the commands become available after installing the program.
Page 26 of 28
13. Other helpful commands
If working with a large file:
You can describe data without loading the data by specifying the location and the name of data file.
. describe using datafilename
You can load only the variables you need by specifying the variable names.
. use var1 var2 var3 using datafilename
Some commands produce a log that is more than a page long (-compress-, for example). To save yourself from
pressing a key to scroll each page, you may use
. set more off
If you are seeing –more- at the end of the screen after typing search, and want to quit seeing more screens, press
q or control and c keys at the same time. Clicking red X button does the same thing.
You can save some memory by compressing the data.
. compress
Shortcuts
Stata can fill in a variable name with a tab key aftrer enough characters to recognize the name are entered. For
example, while you have the auto data open, try:
. describe h [hit tab key]
Stata fills in the rest of the variable name as headroom
You can bring up previously used commands in the Command window by hitting Page Up key.
You can refer to a set of variables with the same stem using an asterisc (*), as in:
. describe weight*
if you had created weight2, it will show both weight and weight2
Miscellaneous
If you forget to start a log file at the beginning of a Stata session, but want to save what you have in the output
window, use
. translate @Results outputfilename.txt
The file can be viewed using a text editor or a word processor.
Note: -translate- only saves what is in the buffer (what you see in the Results window). Depending on the length of
the output you had produced, earlier results may have been lost. It is a good habit to start a log file each time you
start a Stata session.
If you created Stata log file that has a file extension .smcl, you can reformat it into a text file by giving the command:
. translate filename.smcl
filename.log
If you want to perform a mathematical operation on the spot, you can use the –display- command.
. display 1+1 => will return 2
Page 27 of 28
14. On-line tutorials
UCLA
http://www.ats.ucla.edu/stat/stata/
UNC
http://www.cpc.unc.edu/services/computer/presentations/statatutorial/
Princeton
http://data.princeton.edu/stata/
http://www.princeton.edu/~eszter/stata.html
http://www.princeton.edu/~otorres/Stata/
http://opr.princeton.edu/computing/software/stata/intro/default.asp
15. References
Hamilton, Lawrence C. 2006. Statistics With Stata. Updated for Version 9. Pacific Grove, CA: Duxbury Press.
Stata Corporation. 2008. Using Stata Effectively: Data Management, Analysis, and Graphics Fundamentals.
Data and Statistical Services, Princeton University. Fall 2007. Stata Hands-on Instruction Guide. Windows version
9.0.
Page 28 of 28