1.1 Using this book 1.2 Overview of this book 1.3 Listing observations in this book 1.4 More online resources 2 Reading and importing data files 2.1 Introduction 2.2 Reading Stata da
Trang 3No part of this book may be reproduced, stored in a retrieval
system, or transcribed, in any form or by any means—electronic,mechanical, photocopy, recording, or otherwise—without the priorwritten permission of StataCorp LLC
Stata, , Stata Press, Mata, , and NetCourse are registeredtrademarks of StataCorp LLC
Stata and Stata Press are registered trademarks with the WorldIntellectual Property Organization of the United Nations
NetCourseNow is a trademark of StataCorp LLC
LAT XE 2 is a trademark of the American Mathematical Society
Trang 5observations, and useful perspectives to consider I am very
grateful to Kristin MacDonald and Adam Crawley for their carefulreview and well-crafted editing I thank them for fine-tuning mywords to better express what I was trying to say in the first place Iwant to thank Lisa Gilmore for finessing and polishing the
typesetting and for being such a key player in transforming thistext from a manuscript into a book I am so delighted with thecover, designed and created by Eric Hubbard, which conveys themetaphor that data management is often like constructing a
building I want to extend my appreciation and thanks to the entireteam at StataCorp and Stata Press, who have always been sofriendly, encouraging, and supportive—including Patricia Branton,Vince Wiggins, Annette Fett, and Deirdre Skaggs Finally, I want
to thank Frauke Kreuter for her very kind assistance in translatinglabels into German in chapter 5
Trang 71.1 Using this book
1.2 Overview of this book
1.3 Listing observations in this book
1.4 More online resources
2 Reading and importing data files
2.1 Introduction
2.2 Reading Stata datasets
2.3 Importing Excel spreadsheets
2.4 Importing SAS files
2.4.1 Importing SAS sas7bdat files
2.4.2 Importing SAS XPORT Version 5 files
2.4.3 Importing SAS XPORT Version 8 files
2.5 Importing SPSS files
2.6 Importing dBase files
2.7 Importing raw data files
2.7.1 Importing comma-separated and tab-separated files
2.7.2 Importing space-separated files
2.7.3 Importing fixed-column files
2.7.4 Importing fixed-column files with multiple lines of raw dataper observation
2.8 Common errors when reading and importing files
2.9 Entering data directly into the Stata Data Editor
3 Saving and exporting data files
3.1 Introduction
3.2 Saving Stata datasets
Trang 83.3 Exporting Excel files
3.4 Exporting SAS XPORT Version 8 files
3.5 Exporting SAS XPORT Version 5 files
3.6 Exporting dBase files
3.7 Exporting comma-separated and tab-separated files
3.8 Exporting space-separated files
3.9 Exporting Excel files revisited: Creating reports
4 Data cleaning
4.1 Introduction
4.2 Double data entry
4.3 Checking individual variables
4.4 Checking categorical by categorical variables
4.5 Checking categorical by continuous variables
4.6 Checking continuous by continuous variables
4.7 Correcting errors in data
5.6 Labeling variables and values in different languages
5.7 Adding comments to your dataset using notes
5.8 Formatting the display of variables
5.9 Changing the order of variables in a dataset
6 Creating variables
6.1 Introduction
6.2 Creating and changing variables
6.3 Numeric expressions and functions
6.4 String expressions and functions
6.5 Recoding
6.6 Coding missing values
Trang 96.11 Computations across observations
6.12 More examples using the egen command
6.13 Converting string variables to numeric variables
6.14 Converting numeric variables to string variables
6.15 Renaming and ordering variables
7 Combining datasets
7.1 Introduction
7.2 Appending: Appending datasets
7.3 Appending: Problems
7.4 Merging: One-to-one match merging
7.5 Merging: One-to-many match merging
7.6 Merging: Merging multiple datasets
7.7 Merging: Update merges
7.8 Merging: Additional options when merging datasets
7.9 Merging: Problems merging datasets
7.10 Joining datasets
7.11 Crossing datasets
8 Processing observations across subgroups
8.1 Introduction
8.2 Obtaining separate results for subgroups
8.3 Computing values separately by subgroups
8.4 Computing values within subgroups: Subscripting
observations
8.5 Computing values within subgroups: Computations acrossobservations
8.6 Computing values within subgroups: Running sums
8.7 Computing values within subgroups: More examples
8.8 Comparing the by and tsset commands
9 Changing the shape of your data
9.1 Introduction
9.2 Wide and long datasets
9.3 Introduction to reshaping long to wide
9.4 Reshaping long to wide: Problems
9.5 Introduction to reshaping wide to long
9.6 Reshaping wide to long: Problems
9.7 Multilevel datasets
9.8 Collapsing datasets
Trang 1010 Programming for data management: Part I
10.1 Introduction
10.2 Tips on long-term goals in data management
10.3 Executing do-files and making log files
10.4 Automating data checking
10.5 Combining do-files
10.6 Introducing Stata macros
10.7 Manipulating Stata macros
10.8 Repeating commands by looping over variables
10.9 Repeating commands by looping over numbers
10.10 Repeating commands by looping over anything
10.11 Accessing results stored from Stata commands
11 Programming for data management: Part II
11.1 Writing Stata programs for data management
11.2 Program 1: hello
11.3 Where to save your Stata programs
11.4 Program 2: Multilevel counting
11.5 Program 3: Tabulations in list format
11.6 Program 4: Scoring the simple depression scale
11.7 Program 5: Standardizing variables
11.8 Program 6: Checking variable labels
11.9 Program 7: Checking value labels
11.10 Program 8: Customized describe command
11.11 Program 9: Customized summarize command
11.12 Program 10: Checking for unlabeled values
11.13 Tips on debugging Stata programs
11.14 Final thoughts: Writing Stata programs for data
management
A Common elements
A.1 Introduction
A.2 Overview of Stata syntax
A.3 Working across groups of observations with by
A.4 Comments
A.5 Data types
Trang 11A.10 Missing values
A.11 Referring to variable lists
A.12 Frames
A.12.1 Frames example 1: Can I interrupt you for a quickquestion?
A.12.2 Frames example 2: Juggling related tasks
A.12.3 Frames example 3: Checking double data entry
Subject index
Trang 138.1 Meanings of newvar depending on the value inserted for
8.2 Values assigned to newvar based on the value inserted for
8.3 Expressions to replace X and the meaning that newvar wouldhave
8.4 Expressions to replace X and the meaning that newvar wouldhave
Trang 152.1 Contents of Excel spreadsheet named dentists.xls
2.2 Contents of Excel spreadsheet named dentists2.xls
2.3 Contents of Excel spreadsheet named dentists3.xls
2.4 Stata Data Editor after step 1, entering data for the first
observation
2.5 Variables Manager after labeling the first variable
2.6 Create label dialog box showing value labels for racelab
2.7 Manage value labels dialog box showing value labels for
racelab and yesnolab
2.8 Variables Manager and Data Editor after step 2, labeling thevariables
2.9 Data Editor after step 3, fixing the date variables
3.1 Contents of Excel spreadsheet named dentlab.xlsx
3.2 Contents of second try making Excel spreadsheet named
3.7 First goal: Create Report #1
3.8 First try to create Report #1
3.9 Contents of dentrpt1-skeleton.xlsx: Skeleton Excel file forReport #1
3.10 Contents of dentrpt1.xlsx: Excel file showing Report #1
3.11 Second goal: Create Report #2
3.12 Contents of dentrpt2-skeleton.xlsx: Skeleton Excel file forReport #2
3.13 Contents of dentrpt2.xlsx: Excel file showing Report #2
3.14 Third goal: Create Report #3
3.15 Contents of dentrpt3-skeleton.xlsx: Skeleton Excel file forReport #3
3.16 First stage of creating dentrpt3.xlsx for Report #3
3.17 Final version of dentrpt3.xlsx for Report #3
10.1 Flow diagram for the wwsmini project
Trang 1611.1 My first version of hello.ado shown in the Do-file Editor
11.2 My second version of hello.ado, with a more friendly greeting
11.3 My third version of hello.ado
Trang 18Preface to the Second Edition
It was nearly 10 years ago that I wrote the preface for the firstedition of this book The goals and scope of this book are still thesame, but in this second edition you will find new data
management features that have been added over the last
10 years Such features include the ability to read and write awide variety of file formats, the ability to write highly customizedExcel files, the ability to have multiple Stata datasets open at
once, and the ability to store and manipulate string variables
stored as Unicode
As mentioned above, Stata now reads many file formats Statacan now read Excel files (see section 2.3), SAS files (see section
2.4), SPSS files (see section 2.5), and even dBase files (see
section 2.6) Further, Stata has added the import delimited
command, which reads a wide variety of delimited files and
supports many options for customizing the importing of such data(see section 2.7.1)
Stata can now export files into many file formats Stata cannow export Excel files (see section 3.3), SAS XPORT 8 and SASXPORT 5 files (see sections 3.4 and 3.5), and dBase files (seesection 3.6) Additionally, the export delimited command exportsdelimited files and supports many options for customizing theexport of such data (see section 3.7) Also, section 3.9 will
illustrate some of the enhanced capabilities Stata now has forexporting Excel files, showing how you can generate custom
formatted reports
The biggest change you will find in this new edition is the
addition of chapter 11, titled “Programming for data management:Part II” Chapter 11 builds upon chapter 10, illustrating how Stataprograms can be used to solve common data management tasks
I describe four strategies that I commonly use when creating aprogram to solve a data management task and illustrate how to
Trang 19programming tools involved in solving the example I chose the 10examples in this chapter not only because the problems are
common and easy to grasp but also because these programsillustrate frequently used tools for writing Stata programs Afteryou explore these examples and see these programming toolsapplied to data management problems, I hope you will have
insight into how you can apply these tools to build programs foryour own data management tasks
Writing this book has been both a challenge and a pleasure Ihope that you like it!
Trang 21There is a gap between raw data and statistical analysis That gap,called data management, is often filled with a mix of pesky andstrenuous tasks that stand between you and your data analysis Ifind that data management usually involves some of the most
challenging aspects of a data analysis project I wanted to write abook showing how to use Stata to tackle these pesky and
challenging data management tasks
One of the reasons I wanted to write such a book was to beable to show how useful Stata is for data management
Sometimes, people think that Stata’s strengths lie solely in its
statistical capabilities I have been using Stata and teaching it toothers for over 10 years, and I continue to be impressed with theway that it combines power with ease of use for data management.For example, take the reshape command This simple commandmakes it a snap to convert a wide file to a long file and vice versa(for examples, see section 9.3) Furthermore, reshape is partlybased on the work of a Stata user, illustrating that Stata’s power fordata management is augmented by community-contributed
programs that you can easily download
Each section of this book generally stands on its own, showingyou how you can do a particular data management task in Stata.Take, for example, section 2.7.1, which shows how you can read acomma-delimited file into Stata This is not a book you need toread cover to cover, and I would encourage you to jump around tothe topics that are most relevant for you
Data management is a big (and sometimes daunting) task Ihave written this book in an informal fashion, like we were sittingdown together at the computer and I was showing you some tipsabout data management My aim with this book is to help you
easily and quickly learn what you need to know to skillfully useStata for your data management tasks But if you need further
assistance solving a problem, section 1.4 describes the rich array
of online Stata resources available to you I would especially
recommend the Statalist listserver, which allows you to tap into theknowledge of Stata users around the world
Trang 22If you would like to contact me with comments or suggestions, Iwould love to hear from you You can write me at
MichaelNormanMitchell@gmail.com, or visit me on the web athttp://www.MichaelNormanMitchell.com Writing this book hasbeen both a challenge and a pleasure I hope that you like it!
Trang 24Chapter 1
Introduction
It has been said that data collection is like garbage collection:before you collect it you should have in mind what you aregoing to do with it
—Russell Fox, Max Gorbuny, and Robert Hooke
Trang 251.1 Using this book
As stated in the title, this is a practical handbook for data
management using Stata As a practical handbook, there is noneed to read the chapters in any particular order Each chapter, aswell as most sections within each chapter, stands alone Eachsection focuses on a particular data management task and
provides examples of how to perform that particular data
management task I imagine at least two ways this book could beused
You can pick a chapter, say, chapter 4 on data cleaning, andread the chapter to pick up some new tips and tricks about how toclean and prepare your data Then, the next time you need toclean data, you can use some of the tips you learned and grab thebook for a quick refresher as needed
Or you may wish for quick help on a task you have never
performed (or have not performed in a long time) For example,you want to import an Excel .xlsx file You can grab the book andflip to chapter 2 on reading and importing datasets in which
section 2.3 illustrates importing Excel .xlsx files Based on thoseexamples, you can read the Excel .xlsx file and then get back toyour work
However you read this book, each section is designed to
provide you with information to solve the task at hand without
getting lost in ancillary or esoteric details If you find yourself
craving more details, each section concludes with suggested
references to the Stata help files for additional information
Because this book is organized by task, whereas the referencemanuals are organized by command, I hope this book helps youconnect data management tasks to the corresponding referencemanual entries associated with those tasks Viewed this way, thisbook is not a competitor to the reference manuals but is instead acompanion to them
I encourage you to run the examples from this book for
yourself This engages you in active learning, as compared with
Trang 26passive learning (such as just reading the book) When you areactively engaged in typing in commands, seeing the results, andtrying variations on the commands for yourself, I believe you willgain a better and deeper understanding than you would obtainfrom just passively reading.
To allow you to replicate the examples in this book, the
datasets are available for download You can download all thedatasets used in this book into your current working directory fromwithin Stata by typing the following commands:
You can also download any commands used in this book by
further typing
After issuing these commands, you could then use a dataset,for example, wws.dta, just by typing the following command:
Tip! Online resources for this book
You can find all the online resources for this book at the
book’s website:
https://www.stata-press.com/books/data-management-using-stata/
This site describes how to access the data online and
will include an errata webpage (which will hopefully be
short or empty)
Trang 27of a section, but that strategy might not always work Then, youwill need to start from the beginning of the section to work yourway through the examples Although most sections are
independent, some build on prior sections Even in such cases,the datasets will be available so that you can execute the
examples starting from the beginning of any given section
Although the tasks illustrated in this book could be performedusing the Stata point-and-click interface, this book concentrates
on the use of Stata commands However, there are two interactive
or point-and-click features that are so handy that I believe evencommand-oriented users (including myself) would find them
useful The Data Editor (as illustrated in section 2.10) is a usefulinteractive interface for entering data into Stata That same
section illustrates the use of the Variables Manager Although theVariables Manager is illustrated in the context of labeling variablesfor a newly created dataset, it is equally useful for modifying (oradding) labels for an existing dataset
Trang 281.2 Overview of this book
Each chapter of this book covers a different data managementtopic, and each chapter pretty much stands alone The ordering ofthe chapters is not like that in a traditional book, where you shouldread from the beginning to the end You might get the most out ofthis book by reading the chapters in a different order than that inwhich they are presented I would like to give you a quick
overview of the book to help you get the most out of the order inwhich you read the chapters
This book is composed of 12 chapters, comprising this
introductory chapter (chapter 1), informational chapters 2–11, and
an appendix
About The Cover
I frequently think of constructing a building as a useful
metaphor for data management A hastily constructed
building will be weak and need lots of maintenance
simply to keep it from falling over A sturdy building
arises from good design, quality materials, careful
assembly, and attention to detail In this book, I aim to
illustrate best practices that will help you design and
create sturdy datasets that will be strong and require theleast amount of maintenance over time
The following five chapters, chapters 2–6, cover nuts-and-boltstopics that are common to every data management project:
reading and importing data, saving and exporting data, data
cleaning, labeling datasets, and creating variables These topicsare placed at the front because I think they are the most commontopics in data management; they are also placed in the front
Trang 29The next three chapters, chapters 7–9, cover tasks that occur
in many (but not all) data management projects: combining
datasets, processing observations across subgroups, and
changing the shape of your data
Chapter 10 and chapters 11 cover programming for data
management Although the topics in this chapter are common tomany (if not all) data management projects, they are a little moreadvanced than the topics discussed in chapters 2–6 Chapter 10
describes how to structure your data analysis to be reproducibleand describes a variety of programming shortcuts for performingrepetitive tasks Chapter 11 builds upon chapter 10 and illustrateshow Stata programs can be used to solve common data
management tasks I describe four strategies that I commonly usewhen creating a program to solve a data management task andillustrate how to solve 10 data management problems, drawingupon these strategies as part of solving each problem
Appendix A describes common elements regarding the
workings of Stata Unlike the previous chapters, these are
fragments that do not pertain to a particular data managementtask yet are pervasive and hence are frequently referenced
throughout the book The earlier chapters will frequently refer tothe sections in the appendix, providing one explanation of theseelements rather than repeating explanations each time they arise.The appendix covers topics such as comments, logical
expressions, functions, if and in, missing values, and variablelists I placed this chapter at the back to help you quickly flip to itwhen it is referenced You may find it easier to read over the
appendix to familiarize yourself with these elements rather thanrepeatedly flipping back to it
The next section describes and explains some of the optionsthat are used with the list command throughout this book
Trang 301.3 Listing observations in this book
This book relies heavily on examples to show you how data
management commands work in Stata I would rather show youhow a command works with a simple example than explain it withlots of words To that end, I frequently use the list command toillustrate the effect of commands The default output from the list
command is not always as clear as I might hope Sometimes, I addoptions to the list command to maximize the clarity of the output.Rather than explain the workings of these options each time theyarise, I use this section to illustrate these options and explain whyyou might see them used throughout the book
For the first set of examples, let’s use wws.dta, which contains2,246 hypothetical observations about women and their work
For files with many observations, it can be useful to list a subset
of observations I frequently use the in specification to show
selected observations from a dataset In the example below, we listobservations 1–5 and see the variables idcode, age, hours, and
wage
Sometimes, variable names are so long that they get
abbreviated by the list command This can make the listings morecompact but also make the abbreviated headings harder to
Trang 31The abbreviate() option can be used to indicate the minimumnumber of characters the list command will use when
abbreviating variables For example, specifying abbreviate(20)
means that none of the variables will be abbreviated to a lengthany shorter than 20 characters In the book, I abbreviate this option
to abb() (for example, abb(20), as shown below) Here this optioncauses all the variables to be fully spelled out
When the variable listing is too wide for the page, the listing willwrap on the page As shown below, this listing is hard to follow,and so I avoid it in this book
Trang 32Sometimes, I add the noobs option to avoid such wrapping The
noobs option suppresses the display of the observation numbers,which occasionally saves just enough room to keep the listing fromwrapping on the page
The example from above is repeated below with the noobs
option, and enough space is saved to permit the variables to belisted without wrapping
For the remaining examples, let’s use tv1.dta, which contains
10 observations about the TV-watching habits of four kids
Trang 33Note how a separator line is displayed after every five
observations This helps make the output easier to read
Sometimes, though, I am pinched for space and suppress thatseparator to keep the listing on one page The separator(0) option(which I abbreviate to sep(0)) omits the display of these
separators
In other cases, the separators can be especially helpful in
clarifying the grouping of observations In this dataset, there aremultiple observations per kid, and we can add the sepby(kidid)
option to request that a separator be included between each level
of kidid This helps us clearly see the groupings of observations
by kid
Trang 34This concludes this section describing options this book useswith the list command I hope that this section helps you avoidconfusion that could arise by having these options appear withoutany explanation of what they are or why they are being used.
Shout-out! Stata on the Internet
Did you know that Stata is on Facebook? On Twitter?
That Stata has a YouTube Channel filled with Stata videotutorials? Find out more by typing help internet
Trang 351.4 More online resources
There are many online resources to help you learn and use Stata.Here are some resources I would particularly recommend:
The “Stata resources and support” webpage provides a
comprehensive list of online resources that are available forStata It lists official resources that are available from
StataCorp as well as resources from the Stata community.See
https://www.stata.com/support/
The “Resources for learning Stata” webpage provides a list ofresources created by the Stata community to help you learnand use Stata; see https://www.stata.com/links/resources-for-learning-stata/ Among the links included there, I would highlyrecommend the UCLA IDRE Stata web resources at
https://stats.idre.ucla.edu/stata/, which include FAQs,
annotated Stata output, and textbook examples solved inStata
Stata video tutorials help you get started quickly on specifictopics Stata has recorded over 250 short video tutorials
demonstrating how to use Stata and solve specific problems.The videos cover topics like simple linear regression, timeseries, descriptive statistics, importing Excel data, Bayesiananalysis, tests, instrumental variables, and even more! Youcan access these videos and learn more about them by
typing help videos or by searching for “Stata YouTube” withyour favorite web browser and search engine
The “Frequently asked questions on using Stata” webpage isspecial because it not only contains many frequently askedquestions but also includes answers! The FAQs cover
common questions (for example, how do I export tables fromStata?) as well as esoteric (for example, how are estimates ofrho outside the bounds handled in the two-step
Heckman estimator?) You can search the FAQs using
Trang 36keywords, or you can browse the FAQs by topic See
https://www.stata.com/support/faqs/
“Statalist” is an independently operated listserver that
connects over 3,000 Stata users from all over the world I cansay from personal experience that the community is bothextremely knowledgeable and friendly, welcoming questionsfrom newbies and experts alike Even if you never post aquestion of your own, you can learn quite a bit from searchingthe vast archive of answers See https://www.statalist.org/
The Stata Journal is published quarterly with articles thatintegrate various aspects of statistical practice with Stata.Although current issues and articles are available by
subscription, articles over three years old are available forfree online as PDF files See https://www.stata-journal.com/
The Stata Blog: Not Elsewhere Classified includes entriesthat explore many Stata topics, including data management,programming, simulation, random numbers, and more! TheStata Blog postings can be detailed, topical, and even
whimsical See https://blog.stata.com/
Trang 38Chapter 2
Reading and importing data files
Stata rhymes with data
—An old Stata FAQ
Trang 392.1 Introduction
You have some data that you are eager to analyze using Stata.Before you can analyze the data in Stata, you must first read thedata into Stata This chapter describes how you can read severalcommon types of data files into Stata This section gives you anoverview of some of the common issues you want to think aboutwhen reading and importing data files in Stata
Changing directories
To read a data file, you first need to know the directory or
folder in which it is located and how to get there
Say that you are using Windows and you have a folder named
mydata that is located in your Documents folder Using the cd
command shown below changes the current working directory tothe mydata folder within your Documents folder.1
Say that you are using Unix (for example, Linux or macOS)and your data files are stored in a directory named ~/statadata.You could go to that directory by typing
Consider the partially complete cd command shown below
After typing the forward slash, we can press the Tab key toactivate tab completion, showing a list of possible folders that can
be chosen via keystrokes or mouse clicks
For further information on these navigational issues, see theGetting Started with Stata manual From this point forward, I willassume that the data files of interest are in your current directory.2
Tip! Using the main menu to change directories
Trang 40In the previous examples, the directory or folder names
were short and simple, but in real life, such names are
often long and typing them can be prone to error It can
be easier to point to a directory or folder than it is to type
it If you go to the File menu and then select Change
working directory , you can change the working
directory by pointing to the directory or folder rather thanhaving to type the full name
What kind of file are you reading?
There are several data files that you can read and import intoStata This chapter begins by illustrating how you can read Statadatasets into Stata As you would expect, it is simple to read Statadatasets into Stata Section 2.2 describes how to read Stata
illustrate how you can import dBase .dbf files, including dBase IIIfiles and dBase IV files
Note! Reading versus importing
In this chapter, I will sometimes talk about “reading” dataand sometimes talk about “importing” data In both
instances, you are retrieving an external file and placing
it into memory In general, I will talk about reading a