Data management using stata statacorp (2020)

1.1 Using this book 1.2 Overview of this book 1.3 Listing observations in this book 1.4 More online resources 2 Reading and importing data files 2.1 Introduction 2.2 Reading Stata da

Trang 3

No part of this book may be reproduced, stored in a retrieval

system, or transcribed, in any form or by any means—electronic,mechanical, photocopy, recording, or otherwise—without the priorwritten permission of StataCorp LLC

Stata, , Stata Press, Mata, , and NetCourse are registeredtrademarks of StataCorp LLC

Stata and Stata Press are registered trademarks with the WorldIntellectual Property Organization of the United Nations

NetCourseNow is a trademark of StataCorp LLC

LAT XE 2 is a trademark of the American Mathematical Society

Trang 5

observations, and useful perspectives to consider I am very

grateful to Kristin MacDonald and Adam Crawley for their carefulreview and well-crafted editing I thank them for fine-tuning mywords to better express what I was trying to say in the first place Iwant to thank Lisa Gilmore for finessing and polishing the

typesetting and for being such a key player in transforming thistext from a manuscript into a book I am so delighted with thecover, designed and created by Eric Hubbard, which conveys themetaphor that data management is often like constructing a

building I want to extend my appreciation and thanks to the entireteam at StataCorp and Stata Press, who have always been sofriendly, encouraging, and supportive—including Patricia Branton,Vince Wiggins, Annette Fett, and Deirdre Skaggs Finally, I want

to thank Frauke Kreuter for her very kind assistance in translatinglabels into German in chapter 5

Trang 7

1.1 Using this book

1.2 Overview of this book

1.3 Listing observations in this book

1.4 More online resources

2 Reading and importing data files

2.1 Introduction

2.2 Reading Stata datasets

2.3 Importing Excel spreadsheets

2.4 Importing SAS files

2.4.1 Importing SAS sas7bdat files

2.4.2 Importing SAS XPORT Version 5 files

2.4.3 Importing SAS XPORT Version 8 files

2.5 Importing SPSS files

2.6 Importing dBase files

2.7 Importing raw data files

2.7.1 Importing comma-separated and tab-separated files

2.7.2 Importing space-separated files

2.7.3 Importing fixed-column files

2.7.4 Importing fixed-column files with multiple lines of raw dataper observation

2.8 Common errors when reading and importing files

2.9 Entering data directly into the Stata Data Editor

3 Saving and exporting data files

3.1 Introduction

3.2 Saving Stata datasets

Trang 8

3.3 Exporting Excel files

3.4 Exporting SAS XPORT Version 8 files

3.5 Exporting SAS XPORT Version 5 files

3.6 Exporting dBase files

3.7 Exporting comma-separated and tab-separated files

3.8 Exporting space-separated files

3.9 Exporting Excel files revisited: Creating reports

4 Data cleaning

4.1 Introduction

4.2 Double data entry

4.3 Checking individual variables

4.4 Checking categorical by categorical variables

4.5 Checking categorical by continuous variables

4.6 Checking continuous by continuous variables

4.7 Correcting errors in data

5.6 Labeling variables and values in different languages

5.7 Adding comments to your dataset using notes

5.8 Formatting the display of variables

5.9 Changing the order of variables in a dataset

6 Creating variables

6.1 Introduction

6.2 Creating and changing variables

6.3 Numeric expressions and functions

6.4 String expressions and functions

6.5 Recoding

6.6 Coding missing values

Trang 9

6.11 Computations across observations

6.12 More examples using the egen command

6.13 Converting string variables to numeric variables

6.14 Converting numeric variables to string variables

6.15 Renaming and ordering variables

7 Combining datasets

7.1 Introduction

7.2 Appending: Appending datasets

7.3 Appending: Problems

7.4 Merging: One-to-one match merging

7.5 Merging: One-to-many match merging

7.6 Merging: Merging multiple datasets

7.7 Merging: Update merges

7.8 Merging: Additional options when merging datasets

7.9 Merging: Problems merging datasets

7.10 Joining datasets

7.11 Crossing datasets

8 Processing observations across subgroups

8.1 Introduction

8.2 Obtaining separate results for subgroups

8.3 Computing values separately by subgroups

8.4 Computing values within subgroups: Subscripting

observations

8.5 Computing values within subgroups: Computations acrossobservations

8.6 Computing values within subgroups: Running sums

8.7 Computing values within subgroups: More examples

8.8 Comparing the by and tsset commands

9 Changing the shape of your data

9.1 Introduction

9.2 Wide and long datasets

9.3 Introduction to reshaping long to wide

9.4 Reshaping long to wide: Problems

9.5 Introduction to reshaping wide to long

9.6 Reshaping wide to long: Problems

9.7 Multilevel datasets

9.8 Collapsing datasets

Trang 10

10 Programming for data management: Part I

10.1 Introduction

10.2 Tips on long-term goals in data management

10.3 Executing do-files and making log files

10.4 Automating data checking

10.5 Combining do-files

10.6 Introducing Stata macros

10.7 Manipulating Stata macros

10.8 Repeating commands by looping over variables

10.9 Repeating commands by looping over numbers

10.10 Repeating commands by looping over anything

10.11 Accessing results stored from Stata commands

11 Programming for data management: Part II

11.1 Writing Stata programs for data management

11.2 Program 1: hello

11.3 Where to save your Stata programs

11.4 Program 2: Multilevel counting

11.5 Program 3: Tabulations in list format

11.6 Program 4: Scoring the simple depression scale

11.7 Program 5: Standardizing variables

11.8 Program 6: Checking variable labels

11.9 Program 7: Checking value labels

11.10 Program 8: Customized describe command

11.11 Program 9: Customized summarize command

11.12 Program 10: Checking for unlabeled values

11.13 Tips on debugging Stata programs

11.14 Final thoughts: Writing Stata programs for data

management

A Common elements

A.1 Introduction

A.2 Overview of Stata syntax

A.3 Working across groups of observations with by

A.4 Comments

A.5 Data types

Trang 11

A.10 Missing values

A.11 Referring to variable lists

A.12 Frames

A.12.1 Frames example 1: Can I interrupt you for a quickquestion?

A.12.2 Frames example 2: Juggling related tasks

A.12.3 Frames example 3: Checking double data entry

Subject index

Trang 13

8.1 Meanings of newvar depending on the value inserted for

8.2 Values assigned to newvar based on the value inserted for

8.3 Expressions to replace X and the meaning that newvar wouldhave

8.4 Expressions to replace X and the meaning that newvar wouldhave

Trang 15

2.1 Contents of Excel spreadsheet named dentists.xls

2.2 Contents of Excel spreadsheet named dentists2.xls

2.3 Contents of Excel spreadsheet named dentists3.xls

2.4 Stata Data Editor after step 1, entering data for the first

observation

2.5 Variables Manager after labeling the first variable

2.6 Create label dialog box showing value labels for racelab

2.7 Manage value labels dialog box showing value labels for

racelab and yesnolab

2.8 Variables Manager and Data Editor after step 2, labeling thevariables

2.9 Data Editor after step 3, fixing the date variables

3.1 Contents of Excel spreadsheet named dentlab.xlsx

3.2 Contents of second try making Excel spreadsheet named

3.7 First goal: Create Report #1

3.8 First try to create Report #1

3.9 Contents of dentrpt1-skeleton.xlsx: Skeleton Excel file forReport #1

3.10 Contents of dentrpt1.xlsx: Excel file showing Report #1

3.11 Second goal: Create Report #2

3.13 Contents of dentrpt2.xlsx: Excel file showing Report #2

3.14 Third goal: Create Report #3

3.16 First stage of creating dentrpt3.xlsx for Report #3

3.17 Final version of dentrpt3.xlsx for Report #3

10.1 Flow diagram for the wwsmini project

Trang 16

11.1 My first version of hello.ado shown in the Do-file Editor

11.2 My second version of hello.ado, with a more friendly greeting

11.3 My third version of hello.ado

Trang 18

Preface to the Second Edition

It was nearly 10 years ago that I wrote the preface for the firstedition of this book The goals and scope of this book are still thesame, but in this second edition you will find new data

management features that have been added over the last

10 years Such features include the ability to read and write awide variety of file formats, the ability to write highly customizedExcel files, the ability to have multiple Stata datasets open at

once, and the ability to store and manipulate string variables

stored as Unicode

As mentioned above, Stata now reads many file formats Statacan now read Excel files (see section 2.3), SAS files (see section

2.4), SPSS files (see section 2.5), and even dBase files (see

section 2.6) Further, Stata has added the import delimited

command, which reads a wide variety of delimited files and

supports many options for customizing the importing of such data(see section 2.7.1)

Stata can now export files into many file formats Stata cannow export Excel files (see section 3.3), SAS XPORT 8 and SASXPORT 5 files (see sections 3.4 and 3.5), and dBase files (seesection 3.6) Additionally, the export delimited command exportsdelimited files and supports many options for customizing theexport of such data (see section 3.7) Also, section 3.9 will

illustrate some of the enhanced capabilities Stata now has forexporting Excel files, showing how you can generate custom

formatted reports

The biggest change you will find in this new edition is the

addition of chapter 11, titled “Programming for data management:Part II” Chapter 11 builds upon chapter 10, illustrating how Stataprograms can be used to solve common data management tasks

I describe four strategies that I commonly use when creating aprogram to solve a data management task and illustrate how to

Trang 19

programming tools involved in solving the example I chose the 10examples in this chapter not only because the problems are

common and easy to grasp but also because these programsillustrate frequently used tools for writing Stata programs Afteryou explore these examples and see these programming toolsapplied to data management problems, I hope you will have

insight into how you can apply these tools to build programs foryour own data management tasks

Writing this book has been both a challenge and a pleasure Ihope that you like it!

Trang 21

There is a gap between raw data and statistical analysis That gap,called data management, is often filled with a mix of pesky andstrenuous tasks that stand between you and your data analysis Ifind that data management usually involves some of the most

challenging aspects of a data analysis project I wanted to write abook showing how to use Stata to tackle these pesky and

challenging data management tasks

One of the reasons I wanted to write such a book was to beable to show how useful Stata is for data management

Sometimes, people think that Stata’s strengths lie solely in its

statistical capabilities I have been using Stata and teaching it toothers for over 10 years, and I continue to be impressed with theway that it combines power with ease of use for data management.For example, take the reshape command This simple commandmakes it a snap to convert a wide file to a long file and vice versa(for examples, see section 9.3) Furthermore, reshape is partlybased on the work of a Stata user, illustrating that Stata’s power fordata management is augmented by community-contributed

programs that you can easily download

Each section of this book generally stands on its own, showingyou how you can do a particular data management task in Stata.Take, for example, section 2.7.1, which shows how you can read acomma-delimited file into Stata This is not a book you need toread cover to cover, and I would encourage you to jump around tothe topics that are most relevant for you

Data management is a big (and sometimes daunting) task Ihave written this book in an informal fashion, like we were sittingdown together at the computer and I was showing you some tipsabout data management My aim with this book is to help you

easily and quickly learn what you need to know to skillfully useStata for your data management tasks But if you need further

assistance solving a problem, section 1.4 describes the rich array

of online Stata resources available to you I would especially

recommend the Statalist listserver, which allows you to tap into theknowledge of Stata users around the world

Trang 22

If you would like to contact me with comments or suggestions, Iwould love to hear from you You can write me at

MichaelNormanMitchell@gmail.com, or visit me on the web athttp://www.MichaelNormanMitchell.com Writing this book hasbeen both a challenge and a pleasure I hope that you like it!

Trang 24

Chapter 1

Introduction

It has been said that data collection is like garbage collection:before you collect it you should have in mind what you aregoing to do with it

—Russell Fox, Max Gorbuny, and Robert Hooke

Trang 25

1.1 Using this book

As stated in the title, this is a practical handbook for data

management using Stata As a practical handbook, there is noneed to read the chapters in any particular order Each chapter, aswell as most sections within each chapter, stands alone Eachsection focuses on a particular data management task and

provides examples of how to perform that particular data

management task I imagine at least two ways this book could beused

You can pick a chapter, say, chapter 4 on data cleaning, andread the chapter to pick up some new tips and tricks about how toclean and prepare your data Then, the next time you need toclean data, you can use some of the tips you learned and grab thebook for a quick refresher as needed

Or you may wish for quick help on a task you have never

performed (or have not performed in a long time) For example,you want to import an Excel .xlsx file You can grab the book andflip to chapter 2 on reading and importing datasets in which

section 2.3 illustrates importing Excel .xlsx files Based on thoseexamples, you can read the Excel .xlsx file and then get back toyour work

However you read this book, each section is designed to

provide you with information to solve the task at hand without

getting lost in ancillary or esoteric details If you find yourself

craving more details, each section concludes with suggested

references to the Stata help files for additional information

Because this book is organized by task, whereas the referencemanuals are organized by command, I hope this book helps youconnect data management tasks to the corresponding referencemanual entries associated with those tasks Viewed this way, thisbook is not a competitor to the reference manuals but is instead acompanion to them

I encourage you to run the examples from this book for

yourself This engages you in active learning, as compared with

Trang 26

passive learning (such as just reading the book) When you areactively engaged in typing in commands, seeing the results, andtrying variations on the commands for yourself, I believe you willgain a better and deeper understanding than you would obtainfrom just passively reading.

To allow you to replicate the examples in this book, the

datasets are available for download You can download all thedatasets used in this book into your current working directory fromwithin Stata by typing the following commands:

You can also download any commands used in this book by

further typing

After issuing these commands, you could then use a dataset,for example, wws.dta, just by typing the following command:

Tip! Online resources for this book

You can find all the online resources for this book at the

book’s website:

https://www.stata-press.com/books/data-management-using-stata/

This site describes how to access the data online and

will include an errata webpage (which will hopefully be

short or empty)

Trang 27

of a section, but that strategy might not always work Then, youwill need to start from the beginning of the section to work yourway through the examples Although most sections are

independent, some build on prior sections Even in such cases,the datasets will be available so that you can execute the

examples starting from the beginning of any given section

Although the tasks illustrated in this book could be performedusing the Stata point-and-click interface, this book concentrates

on the use of Stata commands However, there are two interactive

or point-and-click features that are so handy that I believe evencommand-oriented users (including myself) would find them

useful The Data Editor (as illustrated in section 2.10) is a usefulinteractive interface for entering data into Stata That same

section illustrates the use of the Variables Manager Although theVariables Manager is illustrated in the context of labeling variablesfor a newly created dataset, it is equally useful for modifying (oradding) labels for an existing dataset

Trang 28

1.2 Overview of this book

Each chapter of this book covers a different data managementtopic, and each chapter pretty much stands alone The ordering ofthe chapters is not like that in a traditional book, where you shouldread from the beginning to the end You might get the most out ofthis book by reading the chapters in a different order than that inwhich they are presented I would like to give you a quick

overview of the book to help you get the most out of the order inwhich you read the chapters

This book is composed of 12 chapters, comprising this

introductory chapter (chapter 1), informational chapters 2–11, and

an appendix

About The Cover

I frequently think of constructing a building as a useful

metaphor for data management A hastily constructed

building will be weak and need lots of maintenance

simply to keep it from falling over A sturdy building

arises from good design, quality materials, careful

assembly, and attention to detail In this book, I aim to

illustrate best practices that will help you design and

create sturdy datasets that will be strong and require theleast amount of maintenance over time

The following five chapters, chapters 2–6, cover nuts-and-boltstopics that are common to every data management project:

reading and importing data, saving and exporting data, data

cleaning, labeling datasets, and creating variables These topicsare placed at the front because I think they are the most commontopics in data management; they are also placed in the front

Trang 29

The next three chapters, chapters 7–9, cover tasks that occur

in many (but not all) data management projects: combining

datasets, processing observations across subgroups, and

changing the shape of your data

Chapter 10 and chapters 11 cover programming for data

management Although the topics in this chapter are common tomany (if not all) data management projects, they are a little moreadvanced than the topics discussed in chapters 2–6 Chapter 10

describes how to structure your data analysis to be reproducibleand describes a variety of programming shortcuts for performingrepetitive tasks Chapter 11 builds upon chapter 10 and illustrateshow Stata programs can be used to solve common data

management tasks I describe four strategies that I commonly usewhen creating a program to solve a data management task andillustrate how to solve 10 data management problems, drawingupon these strategies as part of solving each problem

Appendix A describes common elements regarding the

workings of Stata Unlike the previous chapters, these are

fragments that do not pertain to a particular data managementtask yet are pervasive and hence are frequently referenced

throughout the book The earlier chapters will frequently refer tothe sections in the appendix, providing one explanation of theseelements rather than repeating explanations each time they arise.The appendix covers topics such as comments, logical

expressions, functions, if and in, missing values, and variablelists I placed this chapter at the back to help you quickly flip to itwhen it is referenced You may find it easier to read over the

appendix to familiarize yourself with these elements rather thanrepeatedly flipping back to it

The next section describes and explains some of the optionsthat are used with the list command throughout this book

Trang 30

1.3 Listing observations in this book

This book relies heavily on examples to show you how data

management commands work in Stata I would rather show youhow a command works with a simple example than explain it withlots of words To that end, I frequently use the list command toillustrate the effect of commands The default output from the list

command is not always as clear as I might hope Sometimes, I addoptions to the list command to maximize the clarity of the output.Rather than explain the workings of these options each time theyarise, I use this section to illustrate these options and explain whyyou might see them used throughout the book

For the first set of examples, let’s use wws.dta, which contains2,246 hypothetical observations about women and their work

For files with many observations, it can be useful to list a subset

of observations I frequently use the in specification to show

selected observations from a dataset In the example below, we listobservations 1–5 and see the variables idcode, age, hours, and

wage

Sometimes, variable names are so long that they get

abbreviated by the list command This can make the listings morecompact but also make the abbreviated headings harder to

Trang 31

The abbreviate() option can be used to indicate the minimumnumber of characters the list command will use when

abbreviating variables For example, specifying abbreviate(20)

means that none of the variables will be abbreviated to a lengthany shorter than 20 characters In the book, I abbreviate this option

to abb() (for example, abb(20), as shown below) Here this optioncauses all the variables to be fully spelled out

When the variable listing is too wide for the page, the listing willwrap on the page As shown below, this listing is hard to follow,and so I avoid it in this book

Trang 32

Sometimes, I add the noobs option to avoid such wrapping The

noobs option suppresses the display of the observation numbers,which occasionally saves just enough room to keep the listing fromwrapping on the page

The example from above is repeated below with the noobs

option, and enough space is saved to permit the variables to belisted without wrapping

For the remaining examples, let’s use tv1.dta, which contains

10 observations about the TV-watching habits of four kids

Trang 33

Note how a separator line is displayed after every five

observations This helps make the output easier to read

Sometimes, though, I am pinched for space and suppress thatseparator to keep the listing on one page The separator(0) option(which I abbreviate to sep(0)) omits the display of these

separators

In other cases, the separators can be especially helpful in

clarifying the grouping of observations In this dataset, there aremultiple observations per kid, and we can add the sepby(kidid)

option to request that a separator be included between each level

of kidid This helps us clearly see the groupings of observations

by kid

Trang 34

This concludes this section describing options this book useswith the list command I hope that this section helps you avoidconfusion that could arise by having these options appear withoutany explanation of what they are or why they are being used.

Shout-out! Stata on the Internet

Did you know that Stata is on Facebook? On Twitter?

That Stata has a YouTube Channel filled with Stata videotutorials? Find out more by typing help internet

Trang 35

1.4 More online resources

There are many online resources to help you learn and use Stata.Here are some resources I would particularly recommend:

The “Stata resources and support” webpage provides a

comprehensive list of online resources that are available forStata It lists official resources that are available from

StataCorp as well as resources from the Stata community.See

https://www.stata.com/support/

The “Resources for learning Stata” webpage provides a list ofresources created by the Stata community to help you learnand use Stata; see https://www.stata.com/links/resources-for-learning-stata/ Among the links included there, I would highlyrecommend the UCLA IDRE Stata web resources at

https://stats.idre.ucla.edu/stata/, which include FAQs,

annotated Stata output, and textbook examples solved inStata

Stata video tutorials help you get started quickly on specifictopics Stata has recorded over 250 short video tutorials

demonstrating how to use Stata and solve specific problems.The videos cover topics like simple linear regression, timeseries, descriptive statistics, importing Excel data, Bayesiananalysis, tests, instrumental variables, and even more! Youcan access these videos and learn more about them by

typing help videos or by searching for “Stata YouTube” withyour favorite web browser and search engine

The “Frequently asked questions on using Stata” webpage isspecial because it not only contains many frequently askedquestions but also includes answers! The FAQs cover

common questions (for example, how do I export tables fromStata?) as well as esoteric (for example, how are estimates ofrho outside the bounds handled in the two-step

Heckman estimator?) You can search the FAQs using

Trang 36

keywords, or you can browse the FAQs by topic See

https://www.stata.com/support/faqs/

“Statalist” is an independently operated listserver that

connects over 3,000 Stata users from all over the world I cansay from personal experience that the community is bothextremely knowledgeable and friendly, welcoming questionsfrom newbies and experts alike Even if you never post aquestion of your own, you can learn quite a bit from searchingthe vast archive of answers See https://www.statalist.org/

The Stata Journal is published quarterly with articles thatintegrate various aspects of statistical practice with Stata.Although current issues and articles are available by

subscription, articles over three years old are available forfree online as PDF files See https://www.stata-journal.com/

The Stata Blog: Not Elsewhere Classified includes entriesthat explore many Stata topics, including data management,programming, simulation, random numbers, and more! TheStata Blog postings can be detailed, topical, and even

whimsical See https://blog.stata.com/

Trang 38

Chapter 2

Reading and importing data files

Stata rhymes with data

—An old Stata FAQ

Trang 39

2.1 Introduction

You have some data that you are eager to analyze using Stata.Before you can analyze the data in Stata, you must first read thedata into Stata This chapter describes how you can read severalcommon types of data files into Stata This section gives you anoverview of some of the common issues you want to think aboutwhen reading and importing data files in Stata

Changing directories

To read a data file, you first need to know the directory or

folder in which it is located and how to get there

Say that you are using Windows and you have a folder named

mydata that is located in your Documents folder Using the cd

command shown below changes the current working directory tothe mydata folder within your Documents folder.1

Say that you are using Unix (for example, Linux or macOS)and your data files are stored in a directory named ~/statadata.You could go to that directory by typing

Consider the partially complete cd command shown below

After typing the forward slash, we can press the Tab key toactivate tab completion, showing a list of possible folders that can

be chosen via keystrokes or mouse clicks

For further information on these navigational issues, see theGetting Started with Stata manual From this point forward, I willassume that the data files of interest are in your current directory.2

Tip! Using the main menu to change directories

Trang 40

In the previous examples, the directory or folder names

were short and simple, but in real life, such names are

often long and typing them can be prone to error It can

be easier to point to a directory or folder than it is to type

it If you go to the File menu and then select Change

working directory , you can change the working

directory by pointing to the directory or folder rather thanhaving to type the full name

What kind of file are you reading?

There are several data files that you can read and import intoStata This chapter begins by illustrating how you can read Statadatasets into Stata As you would expect, it is simple to read Statadatasets into Stata Section 2.2 describes how to read Stata

illustrate how you can import dBase .dbf files, including dBase IIIfiles and dBase IV files

Note! Reading versus importing

In this chapter, I will sometimes talk about “reading” dataand sometimes talk about “importing” data In both

instances, you are retrieving an external file and placing

it into memory In general, I will talk about reading a

Tiêu đề	Data Management Using Stata: A Practical Handbook
Tác giả	Michael N. Mitchell
Người hướng dẫn	Bill Rising, Kristin MacDonald, Adam Crawley
Trường học	StataCorp LLC
Thể loại	book
Năm xuất bản	2020
Thành phố	College Station

Định dạng
Số trang	711
Dung lượng	30,33 MB
File đính kèm	68. Data Management.rar (28 MB)