Data Management using EpiData and Stata Introduction The aim of these sessions is to equip you with the practical and essential epidemiological skills needed to design a questionnaire,
Trang 1Epidemiology in Practice Data Management Using EPI-DATA and STATA
Course Manual
This document contains the material which has been written to support teaching
sessions in data management, which are part of the Epidemiology in Practice module for London-based MSc students The material is designed to introduce students to the practicalities of managing data in epidemiological studies, including field procedures, questionnaire design, the cleaning and checking of data, and organizing a dataset in preparation for analysis
IMPORTANT NOTE:
This copy is for DL MSc Epidemiology students Please note that the datasets used are part of the EPM103 materials and DL students will need to access these from the core module CD-ROM (rather than in the location described in the
document)
LONDON SCHOOL OF HYGIENE AND TROPICAL MEDICINE
October 2011
No part of this teaching material may be reproduced by any
means without the written authority of the School given by the
Secretary and Registrar
Trang 2Contents
Introduction to the Unit 1
Session 1: Introduction to data management and questionnaire design
Introduction to data management 1.1 Designing a data management strategy 1.2
What do we mean by data? 1.2.1 Data management considerations at the preparation phase 1.2.2 Data management considerations in the fieldwork 1.2.3 Data management strategy and manual 1.2.4 Database construction 1.2.5 Questionnaire design 1.3
General guidelines for writing questions 1.3.1 Other considerations 1.3.2 Coding responses 1.3.3 Identifying respondents and data collection forms 1.3.4 General layout - introduction 1.3.5 Ordering of questions 1.3.5.1 Instructions 1.3.5.2 Pilot testing 1.4 Quality control 1.5 Information and consent 1.6
Practical 1: Questionnaire design
Practical 2: Introduction to EpiData
Introduction to EpiData 2.1
Creating a data file 2.1.1 Datasets, cases and files 2.1.2 Defining database file structure 2.1.3 Variable names 2.1.4 Variable types 2.1.5 Null values – missing and not applicable data 2.1.6 Identifying (ID) numbers 2.1.7 Data quality and checking 2.2
Errors in data entry 2.2.1
Trang 3Quality and validity 2.2.2 Data checking functions in EpiData 2.2.3 The onchocerciasis dataset 2.3 Using EpiData 2.4
Starting EpiData 2.4.1 Creating a QES file using the EpiData editor 2.4.2 Saving the QES file 2.4.3 Creating a data (.REC) file 2.4.4 Entering data 2.4.5 Variable names 2.4.6 Finding records 2.4.7 Summary so far 2.4.8 Adding checks in EpiData 2.5
Creating specified ranges 2.5.1 Creating labels for variables 2.5.2 Setting up checks 2.5.3 Specifying a KEY UNIQUE variable 2.5.4 Limitations of interactive data checking 2.5.5 Double entry and validation 2.6 Exporting data from EpiData 2.7
The Onchoceriasis dataset 2.7.1 Exporting to STATA 2.7.2 Folder management in STATA 2.7.3
Practical 3: Data management using Stata 12 – basics
Introduction 3.1 When to save and when to clear? 3.2
Do Files 3.3
What is a do file? 3.3.1 Comments in do files 3.3.2 The “set more off” command 3.3.3
A do-file template 3.3.4 The display command 3.4 Log files 3.5 Appending data 3.6 Naming variables, labeling variables and labeling values 3.7
Variable names 3.7.1 Variable labels 3.7.2 Value labels 3.7.3
Trang 4Describing data 3.8
Equal signs in Stata commands – one or two? 3.6.1 Creating and transforming variables – gen, replace, recode 3.9
Recategorising data 3.9.1 The egen command 3.10
What’s the difference between gen and egen? 3.10.1 Exercise 3.11
Practical 4: Data management using Stata 12 – essentials
Types of variable – string and numeric 4.1
Converting string variables to numeric (destring, encode) 4.1.1 Converting numeric variables to string (decode) 4.1.2 Collapsing datasets 4.2 Reshaping data 4.3 Merging files 4.4
Practical 5: Data management using Stata 12 – advanced
Dates 5.1 Identifying duplicate observations 5.2 Shortcuts for repeating commands : using foreach loops 5.3 Working with sub-groups – use of Stata system variables 5.4 Final integrating exercise 5.5
Appendix 1: Example data processing manual
Appendix 2: The progress of a questionnaire
Appendix 3: Questionnaires to enter
Appendix 4: Common stata commands
Trang 5Data Management using EpiData and Stata
Introduction
The aim of these sessions is to equip you with the practical and essential epidemiological skills needed to design a questionnaire, enter data using Epi-Data and prepare data for analysis in Stata 12
By the end of this Teaching Unit, you should:
• appreciate the need for a coherent data management strategy for an epidemiological study
• understand the principles of good questionnaire design
• be able to create an Epi-Data database to translate data from the questionnaire to computer
• be able to enter and verify data in Epi-Data
• know how to transfer data from Epi-Data to Stata and other statistical packages
• be able to create and use Stata do- and log-files
• know how to undertake common management tasks in Stata, including merging, appending, and collapsing files
• be able to use common Stata commands to generate, recode and replace variables
• be familiar with more advanced topics such as dates and substrings
The total class contact time for Data Management is 15 hours, consisting of five three-hour
sessions You are advised to read notes for practical sessions before the class and to work through the material for each session Each week’s exercises build on work completed in former sessions If you cannot complete the work within the three face-to- face hours you are strongly advised to complete the practical in your own time
Recommended texts
An excellent book which covers essential issues of data management and questionnaire design is:
Field Trials of Health Interventions in Developing Countries: A Toolbox
Edited by Smith & Morrow; WHO
A useful book on Stata is:
A Short introduction to Stata for Biostatistics
Hills & DeStavola
Timberlake Consultants Ltd
There are several copies of both books in the library
Trang 6Dataset
During this course we will use data from a randomised controlled trial of an intervention against
river blindness (onchocerciasis) in Sierra Leone in West Africa
Onchocerciasis is a common and debilitating disease of the tropics It is chronic, and affects the skin and eyes Its pathology is thought to result from the cumulative effects of inflammatory responses to immobile and dead microfilariae in the skin and eyes Microfilariae are tiny worm-like parasites that breed in fast-flowing tropical rivers and are deposited in the skin by blackfly
(simulium) The fly bites and injects the parasite larvae (microfilariae) under the skin These
mature and produce further larvae that may migrate to the eye where they may cause lesions leading to blindness The worms are detectable by microscopic examination of skin samples, usually snipped from around the hips; severity of infection is measured by counting the average number of worms per microgram of skin examined
A double-blind-placebo-controlled trial was designed to study the clinical and parasitological
effects of repeated treatment with a newly developed drug called Ivermectin Subjects were
enrolled from six villages in Sierra Leone (West Africa), and initial demographic and parasitological surveys were conducted between June and October 1987 Subjects were randomly allocated to either the Ivermectin treatment group or the placebo control group Randomisation was done in London
The questionnaire in section 2.3 is similar to that used to collect baseline data for the study It contains questions on background demographic and socio-economic factors, and on subjects' previous experience of onchocerciasis
Follow-up parasitology and repeated treatment was performed for five further surveys at six monthly intervals The principal outcome of interest was the comparison between microfilarial counts both before and after treatment, and between the two treatment groups
Reference
Whitworth JAG, Morgan D, Maude GH, Downhan MD, and Taylor DW (1991), A community trial
of ivermectin for onchocerciasis in Sierra Leone: clinical and parasitological responses to the initial dose, Transactions of the Royal Society of Tropical Medicine and Hygiene, 85, 92-6
Trang 7Files and Variables
You will need the following files during the course They are currently in the drive
u:\download\teach\dataepi Create a new folder in your drive called h:\dataepi, and use
Windows Explorer to copy all the files from the u:\download\teach\dataepi folder into h:\dataepi
File Variable Contents Values / Codes
Code for missing values
baseline demography VILL Village number 1 6
DATE Date of interview Dd/mm/yyyy AGE Age in years Positive integers 99
2 = Other HHNO Household number Positive integers 99
Microfilariae counts SR Survey round 1 5
MFRIC MF count (right) Positive integers 999 MFLIC MF count (left) Positive integers 999 SSRIC Skin diameter (right) Positive real
numbers SSLIC Skin diameter (left) Positive real
numbers
EOSIN Eosinophil count Positive integers 9999 PCV Packed cell volume Positive integers 99 MPS Malaria parasites 0 = Negative
1 = Positive 9 TMTCODES.REC
Treatment codes DRUG Drug batch
IVER = DRUG PLAC = PLACEBO IDNO Subject ID number
Trang 8Data Management using EpiData and Stata
Session 1 : Introduction to Data Management and
Questionnaire Design
In this session we cover the essentials of data management, and give an introduction to questionnaire design
Objectives of this session
After this session, you should understand how to:
1 start planning data sources and computer files for your own studies
2 use a data management strategy as a way of ensuring quality of data in preparation for the analysis
3 outline tasks to be included in the procedures guide for fieldwork and data management
4 Start designing a precise and informative questionnaire
1.1 Introduction to data management
An essential part of any epidemiological study is the collection of relevant data on participants Data will include identification information such as name, age, sex, place of residence, information on the main outcome and exposure, and on other factors that may be potential confounders of the association under study This will usually include clinical and lab data which need to be linked to the socio-demographic and behavioural data Often different sources of data are used - for example some data may be collected at community level, or samples may be collected from the same individual at different timepoints
In an epidemiological study, we need to transfer information from the study population to the computer, ready for statistical analysis There are however, many things that can go wrong in this
process so that the final dataset may not represent the true situation
Data management is the process of transferring information from the study population to a dataset ready for analysis
The main aim of data management is to ensure that sufficient and necessary information is collected and processed to answer the research question
Trang 9Reality: Information about the target
population e.g height, weight, income,
concentration of malaria parasites in blood,
number of sexual partners in last year etc
Final dataset on the computer should
represent reality Analysis of this data is used
to draw conclusions about the target
Study population
Data collection Data entry
Preparation
of dataset for analysis
Example: The multi-centre study of sexual behaviour in Africa
As an example, we will consider a cross-sectional study of sexual behaviour in Africa This study was a population-based survey of around 2000 adults and 300 sex workers in each of four cities in Africa The aim of the study was to explore the factors associated with the
differential spread of HIV in these cities (Buvé et al, AIDS 2001 Aug 15 Suppl 4)
Households in each city were selected randomly and all adults living in the selected household were eligible for the study
The following information was collected:-
- household information through interview
- individual information (socio-demographic, behavioural) through interviewer-administered questionnaires
- clinical examination for men
- specimen form detailing which biological samples were taken
- laboratory form with test results
Like many epidemiological studies, this was a large, complex and expensive project involving collaborators in several African & European countries The success of the study depended on two main features of the data processing and management:-
1 Standardised data management across all sites
2 Accurate data entry, with all queries being meticulously recorded and processed
A data management strategy had to be designed to facilitate these two objectives
Trang 101.2 Designing a data management strategy
A data management strategy comprises the data processing needs of each stage of the research study This strategy outlines how the work should be tackled, the problems that may be encountered, and how to overcome them
1.2.1 What do we mean by data?
‘Data’ are all values needed in the analysis of the research question Results from lab tests and clinical measurements may be data, as can observations and measurements of a community or household
We must plan how these data are to be used in the study, and how they are collected, stored, processed and made suitable for analysis
Some data may not be used in the analysis, but may feed into the study prior to other activities For example qualitative data from focus group discussions may be needed to establish the nature
of community norms, and these can then be used to design the questionnaire, or community characteristics may be needed in order to stratify or randomise the sample
1.2.2 Data management considerations at the preparation phase
When writing your study proposal, you must start planning how to manage the data If you don’t, problems with data processing are likely to delay the study considerably or, even worse, result in poor quality data and meaningless results
To ensure that the data collected do represent reality you need to consider the following:
• Study design – is this to be cross-sectional survey, case-control, cohort etc?
• Sampling - how will you identify eligible respondents in your study population – for example,
where will the list of the study sample come from, and what criteria will you use for inclusion in the sample?
• What data are to be collected? What are the outcomes, exposures and potential confounders?
• What questionnaires and forms will be needed? Will data be collected from different sources -
questionnaires, clinical records, lab tests etc? What hardware and software needs are
there?
• What is the sample size needed to meet the objectives of the study? What percentage will you
add to take into account refusals, loss to follow-up etc? What methods can be used to minimise the refusal rate?
Trang 11• Who will be responsible for the data at each stage? Data will pass through field workers, lab
technicians, data entry personnel, data managers and statisticians All of them should report to the principal investigator What role do people external to the study team have (consultants, regional health officers, community leaders, study subjects)?
• At this stage the strategy should include an overview of the budget needed for all aspects of data management
− printing or copying questionnaires
− transport costs: do you need to buy vehicles or can you use those belonging to your institution
− consumables: clipboards; paper; pencils; printer paper; sticky labels; printer ink; computer disks; etc
− measuring equipment: scales; tape measures; etc
− laboratory costs: consumables; freezer storage space; cost of sample analysis
− computing costs: will computers be bought or will use be made of institutions computers? Printers, UPS, voltage regulators etc
− software for data management and data analysis This may be covered by the institution,
or need to be purchased separately
1.2.3 Data management considerations in the fieldwork
Before the fieldwork starts, field management of data will have to be considered in more detail A plan must be drawn up in advance, often without the detailed knowledge that will come later All procedures must be clearly outlined in a field manual to ensure that all research and field staff have
a common understanding of procedures
• Recruitment of field staff
− how many staff are needed, and what is their primary role (e.g data entry managers, interviewers)
− which qualities and skills must the staff have?
− how will staff be selected?
− how will staff be trained?
• Logistics
− how is transport to be organised
− what are the accommodation needs for field staff
− what demands there will be for efficient administrative management (computers, photocopying, printing, etc)
Trang 12• Designing the questionnaire/forms
− identify what questionnaires and forms are needed
− compose the questions
− assemble the questionnaire
− pilot test the Questionnaire, make changes as needed
− code the responses
− design computer data entry processes
• Field manual
− write a field manual It should identify all the data that are to be collected and outline how they will be obtained In addition protocols should be clearly laid out for dealing with problems that may arise, for example, how the following situations are to be resolved, and
by whom:-
A field supervisor finds that the wrong person has been interviewed
A supervisor finds that an answer box has not been filled correctly,
− the field manual should also lay out procedures for supervision and quality control: by whom, where and when
Example continued : multi-centre study
For the multi-centre study, there were five questionnaires or forms for each individual and the data management manual detailed how the data were to be entered and coded In each city there was
a team of data entry staff, local field staff, and an experienced local supervisor There was also a local data manager who, together with the research statistician, recruited and trained data entry staff
The manual detailed the process for supervision and quality control of interviewers, clinical staff, data entry staff and lab staff (see Appendix 1)
1.2.4 Data management strategy and manual
The data management strategy should describe:
• the databases to be created, and outline the variables in each database
• the purpose of each database and into what format the data will be translated
• how the databases will be linked
• the programs for data entry, data cleaning and data checking
The data management strategy should include an outline plan for data analysis The main outcome and exposure variables and risk factors of interest should be identified, along with potential confounders and ways of dealing with them (stratification, or multivariate analysis)
Trang 13The methods of analysis should be outlined, for example tabulation of variables, followed by logistic regression (see journal guidelines)
A data management manual must be produced It is good practice to have written and explicit
instructions on all aspects of data management filed with the data managers
The data management manual should explain the following:
• The flow of the questionnaires and forms from the field to final storage
• The data entry system It should include the following:
− a list of filenames;
− which computers will be used for first and second data entry ;
− which data entry staff will input the data;
− who will compare first and second entry;
− how data entry errors will be resolved
• Who will merge entered data, and who will keep the master dataset
• Procedures for entering data which have been corrected following discussions with the field manager This may happen when data are updated from later rounds in a study It is critical that the procedures are described for when, where and how such data are updated
• Procedures for backing up data (if you don’t do this you can lose the data!)
• Procedures for running virus check programs
• How to produce sticky labels and lists of subjects These may be to identify subjects with missing data
• How to produce lists or summary statistics to facilitate fieldwork (eg response rates, respondents needing follow-up etc.)
• Who has responsibility for maintaining hardware and software and how s/he should undertake the task
The data management manual for the multi-centre study is given in Appendix 1
Further details on preparing a data management manual are given in Appendix 2
1.2.5 Database construction
Each questionnaire or source of data is associated with a database Each of these databases must be created, and the relationship between the data in the different databases defined Each database has one variable for each question in the questionnaire, and one record for each subject responding to that questionnaire Each subject has a unique identifier This identifier allows data in one database to be linked with data on another database
Linking questionnaires may be through a simple one-to-one relationship, by which every database has one record (observation) on each subject However some studies may have multiple records for each subject For example, in the multi-centre study, there were usually several eligible individuals within one household, so information on the household form had to
be matched to more than one individual
Trang 141.3 Questionnaire design
Writing good questions is a lot harder than it first appears!
1.3.1 General guidelines for writing questions
• questions should relate directly to the objectives of the study
• keep the questions short
• questions must be simple, clear and unambiguous - don't use technical jargon
• ask only one thing at a time (e.g avoid questions like ‘do the cost and times of the clinic prevent you attending?)
• avoid open questions beginning with ‘Why’
• avoid negative questions (e.g Do you not think… )
• avoid hypothetical questions (e.g If the bus fare was cheaper, would you come to the clinic more often?)
• pay attention to sensitive questions, for example about sexual behaviour
• check the adequacy of the lists of responses to closed questions (for example, ensure a food list covers most things normally eaten in the community concerned)
• avoid a large proportion of responses being in the ‘other (specify) ’ category
• Translate questions into local languages and then have them back-translated into original language
1.3.2 Other considerations
• Cultural relevance: Are all your concepts meaningful to your respondents? Might you need
to do qualitative research first to clarify this?
• Recall problems: How far back are you expecting your respondents to recall accurately? A
suitable period will depend on the event you are talking about and who you are talking to about it
• Bias due to wording of questions: Try to keep the wording of the questions "neutral" so
that respondents don't just give the reply that they think is expected of them - especially for attitudinal questions
Trang 15• Sensitive questions:
− is your question acceptable?
− is it practical to expect to be able to get this information?
− might an indirect approach using several questions help? There are special techniques available for ensuring confidentiality, but they may not be feasible in all situations
− put sensitive questions towards the end of the questionnaire (see ordering)
• Questions in a verbatim questionnaire: These questions are written exactly as they are to
be read out This carries advantages of standardization, and reduces the amount of interviewer training Using the technique can distract from the personal elements of an interview, but should be departed from only with more skilled interviewers and advance attention to standardizing
• Questions, scales and indices from previous research: In practice, you would carry out
a literature search to study methodologies used in similar research You may be able to adapt or adopt questions, scales or indices used by other researchers
1.3.3 Coding responses
Coding of the data is necessary for computer data entry and statistical analysis
Coding involves classifying responses into a defined set of categories and attaching numerical codes You should devise your coding scheme such that the whole range of responses is
covered (exhaustive) and that any response can only be allocated one code (exclusive)
How do we code the categories?
Sometimes the categories are represented by numbers, the respondent or interviewer does the coding by ringing or ticking the appropriate number
Example: What type of place do you live in?
A relatives’ or friend’s home 3
Other (please specify) _ 5
Alternatively, questionnaires may have boxes for the codes down the right hand side Office personnel then enter codes in these boxes according to the response ticked or ringed by the respondent This type of coding is sometimes used for self administered questionnaires where the use of numbers to represent categories might be off-putting to respondents
Trang 16Example: What type of place do you live in? Please ring
When deciding about categorizing and coding responses it is necessary to consider:
• How it will affect the smooth running of the interview and how easy it will be to complete
• How it will affect accuracy (e.g the more data is transcribed the more chances for error)
• How many questionnaires are involved
For some questions, more than one response is possible To deal with this, have one coding box for each possible response
Example: for the question "Where do you obtain your water?" you might have boxes for "Well",
"River", and "Taps" and codes 1=Yes, 2=No For an individual answering "I get most of my water from the well, but also some from the river", you would put "1" in the well and river boxes, and "2" in the box for taps This method is recommended unless the number of responses is large
Coding can be done at various stages:
• By the respondent: the interviewer/questionnaire asks him/her to select from a list of
responses
• By the interviewer: the respondent answers freely, and the interviewer ticks or circles the appropriate response on the questionnaire
• After the interview: (either by the interviewer or by other personnel in the office) - the
response is recorded verbatim and categorized later
When the forms are ready for coding, examine a sample of the questionnaires and decide on appropriate categories Do this before starting the actual coding process
Avoid changing the codes after starting the coding (Sometimes it may be alright to add an extra code if a completely new response turns up)
Trang 17If the number of forms completed is large, coding after the study is complete is undesirable and may be impractical; in this case careful pilot testing and pre-coding (which are always strongly desirable) are essential
1.3.4 Identifying respondents and data collection forms
A key aspect of any questionnaire is identification of respondents
• Each respondent should be clearly and uniquely identified Name and identification (ID) number are the minimum
• If information is being collected on the same person or household in repeated surveys, identification information could be pre-printed on the forms (e.g by printing id details on sticky labels and attaching a label to each form) This reduces copying errors, saves time in the field and guards against individuals being omitted in survey
• If more than one questionnaire is completed for each respondent, each questionnaire must
be uniquely identified as well as each respondent For instance, numbers can be stamped
on the top right hand corner of the form using a hand held numbering stamp
• For reasons of confidentiality, names are not entered on computer Hence an ID number is essential
1.3.5 General layout - Introduction
The questionnaire should begin with a brief introduction explaining the purpose of the survey
• Questionnaire must be easy to read and to use
• Printing (or photocopying) gives better quality copies but stencil duplicating is usually
cheaper
• Use good quality paper - it may have to stand up to rough handling
• Choose paper size carefully; A4 is often a good choice Small pieces of paper are much harder to sort through, and not so convenient on a clip board
• Lay out of the typing is extremely important Leave plenty of space for the answers to each
of the questions Use bold for emphasis and vary the point size to distinguish text for
different purposes (instructions, questions, etc) Pay attention to the general look of the questionnaire; make it pleasing and easy to read
• Different questionnaires can be easily distinguished from each other if different coloured paper or ink is used for them
Trang 181.3.5.2 Instructions
• It is sometimes necessary to print instructions on the questionnaire to guide the
respondent/interviewer This is especially important where there are branches or jumps e.g
‘if answer is "No", go to question 9’
• Transitional statements: e.g to explain that you are moving on to a new subject, or to define
a term you are using In an interview using a verbatim questionnaire, they should be written exactly as they are to be read
• Distinguish instructions from the things which are to be read out, by use of a different point size, or by printing in italics or capitals
1.4 Pilot testing
Pilot testing means to try out the draft questionnaire in a population similar to the target population However much effort you put into designing your questionnaire it is absolutely essential to pilot test it Look carefully at the responses received during the pilot test to ensure that the data collected is as anticipated and that it is meaningful
At this stage you should also test your coding scheme, your data entry system and even try out some analysis Test the data collection, coding and entry system for at least 10 respondents Do another pilot after each substantial modification before proceeding to the main data collection
The pilot may also be used to validate the questions It may be appropriate to ask two different questions to see which provides more useful information The translation of the questions may
be validated in the pilot Or a pilot population may be used where previous information already exists so that the new questionnaire, and its delivery can be validated
Trang 19In some studies a pilot study of the questionnaire is incorporated into the training process for the interviewers This enables supervisors to assess the quality of the filled questionnaires across the interviewers
1.5 Quality control
Quality control is an essential aspect of data management needed in all aspects of the study:
• for questionnaire response,
• for laboratory testing,
• for clinical examination
Some studies randomly select 10% of subjects to go through a quality control A separate questionnaire is used for this and it is administered by a person other than the initial interviewer All staff should know that their performance is monitored The results of the quality control should be conveyed to the interviewers Interviewers performing badly should be replaced
1.6 Information and consent
Full information about the study must be given to informants before the questionnaire is applied The Declaration of Helsinki governs the ethical principles around medical research Full informed consent is an established principle that cannot be compromised Subjects can be informed of the benefits of the study to individuals and to the community, but must also be made aware of any possible harm they may suffer
Subjects should be informed of every activity they are expected to be involved in, and when these are likely to take place They should be aware that participation is voluntary, and that the health care they receive will not be changed if they refuse Subjects should be told how confidentiality will be maintained, and of their rights under national and international law
Coercion and deception in order to encourage participation in the study or intervention are unacceptable However it is difficult to define and safeguard subjects from indirect coercion Incentives are often given to recompense subjects for their time and expenses, and there is a fine line between incentives and coercion
It is normal to obtain the consent of the subject prior to the questionnaire, but this may happen after the questions have been asked Witnessed, signed consent after full information has been given to the subject is the ‘gold standard’ However it may be impractical to get impartial witnesses and some subjects may not be able to read and write Parents may be able to give
Trang 20consent for a child (under the age of consent), but the child himself/herself retains the right to refuse if they do not want to participate
Some studies may obtain community consent for study procedures It is important to inform community leaders of the study, and to get consent for the activities, especially where these affect the community at large, but do not affect individuals However where individuals are involved in study activities they retain their right to consent or refuse For example community education programmes to raise awareness of STD and HIV may not require the consent of every individual in the community to be started and delivered However individuals may want to opt out of some lessons, or refuse to participate in evaluation exercises
Trang 21Data Management using EpiData and Stata
Practical 2: Introduction to EpiData
Objectives of this session
By the end of this section you should:
• understand what is meant by a database, a file, a case and a variable
• understand the types of variables used in EpiData
• be able to create a questionnaire (.qes) file in EpiData
• be able to create a database (.rec) file in EpiData
• be able to enter data into the file
• understand the principles of quality control including data checking double entry of data
• export data from EpiData to other programmes
Please read sections 2.1 to 2 before the session
2.1 Introduction to EpiData
In this session, we will use the software package EpiData for data entry There are several
reasons for this:-
• it has been specifically written for use in research studies and designed to simplify each stage of the data management process
• it is easy to use
• it is distributed free of charge
• it does not require a powerful computer
• It demonstrates the principles of data checking
• It can export data in formats that can be read by virtually every statistical, database, or spreadsheet package
You can learn the principles of data management with EpiData, and these concepts and techniques are valid whichever package you may use in the future
Trang 22We use Epidata to undertake the following functions:
• write and edit text to create questionnaires, edit files that contain checking and
data-coding rules, and write and edit text
• enter data to make data files from questionnaire files, enter, edit, and query data
• carry out interactive checking while data are being entered to make sure that you have
entered legal values in an appropriate place, e.g using skip patterns and coding rules
(interactive checking)
• carry out batch checking after data have been entered
We recommend that data are exported to a statistical package for further cleaning and analysis
2.1.1 Creating a data file
Creating a database file in EpiData is a two stage process:
Stage 1: First you click on Define Data to create a data entry form This file is called a
questionnaire file and must have the extension “.QES” It includes information on variable
names, types, and lengths and defines the layout of the data entry form and the structure of the data file
Stage 2: Epidata uses this QES file to create a record file This file has the extension REC and
data is entered into this file
Text editor
.QES file
Make data file function
.REC file
Defines the structure of the data file and the layout of the data entry form
Holds the data
Trang 232.1.2 Datasets, cases, and files
A data set is stored in the computer as a file A file consists of a collection of cases or records Each case contains data in a series of variables For example:
Data are usually represented in a table in which each row represents an individual case (record), and each column represents a variable (field) For example:
Survey
Household number
Relation to head of household
Trang 242.1.3 Defining database file structure
To define the structure of a database file we need to specify for each variable a name, a type, and a length The variable type chosen will depend on the type of data that the variable is to
contain
An example of the structure is:-
LIV logical (yes/no) 1
Each variable has a name which allows us to refer to it for data checking and analysis
Each variable is of a certain type The type you choose to assign to a variable depends upon
the type of data it will contain The most common data types are text, numeric, logical, and date
The length of a variable defines how much information it can hold A text variable with length
ten will be able to hold up to ten letters or numbers A numeric variable with length three will be able to hold numbers between -99 and 999 The length of a variable must correspond to the highest anticipated number of letters and / or numbers
Trang 252.1.4 Variable names
In EpiData variable names:
− must not exceed eight characters
− must begin with a letter, not a number
− must not contain any spaces or punctuation marks
− can otherwise contain any sequence of letters and digits
Names can describe the variable they refer to (e.g OCCUP is probably more informative than VAR17) but with large questionnaires it may be easier to use question numbers (e.g Q17) as variable names
Examples of illegal variable names are:-
LAST NAME (contains a space)
COUNTRYOFORIGIN (longer than eight characters)
2.1.5 Variable types
Each variable must be of a certain type The type you choose to assign a variable will depend
on the type of data you wish it to contain EpiData provides the following variable types:-
1 TEXT variables are used for holding information consisting of text or numbers Text
variables are useful for holding information such as names and addresses
2 NUMERIC variables are used for holding numerical information They can be used for
holding categorical or continuous data Numeric variables can be defined to hold either integers (whole numbers) or real numbers (numbers with a fractional part)
3 LOGICAL (YES/NO) variables are used for holding data that can have only two possible
states such as whether a respondent has been ill or not Logical variables can hold either the
character 'Y' or the character 'N' (which may also be entered as ‘1’ and ‘0’) LOGICAL
variable types are appropriate for binary categorical data
4 DATE variables are used to hold dates DATE variables can be used to hold data in the American (mm/dd/yyyy) and European (dd/mm/yyyy) formats The advantage of using date type variables is that the EpiData will only allow you to enter valid dates DATE type
variables also simplify any calculations as factors such as variable month length and leap years are accounted for
Trang 262.1.6 Null values – missing and not applicable data
Sometimes data items will not be available or are not appropriate to collect for some respondents (e.g age at menarche for male respondents) It is important that you take this into account when designing questionnaires, coding schemes, and data files
Data that is missing or not appropriate is called null data There are two types of null data:-
• When data are not available it is defined as missing data It is generally considered bad practice to leave data entry spaces on the questionnaire or data entry screen because it can lead to confusion later Always consider the codes to use when a value is missing It is common practice to use 9, 99, 999 etc to denote missing data
• When data are not available because it is not applicable to collect it is defined as applicable For example, if someone is not married, then the age of their spouse is not applicable It is common practice to use 8, 88, 888 etc to denote not-appropriate data Coding not-appropriate data in this way allows you to code data back to missing or null before analysis
not-The coding scheme you decide to use for missing and non-appropriate data must be defined in advance and consistent across variables
Any field that is left empty at data-entry or uses missing data in calculations receives a special
missing data code (.) However, it is better to explicitly code missing data as 9,99 etc
Trang 272.1.7 Identifying (ID) numbers
When designing a questionnaire or a database file it is important to include a variable that holds
a unique value for each case This makes finding both paper forms and individual cases in a
database file easier should you need to query or edit a data item This variable is called the key
A database that consists of more than one linked data file is called a relational database We
will cover these further in Practical 3
Trang 282.2 Data Quality and Data Checking
A fundamental function of data management is to minimise error at all stages of a survey and not just at the computing stage To do so effectively, it is necessary to carry out careful checks
at all stages
2.2.1 Errors in data entry
a Minimising errors
Preventing errors relies on an understanding of the type of errors that can occur They can be
broadly categorised as response error and processing error
1 A response error is the difference between the 'true' answer and the answer given by the
respondent The best way to minimise response errors is to have a well designed and piloted questionnaire administered by well trained and motivated staff
2 A processing error is the error that creeps in during the coding, data entry, and data analysis
stages of a survey The main way of preventing processing errors is extensive checking of data on the paper forms and on the computer A well designed data entry system and well trained and motivated data entry staff are essential Processing errors also occur at the data analysis stage so it is important to check the results of any recodes, transformations, and calculations It is a good idea to perform procedures on a small sample of data first, so that the results can be analysed in detail for any mistakes
b Detecting errors in data
There are four methods of checking data More than one method may be applied to the same data as each method has different strengths and weaknesses
1 Manual Checking The researcher should check all completed questionnaires at the start of
a survey and continue to check samples of the questionnaires throughout the survey
Manual checking is very important because it takes place very early in the data collection - entry - analysis process Problems can be detected and corrected in the field, sometimes
without the need to re-interview respondents
2 Checking during data entry (Interactive Checking) EpiData has functions that detect and
correct problems with data as they are entered These functions pick up errors in range, copying, consistency, and routing The errors may arise from an invalid value written on the paper copy of the questionnaire or from a mistake on the part of the data entry clerk
Trang 293 Checking after data entry (Batch Checking) EpiData can also check data after it has
been entered This is an important facility since you may wish to examine the coherence of your dataset, e.g identify unlikely but legal values or identify missing values
4 Validation (or Verification) This involves double-entry of data into different files by different
operators The resulting files are then compared to assess whether they are the same
Validation is useful in picking up transposition and copying errors EpiData provides
functions for double-entry and validation of data
The type of checking you do will depend on the way your study is organised Interactive checking may slow the data entry process as each problem is referred to the data supervisor A compromise approach is sometimes used Data entry staff are warned of an error and asked to confirm that the data entered is the same as that on the data collection form The operator can then correct or leave the error (so that it corresponds to what is on the data collection form) Errors are then detected later with batch checking This approach minimises the call on the supervisor’s time
c Correcting errors
A strategy is necessary to deal with errors at data entry If the error is detected in the field or soon after collection then it should be possible to re-interview and collect the correct data Often this will involve asking only a few questions of the respondent
If the error is detected later it may not be possible to re-interview There are three options available for dealing with cases that have some errors in them:
1 Ignore the errors and use all the data This approach is fine only if you are absolutely
certain that the errors are not systematic Even if errors are not systematic they will add
noise to the data and may create a bias towards the null in subsequent statistical analysis
This approach is very rarely used
2 Mark erroneous data as missing and use the rest of the data This approach ensures that
no obviously erroneous data are included in subsequent analysis There is a risk that the data that is not obviously erroneous may be of poor quality This is the most commonly used approach
3 Throw out all erroneous cases This is the strong position and ensures that no erroneous
data are included in subsequent analysis Potentially erroneous data are filtered out on a 'guilt by association' basis This approach may bias results depending who gets to decide on which records are to be excluded from subsequent analysis and is rarely used in practice
Trang 302.2.2 Quality and validity
Most of the checks you can perform by examining questionnaires or data records relate to the validity of the data but not necessarily to its quality It is possible to check the quality of the data:
1 Question Redundancy: Ask important questions in different ways at different points in the
interview Responses can be compared This is useful at the pilot stage and can help you decide which question elicits the most reliable response You may decide later that other questions are redundant
2 Confirmatory Checks: Incorporate checks within the questionnaire (e.g ask about
vaccination history, check against health card, ask to see vaccination scar)
3 External Records: Check a sample of questionnaires against external records (e.g if asking
about clinic visits then check against clinic records)
4 Expected distributions: It is likely that you will already know something (e.g from the
Census or previous studies) about the population under study You should check that the distributions (e.g age distribution, sex ratios, etc.) in your data are similar to those you expected
5 Compare results from different interviewers: Use a variable on the data collection form
that identifies the interviewer You can then compare responses for a particular question among the interviewers
Trang 312.2.3 Data checking functions in EpiData
Even without specifying data checks, EpiData provides a limited form of data checking It will
only allow you to enter numbers into NUMERIC variables, valid dates into DATE variables, and
Y, N, 1, or 0 into YES/NO variables It is much better, however, to specify more specific checks
and EpiData provides functions that allow you to do this
Interactive Checking: The Checks function allows you to check data as it is being entered
(interactive checking) These functions include:-
1 Must-enter variables You can specify that certain variables must be filled with a value
other than missing
2 Legal values The input must match one of a specified list of values The variable can be left
blank unless it is designated as a must-enter variable
3 Range Checks The contents of a variable must lie between two bounding values You may
mix range checks and legal values (e.g for missing value codes)
4 Repeat variables The variable on a new record will automatically hold the value for the
previous case This is useful for data that seldom changes (e.g location codes)
5 Conditional jumps Normally the curser moves from one variable to the next However sometimes you may wish to skip a variable (e.g Ever pregnant? Y/N may be followed by Date of last pregnancy dd/mm/yyyy) These conditional jumps are used for questionnaire
routing during data entry
6 Programmed checks EpiData also provides an easy-to-use block-structured programming
language that allows you to program more complex checking procedures (e.g consistency checks)
Batch Checking: Under Document you will find the Validate duplicate Files function which
allows for batch checking (i.e all cases at once) of data that has been double-entered
Batch checks for range and consistency may also be performed using the CHECK programming
language which is built into EpiData
Trang 322.3 The onchocerciasis dataset
Now you will use EpiData to enter baseline data for a trial of an intervention against river
blindness (onchocerciasis) in Sierra Leone in West Africa
The trial rationale and design are described in the introduction to this unit Briefly, we will be looking at a placebo-controlled randomised trial of the effect of the drug ivermectin on infection with micofilaria Subjects were enrolled from six villages in Sierra Leone (West Africa), and initial demographic and parasitological surveys were conducted between June and October 1987
A questionnaire was used to collect baseline data on the individuals, including demographic and socio-economic factors, and on subjects' previous experience of onchocerciasis
Follow-up parasitology and repeated treatment was performed for five further surveys at six monthly intervals The principal outcome of interest was the comparison between microfilarial counts both before and after treatment, and between the two treatment groups The baseline questionnaire is shown here During this practical you will create a data entry form to enter data from this questionnaire
Trang 33Onchocerciasis Baseline Survey - Personal Data
IDNO | | | | |
Name
Village
Date of interview DATE | | | | | | | | |
Age in years AGE | | |
Sex (M or F) SEX | |
Tribe (Mende 1 Other 2) TRIBE | |
Household number HHNO | | |
Relationship to head of household? REL | |
1 Self 5 Other Blood Relative
2 Parent 6 Spouse
3 child 7 Other Non-Blood Relative
4 sibling 8 Friend
Have you lived in this village all your life? (Y/N) LIV | |
How long have you lived in this village? STAY | | | (years)
What is your main occupation? OCC | |
0 None / Missing 5 Office Work
1 At Home 6 Trading
2 At School 7 Housework
3 Farming 8 Mining
4 Fishing
Trang 342.4 Using EpiData
2.4.1 Starting the EpiData programme
Start EpiData by clicking on the icon in the Statistical Applications window
The first time you open this, a ‘Welcome’ window will appear Click on the bottom right box, and then on ‘Close’ to prevent this window showing again
EpiData runs in a single window Functions are selected from menus or from toolbars as in any
other Windows application
At the top of the screen, there are two toolbars:
Export & backup data
Lists, reports, &c.
Data entry
Add, revise, clear check
Create REC files
Create QES files
Work process toolbar
Trang 352.4.2 Creating a QES file using the EpiData editor
The QES file is a text file which tells EpiData which type of data to expect, and what the variables are called
EpiData variables are defined using special characters entered in the QES files:
Type Example EpiData definition
For example, in the onchocerciasis dataset, IDNO is a 4 digit numeric variable The line in the QES file would therefore be:-
We will now begin to create a questionnaire (.QES) file using the EpiData editor For each
variable you must state the variable name, variable type and variable length You choose any
sensible variable name but the variable type must be one of a set of types defined by EpiData You can define the variable type and length in three ways in EpiData
1 Write everything manually
2 Use the code writer (icon on far right on the icon toolbar) The code writer watches what you type If you type some field definition characters (e.g #, _, <A, <d, <m, <Y, <I, <S)
EpiData will prompt you for the information required (e.g length, number of decimal places)
to complete the variable definition
3 Use the field pick list (icon next to the code writer icon on the icon toolbar) The field pick list displays a dialogue box that allows you to specify variable definitions and insert them
into the questionnaire
Trang 36WORKED EXAMPLE: CREATING A NEW QES FILE
- Open EpiData
- Click the Define Data button on the work process toolbar and select the New QES file option
This will open a new editor window
FIRST we will define variables manually
In the editor window type the following:
Now click on the third icon from the right Preview Data Form of the Editor toolbar to see what
your form looks like for the data entry clerks
Go back to the editor window by clicking on ‘Untitled 1’ in the bottom left-hand corner of the screen
SECOND use the code writer (icon on far right on the icon toolbar)
Turn the code writer on by clicking the code writer icon on the editor toolbar (the icon on the far
right of the toolbar)
Type the entry line for age
Trang 37When you type # you will see a box asking you to ‘enter length of field’ Age has a maximum of
2 digits, so you enter ‘2’ in the box and click on OK
You should then see the line
in the editor window
Now type the line entry for Region You want to record this in capital letters so you type
THIRD use the field pick list
Type the entry line for sex
Sex
Now turn the field pick list on by clicking the icon on the editor
You will see a box titled ‘Field pick list’ Select ‘Text’ and then ‘Upper case text’, and Length ‘1’, and then click on the ‘Insert’ box You will now see that EpiData has inserted the variable type
<A> - this asks for an upper case text entry of length 1 Click the x at the top of the box to close the field pick list box
You should then see the line
Sex <A>
Experiment with the range of options available using the field pick list
This worked example has been for illustration only There is no need to save it for future exercises but if you wish to save your work please use File save and allocate an appropriate file name
Trang 38It is good practice to create variables with unique, short, easily remembered andmeaningful
names For long questionnaires it may be easier to use question numbers (e.g Q01) as variable
names Complex coding information, while needed on paper collection forms would tend to clutter up the data entry screen and should not be included
Exercise 1 – creating a QES file
Create a QES file for the onchocerciasis questionnaire (see Section 2.3) You can shorten the questions to save space
You should be familiar with three ways of defining variables During the exercise please experiment so you find the way that suits you best
Note : The characters <, >, _, and # are used by EpiData to define variables Avoid using them anywhere else in the questionnaire (.QES) file You should also avoid using the @ character
2.4.3 Saving the questionnaire (.QES) file
The finished QES file should look similar to the one below (you may have chosen different question names)
Relation to head of household? #
Lived in village all your life? <Y>
Years living in this village? ##
What is your main occupation? #
Trang 39
Exercise 2 – creating a QES file
Check your work carefully against the example questionnaire The variable types should be identical to those shown, but the questions themselves may be slightly different, depending on the wording you have chosen
Save the questionnaire file by clicking the save icon on the editor toolbar or by selecting the
File->Save menu option
When prompted give the filename oncho.qes and click Save (You must use the extension
.QES for questionnaire files.)
Make sure you save to the dataepi folder which you should have created! Please ask if you don’t know how to do this
You can see what the data entry form will look like while you’re creating it by clicking the preview data form button:
Trang 402.4.4 Creating a data (.REC) file
We will now make a data file so that data can be entered onto the computer
Exercise 3 – creating a data (.rec) file for the oncho data
Close all the open windows by selecting ‘File’, ‘Close All’ so no more EpiData files are open
Click the Make Data File button on the work process toolbar
You will see a box titled ‘Create data file from QES file’
In response to ‘Enter name of QES file’, select oncho.qes
In response to ‘Enter name of data file’ specify oncho.rec (this will be the default)
Click the OK button
In response to the request for a data file label type Oncho baseline survey
You will see a box titled ‘Information’ which tells you that the file oncho.rec has been created
Click OK
EpiData has now created a data file oncho.rec using the information contained in the QES file
(oncho.qes)