Data management manual 2011

Data Management using EpiData and Stata Introduction The aim of these sessions is to equip you with the practical and essential epidemiological skills needed to design a questionnaire,

Trang 1

Epidemiology in Practice Data Management Using EPI-DATA and STATA

Course Manual

This document contains the material which has been written to support teaching

sessions in data management, which are part of the Epidemiology in Practice module for London-based MSc students The material is designed to introduce students to the practicalities of managing data in epidemiological studies, including field procedures, questionnaire design, the cleaning and checking of data, and organizing a dataset in preparation for analysis

IMPORTANT NOTE:

This copy is for DL MSc Epidemiology students Please note that the datasets used are part of the EPM103 materials and DL students will need to access these from the core module CD-ROM (rather than in the location described in the

document)

LONDON SCHOOL OF HYGIENE AND TROPICAL MEDICINE

October 2011

No part of this teaching material may be reproduced by any

means without the written authority of the School given by the

Secretary and Registrar

Trang 2

Contents

Introduction to the Unit 1

Session 1: Introduction to data management and questionnaire design

Introduction to data management 1.1 Designing a data management strategy 1.2

What do we mean by data? 1.2.1 Data management considerations at the preparation phase 1.2.2 Data management considerations in the fieldwork 1.2.3 Data management strategy and manual 1.2.4 Database construction 1.2.5 Questionnaire design 1.3

General guidelines for writing questions 1.3.1 Other considerations 1.3.2 Coding responses 1.3.3 Identifying respondents and data collection forms 1.3.4 General layout - introduction 1.3.5 Ordering of questions 1.3.5.1 Instructions 1.3.5.2 Pilot testing 1.4 Quality control 1.5 Information and consent 1.6

Practical 1: Questionnaire design

Practical 2: Introduction to EpiData

Introduction to EpiData 2.1

Creating a data file 2.1.1 Datasets, cases and files 2.1.2 Defining database file structure 2.1.3 Variable names 2.1.4 Variable types 2.1.5 Null values – missing and not applicable data 2.1.6 Identifying (ID) numbers 2.1.7 Data quality and checking 2.2

Errors in data entry 2.2.1

Trang 3

Quality and validity 2.2.2 Data checking functions in EpiData 2.2.3 The onchocerciasis dataset 2.3 Using EpiData 2.4

Starting EpiData 2.4.1 Creating a QES file using the EpiData editor 2.4.2 Saving the QES file 2.4.3 Creating a data (.REC) file 2.4.4 Entering data 2.4.5 Variable names 2.4.6 Finding records 2.4.7 Summary so far 2.4.8 Adding checks in EpiData 2.5

Creating specified ranges 2.5.1 Creating labels for variables 2.5.2 Setting up checks 2.5.3 Specifying a KEY UNIQUE variable 2.5.4 Limitations of interactive data checking 2.5.5 Double entry and validation 2.6 Exporting data from EpiData 2.7

The Onchoceriasis dataset 2.7.1 Exporting to STATA 2.7.2 Folder management in STATA 2.7.3

Practical 3: Data management using Stata 12 – basics

Introduction 3.1 When to save and when to clear? 3.2

Do Files 3.3

What is a do file? 3.3.1 Comments in do files 3.3.2 The “set more off” command 3.3.3

A do-file template 3.3.4 The display command 3.4 Log files 3.5 Appending data 3.6 Naming variables, labeling variables and labeling values 3.7

Variable names 3.7.1 Variable labels 3.7.2 Value labels 3.7.3

Trang 4

Describing data 3.8

Equal signs in Stata commands – one or two? 3.6.1 Creating and transforming variables – gen, replace, recode 3.9

Recategorising data 3.9.1 The egen command 3.10

What’s the difference between gen and egen? 3.10.1 Exercise 3.11

Practical 4: Data management using Stata 12 – essentials

Types of variable – string and numeric 4.1

Converting string variables to numeric (destring, encode) 4.1.1 Converting numeric variables to string (decode) 4.1.2 Collapsing datasets 4.2 Reshaping data 4.3 Merging files 4.4

Practical 5: Data management using Stata 12 – advanced

Dates 5.1 Identifying duplicate observations 5.2 Shortcuts for repeating commands : using foreach loops 5.3 Working with sub-groups – use of Stata system variables 5.4 Final integrating exercise 5.5

Appendix 1: Example data processing manual

Appendix 2: The progress of a questionnaire

Appendix 3: Questionnaires to enter

Appendix 4: Common stata commands

Trang 5

Data Management using EpiData and Stata

Introduction

The aim of these sessions is to equip you with the practical and essential epidemiological skills needed to design a questionnaire, enter data using Epi-Data and prepare data for analysis in Stata 12

By the end of this Teaching Unit, you should:

• appreciate the need for a coherent data management strategy for an epidemiological study

• understand the principles of good questionnaire design

• be able to create an Epi-Data database to translate data from the questionnaire to computer

• be able to enter and verify data in Epi-Data

• know how to transfer data from Epi-Data to Stata and other statistical packages

• be able to create and use Stata do- and log-files

• know how to undertake common management tasks in Stata, including merging, appending, and collapsing files

• be able to use common Stata commands to generate, recode and replace variables

• be familiar with more advanced topics such as dates and substrings

The total class contact time for Data Management is 15 hours, consisting of five three-hour

sessions You are advised to read notes for practical sessions before the class and to work through the material for each session Each week’s exercises build on work completed in former sessions If you cannot complete the work within the three face-to- face hours you are strongly advised to complete the practical in your own time

Recommended texts

An excellent book which covers essential issues of data management and questionnaire design is:

Field Trials of Health Interventions in Developing Countries: A Toolbox

Edited by Smith & Morrow; WHO

A useful book on Stata is:

A Short introduction to Stata for Biostatistics

Hills & DeStavola

Timberlake Consultants Ltd

There are several copies of both books in the library

Trang 6

Dataset

During this course we will use data from a randomised controlled trial of an intervention against

river blindness (onchocerciasis) in Sierra Leone in West Africa

Onchocerciasis is a common and debilitating disease of the tropics It is chronic, and affects the skin and eyes Its pathology is thought to result from the cumulative effects of inflammatory responses to immobile and dead microfilariae in the skin and eyes Microfilariae are tiny worm-like parasites that breed in fast-flowing tropical rivers and are deposited in the skin by blackfly

(simulium) The fly bites and injects the parasite larvae (microfilariae) under the skin These

mature and produce further larvae that may migrate to the eye where they may cause lesions leading to blindness The worms are detectable by microscopic examination of skin samples, usually snipped from around the hips; severity of infection is measured by counting the average number of worms per microgram of skin examined

A double-blind-placebo-controlled trial was designed to study the clinical and parasitological

effects of repeated treatment with a newly developed drug called Ivermectin Subjects were

enrolled from six villages in Sierra Leone (West Africa), and initial demographic and parasitological surveys were conducted between June and October 1987 Subjects were randomly allocated to either the Ivermectin treatment group or the placebo control group Randomisation was done in London

The questionnaire in section 2.3 is similar to that used to collect baseline data for the study It contains questions on background demographic and socio-economic factors, and on subjects' previous experience of onchocerciasis

Follow-up parasitology and repeated treatment was performed for five further surveys at six monthly intervals The principal outcome of interest was the comparison between microfilarial counts both before and after treatment, and between the two treatment groups

Reference

Whitworth JAG, Morgan D, Maude GH, Downhan MD, and Taylor DW (1991), A community trial

of ivermectin for onchocerciasis in Sierra Leone: clinical and parasitological responses to the initial dose, Transactions of the Royal Society of Tropical Medicine and Hygiene, 85, 92-6

Trang 7

Files and Variables

You will need the following files during the course They are currently in the drive

u:\download\teach\dataepi Create a new folder in your drive called h:\dataepi, and use

Windows Explorer to copy all the files from the u:\download\teach\dataepi folder into h:\dataepi

File Variable Contents Values / Codes

Code for missing values

baseline demography VILL Village number 1 6

DATE Date of interview Dd/mm/yyyy AGE Age in years Positive integers 99

2 = Other HHNO Household number Positive integers 99

Microfilariae counts SR Survey round 1 5

MFRIC MF count (right) Positive integers 999 MFLIC MF count (left) Positive integers 999 SSRIC Skin diameter (right) Positive real

numbers SSLIC Skin diameter (left) Positive real

numbers

EOSIN Eosinophil count Positive integers 9999 PCV Packed cell volume Positive integers 99 MPS Malaria parasites 0 = Negative

1 = Positive 9 TMTCODES.REC

Treatment codes DRUG Drug batch

IVER = DRUG PLAC = PLACEBO IDNO Subject ID number

Trang 8

Session 1 : Introduction to Data Management and

Questionnaire Design

In this session we cover the essentials of data management, and give an introduction to questionnaire design

Objectives of this session

After this session, you should understand how to:

1 start planning data sources and computer files for your own studies

2 use a data management strategy as a way of ensuring quality of data in preparation for the analysis

3 outline tasks to be included in the procedures guide for fieldwork and data management

4 Start designing a precise and informative questionnaire

1.1 Introduction to data management

An essential part of any epidemiological study is the collection of relevant data on participants Data will include identification information such as name, age, sex, place of residence, information on the main outcome and exposure, and on other factors that may be potential confounders of the association under study This will usually include clinical and lab data which need to be linked to the socio-demographic and behavioural data Often different sources of data are used - for example some data may be collected at community level, or samples may be collected from the same individual at different timepoints

In an epidemiological study, we need to transfer information from the study population to the computer, ready for statistical analysis There are however, many things that can go wrong in this

process so that the final dataset may not represent the true situation

Data management is the process of transferring information from the study population to a dataset ready for analysis

The main aim of data management is to ensure that sufficient and necessary information is collected and processed to answer the research question

Trang 9

Reality: Information about the target

population e.g height, weight, income,

concentration of malaria parasites in blood,

number of sexual partners in last year etc

Final dataset on the computer should

represent reality Analysis of this data is used

to draw conclusions about the target

Study population

Data collection Data entry

Preparation

of dataset for analysis

Example: The multi-centre study of sexual behaviour in Africa

As an example, we will consider a cross-sectional study of sexual behaviour in Africa This study was a population-based survey of around 2000 adults and 300 sex workers in each of four cities in Africa The aim of the study was to explore the factors associated with the

differential spread of HIV in these cities (Buvé et al, AIDS 2001 Aug 15 Suppl 4)

Households in each city were selected randomly and all adults living in the selected household were eligible for the study

The following information was collected:-

- household information through interview

- individual information (socio-demographic, behavioural) through interviewer-administered questionnaires

- clinical examination for men

- specimen form detailing which biological samples were taken

- laboratory form with test results

Like many epidemiological studies, this was a large, complex and expensive project involving collaborators in several African & European countries The success of the study depended on two main features of the data processing and management:-

1 Standardised data management across all sites

2 Accurate data entry, with all queries being meticulously recorded and processed

A data management strategy had to be designed to facilitate these two objectives

Trang 10

1.2 Designing a data management strategy

A data management strategy comprises the data processing needs of each stage of the research study This strategy outlines how the work should be tackled, the problems that may be encountered, and how to overcome them

1.2.1 What do we mean by data?

‘Data’ are all values needed in the analysis of the research question Results from lab tests and clinical measurements may be data, as can observations and measurements of a community or household

We must plan how these data are to be used in the study, and how they are collected, stored, processed and made suitable for analysis

Some data may not be used in the analysis, but may feed into the study prior to other activities For example qualitative data from focus group discussions may be needed to establish the nature

of community norms, and these can then be used to design the questionnaire, or community characteristics may be needed in order to stratify or randomise the sample

1.2.2 Data management considerations at the preparation phase

When writing your study proposal, you must start planning how to manage the data If you don’t, problems with data processing are likely to delay the study considerably or, even worse, result in poor quality data and meaningless results

To ensure that the data collected do represent reality you need to consider the following:

• Study design – is this to be cross-sectional survey, case-control, cohort etc?

• Sampling - how will you identify eligible respondents in your study population – for example,

where will the list of the study sample come from, and what criteria will you use for inclusion in the sample?

• What data are to be collected? What are the outcomes, exposures and potential confounders?

• What questionnaires and forms will be needed? Will data be collected from different sources -

questionnaires, clinical records, lab tests etc? What hardware and software needs are

there?

• What is the sample size needed to meet the objectives of the study? What percentage will you

add to take into account refusals, loss to follow-up etc? What methods can be used to minimise the refusal rate?

Trang 11

• Who will be responsible for the data at each stage? Data will pass through field workers, lab

technicians, data entry personnel, data managers and statisticians All of them should report to the principal investigator What role do people external to the study team have (consultants, regional health officers, community leaders, study subjects)?

• At this stage the strategy should include an overview of the budget needed for all aspects of data management

− printing or copying questionnaires

− transport costs: do you need to buy vehicles or can you use those belonging to your institution

− consumables: clipboards; paper; pencils; printer paper; sticky labels; printer ink; computer disks; etc

− measuring equipment: scales; tape measures; etc

− laboratory costs: consumables; freezer storage space; cost of sample analysis

− computing costs: will computers be bought or will use be made of institutions computers? Printers, UPS, voltage regulators etc

− software for data management and data analysis This may be covered by the institution,

or need to be purchased separately

1.2.3 Data management considerations in the fieldwork

Before the fieldwork starts, field management of data will have to be considered in more detail A plan must be drawn up in advance, often without the detailed knowledge that will come later All procedures must be clearly outlined in a field manual to ensure that all research and field staff have

a common understanding of procedures

• Recruitment of field staff

− how many staff are needed, and what is their primary role (e.g data entry managers, interviewers)

− which qualities and skills must the staff have?

− how will staff be selected?

− how will staff be trained?

• Logistics

− how is transport to be organised

− what are the accommodation needs for field staff

− what demands there will be for efficient administrative management (computers, photocopying, printing, etc)

Trang 12

• Designing the questionnaire/forms

− identify what questionnaires and forms are needed

− compose the questions

− assemble the questionnaire

− pilot test the Questionnaire, make changes as needed

− code the responses

− design computer data entry processes

• Field manual

− write a field manual It should identify all the data that are to be collected and outline how they will be obtained In addition protocols should be clearly laid out for dealing with problems that may arise, for example, how the following situations are to be resolved, and

by whom:-

 A field supervisor finds that the wrong person has been interviewed

 A supervisor finds that an answer box has not been filled correctly,

− the field manual should also lay out procedures for supervision and quality control: by whom, where and when

Example continued : multi-centre study

For the multi-centre study, there were five questionnaires or forms for each individual and the data management manual detailed how the data were to be entered and coded In each city there was

a team of data entry staff, local field staff, and an experienced local supervisor There was also a local data manager who, together with the research statistician, recruited and trained data entry staff

The manual detailed the process for supervision and quality control of interviewers, clinical staff, data entry staff and lab staff (see Appendix 1)

1.2.4 Data management strategy and manual

The data management strategy should describe:

• the databases to be created, and outline the variables in each database

• the purpose of each database and into what format the data will be translated

• how the databases will be linked

• the programs for data entry, data cleaning and data checking

The data management strategy should include an outline plan for data analysis The main outcome and exposure variables and risk factors of interest should be identified, along with potential confounders and ways of dealing with them (stratification, or multivariate analysis)

Trang 13

The methods of analysis should be outlined, for example tabulation of variables, followed by logistic regression (see journal guidelines)

A data management manual must be produced It is good practice to have written and explicit

instructions on all aspects of data management filed with the data managers

The data management manual should explain the following:

• The flow of the questionnaires and forms from the field to final storage

• The data entry system It should include the following:

− a list of filenames;

− which computers will be used for first and second data entry ;

− which data entry staff will input the data;

− who will compare first and second entry;

− how data entry errors will be resolved

• Who will merge entered data, and who will keep the master dataset

• Procedures for entering data which have been corrected following discussions with the field manager This may happen when data are updated from later rounds in a study It is critical that the procedures are described for when, where and how such data are updated

• Procedures for backing up data (if you don’t do this you can lose the data!)

• Procedures for running virus check programs

• How to produce sticky labels and lists of subjects These may be to identify subjects with missing data

• How to produce lists or summary statistics to facilitate fieldwork (eg response rates, respondents needing follow-up etc.)

• Who has responsibility for maintaining hardware and software and how s/he should undertake the task

The data management manual for the multi-centre study is given in Appendix 1

Further details on preparing a data management manual are given in Appendix 2

1.2.5 Database construction

Each questionnaire or source of data is associated with a database Each of these databases must be created, and the relationship between the data in the different databases defined Each database has one variable for each question in the questionnaire, and one record for each subject responding to that questionnaire Each subject has a unique identifier This identifier allows data in one database to be linked with data on another database

Linking questionnaires may be through a simple one-to-one relationship, by which every database has one record (observation) on each subject However some studies may have multiple records for each subject For example, in the multi-centre study, there were usually several eligible individuals within one household, so information on the household form had to

be matched to more than one individual

Trang 14

1.3 Questionnaire design

Writing good questions is a lot harder than it first appears!

1.3.1 General guidelines for writing questions

• questions should relate directly to the objectives of the study

• keep the questions short

• questions must be simple, clear and unambiguous - don't use technical jargon

• ask only one thing at a time (e.g avoid questions like ‘do the cost and times of the clinic prevent you attending?)

• avoid open questions beginning with ‘Why’

• avoid negative questions (e.g Do you not think… )

• avoid hypothetical questions (e.g If the bus fare was cheaper, would you come to the clinic more often?)

• pay attention to sensitive questions, for example about sexual behaviour

• check the adequacy of the lists of responses to closed questions (for example, ensure a food list covers most things normally eaten in the community concerned)

• avoid a large proportion of responses being in the ‘other (specify) ’ category

• Translate questions into local languages and then have them back-translated into original language

1.3.2 Other considerations

• Cultural relevance: Are all your concepts meaningful to your respondents? Might you need

to do qualitative research first to clarify this?

• Recall problems: How far back are you expecting your respondents to recall accurately? A

suitable period will depend on the event you are talking about and who you are talking to about it

• Bias due to wording of questions: Try to keep the wording of the questions "neutral" so

that respondents don't just give the reply that they think is expected of them - especially for attitudinal questions

Trang 15

• Sensitive questions:

− is your question acceptable?

− is it practical to expect to be able to get this information?

− might an indirect approach using several questions help? There are special techniques available for ensuring confidentiality, but they may not be feasible in all situations

− put sensitive questions towards the end of the questionnaire (see ordering)

• Questions in a verbatim questionnaire: These questions are written exactly as they are to

be read out This carries advantages of standardization, and reduces the amount of interviewer training Using the technique can distract from the personal elements of an interview, but should be departed from only with more skilled interviewers and advance attention to standardizing

• Questions, scales and indices from previous research: In practice, you would carry out

a literature search to study methodologies used in similar research You may be able to adapt or adopt questions, scales or indices used by other researchers

1.3.3 Coding responses

Coding of the data is necessary for computer data entry and statistical analysis

Coding involves classifying responses into a defined set of categories and attaching numerical codes You should devise your coding scheme such that the whole range of responses is

covered (exhaustive) and that any response can only be allocated one code (exclusive)

How do we code the categories?

Sometimes the categories are represented by numbers, the respondent or interviewer does the coding by ringing or ticking the appropriate number

Example: What type of place do you live in?

A relatives’ or friend’s home 3

Other (please specify) _ 5

Alternatively, questionnaires may have boxes for the codes down the right hand side Office personnel then enter codes in these boxes according to the response ticked or ringed by the respondent This type of coding is sometimes used for self administered questionnaires where the use of numbers to represent categories might be off-putting to respondents

Trang 16

Example: What type of place do you live in? Please ring

When deciding about categorizing and coding responses it is necessary to consider:

• How it will affect the smooth running of the interview and how easy it will be to complete

• How it will affect accuracy (e.g the more data is transcribed the more chances for error)

• How many questionnaires are involved

For some questions, more than one response is possible To deal with this, have one coding box for each possible response

Example: for the question "Where do you obtain your water?" you might have boxes for "Well",

"River", and "Taps" and codes 1=Yes, 2=No For an individual answering "I get most of my water from the well, but also some from the river", you would put "1" in the well and river boxes, and "2" in the box for taps This method is recommended unless the number of responses is large

Coding can be done at various stages:

• By the respondent: the interviewer/questionnaire asks him/her to select from a list of

responses

• By the interviewer: the respondent answers freely, and the interviewer ticks or circles the appropriate response on the questionnaire

• After the interview: (either by the interviewer or by other personnel in the office) - the

response is recorded verbatim and categorized later

When the forms are ready for coding, examine a sample of the questionnaires and decide on appropriate categories Do this before starting the actual coding process

Avoid changing the codes after starting the coding (Sometimes it may be alright to add an extra code if a completely new response turns up)

Trang 17

If the number of forms completed is large, coding after the study is complete is undesirable and may be impractical; in this case careful pilot testing and pre-coding (which are always strongly desirable) are essential

1.3.4 Identifying respondents and data collection forms

A key aspect of any questionnaire is identification of respondents

• Each respondent should be clearly and uniquely identified Name and identification (ID) number are the minimum

• If information is being collected on the same person or household in repeated surveys, identification information could be pre-printed on the forms (e.g by printing id details on sticky labels and attaching a label to each form) This reduces copying errors, saves time in the field and guards against individuals being omitted in survey

• If more than one questionnaire is completed for each respondent, each questionnaire must

be uniquely identified as well as each respondent For instance, numbers can be stamped

on the top right hand corner of the form using a hand held numbering stamp

• For reasons of confidentiality, names are not entered on computer Hence an ID number is essential

1.3.5 General layout - Introduction

The questionnaire should begin with a brief introduction explaining the purpose of the survey

• Questionnaire must be easy to read and to use

• Printing (or photocopying) gives better quality copies but stencil duplicating is usually

cheaper

• Use good quality paper - it may have to stand up to rough handling

• Choose paper size carefully; A4 is often a good choice Small pieces of paper are much harder to sort through, and not so convenient on a clip board

• Lay out of the typing is extremely important Leave plenty of space for the answers to each

of the questions Use bold for emphasis and vary the point size to distinguish text for

different purposes (instructions, questions, etc) Pay attention to the general look of the questionnaire; make it pleasing and easy to read

• Different questionnaires can be easily distinguished from each other if different coloured paper or ink is used for them

Trang 18

1.3.5.2 Instructions

• It is sometimes necessary to print instructions on the questionnaire to guide the

respondent/interviewer This is especially important where there are branches or jumps e.g

‘if answer is "No", go to question 9’

• Transitional statements: e.g to explain that you are moving on to a new subject, or to define

a term you are using In an interview using a verbatim questionnaire, they should be written exactly as they are to be read

• Distinguish instructions from the things which are to be read out, by use of a different point size, or by printing in italics or capitals

1.4 Pilot testing

Pilot testing means to try out the draft questionnaire in a population similar to the target population However much effort you put into designing your questionnaire it is absolutely essential to pilot test it Look carefully at the responses received during the pilot test to ensure that the data collected is as anticipated and that it is meaningful

At this stage you should also test your coding scheme, your data entry system and even try out some analysis Test the data collection, coding and entry system for at least 10 respondents Do another pilot after each substantial modification before proceeding to the main data collection

The pilot may also be used to validate the questions It may be appropriate to ask two different questions to see which provides more useful information The translation of the questions may

be validated in the pilot Or a pilot population may be used where previous information already exists so that the new questionnaire, and its delivery can be validated

Trang 19

In some studies a pilot study of the questionnaire is incorporated into the training process for the interviewers This enables supervisors to assess the quality of the filled questionnaires across the interviewers

1.5 Quality control

Quality control is an essential aspect of data management needed in all aspects of the study:

• for questionnaire response,

• for laboratory testing,

• for clinical examination

Some studies randomly select 10% of subjects to go through a quality control A separate questionnaire is used for this and it is administered by a person other than the initial interviewer All staff should know that their performance is monitored The results of the quality control should be conveyed to the interviewers Interviewers performing badly should be replaced

1.6 Information and consent

Full information about the study must be given to informants before the questionnaire is applied The Declaration of Helsinki governs the ethical principles around medical research Full informed consent is an established principle that cannot be compromised Subjects can be informed of the benefits of the study to individuals and to the community, but must also be made aware of any possible harm they may suffer

Subjects should be informed of every activity they are expected to be involved in, and when these are likely to take place They should be aware that participation is voluntary, and that the health care they receive will not be changed if they refuse Subjects should be told how confidentiality will be maintained, and of their rights under national and international law

Coercion and deception in order to encourage participation in the study or intervention are unacceptable However it is difficult to define and safeguard subjects from indirect coercion Incentives are often given to recompense subjects for their time and expenses, and there is a fine line between incentives and coercion

It is normal to obtain the consent of the subject prior to the questionnaire, but this may happen after the questions have been asked Witnessed, signed consent after full information has been given to the subject is the ‘gold standard’ However it may be impractical to get impartial witnesses and some subjects may not be able to read and write Parents may be able to give

Trang 20

consent for a child (under the age of consent), but the child himself/herself retains the right to refuse if they do not want to participate

Some studies may obtain community consent for study procedures It is important to inform community leaders of the study, and to get consent for the activities, especially where these affect the community at large, but do not affect individuals However where individuals are involved in study activities they retain their right to consent or refuse For example community education programmes to raise awareness of STD and HIV may not require the consent of every individual in the community to be started and delivered However individuals may want to opt out of some lessons, or refuse to participate in evaluation exercises

Trang 21

Practical 2: Introduction to EpiData

Objectives of this session

By the end of this section you should:

• understand what is meant by a database, a file, a case and a variable

• understand the types of variables used in EpiData

• be able to create a questionnaire (.qes) file in EpiData

• be able to create a database (.rec) file in EpiData

• be able to enter data into the file

• understand the principles of quality control including data checking double entry of data

• export data from EpiData to other programmes

Please read sections 2.1 to 2 before the session

2.1 Introduction to EpiData

In this session, we will use the software package EpiData for data entry There are several

reasons for this:-

• it has been specifically written for use in research studies and designed to simplify each stage of the data management process

• it is easy to use

• it is distributed free of charge

• it does not require a powerful computer

• It demonstrates the principles of data checking

• It can export data in formats that can be read by virtually every statistical, database, or spreadsheet package

You can learn the principles of data management with EpiData, and these concepts and techniques are valid whichever package you may use in the future

Trang 22

We use Epidata to undertake the following functions:

• write and edit text to create questionnaires, edit files that contain checking and

data-coding rules, and write and edit text

• enter data to make data files from questionnaire files, enter, edit, and query data

• carry out interactive checking while data are being entered to make sure that you have

entered legal values in an appropriate place, e.g using skip patterns and coding rules

(interactive checking)

• carry out batch checking after data have been entered

We recommend that data are exported to a statistical package for further cleaning and analysis

2.1.1 Creating a data file

Creating a database file in EpiData is a two stage process:

Stage 1: First you click on Define Data to create a data entry form This file is called a

questionnaire file and must have the extension “.QES” It includes information on variable

names, types, and lengths and defines the layout of the data entry form and the structure of the data file

Stage 2: Epidata uses this QES file to create a record file This file has the extension REC and

data is entered into this file

Text editor

.QES file

Make data file function

.REC file

Defines the structure of the data file and the layout of the data entry form

Holds the data

Trang 23

2.1.2 Datasets, cases, and files

A data set is stored in the computer as a file A file consists of a collection of cases or records Each case contains data in a series of variables For example:

Data are usually represented in a table in which each row represents an individual case (record), and each column represents a variable (field) For example:

Survey

Household number

Relation to head of household

Trang 24

2.1.3 Defining database file structure

To define the structure of a database file we need to specify for each variable a name, a type, and a length The variable type chosen will depend on the type of data that the variable is to

contain

An example of the structure is:-

LIV logical (yes/no) 1

Each variable has a name which allows us to refer to it for data checking and analysis

Each variable is of a certain type The type you choose to assign to a variable depends upon

the type of data it will contain The most common data types are text, numeric, logical, and date

The length of a variable defines how much information it can hold A text variable with length

ten will be able to hold up to ten letters or numbers A numeric variable with length three will be able to hold numbers between -99 and 999 The length of a variable must correspond to the highest anticipated number of letters and / or numbers

Trang 25

2.1.4 Variable names

In EpiData variable names:

− must not exceed eight characters

− must begin with a letter, not a number

− must not contain any spaces or punctuation marks

− can otherwise contain any sequence of letters and digits

Names can describe the variable they refer to (e.g OCCUP is probably more informative than VAR17) but with large questionnaires it may be easier to use question numbers (e.g Q17) as variable names

Examples of illegal variable names are:-

LAST NAME (contains a space)

COUNTRYOFORIGIN (longer than eight characters)

2.1.5 Variable types

Each variable must be of a certain type The type you choose to assign a variable will depend

on the type of data you wish it to contain EpiData provides the following variable types:-

1 TEXT variables are used for holding information consisting of text or numbers Text

variables are useful for holding information such as names and addresses

2 NUMERIC variables are used for holding numerical information They can be used for

holding categorical or continuous data Numeric variables can be defined to hold either integers (whole numbers) or real numbers (numbers with a fractional part)

3 LOGICAL (YES/NO) variables are used for holding data that can have only two possible

states such as whether a respondent has been ill or not Logical variables can hold either the

character 'Y' or the character 'N' (which may also be entered as ‘1’ and ‘0’) LOGICAL

variable types are appropriate for binary categorical data

4 DATE variables are used to hold dates DATE variables can be used to hold data in the American (mm/dd/yyyy) and European (dd/mm/yyyy) formats The advantage of using date type variables is that the EpiData will only allow you to enter valid dates DATE type

variables also simplify any calculations as factors such as variable month length and leap years are accounted for

Trang 26

2.1.6 Null values – missing and not applicable data

Sometimes data items will not be available or are not appropriate to collect for some respondents (e.g age at menarche for male respondents) It is important that you take this into account when designing questionnaires, coding schemes, and data files

Data that is missing or not appropriate is called null data There are two types of null data:-

• When data are not available it is defined as missing data It is generally considered bad practice to leave data entry spaces on the questionnaire or data entry screen because it can lead to confusion later Always consider the codes to use when a value is missing It is common practice to use 9, 99, 999 etc to denote missing data

• When data are not available because it is not applicable to collect it is defined as applicable For example, if someone is not married, then the age of their spouse is not applicable It is common practice to use 8, 88, 888 etc to denote not-appropriate data Coding not-appropriate data in this way allows you to code data back to missing or null before analysis

not-The coding scheme you decide to use for missing and non-appropriate data must be defined in advance and consistent across variables

Any field that is left empty at data-entry or uses missing data in calculations receives a special

missing data code (.) However, it is better to explicitly code missing data as 9,99 etc

Trang 27

2.1.7 Identifying (ID) numbers

When designing a questionnaire or a database file it is important to include a variable that holds

a unique value for each case This makes finding both paper forms and individual cases in a

database file easier should you need to query or edit a data item This variable is called the key

A database that consists of more than one linked data file is called a relational database We

will cover these further in Practical 3

Trang 28

2.2 Data Quality and Data Checking

A fundamental function of data management is to minimise error at all stages of a survey and not just at the computing stage To do so effectively, it is necessary to carry out careful checks

at all stages

2.2.1 Errors in data entry

a Minimising errors

Preventing errors relies on an understanding of the type of errors that can occur They can be

broadly categorised as response error and processing error

1 A response error is the difference between the 'true' answer and the answer given by the

respondent The best way to minimise response errors is to have a well designed and piloted questionnaire administered by well trained and motivated staff

2 A processing error is the error that creeps in during the coding, data entry, and data analysis

stages of a survey The main way of preventing processing errors is extensive checking of data on the paper forms and on the computer A well designed data entry system and well trained and motivated data entry staff are essential Processing errors also occur at the data analysis stage so it is important to check the results of any recodes, transformations, and calculations It is a good idea to perform procedures on a small sample of data first, so that the results can be analysed in detail for any mistakes

b Detecting errors in data

There are four methods of checking data More than one method may be applied to the same data as each method has different strengths and weaknesses

1 Manual Checking The researcher should check all completed questionnaires at the start of

a survey and continue to check samples of the questionnaires throughout the survey

Manual checking is very important because it takes place very early in the data collection - entry - analysis process Problems can be detected and corrected in the field, sometimes

without the need to re-interview respondents

2 Checking during data entry (Interactive Checking) EpiData has functions that detect and

correct problems with data as they are entered These functions pick up errors in range, copying, consistency, and routing The errors may arise from an invalid value written on the paper copy of the questionnaire or from a mistake on the part of the data entry clerk

Trang 29

3 Checking after data entry (Batch Checking) EpiData can also check data after it has

been entered This is an important facility since you may wish to examine the coherence of your dataset, e.g identify unlikely but legal values or identify missing values

4 Validation (or Verification) This involves double-entry of data into different files by different

operators The resulting files are then compared to assess whether they are the same

Validation is useful in picking up transposition and copying errors EpiData provides

functions for double-entry and validation of data

The type of checking you do will depend on the way your study is organised Interactive checking may slow the data entry process as each problem is referred to the data supervisor A compromise approach is sometimes used Data entry staff are warned of an error and asked to confirm that the data entered is the same as that on the data collection form The operator can then correct or leave the error (so that it corresponds to what is on the data collection form) Errors are then detected later with batch checking This approach minimises the call on the supervisor’s time

c Correcting errors

A strategy is necessary to deal with errors at data entry If the error is detected in the field or soon after collection then it should be possible to re-interview and collect the correct data Often this will involve asking only a few questions of the respondent

If the error is detected later it may not be possible to re-interview There are three options available for dealing with cases that have some errors in them:

1 Ignore the errors and use all the data This approach is fine only if you are absolutely

certain that the errors are not systematic Even if errors are not systematic they will add

noise to the data and may create a bias towards the null in subsequent statistical analysis

This approach is very rarely used

2 Mark erroneous data as missing and use the rest of the data This approach ensures that

no obviously erroneous data are included in subsequent analysis There is a risk that the data that is not obviously erroneous may be of poor quality This is the most commonly used approach

3 Throw out all erroneous cases This is the strong position and ensures that no erroneous

data are included in subsequent analysis Potentially erroneous data are filtered out on a 'guilt by association' basis This approach may bias results depending who gets to decide on which records are to be excluded from subsequent analysis and is rarely used in practice

Trang 30

2.2.2 Quality and validity

Most of the checks you can perform by examining questionnaires or data records relate to the validity of the data but not necessarily to its quality It is possible to check the quality of the data:

1 Question Redundancy: Ask important questions in different ways at different points in the

interview Responses can be compared This is useful at the pilot stage and can help you decide which question elicits the most reliable response You may decide later that other questions are redundant

2 Confirmatory Checks: Incorporate checks within the questionnaire (e.g ask about

vaccination history, check against health card, ask to see vaccination scar)

3 External Records: Check a sample of questionnaires against external records (e.g if asking

about clinic visits then check against clinic records)

4 Expected distributions: It is likely that you will already know something (e.g from the

Census or previous studies) about the population under study You should check that the distributions (e.g age distribution, sex ratios, etc.) in your data are similar to those you expected

5 Compare results from different interviewers: Use a variable on the data collection form

that identifies the interviewer You can then compare responses for a particular question among the interviewers

Trang 31

2.2.3 Data checking functions in EpiData

Even without specifying data checks, EpiData provides a limited form of data checking It will

only allow you to enter numbers into NUMERIC variables, valid dates into DATE variables, and

Y, N, 1, or 0 into YES/NO variables It is much better, however, to specify more specific checks

and EpiData provides functions that allow you to do this

Interactive Checking: The Checks function allows you to check data as it is being entered

(interactive checking) These functions include:-

1 Must-enter variables You can specify that certain variables must be filled with a value

other than missing

2 Legal values The input must match one of a specified list of values The variable can be left

blank unless it is designated as a must-enter variable

3 Range Checks The contents of a variable must lie between two bounding values You may

mix range checks and legal values (e.g for missing value codes)

4 Repeat variables The variable on a new record will automatically hold the value for the

previous case This is useful for data that seldom changes (e.g location codes)

5 Conditional jumps Normally the curser moves from one variable to the next However sometimes you may wish to skip a variable (e.g Ever pregnant? Y/N may be followed by Date of last pregnancy dd/mm/yyyy) These conditional jumps are used for questionnaire

routing during data entry

6 Programmed checks EpiData also provides an easy-to-use block-structured programming

language that allows you to program more complex checking procedures (e.g consistency checks)

Batch Checking: Under Document you will find the Validate duplicate Files function which

allows for batch checking (i.e all cases at once) of data that has been double-entered

Batch checks for range and consistency may also be performed using the CHECK programming

language which is built into EpiData

Trang 32

2.3 The onchocerciasis dataset

Now you will use EpiData to enter baseline data for a trial of an intervention against river

blindness (onchocerciasis) in Sierra Leone in West Africa

The trial rationale and design are described in the introduction to this unit Briefly, we will be looking at a placebo-controlled randomised trial of the effect of the drug ivermectin on infection with micofilaria Subjects were enrolled from six villages in Sierra Leone (West Africa), and initial demographic and parasitological surveys were conducted between June and October 1987

A questionnaire was used to collect baseline data on the individuals, including demographic and socio-economic factors, and on subjects' previous experience of onchocerciasis

Follow-up parasitology and repeated treatment was performed for five further surveys at six monthly intervals The principal outcome of interest was the comparison between microfilarial counts both before and after treatment, and between the two treatment groups The baseline questionnaire is shown here During this practical you will create a data entry form to enter data from this questionnaire

Trang 33

Onchocerciasis Baseline Survey - Personal Data

Name

Village

Age in years AGE | | |

Sex (M or F) SEX | |

Tribe (Mende 1 Other 2) TRIBE | |

Household number HHNO | | |

Relationship to head of household? REL | |

1 Self 5 Other Blood Relative

2 Parent 6 Spouse

3 child 7 Other Non-Blood Relative

4 sibling 8 Friend

Have you lived in this village all your life? (Y/N) LIV | |

How long have you lived in this village? STAY | | | (years)

What is your main occupation? OCC | |

0 None / Missing 5 Office Work

1 At Home 6 Trading

2 At School 7 Housework

3 Farming 8 Mining

4 Fishing

Trang 34

2.4 Using EpiData

2.4.1 Starting the EpiData programme

Start EpiData by clicking on the icon in the Statistical Applications window

The first time you open this, a ‘Welcome’ window will appear Click on the bottom right box, and then on ‘Close’ to prevent this window showing again

EpiData runs in a single window Functions are selected from menus or from toolbars as in any

other Windows application

At the top of the screen, there are two toolbars:

Export & backup data

Lists, reports, &c.

Data entry

Add, revise, clear check

Create REC files

Create QES files

Work process toolbar

Trang 35

2.4.2 Creating a QES file using the EpiData editor

The QES file is a text file which tells EpiData which type of data to expect, and what the variables are called

EpiData variables are defined using special characters entered in the QES files:

Type Example EpiData definition

For example, in the onchocerciasis dataset, IDNO is a 4 digit numeric variable The line in the QES file would therefore be:-

We will now begin to create a questionnaire (.QES) file using the EpiData editor For each

variable you must state the variable name, variable type and variable length You choose any

sensible variable name but the variable type must be one of a set of types defined by EpiData You can define the variable type and length in three ways in EpiData

1 Write everything manually

2 Use the code writer (icon on far right on the icon toolbar) The code writer watches what you type If you type some field definition characters (e.g #, _, <A, <d, <m, <Y, <I, <S)

EpiData will prompt you for the information required (e.g length, number of decimal places)

to complete the variable definition

3 Use the field pick list (icon next to the code writer icon on the icon toolbar) The field pick list displays a dialogue box that allows you to specify variable definitions and insert them

into the questionnaire

Trang 36

WORKED EXAMPLE: CREATING A NEW QES FILE

- Open EpiData

- Click the Define Data button on the work process toolbar and select the New QES file option

This will open a new editor window

FIRST we will define variables manually

In the editor window type the following:

Now click on the third icon from the right Preview Data Form of the Editor toolbar to see what

your form looks like for the data entry clerks

Go back to the editor window by clicking on ‘Untitled 1’ in the bottom left-hand corner of the screen

SECOND use the code writer (icon on far right on the icon toolbar)

Turn the code writer on by clicking the code writer icon on the editor toolbar (the icon on the far

right of the toolbar)

Type the entry line for age

Trang 37

When you type # you will see a box asking you to ‘enter length of field’ Age has a maximum of

2 digits, so you enter ‘2’ in the box and click on OK

You should then see the line

in the editor window

Now type the line entry for Region You want to record this in capital letters so you type

THIRD use the field pick list

Type the entry line for sex

Sex

Now turn the field pick list on by clicking the icon on the editor

You will see a box titled ‘Field pick list’ Select ‘Text’ and then ‘Upper case text’, and Length ‘1’, and then click on the ‘Insert’ box You will now see that EpiData has inserted the variable type

<A> - this asks for an upper case text entry of length 1 Click the x at the top of the box to close the field pick list box

You should then see the line

Sex <A>

Experiment with the range of options available using the field pick list

This worked example has been for illustration only There is no need to save it for future exercises but if you wish to save your work please use File  save and allocate an appropriate file name

Trang 38

It is good practice to create variables with unique, short, easily remembered andmeaningful

names For long questionnaires it may be easier to use question numbers (e.g Q01) as variable

names Complex coding information, while needed on paper collection forms would tend to clutter up the data entry screen and should not be included

Exercise 1 – creating a QES file

Create a QES file for the onchocerciasis questionnaire (see Section 2.3) You can shorten the questions to save space

You should be familiar with three ways of defining variables During the exercise please experiment so you find the way that suits you best

Note : The characters <, >, _, and # are used by EpiData to define variables Avoid using them anywhere else in the questionnaire (.QES) file You should also avoid using the @ character

2.4.3 Saving the questionnaire (.QES) file

The finished QES file should look similar to the one below (you may have chosen different question names)

Relation to head of household? #

Lived in village all your life? <Y>

Years living in this village? ##

What is your main occupation? #

Trang 39

Exercise 2 – creating a QES file

Check your work carefully against the example questionnaire The variable types should be identical to those shown, but the questions themselves may be slightly different, depending on the wording you have chosen

Save the questionnaire file by clicking the save icon on the editor toolbar or by selecting the

File->Save menu option

When prompted give the filename oncho.qes and click Save (You must use the extension

.QES for questionnaire files.)

Make sure you save to the dataepi folder which you should have created! Please ask if you don’t know how to do this

You can see what the data entry form will look like while you’re creating it by clicking the preview data form button:

Trang 40

2.4.4 Creating a data (.REC) file

We will now make a data file so that data can be entered onto the computer

Exercise 3 – creating a data (.rec) file for the oncho data

Close all the open windows by selecting ‘File’, ‘Close All’ so no more EpiData files are open

Click the Make Data File button on the work process toolbar

You will see a box titled ‘Create data file from QES file’

In response to ‘Enter name of QES file’, select oncho.qes

In response to ‘Enter name of data file’ specify oncho.rec (this will be the default)

Click the OK button

In response to the request for a data file label type Oncho baseline survey

You will see a box titled ‘Information’ which tells you that the file oncho.rec has been created

Click OK

EpiData has now created a data file oncho.rec using the information contained in the QES file

(oncho.qes)

Định dạng
Số trang	138
Dung lượng	1,07 MB
File đính kèm	20. Data Management manual_2011.rar (1 MB)