Table 1.1 Examples of issues and opinions obtained from CARD project field visits Field visit Activities and issues identified Field trip to DABACO • Investment in technology and human
Trang 1INSTITUTE OF POLICY AND STRATEGY FOR AGRICULTURE AND RURAL DEVELOPMENT
CENTER FOR AGRICULTURAL POLICY
-
CARD Project 030/06 VIE: Developing a strategy for enhancing the
competitiveness of rural small and medium enterprises in the
agro-food chain: the case of animal feed
Training Manual for CARD project 030/06 VIE
Donna Brennan and Sally Marsh
School of Agricultural and Resource Economics, University of Western Australia
July 2010
Feasible
Maximizing profit
Trang 2Table of Contents
Table of Contents 2
1 Purpose of the training manual 4
2 Problem identification 6
2.1 Identification of key issues 6
2.1.1 Methodology/activities 6
2.1.2 Use of reviews and secondary data 9
2.2 Formulating researchable questions or hypotheses 9
2.3 Focusing data collection 10
2.4 References in this section 11
3 Survey design and sampling techniques 12
3.1 Introduction 12
3.2 Why is it so difficult to conduct a good survey? 12
3.2.1 Issues with translation 13
3.3 Steps in the process of doing a survey 13
3.3.1 Is a survey really needed? 13
3.3.2 Statement of information goals and uses 14
3.3.3 Collect background information 14
3.3.4 Focus groups 14
3.3.5 Select survey method (personal interview, phone, letter, web-based) 15
3.3.6 Determine sampling method and select sample 16
3.3.7 Draft questions 16
3.3.8 Pilot test the questionnaire 17
3.3.9 Redraft the survey 17
3.3.10 Train interviewers/enumerators 17
3.3.11 Collect the data 17
3.4 Sampling 18
3.4.1 Accuracy, bias and precision 18
3.4.2 Types of sample design 18
3.4.3 Sampling strategies 19
3.4.4 Proportional stratification by size 23
3.5 Question design 24
3.5.1 Designing good survey questions 24
3.5.2 Should you use open or closed questions? 24
3.5.3 If closed questions, which type of closed question format? 25
3.5.4 Using Likert Scales 26
3.6 References in this chapter 27
4 Data entry 28
4.1 Principles of database design 28
4.2 Designing tables from survey questionnaires 28
4.2.1 Example of table design in IFPRI feedmill database 29
4.3 Practicing using queries 29
4.3.1 Types of queries: 30
4.4 Designing a database for the CARD Livestock questionnaire 31
5 Data cleaning and analysis– techniques using Stata 33
5.1 Data cleaning 33
5.2 Creating Output Templates 33
5.3 Stata dofiles for feed use as an example 36
5.3.1 Objectives: 36
Trang 35.3.2 Exercises 37
6 Analysis of Survey Data 49
6.1 Treatment of variables in survey analysis 49
6.1.1 The number of variables 49
6.1.2 Levels of measurement 49
6.1.3 Method of analysis 50
6.1.4 Descriptive and inferential statistics 51
6.2 A Quick Overview of Descriptive Statistics 51
6.2.1 Measures of location 52
6.2.2 Measures of spread 52
6.2.3 Measures of shape 53
6.2.4 Techniques for displaying and examining distributions 53
6.3 Data management in Excel 55
6.3.1 Notation for basic functions in Excel 55
6.3.2 Using more complex functions in Excel - SUMIF 55
6.3.3 Using more complex functions in Excel - COUNTIF 57
6.3.4 Using more complex functions in Excel - TRANSPOSE 58
6.3.5 Pivot Tables in Excel 62
6.3.6 Using MACROs in Excel 66
6.4 References in this section 71
7 Assessing competitiveness – principles and exercises 72
7.1 Types of market structure 72
7.1.1 Perfect competition 72
7.1.2 Monopoly 72
7.1.3 Monopolistic competition 72
7.1.4 Oligopoly 73
7.2 Analyzing competitiveness 74
7.3 Product differentiation in the feedmill industry 74
7.4 Competitiveness in the livestock feed production sector 75
7.4.1 Evidence of returns to scale 76
7.4.2 Supply chain differences 77
7.4.3 Competitive strategies 77
7.5 Production economics for feed operations – least-cost feed rations 79
7.5.1 Some basic animal nutrition 79
7.5.2 The pig diet used in this training course 80
7.5.3 Linear programming 81
7.5.4 Mathematical specification of the linear programming problem 86
7.5.5 Least cost feed analysis using linear programming 87
7.6 References for this chapter 88
8 Reporting and communication 89
8.1 Writing the research report 89
8.1.1 Working in Outline 89
8.1.2 Labelling and cross referencing tables and figures 90
8.1.3 Tables and figures in a Research Report 91
8.1.4 Other conventions for Report writing in English 92
8.2 Some common errors in English writing 92
8.2.1 Language used in reports 92
8.2.2 Correct use of some English words in Reports 93
8.3 Writing policy briefs 94
8.3.1 Preparation of a Policy Brief 94
Trang 41 Purpose of the training manual
The purpose of this manual is to document theoretical issues, methodology and analytical techniques that were used in the process of conducting CARD Project 030/06 VIE “Developing a strategy for enhancing the competitiveness of rural small and medium enterprises in the agro-food chain: the case of animal feed” Work for this project was conducted from mid-2007 to early 2010 It is hoped that the experiences gained from the project work and documented in this training manual will
be useful for future work undertaken by IPSARD/CAP
The chapters include:
• 2 Problem identification In this chapter, techniques to identify key issues,
formulate researchable hypothesis and focus data collection are discussed using examples from the project
• 3 Survey design and sampling techniques This chapter focuses on aspects
of socio-economic surveying, including: reasons why surveys can be difficult
to conduct; steps in conducting a survey; sampling techniques used in surveys; and question design
• 4 Data entry This chapter contains the material from a course on database
design presented by Donna Brennan in July 2008 It should be read in conjunction with electronic course materials in the zip file “Course database and access forms.zip” The chapter includes sections on principles of database design, designing tables from survey questionnaires and using queries in Microsoft Access
• 5 Data cleaning and analysis – techniques using Stata This chapter
contains tips and techniques for data cleaning, building data output templates, and data analysis Training notes recorded by members of the CAP team (Pham Thi Lien Phuong and Nguyen Thi Thinh), in the form of annotated Stata do files, are provided in this section Data needed for these analyses will
be in the CARD project database kept at CAP
• 6 Analysis of survey data This chapter includes a discussion of treatment of
variables in analysis of survey data and an overview of descriptive statistics Additionally, it includes material from a training course in data management
in Excel provided to team members when they visited Perth in August 2009 The course covered special functions for managing and querying large data tables, including conditional sums, transposing data, and extracting subsets using pivot tables The course also covered the basics of building macros Training notes recorded by members of the CAP team (Pham Thi Lien Phuong and Nguyen Thi Thinh) are provided in this chapter
• 7 Assessing competitiveness - principles and exercises This chapter briefly
outlines types of market structure Issues to consider when analyzing competitiveness, and in particular, issues when assessing competitiveness of firms producing a heterogeneous product are discussed Aspects of competitiveness investigated in the project are outlined, and material from a training course on Least-Cost feed rations is included
• 8 Reporting and communication This final chapter focuses on providing
tips for producing a well-structured and well-written Research Report, including techniques for handling large documents in Microsoft Word and a
Trang 5discussion of common errors made in English writing Finally we outline the preparation of a Policy Brief
The report was mainly written by Dr Donna Brennan and Sally Marsh, but also contains contributions from Vietnamese CARD project team members, Pham Thi Lien Phuong and Nguyen Thi Thinh in Chapters 5 and 6
A number of electronic files are provided as part of and to be used in conjunction with this report:
For Chapter 4: Course database and access forms.zip
For Chapter 6: macro_practice.xls
For Chapter 7: Cong Nhan May Mac.xls
Least cost feed ration exercise.xls
Trang 61 Problem identification
1.1 Identification of key issues
A key task at the beginning of a research project is to scope key issues and existing information and data relevant to the planned research There are a number of standard ways in which this can be done, including:
• Literature reviews;
• Collection of secondary data;
• Identification of and engagement with key stakeholders e.g interviews, field visits, workshops designed to seek stakeholder/expert ideas and opinions;
• Consultations with known experts;
• Overseas study tours; and
• Participatory appraisals, a technique used for consultation with local people often used in rural development projects
(see http://en.wikipedia.org/wiki/Participatory_rural_appraisal )
In this project, methods used to identify key issues involved consultations with
stakeholders and known experts, a study tour to Thailand, collection of secondary data and a literature review
1.1.1 Methodology/activities
Early engagement with stakeholders and experts
Early in the project, time was spent identifying key stakeholders and experts (e.g feedmills, staff of MARD, Vietnam Animal Feed Association) and discussing the planned project with them For example: a meeting in 2007 with Mr Le Ba Lich, Chairman of the Vietnam Animal Feed Association (VAFA), elicited the following
information, issues and opinions (questions asked are in italics, with a summary of the
reply in normal text)
• What is the benefit for a feedmill to join VAFA? They get technical support,
recipes (Lich and other scientists involved in formulation) for all feeds for pigs and chickens, training Some companies come when prices change to get advice on how to change feeds (ability to change recipes depends on storage,
inventory, knowledge of market prices)
• What are the characteristics of small feedmill enterprises? Generally
producing <3,000 T/yr (there are 145 businesses with < 5000 T/yr – 10% of total production), often don’t have an office or own equipment (rent), sell animal feed concentrates (premix), sell directly to farmers, located in rural areas
• Are small mills inefficient? Small mills still have their market share – sell to
very small land holders (who are interested in low prices), smallholder animal production is 90% of production – small mill production is 10% of total
• Why are small mills going bankrupt? They don’t have sufficient capital to
sustain/invest in their business, material costs are increasing
Trang 7• Why does the GoV want to encourage them to continue? GoV has
slogans/policy to support SMEs, but in his opinion the GoV should only
support medium enterprises Support might include land, capital, interest rates
• What is the cutoff between medium and large enterprises? Discussion about
this, Mr Lich considered >10-20,000T/yr to be large
• What is a low quality feed? Protein content too low, inaccurate labeling, high
mycotoxins, feed stored in areas with high contamination risk
• How many feedmills employ a nutritionist? Large mills yes, some medium
mills, others get recipes from others
• Does VAFA provide specific or generic recipes? Specific – depending on what
raw materials are available
• Regulation: This is a difficult area
o No laboratory in the livestock dept and no experience If they sample and send to another laboratory (in the north or south) it is costly and the Dept of Livestock doesn’t have budget for this
o MARD has funds but they are insufficient
o Corruption is an issue
o MARD not authorised to take food in the market as this is linked to the Ministry of Trade (Dept of Marketing and Management)
• Can the VAFA guarantee feed quality for small mills? No
• What is a small holder farm? Uses traditional methods and has <100 chickens
and <5 pigs
• Do smallholder farms have a seasonal demand for feed? After summer they
buy a pig and raise for Tet Small and medium mills have a cycle of increased production after August and up until Tet (main pig raising season in the north)
• Do any medium mills have breeding operations? Yes – Dong Nai, CP, Dabaco
• Are there any independent breeding companies in Vietnam? One poultry
research centre, some others I think
• Do the large feed companies have a monopoly over animal breeding in
Literature review
A review of the literature is usually essential, to see what is already known about the subject area, and what field research has already been done Often, work done in Vietnam will be found in technical reports for MARD and donor projects, but other sources may be theses (both locally and internationally), web-based publications, and
Trang 8Table 1.1 Examples of issues and opinions obtained from CARD project field visits
Field visit Activities and issues identified
Field trip to DABACO
• Investment in technology and human development
• Management structure – SOE to equitised company
• Storage capacity
• Buying and importing strategies
• Quality control capability – use of laboratory
• Batch size and mill operations generally (throughput capacity (tonnes/hr, tonnes/day – the smaller the batch size the more the energy/unit cost), repairs and maintenance scheduling
(cleaning of equipment, safety), some feeds harder to produce (chicken feed and small pigs which need smaller dye)
• Do price and quality equate? Yes, but not perfectly as price can include services
• Pricing arrangements within and outside contracts Visited small domestic
feedmill in Gia Lam • Established 2002, 25-27 employees, produce 100T concentrate/mth, rent land for the mill (not really a mill – just a
mixing facility)
• Biggest issues – capital, land, increasing production costs, cost
of credit from VBARD (1.03% mth) – mortgages private assets
• Customers – agents at provincial level, markets to the mountainous areas as a priority as this is a good market for concentrates
• How does he compete? He has difficulties – especially in
import procurement, also bigger companies give agents a bigger bonus Only competition from large companies is an
issue – other SMEs not a problem
• Marketing policy? His strategy is to have good quality by buying good raw materials, and to focus on mountain areas
• Quality control? Done in two stages: checks maize quality
when he buys, expert from provincial dept level checks the product Every 3 months he sends his product for testing to National Husbandry Institute Fishmeal and soybean he tests more often (110,000 VND for one protein test) No laboratory – 100% of small mills don’t have a laboratory Dept of Ag at provincial level comes in once per year to check the output –
he has to pay for testing (100,000 to 200,000 VND/yr) Fined once when content didn’t match label (then changed his
components)
• Recipes? He has one nutrition expert – also the German
company he buys the premix from helps with recipe
formulation, also VAFA
• Avian flu reduced sales by 30-40%.
Trang 9scientific journals In the case of the CARD project, we were interested in the lessons from international experience, and one of the components of the project was an international literature review which was conducted by Dr Johanna Pluske (Pluske, 2007) This review provided a desktop overview of the feed industry from a global perspective generally and with specific focus on three countries: Vietnam, China and Thailand These countries were selected for review to identify similarities and lessons that may be useful in understanding the feed sector in Vietnam
Collection of secondary data
Basic information about the nature of the industry, including recent trends in production, and differences in characteristics of production in different parts of the country, should be assembled Aside from statistics reported by others in the technical reports mentioned above, there is a lot of detailed information at the regional and province level available from the GSO
1.1.2 Use of reviews and secondary data
Information from collection of secondary data forms the basis of the background
chapter presented in the livestock feedmill survey report (Phuong et al 2010) The
secondary data demonstrated the rapid rate of growth in livestock feed production since 2000, and highlighted the role of domestic and imported ingredients in feedmill production
Specific input into the planned research from the secondary data collection included:
• An examination of the spatial pattern of production showed that the Red River Delta and South East region (and to a lesser extent the Mekong Delta) were the most important livestock feed production areas, and that is why we chose to conduct the survey in those regions
• The evidence on price trends for feed inputs and feedmill outputs highlighted the problem of rapidly rising feed input prices which have been encountered by the feedmill industry in recent years, and helped us to form some basic survey questions about the setting/revision of feedmill output prices
Information from the literature review, workshop, interviews and field visits were used to help develop possible research questions through team discussions and meetings A team meeting at CAP in 2007 identified a range of research questions that could be asked and further secondary data that would be needed to help answer these questions How these possible research questions were then further considered is discussed further in Section 2.2
1.2 Formulating researchable questions or hypotheses
It is unlikely that all relevant research questions can be answered by any individual research project Any project is limited by resources and time available to conduct the research Some information that might be needed to answer a question may be unavailable or particularly difficult to obtain It is important to carefully consider possible research questions arising from initial observations/data to see if it is possible
to answer the question with the planned research
The formulation of research questions/hypotheses is an application of scientific method, i.e
Trang 10• Collection of facts by observation or experimentation,
• Formulation of a research question or hypothesis to explain facts in terms of cause and effect relationships,
• Deductions from a question or hypothesis that can be tested, and
• Verification of deductions by new observation or experimentation
The scientific method attempts to systematize the process of generating scientific knowledge However, it is a general approach or a general way of thinking, not a specific recipe for any given research project The key to success in research is in being able to ask an important question in such a way that the question can be answered There are an infinite number of important questions to ask, and for many of them there are no practical methods of providing answers Likewise, there are an infinite number of questions with reasonable methods of providing the answers, but the questions themselves are unimportant Useful research questions must aim to have answers which are important, and have hypothesis that can be tested and confirmed or refuted
The project team discussed a wide range of possible research questions arising from the scoping studies and then focused these into a much smaller number of research questions that were considered to be important, and could be investigated by the planned research These were:
• Are economies of scale evident in the livestock feed sector in Vietnam?
• How different is production and trading between large feed mills and SMEs in terms of material input use, storage, product types, quality control, types of customers and services offered to customers?
• Are the raw material procurement and output distribution channels used by SMEs and larger feed mills different?
• How do domestic SMEs compete in the sector against larger foreign-owned mills?
• Is there any evidence of prices for raw material imports being higher than domestic prices for raw material inputs?
• Is there an opportunity for Vietnamese SMEs to compete in niche markets?
(e.g smaller mills targeting more remote areas)?
• What are the constraints facing SMEs operating in the livestock feed sector in
Vietnam?
1.3 Focusing data collection
Agricultural economists often use information from agricultural scientists when seeking to understand production issues, and in focusing data collection in agricultural surveys In the case of the feed industry, we can use information from scientists about animal nutrition, and the quality and composition of different feed inputs, to focus our questions We can also use the measures adopted by scientists and by the industry to assess the technical efficiency of production The most commonly used indicator of technical efficiency used by scientists is the Feed Conversion Ratio This is a measure
Trang 11of the quantity (kg) of liveweight produced per kg of feed fed A higher feed conversion ratio means that more feed is required to produce a unit of output, thus indicating a less efficient system In our analysis of animal producers, we collected data on feed input use and liveweight production, so that we could calculate and compare the feed conversion ratio achieved on different farms
There are two main types of ingredients used in producing animal feed: energy and protein Energy rich ingredients include maize, rice and cassava Protein rich ingredients include soybean cake and fishmeal Animal feeds normally must meet certain criteria such as energy content (calories per kg) and protein content (%) With information on the energy and protein content of feed ingredients, and the nutritional requirements of certain animal feeds, we can examine the effect of feed input prices
on cost of production Basically we ask the question: What is the least cost combination of feed ingredients that can be used in making animal feed, given that we know what nutrient composition the feed must have?
By setting up a model to assess this question (see chapter 7), we can also examine the impact of policies affecting feed ingredient prices (such as import taxes, price seasonality and access to stored maize) We can also forecast the likely demand for feed ingredients as feedmill demand grows, and as relative prices of feed ingredients change
1.4 References in this section
Pluske, J 2007 A Desktop Review of the Animal Feed Sector at a Global Scale
Report for CARD Project 030/06 VIE, Center for Agricultural Policy, Hanoi Phuong, P.T.L., Thinh, N.T., Brennan, D., Marsh, S and Nguyen, B.H 2010 Small-
Medium Enterprises in the Livestock Feed Sector in Vietnam Vol I: Livestock feed production, Report for CARD Project 030/06 VIE, Center for Agricultural Policy, Hanoi
Trang 122 Survey design and sampling techniques
2.1 Introduction
Surveys are used to ask a consistent set of questions to a sample of people, so that responses can be recorded and analysed They are the standard tool for professionals who are interested in people’s and firm’s activities, attitudes, beliefs, intentions and preferences As a tool, surveys can be very difficult to design and implement Surveys are conducted for two main reasons:
• to get otherwise unavailable information, and
• to allow researchers to generalise about a large population by studying only a small proportion of the population
Good policy analysis is critically dependent on good quality data There are many factors that influence the quality of survey data It will be dependent on the quality of:
• the survey design;
• the implementation of the survey, and
• the treatment of the raw data
The main problems associated with surveys are people-related, not statistical, and they include issues such as the ambiguity of communication by language, the attitudes of respondents to their participation in the survey, and the limits to human memory In this chapter common problems encountered in the design and conduct of surveys leading to poor validity of results are outlined, following Pannell and Pannell (1999)
2.2 Why is it so difficult to conduct a good survey?
For a variety of reasons getting accurate information from surveys can be very difficult Foddy (1993) outlines a set of reasons why this is so
• Even simple factual questions are often answered incorrectly This is especially the case if people are being asked about activities that happened in the past
• The relationship between what people say they do and what they actually do is sometimes poor
• People’s attitudes, beliefs, opinions, habits and interests often seem to be very unstable The instability may be due to actual instability of attitudes, but it may reflect other things, such as the way the question is asked
• Small changes in wording can sometimes produce changes in responses
• Respondents frequently misinterpret questions This can easily be seen if respondents are asked to repeat questions in their own words Often this will show that people have misunderstood what is being asked
• Answers to earlier questions can affect answers to later questions
• Changing the order in which response options are presented sometimes affects respondent’s answers If people are asked to read the options for themselves, they tend to go for the first option This is called a “primacy” effect If the options are presented verbally, they tend to go for the last one: a “recency” effect
Trang 13• Answers are sometimes affected by the question format This is most easily seen when comparing answers from “open” and “closed” format questions For example, if people are asked an open question about their information sources, they are less likely to nominate sources than if a closed question is asked with a list of possible information sources that can be ticked if used
• Cultural or ethnic differences can affect not only the interpretation of a question, but also people’s willingness to give accurate answers For example,
in a culture where governments and/or businesses are perceived as being corrupt or exploitive, responses to questions from outsiders are likely to be affected by the risk that responses may be obtained and abused by government officials or others
The first three dot points above are unavoidable to some extent The other factors all have implications for the design and conduct of a survey For all of these reasons it is essential to invest a lot of care and effort into developing, testing, improving and re-testing your survey before you actually conduct it
2.2.1 Issues with translation
There is an additional problem when developing surveys within multi-lingual teams
If the survey is developed in English and then translated into Vietnamese (or vice versa) extra care must be taken to ensure that the translation is correct It is very easy for small translation errors to make a big difference to the data collected For example, despite a great deal of care and checking, at least one small translation error occurred in the CARD project feedmill survey The English version of the survey asked about storage and one of the options was “silo” The Vietnamese translation in the survey for “silo” was a word that meant “underground bunker” – an old meaning
of the word “silo” within a military context The more usual English dictionary definition of silo is “a tower-like structure for storing grain”
This error was not noticed until after the survey had been completed, when it became apparent that very few firms said that they had “silos” for storage, despite the obvious fact that silos were often clearly visible Care should be taken when using Vietnamese-English dictionaries and translation software which are often not correct for modern English use If the Vietnamese collaborators are uncertain about the meaning of English words it is better to ask the English-speaking collaborators The translation of the survey needs to pass tests of “common-sense” If a question seems silly or not relevant, then it could be that the translation is inaccurate
2.3 Steps in the process of doing a survey
2.3.1 Is a survey really needed?
It is important to ask if a survey to collect the information is really needed It is possible that the information may be already available from other sources such as:
• a previous survey (much information is collected routinely and regularly but not used);
• published data; and
• reliable interpersonal feedback from contact with farmers and growers
Trang 142.3.2 Statement of information goals and uses
If the information is not available from other sources, the next step is to write a statement of information goals and uses That is, what information do you want to know and what will you do with this information when you have collected it? The goal for the feedmill survey was articulated as wanting to answer a series of research questions, as below:
• Are economies of scale evident in the livestock feed sector in Vietnam?
• How different is production and trading between large feed mills and SMEs in terms of material input use, storage, product types, quality control, types of customers and services offered to customers?
• Are the raw material procurement and output distribution channels used by SMEs and larger feed mills different?
• How do domestic SMEs compete in the sector against larger foreign-owned mills?
• Is there any evidence of prices for raw material imports being higher than domestic prices for raw material inputs?
• Is there an opportunity for Vietnamese SMEs to compete in niche markets?
(e.g smaller mills targeting more remote areas)?
• What are the constraints facing SMEs operating in the livestock feed sector in
Vietnam?
2.3.3 Collect background information
The next step is to collect background information to familiarise yourself with the issues you have decided to conduct the survey on so that you have an understanding grounded in reality and a “feel” for the issues This can involve reading, talking to relevant people and experts, and running focus groups This step is often called
“scoping” the issues, and the procedures carried out for the CARD project are outlined in Chapter 2 More information on running focus groups is given in the next section
2.3.4 Focus groups
A focus group is a small group of people (say six to eight) drawn from the population you will survey You ask these people open-ended questions about the issue you are interested in and record their responses Focus groups are good for: i) helping ensure that you ask about aspects of the issue which are most important to the relevant population, ii) helping to word survey questions using language which is appropriate
to the likely survey respondents, and iii) alerting you to issues and problems which you weren’t aware of
The procedure for focus groups is to:
• Select a sub-sample of your population (ideally a minimum of three different, but similar, groups of eight to ten people) You should think about the
Trang 15important characteristics of your target group and make sure they are represented in this small sample Sometimes, due to a lack of time or resources, researchers use a convenient sample for the focus group; for instance, a group of farmers that they already have links with, but this reduces the representativeness of the data collected Sometimes it is advisable to hold separate focus groups for different types of people that you might want to include in the same survey, especially if their responses to questions are likely
to be affected by the presence of the other type of people
• Create a prompt list of questions You will have already formed a range of ideas which you think need to be included in the survey from your review of the literature and talking to stakeholders and experts Write down the key issues to use as discussion starters with the focus group Use the prompt list as
a check list to make sure that these issues are covered in the discussion The order of the issues in the discussion in not important Be prepared to give attention to new issues and ideas which are not part of your prompt list
• Facilitate the group during the discussion Your role is to get a discussion going in the general topic area and then observe and record the discussion Occasionally you will prompt the discussion by asking an open-ended question to address the issues on your list Some points to remember when facilitating are:
o Give a brief introduction about the purpose of the discussion and then invite people to speak about the topic
o Use prompt questions to keep the discussion on the topic – but allow some digressions
o Use probing questions to encourage detail and ask for elaboration and clarification : they should be offered in a conversational style… “So what
do you think of… ? ; So can you tell me more about that?”
o Listen to the language being used to describe the issues and adapt your own to it
o Remain neutral: be interested but do not show surprise, anger, embarrassment at any of the comments
o Raise issues in an open-ended manner - e.g “How do you feel about……?” rather than “Are you satisfied with …?”
o Beware of body language: avoid sounding like the answers you are getting are correct, instead look interested and encourage people to keep talking
o Allow silences as signals that you’d like them to keep talking
• Tape observations and/or write them down
• Analyse the tape/written record to provide major issues for questions and suitable wording
2.3.5 Select survey method (personal interview, phone, letter,
web-based)
Survey data can be collected in a number of ways: by post, web-based, phone or in a face-to-face interview There are a number of factors to consider when determining which is the most suitable method These include:
• Cost Phone, web-based or letter are cheaper; face-to-face is the most
expensive and time consuming There are a number of websites which offer free (or relatively inexpensive) use of web-based survey tools One such site
Trang 16is: www.surveymonkey.com (“an online survey tool that enables people of all experience levels to create their own on-line survey quickly and easily”)
• Size and location of your sample If your sample is large it takes significant
resources to conduct face-to-face surveys Location of respondents in remote areas also creates difficulties for face-to-face surveys
• Response rate It is common to have a response rate of 30% or less in postal
and web-based surveys Response rates for phone or personal interviews are higher: around 70% Research indicates that response rates in post and web-based surveys are similar, but of course web-based surveys assume that respondents have access to a computer and are computer literate Response rates to mail and web-based surveys can be improved by following a standardised procedure developed by Dillman (2000) For mail surveys this
“tailored design” approach includes four contacts: a preliminary postcard, a hard copy survey with cover letter explaining the purpose of the study, a follow-up/reminder postcard, and a replacement hard copy surveywith cover letter to non-respondents Response rates for mail and web-based surveys can
be increased by offering incentives to respondents, e.g to go into a draw for a prize if they respond
• Complexity of information being collected If the survey requires complex
information or large amounts of information, a personal interview may be the only feasible method In all surveys the length should be as short as possible, but in postal and web-based surveys brevity is especially important so as not to reduce the response rate
• Time available Telephone and web-based surveys are favored for their speed
compared to mail surveys and face-to-face interviews
• Literacy levels Low levels of literacy and low literacy competency can be
serious issues for mail and web-based surveys
• Validity, or the risk of introducing bias into the survey results In face-to-face
and telephone interviews, the interviewer is a threat to validity Inappropriate non-verbal behaviour, failure to clarify vague replies, failure to use the question wording, failure to accurately record the respondent’s reply are all common problems Biased samples through low response rates are the most worrisome aspect of postal and web-based surveys Even if a relatively high rate of say 60% was obtained there is still the question of what the distribution
of replies would have looked like if everyone had responded Those who do not respond may well be self-selecting on the basis of a particular characteristic, e.g education level If education is also likely to be associated with the issue you are investigating in the survey then you’ve got a biased sample
2.3.6 Determine sampling method and select sample
The aim is to get a sample which is as representative and unbiased as possible This is addressed in more detail in Section 3.4
2.3.7 Draft questions
This are also many issues associated with designing survey questions and these are addressed in detail in Section 3.5
Trang 172.3.8 Pilot test the questionnaire
One important type of pilot testing is to trial the draft survey with a small number of people/firms from the target population This is useful for uncovering aspects of questions that will cause interviewers and respondents to have difficulty Two interviewers should conduct each pilot interview One should conduct the interview, the other should record impressions When pilot-testing consider the following questions:
• Were any of the questions difficult for the respondent to answer?
• Did any of the questions seem to make the respondent uncomfortable?
• Did you have to repeat any of the questions?
• Did the respondent misinterpret any of the questions?
Mail, phone and web-based surveys should also be pilot-tested In this case the test respondent should complete the survey, and then be asked about what they thought about the structure and content of the survey, alone the lines of the questions above Another type of pilot-testing which can be valuable is to attempt to analyse a set of fictional results to your survey Often people don’t adequately consider which statistical method or what type of summaries they are going to use until after the data have been collected, by which time is it too late to realise that you didn’t ask for the right information to do the planned analysis Attempting to do an analysis with artificial data (e.g made up out of our head) will reduce this problem substantially In the CARD project the exercise on least cost feed rations (see Chapter 7) was conducted to help clarify what data we would need from the survey
2.3.9 Redraft the survey
Reconstruct the questions based on your experience with the pilot interviews and pilot analysis
It is important that the interviewers are familiar with the survey topic and the questionnaire In the training, emphasise the things they should not do because it would reduce the validity of the results These things include using inconsistent wording to ask the questions, and providing guidance or reinforcement for particular types of responses
If care has been taken with the survey design this should now be relatively straight forward In Chapters 4, 5 and 6 of this Manual we discuss aspects of data entry, data cleaning and analysis
Trang 182.4 Sampling
Most samples taken from a population of data are designed to reduce the cost of collecting data There are generally, two purposes of either obtaining an estimate of a population parameter such as a mean value or of testing a statistical hypothesis such as: “large-scale feedmills have lower costs of production than small-scale feedmills” Estimation of population parameters is the most common purpose
Note that a sample can only give results in terms of probability statements Suppose your population of 3,000 workers in a factory were surveyed and you found 1,500 smoked This would be a population parameter Suppose now, you sample 2,998 and you found 1,499 smoked, the proportion is still 50 per cent but now you have an estimate or statistic rather than a parameter
Also, note that drawing a sample implies that a random method has been applied to choose the sample such that each member of the population has a known chance of being selected Sampling theory does not apply if this is not the case
2.4.1 Accuracy, bias and precision
The accuracy of a sample estimate refers to its closeness to the population value Consider a small population of 4 numbers, 15, 17, 18, 22 with a mean of 18 We could select a sample of 2, say 15 and 18, to give a mean of 16.5 With a larger set of numbers we could select many samples so that we have many mean values These mean values will have a distribution of which we can take a mean which is referred to
as an estimator rather than an estimate If the expected value of the estimator is equal
to the population parameter then it is an unbiased estimator Otherwise it is a biased estimator The bias is the difference between the expected value of the estimator and the population value Note that bias depends on the method of sampling as well as the method of estimation
It is important also, to recognise that any one sample may give an inaccurate estimate even if the estimator is unbiased An estimator which is biased can also produce an accurate estimate for an individual sample
Next, it is important to know what the sampling fluctuations might be on average This can be obtained through a measure of the spread of the sampling distribution The standard deviation or variance is the means to measure this property The standard deviation of the means obtained from numerous samples is the standard error
of the mean This provides an estimate of the probable accuracy or precision of any one estimate
2.4.2 Types of sample design
The aim of sampling for a survey is to get a sample which is as representative and unbiased as possible Firstly, the sampling frame should be determined – that is, what
is the “working” population from which the sample will be drawn? The sample size will be dependent on a lot of factors (particularly the resources available to conduct the survey), but generally the larger the sample size the better However, a large sample size is not a guarantee of the accuracy of the results since it will not eliminate bias in the selection of a sample Thus, size of sample alone is not enough Seek the
Trang 19advice of a statistician to help determine the sample size (in relation to the “working population”) that will give an acceptable standard error
The sample selection procedures/strategies should then be determined Sampling can
be random or “purposeful”
• Random sampling avoids systematic bias in the sample and can be simple, stratified, or cluster random sampling
• “Purposeful” sampling increases the utility of information from small samples,
by deliberately selecting a sample from a specific group of respondents
Simple random sampling is used when we believe that the population is homogenous, and a random set of individuals selected from it will represent the population responses
Stratified random sampling is used when there are distinct subgroups of the population that we are interested in and believe that these subgroups may respond differently to our survey In the survey for the CARD project we wanted to make sure that we selected a set of respondents that represented the range of size categories that
we were interested in We wanted to analyse the impact of size class on many of our responses in the survey
There are two ways of going about the selection of the population for a stratified sample, once the strata have been decided upon One is called proportionate allocation, and organizes the sample so that the share of surveys in each strata is in the same proportion as the share of these groups in the overall population For example, if the total population had 20% large firms we would have 20% large firms in our sample
The other method of sampling is to emphasize the collection of data from a group where we think the group might have a higher variance For example, if we thought that the large firms might have greater variance in their responses we would select more than 20% of the survey from this group However, in order to work out how many to select we would need to have a reasonable estimate of the standard deviation
of each population sub-group If we have no idea as to whether variance would be different for different sub-groups (as in our case) we use the proportionate method (see Section 3.4.4 for the detail of the proportionate method used in the CARD project)
A formal classification and description of sampling strategies is given in the following section
2.4.3 Sampling strategies
Random sampling
Here each of the N units has a calculable (non-zero) probability of being selected Unrestricted random sampling means that each possible sample of n units from the population of N has an equal chance of being selected Also unrestricted random sampling is usually conducted “with replacement”, ie a unit drawn is returned to the population with the possibility of being drawn again If there is no replacement it is usually referred to as simple random sampling For example, simple random
Trang 20sampling can be done by numbering the N members of the population and then using uniform random numbers in the range 0-N to chose n
Randomness is vital if the parameter estimates are to be unbiased Random is not haphazard selection: it needs to be independent of human judgment Use a random number generator (e.g RAND( ) function in Excel) or lottery method (e.g drawn from a hat)
Systematic Sampling
Divide the population by the sampling fraction, say 5000/100 = 50 Randomly select the number between 1 and 50 and then take every 50th number, e.g 10, 60, 110 etc In this case only a total of 50 samples can be chosen, not an infinite number Note that the list should have a random arrangement to get the precision of a simple random sample
Stratification
Other than increasing the size of a sample, its precision can be increased by stratification Before the sample is selected information is used to divide the population into a number of strata - then a random sample is selected from each stratum If the sampling fraction is the same for each stratum then there will be greater precision than a simple random sample because the different strata will be properly represented (e.g sexes, age, regions, town, etc.): i.e the Standard Error will
be reduced It is not necessary that the sampling fraction be constant across the strata – there can be a proportionate stratified sample (uniform sampling fraction) or a disproportionate stratified sample (variable sampling fraction)
Proportionate Random Sampling In this case, information on the population is known For example, stratify a student population by type of degree and then sample
according to the proportion in each degree so the sample has the same proportions The reason it works is that the variation between the strata does not enter into the Standard Deviation because it is reflected exactly in the sample There is no sampling
of the strata, only sampling within the strata The greater the variation accounted for
by the strata the greater will be the gain from stratification Thus the strata should as much as possible be distinct from each other, and within strata should be
homogenous Select stratification factors related to the subject of the survey The aim
should be to stratify using a classification related to the key variables or attributes in the survey However you must know the population distribution of the classifying
variables for every member of the population
Disproportionate Stratified Sampling
Disproportionate stratified sampling allows for the possibility of using variable sampling fractions This is useful where the populations in some strata are more variable than others Where a stratum is more variable it is better to have a larger proportion representing it to gain greater precision It can be shown in this case that the optimum precision is obtained for a given cost if the sampling fractions in the different strata are made proportional to the standard deviation in those strata and inversely proportional to the square root of the costs per unit in the strata
Trang 21Normally the standard deviation and costs/unit are not known However a pilot survey, previous surveys or expert judgment might be used, and some judgment may
be necessary in choosing the sampling fractions
Cluster and Multi-stage Sampling
A population can be thought of as made up of a hierarchy of sampling units of different sizes and types It is possible to randomly select a sample of say student classes then one may include all the students in that particular set of classes or randomly choose individuals from the class To work, each student should only be a member of one class
Select randomly B and use B to form a cluster then select the students from the cluster When sampling is done so that all of a cluster is used it is known as cluster sampling When the cluster is randomly sampled it is multi-stage sampling For example: Select randomly some suburban areas each with 50 houses and interview all
of the groups of 50 houses Or, select randomly some suburbs then select the houses from the suburbs randomly The advantage of this method is the lower cost of travel and collection of the information
Sampling with Varying Probabilities
Previously we have assumed the clusters were of near equal size If they are not, then complications arise since choosing a large unit or primary sampling unit first, changes the probability of the selection of a particular individual Compare a cluster unit size
of 2000 or 20 where the second stage sampling fraction is 1/10 yields 200 people in the first case and 2 in the second This can be allocated by stratifying the primary sampling units or the clusters and selecting a sample of them in each size group, probably with a varying sample fraction Another approach is to select the primary sampling unit with a probability proportional to size This gives greater precision than would a simple random sample of primary sampling units
sub-an appropriate measure of size is needed
Multi-phase Sampling
In this case some limited information is collected from the whole sample and additional information is collected from sub-samples With only one sub-sample it is know as two-phase sampling
This is an efficient way of getting information some of which may be time consuming and expensive to obtain Also there may be areas of questions where less precision is required Also, the data in the first phase can be used to select the sample by stratification in the second phase The use of two-phase sampling is only effective if the cost of data collection for the first-phase is much lower than the second phase by about a factor of 10
Trang 22Replicated Sampling
With complex sample designs such as multi-stage sampling the calculation of standard errors is difficult The paired direction design yields a simple formula This involves the selection of two units per stratum in single-stage sampling or two primary sampling units per stratum in multi-stage sampling
Another flexible approach is through replicated or interpenetrating sampling In this case a number of sub-samples rather than one full sample are selected from the population All the sub-samples have exactly the same design and each is a self contained and adequate sample of the population Each of the sub-samples has to be independent and with the same sample design
The sample estimates can be calculated for each of the sub-samples, and the variation between these estimates provides a means of assessing the precision of the overall estimate The advantages of the replication sampling are:
• Easy generation of preliminary results using one of the sub-samples
• Can obtain an estimate of some of the non-sampling errors such as variation between interviews
The number of replications must be chosen Some have used between 4 and 10 However, the larger the number the more limited the possible stratification of each sub-sample
Quota Sampling
Quota sampling is different from probability sampling Quota sampling is a method
of stratified sampling in which the selection within strata is non-random It is the non-random error that constitutes its greatest weakness causing great debate about the value of quota sampling Statisticians think it is theoretically weak, while market researchers defend its cheapness
Various stratification schemes might be used, e.g rural/urban, sex, age, etc In quota sampling the interviewers are given the numbers to select from rural or urban areas, the number of males and females, the numbers in age-groups, etc The strata chosen should be important in determining the variation in the variables of interest
Arguments against:
• Not possible to calculate appropriate standard errors It is sometimes argued, however, that these are small problems compared to other biases
• Interviewers may select in a biased way - the easy people or firms to interview
• Control of interviewers in placing respondents into the right groups is difficult Arguments for:
• Less costly
• Administratively easy
• Independent of the existence of sampling frames and may be the only method if there are no suitable sampling frames
Panel and Longitudinal Studies
This is collecting data from the same sample on more than one occasion There are special problems in maintaining the representatives of the sample Samples such as
Trang 23this allow trends to be studied Also, it is possible to study the nature of the change and the people who have changed and possibly why they changed, as well as the causes of the change
A panel nearly always has greater precision than a set of random samples through time Also, they can be used to measure the impact of experiments such as advertising The design is known as the before-after design without control group The problem with panel studies is the recruitment of willing respondents, sample mortality or loss and conditioning of responses Panel number replacement systems have been worked out
Master Samples
If repeated samples are to be taken of the same area or population then the preparation
of a master sample from which sub-samples can be taken is often efficient They simplify and speed up the selection process Often select the primary sampling units once, such as regions or provinces, etc and then sample from these A master sample needs to be reasonably stable
2.4.4 Proportional stratification by size
In the CARD project we were interested in the impact of size on competitiveness, and therefore it was important to draw a sample that represented the range of feedmill scales in operation We classified large as being more than 80,000 tonnes per year, and medium as 20000 to 80000 Because of the range of sizes within the small category, and the dominance of very small firms in our total population, we selected three different size categories for small-scale firms to ensure that we selected the representative range Our sampling frame was a list provided by the Department of Livestock Production which contained the name, address, and production capacity of
241 feedmills operating in Vietnam in 2006 The total population of feedmills in the six provinces in which we were working (Ha Noi, Ha Tay, Binh Duong, Dong Nai, Long An, Tien Giang) was 107 firms Around half these firms were of a scale less than 5000 tonnes per year Using the proportionate sampling approach, around half the firms in the sample should be from this smallest category Similarly, around 10%
of firms in the population were in the next size group, so our sample was selected so that 10% of the sample was from this size group, and so on (Table 3.1)
Table 3.1 Sampling strategy to represent scale of operation
Trang 24Once we determined the desired number of firms in each province, using the same proportionate sampling approach, we used a spreadsheet function to randomly choose which firms were selected This was done by assigning a random number to each firm (using the rand()) function in Excel to generate a number for each firm, then cutting and pasting the values into a new column (to avoid the fact that the rand() function continually produces new random numbers) We then sorted the firms according to the random number assigned to them, from highest to lowest We then used this sorted set of firms as our priority list for sampling, going down the list until we had enough firms to meet our required sample size A reserve list of firms was drawn up in the same way in case we needed to replace firms in the sample that did not wish to participate, or that were no longer operating in the sector
2.5 Question design
The aim is to design valid and reliable questions Reliability means that a person would give consistent answers to your survey questions in different times, places and contexts Validity means that the survey questions actually measure what they set out
to measure Reliability and validity can be affected by language, context and question type
2.5.1 Designing good survey questions
The aim is to have questions which are as clear and straight forward as possible Some tips for this include:
• Use simple language
• Keep concepts simple, or be prepared to explain a concept in simple language
• Make sure the task (i.e answering the questionnaire) is manageable
• Questions should relate to issues which are common knowledge within the target population
• Be aware of wording effects - even small changes in wording can shift answers (e.g not allow vs forbid)
• Question order - the general rule is to move from more general to more specific questions
Foddy (1993) and Pannell and Pannell (1999) give a more comprehensive discussion about designing good survey questions
2.5.2 Should you use open or closed questions?
An example of an “open-ended” (qualitative) question is:
“What assistance should the government of Vietnam provide for small-medium domestic livestock feed enterprises to help their competitiveness in the animal feed sector?
A “closed question” that explores a part of this same issue might ask:
“The Government of Vietnam should provide subsidised credit to small-medium domestic livestock feed enterprises Do you:
strongly agree 1 2 3 4 5 strongly disagree”
Trang 25There has always been controversy over the value of collection of qualitative information from open ended questions Researchers disagree over whether this kind
of information is useful It has been argued that the open type of question fails to control what the respondent is supposed to be answering; that respondents wander from the topic and that answers from different respondents cannot be meaningfully compared
On the other hand closed questions are said to impose a framework on respondents which may not be relevant to the respondent, and that fixed response options force the respondent to adopt the researcher’s frame of reference even when it is not meaningful to them
In general:
• The open-ended approach is good for exploratory work where you are trying
to discover the range of ideas, feelings and reactions You might then use this data to create a structured, quantifiable set of questions
• Sometimes, you only have time to collect qualitative data, (e.g through a group discussion) and it might be all that is needed for the decision making you have to do on that issue
• The two types of information can be used as a validity check on each other
• It is possible to roughly classify open ended question into themes where the sample is not too large and then do a rough quantification of them
• You can use specific quotes from open questions to complement your closed data in reports
2.5.3 If closed questions, which type of closed question format?
There are various types of scales and formats that can be used in closed questions and advantages and disadvantages associated with each type The main types are:
• Agree/disagree or yes/no This is the simplest kind of question It tends to be lower in validity than those with a scale of responses because it forces an extreme or cut-and-dried response when in fact most of us are not clearly polarised on many issues You should normally avoid agree/disagree or yes/no questions If used, a “not sure/don’t know” option should be provided
• Standard scales In this approach, an example is given for each level in the scale This can be useful for frequency of behaviours where each point is numerically defined For example:
Normally, I eat pork (please tick one choice only):
every day
several times a week
about once a week
once or twice a month
several times a year
never
However, often is it difficult to develop scales with behavioural or attitudinal statements which accurately represent something
Trang 26• Behavioural checklists These are quick and easy to administer and understood
by most people A long list of items (on a particular topic) is gathered which is meant to distinguish between individuals on the behaviour/activity in question The respondent ticks those items on the list that apply to him/her It is useful
to have a “not applicable” or “don’t know” option, so that the every item on the list is answered For example:
Please tick all the items that apply to you:
I sell livestock to local traders
I sell livestock to regional traders
I sell livestock on contract
I sell livestock direct to processors
I see livestock direct to farm households
• Ranking methods Ranking requires respondents to put items in order It becomes too complex when the number of items is large For example:
Rank the following 1 to 4 in order of your preference to eat:
2.5.4 Using Likert Scales
An example of a question/statement asked in the format of a Likert Scale is given below
The training workshop on least cost feed rations introduced ideas that will be useful in
• Two to four categories are not enough: responses to the four point scale (e.g strongly agree, agree, disagree, strongly disagree) have been found not to collapse down into a two point scale: almost one in five respondents who answered on the positive scale of the four point scale answered on the negative scale of the two point scale
Trang 27• Seven to nine point scales can be used more reliably than scales with fewer points (i.e with seven or nine points respondents give the same rating to the same or similar questions)
• Some multivariate statistical procedures only work properly when data has been generated using rating scales with six or more categories
Overall, seven to nine categories produce the most reliable and valid ratings
An important thing to remember is that it is possible for two different people to agree with a statement just as strongly as each other, but to give different ratings, such as one saying “agree” while the other says “strongly agree” Objective empirical values are not attached to these scales Even within one person’s rating of different questions, “agree” may mean something different in terms of intensity in different questions The statistical treatment of this ordinal (as opposed to cardinal or interval-ratio) data requires special care
2.6 References in this chapter
Dillman, D.A (2000) Mail and Internet Surveys: The Tailored Design Method
Wiley, New York
Foddy, W (1993) Constructing Questions for Interviews and Questionnaires Gower,
Cambridge
Pannell, P.B and Pannell, D.J (1999) Introduction to Social Surveying: Pitfalls,
Potential Problems and Preferred Practices SEA Working Paper 99/04 URL: http://www.general.uwa.edu.au/u/dpannell/seameth3.htm
Trang 283 Data entry
This section contains the material from a course of database design presented by Donna Brennan in July 2008 The electronic course materials are found in the zip file
“Course database and access forms.zip”
3.1 Principles of database design
There are a number of benefits of using a database for data entry These include:
• Automatic data checking can be embedded in the design to reduce entry error
• The entry forms can be set out to look like the questionnaire, making it easier for the people doing the data entry
• Clearly defined Microsoft Access tables will provide a clean set of data and avoid duplication of entry (or missing some entry) that can occur in spreadsheets
• Database designs make it easy to deal with data that has a different number of responses for each survey For example, details on the number of crops where each household has a different number of crops Instead of a whole lot of blank data, the database will only add as many records as is required
• Database designs make it easy to aggregate data into a form that can be used
by statistical software
A glossary of terms used in database design is given in Table 4.1
Table 4.1 Glossary of terms
Term Definition
Primary key A field in a table that is uniquely defines one record
Record A row in a database table (an observation)
Field A column in a database table (a variable)
Table Tables contain the data in the database
Query Queries are used to perform calculations on tables in the database, to
create new tables, and to organize the data in various ways Form Forms provide a user interface for entering data They are built using
reference to the underlying tables
3.2 Designing tables from survey questionnaires
If you are used to entering data in spreadsheets or in Stata you may be tempted to create a very wide table with rows representing an observation (respondent) and columns representing the observations But often surveys involve filling out tables, which have rows and columns for each respondent When designing tables in Access you can design ‘multi-dimensional tables’ that record data in tables for each respondent These tables may not be the same size for each respondent For example,
Trang 29the responses on household members or details about crops might vary according to number of people or crops Often when you are doing statistical analysis you want as summary of this data, rather than all of the detail You can use queries to aggregate the information for later use
3.2.1 Example of table design in IFPRI feedmill database
(see IFPRI Survey 200v.mdb)
Table fproc01 has the basic details about the firms and corresponds to page 1 of questionnaire There are 35 firms (records)
Table fproca1 has details from page 2, with all details entered as columns and 1 row per firm Thus there are 35 records
Table fproce1 has details from page 5, which is a table with lots of rows and columns Thus, imagine that the data from this table has 3 dimensions That is for each firm, we need rows and columns
Table fprocg1 also has multiple rows per firm, one row for each row of Table 8
3.3 Practicing using queries
The following exercises demonstrate some of the querying features in Access Some
of the examples shown (such as reporting averages) are best done in Stata because more detailed statistical analysis such as Anova tests on the reported means can be done But Access queries can be very efficient at performing intermediate calculations
on the data such as adding up across observations, grouping observations, etc Once calculations are done the crosstab query can be used to put the data into “row = observation, column=variable” format required by Stata
Trang 303.3.1 Types of queries:
1 Joining two tables with matching key (like merge in Stata)
Using IFPRI livestock feed database
Create a query which combines location data in fproc01 and feed processing capacity in table fproca1
Save query
2 Aggregation (like collapse in Stata)
a Using the query made in step 1, answer the following questions
• What is the total processing capacity of all the firms surveyed in the sample, by province?
• What is the average processing capacity of all the firms surveyed
in the sample (national)?
• How many firms were surveyed, by region?
3 Joining multiple tables with several matching keys (more powerful than Stata)
eg Combining info from 3 tables using 2 keys
We want to look at the raw material input data and compare the amount spent on
different types of inputs Open table Raw Material Inputs and Prices See that the
data is coded with a number representing the type of feed We need to create a table that has the information about these feed codes
See spreadsheet ‘some codes’ The data in sheet raw input codes can be selected and pasted into the table page of access
Now create a primary field (input code) by opening table in design view
Then make a query that combines the following information:
region qtyproc procpri inputexp Input Name Type Idcode
• Save the query
• Then make a new query that sums by type
• Then make a new query that sums total expenditure for the firm
• Then combine these two queries to work out the protein share of total expenditure
• Query this query to find the simple average of protein share for firms
in each region
4 Obtaining a subset of the data (like keep or drop in Stata)
• Using Main products production and sales
• Make a query which only contains information about pig grower ration (code =14), and combine the regional data info, and calculate total sales revenue
• Save the query
• Now query the query to find the average revenue and production by region
Trang 315 Update queries
Create a field called animal in Main products production and sales
Use update query to put ‘pig’ if code<20, poultry if code>20 and <30, cattle if >30
6 Crosstab query
• Make a table that calculates the value of production and sales by type of
product, summing over animal type You must end up with unique rows per
firm per animal
• Save it
• Now use cross table to get sales revenue by firm, by type
3.4 Designing a database for the CARD Livestock
questionnaire
In this database we treat each major section as a separate table in the database, but where there are questions that involve filling out a separate table (eg Storage equipment B9) we establish a separate database table for that question
The tables we develop will be:
Batch information Section D2,4,5,6,8,9,10 Page 9, and page 13
**** not yet done Section E
Quality control Section G except G5 P20, p22-23
Trang 321 Examine tables that are currently there
2 Create tables for section E
a How many tables?
b Do you need a lookup table?
c Create tables (format is very important Text, Integer or Double?)
3 Do the same for sections H, I, J
Now you are ready to use the forms to create the survey input
Trang 334 Data cleaning and analysis– techniques using
Stata
4.1 Data cleaning
Once all the data has been entered into the database, it can be sent to Stata for statistical analysis However, the first step is to examine the data carefully to check that there were no errors with data entry, such as misinterpreting the handwriting of the enumerator, mistaking the units of measurement, or simply typing in the wrong number There also may be problems with the responses provided by the survey respondent, because of misunderstanding of the question, or a refusal to provide certain information which will lead to missing observations Thus, even after careful data entry, a data cleaning process is required
Because of the commercial sensitivity of certain questions (especially cost of production and profit) the survey questionnaire was designed to provide several checks from which cost of production could be derived For example, we asked detailed questions about the cost of feed material inputs so that in most cases we had good data on expenditure on feed materials In the cost of production section we gave respondents the option of providing costs in terms of share of total cost, rather than actual costs per tonne, but we could infer total costs from the good data we had on feed costs There were some parts of the survey where cross checking could be performed to examine the consistency in the responses between different sections All
of these calculations were done in hands-on training conducted at CAP over the period August to October 2008 Training was also provided in conducting and interpreting analysis of variance in Stata Additional training was provided in Perth in August 2009 Training notes recorded by members of the CAP team (Pham Thi Lien Phuong and Nguyen Thi Thinh), in the form of annotated Stata do files, are provided
in this section Data needed for these analyses will be in the project database kept at CAP
4.2 Creating Output Templates
Before commencing data analysis in Stata it is worthwhile to think about what data you wish to extract, and how you will present it The data you need to extract will depend on the project goals, research questions, and also the questionnaire itself A useful technique is to develop data output templates for each various section of the questionnaire Extracting useful data from survey data can be quite complex as often new variables need to be created to provide data in a way that is relevant to the project research questions Output templates were created for the CARD project, and some of these are shown in this section to give examples of how to think about extracting survey data
One of the first issues we faced was deciding on how we would define small, medium and large firms We had sampled in five categories (<5000 tonnes, 5 to 10,000 tonnes,
10 to 20,000 tonnes, 20 to 80,000 tonnes, and >80,000 tonnes) But these categories had been based on the supplied list of firms and their output We needed to base our
Trang 34final production scale classification (small, medium, large) for analysis on the actual
output of the surveyed firms The following process was followed:
• Make one size class variable that corresponds to the sampling classification (based
on reported capacity from the survey)
• Make a second size class variable based on <20000, 20000-80000 and >80000
• Make a third based on < and > 80000
• Suggest that we look at some key indicators:
o Cost of production per tonne of factory output
o Average revenue per tonne of factory output
o Capital investment per tonne of factory output
• Then do an analysis of variance for different size classes to see how many
statistically significant groups there are
After looking at the data in this way, the size classification to be used for analysis
(based on the data) was decided upon For various reasons (e.g differences seen
between various size classifications in key indicators and maintaining sample sizes in
the scale categories) the classification finally used was: small = <10,000 tonnes,
medium = 10,000 to 60,000 tonnes; and large = >60,000 tonnes Output templates
were then designed
A straight forward template was created for data related to quality control testing (e.g
Table 5.1) These data could be extracted directly from the questionnaire data, but this
is not always the case
Table 5.1 Output template for Section G of the CARD project questionnaire – Product
quality control including raw materials (small, medium and large firms and overall)
Formal
certification
HACCP % of firms saying yes
% of firms saying yes
Place for testing
Own laboratory % of firms saying yes
Others list** % of firms saying yes
* If most ‘other’ are the same, list the type given
** We need to classify the other places into simpler categories: e.g MARD institutes, University, SOE, private
enterprise, etc
Trang 35Another way to think about the data needed (rather than drawing up a table) is just to
provide a list, as in the following example of data needed from the survey questions
on assets
By region (north, south); and by production size, extract the following data on assets:
• Land area, total: mean, sd
• Land area, used: mean, sd
• Land ownership status: number in each group
• Land value per m2, owned: mean, sd if >0
• Land value per m2, rented: mean, sd if >0
• Number of firms having enough area to expand, or not: number yes, no
• Production capacity, tonnes per hr: mean, sd
• Production 2007, tonnes: mean, sd
• Number of production chains: mean, sd
In some cases quite a lot of data manipulation needs to be carried out to obtain useful
data from the survey answers This is when Output templates are most useful to help
clarify exactly what data is needed, and how it can be extracted from the survey data
For example, the following template (Table 5.2) and instructions were created to get
cost of production data from Section C.1 of the survey Another example is supplied
for survey data related to the products produced by firms (Table 5.3) The object of
the templates is always to provide guidance on how to extract and present data from
the survey questions in a way that will be useful for analysis addressing the research
questions
Table 5.2 Main elements of costs of production (in VND/tonne output) for small,
medium and large firms and overall
Instructions:
1 Need to examine the data and see what is missing
2 Create variables corresponding to each of the rows in the table below, and calculate
cost, and then cost per tonne of output (using output data)
3 Note that if we don’t have a value for total cost and only have % in each category,
can use total raw material costs reported in section E, to scale the costs of production
4 Summarise these as mean percentages/values for small, medium and large firms
and overall (as shown below for small and medium firms)
Costs
Percent Mean cost
(VND/tonne output)
Percent Mean cost
(VND/tonne output) Raw material purchases
Labor
Electricity and Fuel
Repair and Maintenance
Trang 36Table 5.3 Output template for products produced by firms (from Section D of the
questionnaire on factory output)
Instructions: Create variables for each company as in table below, and then summarise
mean values in the table, grouping firms by size class, ownership class
- Practice commands in Stata in a more flexible way
- Commands themselves are not so complicated!
- But more importantly: when they are used, how to do with commands
+ Old commands: gen, egen, recode, collapse, etc
+ New ones: global … foreach { } Æ create a loop with all variables listed
(required the same heads/foot) Æ no need to repeat for all variables generated
Trang 374.3.2 Exercises
A Feed fed for sow and slaughtering pigs by types: see Table 5.4
- By feed types: % households use
- By stages of production: sow, piglet, porker
Table 5.4 Diet fed to the pigs
* Code: 1 = kg, 2 = ratio, 3 =%
Diet fed for: Feed per pig
HM mix - ingredients used per batch (kg/ratio/%)
Type of pigs
a
Number
of days b End weight c Complete d.HM mix* e Conc f Premix g Corn h Rice i Other i Other
k Code* Unit kg/hd kg/day kg/day
Trang 38Exercise A1: % hh using complete, complete and mix and mix only for different
production stages:
use Sec_13_D1_check.dta, clear
*** Use * instead of repeating variables (in this case started with
var_d1)
recode var_d1* (.=0)
Æ variables from var_d1_a to var_d1_l will be changed
*** In case no info in column d (mix), then generate totalmix by sum
of column from e to l
egen totalmix=rsum( var_d1_e - var_d1_l )
** Now using complete feed, complete and mix and only mix:
gen complete=1 if var_d1_c!=
recode complete =0
gen mix=1 if var_d1_d!= & var_d1_d!=0
recode mix =0
Trang 39** in some cases where total mix is greater than 0, fix mix again: recode mix 0=1 if totalmix>0
*** TO SEE HOW MANY USE MIXED FEED BUT NOT HAVE INFO ON TOTAL MIX (DETAILS)
** > ADD THEM INTO THOSE WITH MIXED FEED AS WELL
br idcode if var_d1_d >0 & totalmix ==0
gen mix_missing = 1 if var_d1_d >0 & totalmix ==0
recode mix_missing =1 if var_d1_d ==0 & totalmix >0
count if var_d1_d >0 & totalmix >0
*** > now fix already!!
save temp.dta, replace
*** collapse data by household (overall)
collapse (sum) complete mix, by(idcode)
*** Create variables for households using complete only, combine complete and mix and finally those using mixed feed only:
gen completeonly =1 if complete>0 & mix==0
Trang 40sort idcode
save feedtype_use_intotal.dta, replace
*** Collapse data by household for each type of stage
*** By sows:
use temp, clear
collapse (sum) complete mix if rownum>=1 & rownum<=4, by(idcode)
gen sow_completeonly =1 if complete>0 & mix==0
save feedtype_use_sow.dta, replace
**** Exercise: Do the same for piglet and porker stages
*** Merging all files together
save in_total_feed_use.dta, replace
*** Result 1: % of households using feed by complete & mixed feed at
different stages by region and production scale: