9 2.3.2 COVID-19 prevalence estimation by random sampling in population - optimal sample pooling under varying assumptions about true prevalence ..... It is used to calculate statistics
Trang 1SCHOOL OF ADVANCED EDUCATIONAL PROGRAMS
NATIONAL ECONOMICS UNIVERSITY
-SUBJECT: BUSINESS STATISTICS
GROUP MID-TERM ASSIGNMENT
TOPIC: SAMPLING DISTRIBUTION AND ESTIMATION
Group 3:
Class: Advanced Finance 63D Assoc Prof : Tran Thi Bich
HANOI, 2023
Trang 2TABLE OF CONTENT:
PART 1: INTRODUCTION 3
1.1 Definition of Sampling distribution and The Central Limit Theorem 3
1.1.1 Sampling Distribution 3
1.1.2 The Central Limit Theorem 4
1.2 Definition of Estimation 4
1.3 Applications of Sampling Distribution and Estimation 4
1.3.1 Sampling Distribution 4
1.3.2Estimation 5
PART 2: ARTICLE AND SOURCES ANALYSIS 6
2.1 Article Summary 6
2.1.1General Info 6
2.1.2 Objective & Purpose of the article 6
2.1.3 Methodology 7
2.1.4Main Results 7
2.2 Techniques used in the article 7
2.2.1 Sample size 7
2.2.2 Sampling error 8
2.2.3 Simple random sampling 8
2.2.4 Confidence level 8
2.2.5 The Adjusted Wald, Jeffreys and Wilson Interval Methods 8
2.2.6 Finite Population Correction (FPC) 9
2.2.7 The Maximum likelihood estimate, Laplace, Jeffreys and Wilson Point Estimators 9
2.3 Additional Source 9
2.3.1 Methods of sample size calculation in descriptive retrospective burden of illness studies 9
2.3.2 COVID-19 prevalence estimation by random sampling in population - optimal sample pooling under varying assumptions about true prevalence 10
2.4 Conclusions 11
PART 3: DATA ANALYSIS 12
3.1 General Info 12
3.2 Survey Questions 13
3.2.1 Level of trust in COVID-19 vaccination (TR) 13
3.2.2 Level of Perceiving COVID-19 Risk (PRC) 13
3.2.3 Level of COVID-19 Vaccine Perception (PV) 13
3.2.4 Level of COVID-19 Vaccination Intention (INT) 13
3.3 Data Analysis 14
3.3.1 Sampling distribution 14
3.3.2 Estimation 16
3.3.3 Correlation 19
3.4 Conclusion 20
3.5 Recommendation 21
3.5.1 The Government 21
3.5.2 Vaccine import Firms 21
3.5.3 The Media & Telecommunication 21
3.5.4 Citizen 21
REFERENCE 22
2
Trang 3PART 1: INTRODUCTION
The probability distribution of a given statistic is estimated based on a randomsample The estimator is the generalized mathematical parameter to calculatesample statistics It is used to calculate statistics for a provided sample and helpsremove unpredictability when conducting research or collecting statistical data.1.1 Definition of Sampling distribution and The Central Limit Theorem1.1.1 Sampling Distribution
Sampling distribution is a probability distribution of a statistic based ondata from multiple samples within a specific population Sine analyzing theentire population is impractical, its main purpose is to provide representativeresults for small samples of a larger population The sampling distribution of apopulation refers to the frequency distribution of a range of potential outcomesthat could occur for a statistic of that population
There are primarily three types of sampling distribution:
- Sampling distribution of mean
The mean of sampling distribution of the mean is the mean of thepopulation from which the scores were sampled The graph will show anormal distribution, and the center will be the mean of the samplingdistribution, which is the mean of the entire population
- Sampling distribution of proportion
The Sampling distribution of proportion measures the proportion ofsuccess, i.e a chance of occurrence of certain events, by dividing thenumber of successes i.e chances by the sample size ‘n’ The mean of allthe sample proportions you calculate from each sample group wouldbecome the proportion of the entire population
- T - distribution
T-distribution is used for estimating population parameters for small samplesizes or unknown variances It is used to estimate the mean of the population,confidence intervals, statistical differences, and linear regression
The sampling distribution is influenced by several factors, such as the statistic,sample size, sampling process, and the overall population It is used to calculate statisticslike means, ranges, variances, and standard deviations for the sample at hand
Sample size and normality: If X has a distribution that is:
+ Normal, then the sample mean has a normal distribution for all sample sizes+ Close to normal, the approximation is good for small sample sizes+ Far from normal, the approximation requires larger sample sizes
3
Trang 4For a sample size of more than 30, the sampling distribution formula is given:
μ x=μ and σ x=μ and σ x=μ and σ x= √+ The mean of the sample and population is represented by μ xand+ The standard deviation of the sample and population is represented as σ xandσ
+ The sample size of more than 30 is represented as n
1.1.2 The Central Limit Theorem
The Central Limit Theorem (CLT) explains that the distribution of samplemeans approaches a normal distribution as the sample size increases,regardless of the population’s distribution
In practice, sample sizes equal to or greater than 30 are commonlyconsidered sufficient for the CLT to be applicable A fundamental feature of theCLT is that the average of the sample means and standard deviations will equalthe population means and standard deviation
1.2 Definition of Estimation
Estimation in statistics are any procedures used to calculate the value of apopulation drawn from observations within a sample size drawn from that population.There are two types of estimation: either point or interval estimation The purpose ofestimation is to find the approximate value of a parameter for the population bycalculating statistics based on samples taken from that population
1.3 Applications of Sampling Distribution and Estimation
1.3.1 Sampling Distribution
Sampling distribution may be used to represent the size of particles produced
by grinding, milling, and crushing
- In Economics, collect a simple random sample of 50 individuals in a town anduse the average annual income of the individuals in the sample to estimate theaverage annual income of individuals in the entire town If the average annualincome of the individuals in the sample is $58,000, then the best guess for the actualaverage annual income of them will be $58,000
- In Biology, biologists may measure the height of 30 randomly selected plants andthen use the sample mean height to estimate the population mean height If the biologistfinds that the sample mean height of the 30 plants is 10.3 inches, then her best guess forthe population mean height will also be 10.3 inches
- In Surveys, the HR department of some companies may randomly select
50 employees to take a survey that assesses their overall satisfaction on ascale of 1 to 10 If it is found that the average satisfaction among employees inthe survey is 8.5 then the best guess for the average satisfaction rating of allemployees at the company is also 8.5
4
Trang 51.3.2 Estimation
- In engineering, for any specified rock slope failure mode (such as planeshear, step path, or wedge), the safety factor (SF) equation can be applied tothe point estimation method to provide accurate estimates of the mean andstandard deviation of the SF probability distribution
- In calculating labor costs, an estimation model can help to count hours onthe right day to compute labor costs, which can be particularly challenging fornight shift workers
- In the construction and contracting industry, people have either designedestimation apps for clients or programmed software to approximately calculate thevolume and cost of flooring (concrete, wood, carpet) or drywall
- In practical mathematics, estimation can be applied to time management, future budgeting, and algebra,
- CLT is helpful in finance when examining a large portfolio of securities to estimate returns, risks, and correlations
5
Trang 6PART 2: ARTICLE AND SOURCES ANALYSIS 2.1 Article Summary
2.1.1 General Info
The use of random sampling in investigations involving child abuse materialThis article by Brian Jones, Syd Pleno, Michael Wilkinson done in 2012address two ubiquitous problems of practicing digital forensics in lawenforcement, the ever increasing data volume and staff exposure to disturbingmaterial It discusses how the New South Wales Police Force, State ElectronicEvidence Branch (SEEB) has implemented a “Discovery Process” This articleuses random sampling of files and applying statistical estimation to the results,the branch has been able to reduce backlogs from three months to 24h Theprocess has the added advantage of reducing staff exposure to child abusematerial and providing the courts with an easily interpreted report
Figure 1 SEEB Investigation July 2010 to June 2011
2.1.2 Objective & Purpose of the article
In order to maximize its capacity to offer timely support to significant criminalinvestigations, SEEB was forced to develop and execute a variety of systems in response
to the growing demand for digital forensic support Investigations on the possession ofchild abuse material (CAM) are one area with high demand In the past, it was the job ofSEEB analysts to detect any images, writings, or films that included CAM
A crucial limitation of chart review is the requirement to inexpensively get asufficiently large sample Researchers need to consider how valuable the data is expected
to be Given the absence of validated techniques for predetermined sample size
6
Trang 7calculation, the goal of this study is to offer a satisfactory methodology forsample size computation.
2.1.3 Methodology
A sample is an accurate representation of the population In this situation,quantitative information about that population is expected to be gained fromanalyzing the sample “A sample is representative if the statistics computed from itaccurately reflect the corresponding population parameters ”(De-Veaux et al 2009).The population must be randomly sorted in order to allow for the above-mentionedestimation, and a minimum required number of items must be presented— the sampleset This can be modeled as "simple random samplings" in statistics
By assigning a random number to each file, we can sort the files andchoose a sample from the first n files to calculate the ratio of files on a digitalstorage device that include CAM to those that do not
2.1.4 Main Results
Table 1 Test results Restraints: CL 99%, CL < 5% population 52,061, sample
-8388, actual items of interest - 1527
Table 2 Average test results
2.2 Techniques used in the article
2.2.1 Sample size
Due to the law of large numbers and the Central Limit Theorem in order
to achieve a confidence level of 99%, a maximum of approximately 10,000 files(when using the Discovery statistical constraints) have to be viewed –irrespective of the population size (Yamane, 1967)
7
Trang 82.2.2 Sampling error
In this case we are only concerned with sampling errors, as non-sampling errors are minimal due to the controlled environment and uniformity of the population
2.2.3 Simple random sampling
Sampling is defined in ISO/IEC 17,025 as:
“ A defined procedure whereby a part of a substance, material or product is taken to provide for testing or calibration of a representative sample of the whole.”
To allow for the estimation the population must be randomly sorted and aminimum recommended number of items are to be presented – the sample set
In statistics this can be modeled as “simple random sampling”
“With simple random sampling, each member of the sampling frame has an equalchance of selection and each possible sample of a given size has an equal chance ofbeing selected Every member of the sampling frame is numbered sequentially and arandom selection process is applied to the numbers.” (McLennan, 1999)
For the purposes of determining the ratio of files containing CAM to filesnot containing CAM on a digital storage device, simple random sampling isrelatively straightforward By assigning a random number to each file the filescan then be sorted and the sample selected from the first n files
2.2.4 Confidence level
This value indicates the reliability of the estimate If the confidence level
is 95%, this implies that if 100 samples were conducted, 95 of them would fallbetween the required confidence interval (Stewart, 2011)
2.2.5 The Adjusted Wald, Jeffreys and Wilson Interval Methods
To calculate the estimate, the first thing required is to calculate theinterval estimation of the proportion, or the confidence interval The standardformula for calculating the confidence interval is known in most introductorystatistics textbooks as the Wald interval
As the standard interval has been proven to be inconsistent and unreliable inmany circumstances, it is recommended by several sources that the Adjusted Wald
is a more reliable interval calculation The Adjusted Wald interval is defined as:
8
Trang 9Wilson & Jeffreys Bayesian intervals method were also used in conductingsimulations and testing to compare and find out the most reliable and give acceptableconservative estimates which were selected for use in the SEEB sampling process.2.2.6 Finite Population Correction (FPC)
If the population is known and the sample is greater than 5% of thepopulation, the finite population correction factor can be used The FPC isincluded into the estimate calculation and usually results in a narrower marginfor the estimate, without affecting the reliability The formula for the FPC is:
−
2.2.7 The Maximum likelihood estimate, Laplace, Jeffreys and Wilson PointEstimators
The point estimate is a ratio of the selection in relation to the sample or s/
n It is a value that is used in the final estimate calculation There are 4 mainBinomial point estimators: The Maximum likelihood estimate (MLE), Laplace,Wilson and Jeffreys (Böhning and Viwatwongkasem, 2005)
9
Trang 10An objective of this study was to propose rigorous methodologies forsample size computation in light of the lack of verified methods for priorisample size estimations.
This study offers official guidelines for calculating sample sizes for studies of theretrospective burden of disease Pharmacoepidemiology uses observationalburden of illness research to achieve a number of goals, including contextualizingthe present treatment environment, highlighting significant treatment gaps, andoffering estimates to parameterize economic models The goal of this study was tocreate suggested sample size formulas for use in such studies
- Technique
+ Cost estimate – Bottom-up sampling
Estimating work effort in agile projects is fundamentally different fromtraditional methods of estimation The traditional approach is to estimateusing a “bottom-up” technique: detail out all requirements and estimateeach task to complete those requirements in hours/days, then use thisdata to develop the project schedule
This technique is used when the requirements are known at a discrete level wherethe smaller workpieces are then aggregated to estimate the entire project This isusually used when the information is only known in smaller pieces
+ Retrospective chart review
The retrospective chart review (RCR), also known as a medical recordreview, is a type of research design in which pre-recorded, patient-centered data are used to answer one or more research questions.The data used in such reviews exist in many forms: electronic databases,results from diagnostic tests, and notes from health service providers tomention a few RCR is a popular methodology widely applied in manyhealthcare-based disciplines such as epidemiology, quality assessment,professional education and residency training, inpatient care, and clinicalresearch (cf Gearing et al.), and valuable information may be gatheredfrom study results to direct subsequent prospective studies
2.3.2 COVID-19 prevalence estimation by random sampling in population - optimal sample pooling under varying assumptions about true prevalence
A rough estimate of the illness burden in a population is the sum of confirmedCOVID-19 cases divided by the size of the population However, a number ofstudies indicate that a sizable portion of cases typically go unreported and that
1 0
Trang 11this percentage greatly varies on the level of sampling and the various test criteria employed in different jurisdictions.
This research aims to Examine how the number of samples used in theexperiment and the degree of sample pooling affect the accuracy ofprevalence estimates and the possibility of reducing the number of testsnecessary to obtain individual-level diagnostic results
- Technique
Estimates of the true prevalence of COVID-19 in a population can be made byrandom sampling and pooling of RT-PCR tests These estimations do not takeinto account follow-up testing on sub-pools that enable patient-level diagnosis;they are only based on the initial pooled tests The precision from the pooledtest estimates would be closer to that of testing individually if the results fromthese samples were taken into account
This article started by generating a population of 500,000 individuals andthen let each individual have the probability of being infected at samplingtime The number of patient samples collected from the population isdenoted by n, and the number of patient samples that are pooled into asingle well is denoted by k The total number of pools are thus n/k, herebycalled m The number of positive pools in an experiment is termed x.2.4 Conclusions
In conclusion, this study represents a formal guide to sample sizecalculations to search for CAM files SEEB has been able to significantly reducethe exposure of its staff and police investigators to disturbing child abusematerial It has also significantly reduced backlogs, enables investigators toestablish the extent of their investigation in a short timeframe and provides thecourts with a clear record of the quantity and severity of CAM on a device
1 1