9 2.3.2 COVID-19 prevalence estimation by random sampling in population - optimal sample pooling under varying assumptions about true prevalence ..... The graph will show a normal distri
Trang 1SCHOOL OF ADVANCED EDUCATIONAL PROGRAMS
NATIONAL ECONOMICS UNIVERSITY
-
SUBJECT: BUSINESS STATISTICS GROUP MID-TERM ASSIGNMENT TOPIC: SAMPLING DISTRIBUTION AND ESTIMATION
Group 3:
Phạm Thị Hương Trà : 11203958
Assoc Prof : Tran Thi Bich HANOI, 2023
Trang 2TABLE OF CONTENT:
PART 1: INTRODUCTION 3
1.1 Definition of Sampling distribution and The Central Limit Theorem 3
1.1.1 Sampling Distribution 3
1.1.2 The Central Limit Theorem 4
1.2 Definition of Estimation 4
1.3 Applications of Sampling Distribution and Estimation 4
1.3.1 Sampling Distribution 4
1.3.2 Estimation 5
PART 2: ARTICLE AND SOURCES ANALYSIS 6
2.1 Article Summary 6
2.1.1 General Info 6
2.1.2 Objective & Purpose of the article 6
2.1.3 Methodology 7
2.1.4 Main Results 7
2.2 Techniques used in the article 7
2.2.1 Sample size 7
2.2.2 Sampling error 8
2.2.3 Simple random sampling 8
2.2.4 Confidence level 8
2.2.5 The Adjusted Wald, Jeffreys and Wilson Interval Methods 8
2.2.6 Finite Population Correction (FPC) 9
2.2.7 The Maximum likelihood estimate, Laplace, Jeffreys and Wilson Point Estimators 9
2.3 Additional Source 9
2.3.1 Methods of sample size calculation in descriptive retrospective burden of illness studies 9
2.3.2 COVID-19 prevalence estimation by random sampling in population - optimal sample pooling under varying assumptions about true prevalence 10
2.4 Conclusions 11
PART 3: DATA ANALYSIS 12
3.1 General Info 12
3.2 Survey Questions 13
3.2.1 Level of trust in COVID-19 vaccination (TR) 13
3.2.2 Level of Perceiving COVID-19 Risk (PRC) 13
3.2.3 Level of COVID-19 Vaccine Perception (PV) 13
3.2.4 Level of COVID-19 Vaccination Intention (INT) 13
3.3 Data Analysis 14
3.3.1 Sampling distribution 14
3.3.2 Estimation 16
3.3.3 Correlation 19
3.4 Conclusion 20
3.5 Recommendation 21
3.5.1 The Government 21
3.5.2 Vaccine import Firms 21
3.5.3 The Media & Telecommunication 21
3.5.4 Citizen 21
REFERENCE 22
Trang 3PART 1: INTRODUCTION
The probability distribution of a given statistic is estimated based on a random sample The estimator is the generalized mathematical parameter to calculate sample statistics It is used to calculate statistics for a provided sample and helps remove unpredictability when conducting research or collecting statistical data
1.1 Definition of Sampling distribution and The Central Limit Theorem
1.1.1 Sampling Distribution
Sampling distribution is a probability distribution of a statistic based on data from multiple samples within a specific population Sine analyzing the entire population is impractical, its main purpose is to provide representative results for small samples of a larger population The sampling distribution of a population refers to the frequency distribution of a range of potential outcomes that could occur for a statistic of that population
There are primarily three types of sampling distribution:
- Sampling distribution of mean
The mean of sampling distribution of the mean is the mean of the population from which the scores were sampled The graph will show a normal distribution, and the center will be the mean of the sampling distribution, which is the mean of the entire population
- Sampling distribution of proportion
The Sampling distribution of proportion measures the proportion of success, i.e
a chance of occurrence of certain events, by dividing the number of successes i.e chances by the sample size ‘n’ The mean of all the sample proportions you calculate from each sample group would become the proportion of the entire population
- T - distribution
T-distribution is used for estimating population parameters for small sample sizes
or unknown variances It is used to estimate the mean of the population, confidence intervals, statistical differences, and linear regression
The sampling distribution is influenced by several factors, such as the statistic, sample size, sampling process, and the overall population It is used to calculate statistics like means, ranges, variances, and standard deviations for the sample at hand
Sample size and normality: If X has a distribution that is:
+ Normal, then the sample mean has a normal distribution for all sample sizes + Close to normal, the approximation is good for small sample sizes
+ Far from normal, the approximation requires larger sample sizes
Trang 4For a sample size of more than 30, the sampling distribution formula is given:
√𝒏 + The mean of the sample and population is represented by μxand 𝜇
+ The standard deviation of the sample and population is represented as σxand
σ
+ The sample size of more than 30 is represented as n
1.1.2 The Central Limit Theorem
The Central Limit Theorem (CLT) explains that the distribution of sample means approaches a normal distribution as the sample size increases, regardless of the population’s distribution
In practice, sample sizes equal to or greater than 30 are commonly considered sufficient for the CLT to be applicable A fundamental feature of the CLT is that the average of the sample means and standard deviations will equal the population means and standard deviation
1.2 Definition of Estimation
Estimation in statistics are any procedures used to calculate the value of a population drawn from observations within a sample size drawn from that population There are two types of estimation: either point or interval estimation The purpose of estimation is to find the approximate value of a parameter for the population by calculating statistics based on samples taken from that population
1.3 Applications of Sampling Distribution and Estimation
- In Biology, biologists may measure the height of 30 randomly selected plants and then use the sample mean height to estimate the population mean height If the biologist finds that the sample mean height of the 30 plants is 10.3 inches, then her best guess for the population mean height will also be 10.3 inches
- In Surveys, the HR department of some companies may randomly select 50 employees to take a survey that assesses their overall satisfaction on a scale of 1
to 10 If it is found that the average satisfaction among employees in the survey is 8.5 then the best guess for the average satisfaction rating of all employees at the company is also 8.5
Trang 51.3.2 Estimation
- In engineering, for any specified rock slope failure mode (such as plane shear, step path, or wedge), the safety factor (SF) equation can be applied to the point estimation method to provide accurate estimates of the mean and standard deviation of the SF probability distribution
- In calculating labor costs, an estimation model can help to count hours on the right day to compute labor costs, which can be particularly challenging for night shift workers
- In the construction and contracting industry, people have either designed estimation apps for clients or programmed software to approximately calculate the volume and cost of flooring (concrete, wood, carpet) or drywall
- In practical mathematics, estimation can be applied to time management, future budgeting, and algebra,
- CLT is helpful in finance when examining a large portfolio of securities to estimate returns, risks, and correlations
Trang 6PART 2: ARTICLE AND SOURCES ANALYSIS 2.1 Article Summary
2.1.1 General Info
The use of random sampling in investigations involving child abuse material
This article by Brian Jones, Syd Pleno, Michael Wilkinson done in 2012 address two ubiquitous problems of practicing digital forensics in law enforcement, the ever increasing data volume and staff exposure to disturbing material It discusses how the New South Wales Police Force, State Electronic Evidence Branch (SEEB) has implemented a “Discovery Process” This article uses random sampling of files and applying statistical estimation to the results, the branch has been able to reduce backlogs from three months to 24h The process has the added advantage of reducing staff exposure to child abuse material and providing the courts with an easily interpreted report
Figure 1 SEEB Investigation July 2010 to June 2011
2.1.2 Objective & Purpose of the article
In order to maximize its capacity to offer timely support to significant criminal investigations, SEEB was forced to develop and execute a variety of systems in response
to the growing demand for digital forensic support Investigations on the possession of child abuse material (CAM) are one area with high demand In the past, it was the job
of SEEB analysts to detect any images, writings, or films that included CAM
A crucial limitation of chart review is the requirement to inexpensively get a sufficiently large sample Researchers need to consider how valuable the data is expected to be Given the absence of validated techniques for predetermined sample size
Trang 7calculation, the goal of this study is to offer a satisfactory methodology for sample size computation
2.1.3 Methodology
A sample is an accurate representation of the population In this situation, quantitative information about that population is expected to be gained from analyzing the sample “A sample is representative if the statistics computed from it accurately reflect the corresponding population parameters ”(De-Veaux et al 2009)
The population must be randomly sorted in order to allow for the mentioned estimation, and a minimum required number of items must be presented—the sample set This can be modeled as "simple random samplings" in statistics
above-By assigning a random number to each file, we can sort the files and choose a sample from the first n files to calculate the ratio of files on a digital storage device that include CAM to those that do not
2.1.4 Main Results
Table 1 Test results Restraints: CL - 99%, CL < 5% population - 52,061, sample - 8388,
actual items of interest - 1527
Table 2 Average test results
2.2 Techniques used in the article
2.2.1 Sample size
Due to the law of large numbers and the Central Limit Theorem in order to achieve a confidence level of 99%, a maximum of approximately 10,000 files (when using the Discovery statistical constraints) have to be viewed – irrespective of the population size (Yamane, 1967)
Trang 82.2.2 Sampling error
In this case we are only concerned with sampling errors, as non-sampling errors are minimal due to the controlled environment and uniformity of the population 2.2.3 Simple random sampling
Sampling is defined in ISO/IEC 17,025 as:
“ A defined procedure whereby a part of a substance, material or product is taken
to provide for testing or calibration of a representative sample of the whole.”
To allow for the estimation the population must be randomly sorted and a minimum recommended number of items are to be presented – the sample set In statistics this can be modeled as “simple random sampling”
“With simple random sampling, each member of the sampling frame has an equal chance of selection and each possible sample of a given size has an equal chance of being selected Every member of the sampling frame is numbered sequentially and a random selection process is applied to the numbers.” (McLennan, 1999)
For the purposes of determining the ratio of files containing CAM to files not containing CAM on a digital storage device, simple random sampling is relatively straightforward By assigning a random number to each file the files can then be sorted and the sample selected from the first n files
2.2.4 Confidence level
This value indicates the reliability of the estimate If the confidence level is 95%, this implies that if 100 samples were conducted, 95 of them would fall between the required confidence interval (Stewart, 2011)
2.2.5 The Adjusted Wald, Jeffreys and Wilson Interval Methods
To calculate the estimate, the first thing required is to calculate the interval estimation of the proportion, or the confidence interval The standard formula for calculating the confidence interval is known in most introductory statistics textbooks as the Wald interval
As the standard interval has been proven to be inconsistent and unreliable in many circumstances, it is recommended by several sources that the Adjusted Wald is a more reliable interval calculation The Adjusted Wald interval is defined as:
Trang 9Wilson & Jeffreys Bayesian intervals method were also used in conducting simulations and testing to compare and find out the most reliable and give acceptable conservative estimates which were selected for use in the SEEB sampling process 2.2.6 Finite Population Correction (FPC)
If the population is known and the sample is greater than 5% of the population, the finite population correction factor can be used The FPC is included into the estimate calculation and usually results in a narrower margin for the estimate, without affecting the reliability The formula for the FPC is:
𝐹𝑃𝐶 = 𝑁 − 1 𝑁 − 𝑛2.2.7 The Maximum likelihood estimate, Laplace, Jeffreys and Wilson Point Estimators
The point estimate is a ratio of the selection in relation to the sample or s/n It is
a value that is used in the final estimate calculation There are 4 main Binomial point estimators: The Maximum likelihood estimate (MLE), Laplace, Wilson and Jeffreys (Böhning and Viwatwongkasem, 2005)
Trang 10An objective of this study was to propose rigorous methodologies for sample size computation in light of the lack of verified methods for priori sample size estimations
- Purpose
This study offers official guidelines for calculating sample sizes for studies of the retrospective burden of disease Pharmacoepidemiology uses observational burden of illness research to achieve a number of goals, including contextualizing the present treatment environment, highlighting significant treatment gaps, and offering estimates to parameterize economic models The goal of this study was
to create suggested sample size formulas for use in such studies
- Technique
+ Cost estimate – Bottom-up sampling
Estimating work effort in agile projects is fundamentally different from traditional methods of estimation The traditional approach is to estimate using a
“bottom-up” technique: detail out all requirements and estimate each task to complete those requirements in hours/days, then use this data to develop the project schedule
This technique is used when the requirements are known at a discrete level where the smaller workpieces are then aggregated to estimate the entire project This is usually used when the information is only known in smaller pieces
+ Retrospective chart review
The retrospective chart review (RCR), also known as a medical record review, is
a type of research design in which pre-recorded, patient-centered data are used to answer one or more research questions
The data used in such reviews exist in many forms: electronic databases, results from diagnostic tests, and notes from health service providers to mention a few RCR is a popular methodology widely applied in many healthcare-based disciplines such as epidemiology, quality assessment, professional education and residency training, inpatient care, and clinical research (cf Gearing et al.), and valuable information may be gathered from study results to direct subsequent prospective studies
2.3.2 COVID-19 prevalence estimation by random sampling in population - optimal sample pooling under varying assumptions about true prevalence
- Issue
A rough estimate of the illness burden in a population is the sum of confirmed COVID-19 cases divided by the size of the population However, a number of studies indicate that a sizable portion of cases typically go unreported and that
Trang 11this percentage greatly varies on the level of sampling and the various test criteria employed in different jurisdictions
- Purpose
This research aims to Examine how the number of samples used in the experiment and the degree of sample pooling affect the accuracy of prevalence estimates and the possibility of reducing the number of tests necessary to obtain individual-level diagnostic results
- Technique
Estimates of the true prevalence of COVID-19 in a population can be made by random sampling and pooling of RT-PCR tests These estimations do not take into account follow-up testing on sub-pools that enable patient-level diagnosis; they are only based on the initial pooled tests The precision from the pooled test estimates would be closer to that of testing individually if the results from these samples were taken into account
This article started by generating a population of 500,000 individuals and then let each individual have the probability of being infected at sampling time The number of patient samples collected from the population is denoted by n, and the number of patient samples that are pooled into a single well is denoted by k The total number of pools are thus n/k, hereby called m The number of positive pools
in an experiment is termed x
2.4 Conclusions
In conclusion, this study represents a formal guide to sample size calculations to search for CAM files SEEB has been able to significantly reduce the exposure of its staff and police investigators to disturbing child abuse material It has also significantly reduced backlogs, enables investigators to establish the extent of their investigation in a short timeframe and provides the courts with a clear record of the quantity and severity
of CAM on a device