1. Trang chủ
  2. » Giáo Dục - Đào Tạo

PART 2 DESCRIPTIVE STATISTICS 1 measures of central tendency

18 4 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Measures of Central Tendency
Người hướng dẫn Dr Greeni Maheshwari
Trường học University of Economics and Law
Chuyên ngành Business Statistics
Thể loại lecture notes
Năm xuất bản 2020
Thành phố Ho Chi Minh City
Định dạng
Số trang 18
Dung lượng 371,87 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

PART 1: DATA COLLECTIONOur team has collected data of six variables, including the total number of deaths due to COVID 19 between January 22 and April 23, 2020, average rainfall in mm an

Trang 1

ASSIGNMENT COVER PAGE

Pham Huynh Ngoc Anh – s3836285

Chau Hai Hoang – s3836304 Pham Phuong Thao – s3817955

Trang 2

TABLE OF CONTENTS

PART 1: DATA COLLECTION 3

PART 2: DESCRIPTIVE STATISTICS 3

PART 3: MULTIPLE REGRESSION 5

PART 4: TEAM REGRESSION CONCLUSION 6

PART 5: TIME SERIES 6

PART 7: OVERALL TEAM CONCLUSION 12

APPENDICES 15

CONTRIBUTION

2

Trang 3

PART 1: DATA COLLECTION

Our team has collected data of six variables, including the total number of deaths due to COVID 19 between

January 22 and April 23, 2020, average rainfall (in mm) and average temperature (in Celsius) based on available

data from 1991 to 2016, hospital beds (per 10,000 people, latest available), population of the country (in 1000s)

in 2018 and medical doctors (per 10,000, latest available) for 35 countries in Region A: Asia and 27 countries in

Region B: European Union (EU) The links of sources are provided in the Excel file, sheet "RAW DATA –

Part 1" After cleaning the data, there are 30 countries remained in Region A: Asia and 27 countries in Region B:

EU, which are presented in the Excel file, sheet “CLEAN

– Part 1 Asia” and “CLEAN – Part 1 EU”

PART 2: DESCRIPTIVE STATISTICS

1 Measures of Central Tendency

To create an optimal analysis of the three measurements in the case, we have to check whether there

are outliers or not We used the formulas to detect outliers, to be specific, any values that are higher

than Q3 + (1.5*IQR) or lower than Q1 – (1.5* IQR) will be considered as outliers In the case, we

identify 5 outliers for Region A: Asia (China, India, Indonesia, Japan, Philippines) and 6 outliers for

Region B: EU (Belgium, France, Germany, Italy, Netherlands, Spain)

Figure 1 Measures of Central Tendency of total number of deaths due to COVID-19 in Asia and EU (Unit: Number of deaths)

In this situation, mean is not considered a perfect tool for comparison due to the presence of outliers in

both Asia and EU region Moreover, EU dataset does not possess a mode; therefore, it is impossible to

use mode as a measure to compare Hence, the median would be the best measure of the central

tendency among the three measures

The median of EU dataset is 239, which is nearly 24 times higher than that of Asia dataset (10) This indicates

that 50% of EU countries have the number of deaths due to COVID-19 larger or at least equal to 239 deaths

during the epidemic, which is much higher than that of Asia countries Therefore, we can conclude that EU

nations might have a higher number of deaths due to COVID-19 than Asia nations.

2 Box-and-whisker Plots Analysis

Trang 4

Figure 2 Box-and-whisker plots of total number of deaths due to COVID 19 in Asia and EU countries

As the difference in the range of data in Asia and EU is too significant, we have to draw two box-and-whisker plots on different scales to improve the data visualisation As can be seen from Figure 2, the

data distributions of both Asia and EU region are clearly right-skewed Specifically, the right

whiskers of both Asia and EU regions are much longer than the left whiskers, which shows the presence of outliers in the datasets Figure 2 reflects that 25% of countries in Asia have no deaths, while 75% of EU countries have more than 51 deaths 25% of EU countries have more than 1420 deaths, while 25% of Asia countries only have more than 119 deaths Based on that, we can conclude that EU countries have a higher number of deaths than Asia countries

3 Measures of Variation

Figure 3 Measures of Variation of total deaths in Asia and EU (Unit: number of deaths except for the Coefficient of Variation)

Interquartile Range (IQR) would be the best measure of variation in this case as the presence of outliers has been noticed in the dataset of both regions Standard Deviation is not considered the best measure because it is sensitive to extreme values, and CV is also not suitable because in this case, the skewness of both datasets is two high (highly right-skewed), as can be clearly seen from Figure 2

EU region has a higher IQR (1369.5) compared to Asia region (119), which indicates that the dispersion of the number of deaths of COVID-19 around the median of EU is larger In other words, the number of deaths in EU nations spreads farther from the middle value (3425.59) and are less consistent than in Asia nations, or the COVID-19 has a greater impact on EU countries than on Asian ones.

Trang 5

PART 3: MULTIPLE REGRESSION

1 Region A: Asian Countries

After applying backward elimination, we find that all five variables: average rainfall, average temperature, hospital beds (per 10,000 people), population (in 1000s), and medical doctors (per 10,000 people) are insignificant at 5% level of significance Therefore, no regression model can be built for Asia region The backward elimination process can be seen in Appendix A.1

2 Region B: EU Countries (FINAL)

After applying backward elimination, we find that two variables, namely hospital beds (per 10,000 people) and population (in 1000s), are significant at 5% level of significance The regression building process can be seen in Appendix A.2

a Regression Output

Figure 4 FINAL regression model of Region B: EU

b Regression Equation: ^y = b0 + b1X1+ b2X2

^ = 6838.4 – 157.83 (Hospital beds) + 0.28 (Population)

Total number of deaths

c Regression coefficient of the significant independent variables

The slope b1 = - 157.83 indicates that the total number of deaths due to COVID 19 between January

22 and April 23, 2020 decreases by 157.83 deaths with every 10000 beds increases in the number of hospital beds, holding population (1000s) as constant

The slope b2 = 0.28 indicates that the total number of deaths due to COVID 19 between January 22 and April 23, 2020 increases by 0.28 deaths with every 1000 people increases in the population of the country, holding hospital beds (per 10000 people) as constant

In this case, for no hospital beds and no population, b0 = 6838.4, which appears nonsense However, the intercept simply indicates that over the sample size selected, the portion of total number of deaths due to COVID 19 between January 22 and April 23, 2020, not explained by the number of hospital beds and population of the country is 6838.4 deaths Also note that x1 = 0 and x2 = 0 is outside the range of observed values

Trang 6

d The coefficient of determination

The coefficient of determination (R Square = 67.47%) shows that 67.47% of the total variation in the total number of deaths due to COVID 19 between January 22 and April 23, 2020, can be explained by the variation in the number of hospital beds and the population of the country, while 32.53% of the total variation in the total number of deaths due to COVID 19 between January 22 and April 23, 2020

is due to the variation in other factors that were not included in our study

PART 4: TEAM REGRESSION CONCLUSION

According to Part 3, two regions have different significant independent variables With Asia data set, none of the independent variables among average rainfall, average temperature, hospital beds, population, and medical doctors is significant While with EU data set final regression model, two independent variables are found to be significant, including hospital beds (per 10000 people) and population (1000s)

EU region is more likely to be impacted due to this pandemic Firstly, according to Part 2, the total numbers of deaths due to COVID 19 in EU region are much higher than that of Asia region With the variables given in this study, the higher number of deaths in EU region can be attributed to two factors, including the number of hospital beds (per 10,000 people) and the population of the country (in 1000s), with 67.47% of the total variation in the total number of deaths due to COVID 19 in EU can be explained by these two factors However, as none of the five given variables is significant with Asia region, other variables should be considered when estimating the total number of deaths in this region

In conclusion, after building regression models for two regions, this report found that the number of hospital beds and population of the country can be used to estimate the total number of deaths due to COVID 19 in EU region Meanwhile, all five given variables have no relationships with the total number of deaths due to COVID 19 in Asia region Therefore, to estimate the total number of deaths due to COVID 19 in Asia more effectively, it is suggested that variables other than the five given ones should be taken into consideration

PART 5: TIME SERIES

Because from January 1 to January 21, there were no deaths due to COVID 19 in Region A: Asia, and from January 1 to February 14, there were no deaths due to COVID 19 in Region B: EU, therefore, we choose to collect data from February 15 to April 30 to build trend models for two regions In the two datasets from February 15 to April 30, if there is any day when no death occurs, we will convert 0 to 0.00000005 for calculation purpose as Log (0) cannot be calculated The link for data collection is provided in the Excel file, sheet “Part 5 Final”

1 Build Linear (LIN), Quadratic (QUA), Exponential (EXP) trend models

1.1 Region A: Asia

Trang 7

After using hypothesis testing for Asia region (Appendix B.1), we found that linear, quadratic and exponential trends are significant trend models for this region

LIN

a Regression output

Figure 5 Time Series outputs for Region A: Asia linear trend

^

= 22.3 + 1.873T

b Formula: y

QUA

a Regression output

Figure 6 Time Series outputs for Region A: Asia quadratic trend

b Formula: ^y = 114.142 – 5.192T + 0.092T2

EXP

a Regression output

Trang 8

Figure 7 Time Series outputs for Region A: Asia exponential trend

1.2 Region B: EU countries

After using hypothesis testing for EU region (Appendix B.2), we found that linear, quadratic and exponential trends are significant trend models for this region

LIN

a Regression output

Figure 8 Time Series outputs for Region B: EU linear trend

b Formula: ^y = -363.053+ 45.125T

QUA

a Regression output

Trang 9

Figure 9 Time Series outputs for Region B: EU quadratic trend

b Formula: ^y = -1048.775 +97.873T - 0.685T2

EXP

a Regression output

Figure 10 Time Series outputs for Region B: EU exponential trend

2 Recommended Trend Models

To determine the best trend model to predict the number of deaths due to COVID-19, it is recommended that we should compare the Coefficient of Determination (R Square) in regression outputs of the three models The trend model with a higher coefficient of determination means more of the total variation in the total number of deaths due to COVID 19 can be explained as well as fewer errors in the prediction.

a Region A: Asia

Figure 11 Coefficient of determination of three trend models of Asia (%)

For region A, it can be clearly observed from Figure 11 that the exponential trend model possesses the highest coefficient of determination (18.75%) As a result, the exponential trend model will be the most

Trang 10

suitable in region A’s case to predict the total number of deaths due to Covid-19 as it will produce the

fewest error

b Region B: EU

Figure 12 Coefficient of determination of three trend models of EU (%)

For region B, it can be clearly observed in Figure 12 that the quadratic trend model possesses the highest

coefficient of determination (70%) As a result, quadratic trend model will be the most suitable in region B's

case to predict the total number of deaths due to Covid-19 as it will produce the fewest error.

3 Predict the number of deaths on May 29, May 30, May 31 a Region A: Asia

As it has been concluded in the above section, exponential trend is the best model for predicting

the number of deaths due to COVID 19 in Asia with the formula: ^y = 34.988 x 1.016T

Figure 13 Predicted number of deaths on May 29, 30, 31 in Asia

b Region B: EU

It is concluded that quadratic trend is the best model to predict the number of deaths due to COVID 19

in EU, with the formula: ^y = -1048.775 +97.873T - 0.685T2

Figure 14 Predicted number of deaths on May 29, 30, 31 in EU

PART 6: TIME SERIES CONCLUSION

a Line chart

10

Trang 11

Figure 15 Line graph of Daily total number of deaths due to COVID 19 in Asia and EU from February 15 to April 30

b Explanation

The aforementioned line graph demonstrates the change in the daily total number of deaths from February 15 to April 30, 2020, in Asia region and EU region It can be obviously observed from the chart that the daily number of deaths in the two regions follow different trends Although the pandemic started in Asia, the daily total number of deaths in Asia was pretty stable with minor fluctuations, except for the irregular component of April 17 due to the sudden rocket in the total number of deaths

in China Meanwhile, in EU, there was a sharp fluctuation in the total number of daily deaths The daily number of deaths witnessed an upward trend from March 7 to April 1 This number then increased to peak and decreased with a seasonal component of three days until April 15, and subsequently saw a drop with some fluctuations in the rest of the period

As mentioned in Part 5.3, exponential trend model and quadratic trend model is chosen to best predict the number of deaths due to COVID 19 in Asia and EU, respectively In order to determine the best

trend model to predict the number of deaths in the world, our team will once again compare the

coefficient of determination (R Square) of the two trend models, as the higher the R Square, the fewer errors Our team would prefer the quadratic trend model of EU region to predict the number of deaths

in the world as its R square is much larger than that of exponential trend model of Asia region (70% and 18.75%), indicating that 70% of the dependent variable (total number of deaths due to COVID-19) can be explained by the model's inputs, in turn producing fewer errors in the prediction process

Trang 12

PART 7: OVERALL TEAM CONCLUSION

1 Main factors impacting the total number of deaths

Based on Part 3 Multiple Regression analysis of the EU region, it is concluded that there are two significant independent variables that may affect the total number of deaths due to COVID-19 which are hospital beds (per 10,000 people) and population (in 1000s) at 95% level of confidence Based on the

regression equation for EU region in Part 3 ( = 6838.4 – 157.83 (Hospital

beds) + 0.28 (Population)), it could be clearly observed that while the number of hospital beds (per

10000 people) has an inverse relationship with the total number of deaths due to COVID-19, population (in 1000s) has a direct posit relationship with the total number of deaths due to COVID-19 Moreover, Part 3 also referred that with every increase in the total number of beds the total number deaths due to COVID-19 will reduce by 0.015783 death while with every person increases in the population of a country, the total number of deaths due to COVID-19 will only increase by 0.00028 death, hence, it could be inferred that the number of hospital beds has a greater impact on the total number of deaths due to COVID-19 compared to the population of a country However, these two variables are not the only factors that could influence the total number of deaths as our study only covers two specific regions which are the Asia and EU region Therefore, with a further study into the pandemic, other factors such as lack of social protection, access to healthcare or meteorological factors can affect the total number of deaths as well (United Nation 2020, Liu et al 2020)

2 Predicted number of deaths due to COVID-19 in the world on June 30

As mentioned in part 6, the best trend model to predict the number of deaths due to COVID-19 would

be the quadratic trend model of EU region Therefore, in order to predict the number of deaths due to COVID-19 in the world on June 30, we would use the formula of the quadratic trend model of EU region with T equal to 137

Formula: ^y = -1048.775 + 97.873T - 0.685T 2

= -1048.775 +97.873 x137 – 0.685 x 1372

The calculation depicts that on June 30, the number of deaths due to COVID-19 will perceive a negative value However, this cannot happen in the real world This result indicates that the

COVID-19 pandemic may end before June 30

3 The number of deaths due to COVID 19 will reduce by the end of the year 2020

Based on the calculations above of the number of deaths due to COVID-19 in the world on June 30 and the number of deaths in Asia and EU on May 29, 30, 31 in Part 5, it depicts that the number of deaths due to COVID-19 is gradually decreasing Moreover, the number of deaths due to COVID-19 on June 30 shows a negative value (-497 deaths) indicating that the epidemic may end before June 30 According to a research by Kraemer et al (2020), the implementation of travel restriction has been proved to be useful

Ngày đăng: 10/05/2022, 08:48

TỪ KHÓA LIÊN QUAN

w