1. Trang chủ
  2. » Giáo Dục - Đào Tạo

(TIỂU LUẬN) RMIT international university vietnam ASSINGMENT 3 PART a DATA COLLECTION the data for the total number of deaths due to COVID 19 between april 01

21 8 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Data Collection for COVID-19 Deaths and Variables (April to July 2020)
Người hướng dẫn Greeni Maheshwari, PTS.
Trường học RMIT International University Vietnam
Chuyên ngành Business Statistics
Thể loại assignment
Năm xuất bản 2020
Thành phố Ho Chi Minh City
Định dạng
Số trang 21
Dung lượng 1,61 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

TABLE OF CONTENTSPART 4: TEAM REGRESSION CONCLUSION 6 PART 6: TIME SERIES CONCLUSION 11 PART 1: DATA COLLECTION The data for the total number of deaths due to COVID 19 between April 01

Trang 1

RMIT International University Vietnam

Saigon

Trang 2

TABLE OF CONTENTS

PART 4: TEAM REGRESSION CONCLUSION 6

PART 6: TIME SERIES CONCLUSION 11

PART 1: DATA COLLECTION

The data for the total number of deaths due to COVID 19 between April 01 to July 31, 2020, and five other variables including average temperature (in Celsius) and average rainfall (in mm) based on available data from 1991 to 2016, medical doctors ( per 10,000 people, latest available), hospital beds (per 10,000, latest available) and population of the country (in

Trang 3

millions, latest available) for 50 countries in Region A: Asia and 23 countries in

Region B: North America were collected After the cleaning process, there are 46 countries remaining in Region A: Asia and 21 countries remaining in Region B: North America The datasets are presented in the attached Excel file

PART 2: DESCRIPTIVE STATISTICS

 Central Tendency Measurements

Central Tendency Asia North America

Figure 1 Measures of Central Tendency of total number of deaths due to COVID-19 between

April 01 to July 31, 2020, in Asia and North America.

In comparing the total death in Asia and North America by using the Central Tendency measurements, there is nothing worth notice in the mode figure, which will not be

considered Moreover, the mean will not be used to interpret since there is the existence of outliers, based on the calculation in appendix 1.1 and appendix 1.2 Consequently, the Median will be the most suitable measurement for the comparison which illustrates that 50 percent of the values are greater than the median and the remaining 50 percent are lower than the median.At first glance, it can be clearly defined that there is a significant difference between Asia and North America middle number of total deaths relating to the COVID-19 Inaddition, North America with the figure of 27.09, which is roughly three times higher than Asia with the median of 10.031 Therefore, it can be concluded that North American

countries have more deaths relating to the Cocid-19 than the Asian countries

 Box and whisker plot

Figure 2 Box-and-whisker plots of total number of deaths due to COVID 19 in Asia and EU

countries.

Trang 4

As can be seen from the box and whisker plot we drew above, the data distribution of

Asia and North America region are both right-skewed Moreover, the right whiskers of Asia and North America are both longer than the left whiskers shows the presence of outliers in thedatasets The box and whisker plots show that 75% of countries in North America have more than 27 deaths per million population while 75% of countries in Asia have only more than 10 deaths In addition, 25% of the number of deaths in Asia is around 1 to 10 deaths and 2 to 27 deaths in North America From which demonstrates that North American countries have a higher death rate than Asian countries

Figure 3 Measures of Variation of total deaths in Asia and EU (Unit: number of deaths

except for the Coefficient of Variation).

In this scenario, the best measure of variation is the Interquartile Range (IQR) due to the existence of outliers In addition, standard deviation is not suitable to measure because it can

be heavily influenced by the outliers, the coefficient of variation is also not a good choice as

we can notice that the distribution of the datasets above is highly right-skewed The

Interquartile Range of Asia region (50.501) is smaller than the Interquartile Range of North America (99.125), indicating that the dispersion of data of Asia region around the median is smaller In other words, the total number of deaths by Covid-19 in Asia are more consistent than in North America, or the Covid-19 pandemic has less impact on the Asia region than on North America

PART 3: MULTIPLE REGRESSION

1 Region A: Asian countries (FINAL)

After applying backward elimination, we find that one variable which is the average rainfall

is significant at a 5% level of significance The FINAL regression model for Asian countries

is given below

a Regression output

Trang 5

Figure 4: FINAL regression model of Region A: Asia

= 61.01 - 0.286*

c Regression coefficient of the significant independent variable

The slope b = - 0.286 indicates that the total number of deaths due to COVID 19 between 1

April 01 to July 31, 2020, decreased by 0.286 deaths with every mm increase in the amount

of rainfall

In this case, for no rainfall, b = 61.01, which makes sense as it is possible to have deaths 0

regardless there is rain or not Also, the intercept indicates that over the sample size selected, the portion of the total number of deaths due to COVID 19 between April 01 and July 31,

2020, is not explained by the average rainfall (in mm) of a country is 61.01 deaths Therefore,the total number of deaths is 61.01 when there is no rainfall

d The coefficient of determination

The coefficient of determination (R square = 16.3%) shows that 16.3% of the total variation

in the total number of deaths due to COVID 19 from April 01 to July 31, 2020, can be explained by the variation in the amount of rainfall, while 83,7% of the total variation in the total number of deaths due to COVID 19 between April 01 and July 31, 2020, is due to non included factors in the observation

2 Region B: North American Countries (FINAL)

After applying backward elimination, we find that only one variable named Population (in millions) is significant at a 5% level of significance The Final regression for North Americancountries is given below

Trang 6

a Regression Output

Figure 5: FINAL regression model of Region B: North America.

b Regression Equation: = b + b *0 1

= 52.98 +1.399*

c The regression coefficient of the significant independent variables

The slope b = 1.399 indicates that the total number of deaths due to COVID 19 between 1

April 01 to July 31, 2020, increased by 1.399 deaths with every million people increasing in the population of the country

In this case, for no population, b = 52.98, which makes no sense However, the intercept 0

simply indicates that over the sample size selected, the portion of the total number of deaths due to COVID 19 between April 01 and July 31, 2020, not explained by the number of the population of the country is 52.98 deaths Also, when X = 0, that means it is impossible to 1

have deaths when there is no population

d The coefficient of determination

The coefficient of determination (R Square = 61.1 %) shows that 61.1 % of the total variation

in the total number of deaths due to COVID 19 from April 1 to July 31, 2020, can beexplained by the variation in the population of the country, while 38.9% of the total variation

in the total number of deaths due to COVID 19 between April 1 and July 31, 2020, is due tonon included factors in this observation

PART 4: TEAM REGRESSION CONCLUSION

According to the study in Part 3, the final claim is that the two regions have the same amount

of significant independent variables but in different types including average rainfall (in mm),hospital beds (per 10,000 population), medical doctors (per 10,000 population), averagetemperature (in Celsius) and population (in millions) In the Asia final regression model, thesignificant independent variable is the average rainfall (in mm) In the North America dataset, the significant independent variable in the final regression model is Population (inmillions) among the five listed above variables In comparison, the North America region hasremarkably more total deaths according to the findings in part 2, which means the region has

Trang 7

been impacted more than the Asia Region due to the pandemic Moreover, from thestudy in part 3, 61.1% of the total variation in the total total deaths in North America due toCOVID 19 can be explained by the population of the country (in millions) which illustratesthat the variation of population contributes a major impact to the variation of the total number

of deaths in the NA region Meanwhile, in Asia, only 16.3% of the variations in the totalnumber of deaths can be explained by the variation of the average rainfall (in mm), whichmeans that the average rainfall influence on the total deaths is not too great and a largeamount of other considerable factors that are not included in the study leading to a lowerreliable result compared to that of the North America region

To conclude, by building the regression models and comparing the descriptive statistics oftwo regions, this study indicates that the average rainfall can be used to forecast the totalnumber of deaths due to COVID 19 in Asia while in North America, the population of thecountry is the independent variable that can be utilized to predict the total number of deaths.Also, the North American countries have suffered a higher impact due to the greater number

of deaths due to the pandemic in comparison to Asian countries

PART 5 TIME SERIES

In part 5, our group collected data for the total number of deaths per day in two regions Asiaand North America from April 01 to July 31, 2020 In the collected datasets, if there are nodeaths on a particular day and hence to build the exponential trend model, we will take0.00005 instead of 0 to build the exponential trend model as log(0) cannot be calculated Thedatasets are presented in the attached Excel file

1 Build Linear, Quadratic and Exponential trend models.

Trang 8

 The slope b = 10.366 indicates that the total number of deaths due to COVID1

19 between April 01 to July 31, 2020, increased by 10.366 deaths every day

 b0 = 88.199 when T = 0, which illustrates that there were 88.199 deaths on 31 March, 2020

 Quadratic Trend Model

Figure 7 Time Series outputs for Region A: Asia quadratic trend.

b Formula:

=338.607–1.75*+ 0.0985*

 The slope b = 0.0985 indicates that the total number of deaths due to COVID 192

between April 01 to July 31, 2020, increased by 0.0985 deaths every

 b0 = 338.607 when T = 0, which illustrates that there were 338.607 deaths on 31 March, 2020

 Exponential Trend Model

a Regression output

Figure 8 Time Series outputs for Region A: Asia exponential trend.

b Formula: in linear format:

log() = 2.383 + 0.00653*

In non-linear format:

= 241.546 *

Trang 9

Interpretation: ( b - 1) * 100% = 1.5% is the estimated daily compound growth rate in1

percentage for the total number of deaths due to COVID 19 from April 01 to July 31, 2020 inAsia

1.2 Region B: North America

After testing the Hypothesis for trend models in the North America region (appendix 3.2), thefindings indicate that linear and exponential trend models are significant

 Linear Trend Model

a Regression output

Figure 9 Time Series outputs for Region B: North America linear trend.

b Formula: = 2056.42 - 5.37*

 The slope b = - 5.37 indicates that the total number of deaths due to COVID 19 1

between April 01 to July 31, 2020, decreased by 5.37 deaths every day

 b0 = 2056.42 when T = 0, which illustrates that there were 2056.42 deaths on 31 March, 2020

 Exponential Trend Model

a Regression output

Figure 10 Time Series outputs for Region A: Asia exponential trend.

Trang 10

b Formula:

In linear format: log() = 3.2840 - 0.00127*

In non-linear format: = 1923.43 *

Interpretation: ( b - 1) x 100% = 0.3% is the estimated daily compound decrease rate in 1

percentage for the total number of deaths due to COVID 19 from April 01 to July 31, 2020 in North America

2 Recommended Trend Models

The Coefficient of Determination (R Square) will be used to determine the most suitabletrend model for the regression outputs Higher the coefficient of determination, the more ofthe total variation in the number of deaths can be explained, which is better for the estimatingthe number of deaths due to COVID 19

For region A, it can be seen in the figure that the exponential trend had the highest coefficient

of determination, which means the exponential trend model will be the most suitable inregion A's situation to predict the total number of deaths due to Covid-19 as it will producefewer errors

b Region B: North America

Figure 12 Coefficient of determination of linear and exponential trend models of NA (%).

For region B, with a slightly higher coefficient of determination; hence, the linear trendmodel will be the most suitable in region B's situation to predict the total number of deathsdue to Covid-19 as it will produce fewer errors compared to the exponential trend model

3 Predict the number of deaths on September 28, September 29, and September 30.

Trang 11

Figure 13 Forecasted number of deaths on September 28,29,30 in Asia.

b Region B: North America

As the above conclusion, the linear trend is the best model to predict the number of deathsdue to COVID 19 in North America, with the formula:

Figure 14 Forecasted number of deaths on September 28, 29, 30 in North America.

PART 6: TIME SERIES CONCLUSION

a Line chart

Trang 12

Figure 15 Line graph of Daily total number of deaths due to COVID 19 in Asia and North

America from April 01 to July 31,2020.

b Explanation

The line graph above presents the daily total number of Deaths in Asia and North Americadue to Covid 19 from April 01 to July 31, 2020 It can be concioused that the number ofDeaths in Asia is more stable and significantly less (in number of deaths) compared to NorthAmerica, although this is the region where the pandemic was spread There is an existence ofirregular components in 2 periods, once occurred in 15-April and once in 17-June, and started

to increase steadily from 24-June to 29-June On the other hand, in North America was achaos of fluctuation, the number of deaths reached the peak in 15-April, then started to movedownward with the cyclical component of a 7 days period until the end of the observation.Also, the region has the irregular component of 24-June, which the number of deaths gothigher than any other nearby period

Relating to Part 5.3, Asia and North America do not follow the same trend model in order topredict the numbers of death due to the Covid-19, which is the exponential trend model inAsia and the linear trend model in North America

To come up with the conclusion, our team has compared the Coefficient of Determination (RSquare), because the higher the Coefficient of Determination, the lesser error, the more totalvariation in the number of deaths can be explained The R Square of exponential trend mode

of Asia is the highest (80.6%), similarity, the linear trend model of North America is higherthan the other (7.9%) In conclusion, we want to use exponential trend model to predict thetotal number of death in the world since its R square is larger than the Linear trend model inNorth America (80.6% > 7.9%), presenting that 80.6% of the independent variable (number

of deaths by the Covid 19) can be explained by exponential trend model

PART 7 : OVERALL TEAM CONCLUSION

7.1 Main factors impacting the total number of deaths

Based on part 3, Multiple Regression analysis of Asia region, it indicates that there is onlyone significant independent variable that may affect the total number of deaths due to COVID

19 which is the average rainfall (in mm) at 95% level of confidence Based on the regression

Ngày đăng: 05/12/2022, 06:30

TỪ KHÓA LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm

w