Part 6 HalfPart 7 2 questionsViet Do S3750310 Part 1 Find 1 datasetPart 5 AllAssignment 3B Powerpoint + Edit Phuc Tu S3812120 Part 1 Find 3 datasets + content Part 4 AllPart 7 2 question
Trang 1RMIT International University Vietnam
ASSIGNMENT COVER PAGE
Title of Assignment Team Assignment Report 3A
Student name - Student number Ho Trong Dat - S3804678
Do Hoai Viet - S3750310Phan Minh Dang Khoa - S3818139
Tu Huu Phuc - S3812120
Trang 2Part 6 (Half)Part 7 (2 questions)Viet Do S3750310 Part 1 (Find 1 dataset)
Part 5 (All)Assignment 3B (Powerpoint + Edit)
Phuc Tu S3812120 Part 1 (Find 3 datasets + content)
Part 4 (All)Part 7 (2 questions)Assignment 3B (Question 1, 3 + Presentation)
Dat Ho S3804678 Part 1 (Find 1 dataset)
Part 6 (Half)Assignment 3B (Question 2 + Presentation)
PART 1: DATA COLLECTION:
In collecting-data process, by enquiring various reliable sources, such as WHO orWorld Bank, our team successfully collected a wide range of secondary data in the majority
of countries in two regions, Asia and Europe & European Union in terms of for six variables:
- Numbers of COVID-19 deaths (between January 22 and April 23, 2020) (OurWorld In Data 2020)
Trang 3- Average temperature (in mm) that is calculated by data from 1991 to 2016 (WorldBank Group 2020).
- Average rainfall (in Celsius) that is calculated by data from 1991 to 2016 (World Bank Group 2020)
- Population (in 1,000s) by using data in 2018 (The World Bank 2019)
- Hospitals beds (per 10,000 people) by using latest available data (WHO 2020)
- Medical doctors (per 10,000) by using latest available data (WHO 2020)
However, due to the many national issues, mostly relating to sovereignty recognition
of few countries, there is still a lack of data in those nations And solving this problem, weimplemented the data-cleansing method, which adjusts and rejects the missing or poor-qualitydata, hence enhancing the reliability of final result in testing (Gschwandtner et al 2014),especially building regression model as in this research
As a result of this cleansing progress, we finally have new well-qualified datasetswithout any missing data, which ensures more reliable output for final regression model:
- Asia: 32 countries (cleaning 3 countries: Hong Kong, Macao, Taiwan).
- Europe & European Union: 42 countries (cleaning 11 countries: Faroe Islands,
Gibraltar, Guerney and Aderney, Jersey, Kosovo, Liechtenstein, Vatican City, Svalbard and Jan Mayen Islands, San Marino, The Isle of Man, Moldova).
PART 2: DECRIPTIVE MEASURE:
From the collected and cleaned data about deaths due to COVID-19 pandemic in the firstpart, we are able to analyze the descriptive measure in two those regions Generally, the deathcases due to Covid-19 in European region is higher than that in Asia but the differencebetween mortality cases in Asian countries is overall greater than this measure in Europe &European Unions
a Measure of Central Tendency:
Except for mode that cannot be utilized for assessing due to the variability of data incountries having death cases, two other statistics both can be ideal representative for CentralTendency And by the way of evaluation, despite the impact from outliers (7 in Asia and 9 inEurope & European Union), mean still seems to be a better statistic for assessing CentralTendency because median witnesses a stronger detrimental effect from the unusualdistribution, especially when there are 12 Asian countries having no deaths from COVID-19(accounted for over one-third of all data in set) With this selection, the number of mortalitycase in Europe and European Union countries is considerably greater than that in Asia(2329.08 deaths vs 215.37 deaths) In other words, the average COVID-19 deaths in Asiancountries is nearly 10 times lower than that number in Europe & European Union countries
b Measure of Variance:
Table 1: Measures of Central Tendancy of COVID-19 deaths in Asia and Europe & European Union
Trang 4IQR (Cases of death) 71.5 491.5
Statistically, IQR and Range are not ideal statistics for reflecting the Variance becausethey do not demonstrate the distribution Although Standard Deviation is usually used asrepresentation for Variance due to the relation of all data in set, it seems not to be this casebecause the absolute value in this statistic is not suitable when the means of Asia and Europe
& European Union are vastly different (about 10 times in comparison) As a consequence,Coefficient of Variance is the best selection for representing Variance since this measureshows the relative value, which allows the accurate comparison, no matter how different themeans of objectives are With this choice, we conclude that the variability of numbers ofdeaths between Asian countries is much greater than that in European nations (364.82% vs263.31%) Specifically, there is a further dispersion of mortality cases around its averagedeaths in Asian nations than those in Europe & European Union
c Measure of Shape:
Even though box-and-whisker plot and mean-and-median comparison alwaysdemonstrate the same result of skewness, graph-illustrating solution is still better for analysis
as it not only explains the detail of skewness but also reveals exactly the distribution of data
in four quarters, which provides the viewers with a deep understanding about features ofdifferent sets For example, in this case, in spite of the same right-skew distribution, box ofEurope & European Union is much longer, which describes the vaster spread of middle 50%
of data in this region than that in Asian nations And as a result of this choice, we generallyinfer that two regions has right skewness, which means that more than 50% of total Asiancountries have the COVID-19 deaths below 215.37 mortality cases while lower than 50% oftotal countries in Europe and European Union have the deaths over 2329.08 cases due topandemic
Table 2: Measures of Variance of COVID-19 deaths in Asia and Europe & European Union
Graph 1: Box-and-Whisker plot of Asia and Europe & European Union
4 4619
Trang 5PART 3: MULTIPLE REGRESSION:
As mentioned in part 1, through the collecting and cleansing step, we have two setswith the data from 32 Asian and 41 European countries for building regression model Andwith this model, we are able to estimate the change in number of COVID-19 deaths whentested predictors change Specifically, our purpose in Multiple Regression part is finding out:
- Whether there are significant influences from 5 independent variables (averagetemperature, average rainfall, populations, hospital beds and medical doctors) on dependentvariable (COVID-19 deaths)
- How those independent variables impacts dependent variable (Negative/Positive,Strong/Weak)
Most remarkably, to fulfil those purposes, elimination backward procedure is used forremoving all insignificant variables in this case The reason behind using this method is thatthe variability of error is impacted by the number of predictors, which can be explained bythe mutual interactions between those independent variables that results in the inaccuracy ofregression model (Cai & Hayes 2007) Consequently, by eliminating variables one-by-one,elimination backward can effectively remove those interactions, which enhances the veracity
of final regression model
After applying this method, our team successfully eliminate insignificant predictors toreach to the final model that contains only variable that are significant at 5% level ofsignificance in two regions:
1 Asia:
a Regression output:
b Equation:
COVID19 deaths (y-hat) = -19.551 + 0.002(Population)
- In which Units are:
Estimated COVID19 deaths (cases)
Population (1000s)
c Regression coefficients:
Trang 6 b1 = 0.002 indicates that the number of deaths increases by 0.002 cases for every 1000people increase in population.
b = -19.551 shows that when the population is zero, the estimated deaths due to COVID-0
19 is -19.5512 cases However, this interpretation makes no sense in this case because thedeaths cannot be a negative value and it is impossible for having deaths when there are nopeople in a country
* As a consequence of this equation, we implicate that:
- There is a significant influence from population on the COVID-19 deaths in eachcountry (p-value = 0.000 < 0.05 = Level of significance)
- There is a positive (0.002 is positive) relation between COVID-19 deaths andpopulation
d Coefficients of determination:
R square = 0.631 indicates that about 63.1% of the variation in COVID19 deaths maydue to variation in population of a country, the remaining 36.9% of variation of COVID19deaths are influenced by other factors
2 Europe & European Union:
a. Regression output:
b Equation:
COVID19 deaths (y-hat) = -11669.702 + 0.142(Population) + 90.87(Average rainfall) + 579.428(Average temperature)
- In which Units are:
Estimated COVID19 deaths (cases)
Trang 7 b = 90.87 shows that the COVID 19 deaths will increase, on average, by 90.87 death for2every mm increase in average rainfall, holding the average temperature and thepopulation as constant.
b = 579.428 shows that the COVID 19 deaths will increase, on average, by 579.4283death for every Celsius increase in average temperature, holding average rainfall andpopulation as constant
b = -11669.702 shows that when the Average rainfall, the Average temperature and0Population are zero, the approximated deaths due to COVID19 calculated as –11669.702 deaths However, it is meaningless if there is no population in a singlecountry and number of deaths remain negative; hence there is no significant interpretationfor this intercept
* From equation, we infer that:
- There are significant influences from population, average rainfall and averagetemperature on the COVID-19 deaths in each country (p-value (population) = 0.000 < 0.05;p-value (average rainfall) = 0.028 < 0.05; p-value (average temperature) = 0.005 < 0.05)
- There are positive (0.142; 90.87 and 579.428 are positive) relation between COVIDdeaths and population
d. Coefficients of determination:
R square equals 0.417 indicates that about 41.7% of the variation in COVID19 deathsmay due to variation in the average rainfall, the average temperature and the population of acountry, the remaining 58.3% of variation of COVID19 deaths are influenced by otherfactors
PART 4: TEAM REGRESSION CONCLUSION:
1 Do both the models have the same significant independent variable/s?
Based on the final regression model in two regions, it is obvious that there aredissimilarity in significant variables between two regions Particularly, by applyingElimination Backward method (see more from 5 models and hypothesis tests in appendix),
we eliminated 4 insignificant variables in Asia and 2 insignificant variables in Europe &European Union Consequently, we have the final models in two regions, in which Asia hasonly one significant variable: population, Europe & European Union has 3 significantindependent variables: average temperature, average rainfall and population Thus, twomodels have different significant variables
Explaining by scientific evidences, population appears in both models showing theclose positive relation between population and number of deaths, which can be interpreted bymany intermediate elements, especially the number of cases Specifically, the crowdedpopulation would encourage the invasion of infectious diseases as the pathogensrises (Dobson & Carper 1996) As a result, as Donaldson and his colleagues (2009) proved,the more crowed area likely has the higher number of infectious cases, hence possibly havinghigher deaths if the death rate is the same internationally Another explanation is that largerpopulation size may result in lower individual care and overwhelming situation ConsideringWuhan three months ago as a typical example, all hospitals at there were overcrowded andthe mortality cases accelerated exponentially (Li et al 2020) So, most of scientific evidencesupport our final regression model
Regarding remained variables, the European model implicates the positive correlationbetween numbers of deaths and average temperature However, it is widely acknowledgedthat the viability of Coronavirus is lower with the higher temperature (Chan et al 2011) Inother words, this finding shows the negative relation between average temperature and the
Trang 8numbers of COVID-19 deaths since the higher temperature discourages the development ofthis virus Similarly, in this study, Chan and his colleagues (2011) stated the negativerelationship between the stability of Coronavirus and the humidity As a consequence, theyalso denied the positive correlation between number of mortality cases and average rainfall,which is result of our final model Therefore, the positive relations of two variables withdeaths are not supported by scientific evidences.
2 Which region is more impacted due to this pandemic?
Based on equation of our final regression model, we conclude that Covid-19 has moreimpact on Europe & European Union than Asia by checking out the slopes, whichsummarizes the change in death cases resulting from the change in variables By the way ofillustration, in the ‘population’ variable, b1 value in Asia is 0.002 that is extremely small in acomparison with the slope of 0.142 in Europe & European Union, which is nearly 70 times
As a result of this exponential difference, despite the population in Asia is 5 times greaterthan that in Europe & European Union (The World Bank 2019), the European nations aremore impacted by population due to its massive slope comparing with Asia (1)
In addition, while Asia is not significant influenced by average rainfall and averagetemperature due to the disappearance of two variables in equation but they strongly affect thenumber of death in European countries (b = 90.87, b = 579.428) Once again, Europe &2 3European Union is more impacted by average rainfall and average temperature (2)
From (1) and (2), we infer the more influence from pandemic on Europe & EuropeanUnion than Asia Impressively, this finding is strongly supported by the result of thedescriptive measure when the number of death in European is nearly 10 times higher thanAsia (Central Tendency)
* Non-technical conclusion: To sum up, from the regression output, we imply that the
number of Covid-19 death in European nations are affected by average temperature, averagerainfall and population while mortality cases in Asia are influenced by only population.Moreover, from regression equation and descriptive measure, we generally conclude thatEuropean countries are more impacted by pandemic that the Asian partner
PART 5: TIME SERIES:
In this part, we will collect data of COVID-19 deaths in Asia and Europe & EuropeanUnion between February 15, 2020 and April 30, 2020 Based on this dataset, we will build thetrend models and choose the best one for predicting the number of COVID-19 deaths infuture by using time series:
1 Asia:
After using the hypothesis tests (see more in Appendix), we infer that Quadratic(QUA) does not exist and only two significant models exist in Asia with regression outputsand formulas below:
a Regression output:
- Linear (LIN) trend model:
Trang 9- Exponential (EXP) trend model:
b Formula:
EXP (in non-linear format) Log ( ^Y ) = 1.761 + 0.012T
EXP (in linear format) ^Y = 57.677 × 1.028T
Table 3 Formula of significant models in Asia
Based on regression output, we are able to compare the R-square for choosing the bestmodel to predict the number of COVID-19 deaths in Asia Specifically, R-square ofExponential trend model is 67.3%, which is higher than the other significant trend model(38.6%) Thus, we strongly recommend the exponential (EXP) trend model for estimating thefurther mortality cases in Asia due to the least fault among numerous models And so, we alsochoose this model for forecasting the number of deaths due to COVID-19 in Asia on May 29,May 30 and May 31 as table below:
Trang 10Table 4 Predicted deaths on May 29, May 30, May 31 in Asia
2 Europe & European Union:
After using the hypothesis tests (see more in Appendix), we infer that Quadratic(QUA) does not exist and only two significant models exist in Europe & European Unionwith regression outputs and formulas below:
a Regression output:
- Linear (LIN) trend model:
- Exponential (EXP) trend model:
b Formula:
Trang 11EXP (in non-linear format) Log ( ^Y ) = -2.717 + 0.112T
EXP (in linear format) ^Y = 0.002 × 1.294T
Table 5 Formula of significant models in Europe & European Union
With this regression output, in the similar way, we use R-square as a tool forevaluating the best model And once again, Linear (LIN) trend model still has the highest R-square at 71.5% (comparing with 50.1% of Linear trend model) Consequently, werecommend using the Linear trend model for predicting the number of deaths due to COVID-
19 in Europe & European Union Based on this model, we also estimate the number ofCOVID-19 deaths in Europe & European Union on May 29, May 30 and May 31 as tablebelow:
^Y = -730.218 +
64.340T
≈ 6025 ≈ 6089 ≈ 6154
Table 6 Predicted deaths on May 29, May 30, May 31 in Europe & European Union
PART 6: TEAM SERIES CONCLUSION:
1 Line charts of number of deaths in two regions:
Graph 2 A line chart of number of deaths in Europe & European Union
Trang 12Graph 3 A line chart of number of deaths in Asia
2 Comment on trend models and line charts:
Based on our analysis in part 5 above, we conclude that both regions have the samesignificant trend models: Linear (LIN) trend model and Exponential (EXP) trend model.However, the best model for predicting deaths in Asia is the Exponential trend model whilethe Europe & European Union chooses the Linear trend model as the best one Anyway, bothsuitable models show the increasing trend when β1 (= 1.028) in Asian Exponential trend
model and b (= 64.340) in European Linear trend model are all positive.1
Moreover, the line graphs above demonstrate the complicated fluctuations By theway of illustration, the European chart shows a constant upward trend until April 4 beforestarting to change unpredictably (intermittent increase and decrease) on the rest of timeperiod On the other hand, the chart in Asia manifests the stable growth over time, except forthe date April 17, in which the irregular trend is witnessed due to the unexpected events.According to Worldometers, this ‘unexpected’ event derived from the shift in the countingway, which made the deaths rise dramatically Thus, they were not deaths on a single day inChina but reported in the long period, hence not being remarkable
3 Best trend model:
From two best trend models from Asia and Europe & European Union, we wouldchoose the best model for forecasting the world-wide COVID-19 deaths Specifically, R-square of Asian model is 67.3%, which is smaller than 71.5% of European’s R-square.Moreover, p-value of Linear trend model in Europe & European Union is much greater thanthat in Exponential trend model As a result, the Linear trend model of Europe & EuropeanUnion has less errors than the partner, so it is chosen for estimating the global deaths due topandemic that we will discuss more in the part 7
*Non-technical conclusion: Generally, the COVID-19 deaths both regions is
witnessed the increasing trends but the Europe & European Union seems to be moreunpredictable What is more, European trend model is the more suitable one for predictingthe COVID-deaths in the world from using the time series
Trang 13PART 7: TEAM OVERALL CONCLUSION:
To recapitulate, from part 4, we infer that number of deaths due to COVID 19pandemic and the population have a strong positive relationship according to the finalregression model of Asia that can be explained by many intermediaries in scientificexplanation Besides, the average rainfall, temperature and population of each country inEurope & European Union also proportionally influence on mortality cases in this regionalthough scientific evidence does not support them Specifically:
+ Region A: Asia:
COVID19 deaths (y-hat) = -19.551 + 0.002(Population)
+ Region B: Europe & European Union:
COVID19 deaths (y-hat) = -11669.702 + 0.142(Population) + 90.87(Averagerainfall) + 579.428(Average temperature)
From part 6, we have already chosen the best model for predicting COVID-19 deaths
To be more specific, Linear (LIN) trend model of Europe & European Union is the mostsuitable due to the highest R-square, implicating the least error among various trend model.Based on this model, we are able to predict the death cases in the world on June 30, 2020:
Table 4 Predicted deaths on June 30 from best model
With this calculation, we predict the world deaths will be at around 8084 cases onJune 30 In addition, in his research, Murray (2020) also predicted that the number of deathswould be accelerated rapidly in May, June and July, which concurs with our findings
Likewise, we also choose Linear (LIN) trend model for predicting the deaths cases atthe end of 2020 by daily time series And from the slope, b is positive, which implies the1stable upward trend over time series For this reason, we also forecast the upward trendconstantly in COVID-19 deaths, meaning that it continue to increase in the end of 2020.However, in major recent studies, deaths due to pandemic was estimated to reach a peak atJuly before starting to drop significantly by the end of year (Murray 2020), which denies ourresult of final model
Trang 14For the more discussion, it is quite amazing to know that recent researches gave theinaccurate estimation about the COVID-19 deaths of our world (Appolonia & Barranco2020) This dissimilarity comes from the complicated scenario, especially the distinctivepolicies in each nation (Dowd et al 2020) For example, after social distancing policies,which prevented the spread of Coronavirus, had been imposed, the deaths were suddenlyreduced and the estimation before had been incorrect However, the positive result from thispolicy made the government become subjective and relaxed their pandemic policies, whichonce again generated an ideal environment for Coronavirus to develop, so this sudden causemade the calculations not to be exact again due to the accelerated mortality cases For thisreason, our predictions also can be incorrect in the future as other professional research used
to Additionally, based on the dependence of COVID-19 deaths on government intervention,
we also strongly recommend government to maintain this policy for preventing the increase
in death cases again
Regarding the variables, further investigations need to be done to find reliablesignificant factors as the population that truly affect number of deaths, which improves theaccuracy of prediction about COVID-19 deaths For example, the number of over-65 people
in population structure or the number of male and female are many remarkable variables thataffect the COVID-19 deaths Particularly, according to researchers (Sharon 2020), theCoronavirus is known as an unequal-opportunity killer, which means the older people are, themore possibility of death they have if they catch the Coronavirus By the way of explanation,being elderly, having weaker immune system and the worse overall health, or possibly havingother chronic illness already, will lead to the high risk of mortality from Corona diseasereasonably On the other hand, specific data from China CDC depicted that 106 men haddisease for every 100 women Furthermore, the WHO mission (2020) reported 51% malecases among two sexes while in Wuhan a study discovered about 58% of the patients aremale Besides, an updated written by researchers in JAMA revealed that there is slightpredominance of male deaths in this pandemic As a consequence of those figures, men havemore probability of mortality than the partner due to the higher cases Therefore, number ofmale and female mortality cases from COVID-19 should be a part of discussion To sum up,with the various available data source from Internet, further researches should enquire andbuild regression model as in our research to have a better estimation about COVID-19 deaths
Reference:
Appolonia, A & Victoria, B 2020, ‘Why COVID-19 predictions will always be wrong’,
Business Insider, April 30, viewed 22 May 2020, <death-predictions-analysis-modeling-pandemic-2020-4>
https://www.businessinsider.com/covid-19-Chan, KH, Peiris, JSM, Lam, SY, Poon, LIM, Yuen, KY & Seto, WH 2011, ‘The Effects of
Temperature and Relative Humidity on the Viability of the SARS Coronavirus’, Advance in
Virology, vol 2011, pp 1-7.
Donaldson, LJ, Rutter, PD, Ellis, BM, Greaves, FE, Mytton, OT, Pebody, RG & Yeardley, E
2009, ‘Mortality from pandemic A/H1N1 2019 influenza in England: publichealth surveillance study’, BMJ, vol 339