Group 6Group Members and ContributionThai Low Income LI Lower Middle Income LMI Upper Middle Income UMI High Income HI Death rate DR Gross national income GNI Domestic general government
Trang 1Group 6
Trang 2Contents Group 6
Group Members and Contribution 5
Abbreviation 5
Part 1: Data Collection 5
Part 2: Descriptive Analysis 5
a Test of Outliers 5
b Measure of Central Tendency 6
c Measure of Variation 6
d Box-and-Whisker Plot Analysis 7
Part 3: Multiple Regression 8
a LI countries regression model 8
b LMI countries regression model 8
c UMI countries regression model 10
d HI countries regression model 12
Part 4: Team Regression Conclusion 13
a Conclusion for Part 2 14
b Conclusion for Part 3 and Part 4 15
Part 5: Time Series 15
1 Regression Output for Liberia, Lao, Guyana, and Netherlands 15
a Liberia 15
b Lao 16
Trang 3Group 6
c Guyana 17
d Netherlands 18
2 Trend Model and Formula for all 4 countries 19
3 Recommend Trend Model 19
a Liberia 19
b Lao 19
c Guyana 19
d Netherlands 20
4 Predict crude death rate for Liberia, Lao, Guyana, and Netherlands year 2018-2020 20
a Liberia 2018 to 2020 20
b Lao 2018 to 2020 20
c Guyana 2018 to 2020 20
d Netherlands 2018 to 2020 20
Part 6: Time Series Conclusion 21
a Line chart 21
b Best trend model anticipating the crude death rate all over the world 21
Part 7: Overall Team Conclusion 22
a Main factors that impact the crude death rate 22
b Predicted crude death rate in year 2030 22
c Recommendations 22
References 23
Trang 4Group 6
Appendices 26
Appendix A: 143 Countries Data sort by GNI (Low to High) 26
Appendix A.1: Low Income 26
Appendix A.2: Low-Middle Income 27
Appendix A.3: Middle-Upper Income 28
Appendix A.4: High Income 29
Appendix B: Backward Elimination process in the regression model 29
Appendix B.1: LI countries 29
Appendix B.2: LMI countries 30
Appendix B.3: UMI countries 31
Appendix B.4: HI countries 31
Appendix C: Time Series 32
Appendix C.1: Significant Trend Models validation process for Liberia, Laos, Guyana, Netherlands 32
Appendix C.2: Crude death rate prediction for Liberia, Lao, Guyana, and Netherlands for year 2018 to 2020 calculations 38
Trang 5Group 6Group Members and Contribution
Thai
Low Income (LI)
Lower Middle Income (LMI)
Upper Middle Income (UMI)
High Income (HI)
Death rate (DR)
Gross national income (GNI)
Domestic general government health expenditure
(GGHE-D) Immunization, measles (IM)
Prevalence of current tobacco use (PCTU)
Part 1: Data Collection
The countries divided into four categories based on the income level (Appendix A) The data set is for year 2014 includes 4 variables (Appendix A) The data are collected from World Bank (n.d.a) Initially,
it contains 217 countries, however, due to shortage of information of some of the countries so we had to eliminate those and only kept 143 countries that have sufficient information that meet the requirements – 4 variables Besides, the reason we selected all the 143 countries instead of narrowing it down is we want to maintain the original and pure of the data and prevent bias in the data cleaning process, hence, creating a more transparency and accuracy dataset (Šimundić 2013)
Part 2: Descriptive Analysis
a Test of Outliers
Trang 6Group 6
Figure 2.a: Test of outliers of the total amount of deaths rate in 4 types of studied income With the
purpose to evaluating the certainty of descriptive measurement, the table show the test of
outliers is displayed Outliers is a tool, which help to observe the data which not stay the same with another data of the rest (Lumenlearning n.d.) Relied on the figure 1 and compared the number of minimums with lower bound and maximum with upper bound, we have outliers occurred in this data
b Measure of Central Tendency
Figure 2.b: Central Tendency of total amount of deaths rate based on 4 types of studied income
Considering the central tendency of this case, some outliers exist in this case, but it was far away from
upper bound and lower bound Thus, we could not use the mean to indicate them The median is more
useful than others with purpose to calculate the central tendency of 4 types of income levels
Moreover, the median is the most suitable dimension because it allows to measure the approximately average of 4 types of income (Investopedia 2020) The median frequently to use in reverse with the mean because when the outliers appear to make value of data skew a bit Furthermore, the median could not be affected by outliers than the mean, so when the outliers exist, the best way to calculate is that we would use median For this reason, expected to measure the central of tendency of income, the median should be
applied in this case
Based on the figure 2.b, the median of data fluctuated from 6.746 (LMI) to 7.162 (UMI) This number
depicts that 50% of UMI countries confront greater than 7.162 of mortality case and another 50% of them had less than 7.162 death case per dollar in community reported survey We would visualize the same picture for LMI dataset, 50% countries in this type of income recorded they had deal with more than 6.7415 of death case and 50% countries had less than 6.7145 mortality case due to the disease In totally, the median of crude death rate based on two types of income upper and lower middle income seems like approximately equal.
c Measure of Variation
Figure 2.c: Measure of Variation of total amount of deaths rate based on 4 types of studied income The
movement from the lowest point to the highest point of data is exhibited by Range, then bring a
rapid and rough estimation about expansion of value inside the dataset (ChiliMath n.d.) Relied on figure 2.c, the range of HI countries exist is the largest number, which give an information that the amount of deaths’ contribution based on HI is the most scattered in 2014
The Interquartile Range is one of many indicators often used with purpose to calculate how well the data point expands from the mean inside the dataset () The larger the IQR would lead to the more data point is outspreaded () In contrary, the lower the IQR would lead the more data point is gathering close to the median (Stephanie n.d.)
According to figure 2.c, the IQR of LI countries is the smallest number in the income group, which give an opinion that the number of death case in LI is gathering around the median, while the number of mortality case of another type of income spread out seriously To be more clearly, the total amount of deaths in LI countries closed to the value in the middle and more stable when compared with the expansion of death case
Trang 7Calculating the scatter of data point inside the data set around the mean, the Coefficient of Variation
is the statistical estimation is suitable to proceed (Adam 2020) As can be observed from the figure 2.c, a background could be drawn that Low Income has the smallest coefficient of variation when compared with another income area (2364% < 3593% < 3806% < 3810%) This show that the value of data around means of
LI countries were more separated than the rest types of income In this situation, it has a paradoxical thing between the analyze in the standard deviation and the coefficient of variation When the standard deviation shows that the recognized values in LI cluster and close to the mean than the rest of type of income, while the coefficient of variation support vice versa This could be defined that standard deviation is usually used
to examine one data series and if there are greater than one dataset, coefficient of variation would be
considered For this reason, coefficient of variation should be organized with purpose to compare the scatter
of 4 income areas In a nutshell, having a smallest standard deviation and variance, LI data is more consistentthan another income, demonstrate that the reported survey staying the same
d Box-and-Whisker Plot Analysis
Figure 2.d: Box and whisker plots represent the crude deaths rate based on 4 types of income level.
Based on figure 2.d., it is can be seen that the only data distributions of HI countries are slight left
skewed For the other 3 income levels, the length of the right whiskers is much longer than that of the left
Trang 8Group 6whiskers, reflecting on the of outliers in the data distribution Furthermore, min and max of the amount of death cases in LI because of disease was higher than these measurement in other areas of income For this reason, we could assume that the countries in the LI area suffered more mortality case than another types of income In additional, the median of LMI was slower than the median of Upper Middle Income (6.7415 < 7.162), which mean that 50% of Lower Income confirmed that the number of death case no more than 6.7415cases, while, in UMI half of them suffered fewer than 7.162 cases Because of the application of box and whisker plot, we could visualize to draw a generally picture that LI have a huge number of deaths than another area before the reported research.
Part 3: Multiple Regression
Regardless of the income level of the selected country category, it is all initially comprised of the same dependent and independent variables, which are:
Basically, the backward elimination method is employed to eliminate irrelevant, redundant, or
not statistically significant at a 5% level of significance variables from many variables hence, it can
enhance the accuracy and the quality of regression models (Ruan et.al 2020)
a LI countries regression model
Based on the result of Figure B.1.4 of Appendix B.1, because all variables’ p-value are much higher than the given level of significant, so we must eliminate all (Narin, Isler & Ozer 2013) Hence, no final regression output is constructed for the LI countries dataset which means there is no relationship between DRand the fourth variables and the variation of that will not affect the DR so it might depends on different
factors (Hannerz et al 2019) In other words, those predictor variables are not statistically significant, we do
not reject H0 and there is insufficient evidence in our sample to conclude that a non-zero correlation exists and as the results, there is no scatter plot to illustrate for LI countries (Hannerz et al 2019)
b LMI countries regression model Regression output:
8
Trang 9Group 6
Figure 3.b.1: Final model of LMI countries Final model of LMI countries
Based on the result of Figure B.2.2 of Appendix B.2 and figure 3.b.1., after eliminating least
significant variable, the remaining- Immunization - measles, GGHE-D, and GNI are the most significant variables because its p-value = 0.0003; 0.006; 0.044 < α = 0.05 In other words, the remaining predictor variables are statistically significant, our sample data provide enough evidence to reject the H0 so changes inthe independent variables are associated with changes in the response at the DR (Fauzi 2017)
Scatter plot:
Trang 10Group 6
Figure 3.b.2: Scatter plots of final output of LMI countries
Regression Equation: Y = 18.8623 -0.0011X 1 + 0.0183X 2 - 0.1264X 3
(where Y is the estimated DR and X1 X2 X3 are the independent variables - the GNI, GGHE-D, and IM)
Interpret the regression coefficient of the significant independent variable:
b 1 = -0.0011 shows that for every increase of 1 unit of current US$ of the GNI per capita by using
Atlas method, the DR per 1000 live births will decrease by 0.001 deaths, considering the two remainingfactors as constant
b 2 = 0.0183 shows that for every increase of 1 unit of international US$ of the domestic general government
expenditure on health per capita, the DR per 1000 live births will increase by 0.018 deaths, considering the two
remaining factors as constant
b 3 = -0.1264 shows that for every increase of 1% of children ages 12-23 months who received the
measles vaccination before 12 months or at any time before the survey, the DR per 1000 live births will decrease by 0.126 deaths, considering the two remaining factors as constant
Interpret the coefficient of determination: R 2 = 0.381 interprets that only 38.1% of the variation of the DR is
explained by the variation of GNI, GGHE-D, and Immunization, measles, the remaining 61.9% of the
DR is explained by different factors (Glen n.d)
c UMI countries regression model
Regression output:
Trang 11Group 6
Figure 3.c.1: Final model of UMI countries
Based on the result of Figure B.3.4 of Appendix B.3 and figure 3.c.1, PCTU is the most significantvariables because its p-value = 0.001 < α = 0.05 In other words, the predictor variable is statistically
significant, we reject the H0, so the sample evidence supports our prediction that PCTU is the potential risk factor of the DR in UMI countries (Fauzi 2017)
Scatter plot:
Figure 3.c.2: Scatter plots of final output of UMI countries
Regression Equation: Y = 4.016 + 0.157X 1
(where Y is the estimated DR and X is the independent variable - the prevalence of current tobacco use)
Interpret the regression coefficient of the significant independent variable:
Trang 12Group 6
b 1 = 0.157 interprets that for every increase of 1% of the population ages 15 years and over who currently
use any tobacco product on a daily or non-daily basis, the DR per 1000 live births will increase by 0.157 deaths
Interpret the coefficient of determination: R 2 = 0.253 interprets that only 25.3% of the variation of the DR is
explained by the variation of PCTU and the remaining 74.7% of the DR is explained by different factors (Glen n.d.)
d HI countries regression model
Regression output:
Figure 3.d.1: Final model of HI countries
Based on the result of Figure B.4.2 of Appendix B.4 and figure 3.d.1, after eliminating least significant
variable, the remaining- PCTU, GNI and GGHE-D are the most significant variables because its p-value =
0.004; 0.010; 0.018 < α = 0.05 In other words, the remaining predictor variables are statistically significant and
there was enough evidence indicating that the null hypothesis could be rejected so changes in the
independent variables are associated with changes in the response at the DR (Fauzi 2017)
Scatter plot:
Trang 13Group 6
Figure 3.d.2: Scatter plots of final output of HI countries
Regression Equation: Y = 4.44805 - 0.000076X1 + 0.00116X2 + 0.14636X3
(where Y is the estimated DR; X1 X2 X3 are the independent variables - the GNI, GGHE-D, and PCTU)
Interpret the regression coefficient of the significant independent variable:
b 1 = -0.000076 shows that for every increase of 1 current US$ of the GNI per capita by using Atlas
method, the death rate per 1000 live births will decrease by 0.000076 deaths, considering the two remainingfactors as constant
b 2 = 0.00116 shows that for every increase of 1 current international US$ of the domestic general
government expenditure on health per capita, the DR per 1000 live births will increase by 0.00116 deaths,
considering the two remaining factors as constant
b 3 = 0.14636 shows that for every increase of 1% of the population ages 15 years and over who currently
use any tobacco product on a daily or non-daily basis, the DR per 1000 live births will increase by 0.14636 deaths, considering the two remaining factors as constant
Interpret the coefficient of determination: R 2 = 0.343 interprets that only 34.3% of the variation of the
DR is explained by the variation of PCTU and the remaining 65.7% can be attributed to unknown
variables (Glen n.d.)
Part 4: Team Regression Conclusion
1 Do all models have the same significant independent variable/s?
Figure 4.1.1: Significant independent variable(s) of 4 income level
Trang 14Group 6According to figure 4.1.1, after applying backward elimination processes, the 4 given models do not have the exact same significant independent variables However, there are some similarities among the three-income level, in LMI and HI countries GGHE-D and GNI are correlated with DR (as GGHE-D
increases DR typically increases, and as GNI increases DR decreases) Whereas UMI and HI countries are affected PCTU (as PCTU increase DR also increases) Regarding LI countries, it is possible that the DR is affected by different factors apart from the 4 given variables
2 Which variables have the higher impact on the crude death rate in each countries category? LI countries
are affected by unknown variables because the fourth given variables do not have any
correlation with DR via the backward elimination process
Based on figure 4.1.1, UMI countries only has 1 significant independent variable, hence, IM is
its greatest risk factors of the DR
Figure 4.2.1: Summary information when excluding 1 variable out of 3 variaables for LMI countries
LMI countries are affected by 3 different variables (figure 4.1.1) hence, to determine which variable has the highest impact on the crude death rate we compute the reduction in R2 when excluding 1 variable According to figure 4.2.1, without IM the R2 is the lowest (6%) so when R2 of LMIC countries subtracts the
R2 without IM, the reduction in R2 is the highest (32.04%) compared to the two remaining variables Hence,
IM has the greatest impact on the crude death rate in LMIC countries
Figure 4.2.2: Summary information when excluding 1 variable out of 3 variaables for HI countries
For HI countries, the process can be done the same as LMI countries because they are also affected
by 3 different variables (figure 4.1.1) According to figure 4.2.2, without PCTU the R2 is the lowest
(22.67%) which results to the highest of the reduction in R2 (11.74%) compared to the two remaining
variables Hence, IM has the greatest impact on the crude death rate in LMIC countries
3 Conclusion for part 2 and part 4
a Conclusion for Part 2
Owing to the existence of 4 outliers have been measured in the four datasets, the utilizes of Range, Mean, Standard Deviation, and Variance (figure 4) were not convenient for examining total death cases because they were easily influenced by intense value This probably reveal incorrect in the data analysis progress Mode was not useful in this case because it could not indicate the center of the contribution well Additionally, because of the high skewness in all of 4 data series (right skewed), so Coefficient of Variation
is not suited to utilize in these datasets Thus, it is highly recommended that in those data series, InterquartileRange and Median (Central Tendency) become 2 most suitable estimations, the reason is that these
accessions are protected to the existence of intense value and they extremely focus on the center of the
contribution, affording a detailed insight into the datasets
To sum up, box and whisker are used with purpose to catch a general opinion that all data contribution is right skewed also the graph exhibits the existence of outliers in this situation although leave out going to do investigated process Synchronously, Mins and Maxs also provide profound that with the same time, the mortality case of Low Income because of disease higher than the death case of another income To be more
Trang 15Group 6clearly, in the Central of Tendency part, medians extremely show that 50% of Lower Middle Income have suffered the number of death case was more than 6.7415 cases While half of Upper Middle-Income
covering their mortality case less than 7.162 cases Moreover, the application of Interquartile Range
demonstrates that IQR of Low Income is the smallest number in the income group, which give an opinion that the number of death case in Low Income is gathering around the median, while the number of mortality case of another type of income spread out seriously
b Conclusion for Part 3 and Part 4
After computing the reduction in R2 when excluding 1 variable, for LMI countries, IM has the
greatest impact on the crude death rate whereas, for HI countries PCTU has the greatest impact on the
crude death rate Besides, GNI and GGHE-D are two remaining risk factors of the crude death rate in both country categories mentioned above
The regression model of LMI countries will provide a better crude death rate estimation because it has the highest R2 with 38.1% which shows a stronger relationship between the dependent variables and independent variables and 38.1% of the variation in the crude death of LMI countries is explained by the variation in IM, GNI,and GGHE-D Whereas UMI and HI countries have a lower R2 (25.3% and 34.3%) which mean they have weakercorrelations Besides, the higher the R2 , the greater the capability of forecast or determine the likelihood of futureevents falling within the predicted outcomes (Zhang 2017) Because the higher the R2, , more data points will fall within the prediction line hence, the stronger the predictive ability of a model for the given dependent variable (Hamilton, Ghert & Simpson 2015) However, the R2 value of three
income level are still relatively low which shows there are many unknown leading risk factors influence on the crude death rate and hence, the model is lack of the ability to make reliable predictions (Hamilton, Ghert & Simpson 2015) In other words, we should identify another more potential and effective variable, such as Prevalence of overweight or underweight; and those related to dietary and activity lifestyle factors (Ritche & Roser 2018)
Part 5: Time Series
1 Regression Output for Liberia, Lao, Guyana, and Netherlands
Process of eliminating invalid trend models is in Appendix C.1
Linear Trend Model
Trang 16Group 6Exponential Trend Model
b Lao
Linear Trend Model
Quadratic Trend Model
Exponential Trend Model
Trang 17Group 6
Linear Trend Model
Quadratic Trend Model
Exponential Trend Model
Trang 18Group 6
d Netherlands Linear Trend Model
Exponential Trend Model
18
Trang 19Group 6
2 Trend Model and Formula for all 4 countries
Where:
Ŷ: The Death rate, crude (per 1,000 person) for the country
T: The period for the year starting from 1995 as the first period
3 Recommend Trend Model
a LiberiaMAD and SSE for Liberia:
The Exponential Model is recommended because the MAD and SSE of the Exponential model is
much lower than the Linear model
b Lao
MAD and SSE for Lao:
The Exponential model is recommended because the MAD and SSE of the Exponential model is
much lower than the Linear and Quadratic model
c Guyana
MAD and SSE for Guyana:
Trang 20Group 6
The Exponential Model is recommended because the MAD and SSE of the Exponential model is
much lower than the Linear and Quadratic model
d Netherlands
MAD and SSE for Netherlands:
The Exponential Model is recommended because the MAD and SSE of the Exponential model is
much lower than the Linear model
4 Predict crude death rate for Liberia, Lao, Guyana, and Netherlands year 2018-2020
Due to World Bank (n.d.) do not have the sufficient crude death rate data for year 2019 and 2020, wewill use Knoema (2020a, 2020b, 2020c, 2020d) data for the actual crude death rate in our calculations
Using Exponential Trend Model The calculation processes are in Appendix C.2