Measure of Central Tendency . Measure of Variation. Box-and-Whisker Plot Analysis

Group 6Group Members and ContributionNguyen, Huynh Thai My, Huynh Thi  Low Income LI  Lower Middle Income LMI  Upper Middle Income UMI  High Income HI  Death rate DR  Gross nationa

Trang 1

Group 6

Trang 2

Group 6 Contents

Group Members and Contribution 5

Abbreviation 5

Part 1: Data Collection 5

Part 2: Descriptive Analysis 5

a Test of Outliers 5

b Measure of Central Tendency 6

c Measure of Variation 6

d Box-and-Whisker Plot Analysis 7

Part 3: Multiple Regression 8

a LI countries regression model 8

b LMI countries regression model 8

c UMI countries regression model 10

d HI countries regression model 12

Part 4: Team Regression Conclusion 13

a Conclusion for Part 2 14

b Conclusion for Part 3 and Part 4 15

Part 5: Time Series 15

1 Regression Output for Liberia, Lao, Guyana, and Netherlands 15

a Liberia 15

b Lao 16

Trang 3

Group 6

c Guyana 17

d Netherlands 18

2 Trend Model and Formula for all 4 countries 19

3 Recommend Trend Model 19

a Liberia 19

b Lao 19

c Guyana 19

d Netherlands 20

4 Predict crude death rate for Liberia, Lao, Guyana, and Netherlands year 2018-2020 20

a Liberia 2018 to 2020 20

b Lao 2018 to 2020 20

c Guyana 2018 to 2020 20

d Netherlands 2018 to 2020 20

Part 6: Time Series Conclusion 21

a Line chart 21

b Best trend model anticipating the crude death rate all over the world 21

Part 7: Overall Team Conclusion 22

a Main factors that impact the crude death rate 22

b Predicted crude death rate in year 2030 22

c Recommendations 22

References 23

Trang 4

Group 6

Appendices 26

Appendix A: 143 Countries Data sort by GNI (Low to High) 26

Appendix A.1: Low Income 26

Appendix A.2: Low-Middle Income 27

Appendix A.3: Middle-Upper Income 28

Appendix A.4: High Income 29

Appendix B: Backward Elimination process in the regression model 29

Appendix B.1: LI countries 29

Appendix B.2: LMI countries 30

Appendix B.3: UMI countries 31

Appendix B.4: HI countries 31

Appendix C: Time Series 32

Appendix C.1: Significant Trend Models validation process for Liberia, Laos, Guyana, Netherlands 32

Appendix C.2: Crude death rate prediction for Liberia, Lao, Guyana, and Netherlands for year 2018 to 2020 calculations 38

Trang 5

Group 6Group Members and Contribution

Nguyen, Huynh

Thai

My, Huynh Thi

 Low Income (LI)

 Lower Middle Income (LMI)

 Upper Middle Income (UMI)

 High Income (HI)

 Death rate (DR)

 Gross national income (GNI)

 Domestic general government health expenditure (GGHE-D)

 Immunization, measles (IM)

 Prevalence of current tobacco use (PCTU)

Part 1: Data Collection

The countries divided into four categories based on the income level (Appendix A) The data set is for year 2014 includes 4 variables (Appendix A) The data are collected from World Bank (n.d.a) Initially, it

contains 217 countries, however, due to shortage of information of some of the countries so we had to eliminate those and only kept 143 countries that have sufficient information that meet the requirements – 4 variables Besides, the reason we selected all the 143 countries instead of narrowing it down is we want to maintain the original and pure of the data and prevent bias in the data cleaning process, hence, creating a more transparency and accuracy dataset (Šimundić 2013)

Part 2: Descriptive Analysis

a Test of Outliers

Trang 6

Group 6

Figure 2.a: Test of outliers of the total amount of deaths rate in 4 types of studied income

With the purpose to evaluating the certainty of descriptive measurement, the table show the test of outliers is displayed Outliers is a tool, which help to observe the data which not stay the same with another data

of the rest (Lumenlearning n.d.) Relied on the figure 1 and compared the number of minimums with lower bound and maximum with upper bound, we have outliers occurred in this data

b Measure of Central Tendency

Figure 2.b: Central Tendency of total amount of deaths rate based on 4 types of studied income

Considering the central tendency of this case, some outliers exist in this case, but it was far away from upper bound and lower bound Thus, we could not use the mean to indicate them The median is more useful than others with purpose to calculate the central tendency of 4 types of income levels

Moreover, the median is the most suitable dimension because it allows to measure the approximately average of 4 types of income (Investopedia 2020) The median frequently to use in reverse with the mean because when the outliers appear to make value of data skew a bit Furthermore, the median could not be

affected by outliers than the mean, so when the outliers exist, the best way to calculate is that we would use median For this reason, expected to measure the central of tendency of income, the median should be applied inthis case

Based on the figure 2.b, the median of data fluctuated from 6.746 (LMI) to 7.162 (UMI) This number depicts that 50% of UMI countries confront greater than 7.162 of mortality case and another 50% of them had less than 7.162 death case per dollar in community reported survey We would visualize the same picture for LMI dataset, 50% countries in this type of income recorded they had deal with more than 6.7415 of death case and 50% countries had less than 6.7145 mortality case due to the disease In totally, the median of crude death rate based on two types of income upper and lower middle income seems like approximately equal

c Measure of Variation

Figure 2.c: Measure of Variation of total amount of deaths rate based on 4 types of studied income

The movement from the lowest point to the highest point of data is exhibited by Range, then bring a rapid and rough estimation about expansion of value inside the dataset (ChiliMath n.d.) Relied on figure 2.c, the range of HI countries exist is the largest number, which give an information that the amount of deaths’ contribution based on HI is the most scattered in 2014

The Interquartile Range is one of many indicators often used with purpose to calculate how well the datapoint expands from the mean inside the dataset () The larger the IQR would lead to the more data point is outspreaded () In contrary, the lower the IQR would lead the more data point is gathering close to the median (Stephanie n.d.) According to figure 2.c, the IQR of LI countries is the smallest number in the income group, which give an opinion that the number of death case in LI is gathering around the median, while the number of mortality case of another type of income spread out seriously To be more clearly, the total amount of deaths in

LI countries closed to the value in the middle and more stable when compared with the expansion of death case

Trang 7

Group 6

in another income level () Because of utilize of IQR, we could claim that the generally picture is Low Income have covered more mortality case than other types of income level

The variance exhibits the level of scattering of the data values from the mean () A small variance

display that the point of data seems like very close to the mean and other points and vice versa for high variance() Like standard deviation, it also demonstrates how far and closely of the values around the mean Standard deviation is one of the prominent measurements of variability, the reason is that the result is given in original units (Roberts n.d.) Looking at figure 3, it was marked that the pair variance and standard deviation in LI countries is less than the number of other types of income, which could be clarified that the data point of LI countries going to cluster around the mean while the data point of another income separates from the mean Calculating the scatter of data point inside the data set around the mean, the Coefficient of Variation is the statistical estimation is suitable to proceed (Adam 2020) As can be observed from the figure 2.c, a

background could be drawn that Low Income has the smallest coefficient of variation when compared with another income area (2364% < 3593% < 3806% < 3810%) This show that the value of data around means of LIcountries were more separated than the rest types of income In this situation, it has a paradoxical thing betweenthe analyze in the standard deviation and the coefficient of variation When the standard deviation shows that the recognized values in LI cluster and close to the mean than the rest of type of income, while the coefficient ofvariation support vice versa This could be defined that standard deviation is usually used to examine one data series and if there are greater than one dataset, coefficient of variation would be considered For this reason, coefficient of variation should be organized with purpose to compare the scatter of 4 income areas In a nutshell,having a smallest standard deviation and variance, LI data is more consistent than another income, demonstrate that the reported survey staying the same

d Box-and-Whisker Plot Analysis

Figure 2.d: Box and whisker plots represent the crude deaths rate based on 4 types of income level.

Based on figure 2.d., it is can be seen that the only data distributions of HI countries are slight left skewed For the other 3 income levels, the length of the right whiskers is much longer than that of the left

Trang 8

Group 6whiskers, reflecting on the of outliers in the data distribution Furthermore, min and max of the amount of death cases in LI because of disease was higher than these measurement in other areas of income For this reason, we could assume that the countries in the LI area suffered more mortality case than another types of income In additional, the median of LMI was slower than the median of Upper Middle Income (6.7415 < 7.162), which mean that 50% of Lower Income confirmed that the number of death case no more than 6.7415 cases, while, in UMI half of them suffered fewer than 7.162 cases Because of the application of box and whisker plot, we couldvisualize to draw a generally picture that LI have a huge number of deaths than another area before the reported research.

Part 3: Multiple Regression

Regardless of the income level of the selected country category, it is all initially comprised of the same dependent and independent variables, which are:

Basically, the backward elimination method is employed to eliminate irrelevant, redundant, or not

statistically significant at a 5% level of significance variables from many variables hence, it can enhance the accuracy and the quality of regression models (Ruan et.al 2020)

a LI countries regression model

Based on the result of Figure B.1.4 of Appendix B.1, because all variables’ p-value are much higher than the given level of significant, so we must eliminate all (Narin, Isler & Ozer 2013) Hence, no final regression output is constructed for the LI countries dataset which means there is no relationship between DR and the fourth variables and the variation of that will not affect the DR so it might depends on different factors (Hannerz

et al 2019) In other words those predictor variables are not statistically significant, we do not reject H and , 0there is insufficient evidence in our sample to conclude that a non-zero correlation exists and as the results, there is no scatter plot to illustrate for LI countries (Hannerz et al 2019)

b LMI countries regression model

Regression output:

Trang 9

Group 6

Figure 3.b.1: Final model of LMI countries Final model of LMI countries

Based on the result of Figure B.2.2 of Appendix B.2 and figure 3.b.1., after eliminating least significant variable, the remaining- Immunization - measles, GGHE-D, and GNI are the most significant variables because its p-value = 0.0003; 0.006; 0.044 < α = 0.05 In other words, the remaining predictor variables are statistically significant, our sample data provide enough evidence to reject the H so changes in the independent variables 0are associated with changes in the response at the DR (Fauzi 2017)

Scatter plot:

Trang 10

Group 6

Figure 3.b.2: Scatter plots of final output of LMI countries

Regression Equation: Y = 18.8623 -0.0011X + 0.0183X - 0.1264X 1 2 3

(where is the estimated DR and X X X are the independent variables - the Y 1 2 3 GNI, GGHE-D, and IM)

Interpret the regression coefficient of the significant independent variable:

b 1 = -0.0011 shows that for every increase of 1 unit of current US$ of the GNI per capita by using Atlas

method, the DR per 1000 live births will decrease by 0.001 deaths, considering the two remaining factors as constant

b = 0.0183 2 shows that for every increase of 1 unit of international US$ of the domestic general government expenditure on health per capita, the DR per 1000 live births will increase by 0.018 deaths, considering the two remaining factors as constant

b = -0.1264 3 shows that for every increase of 1% of children ages 12-23 months who received the measles vaccination before 12 months or at any time before the survey, the DR per 1000 live births will decrease by 0.126 deaths, considering the two remaining factors as constant

Interpret the coefficient of determination: R = 0.381 2 interprets that only 38.1% of the variation of the DR is explained by the variation of GNI, GGHE-D, and Immunization, measles, the remaining 61.9% of the DR is explained by different factors (Glen n.d)

c UMI countries regression model

Trang 11

Group 6

Figure 3.c.1: Final model of UMI countries

Based on the result of Figure B.3.4 of Appendix B.3 and figure 3.c.1, PCTU is the most significant variables because its p-value = 0.001 < α = 0.05 In other words, the predictor variable is statistically

significant, we reject the H0, so the sample evidence supports our prediction that PCTU is the potential risk factor of the DR in UMI countries (Fauzi 2017)

Scatter plot:

Figure 3.c.2: Scatter plots of final output of UMI countries

Regression Equation: Y = 4.016 + 0.157X 1

(where Y is the estimated DR and X is the independent variable - the prevalence of current tobacco use)

Trang 12

d HI countries regression model

Figure 3.d.1: Final model of HI countries

Based on the result of Figure B.4.2 of Appendix B.4 and figure 3.d.1, after eliminating least significant variable, the remaining- PCTU, GNI and GGHE-D are the most significant variables because its p-value = 0.004; 0.010; 0.018 < α = 0.05 In other words, the remaining predictor variables are statistically significant and there was enough evidence indicating that the null hypothesis could be rejected so changes in the independent variables are associated with changes in the response at the DR (Fauzi 2017)

Scatter plot:

Trang 13

Group 6

Figure 3.d.2: Scatter plots of final output of HI countries

Regression Equation: Y = 4.44805 - 0.000076X + 0.00116X + 0.14636X1 2 3

(where is the estimated DR; X X X are the independent variables - the Y 1 2 3 GNI, GGHE-D, and PCTU)

b = -0.000076 1 shows that for every increase of 1 current US$ of the GNI per capita by using Atlas method, the death rate per 1000 live births will decrease by 0.000076 deaths, considering the two remaining factors as constant

b = 0.00116 2 shows that for every increase of 1 current international US$ of the domestic general

government expenditure on health per capita, the DR per 1000 live births will increase by 0.00116 deaths, considering the two remaining factors as constant

b = 0.14636 3 shows that for every increase of 1% of the population ages 15 years and over who currently useany tobacco product on a daily or non-daily basis, the DR per 1000 live births will increase by 0.14636 deaths, considering the two remaining factors as constant

Interpret the coefficient of determination: R 2 = 0.343 interprets that only 34.3% of the variation of the

DR is explained by the variation of PCTU and the remaining 65.7% can be attributed to unknown variables (Glen n.d.)

Part 4: Team Regression Conclusion

1 Do all models have the same significant independent variable/s?

Figure 4.1.1: Significant independent variable(s) of 4 income level

Trang 14

Group 6According to figure 4.1.1, after applying backward elimination processes, the 4 given models do not have the exact same significant independent variables However, there are some similarities among the three-income level, in LMI and HI countries GGHE-D and GNI are correlated with DR (as GGHE-D increases DR typically increases, and as GNI increases DR decreases) Whereas UMI and HI countries are affected PCTU (as PCTU increase DR also increases) Regarding LI countries, it is possible that the DR is affected by different factors apart from the 4 given variables.

2 Which variables have the higher impact on the crude death rate in each countries category?

LI countries are affected by unknown variables because the fourth given variables do not have any correlation with DR via the backward elimination process

Based on figure 4.1.1, UMI countries only has 1 significant independent variable, hence, IM is its greatest risk factors of the DR

Figure 4.2.1: Summary information when excluding 1 variable out of 3 variaables for LMI countries

LMI countries are affected by 3 different variables (figure 4.1.1) hence, to determine which variable has the highest impact on the crude death rate we compute the reduction in R when excluding 1 variable 2

According to figure 4.2.1, without IM the R is the lowest (6%) so when R of LMIC countries subtracts the R 2 2 2without IM, the reduction in R is the highest (32.04%) compared to the two remaining variables Hence, IM 2has the greatest impact on the crude death rate in LMIC countries

Figure 4.2.2: Summary information when excluding 1 variable out of 3 variaables for HI countries

For HI countries, the process can be done the same as LMI countries because they are also affected by 3 different variables (figure 4.1.1) According to figure 4.2.2, without PCTU the R is the lowest (22.67%) which 2results to the highest of the reduction in R (11.74%) compared to the two remaining variables Hence, IM has 2the greatest impact on the crude death rate in LMIC countries

3 Conclusion for part 2 and part 4

a Conclusion for Part 2

Owing to the existence of 4 outliers have been measured in the four datasets, the utilizes of Range, Mean, Standard Deviation, and Variance (figure 4) were not convenient for examining total death cases because they were easily influenced by intense value This probably reveal incorrect in the data analysis progress Mode was not useful in this case because it could not indicate the center of the contribution well Additionally,

because of the high skewness in all of 4 data series (right skewed), so Coefficient of Variation is not suited to utilize in these datasets Thus, it is highly recommended that in those data series, Interquartile Range and

Median (Central Tendency) become 2 most suitable estimations, the reason is that these accessions are protected

to the existence of intense value and they extremely focus on the center of the contribution, affording a detailed insight into the datasets

To sum up, box and whisker are used with purpose to catch a general opinion that all data contribution isright skewed also the graph exhibits the existence of outliers in this situation although leave out going to do investigated process Synchronously, Mins and Maxs also provide profound that with the same time, the

mortality case of Low Income because of disease higher than the death case of another income To be more

Trang 15

Group 6clearly, in the Central of Tendency part, medians extremely show that 50% of Lower Middle Income have suffered the number of death case was more than 6.7415 cases While half of Upper Middle-Income covering their mortality case less than 7.162 cases Moreover, the application of Interquartile Range demonstrates that IQR of Low Income is the smallest number in the income group, which give an opinion that the number of death case in Low Income is gathering around the median, while the number of mortality case of another type ofincome spread out seriously.

b Conclusion for Part 3 and Part 4

After computing the reduction in R when excluding 1 variable, for LMI countries, IM has the greatest 2 impact on the crude death rate whereas, for HI countries PCTU has the greatest impact on the crude death rate Besides, GNI and GGHE-D are two remaining risk factors of the crude death rate in both country categories mentioned above

The regression model of LMI countries will provide a better crude death rate estimation because it has the highest R with 38.1% which shows a stronger relationship between the dependent variables and 2

independent variables and 38.1% of the variation in the crude death of LMI countries is explained by the

variation in IM, GNI, and GGHE-D Whereas UMI and HI countries have a lower R (25.3% and 34.3%) which 2mean they have weaker correlations Besides, the higher the R , the greater the capability of forecast or 2

determine the likelihood of future events falling within the predicted outcomes (Zhang 2017) Because the higher the R , more data points will fall within the prediction line hence, the stronger the predictive ability of a 2, model for the given dependent variable (Hamilton, Ghert & Simpson 2015) However, the R value of three 2 income level are still relatively low which shows there are many unknown leading risk factors influence on the crude death rate and hence, the model is lack of the ability to make reliable predictions (Hamilton, Ghert & Simpson 2015) In other words, we should identify another more potential and effective variable, such as

Prevalence of overweight or underweight; and those related to dietary and activity lifestyle factors (Ritche & Roser 2018)

Part 5: Time Series

1 Regression Output for Liberia, Lao, Guyana, and Netherlands

Process of eliminating invalid trend models is in Appendix C.1

a Liberia

Linear Trend Model

Trang 16

Group 6Exponential Trend Model

b Lao

Linear Trend Model

Quadratic Trend Model

Exponential Trend Model

Trang 17

Group 6

c Guyana

Linear Trend Model

Quadratic Trend Model

Trang 18

Group 6

d Netherlands

Linear Trend Model

Trang 19

Group 6

2 Trend Model and Formula for all 4 countries

Where:

Ŷ: The Death rate, crude (per 1,000 person) for the country

T: The period for the year starting from 1995 as the first period

3 Recommend Trend Model

a Liberia

MAD and SSE for Liberia:

The Exponential Model is recommended because the MAD and SSE of the Exponential model is much

lower than the Linear model

b Lao

MAD and SSE for Lao:

The Exponential model is recommended because the MAD and SSE of the Exponential model is much

lower than the Linear and Quadratic model

c Guyana

MAD and SSE for Guyana:

Trang 20

Group 6

lower than the Linear and Quadratic model

d Netherlands

MAD and SSE for Netherlands:

lower than the Linear model

4 Predict crude death rate for Liberia, Lao, Guyana, and Netherlands year 2018-2020

Due to World Bank (n.d.) do not have the sufficient crude death rate data for year 2019 and 2020, we will use Knoema (2020a, 2020b, 2020c, 2020d) data for the actual crude death rate in our calculations Using Exponential Trend Model The calculation processes are in Appendix C.2

Tiêu đề	Measure of Central Tendency, Measure of Variation, Box-and-Whisker Plot Analysis
Tác giả	Group 6

Định dạng
Số trang	40
Dung lượng	4,05 MB