1. Trang chủ
  2. » Giáo Dục - Đào Tạo

Measure of central tendency measure of variation box and whisker plot analysis

40 8 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Measure Of Central Tendency Measure Of Variation Box And Whisker Plot Analysis
Tác giả Group 6
Thể loại Report
Định dạng
Số trang 40
Dung lượng 2,49 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Group 6Group Members and ContributionThai Low Income LI Lower Middle Income LMI Upper Middle Income UMI High Income HI Death rate DR Gross national income GNI Domestic general government

Trang 1

Group 6

Trang 2

Contents Group 6

Group Members and Contribution 5

Abbreviation 5

Part 1: Data Collection 5

Part 2: Descriptive Analysis 5

a Test of Outliers 5

b Measure of Central Tendency 6

c Measure of Variation 6

d Box-and-Whisker Plot Analysis 7

Part 3: Multiple Regression 8

a LI countries regression model 8

b LMI countries regression model 8

c UMI countries regression model 10

d HI countries regression model 12

Part 4: Team Regression Conclusion 13

a Conclusion for Part 2 14

b Conclusion for Part 3 and Part 4 15

Part 5: Time Series 15

1 Regression Output for Liberia, Lao, Guyana, and Netherlands 15

a Liberia 15

b Lao 16

Trang 3

Group 6

c Guyana 17

d Netherlands 18

2 Trend Model and Formula for all 4 countries 19

3 Recommend Trend Model 19

a Liberia 19

b Lao 19

c Guyana 19

d Netherlands 20

4 Predict crude death rate for Liberia, Lao, Guyana, and Netherlands year 2018-2020 20

a Liberia 2018 to 2020 20

b Lao 2018 to 2020 20

c Guyana 2018 to 2020 20

d Netherlands 2018 to 2020 20

Part 6: Time Series Conclusion 21

a Line chart 21

b Best trend model anticipating the crude death rate all over the world 21

Part 7: Overall Team Conclusion 22

a Main factors that impact the crude death rate 22

b Predicted crude death rate in year 2030 22

c Recommendations 22

References 23

Trang 4

Group 6

Appendices 26

Appendix A: 143 Countries Data sort by GNI (Low to High) 26

Appendix A.1: Low Income 26

Appendix A.2: Low-Middle Income 27

Appendix A.3: Middle-Upper Income 28

Appendix A.4: High Income 29

Appendix B: Backward Elimination process in the regression model 29

Appendix B.1: LI countries 29

Appendix B.2: LMI countries 30

Appendix B.3: UMI countries 31

Appendix B.4: HI countries 31

Appendix C: Time Series 32

Appendix C.1: Significant Trend Models validation process for Liberia, Laos, Guyana, Netherlands 32

Appendix C.2: Crude death rate prediction for Liberia, Lao, Guyana, and Netherlands for year 2018 to 2020 calculations 38

Trang 5

Group 6Group Members and Contribution

Thai

Low Income (LI)

Lower Middle Income (LMI)

Upper Middle Income (UMI)

High Income (HI)

Death rate (DR)

Gross national income (GNI)

Domestic general government health expenditure

(GGHE-D) Immunization, measles (IM)

Prevalence of current tobacco use (PCTU)

Part 1: Data Collection

The countries divided into four categories based on the income level (Appendix A) The data set is for year 2014 includes 4 variables (Appendix A) The data are collected from World Bank (n.d.a) Initially,

it contains 217 countries, however, due to shortage of information of some of the countries so we had to eliminate those and only kept 143 countries that have sufficient information that meet the requirements – 4 variables Besides, the reason we selected all the 143 countries instead of narrowing it down is we want to maintain the original and pure of the data and prevent bias in the data cleaning process, hence, creating a more transparency and accuracy dataset (Šimundić 2013)

Part 2: Descriptive Analysis

a Test of Outliers

Trang 6

Group 6

Figure 2.a: Test of outliers of the total amount of deaths rate in 4 types of studied income With the

purpose to evaluating the certainty of descriptive measurement, the table show the test of

outliers is displayed Outliers is a tool, which help to observe the data which not stay the same with another data of the rest (Lumenlearning n.d.) Relied on the figure 1 and compared the number of minimums with lower bound and maximum with upper bound, we have outliers occurred in this data

b Measure of Central Tendency

Figure 2.b: Central Tendency of total amount of deaths rate based on 4 types of studied income

Considering the central tendency of this case, some outliers exist in this case, but it was far away from

upper bound and lower bound Thus, we could not use the mean to indicate them The median is more

useful than others with purpose to calculate the central tendency of 4 types of income levels

Moreover, the median is the most suitable dimension because it allows to measure the approximately average of 4 types of income (Investopedia 2020) The median frequently to use in reverse with the mean because when the outliers appear to make value of data skew a bit Furthermore, the median could not be affected by outliers than the mean, so when the outliers exist, the best way to calculate is that we would use median For this reason, expected to measure the central of tendency of income, the median should be

applied in this case

Based on the figure 2.b, the median of data fluctuated from 6.746 (LMI) to 7.162 (UMI) This number

depicts that 50% of UMI countries confront greater than 7.162 of mortality case and another 50% of them had less than 7.162 death case per dollar in community reported survey We would visualize the same picture for LMI dataset, 50% countries in this type of income recorded they had deal with more than 6.7415 of death case and 50% countries had less than 6.7145 mortality case due to the disease In totally, the median of crude death rate based on two types of income upper and lower middle income seems like approximately equal.

c Measure of Variation

Figure 2.c: Measure of Variation of total amount of deaths rate based on 4 types of studied income The

movement from the lowest point to the highest point of data is exhibited by Range, then bring a

rapid and rough estimation about expansion of value inside the dataset (ChiliMath n.d.) Relied on figure 2.c, the range of HI countries exist is the largest number, which give an information that the amount of deaths’ contribution based on HI is the most scattered in 2014

The Interquartile Range is one of many indicators often used with purpose to calculate how well the data point expands from the mean inside the dataset () The larger the IQR would lead to the more data point is outspreaded () In contrary, the lower the IQR would lead the more data point is gathering close to the median (Stephanie n.d.)

According to figure 2.c, the IQR of LI countries is the smallest number in the income group, which give an opinion that the number of death case in LI is gathering around the median, while the number of mortality case of another type of income spread out seriously To be more clearly, the total amount of deaths in LI countries closed to the value in the middle and more stable when compared with the expansion of death case

Trang 7

Calculating the scatter of data point inside the data set around the mean, the Coefficient of Variation

is the statistical estimation is suitable to proceed (Adam 2020) As can be observed from the figure 2.c, a background could be drawn that Low Income has the smallest coefficient of variation when compared with another income area (2364% < 3593% < 3806% < 3810%) This show that the value of data around means of

LI countries were more separated than the rest types of income In this situation, it has a paradoxical thing between the analyze in the standard deviation and the coefficient of variation When the standard deviation shows that the recognized values in LI cluster and close to the mean than the rest of type of income, while the coefficient of variation support vice versa This could be defined that standard deviation is usually used

to examine one data series and if there are greater than one dataset, coefficient of variation would be

considered For this reason, coefficient of variation should be organized with purpose to compare the scatter

of 4 income areas In a nutshell, having a smallest standard deviation and variance, LI data is more consistentthan another income, demonstrate that the reported survey staying the same

d Box-and-Whisker Plot Analysis

Figure 2.d: Box and whisker plots represent the crude deaths rate based on 4 types of income level.

Based on figure 2.d., it is can be seen that the only data distributions of HI countries are slight left

skewed For the other 3 income levels, the length of the right whiskers is much longer than that of the left

Trang 8

Group 6whiskers, reflecting on the of outliers in the data distribution Furthermore, min and max of the amount of death cases in LI because of disease was higher than these measurement in other areas of income For this reason, we could assume that the countries in the LI area suffered more mortality case than another types of income In additional, the median of LMI was slower than the median of Upper Middle Income (6.7415 < 7.162), which mean that 50% of Lower Income confirmed that the number of death case no more than 6.7415cases, while, in UMI half of them suffered fewer than 7.162 cases Because of the application of box and whisker plot, we could visualize to draw a generally picture that LI have a huge number of deaths than another area before the reported research.

Part 3: Multiple Regression

Regardless of the income level of the selected country category, it is all initially comprised of the same dependent and independent variables, which are:

Basically, the backward elimination method is employed to eliminate irrelevant, redundant, or

not statistically significant at a 5% level of significance variables from many variables hence, it can

enhance the accuracy and the quality of regression models (Ruan et.al 2020)

a LI countries regression model

Based on the result of Figure B.1.4 of Appendix B.1, because all variables’ p-value are much higher than the given level of significant, so we must eliminate all (Narin, Isler & Ozer 2013) Hence, no final regression output is constructed for the LI countries dataset which means there is no relationship between DRand the fourth variables and the variation of that will not affect the DR so it might depends on different

factors (Hannerz et al 2019) In other words, those predictor variables are not statistically significant, we do

not reject H0 and there is insufficient evidence in our sample to conclude that a non-zero correlation exists and as the results, there is no scatter plot to illustrate for LI countries (Hannerz et al 2019)

b LMI countries regression model Regression output:

8

Trang 9

Group 6

Figure 3.b.1: Final model of LMI countries Final model of LMI countries

Based on the result of Figure B.2.2 of Appendix B.2 and figure 3.b.1., after eliminating least

significant variable, the remaining- Immunization - measles, GGHE-D, and GNI are the most significant variables because its p-value = 0.0003; 0.006; 0.044 < α = 0.05 In other words, the remaining predictor variables are statistically significant, our sample data provide enough evidence to reject the H0 so changes inthe independent variables are associated with changes in the response at the DR (Fauzi 2017)

Scatter plot:

Trang 10

Group 6

Figure 3.b.2: Scatter plots of final output of LMI countries

Regression Equation: Y = 18.8623 -0.0011X 1 + 0.0183X 2 - 0.1264X 3

(where Y is the estimated DR and X1 X2 X3 are the independent variables - the GNI, GGHE-D, and IM)

Interpret the regression coefficient of the significant independent variable:

b 1 = -0.0011 shows that for every increase of 1 unit of current US$ of the GNI per capita by using

Atlas method, the DR per 1000 live births will decrease by 0.001 deaths, considering the two remainingfactors as constant

b 2 = 0.0183 shows that for every increase of 1 unit of international US$ of the domestic general government

expenditure on health per capita, the DR per 1000 live births will increase by 0.018 deaths, considering the two

remaining factors as constant

b 3 = -0.1264 shows that for every increase of 1% of children ages 12-23 months who received the

measles vaccination before 12 months or at any time before the survey, the DR per 1000 live births will decrease by 0.126 deaths, considering the two remaining factors as constant

Interpret the coefficient of determination: R 2 = 0.381 interprets that only 38.1% of the variation of the DR is

explained by the variation of GNI, GGHE-D, and Immunization, measles, the remaining 61.9% of the

DR is explained by different factors (Glen n.d)

c UMI countries regression model

Regression output:

Trang 11

Group 6

Figure 3.c.1: Final model of UMI countries

Based on the result of Figure B.3.4 of Appendix B.3 and figure 3.c.1, PCTU is the most significantvariables because its p-value = 0.001 < α = 0.05 In other words, the predictor variable is statistically

significant, we reject the H0, so the sample evidence supports our prediction that PCTU is the potential risk factor of the DR in UMI countries (Fauzi 2017)

Scatter plot:

Figure 3.c.2: Scatter plots of final output of UMI countries

Regression Equation: Y = 4.016 + 0.157X 1

(where Y is the estimated DR and X is the independent variable - the prevalence of current tobacco use)

Interpret the regression coefficient of the significant independent variable:

Trang 12

Group 6

b 1 = 0.157 interprets that for every increase of 1% of the population ages 15 years and over who currently

use any tobacco product on a daily or non-daily basis, the DR per 1000 live births will increase by 0.157 deaths

Interpret the coefficient of determination: R 2 = 0.253 interprets that only 25.3% of the variation of the DR is

explained by the variation of PCTU and the remaining 74.7% of the DR is explained by different factors (Glen n.d.)

d HI countries regression model

Regression output:

Figure 3.d.1: Final model of HI countries

Based on the result of Figure B.4.2 of Appendix B.4 and figure 3.d.1, after eliminating least significant

variable, the remaining- PCTU, GNI and GGHE-D are the most significant variables because its p-value =

0.004; 0.010; 0.018 < α = 0.05 In other words, the remaining predictor variables are statistically significant and

there was enough evidence indicating that the null hypothesis could be rejected so changes in the

independent variables are associated with changes in the response at the DR (Fauzi 2017)

Scatter plot:

Trang 13

Group 6

Figure 3.d.2: Scatter plots of final output of HI countries

Regression Equation: Y = 4.44805 - 0.000076X1 + 0.00116X2 + 0.14636X3

(where Y is the estimated DR; X1 X2 X3 are the independent variables - the GNI, GGHE-D, and PCTU)

Interpret the regression coefficient of the significant independent variable:

b 1 = -0.000076 shows that for every increase of 1 current US$ of the GNI per capita by using Atlas

method, the death rate per 1000 live births will decrease by 0.000076 deaths, considering the two remainingfactors as constant

b 2 = 0.00116 shows that for every increase of 1 current international US$ of the domestic general

government expenditure on health per capita, the DR per 1000 live births will increase by 0.00116 deaths,

considering the two remaining factors as constant

b 3 = 0.14636 shows that for every increase of 1% of the population ages 15 years and over who currently

use any tobacco product on a daily or non-daily basis, the DR per 1000 live births will increase by 0.14636 deaths, considering the two remaining factors as constant

Interpret the coefficient of determination: R 2 = 0.343 interprets that only 34.3% of the variation of the

DR is explained by the variation of PCTU and the remaining 65.7% can be attributed to unknown

variables (Glen n.d.)

Part 4: Team Regression Conclusion

1 Do all models have the same significant independent variable/s?

Figure 4.1.1: Significant independent variable(s) of 4 income level

Trang 14

Group 6According to figure 4.1.1, after applying backward elimination processes, the 4 given models do not have the exact same significant independent variables However, there are some similarities among the three-income level, in LMI and HI countries GGHE-D and GNI are correlated with DR (as GGHE-D

increases DR typically increases, and as GNI increases DR decreases) Whereas UMI and HI countries are affected PCTU (as PCTU increase DR also increases) Regarding LI countries, it is possible that the DR is affected by different factors apart from the 4 given variables

2 Which variables have the higher impact on the crude death rate in each countries category? LI countries

are affected by unknown variables because the fourth given variables do not have any

correlation with DR via the backward elimination process

Based on figure 4.1.1, UMI countries only has 1 significant independent variable, hence, IM is

its greatest risk factors of the DR

Figure 4.2.1: Summary information when excluding 1 variable out of 3 variaables for LMI countries

LMI countries are affected by 3 different variables (figure 4.1.1) hence, to determine which variable has the highest impact on the crude death rate we compute the reduction in R2 when excluding 1 variable According to figure 4.2.1, without IM the R2 is the lowest (6%) so when R2 of LMIC countries subtracts the

R2 without IM, the reduction in R2 is the highest (32.04%) compared to the two remaining variables Hence,

IM has the greatest impact on the crude death rate in LMIC countries

Figure 4.2.2: Summary information when excluding 1 variable out of 3 variaables for HI countries

For HI countries, the process can be done the same as LMI countries because they are also affected

by 3 different variables (figure 4.1.1) According to figure 4.2.2, without PCTU the R2 is the lowest

(22.67%) which results to the highest of the reduction in R2 (11.74%) compared to the two remaining

variables Hence, IM has the greatest impact on the crude death rate in LMIC countries

3 Conclusion for part 2 and part 4

a Conclusion for Part 2

Owing to the existence of 4 outliers have been measured in the four datasets, the utilizes of Range, Mean, Standard Deviation, and Variance (figure 4) were not convenient for examining total death cases because they were easily influenced by intense value This probably reveal incorrect in the data analysis progress Mode was not useful in this case because it could not indicate the center of the contribution well Additionally, because of the high skewness in all of 4 data series (right skewed), so Coefficient of Variation

is not suited to utilize in these datasets Thus, it is highly recommended that in those data series, InterquartileRange and Median (Central Tendency) become 2 most suitable estimations, the reason is that these

accessions are protected to the existence of intense value and they extremely focus on the center of the

contribution, affording a detailed insight into the datasets

To sum up, box and whisker are used with purpose to catch a general opinion that all data contribution is right skewed also the graph exhibits the existence of outliers in this situation although leave out going to do investigated process Synchronously, Mins and Maxs also provide profound that with the same time, the mortality case of Low Income because of disease higher than the death case of another income To be more

Trang 15

Group 6clearly, in the Central of Tendency part, medians extremely show that 50% of Lower Middle Income have suffered the number of death case was more than 6.7415 cases While half of Upper Middle-Income

covering their mortality case less than 7.162 cases Moreover, the application of Interquartile Range

demonstrates that IQR of Low Income is the smallest number in the income group, which give an opinion that the number of death case in Low Income is gathering around the median, while the number of mortality case of another type of income spread out seriously

b Conclusion for Part 3 and Part 4

After computing the reduction in R2 when excluding 1 variable, for LMI countries, IM has the

greatest impact on the crude death rate whereas, for HI countries PCTU has the greatest impact on the

crude death rate Besides, GNI and GGHE-D are two remaining risk factors of the crude death rate in both country categories mentioned above

The regression model of LMI countries will provide a better crude death rate estimation because it has the highest R2 with 38.1% which shows a stronger relationship between the dependent variables and independent variables and 38.1% of the variation in the crude death of LMI countries is explained by the variation in IM, GNI,and GGHE-D Whereas UMI and HI countries have a lower R2 (25.3% and 34.3%) which mean they have weakercorrelations Besides, the higher the R2 , the greater the capability of forecast or determine the likelihood of futureevents falling within the predicted outcomes (Zhang 2017) Because the higher the R2, , more data points will fall within the prediction line hence, the stronger the predictive ability of a model for the given dependent variable (Hamilton, Ghert & Simpson 2015) However, the R2 value of three

income level are still relatively low which shows there are many unknown leading risk factors influence on the crude death rate and hence, the model is lack of the ability to make reliable predictions (Hamilton, Ghert & Simpson 2015) In other words, we should identify another more potential and effective variable, such as Prevalence of overweight or underweight; and those related to dietary and activity lifestyle factors (Ritche & Roser 2018)

Part 5: Time Series

1 Regression Output for Liberia, Lao, Guyana, and Netherlands

Process of eliminating invalid trend models is in Appendix C.1

Linear Trend Model

Trang 16

Group 6Exponential Trend Model

b Lao

Linear Trend Model

Quadratic Trend Model

Exponential Trend Model

Trang 17

Group 6

Linear Trend Model

Quadratic Trend Model

Exponential Trend Model

Trang 18

Group 6

d Netherlands Linear Trend Model

Exponential Trend Model

18

Trang 19

Group 6

2 Trend Model and Formula for all 4 countries

Where:

Ŷ: The Death rate, crude (per 1,000 person) for the country

T: The period for the year starting from 1995 as the first period

3 Recommend Trend Model

a LiberiaMAD and SSE for Liberia:

The Exponential Model is recommended because the MAD and SSE of the Exponential model is

much lower than the Linear model

b Lao

MAD and SSE for Lao:

The Exponential model is recommended because the MAD and SSE of the Exponential model is

much lower than the Linear and Quadratic model

c Guyana

MAD and SSE for Guyana:

Trang 20

Group 6

The Exponential Model is recommended because the MAD and SSE of the Exponential model is

much lower than the Linear and Quadratic model

d Netherlands

MAD and SSE for Netherlands:

The Exponential Model is recommended because the MAD and SSE of the Exponential model is

much lower than the Linear model

4 Predict crude death rate for Liberia, Lao, Guyana, and Netherlands year 2018-2020

Due to World Bank (n.d.) do not have the sufficient crude death rate data for year 2019 and 2020, wewill use Knoema (2020a, 2020b, 2020c, 2020d) data for the actual crude death rate in our calculations

Using Exponential Trend Model The calculation processes are in Appendix C.2

Ngày đăng: 10/05/2022, 08:49

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm

w