Final examination leadership and team building

Information of dataset Medical Cost Personal Datasets: The aim will be to predict the medical costs billed by health insurance on an individual given some or all of the independent varia

Trang 1

ĐẠI HỌC QUỐC GIA HÀ NỘI

TRƯỜNG QUÔC TẾ

VNU-INTERNATIONAL SCHOOL

*w #2 gá

FINAL EXAMINATION LEADERSHIP AND TEAM

BUILDING

Group 2:

Nguyén Ngoc Diing - 21070750

Vũ Mỹ Hoa - 21070129 Nguyễn Thị Nhung - 21070230

Vũ Xuân Bách - 21070793

x Lecturers : Dr.Phạm Thị Việt Hương

Ha Noi, 3/1/2024

Trang 2

TABLE OF CONTENTS

1

I

II), ra e 1

2 Information of |afAS€L SH HT HH HH TH TT HH nh TH KV 2

ENx-vGv 0) 2/0) 0) 23) )0 ) 5n e 5

4 Check the collinearity in the dataset and remove Ì( - - - se 6

6 Check assumptions of the multiple regression modleÌ - 575cc s s< << se srrsserr 11

7 Going back to the original data Choose to use AIC, BIC, or CrossValidated RMSE to build your best possible modeÌ - - - - - 5-1221 S3 SE TT LH kg 16

Trang 3

1 INTRODUCTION

1 Data sources

resource=download&select=insurance.csv

2 Information of dataset Medical Cost Personal Datasets: The aim will be to predict the medical costs billed by health insurance on an individual given some or all of the independent variables of the dataset

Content of each column:

@ age: age of primary beneficiary

@ sex: insurance contractor gender, female, male

@ bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg /m “ 2) using the ratio of height to weight, ideally

18.5 to 24.9 e@ children: Number of children covered by health insurance / Number of dependents

smoker: Smoking

region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest

@ charges: Individual medical costs billed by health insurance

> setwd(”C: /Users/asus/Down1oads”)

> 1ibrary(readx1)

> data <- read_excel(”insurance.xlsx”, sheet = "sheetl”)

> data[1:5, ]

age sex bmi children smoker region charges

19 female 27.9 0 yes southwest 16885

18 male 33.8 1 no southeast 1726

28 male 33 3 no southeast 4449

33 male 22.7 0 no northwest 21984

32 male 28.9 0 no northwest 3867

> dim(data)

[1] 1338

- This dataset has 1338 entries and 7 columns

Trang 4

1 Discard Outlier

> #reject outliers

> colums=c(1,3,4)

> datal=data[,colums]

> boxplot(datal)

>

———

¬

'

' 8

2 4 g

'

oO

QA 7 i \

—

o

—_—

ö - —————

Base on the boxplot,we can detect the presence of outliers in bmi, we use IQR function

to calculate and discard outliers of bmi Then, we calculate Q1 - 1.5*IQR to find the lower limit for the outliers' value, and Q3 + 1.5*IQR to find the upper limit for the

outliers’ value The result is presented by boxplot as below:

quartiles=quantile(data$bmi ,probs=c(.25,.75),na.rm=FALSE)

I1QR=1QR(data$bmi )

lower=quartiles[1]-1.5*IQR

data=subset(data,data$bmi<upper & data$bmi>lower)

boxplot (data$bmi)

After discarding the outliers, the dataset has 1329 entries and 7 columns

Trang 5

> dim(data)

>

2 Summary of the data

> #summary of data

> str(data) tibble [1,329 x 7] (S3: tbl_df/tb1/data frame)

$ age : num [1:1329] 19 18 28 33 32 31 46 37 37 60

$ sex : chr [1:1329] “female” “male” “male” “male”

$ bmi : num [1:1329] 27.9 33.8 33 22.7 28.9

$ children: num [1:1329] 0130001320

$ smoker : chr [1:1329] “yes” “no” “no” “no”

$ region : chr [1:1329] “southwest” “southeast” “southeast” “northwest”

$ charges : num [1:1329] 16885 1726 4449 21984 3867

> summary (data) age sex bmi children Min 718.0 Length:1329 Min 715.96 Min :0 000 1st Qu.:27.0 Class :character 1st Qu.:26.22 1st Qu.:0.000 Median :39.0 Mode :character Median :30.30 Median :1.000

Mean 239.2 Mean 730.54 Mean 71.096 3rd Qqu.:51.0 3rd Qu :34.48 3rd Qu :2.000 Max 764.0 Max 746.75 Max 5.000

smoker region charges Length:1329 Length:1329 Min : 1122 Class :character Class :character 1st Qu.: 4738 Mode :character Mode :character Median : 9361

Mean 213212 3rd Qu :16587 Max 762593

- str(data) displays the structure of the data object, including variable names, data types, and sample values, providing an overview of the data's organization

- summary(data) provides a concise statistical summary for each variable in the data object, including the number of observations, measures of central tendency (mean, median), measures of spread (minimum, maximum), and percentiles, giving insights into the distribution and characteristics of the variables

it oft

- The variables "sex," "smoker," and "region" are categorical variables in the

dataset Therefore, we should convert them into factor variables:

> #categorical variables

> data$sex=as factor (data$sex)

> data$smoker=as factor (data$smoker)

> data$region=as factor (data$region)

> levels(data$sex) {1] “female” “male”

> levels(data$smoker) [1] “no” “yes”

> levels(data$region) (1] “northeast” “northwest” “southeast” “southwest”

>

Trang 6

3 Perform the multiple linear regression

> #run regression

> Iml=Im(charges~age+bmi+chi ldren+sex+smoker+region, data=data)

> summary(1m1) call:

Im(formula = charges ~ age + bmi + children + sex + smoker + region, data = data)

Residuals:

Min 1Q Median 3Q Max

-11124 -2861 -1029 1346 30104

coefficients:

Estimate Std Error t value Pr(>|t]|)

CiIntercept) -11958.25 998.16 -11.980 < 2e-16 ***

age 256.20 11.86 21.606 < 2e-16 ***

bmi 341.37 29.24 11.675 < 2e-16 ***

children 479.28 136.96 3.499 0.000482 ***

sexmale -41.48 331.57 -0.125 0.900452 smokeryes 23640.55 412.27 57.343 < 2e-16 ***

regionnorthwest -382.61 473.09 -0.809 0.418807 regionsoutheast -1044.90 476.52 -2.193 0.028497 * regionsouthwest -1020.06 475.21 -2.147 0.032011 * Signif codes: 0O ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 *.’ O.1 ‘ 7 1 Residual standard error: 6017 on 1320 degrees of freedom Multiple R-squared: 0.75, Adjusted R-squared: 0.7485 F-statistic: 495.1 on 8 and 1320 DF, p-value: < 2.2e-16

>

- The model has 1 dependent variable is charges, and 6 independent variables

are age, bmi, smoker, children, sex and region (age, bmi, children are

numerical, and smoker,sex, region are categorical)

- Because p-value of sex variable = 0.9>0.05 => determining whether to

exclude the variable sex from the model

> Tm2=Tm(charges~age+bmi+chi 1dr en+smoker+region,data=data)

> summary(T1m23

Call:

ImCformula = charges ~ age + bmi + children + smoker + region,

data = data)

Residuals:

-11143 -2854 -1026 1351 30086

coefficients:

Estimate std Error t value Prc>/|t!) Cintercept) -11974.83 988.96 -12.108 < 2e-16 ***

age 256.23 11.85 21.623 < 2e-16 ***

bmi 341.21 29.20 11.684 < 2e-16 ***

children 478.96 136.88 3.499 0.000482 ***

smokeryes 23636.45 410.81 57.536 < 2e-16 ***

regionnorthwest -382.29 472.91 -0.808 0.419009

regionsoutheast -1044.44 476.33 -2.193 0.028504 *

regionsouthwest -1019.88 475.03 -2.147 0.031977 *

Signif codes: O ‘***" 0.001 ‘**" 0.01 “*° O.O5 “.' O.1 “ ° 1

Residual standard error: 6014 on 1321 degrees of freedom

Multiple R-squared: 0.75, Adjusted R-squared: 0.7487

F-statistic: 566.2 on 7 and 1321 DF, p-value: < 2.2e-16

>

Trang 7

Analysis of variance Table

Model 1: charges ~ age + bmi + children + smoker + region

Model 2: charges ~ age + bmi + children + sex + smoker + region

Res Df RSS Df Sum of Sq F Pr(>F)

1 1321 4.7783e+10

2 1320 4.7782e+10 1 566641 0.0157 0.9005

According to the ANOVA test, the p-value = 0.9005 is greater than 0.05, we select

"lm2" and exclude the variable "sex" from the model

-The multiple linear regression equation of lm?:

charges = -11974.83 + 256.23 * age + 341.21 * bmi + 478.96 * children +

23636.45 * smokeryes - 382.29 * regionnorthwest - 1044.44 * regionsoutheast -

1019.88 * regionsouthwest

the model Im2 has R*2=0.7487=> this means that 74.87% of the variation in the

medical charges can be explained by age, smoker, bmi, children, region

4 Check the collinearity in the dataset and remove it

- Do some quick checks of correlation between the predictors:

> library(faraway)

> #do some quick checks of correlation between the predictors

> pairs(data, col = ”dodgerblue”)

>

10 16

=

So

20 40 60 15 30 45 10 16 0 40000

The pairs(data, col = "dodgerblue") command creates a concise and informative

scatterplot matrix that visually represents the pairwise relationships between

variables in the dataset

Trang 8

Based on the model depicting the relationships between the independent

variables mentioned above, it can be observed that there is no evidence of

collinearity

-Check collinearity by VIF:

> library(car)

> vif (1m2)

GVIF Df GVIFA(1/(2*Df))

age 1.017358 1 1.008642

bmi 1.097225 1 1.047485

children 1.003709 1 1.001853

smoker 1.006581 1 1.003285

region 1.090488 3 1.014542

>

All values in the last column of the above output are less than 5(as a rule of

thumb), hence there is no multicollinearity

5 Interaction model

a) The interactions of "smoker" with the numerical variables in the model:

> Im3=1m(charges~age+bmi+chi ldren+smoker+age:smoker+bmi :smoker+children:smoker+r

egion, data=data)

> summary(1m3)

call:

1m(formula = charges ~ age + bmi + children + smoker + age:smoker +

bmi:smoker + children:smoker + region, data = data)

Residuals:

-10035.3 -1930.6 -1325.1 -422.6 30021.6

coefficients:

Estimate Std Error t value Pr(>|t|) (Intercept) -2682 637 882.544 -3.040 0.00242 **

age 264.203 10.680 24.738 < 2e-16 ***

bmi 27.856 26.271 1.060 0.28919

children 583.764 122.087 4.782 1.94e-06 ***

smokeryes -20757.873 1897.213 -10.941 < 2e-16 ***

regionsoutheast -1224.550 383.492 -3.193 0.00144 **

regionsouthwest -1225.977 382.385 -3.206 0.00138 **

age:smokeryes -7.705 23.832 -0.323 0.74653

bmi : smokeryes 1477.213 55.139 26.791 < 2e-16 ***

children:smokeryes -342.005 282.615 -1.210 0.22644

Signif codes: 0 ‘***’ 0.001 ‘**’ 0.01 “*'" 0.05 “°.' 0.1 “ 1

F-statistic: 684.6 on 10 and 1318 DF, p-value: < 2.2¢e-16

p-value of age:smokeryes = 0.74653>0.05 and p-value of

children:smokeryes=0.22644>0.05

=>we consider whether to reject two interaction variables out of the model

=> Performing ANOVA test to examine:

Trang 9

Model 1: charges ~ age + bmi + children + smoker + bmi:smoker + region

Model 2: charges ~ age + bmi + children + smoker + age:smoker + bmi:smoker +

children:smoker + region

1 1320 3.0895e+10

2 1318 3.0857e+10 2 38358287 0.8192 0.441

-Based on the results of the ANOVA test, the p-value(=0.441) is greater than

0.05, then we select "Ilm4" (the model that excludes the two interaction

variables)

-We continue with the ANOVA test to examine whether there is a significant

interaction between the variables "bmi" and "smoker."

> anova(1m2, 1m4)

Model 1: charges ~ age + bmi + children + smoker + region

Model 2: charges ~ age + bmi + children + smoker + bmi:smoker + region

1 1321 4.7783e+10

2 1320 3.0895e+10 1 1.6888e+10 721.53 < 2.2e-16 ***

Signif codes: 0 ‘***’ 0.001 “**° 0.01 “*°? 0.05 °.° 0.1 “ ° 1

>

-p-value = 2.2*e^{-16) <0.05 => reject HO and choose H1

=> |m4 is more suitable

-The plot without the interaction variable:

> #no interaction

1mhinh1=1m(char ges~bmi+smoker ,data=data)

int no=coef (Imhinh1) [1]

int yes=coef (]mhinh1) [1]+coef (1mhinh1) [3]

slope all levels=coef (Imhinh1) [2]

p1ot_coTors=c(” dar kor ange”, "dar kgr ey”)

p1ot(char ges~bmi ,data=data, co1=pTot_coTor s [smoker ] ,pch=as.numeric(smoker))

abline(int.no,slope.all levels,col="darkorange”, Ity=1, lwd=2)

abline(int yes,slope.all levels,col="darkgrey”, Ity=2, Iwd=2)

1egend(”topr 1ght”, c(C”no”, "yes”) ,co1=p]ot_coTors, Tty=c (1, 2) ,pch=c (1, 2))

charges 30000

-This plot suggests that there is a potential difference in average charges

between smokers and non-smokers with the same BMI However, the average

Trang 10

change in charges for an increase in BMI appears to be similar for both groups

Overall, the model's performance is suboptimal

-The plot with the interaction variable:

#with interaction

Imhinh2=1m(charges~bmi *smoker , data=data)

int no=coef (1mhinh2) [1]

int yes=coef (Imhinh2) [1]+coef (1mhinh2) [3]

slope no=coef (1mhinh2) [2]

slope yes=coef (Imhinh2) [2]+coef (Imhinh2) [4]

p1ot_coTor s=c(” dar kor ange”, "dar kgr ey”)

plot(charges~bmi ,data=data,col=plot_colors [smoker] ,pch=as.numeric(smoker))

abline(int.no,slope.no,col="darkorange”, Ity=1, lwd=2)

abline(int yes,slope yes,col="darkgrey”, Ity=2, lwd=2)

legend("topright",c("no", "yes"),col=plot_colors, Ity=c(1,2),pch=c(1,2))

10000

0

bmi

-this plot illustrates that with interaction varriable,these lines fit the data much

better

b) The interactions of "region" with the numerical variables in the model:

Trang 11

+children:region, data=data)

> summary(1m5)

call:

Im(formula = charges ~ age + bmi + children + smoker + bmi:smoker +

region + age:region + bmi:region + children:region, data = data)

Residuals:

-9403.6 -2004.5 -1253.2 -251.6 29776.2

Coefficients:

Estimate Std Error t value Pr(>|t]|)

(Intercept) -3558.92 1554.97 -2.289 0.02225 *

age 232.36 19.31 12.032 < 2e-16 ***

bmi 101 36 48.31 2.098 0.03610 *

children 584 36 224.09 2.608 0.00922 **

smokeryes -21906 25 1723.54 -12.710 < 2e-16 ***

regionsoutheast 2646 23 2163.56 1.223 0.22152

regionsouthwest -457.44 2167.7 -0.211 0.83290

bmi: smokeryes 1492.10 55.40 26.931 < 2e-16 ***

age:regionnorthwest 7.28 27.23 0.635 0.52577

age:regionsoutheast 51.38 26.57 1.934 0.05336

age: regionsouthwest 42.19 27.57 1.530 0.12621

bmi :regionnorthwest -11.22 70.34 -0.160 0.87326

bmi :regionsoutheast -181.49 62.60 -2.899 0.00380 **

bmi: regionsouthwest -70.26 67.7 -1.037 0.29996

children:regionnorthwest 281.01 320.98 0.875 0.38147

children:regionsoutheast -175.77 312.10 -0.563 0.57341

children:regionsouthwest -350.74 307.67 -1.140 0.25450

Signif codes: 0 ‘***’ 0.001 “**' 0.01 “*° 0.05 '*.' 0.1 ° °1

F-statistic: 406.8 on 17 and 1311 DF, p-value: < 2.2e-16

-p-value of age:region and children:region are all<0.05, besides they dont have

any *

=>considering whether to exclude them out of the model Im5

-Conducting an ANOVA test to investigate:

> 1m6=Im(charges~age+bmi+chi ldren+smoker+bmi : smoker+region+bmi :region, data=data)

> anova(1m6, 1m5)

Model 1: charges ~ age + bmi + children + smoker + bmi:smoker + region +

bmi:region

Model 2: charges ~ age + bmi + children + smoker + bmi:smoker + region +

age:region + bmi:region + children:region

1 1317 3.0668e+10

2 1311 3.0462e+10 6 205854820 1.4766 0.1826

- Based on the ANOVA test results, p-value (0.1826) is greater than 0.05, hence

we choose "Ilm6" as the preferred model, which excludes the two interaction

variables

- Continuing with the ANOVA test to evaluate the presence of a statistically

significant interaction between the variables "bmi" and "smoker."

Tiêu đề	Final examination leadership and team building
Tác giả	Nguyộn Ngoc Diing - 21070750, Vũ Mỹ Hoa - 21070129, Nguyễn Thị Nhung - 21070230, Vũ Xuõn Bỏch - 21070793
Người hướng dẫn	Dr. Phạm Thị Việt Hương
Trường học	Đại Học Quốc Gia Hà Nội
Chuyên ngành	International Studies
Thể loại	Final examination
Năm xuất bản	2024
Thành phố	Hà Nội

Định dạng
Số trang	18
Dung lượng	3,13 MB