Information of dataset Medical Cost Personal Datasets: The aim will be to predict the medical costs billed by health insurance on an individual given some or all of the independent varia
Trang 1
ĐẠI HỌC QUỐC GIA HÀ NỘI
TRƯỜNG QUÔC TẾ
VNU-INTERNATIONAL SCHOOL
*w #2 gá
FINAL EXAMINATION LEADERSHIP AND TEAM
BUILDING
Group 2:
Nguyén Ngoc Diing - 21070750
Vũ Mỹ Hoa - 21070129 Nguyễn Thị Nhung - 21070230
Vũ Xuân Bách - 21070793
x Lecturers : Dr.Phạm Thị Việt Hương
Ha Noi, 3/1/2024
Trang 2
TABLE OF CONTENTS
1
I
II), ra e 1
2 Information of |afAS€L SH HT HH HH TH TT HH nh TH KV 2
ENx-vGv 0) 2/0) 0) 23) )0 ) 5n e 5
4 Check the collinearity in the dataset and remove Ì( - - - se 6
6 Check assumptions of the multiple regression modleÌ - 575cc s s< << se srrsserr 11
7 Going back to the original data Choose to use AIC, BIC, or CrossValidated RMSE to build your best possible modeÌ - - - - - 5-1221 S3 SE TT LH kg 16
Trang 31 INTRODUCTION
1 Data sources
resource=download&select=insurance.csv
2 Information of dataset Medical Cost Personal Datasets: The aim will be to predict the medical costs billed by health insurance on an individual given some or all of the independent variables of the dataset
Content of each column:
@ age: age of primary beneficiary
@ sex: insurance contractor gender, female, male
@ bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg /m “ 2) using the ratio of height to weight, ideally
18.5 to 24.9 e@ children: Number of children covered by health insurance / Number of dependents
smoker: Smoking
region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest
@ charges: Individual medical costs billed by health insurance
> setwd(”C: /Users/asus/Down1oads”)
> 1ibrary(readx1)
> data <- read_excel(”insurance.xlsx”, sheet = "sheetl”)
> data[1:5, ]
age sex bmi children smoker region charges
19 female 27.9 0 yes southwest 16885
18 male 33.8 1 no southeast 1726
28 male 33 3 no southeast 4449
33 male 22.7 0 no northwest 21984
32 male 28.9 0 no northwest 3867
> dim(data)
[1] 1338
- This dataset has 1338 entries and 7 columns
Trang 41 Discard Outlier
> #reject outliers
> colums=c(1,3,4)
> datal=data[,colums]
> boxplot(datal)
>
———
¬
'
' 8
2 4 g
'
oO
QA 7 i \
—
o
—_—
ö - —————
Base on the boxplot,we can detect the presence of outliers in bmi, we use IQR function
to calculate and discard outliers of bmi Then, we calculate Q1 - 1.5*IQR to find the lower limit for the outliers' value, and Q3 + 1.5*IQR to find the upper limit for the
outliers’ value The result is presented by boxplot as below:
quartiles=quantile(data$bmi ,probs=c(.25,.75),na.rm=FALSE)
I1QR=1QR(data$bmi )
lower=quartiles[1]-1.5*IQR
data=subset(data,data$bmi<upper & data$bmi>lower)
boxplot (data$bmi)
After discarding the outliers, the dataset has 1329 entries and 7 columns
Trang 5> dim(data)
>
2 Summary of the data
> #summary of data
> str(data) tibble [1,329 x 7] (S3: tbl_df/tb1/data frame)
$ age : num [1:1329] 19 18 28 33 32 31 46 37 37 60
$ sex : chr [1:1329] “female” “male” “male” “male”
$ bmi : num [1:1329] 27.9 33.8 33 22.7 28.9
$ children: num [1:1329] 0130001320
$ smoker : chr [1:1329] “yes” “no” “no” “no”
$ region : chr [1:1329] “southwest” “southeast” “southeast” “northwest”
$ charges : num [1:1329] 16885 1726 4449 21984 3867
> summary (data) age sex bmi children Min 718.0 Length:1329 Min 715.96 Min :0 000 1st Qu.:27.0 Class :character 1st Qu.:26.22 1st Qu.:0.000 Median :39.0 Mode :character Median :30.30 Median :1.000
Mean 239.2 Mean 730.54 Mean 71.096 3rd Qqu.:51.0 3rd Qu :34.48 3rd Qu :2.000 Max 764.0 Max 746.75 Max 5.000
smoker region charges Length:1329 Length:1329 Min : 1122 Class :character Class :character 1st Qu.: 4738 Mode :character Mode :character Median : 9361
Mean 213212 3rd Qu :16587 Max 762593
- str(data) displays the structure of the data object, including variable names, data types, and sample values, providing an overview of the data's organization
- summary(data) provides a concise statistical summary for each variable in the data object, including the number of observations, measures of central tendency (mean, median), measures of spread (minimum, maximum), and percentiles, giving insights into the distribution and characteristics of the variables
it oft
- The variables "sex," "smoker," and "region" are categorical variables in the
dataset Therefore, we should convert them into factor variables:
> #categorical variables
> data$sex=as factor (data$sex)
> data$smoker=as factor (data$smoker)
> data$region=as factor (data$region)
> levels(data$sex) {1] “female” “male”
> levels(data$smoker) [1] “no” “yes”
> levels(data$region) (1] “northeast” “northwest” “southeast” “southwest”
>
Trang 63 Perform the multiple linear regression
> #run regression
> Iml=Im(charges~age+bmi+chi ldren+sex+smoker+region, data=data)
> summary(1m1) call:
Im(formula = charges ~ age + bmi + children + sex + smoker + region, data = data)
Residuals:
Min 1Q Median 3Q Max
-11124 -2861 -1029 1346 30104
coefficients:
Estimate Std Error t value Pr(>|t]|)
CiIntercept) -11958.25 998.16 -11.980 < 2e-16 ***
age 256.20 11.86 21.606 < 2e-16 ***
bmi 341.37 29.24 11.675 < 2e-16 ***
children 479.28 136.96 3.499 0.000482 ***
sexmale -41.48 331.57 -0.125 0.900452 smokeryes 23640.55 412.27 57.343 < 2e-16 ***
regionnorthwest -382.61 473.09 -0.809 0.418807 regionsoutheast -1044.90 476.52 -2.193 0.028497 * regionsouthwest -1020.06 475.21 -2.147 0.032011 * Signif codes: 0O ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 *.’ O.1 ‘ 7 1 Residual standard error: 6017 on 1320 degrees of freedom Multiple R-squared: 0.75, Adjusted R-squared: 0.7485 F-statistic: 495.1 on 8 and 1320 DF, p-value: < 2.2e-16
>
- The model has 1 dependent variable is charges, and 6 independent variables
are age, bmi, smoker, children, sex and region (age, bmi, children are
numerical, and smoker,sex, region are categorical)
- Because p-value of sex variable = 0.9>0.05 => determining whether to
exclude the variable sex from the model
> Tm2=Tm(charges~age+bmi+chi 1dr en+smoker+region,data=data)
> summary(T1m23
Call:
ImCformula = charges ~ age + bmi + children + smoker + region,
data = data)
Residuals:
Min 1Q Median 3Q Max
-11143 -2854 -1026 1351 30086
coefficients:
Estimate std Error t value Prc>/|t!) Cintercept) -11974.83 988.96 -12.108 < 2e-16 ***
age 256.23 11.85 21.623 < 2e-16 ***
bmi 341.21 29.20 11.684 < 2e-16 ***
children 478.96 136.88 3.499 0.000482 ***
smokeryes 23636.45 410.81 57.536 < 2e-16 ***
regionnorthwest -382.29 472.91 -0.808 0.419009
regionsoutheast -1044.44 476.33 -2.193 0.028504 *
regionsouthwest -1019.88 475.03 -2.147 0.031977 *
Signif codes: O ‘***" 0.001 ‘**" 0.01 “*° O.O5 “.' O.1 “ ° 1
Residual standard error: 6014 on 1321 degrees of freedom
Multiple R-squared: 0.75, Adjusted R-squared: 0.7487
F-statistic: 566.2 on 7 and 1321 DF, p-value: < 2.2e-16
>
Trang 7Analysis of variance Table
Model 1: charges ~ age + bmi + children + smoker + region
Model 2: charges ~ age + bmi + children + sex + smoker + region
Res Df RSS Df Sum of Sq F Pr(>F)
1 1321 4.7783e+10
2 1320 4.7782e+10 1 566641 0.0157 0.9005
According to the ANOVA test, the p-value = 0.9005 is greater than 0.05, we select
"lm2" and exclude the variable "sex" from the model
-The multiple linear regression equation of lm?:
charges = -11974.83 + 256.23 * age + 341.21 * bmi + 478.96 * children +
23636.45 * smokeryes - 382.29 * regionnorthwest - 1044.44 * regionsoutheast -
1019.88 * regionsouthwest
the model Im2 has R*2=0.7487=> this means that 74.87% of the variation in the
medical charges can be explained by age, smoker, bmi, children, region
4 Check the collinearity in the dataset and remove it
- Do some quick checks of correlation between the predictors:
> library(faraway)
> #do some quick checks of correlation between the predictors
> pairs(data, col = ”dodgerblue”)
>
10 16
=
So
20 40 60 15 30 45 10 16 0 40000
The pairs(data, col = "dodgerblue") command creates a concise and informative
scatterplot matrix that visually represents the pairwise relationships between
variables in the dataset
Trang 8Based on the model depicting the relationships between the independent
variables mentioned above, it can be observed that there is no evidence of
collinearity
-Check collinearity by VIF:
> library(car)
> vif (1m2)
GVIF Df GVIFA(1/(2*Df))
age 1.017358 1 1.008642
bmi 1.097225 1 1.047485
children 1.003709 1 1.001853
smoker 1.006581 1 1.003285
region 1.090488 3 1.014542
>
All values in the last column of the above output are less than 5(as a rule of
thumb), hence there is no multicollinearity
5 Interaction model
a) The interactions of "smoker" with the numerical variables in the model:
> Im3=1m(charges~age+bmi+chi ldren+smoker+age:smoker+bmi :smoker+children:smoker+r
egion, data=data)
> summary(1m3)
call:
1m(formula = charges ~ age + bmi + children + smoker + age:smoker +
bmi:smoker + children:smoker + region, data = data)
Residuals:
Min 1Q Median 3Q Max
-10035.3 -1930.6 -1325.1 -422.6 30021.6
coefficients:
Estimate Std Error t value Pr(>|t|) (Intercept) -2682 637 882.544 -3.040 0.00242 **
age 264.203 10.680 24.738 < 2e-16 ***
bmi 27.856 26.271 1.060 0.28919
children 583.764 122.087 4.782 1.94e-06 ***
smokeryes -20757.873 1897.213 -10.941 < 2e-16 ***
regionnorthwest -584.838 380.619 -1.537 0.12465
regionsoutheast -1224.550 383.492 -3.193 0.00144 **
regionsouthwest -1225.977 382.385 -3.206 0.00138 **
age:smokeryes -7.705 23.832 -0.323 0.74653
bmi : smokeryes 1477.213 55.139 26.791 < 2e-16 ***
children:smokeryes -342.005 282.615 -1.210 0.22644
Signif codes: 0 ‘***’ 0.001 ‘**’ 0.01 “*'" 0.05 “°.' 0.1 “ 1
Residual standard error: 4839 on 1318 degrees of freedom
Multiple R-squared: 0.8386, Adjusted R-squared: 0.8373
F-statistic: 684.6 on 10 and 1318 DF, p-value: < 2.2¢e-16
p-value of age:smokeryes = 0.74653>0.05 and p-value of
children:smokeryes=0.22644>0.05
=>we consider whether to reject two interaction variables out of the model
=> Performing ANOVA test to examine:
Trang 9Analysis of variance Table
Model 1: charges ~ age + bmi + children + smoker + bmi:smoker + region
Model 2: charges ~ age + bmi + children + smoker + age:smoker + bmi:smoker +
children:smoker + region
Res Df RSS Df Sum of Sq F Pr(>F)
1 1320 3.0895e+10
2 1318 3.0857e+10 2 38358287 0.8192 0.441
-Based on the results of the ANOVA test, the p-value(=0.441) is greater than
0.05, then we select "Ilm4" (the model that excludes the two interaction
variables)
-We continue with the ANOVA test to examine whether there is a significant
interaction between the variables "bmi" and "smoker."
> anova(1m2, 1m4)
Analysis of variance Table
Model 1: charges ~ age + bmi + children + smoker + region
Model 2: charges ~ age + bmi + children + smoker + bmi:smoker + region
Res Df RSS Df Sum of Sq F Pr(>F)
1 1321 4.7783e+10
2 1320 3.0895e+10 1 1.6888e+10 721.53 < 2.2e-16 ***
Signif codes: 0 ‘***’ 0.001 “**° 0.01 “*°? 0.05 °.° 0.1 “ ° 1
>
-p-value = 2.2*e^{-16) <0.05 => reject HO and choose H1
=> |m4 is more suitable
-The plot without the interaction variable:
> #no interaction
1mhinh1=1m(char ges~bmi+smoker ,data=data)
int no=coef (Imhinh1) [1]
int yes=coef (]mhinh1) [1]+coef (1mhinh1) [3]
slope all levels=coef (Imhinh1) [2]
p1ot_coTors=c(” dar kor ange”, "dar kgr ey”)
p1ot(char ges~bmi ,data=data, co1=pTot_coTor s [smoker ] ,pch=as.numeric(smoker))
abline(int.no,slope.all levels,col="darkorange”, Ity=1, lwd=2)
abline(int yes,slope.all levels,col="darkgrey”, Ity=2, Iwd=2)
1egend(”topr 1ght”, c(C”no”, "yes”) ,co1=p]ot_coTors, Tty=c (1, 2) ,pch=c (1, 2))
charges 30000
-This plot suggests that there is a potential difference in average charges
between smokers and non-smokers with the same BMI However, the average
Trang 10change in charges for an increase in BMI appears to be similar for both groups
Overall, the model's performance is suboptimal
-The plot with the interaction variable:
#with interaction
Imhinh2=1m(charges~bmi *smoker , data=data)
int no=coef (1mhinh2) [1]
int yes=coef (Imhinh2) [1]+coef (1mhinh2) [3]
slope no=coef (1mhinh2) [2]
slope yes=coef (Imhinh2) [2]+coef (Imhinh2) [4]
p1ot_coTor s=c(” dar kor ange”, "dar kgr ey”)
plot(charges~bmi ,data=data,col=plot_colors [smoker] ,pch=as.numeric(smoker))
abline(int.no,slope.no,col="darkorange”, Ity=1, lwd=2)
abline(int yes,slope yes,col="darkgrey”, Ity=2, lwd=2)
legend("topright",c("no", "yes"),col=plot_colors, Ity=c(1,2),pch=c(1,2))
10000
0
bmi
-this plot illustrates that with interaction varriable,these lines fit the data much
better
b) The interactions of "region" with the numerical variables in the model:
Trang 11+children:region, data=data)
> summary(1m5)
call:
Im(formula = charges ~ age + bmi + children + smoker + bmi:smoker +
region + age:region + bmi:region + children:region, data = data)
Residuals:
Min 1Q Median 3Q Max
-9403.6 -2004.5 -1253.2 -251.6 29776.2
Coefficients:
Estimate Std Error t value Pr(>|t]|)
(Intercept) -3558.92 1554.97 -2.289 0.02225 *
age 232.36 19.31 12.032 < 2e-16 ***
bmi 101 36 48.31 2.098 0.03610 *
children 584 36 224.09 2.608 0.00922 **
smokeryes -21906 25 1723.54 -12.710 < 2e-16 ***
regionnorthwest -1284.01 2239.17 -0.573 0.56645
regionsoutheast 2646 23 2163.56 1.223 0.22152
regionsouthwest -457.44 2167.7 -0.211 0.83290
bmi: smokeryes 1492.10 55.40 26.931 < 2e-16 ***
age:regionnorthwest 7.28 27.23 0.635 0.52577
age:regionsoutheast 51.38 26.57 1.934 0.05336
age: regionsouthwest 42.19 27.57 1.530 0.12621
bmi :regionnorthwest -11.22 70.34 -0.160 0.87326
bmi :regionsoutheast -181.49 62.60 -2.899 0.00380 **
bmi: regionsouthwest -70.26 67.7 -1.037 0.29996
children:regionnorthwest 281.01 320.98 0.875 0.38147
children:regionsoutheast -175.77 312.10 -0.563 0.57341
children:regionsouthwest -350.74 307.67 -1.140 0.25450
Signif codes: 0 ‘***’ 0.001 “**' 0.01 “*° 0.05 '*.' 0.1 ° °1
Residual standard error: 4820 on 1311 degrees of freedom
Multiple R-squared: 0.8406, Adjusted R-squared: 0.8386
F-statistic: 406.8 on 17 and 1311 DF, p-value: < 2.2e-16
-p-value of age:region and children:region are all<0.05, besides they dont have
any *
=>considering whether to exclude them out of the model Im5
-Conducting an ANOVA test to investigate:
> 1m6=Im(charges~age+bmi+chi ldren+smoker+bmi : smoker+region+bmi :region, data=data)
> anova(1m6, 1m5)
Analysis of variance Table
Model 1: charges ~ age + bmi + children + smoker + bmi:smoker + region +
bmi:region
Model 2: charges ~ age + bmi + children + smoker + bmi:smoker + region +
age:region + bmi:region + children:region
Res Df RSS Df Sum of Sq F Pr(>F)
1 1317 3.0668e+10
2 1311 3.0462e+10 6 205854820 1.4766 0.1826
- Based on the ANOVA test results, p-value (0.1826) is greater than 0.05, hence
we choose "Ilm6" as the preferred model, which excludes the two interaction
variables
- Continuing with the ANOVA test to evaluate the presence of a statistically
significant interaction between the variables "bmi" and "smoker."