In the survey, household heads were asked to specify their place of residence province, schooling level of their children edulevel, and expenditure on education per child for the past 12
Trang 1HANOI UNIVERSITY FACULTY OF MANAGEMENT AND TOURISM
Business and Economics Statistics
CASE STUDY ACADEMIC PERFORMANCE OF UNIVERSITY STUDENTS
Tutor: Mr.Nguyen Hoang Viet Tutorial class : Tut 4
Group members:
Trang 2TABLE OF CONTENT
TABLE OF FIGURES 3
A. Scenario 4
B. Questions and Answers 5
1. Inference technique 5
2. Descriptive statistics for the dataset 5
3. Checking assumption 9
4. Two-way ANOVA test 11
5. Interaction plot and interpretations 14
6. Credibility of the interpretations and conclusions 15
Trang 3TABLE OF FIGURES
Figure 1 The structure of this data frame 5
Figure 2 Frequency table (sample sizes) 6
Figure 3 Mean of Edu Spend according to Edu Levels and Provinces 6
Figure 4 The standard deviation of Edu Spend according to Edu levels and Provinces 7
Figure 5 Boxplot for distribution of groups 8
Figure 6 Gplot of group means 9
Figure 7 Levene’s Test result 10
Figure 8 Q-Q plot of residual 11
Figure 9 Test statistic output 12
Figure 10 Interaction plot between Spending for Edulevel and Province 14
Trang 4A Scenario
The Vietnam Household Living Standards Survey (VHLSS) was conducted nationwide every two years to systematically monitor the living standards of Vietnam's societies In 2018, the survey was carried out with a sample size of 46,995 households in 3,133 communes/wards which were representative at national, regional, urban, rural, and provincial levels The household questionnaire contained many sections, each of which covered a separate aspect of household activities, and education was one important indicator In the survey, household heads were asked to specify their place of residence (province), schooling level of their children (edulevel), and expenditure on education per child for the past 12 months in thousands of VND (eduspend) The objective of our study is to test for any significant interaction between the place of residence and schooling levels and to test for any significant differences in education expenditure due to these two variables Use 0.05 level of significance
A portion of the obtained data is presented below The complete dataset, consisting of 90 observations and 4 variables (obs, province, edulevel, eduspend), is provided in the accompanying file named Case5.cvs
Trang 5B Questions and Answers
1 Inference technique
It is given that the experiment of The Vietnam Household Living Standard is done to test for any significant interaction between the place of residence (province) and schooling level (edulevel) to test for any significant differences in education expenditure (eduspend) due to these two variables
In this study, a two-way ANOVA (two-way analysis of variance) is applied into the real case study
to assess whether there is a substantial interaction at the same time between 2 independent variables
on 1 dependent variable Firstly, it can be seen that province and edulevel were two factors as well
as independent variables in this case study Secondly, eduspend is known to be a variable that depends on two factors (province and edulevel)
The purpose of this study is to examine the effect of place of residence and schooling levels on education expenditure, and the interaction between two factors (province and edulevel)
2 Descriptive statistics for the dataset
Firstly, we use Rstudio to describe statistics for this question To start with, we import the Excel file “Case 5.csv” into R for further calculation:
> setwd("C:/Users/Admin/Documents/bes research") >
getwd
> case5 <- read.table("Case5.csv",header=TRUE, sep=",",
quote="\"", stringsAsFactors=FALSE)
The structure of this data frame can be checked using str() function:
> str(case5)
Figure 1 The structure of this data frame
From the above R output, we can obtain that there are 90 observations and 4 variables: osb, province, edulevel, eduspend; obs and eduspend variables are numeric data, province and edulevel variables are
Trang 6character data To apply some graphical or statistical methods, we should convert province
and edulevel into factors, using the following code:
levels=c("HungYen","ThaiBinh"))
> case5$edulevel <- factor(case5$edulevel,levels = c("Primary School","Secondary School","Nursery School"))
A frequency table can be created to see the sample size of each treatment group with the following
R code:
table(case5$province, case5$edulevel)
Figure 2 Frequency table (sample sizes)
It can be seen that all 6 treatment groups have the same sample size of 15 This selection is our best choice to use a two-way ANOVA test
Next, we use the by () function in R to find several descriptive statistics such as mean, standard
deviation, … for each treatment group listed by the factors and their output respectively:
> by(case5$eduspend, list(case5$province, case5$edulevel), mean)
Trang 7> by(case5$eduspend, list(case5$province, case5$edulevel), sd)
Figure 4 The standard deviation of Edu Spend according to Edu levels and Provinces
Each code gives the specific descriptive statistics of the outcome variable (Edu Spend) for each treatment group with the listed Edu Levels first then the Cities
To get further information, we conduct the boxplot and the mean plot:
> boxplot(eduspend ~ interaction(province, edulevel), data = case5, xlab = "Place of
residents", ylab ="Education
expenditure", col = c("pink", "light blue", "yellow","white", "orange", "gray"))
Trang 8Figure 5 Boxplot for distribution of groups
Initially, the box plot shows clearly several descriptive statistics: medians, quartiles, maximum and minimum data among different groups Each cell has different characteristics for all Based
on R output, we can see that the Hung Yen – Secondary School groups have reached the peak of median value but there is no difference between Hung Yen – School group and Thai Binh Secondary group Moreover, The Hung Yen – Nursery group has the lowest at almost every value: median and minimum value
The skewness of each group is naturally through a boxplot The data of each group can be distributed basically, positive-skewed or negative-skewed is built based on the distance from the median to two endpoints It can be obviously seen that Thai Binh – Secondary School and Hung Yen – Nursery School are normally distributed Besides, Hung Yen – Secondary School and Thai Binh – Nursery School are basic examples of positive-skewed distribution while the others are negative-skewed distribution Also, there are 6 outliers when existing six white dots in Hung Yen
– Secondary School (1 outlier), Thai Binh – Secondary School (1 outlier), Hung Yen – Nursery School (2 outliers), and Thai Binh – Nursery School (3 outliers) respectively but 4 out of 90
We still use meanplot to identify mean value of each group and compare means between groups with the following codes and their outcome:
> install.packages("gplots") >
library("gplots")
Trang 9> plotmeans(eduspend~ interaction(province, edulevel), data = case5, xlab = "Province and edulevel", ylab = "Eduspend", main="Mean Plot + with 95% CI")
Figure 6 Gplot of group means
Figure 6 can help to better understand the structure of the Case 5 data and summarize variability between the means of each group at 95% of the confidence interval It displays the sample size of each group which equals 15 and large variability between their means due to the variable eduspend
3 Checking assumption.
The two-way ANOVA test has three assumptions:
2 Assumption 2: All population standard variances are identical
3 Assumption 3: All population distributions are normal
3.1 Sample are independent, Simple random selected
As the spending on education of one household is not determined by the other one, the samples are independent The scenario stated that: “In 2018, the survey was carried out with a sample size of 46,995 households in 3,133 communes/wards which were representative at national, regional, urban, rural and provincial levels” Therefore, it can be assumed that this sample was selected randomly
3.2 All population variances are identical
Trang 10From Figure 4 The standard deviation of Edu Spend according to Edu levels and Provinces, it
can be seen that the largest standard variance equals 4938.879 and the smallest one equals 356.0621 The result is 4938.879/356.0621 = 13.87084, larger than 2 This ratio reveals the second assumption is not satisfied but the Levene’s test In fact, the condition of Levene’s test did not meet when the ratio is larger than 3
Ho: All population variances are equal
Ha: At least one population variance is a difference
c.Test statistic:
F = 2.0727
p-value = 0.07684
d Rejection rule:
We reject Ho if p-value<α Where p-value= 0.07684 > 0.05, so we do not reject Ho.α Where p-value= 0.07684 > 0.05, so we do not reject Ho
e Conclusion
There is not enough significant evidence to conclude that at least one population variance is different
The result of Levene’s test was obtained by the following codes:
> install.packages("car")
> library(car)
>leveneTest(case5$eduspend,interaction(case5$province,case5$edulevel), center =
median )
Figure 7 Levene’s Test result 3.3 All population distributions are normal
By using these code, we can check the distribution of all population:
Trang 11> install.packages("car")
> library(car)
> qqPlot(lm(eduspend ~ province + edulevel + province*edulevel, data=case5), simulate=T, main="Q-Q Plot", labels=F)
Figure 8 Q-Q plot of residual
Looking at figure 8, numerous points are out of the blue area This can not be proof of the normal distribution of all populations However, due to the scope of the course, we assume that the 2 last assumptions are satisfied To sum up, we are able to carry out a two-way ANOVA test with all satisfied assumptions
4 Two-way ANOVA test.
As mentioned in question 1, we could use two-way ANOVA to test for the significance of the interaction between Province and Edulevel (Interaction effect) as well as that of the differences in education expenditure due to Province and Edulevel (2 main effects) with 0.05 level of significance
Step 1: Form hypotheses for the three tests
The three null hypotheses and alternative hypotheses for the test are stated below:
The hypothesis to test interaction effect:
Ho1: There is no interaction between Province and Edulevel Ha1: There is a significant interaction between Province and Edulevel
Trang 12The hypothesis to test main effects:
Ho2: There are no differences in education expenditure due to Province.
Ha2: There are differences in education expenditure due to Province.
Ho3: There are no differences in education expenditure due to Edulevel.
Ha3: There are differences in education expenditure due to Edulevel.
Step 2: Check assumptions:
The assumptions of the test that have been checked in the answer for question 3:
Step 3: Test statist ic
We run two-way ANOVA on R Studio with Eduspend as outcome variable; Province and Edulevel as two factors by the following command:
> case5.result<-aov(eduspend~ province*edulevel, data = case5)
> summary(case5.result )
Figure 9 Test statistic output
From the R output above, we have:
To test for the interaction of Province and Edulevel:
Trang 13Fpe= 2347318/ 6309566= 0.372
To test the main effect of province:
Fp=23969232/ 6309566 = 3.799
To test the main effect of edulevel:
Fe= 67582359/ 6309566= 10.711
Step 4: Level of significance
The level of significance is α = 0.05
Step 5: Decision rule
Reject Ho if p-value <α Where p-value= 0.07684 > 0.05, so we do not reject Ho α
To test for interaction effect:
p – value =0.6905 > α = 0.05
Step 6: Conclusion
We do not have enough statistical evidence to conclude that there is a significant interaction between two factors and differences in the education expenditure of households in two provinces ( Thai Binh and Hung Yen) due to the place of residence and schooling levels at 5% level of significance, Therefore, our conclusion is that there is insufficient evidence to argue that the interaction between the place of residence and schooling levels is significant.
Because the interaction effect is not significant, we examine 2 main effects: the effect of provinces on education expenditure and the effect of edulevel on education expenditure
As regards the effect of the province, we have:
p-value = 0.0546 > α = 0.05
Trang 14=> The null hypothesis for the differences in education expenditure due to the province is not being rejected
In terms of the effect of edulevel, we have:
p-value = 7.19e-5 <α Where p-value= 0.07684 > 0.05, so we do not reject Ho α = 0.05
⇨ The null hypothesis for the differences in education expenditure due to edulevel is being rejected
For these reasons, at the 5% of significance level, there is enough sufficient evidence to conclude that the differences in education expenditure due to schooling levels are significant and there is not enough sufficient evidence to conclude that the differences in education expenditure due to place of residence are significant
5 Interaction plot and interpretations
To visualize the possible interaction between two factors graphically, we use the interaction.plot function as follow:
> interaction.plot(Case5$province, Case5$edulevel, Case5$eduspend, type = "b", col
= c("red", "blue", "black"), pch = c(16, 18), main="Interaction between Province and Eduspend")
Figure 10 Interaction plot between Spending for Edulevel and Province
Trang 15The plot represents the mean value of the dependent variable (Spending on education ) on the y-axis and x-axis describes the two levels of the first independent variable (Province namely Hung Yen and Thai Binh) The three lines indicate the three levels of the second independent variable (School).
It can be seen that Eduspend is strongly affected by Edulevel For the details, the expenditure for secondary students in Hung Yen and Thai Binh is always higher than those with other levels (Nursery and Primary schools) So it can imply that secondary students are invested more money on study rather than other-graded students in any places of residence This is demonstrated the best by the black line Besides, the interaction between Province and Edulevel is more significant based on their intersection point on students of Nursery School and Primary School living in two provinces The expenses for education in the Nursery School of Hung Yen are more than that of Thai Binh; however, there is a contrast compared to Primary School On the other hand, the lines representing Secondary School and Primary School seem moderately parallel So, there is a slight interaction among expenses in schooling level coming from two different areas
Therefore, we only can conclude that the main effect here appeared but above all, the graphic still shows some fluctuations which tell us the same results with question 4 It is incredible that two answers are consistent
6 Credibility of the interpretations and conclusions.
After reviewing the answers to questions 1 to 5 and carefully examining the assumptions, we believe we have sufficient theoretical basis to be confident about the credibility of the interpretations and conclusions of question 4 Firstly, we follow the test procedure strictly and base
on the output produced by R studio - a reliable and powerful program to make conclusions with a significant level of 5% Secondly, two-way ANOVA test is effective when studying two factors simultaneously rather than individually Moreover, in comparison with the multiple t-test, using two-way ANOVA test could reduce the probability of type I error to the minimum Therefore, our motivation for choosing the inference technique was reasonable and appropriate within the scope of the course Finally, the interpretations and conclusions in question 4 are in agreement with the interaction diagram drawn later in question 5, which makes this report more reliable
Despite the above advantages, there are some problems that can falsify the R output, thereby affecting the credibility of our conclusion Firstly, how the dataset is generated is not explicitly mentioned We do not know if the data set was selected using a simple random sampling method or
on purpose Due to the uncertainty in the method to create the dataset, the samples may be biased