1. Trang chủ
  2. » Giáo Dục - Đào Tạo

ACADEMIC PERFORMANCE OF UNIVERSITY STUDENTS

16 6 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 16
Dung lượng 279,95 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

In the survey, household heads were asked to specify their place of residence province, schooling level of their children edulevel, and expenditure on education per child for the past 12

Trang 1

HANOI UNIVERSITY FACULTY OF MANAGEMENT AND TOURISM

Business and Economics Statistics

CASE STUDY ACADEMIC PERFORMANCE OF UNIVERSITY STUDENTS

Tutor: Mr.Nguyen Hoang Viet Tutorial class : Tut 4

Group members:

Trang 2

TABLE OF CONTENT

TABLE OF FIGURES 3

A. Scenario 4

B. Questions and Answers 5

1. Inference technique 5

2. Descriptive statistics for the dataset 5

3. Checking assumption 9

4. Two-way ANOVA test 11

5. Interaction plot and interpretations 14

6. Credibility of the interpretations and conclusions 15

Trang 3

TABLE OF FIGURES

Figure 1 The structure of this data frame 5

Figure 2 Frequency table (sample sizes) 6

Figure 3 Mean of Edu Spend according to Edu Levels and Provinces 6

Figure 4 The standard deviation of Edu Spend according to Edu levels and Provinces 7

Figure 5 Boxplot for distribution of groups 8

Figure 6 Gplot of group means 9

Figure 7 Levene’s Test result 10

Figure 8 Q-Q plot of residual 11

Figure 9 Test statistic output 12

Figure 10 Interaction plot between Spending for Edulevel and Province 14

Trang 4

A Scenario

The Vietnam Household Living Standards Survey (VHLSS) was conducted nationwide every two years to systematically monitor the living standards of Vietnam's societies In 2018, the survey was carried out with a sample size of 46,995 households in 3,133 communes/wards which were representative at national, regional, urban, rural, and provincial levels The household questionnaire contained many sections, each of which covered a separate aspect of household activities, and education was one important indicator In the survey, household heads were asked to specify their place of residence (province), schooling level of their children (edulevel), and expenditure on education per child for the past 12 months in thousands of VND (eduspend) The objective of our study is to test for any significant interaction between the place of residence and schooling levels and to test for any significant differences in education expenditure due to these two variables Use 0.05 level of significance

A portion of the obtained data is presented below The complete dataset, consisting of 90 observations and 4 variables (obs, province, edulevel, eduspend), is provided in the accompanying file named Case5.cvs

Trang 5

B Questions and Answers

1 Inference technique

It is given that the experiment of The Vietnam Household Living Standard is done to test for any significant interaction between the place of residence (province) and schooling level (edulevel) to test for any significant differences in education expenditure (eduspend) due to these two variables

In this study, a two-way ANOVA (two-way analysis of variance) is applied into the real case study

to assess whether there is a substantial interaction at the same time between 2 independent variables

on 1 dependent variable Firstly, it can be seen that province and edulevel were two factors as well

as independent variables in this case study Secondly, eduspend is known to be a variable that depends on two factors (province and edulevel)

The purpose of this study is to examine the effect of place of residence and schooling levels on education expenditure, and the interaction between two factors (province and edulevel)

2 Descriptive statistics for the dataset

Firstly, we use Rstudio to describe statistics for this question To start with, we import the Excel file “Case 5.csv” into R for further calculation:

> setwd("C:/Users/Admin/Documents/bes research") >

getwd

> case5 <- read.table("Case5.csv",header=TRUE, sep=",",

quote="\"", stringsAsFactors=FALSE)

The structure of this data frame can be checked using str() function:

> str(case5)

Figure 1 The structure of this data frame

From the above R output, we can obtain that there are 90 observations and 4 variables: osb, province, edulevel, eduspend; obs and eduspend variables are numeric data, province and edulevel variables are

Trang 6

character data To apply some graphical or statistical methods, we should convert province

and edulevel into factors, using the following code:

levels=c("HungYen","ThaiBinh"))

> case5$edulevel <- factor(case5$edulevel,levels = c("Primary School","Secondary School","Nursery School"))

A frequency table can be created to see the sample size of each treatment group with the following

R code:

table(case5$province, case5$edulevel)

Figure 2 Frequency table (sample sizes)

It can be seen that all 6 treatment groups have the same sample size of 15 This selection is our best choice to use a two-way ANOVA test

Next, we use the by () function in R to find several descriptive statistics such as mean, standard

deviation, … for each treatment group listed by the factors and their output respectively:

> by(case5$eduspend, list(case5$province, case5$edulevel), mean)

Trang 7

> by(case5$eduspend, list(case5$province, case5$edulevel), sd)

Figure 4 The standard deviation of Edu Spend according to Edu levels and Provinces

Each code gives the specific descriptive statistics of the outcome variable (Edu Spend) for each treatment group with the listed Edu Levels first then the Cities

To get further information, we conduct the boxplot and the mean plot:

> boxplot(eduspend ~ interaction(province, edulevel), data = case5, xlab = "Place of

residents", ylab ="Education

expenditure", col = c("pink", "light blue", "yellow","white", "orange", "gray"))

Trang 8

Figure 5 Boxplot for distribution of groups

Initially, the box plot shows clearly several descriptive statistics: medians, quartiles, maximum and minimum data among different groups Each cell has different characteristics for all Based

on R output, we can see that the Hung Yen – Secondary School groups have reached the peak of median value but there is no difference between Hung Yen – School group and Thai Binh Secondary group Moreover, The Hung Yen – Nursery group has the lowest at almost every value: median and minimum value

The skewness of each group is naturally through a boxplot The data of each group can be distributed basically, positive-skewed or negative-skewed is built based on the distance from the median to two endpoints It can be obviously seen that Thai Binh – Secondary School and Hung Yen – Nursery School are normally distributed Besides, Hung Yen – Secondary School and Thai Binh – Nursery School are basic examples of positive-skewed distribution while the others are negative-skewed distribution Also, there are 6 outliers when existing six white dots in Hung Yen

– Secondary School (1 outlier), Thai Binh – Secondary School (1 outlier), Hung Yen – Nursery School (2 outliers), and Thai Binh – Nursery School (3 outliers) respectively but 4 out of 90

We still use meanplot to identify mean value of each group and compare means between groups with the following codes and their outcome:

> install.packages("gplots") >

library("gplots")

Trang 9

> plotmeans(eduspend~ interaction(province, edulevel), data = case5, xlab = "Province and edulevel", ylab = "Eduspend", main="Mean Plot + with 95% CI")

Figure 6 Gplot of group means

Figure 6 can help to better understand the structure of the Case 5 data and summarize variability between the means of each group at 95% of the confidence interval It displays the sample size of each group which equals 15 and large variability between their means due to the variable eduspend

3 Checking assumption.

The two-way ANOVA test has three assumptions:

2 Assumption 2: All population standard variances are identical

3 Assumption 3: All population distributions are normal

3.1 Sample are independent, Simple random selected

As the spending on education of one household is not determined by the other one, the samples are independent The scenario stated that: “In 2018, the survey was carried out with a sample size of 46,995 households in 3,133 communes/wards which were representative at national, regional, urban, rural and provincial levels” Therefore, it can be assumed that this sample was selected randomly

3.2 All population variances are identical

Trang 10

From Figure 4 The standard deviation of Edu Spend according to Edu levels and Provinces, it

can be seen that the largest standard variance equals 4938.879 and the smallest one equals 356.0621 The result is 4938.879/356.0621 = 13.87084, larger than 2 This ratio reveals the second assumption is not satisfied but the Levene’s test In fact, the condition of Levene’s test did not meet when the ratio is larger than 3

Ho: All population variances are equal

Ha: At least one population variance is a difference

c.Test statistic:

F = 2.0727

p-value = 0.07684

d Rejection rule:

We reject Ho if p-value<α Where p-value= 0.07684 > 0.05, so we do not reject Ho.α Where p-value= 0.07684 > 0.05, so we do not reject Ho

e Conclusion

There is not enough significant evidence to conclude that at least one population variance is different

The result of Levene’s test was obtained by the following codes:

> install.packages("car")

> library(car)

>leveneTest(case5$eduspend,interaction(case5$province,case5$edulevel), center =

median )

Figure 7 Levene’s Test result 3.3 All population distributions are normal

By using these code, we can check the distribution of all population:

Trang 11

> install.packages("car")

> library(car)

> qqPlot(lm(eduspend ~ province + edulevel + province*edulevel, data=case5), simulate=T, main="Q-Q Plot", labels=F)

Figure 8 Q-Q plot of residual

Looking at figure 8, numerous points are out of the blue area This can not be proof of the normal distribution of all populations However, due to the scope of the course, we assume that the 2 last assumptions are satisfied To sum up, we are able to carry out a two-way ANOVA test with all satisfied assumptions

4 Two-way ANOVA test.

As mentioned in question 1, we could use two-way ANOVA to test for the significance of the interaction between Province and Edulevel (Interaction effect) as well as that of the differences in education expenditure due to Province and Edulevel (2 main effects) with 0.05 level of significance

Step 1: Form hypotheses for the three tests

The three null hypotheses and alternative hypotheses for the test are stated below:

The hypothesis to test interaction effect:

Ho1: There is no interaction between Province and Edulevel Ha1: There is a significant interaction between Province and Edulevel

Trang 12

The hypothesis to test main effects:

Ho2: There are no differences in education expenditure due to Province.

Ha2: There are differences in education expenditure due to Province.

Ho3: There are no differences in education expenditure due to Edulevel.

Ha3: There are differences in education expenditure due to Edulevel.

Step 2: Check assumptions:

The assumptions of the test that have been checked in the answer for question 3:

Step 3: Test statist ic

We run two-way ANOVA on R Studio with Eduspend as outcome variable; Province and Edulevel as two factors by the following command:

> case5.result<-aov(eduspend~ province*edulevel, data = case5)

> summary(case5.result )

Figure 9 Test statistic output

From the R output above, we have:

To test for the interaction of Province and Edulevel:

Trang 13

Fpe= 2347318/ 6309566= 0.372

To test the main effect of province:

Fp=23969232/ 6309566 = 3.799

To test the main effect of edulevel:

Fe= 67582359/ 6309566= 10.711

Step 4: Level of significance

The level of significance is α = 0.05

Step 5: Decision rule

Reject Ho if p-value <α Where p-value= 0.07684 > 0.05, so we do not reject Ho α

To test for interaction effect:

p – value =0.6905 > α = 0.05

Step 6: Conclusion

We do not have enough statistical evidence to conclude that there is a significant interaction between two factors and differences in the education expenditure of households in two provinces ( Thai Binh and Hung Yen) due to the place of residence and schooling levels at 5% level of significance, Therefore, our conclusion is that there is insufficient evidence to argue that the interaction between the place of residence and schooling levels is significant.

Because the interaction effect is not significant, we examine 2 main effects: the effect of provinces on education expenditure and the effect of edulevel on education expenditure

As regards the effect of the province, we have:

p-value = 0.0546 > α = 0.05

Trang 14

=> The null hypothesis for the differences in education expenditure due to the province is not being rejected

In terms of the effect of edulevel, we have:

p-value = 7.19e-5 <α Where p-value= 0.07684 > 0.05, so we do not reject Ho α = 0.05

⇨ The null hypothesis for the differences in education expenditure due to edulevel is being rejected

For these reasons, at the 5% of significance level, there is enough sufficient evidence to conclude that the differences in education expenditure due to schooling levels are significant and there is not enough sufficient evidence to conclude that the differences in education expenditure due to place of residence are significant

5 Interaction plot and interpretations

To visualize the possible interaction between two factors graphically, we use the interaction.plot function as follow:

> interaction.plot(Case5$province, Case5$edulevel, Case5$eduspend, type = "b", col

= c("red", "blue", "black"), pch = c(16, 18), main="Interaction between Province and Eduspend")

Figure 10 Interaction plot between Spending for Edulevel and Province

Trang 15

The plot represents the mean value of the dependent variable (Spending on education ) on the y-axis and x-axis describes the two levels of the first independent variable (Province namely Hung Yen and Thai Binh) The three lines indicate the three levels of the second independent variable (School).

It can be seen that Eduspend is strongly affected by Edulevel For the details, the expenditure for secondary students in Hung Yen and Thai Binh is always higher than those with other levels (Nursery and Primary schools) So it can imply that secondary students are invested more money on study rather than other-graded students in any places of residence This is demonstrated the best by the black line Besides, the interaction between Province and Edulevel is more significant based on their intersection point on students of Nursery School and Primary School living in two provinces The expenses for education in the Nursery School of Hung Yen are more than that of Thai Binh; however, there is a contrast compared to Primary School On the other hand, the lines representing Secondary School and Primary School seem moderately parallel So, there is a slight interaction among expenses in schooling level coming from two different areas

Therefore, we only can conclude that the main effect here appeared but above all, the graphic still shows some fluctuations which tell us the same results with question 4 It is incredible that two answers are consistent

6 Credibility of the interpretations and conclusions.

After reviewing the answers to questions 1 to 5 and carefully examining the assumptions, we believe we have sufficient theoretical basis to be confident about the credibility of the interpretations and conclusions of question 4 Firstly, we follow the test procedure strictly and base

on the output produced by R studio - a reliable and powerful program to make conclusions with a significant level of 5% Secondly, two-way ANOVA test is effective when studying two factors simultaneously rather than individually Moreover, in comparison with the multiple t-test, using two-way ANOVA test could reduce the probability of type I error to the minimum Therefore, our motivation for choosing the inference technique was reasonable and appropriate within the scope of the course Finally, the interpretations and conclusions in question 4 are in agreement with the interaction diagram drawn later in question 5, which makes this report more reliable

Despite the above advantages, there are some problems that can falsify the R output, thereby affecting the credibility of our conclusion Firstly, how the dataset is generated is not explicitly mentioned We do not know if the data set was selected using a simple random sampling method or

on purpose Due to the uncertainty in the method to create the dataset, the samples may be biased

Ngày đăng: 29/05/2022, 11:44

w