Probability and statistics project definition of anova

Data cleaning

Checking the number of missing values in the data set “data1” of the variables

The data set contains missing values that need to be addressed for a cleaner version of "data1." To eliminate these values, we utilize the "na.omit" function, resulting in a new, refined data set.

After having removed NA values from “data_1”, we will check if there are any NA values left

The result from the console is “0”, which means that the data no longer contains missing values.

Data visualization 1 Descriptive statistics for each of the variables

Graph: Boxplot dep_delay for each carrier Remove outliers

We will plot boxplot graph using “boxplot” function

The presence of outliers in the data can lead to complications in statistical analyses It is essential to eliminate these outliers from the "dep_delay" variable to ensure accurate results.

The function "remove_outl" is designed to identify and manage outliers in a dataset It calculates the first quartile, third quartile, and interquartile range using the "quantile" and "IQR" functions The function then evaluates the input parameter to determine if it qualifies as an outlier If the parameter meets the outlier criteria, it is replaced with NA values; otherwise, it remains unchanged.

Creating subsets of dep_delay values of carriers and applying the function each created subset

Combining the seperated subsets and storing to a new data set named ”data3” and check if the data set is intact

To identify the number of outliers, we will use the “is.na” function to count the NA values Next, we will eliminate these outliers with the “na.omit” function Finally, we will re-evaluate “data3” to ensure that no outliers remain.

Finally, the boxplot graph is plotted with “boxplot” function with the data set “data3” with outliers removed

One way ANOVA

We will test assumptions about normal distribution First, we create a new dataframe containing only the value of the delayed departures of the airlines:

We use the Shapiro - Wilk test to check whether the delay departure time is normally distributed or not by the following command:

Since p - value < 2.2e-16 < 0.05, the variable departure time delay from Portland is not normally distributed with 95% confidence

Double check with the following QQ-Plot chart:

Thus, the departure time delay between flights of airlines departing Portland in 2014 is not normally distributed

Check for uniformity of variance

We use the Bartlett test to check the homogeneity of the variances with the command:

Since p - value < 2.2e-16 < 0.05, the variance is not uniform with 95% confidence

Presenting a one-way ANOVA table:

Comparison of multiples after analysis of variance

We perform the Tukey test with the command:

The analysis reveals two distinct groups of airlines with similar average delay times: the first group includes WN and AA, while the second group comprises HA, AS, OS, DL, US, F9, B6, OO, UA, and VX Notably, WN stands out as having the highest departure time among the 11 airlines evaluated.

Generalize linear model

In the model, the dependent variable is “arr_delay” while the independent ones are: “carrier”,

“origin”, “dest”, “dep_delay” and “distance” These are the factors that affect the arrival delay The regression model:

From the above analysis, we obtain the regression model:

According to the result of the linear regression model above, we assume:

𝐻 0 : The coefficients on variables don’t have statistical significance

𝐻 1 : The coefficients on variables have statistical significance

The p-values for "Carrier_B6," "Carrier_OO," "Carrier_US," and most of the "dest" variable exceed the 5% significance level, indicating insufficient evidence to reject the null hypothesis 𝐻 0 Consequently, these variables lack statistical significance and will be excluded from the model, while the remaining variables demonstrate p-values below the threshold.

At a 5% significance level, we can confidently reject the null hypothesis (𝐻 0), indicating that the coefficients for the included variables are statistically significant Consequently, there is no need to exclude these variables from the model However, we will proceed to construct a second model that excludes the variables "carrier" and "dest."

According to the result above, we assume:

𝐻 0 : The 1 st model is more effective than the second one

𝐻 1 : The 2 nd model is more effective than the first one.

When we compare the 2 models, the observation probability Pr (F)] is 2e-16 (

Tiêu đề	Probability and Statistics Project of ANOVA
Tác giả	Đinh Lê Minh, Bin Công Khanh, Nguyn Bo Duy, Ngô Minh Đc, Nguyn Trng Thân
Người hướng dẫn	PhD. Phan Th Hưng
Trường học	Ho Chi Minh City University of Technology
Chuyên ngành	Probability and Statistics
Thể loại	Dự án
Năm xuất bản	2022
Thành phố	Ho Chi Minh City

Định dạng
Số trang	30
Dung lượng	3,45 MB