Data cleaning
Checking the number of missing values in the data set “data1” of the variables
The data set contains missing values that need to be addressed for a cleaner version of "data1." To eliminate these values, we utilize the "na.omit" function, resulting in a new, refined data set.
After having removed NA values from “data_1”, we will check if there are any NA values left
The result from the console is “0”, which means that the data no longer contains missing values.
Data visualization 1 Descriptive statistics for each of the variables
Graph: Boxplot dep_delay for each carrier Remove outliers
We will plot boxplot graph using “boxplot” function
The presence of outliers in the data can lead to complications in statistical analyses It is essential to eliminate these outliers from the "dep_delay" variable to ensure accurate results.
The function "remove_outl" is designed to identify and manage outliers in a dataset It calculates the first quartile, third quartile, and interquartile range using the "quantile" and "IQR" functions The function then evaluates the input parameter to determine if it qualifies as an outlier If the parameter meets the outlier criteria, it is replaced with NA values; otherwise, it remains unchanged.
Creating subsets of dep_delay values of carriers and applying the function each created subset
Combining the seperated subsets and storing to a new data set named ”data3” and check if the data set is intact
To identify the number of outliers, we will use the “is.na” function to count the NA values Next, we will eliminate these outliers with the “na.omit” function Finally, we will re-evaluate “data3” to ensure that no outliers remain.
Finally, the boxplot graph is plotted with “boxplot” function with the data set “data3” with outliers removed
One way ANOVA
We will test assumptions about normal distribution First, we create a new dataframe containing only the value of the delayed departures of the airlines:
We use the Shapiro - Wilk test to check whether the delay departure time is normally distributed or not by the following command:
Since p - value < 2.2e-16 < 0.05, the variable departure time delay from Portland is not normally distributed with 95% confidence
Double check with the following QQ-Plot chart:
Thus, the departure time delay between flights of airlines departing Portland in 2014 is not normally distributed
Check for uniformity of variance
We use the Bartlett test to check the homogeneity of the variances with the command:
Since p - value < 2.2e-16 < 0.05, the variance is not uniform with 95% confidence
Presenting a one-way ANOVA table:
Comparison of multiples after analysis of variance
We perform the Tukey test with the command:
The analysis reveals two distinct groups of airlines with similar average delay times: the first group includes WN and AA, while the second group comprises HA, AS, OS, DL, US, F9, B6, OO, UA, and VX Notably, WN stands out as having the highest departure time among the 11 airlines evaluated.
Generalize linear model
In the model, the dependent variable is “arr_delay” while the independent ones are: “carrier”,
“origin”, “dest”, “dep_delay” and “distance” These are the factors that affect the arrival delay The regression model:
From the above analysis, we obtain the regression model:
According to the result of the linear regression model above, we assume:
𝐻 0 : The coefficients on variables don’t have statistical significance
𝐻 1 : The coefficients on variables have statistical significance
The p-values for "Carrier_B6," "Carrier_OO," "Carrier_US," and most of the "dest" variable exceed the 5% significance level, indicating insufficient evidence to reject the null hypothesis 𝐻 0 Consequently, these variables lack statistical significance and will be excluded from the model, while the remaining variables demonstrate p-values below the threshold.
At a 5% significance level, we can confidently reject the null hypothesis (𝐻 0), indicating that the coefficients for the included variables are statistically significant Consequently, there is no need to exclude these variables from the model However, we will proceed to construct a second model that excludes the variables "carrier" and "dest."
According to the result above, we assume:
𝐻 0 : The 1 st model is more effective than the second one
𝐻 1 : The 2 nd model is more effective than the first one.
When we compare the 2 models, the observation probability Pr (F)] is 2e-16 (