Kruskal Wallis test– Like ANOVA, Kruskal Wallis test is used to check whether or not there is any– statistically significant difference between at least 3 groups of an independent variab
Workload
Name ID Work Percentage Đinh Lê Minh 2153568
Activity 1: Question 1,2,3,5 summarise the report, code
Activity 1: Question 4, write the report, code R, summarise theory
Activity 2: Find dataset, write the report, code R, summarise theory
Activity 2: Code R, summarise report, summarise theory
Activity 2: Find dataset, write the report, code R, summarise theory
Theory summary
Analysis of Variance (ANOVA) is a parametric statistical method utilized to compare multiple data sets Essentially, ANOVA assesses the potential differences in a scale-dependent variable across two or more categorical groups.
This report focuses on the two most common types of analysis of variance: one-factor and two-factor analysis of variance One-factor ANOVA examines the impact of a single qualitative factor on a quantitative outcome, while two-factor ANOVA extends this concept by incorporating two independent variables that influence the dependent variable.
The Shapiro-Wilk test is a statistical method used to determine if a random sample originates from a normal distribution It produces a W value, where a small value suggests that the sample is not normally distributed, allowing for the rejection of the null hypothesis that the population follows a normal distribution when the W value falls below a specific threshold.
Levene Test for equality of variances
Levene's test is employed to determine whether k samples exhibit equal variances, a condition known as homogeneity of variance This equality of variances is a crucial assumption for certain statistical tests, such as analysis of variance (ANOVA) By utilizing the Levene test, researchers can confirm the validity of this assumption across different groups or samples.
The Kruskal-Wallis test, similar to ANOVA, assesses whether there are statistically significant differences among at least three groups of an independent variable concerning a continuous or ordinal dependent variable The key distinction is that the Kruskal-Wallis test is applicable for non-normally distributed data, whereas ANOVA is suited for normally distributed data.
The Brown–Forsythe test is a statistical method used to assess the equality of variances across groups by applying an Analysis of Variance (ANOVA) on a transformed response variable This test modifies the mean square by incorporating the observed variances of each group rather than simply dividing by the mean square of the error The interpretation of the p-value remains consistent with that of the traditional ANOVA table.
Definition of multiple linear regression
Multiple linear regression is a statistical method that employs multiple explanatory variables to forecast the outcome of a response variable Its primary objective is to establish a linear relationship between these explanatory variables and the response variable.
(independent) variables and response (dependent) variables
𝛽0: y-intercept (constant term) 𝛽𝑝: slope coefficients for each explanatory variable 𝜀: the model’s error term (also known the residuals) as
• There is a linear relationship between the dependent variables and the independent variables
• The independent variables are not too highly correlated with each other
• yi observations are selected independently and randomly from the population
• Residuals should be normally distributed with a mean of 0 and variance σ.
Ridge regression, also known as Tikhonov regularization, is a technique for estimating coefficients in multiple-regression models when independent variables are highly correlated This method addresses the issue of multicollinearity, which often arises in models with numerous parameters By applying ridge regression, one can achieve improved efficiency in parameter estimation, albeit with a tolerable increase in bias, highlighting the bias–variance tradeoff.
Formula of Ridge regression: with: n: the number of rows p: the number of column
Activity 1 1 Import data
Data cleaning
Checking the number of missing values in the data set “data1” of the variables
The data set contains missing values that need to be addressed for a cleaner version of "data1." To achieve this, we will utilize the "na.omit" function to remove the missing values, resulting in a new, clean data set.
After having removed NA values from “data_1”, we will check if there are any NA values left
The result from the console is “0”, which means that the data no longer contains missing values
Data visualization 1 Descriptive statistics for each of the variables
3.3.1 Descriptive statistics for each of the variables
To obtain descriptive statistics for our variables, we created a dataset named “data3” containing variables with varying values We then calculated key statistical measures, including the mean, minimum, maximum, Q1, Q2, Q3, and standard deviation, using appropriate functions Finally, we displayed these results in a transposed format using the “t()” function for enhanced visualization.
3.3.2 Graph: boxplot dep_delay for each carrier Remove – outliers
We will plot boxplot graph using “boxplot” function
Outliers are present in the data and must be removed to avoid issues in statistical analyses To eliminate them from the "dep_delay" variable, a specific function can be utilized.
The function "remove_outl" is designed to identify and handle outliers in a dataset It begins by calculating the first quartile, third quartile, and interquartile range using the "quantile" and "IQR" functions The function then checks the input parameter value against outlier conditions; if the value is deemed an outlier, it is replaced with NA, while non-outlier values remain unchanged.
Creating subsets of dep_delay values of carriers and applying the function each created subset
Combining the seperated subsets and storing to a new data set named ”data3” and check if the data set is intact
We will count the outliers, which are now represented as NA values, using the “is.na” function Next, we will remove these outliers with the “na.omit” function After this removal, we will check again for any remaining outliers in “data3.”
Finally, the boxplot graph is plotted with “boxplot” function with the data set “data3” with outliers removed
One way ANOVA
We will test assumptions about normal distribution First, we create a new dataframe containing only the value of the delayed departures of the airlines:
We use the Shapiro - Wilk test to check whether the delay departure time is normally distributed or not by the following command:
Since p - value < 2.2e-16 < 0.05, the variable departure time delay from Portland is not normally distributed with 95% confidence
Double check with the following QQ-Plot chart:
Thus, the departure time delay between flights of airlines departing Portland in 2014 is not normally distributed
Check for uniformity of variance
We use the Bartlett test to check the homogeneity of the variances with the command:
Since p - value < 2.2e-16 < 0.05, the variance is not uniform with 95% confidence
Presenting a one-way ANOVA table:
Comparison of multiples after analysis of variance
We perform the Tukey test with the command:
The analysis reveals two groups of airlines with similar average delay times: the first group includes WN and AA, while the second group consists of HA, AS, OS, DL, US, F9, B6, OO, UA, and VX Notably, WN stands out as having the highest departure time among the 11 airlines examined.
Generalize linear model
In the model, the dependent variable is “arr_delay” while the independent ones are: “carrier”,
“origin”, “dest”, “dep_delay” and “distance” These are the factors that affect the arrival delay The regression model:
From the above analysis, we obtain the regression model:
According to the result of the linear regression model above, we assume:
𝐻0: The coefficients on variables don’t have statistical significance
𝐻 1 : The coefficients on variables have statistical significance
The p-values for "Carrier_B6," "Carrier_OO," "Carrier_US," and most of the "dest" variables exceed the 5% significance level, indicating insufficient evidence to reject the null hypothesis \(H_0\) Consequently, these variables lack statistical significance and will be excluded from the model In contrast, the remaining variables have p-values below the 5% significance level, allowing us to reject \(H_0\) and confirming their statistical significance Therefore, we will retain these variables in the model while constructing the second model by excluding "carrier" and "dest."
According to the result above, we assume:
𝐻0: The 1 st model is more effective than the second one
𝐻 1 : The 2 nd model is more effective than the first one.
When we compare the 2 models, the observation probability Pr (F)] is 2e-16 (