Probability and statistics project definition of anova

Kruskal Wallis test– Like ANOVA, Kruskal Wallis test is used to check whether or not there is any– statistically significant difference between at least 3 groups of an independent variab

Workload

Name ID Work Percentage Đinh Lê Minh 2153568

Activity 1: Question 1,2,3,5 summarise the report, code

Activity 1: Question 4, write the report, code R, summarise theory

Activity 2: Find dataset, write the report, code R, summarise theory

Activity 2: Code R, summarise report, summarise theory

Activity 2: Find dataset, write the report, code R, summarise theory

Theory summary

Analysis of Variance (ANOVA) is a parametric statistical method utilized to compare multiple data sets Essentially, ANOVA assesses the potential differences in a scale-dependent variable across two or more categorical groups.

This report focuses on the two most common types of analysis of variance: one-factor and two-factor analysis of variance One-factor ANOVA examines the impact of a single qualitative factor on a quantitative outcome, while two-factor ANOVA extends this concept by incorporating two independent variables that influence the dependent variable.

The Shapiro-Wilk test is a statistical method used to determine if a random sample originates from a normal distribution It produces a W value, where a small value suggests that the sample is not normally distributed, allowing for the rejection of the null hypothesis that the population follows a normal distribution when the W value falls below a specific threshold.

Levene Test for equality of variances

Levene's test is employed to determine whether k samples exhibit equal variances, a condition known as homogeneity of variance This equality of variances is a crucial assumption for certain statistical tests, such as analysis of variance (ANOVA) By utilizing the Levene test, researchers can confirm the validity of this assumption across different groups or samples.

The Kruskal-Wallis test, similar to ANOVA, assesses whether there are statistically significant differences among at least three groups of an independent variable concerning a continuous or ordinal dependent variable The key distinction is that the Kruskal-Wallis test is applicable for non-normally distributed data, whereas ANOVA is suited for normally distributed data.

The Brown–Forsythe test is a statistical method used to assess the equality of variances across groups by applying an Analysis of Variance (ANOVA) on a transformed response variable This test modifies the mean square by incorporating the observed variances of each group rather than simply dividing by the mean square of the error The interpretation of the p-value remains consistent with that of the traditional ANOVA table.

Definition of multiple linear regression

Multiple linear regression is a statistical method that employs multiple explanatory variables to forecast the outcome of a response variable Its primary objective is to establish a linear relationship between these explanatory variables and the response variable.

(independent) variables and response (dependent) variables

𝛽0: y-intercept (constant term) 𝛽𝑝: slope coefficients for each explanatory variable 𝜀: the model’s error term (also known the residuals) as

• There is a linear relationship between the dependent variables and the independent variables

• The independent variables are not too highly correlated with each other

• yi observations are selected independently and randomly from the population

• Residuals should be normally distributed with a mean of 0 and variance σ.

Ridge regression, also known as Tikhonov regularization, is a technique for estimating coefficients in multiple-regression models when independent variables are highly correlated This method addresses the issue of multicollinearity, which often arises in models with numerous parameters By applying ridge regression, one can achieve improved efficiency in parameter estimation, albeit with a tolerable increase in bias, highlighting the bias–variance tradeoff.

Formula of Ridge regression: with: n: the number of rows p: the number of column

Activity 1 1 Import data

Data cleaning

Checking the number of missing values in the data set “data1” of the variables

The data set contains missing values that need to be addressed for a cleaner version of "data1." To achieve this, we will utilize the "na.omit" function to remove the missing values, resulting in a new, clean data set.

After having removed NA values from “data_1”, we will check if there are any NA values left

The result from the console is “0”, which means that the data no longer contains missing values

Data visualization 1 Descriptive statistics for each of the variables

3.3.1 Descriptive statistics for each of the variables

To obtain descriptive statistics for our variables, we created a dataset named “data3” containing variables with varying values We then calculated key statistical measures, including the mean, minimum, maximum, Q1, Q2, Q3, and standard deviation, using appropriate functions Finally, we displayed these results in a transposed format using the “t()” function for enhanced visualization.

3.3.2 Graph: boxplot dep_delay for each carrier Remove – outliers

We will plot boxplot graph using “boxplot” function

Outliers are present in the data and must be removed to avoid issues in statistical analyses To eliminate them from the "dep_delay" variable, a specific function can be utilized.

The function "remove_outl" is designed to identify and handle outliers in a dataset It begins by calculating the first quartile, third quartile, and interquartile range using the "quantile" and "IQR" functions The function then checks the input parameter value against outlier conditions; if the value is deemed an outlier, it is replaced with NA, while non-outlier values remain unchanged.

Creating subsets of dep_delay values of carriers and applying the function each created subset

Combining the seperated subsets and storing to a new data set named ”data3” and check if the data set is intact

We will count the outliers, which are now represented as NA values, using the “is.na” function Next, we will remove these outliers with the “na.omit” function After this removal, we will check again for any remaining outliers in “data3.”

Finally, the boxplot graph is plotted with “boxplot” function with the data set “data3” with outliers removed

One way ANOVA

We will test assumptions about normal distribution First, we create a new dataframe containing only the value of the delayed departures of the airlines:

We use the Shapiro - Wilk test to check whether the delay departure time is normally distributed or not by the following command:

Since p - value < 2.2e-16 < 0.05, the variable departure time delay from Portland is not normally distributed with 95% confidence

Double check with the following QQ-Plot chart:

Thus, the departure time delay between flights of airlines departing Portland in 2014 is not normally distributed

Check for uniformity of variance

We use the Bartlett test to check the homogeneity of the variances with the command:

Since p - value < 2.2e-16 < 0.05, the variance is not uniform with 95% confidence

Presenting a one-way ANOVA table:

Comparison of multiples after analysis of variance

We perform the Tukey test with the command:

The analysis reveals two groups of airlines with similar average delay times: the first group includes WN and AA, while the second group consists of HA, AS, OS, DL, US, F9, B6, OO, UA, and VX Notably, WN stands out as having the highest departure time among the 11 airlines examined.

Generalize linear model

In the model, the dependent variable is “arr_delay” while the independent ones are: “carrier”,

“origin”, “dest”, “dep_delay” and “distance” These are the factors that affect the arrival delay The regression model:

From the above analysis, we obtain the regression model:

According to the result of the linear regression model above, we assume:

𝐻0: The coefficients on variables don’t have statistical significance

𝐻 1 : The coefficients on variables have statistical significance

The p-values for "Carrier_B6," "Carrier_OO," "Carrier_US," and most of the "dest" variables exceed the 5% significance level, indicating insufficient evidence to reject the null hypothesis \(H_0\) Consequently, these variables lack statistical significance and will be excluded from the model In contrast, the remaining variables have p-values below the 5% significance level, allowing us to reject \(H_0\) and confirming their statistical significance Therefore, we will retain these variables in the model while constructing the second model by excluding "carrier" and "dest."

According to the result above, we assume:

𝐻0: The 1 st model is more effective than the second one

𝐻 1 : The 2 nd model is more effective than the first one.

When we compare the 2 models, the observation probability Pr (F)] is 2e-16 (

Tiêu đề	Probability and Statistics Project Definition of ANOVA
Tác giả	Đinh Lê Minh, Biên Công Khanh, Nguyễn Bảo Duy, Ngô Minh Độc, Nguyễn Trọng Thân
Người hướng dẫn	PhD. Phan Thảo Hường
Trường học	Ho Chi Minh City University of Technology
Chuyên ngành	Probability and Statistics
Thể loại	Dự án tốt nghiệp
Năm xuất bản	2022
Thành phố	Ho Chi Minh City

Định dạng
Số trang	30
Dung lượng	3,85 MB