1. Trang chủ
  2. » Luận Văn - Báo Cáo

Probability and statistics project definition of anova

30 0 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Probability and Statistics Project of ANOVA
Tác giả Đinh Lê Minh, Bin Công Khanh, Nguyn Bo Duy, Ngô Minh Đc, Nguyn Trng Thân
Người hướng dẫn PhD. Phan Th Hưng
Trường học Ho Chi Minh City University of Technology
Chuyên ngành Probability and Statistics
Thể loại Dự án
Năm xuất bản 2022
Thành phố Ho Chi Minh City
Định dạng
Số trang 30
Dung lượng 3,45 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Cấu trúc

  • 3.2. Data cleaning (6)
  • 3.3. Data visualization............................................................................................................ 1. Descriptive statistics for each of the variables (7)
    • 3.3.2. Graph: Boxplot dep_delay for each carrier. Remove outliers (7)
  • 3.4. One way ANOVA (10)
  • 3.5. Generalize linear model (14)
  • 4. Activity 2 ............................................................................................................................... 1. Introduction (18)
    • 4.2. Descriptive statistics (19)
    • 4.3. Hypothesis testing ........................................................................................................... 1. The dataset is assumed to fully satisfy the conditions of ANOVA (19)
      • 4.3.2. The dataset doesn’t fully satisfy the conditions of ANOVA ............................... 1. Kruskal Wallis test (25)
        • 4.3.2.2. Brown - Forsythe test (26)
    • 4.4. Ridge regression model................................................................................................... 1. Checking correlation (27)
      • 4.4.2. Build Ridge regression model .............................................................................. 1. Define response (y) and predictor (x) variables (27)
        • 4.4.2.2. Fit Ridge regression model (28)
        • 4.4.2.3. Choose an optimal value for (Lambda) .................................................26 𝜆 4.4.2.4. Analyze final model (28)
  • 5. Reference (0)

Nội dung

Data cleaning

Checking the number of missing values in the data set “data1” of the variables

The data set contains missing values that need to be addressed for a cleaner version of "data1." To eliminate these values, we utilize the "na.omit" function, resulting in a new, refined data set.

After having removed NA values from “data_1”, we will check if there are any NA values left

The result from the console is “0”, which means that the data no longer contains missing values.

Data visualization 1 Descriptive statistics for each of the variables

Graph: Boxplot dep_delay for each carrier Remove outliers

We will plot boxplot graph using “boxplot” function

The presence of outliers in the data can lead to complications in statistical analyses It is essential to eliminate these outliers from the "dep_delay" variable to ensure accurate results.

The function "remove_outl" is designed to identify and manage outliers in a dataset It calculates the first quartile, third quartile, and interquartile range using the "quantile" and "IQR" functions The function then evaluates the input parameter to determine if it qualifies as an outlier If the parameter meets the outlier criteria, it is replaced with NA values; otherwise, it remains unchanged.

Creating subsets of dep_delay values of carriers and applying the function each created subset

Combining the seperated subsets and storing to a new data set named ”data3” and check if the data set is intact

To identify the number of outliers, we will use the “is.na” function to count the NA values Next, we will eliminate these outliers with the “na.omit” function Finally, we will re-evaluate “data3” to ensure that no outliers remain.

Finally, the boxplot graph is plotted with “boxplot” function with the data set “data3” with outliers removed

One way ANOVA

We will test assumptions about normal distribution First, we create a new dataframe containing only the value of the delayed departures of the airlines:

We use the Shapiro - Wilk test to check whether the delay departure time is normally distributed or not by the following command:

Since p - value < 2.2e-16 < 0.05, the variable departure time delay from Portland is not normally distributed with 95% confidence

Double check with the following QQ-Plot chart:

Thus, the departure time delay between flights of airlines departing Portland in 2014 is not normally distributed

Check for uniformity of variance

We use the Bartlett test to check the homogeneity of the variances with the command:

Since p - value < 2.2e-16 < 0.05, the variance is not uniform with 95% confidence

Presenting a one-way ANOVA table:

Comparison of multiples after analysis of variance

We perform the Tukey test with the command:

The analysis reveals two distinct groups of airlines with similar average delay times: the first group includes WN and AA, while the second group comprises HA, AS, OS, DL, US, F9, B6, OO, UA, and VX Notably, WN stands out as having the highest departure time among the 11 airlines evaluated.

Generalize linear model

In the model, the dependent variable is “arr_delay” while the independent ones are: “carrier”,

“origin”, “dest”, “dep_delay” and “distance” These are the factors that affect the arrival delay The regression model:

From the above analysis, we obtain the regression model:

According to the result of the linear regression model above, we assume:

𝐻 0 : The coefficients on variables don’t have statistical significance

𝐻 1 : The coefficients on variables have statistical significance

The p-values for "Carrier_B6," "Carrier_OO," "Carrier_US," and most of the "dest" variable exceed the 5% significance level, indicating insufficient evidence to reject the null hypothesis 𝐻 0 Consequently, these variables lack statistical significance and will be excluded from the model, while the remaining variables demonstrate p-values below the threshold.

At a 5% significance level, we can confidently reject the null hypothesis (𝐻 0), indicating that the coefficients for the included variables are statistically significant Consequently, there is no need to exclude these variables from the model However, we will proceed to construct a second model that excludes the variables "carrier" and "dest."

According to the result above, we assume:

𝐻 0 : The 1 st model is more effective than the second one

𝐻 1 : The 2 nd model is more effective than the first one.

When we compare the 2 models, the observation probability Pr (F)] is 2e-16 (

Ngày đăng: 07/09/2023, 23:07