Report assignment probability and statistic report assignment one way anova

HCMC University of Technology VIET NAM NATIONAL UNIVERSITY HCMC UNIVERSITY OF TECHNOLOGY DEPARTMENT OF CHEMICAL ENGNEERING Report Assignment PROBABILITY AND STATISTIC Report Assignment Lecturer PhD Ng[.]

Trang 1

DEPARTMENT OF CHEMICAL ENGNEERING

Report Assignment PROBABILITY AND STATISTIC

Report Assignment

Lecturer: PhD Nguyễn Tiến Dũng

CC02 – Group 09 Team member

Ho Chi Minh, Sunday 04 nd December 2022

Trang 2

TABLE OF CONTENTS

I Topic 2

II Theoretical basis 2

2.1 One-way ANOVA 2

2.2 Two-way ANOVA 4

2.3 Prediction model - Multiple Linear Regression 6

III Data processing 8

1.Data import 8

2 Checking statistics values 8

3 Data visualization 9

4 Building a linear regression model 19

5 Make forecasts for the compressive strength of concrete 24

REFERENCES 25

Trang 3

I Topic

Concrete is the most important material in civil engineering The concrete

compressive strength is a highly nonlinear function of age and ingredients

File “concrete.csv” contains information about the compressive strength of concrete affected by variables The data set was taken from UCI Machine Learning Repository:

https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength

The data set contains 1030 instances of the compressive strength of concrete and 9 attributes

Mains variables in the dataset:

 Cement – quantitative – kg in a m3 mixture – Input Variable

 Blast Furnace Slag – quantitative – kg in a m3 mixture – Input Variable

 Fly Ash – quantitative – kg in a m3 mixture – Input Variable

 Water – quantitative – kg in a m3 mixture – Input Variable

 Superplasticizer – quantitative – kg in a m3 mixture – Input Variable

 Coarse Aggregate – quantitative – kg in a m3 mixture – Input Variable

 Fine Aggregate – quantitative – kg in a m3 mixture – Input Variable

 Age – quantitative – kg in a m3 mixture – Input Variable

 Concrete compressive strength – quantitative –MPa– Out Variable.

The purpose of our team is to test whether the linear regression model between the concrete compressive strength really exits, if it does, make a forecast base on the data

in the file “concrete.csv”, and use Anova analyze the influence of each variable

II Theoretical basis

2.1 One-way ANOVA

One way ANOVA is a hypothesis test used for testing the equality of three or more

population means simultaneously using variance

For example:

Trang 4

In one laboratory, a team studied whether changes in CO2 concentration affected the germination rate of soybean seeds by gradually increasing the CO2 concentration and recording the height of the bean sprouts after 1 day.

• Statistical problem: Comparing the height means between groups of CO2

concentration

Assumptions for using one-way ANOVA:

• The population are normally distributed To test the normality, we use the Normal probability plot of the Residuals (mentioned in Prediction model)

• The sample are random and independent

• The population has equal variances

An observed dataset can be generalized as table below:

Model considered: Yij=µ+τi+ϵij (i = 1, 2, , a; j = 1, 2, , n) • Where: µ is the

overall mean, τi is the ith treatment effect, ϵij is the random error component Null

and alternative hypotheses:

H 1:τi≠ 0with at least one i

Sum of square (SS) Degree of Median of square (MS)

Trang 5

Test statistic: F 0= MStreatment

MSE =SStreatment /(a−1)¿SSE /¿ ¿¿

• F0 has a Fisher distribution with (a−1) and

• Given α, H0 would be rejected if f 0 >fa−1 ,a(n−1)α·

2.2 Two-way ANOVA

Two-way ANOVA is a statistical technique that used for examining the effect of two

factors on the continuous dependent variable It also studies the interrelationship

between the two independent variables which influences the values of the dependent one

For example: In an Arithmetic test, several male and female students of different ages

participated Exam results are recorded In this case, two-way ANOVA could be used

to determine if gender and age affected the scores

• Statistical problem: Comparing the score means according to the genders and ages Assumptions for using two-way ANOVA are similar with one-way ANOVA (section

2.2) The table of dataset for two-way ANOVA can be generalize as follow:

Trang 6

X j=∑

i=1

H

X ij j=1,2, , H

Variance analysis factors:

No difference in means of group i

At least 1 difference in means of group i

Factor 2

No difference in means of group j

At least 1 difference in means of group j

f 1>fk−1,(k−1)(h−1),α·

Reject H0 if

f 2>fh−1 ,(k−1)(h−1),α·

Trang 7

2.3 Prediction model - Multiple Linear Regression

Regression analysis is the collection of statistical tools that are used to model and

explore relationships between variables that are related in a non-deterministic manner.Multiple linear regression is a critical technique that is deployed to study the linearity and dependency between a group of independent variables and a dependent one The general formula for multiple linear regression can be expressed as:

Y = β0+β1x1+…+ β k x k +ϵ

• β 0, β1, , βn are regression coefficients Each parameter represents the change in themean response, E( y), per unit increase in the associated predictor variable when all the other predictors are held constant

• ϵ is called the random error and follow N (0 ,σ2 )

Assumptions of multiple linear regression model:

• A linear relationship between the dependent and independent variables (can be

tested by using Scatter diagram).

Notice that, in some cases, the independent variables are not in compatible formats or

linear relationship We can use data transformation to make them fitted and better

organized

• The independent variables are not highly correlated with each other

• The variance of the residuals is constant • Independence of observation

• Multivariate normality (occurs when residuals are normally distributed)

Predicted Values and Residuals:

• A predicted value is calculated as ^y i =b0+b1x1+ +b k x k , where the b values come from statistical software and the x-values are specified by us

• A residual (error) term is calculated as e i = y i − ^y i , the difference between an actual and a predicted value of y

Trang 8

Analysis of Variance for Testing Significance of Regression in Multiple Regression

H 1: β i ≠ 0with at least onei

R2∧adjusted R2 We may also use the coefficient of multiple determination R2 or adjusted R2 as a global statistic to assess the fit of the model Computationally,

Import data from “concrete.csv”

Figure 1: R code and result of seeing the first six lines of data

Trang 9

To facilitate the calculation as well as detect unknown values in the excel file, we will convert all the variables to numeric format then the unknown values will be converted

to NA

Figure 2: R code used to convert variables to numeric format

2 Checking statistics values

Checking statistical values for all variables in concrete

Figure 3: R code and statistical values of all variables

We see in the picture above, surveying the compressive strength compressive of concrete after using every day for 1 year, we change the value of components that

Trang 10

make up concrete to find the mass of each component specific to create block that

brings both economic value and long-term value for both users and producers

3 Data visualization

Create a new data named data (including variables like concrete) and convert the

variables cement, slag, flyash, water, superplasticiczer, coarseaggregate,

fineaggregate, age, csMPa to

log(cement+1), log(slag+1),log(flyash+1),log(water+1),log(superplasticizer+1), log(coarseaggregate+1),log(fineaggregate+1), log(age+1),log(csMPa+1)

respectively

Figure 4: R code and results when converting variables to log( x+1)

Explain the reason for converting to log(x+1):

● Improve the fit of model: assuming that when we build the regression model,

the regression error (residual) must have a normal distribution, so that in the

case of regression error (residual) is no normal distribution, taking the log of a

variable helps to scale and make the variable distributed standard In addition,

in the case of residuals (variable variance) caused by the independent variables,

we also can convert those variables to log

● Interpretation: this is the reason why we can interpret the relationship between

two variables more conveniently If I take log of the variable Y and the

independent X, then the regression coefficient β will be the elasticity

coefficient and the interpretation will be as follows: a 1% increase in X will

lead to an increase in what we could expect Y to increase β% (in terms of Y’s

mean), …

Trang 11

● Also, converting to the form log( x+1) instead of the form log( x) because in the

variable there are 0 values

Calculate descriptive statistics for variables cement, slag, flyash, water,

superplasticiczer, coarseaggregate, fineaggregate, age, csMPa converted to

log( x+1).

Figure 6: R code and statistical values of all variables in data

In order to build a linear regression model between variables, we need to draw scatter plots of each variable and clarify it to test whether our assumption that exits such a model is really good or not

Firstly, draw a histogram showing the distribution of the csMPa variable before and

after converting to log( x+1)

Figure 7: R code and the result when plotting histogram showing the distribution of

the csMPa variable

Trang 12

In this figure, we can almost see the graph of the variable csMPa befor and after

converting to log( x+1) form, they are relatively similar to the graph of the normal

distribution We will continue to draw the scatter plots of each variable to further test our linear regression model

Draw a scatter plot to display how the csMPa variables is distributed in relation to the cement variable both and before the log( x+1) form transfer.

Figure 11: R code and results when plotting the scatter plot showing the distribution

of the csMPa variables according to the cement before and after the transfer to

log( x+1) form

Trang 13

We see that when in normal form, it is very difficult to see the linearity (specifically,

covariance) of the two variables cement and csMPa, and when we converted to

log( x+1) form, it is quite easy to see the linearity between the two variables but still a

bit uncertain We will check the next variables

Draw a scatter plot to display how the csMPa variables is distributed in relation to the

other variables both and before the log( x+1) form transfer.

of the csMPa variables according to the slag before and after the transfer to log( x+1)

form

Trang 14

of the csMPa variables according to the flyash before and after the transfer to

Trang 15

of the csMPa variables according to the water before and after the transfer to

Trang 16

of the csMPa variables according to the superplasticizer before and after the transfer

to log( x+1) form

Trang 17

of the csMPa variables according to the coarseaggregate before and after the

transfer to log( x+1) form

Trang 18

of the csMPa variables according to the fineaggregate before and after the transfer

to log( x+1) form

Trang 19

Figure 11: R code and results when plotting the scatter plot and the boxplot showing

the distribution of the csMPa variables according to the age before and after the

transfer to log( x+1) form

Trang 20

In summary, the graphs above show us that it seems likely that a linear regression model exits, but that linearity does not seem to be the case for all variables We are going to build a linear regression model for all the variables and will check to see if our model is really In addition, it is obvious that the log( x+1) conversion has helped

us to have a clear view of the graph as well as the linearity of the variables

4 Building a linear regression model

Consider a linear regression model (lrm1) including:

 Dependent variable: csMPa

 Independent variables: cement, slag, flyash, water, superplasticizer,

coarseaggregate, fineaggregate, age

Figure 11: R code and results when building a liner regression model lrm1

Trang 21

It is clear that we see a rather R-squared (0.7961) demonstrating that there is a fairly large possibility that a linear regression model exits between the variables The

coarseaggregate variable has large p-value (0.62385), so its regression coefficient can

be zero We continue to build a linear regression model consisting of the same

variables as lmr1 but except coarseaggregate in independent variable.

Trang 22

Check the regression coefficients:

The hypothesis H0: Regression coefficients is not statistically significant (β i=0)

The null hypothesis H1: Regression coefficients is statistically significant (β i ≠ 0)

We can see that the p-value corresponding to all variables is less than the confidence level 0.05, this indicates that the effects of these variables are significant In other words, we can reject H0

It is obvious that when we remove the coarseaggregate variable, all p-value of all the

regression coefficients are small and less than significance level α=0.05 In addition,

we see that there are four variables with smaller p-value less than 2e−16, which proves that these variables have a great influence on the regression model that we have built

To test dynamics between the two regression models that we have built, we use Anovamethod

Trang 23

Assumptions of linear regression analysis:

All above analysis is based on several important assumption as follow:

● All independent variables are fixed variables

● Ɛi distributed according to the normal distribution

● Ɛi has a mean value of 0

● Ɛi has a fixed variance for all independent variables

● Linearity of data

● Erros Ɛi , Ɛn are independent of each other

Check the model’s assumptions:

We perform residual analysis to test the model’s assumption:

Figure 18: R code and results when plotting residual analysis to test the model’s

assumption

Trang 24

to the values on the calibration curve, and thus we can accept that Ɛi is

distributed according to the law of normal distribution

● The Scale-Location diagram plots the standardized residual root and the

forecast value This graph shows the assumption of uniformity of variance is relative satisfaction

● The Residuals vs Leverage allows us to identify points of high influence This graph indicates that there are observations 689, 226, and 225 that could be points of high influence in the data set

Trang 25

5 Make forecasts for the compressive strength of concrete

We use this regression model lrm2 to make forecasts for the dataset data.

Figure 19: R code and results when making forecasts for the compressive strength

of concrete

Comment: Based on the prediction results we find that reported value for csMPa

concentration does not deviate too much from the observed value The linear regression model is relatively good

Tiêu đề	Report assignment probability and statistics report assignment one way ANOVA
Người hướng dẫn	PhD. Nguyễn Tiến Dũng
Trường học	Viet Nam National University Ho Chi Minh City University of Technology
Chuyên ngành	Chemical Engineering
Thể loại	report assignment
Năm xuất bản	2022
Thành phố	Ho Chi Minh City

Định dạng
Số trang	26
Dung lượng	892,86 KB