HCMC University of Technology VIET NAM NATIONAL UNIVERSITY HCMC UNIVERSITY OF TECHNOLOGY DEPARTMENT OF CHEMICAL ENGNEERING Report Assignment PROBABILITY AND STATISTIC Report Assignment Lecturer PhD Ng[.]
Trang 1DEPARTMENT OF CHEMICAL ENGNEERING
Report Assignment PROBABILITY AND STATISTIC
Report Assignment
Lecturer: PhD Nguyễn Tiến Dũng
CC02 – Group 09 Team member
Ho Chi Minh, Sunday 04 nd December 2022
Trang 2TABLE OF CONTENTS
I Topic 2
II Theoretical basis 2
2.1 One-way ANOVA 2
2.2 Two-way ANOVA 4
2.3 Prediction model - Multiple Linear Regression 6
III Data processing 8
1.Data import 8
2 Checking statistics values 8
3 Data visualization 9
4 Building a linear regression model 19
5 Make forecasts for the compressive strength of concrete 24
REFERENCES 25
Trang 3I Topic
Concrete is the most important material in civil engineering The concrete
compressive strength is a highly nonlinear function of age and ingredients
File “concrete.csv” contains information about the compressive strength of concrete affected by variables The data set was taken from UCI Machine Learning Repository:
https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength
The data set contains 1030 instances of the compressive strength of concrete and 9 attributes
Mains variables in the dataset:
Cement – quantitative – kg in a m3 mixture – Input Variable
Blast Furnace Slag – quantitative – kg in a m3 mixture – Input Variable
Fly Ash – quantitative – kg in a m3 mixture – Input Variable
Water – quantitative – kg in a m3 mixture – Input Variable
Superplasticizer – quantitative – kg in a m3 mixture – Input Variable
Coarse Aggregate – quantitative – kg in a m3 mixture – Input Variable
Fine Aggregate – quantitative – kg in a m3 mixture – Input Variable
Age – quantitative – kg in a m3 mixture – Input Variable
Concrete compressive strength – quantitative –MPa– Out Variable.
The purpose of our team is to test whether the linear regression model between the concrete compressive strength really exits, if it does, make a forecast base on the data
in the file “concrete.csv”, and use Anova analyze the influence of each variable
II Theoretical basis
2.1 One-way ANOVA
One way ANOVA is a hypothesis test used for testing the equality of three or more
population means simultaneously using variance
For example:
Trang 4In one laboratory, a team studied whether changes in CO2 concentration affected the germination rate of soybean seeds by gradually increasing the CO2 concentration and recording the height of the bean sprouts after 1 day.
• Statistical problem: Comparing the height means between groups of CO2
concentration
Assumptions for using one-way ANOVA:
• The population are normally distributed To test the normality, we use the Normal probability plot of the Residuals (mentioned in Prediction model)
• The sample are random and independent
• The population has equal variances
An observed dataset can be generalized as table below:
Model considered: Yij=µ+τi+ϵij (i = 1, 2, , a; j = 1, 2, , n) • Where: µ is the
overall mean, τi is the ith treatment effect, ϵij is the random error component Null
and alternative hypotheses:
H 1:τi≠ 0with at least one i
Sum of square (SS) Degree of Median of square (MS)
Trang 5Test statistic: F 0= MStreatment
MSE =SStreatment /(a−1)¿SSE /¿ ¿¿
• F0 has a Fisher distribution with (a−1) and
• Given α, H0 would be rejected if f 0 >fa−1 ,a(n−1)α·
2.2 Two-way ANOVA
Two-way ANOVA is a statistical technique that used for examining the effect of two
factors on the continuous dependent variable It also studies the interrelationship
between the two independent variables which influences the values of the dependent one
For example: In an Arithmetic test, several male and female students of different ages
participated Exam results are recorded In this case, two-way ANOVA could be used
to determine if gender and age affected the scores
• Statistical problem: Comparing the score means according to the genders and ages Assumptions for using two-way ANOVA are similar with one-way ANOVA (section
2.2) The table of dataset for two-way ANOVA can be generalize as follow:
Trang 6X j=∑
i=1
H
X ij j=1,2, , H
Variance analysis factors:
No difference in means of group i
At least 1 difference in means of group i
Factor 2
No difference in means of group j
At least 1 difference in means of group j
f 1>fk−1,(k−1)(h−1),α·
Reject H0 if
f 2>fh−1 ,(k−1)(h−1),α·
Trang 72.3 Prediction model - Multiple Linear Regression
Regression analysis is the collection of statistical tools that are used to model and
explore relationships between variables that are related in a non-deterministic manner.Multiple linear regression is a critical technique that is deployed to study the linearity and dependency between a group of independent variables and a dependent one The general formula for multiple linear regression can be expressed as:
Y = β0+β1x1+…+ β k x k +ϵ
• β 0, β1, , βn are regression coefficients Each parameter represents the change in themean response, E( y), per unit increase in the associated predictor variable when all the other predictors are held constant
• ϵ is called the random error and follow N (0 ,σ2 )
Assumptions of multiple linear regression model:
• A linear relationship between the dependent and independent variables (can be
tested by using Scatter diagram).
Notice that, in some cases, the independent variables are not in compatible formats or
linear relationship We can use data transformation to make them fitted and better
organized
• The independent variables are not highly correlated with each other
• The variance of the residuals is constant • Independence of observation
• Multivariate normality (occurs when residuals are normally distributed)
Predicted Values and Residuals:
• A predicted value is calculated as ^y i =b0+b1x1+ +b k x k , where the b values come from statistical software and the x-values are specified by us
• A residual (error) term is calculated as e i = y i − ^y i , the difference between an actual and a predicted value of y
Trang 8Analysis of Variance for Testing Significance of Regression in Multiple Regression
H 1: β i ≠ 0with at least onei
R2∧adjusted R2 We may also use the coefficient of multiple determination R2 or adjusted R2 as a global statistic to assess the fit of the model Computationally,
Import data from “concrete.csv”
Figure 1: R code and result of seeing the first six lines of data
Trang 9To facilitate the calculation as well as detect unknown values in the excel file, we will convert all the variables to numeric format then the unknown values will be converted
to NA
Figure 2: R code used to convert variables to numeric format
2 Checking statistics values
Checking statistical values for all variables in concrete
Figure 3: R code and statistical values of all variables
We see in the picture above, surveying the compressive strength compressive of concrete after using every day for 1 year, we change the value of components that
Trang 10make up concrete to find the mass of each component specific to create block that
brings both economic value and long-term value for both users and producers
3 Data visualization
Create a new data named data (including variables like concrete) and convert the
variables cement, slag, flyash, water, superplasticiczer, coarseaggregate,
fineaggregate, age, csMPa to
log(cement+1), log(slag+1),log(flyash+1),log(water+1),log(superplasticizer+1), log(coarseaggregate+1),log(fineaggregate+1), log(age+1),log(csMPa+1)
respectively
Figure 4: R code and results when converting variables to log( x+1)
Explain the reason for converting to log(x+1):
● Improve the fit of model: assuming that when we build the regression model,
the regression error (residual) must have a normal distribution, so that in the
case of regression error (residual) is no normal distribution, taking the log of a
variable helps to scale and make the variable distributed standard In addition,
in the case of residuals (variable variance) caused by the independent variables,
we also can convert those variables to log
● Interpretation: this is the reason why we can interpret the relationship between
two variables more conveniently If I take log of the variable Y and the
independent X, then the regression coefficient β will be the elasticity
coefficient and the interpretation will be as follows: a 1% increase in X will
lead to an increase in what we could expect Y to increase β% (in terms of Y’s
mean), …
Trang 11● Also, converting to the form log( x+1) instead of the form log( x) because in the
variable there are 0 values
Calculate descriptive statistics for variables cement, slag, flyash, water,
superplasticiczer, coarseaggregate, fineaggregate, age, csMPa converted to
log( x+1).
Figure 6: R code and statistical values of all variables in data
In order to build a linear regression model between variables, we need to draw scatter plots of each variable and clarify it to test whether our assumption that exits such a model is really good or not
Firstly, draw a histogram showing the distribution of the csMPa variable before and
after converting to log( x+1)
Figure 7: R code and the result when plotting histogram showing the distribution of
the csMPa variable
Trang 12In this figure, we can almost see the graph of the variable csMPa befor and after
converting to log( x+1) form, they are relatively similar to the graph of the normal
distribution We will continue to draw the scatter plots of each variable to further test our linear regression model
Draw a scatter plot to display how the csMPa variables is distributed in relation to the cement variable both and before the log( x+1) form transfer.
Figure 11: R code and results when plotting the scatter plot showing the distribution
of the csMPa variables according to the cement before and after the transfer to
log( x+1) form
Trang 13We see that when in normal form, it is very difficult to see the linearity (specifically,
covariance) of the two variables cement and csMPa, and when we converted to
log( x+1) form, it is quite easy to see the linearity between the two variables but still a
bit uncertain We will check the next variables
Draw a scatter plot to display how the csMPa variables is distributed in relation to the
other variables both and before the log( x+1) form transfer.
Figure 11: R code and results when plotting the scatter plot showing the distribution
of the csMPa variables according to the slag before and after the transfer to log( x+1)
form
Trang 14Figure 11: R code and results when plotting the scatter plot showing the distribution
of the csMPa variables according to the flyash before and after the transfer to
log( x+1) form
Trang 15Figure 11: R code and results when plotting the scatter plot showing the distribution
of the csMPa variables according to the water before and after the transfer to
log( x+1) form
Trang 16Figure 11: R code and results when plotting the scatter plot showing the distribution
of the csMPa variables according to the superplasticizer before and after the transfer
to log( x+1) form
Trang 17Figure 11: R code and results when plotting the scatter plot showing the distribution
of the csMPa variables according to the coarseaggregate before and after the
transfer to log( x+1) form
Trang 18Figure 11: R code and results when plotting the scatter plot showing the distribution
of the csMPa variables according to the fineaggregate before and after the transfer
to log( x+1) form
Trang 19Figure 11: R code and results when plotting the scatter plot and the boxplot showing
the distribution of the csMPa variables according to the age before and after the
transfer to log( x+1) form
Trang 20In summary, the graphs above show us that it seems likely that a linear regression model exits, but that linearity does not seem to be the case for all variables We are going to build a linear regression model for all the variables and will check to see if our model is really In addition, it is obvious that the log( x+1) conversion has helped
us to have a clear view of the graph as well as the linearity of the variables
4 Building a linear regression model
Consider a linear regression model (lrm1) including:
Dependent variable: csMPa
Independent variables: cement, slag, flyash, water, superplasticizer,
coarseaggregate, fineaggregate, age
Figure 11: R code and results when building a liner regression model lrm1
Trang 21It is clear that we see a rather R-squared (0.7961) demonstrating that there is a fairly large possibility that a linear regression model exits between the variables The
coarseaggregate variable has large p-value (0.62385), so its regression coefficient can
be zero We continue to build a linear regression model consisting of the same
variables as lmr1 but except coarseaggregate in independent variable.
Figure 11: R code and results when building a liner regression model lrm2
Trang 22Check the regression coefficients:
The hypothesis H0: Regression coefficients is not statistically significant (β i=0)
The null hypothesis H1: Regression coefficients is statistically significant (β i ≠ 0)
We can see that the p-value corresponding to all variables is less than the confidence level 0.05, this indicates that the effects of these variables are significant In other words, we can reject H0
It is obvious that when we remove the coarseaggregate variable, all p-value of all the
regression coefficients are small and less than significance level α=0.05 In addition,
we see that there are four variables with smaller p-value less than 2e−16, which proves that these variables have a great influence on the regression model that we have built
To test dynamics between the two regression models that we have built, we use Anovamethod
Figure 11: R code and results when building a liner regression model lrm1
Trang 23Assumptions of linear regression analysis:
All above analysis is based on several important assumption as follow:
● All independent variables are fixed variables
● Ɛi distributed according to the normal distribution
● Ɛi has a mean value of 0
● Ɛi has a fixed variance for all independent variables
● Linearity of data
● Erros Ɛi , Ɛn are independent of each other
Check the model’s assumptions:
We perform residual analysis to test the model’s assumption:
Figure 18: R code and results when plotting residual analysis to test the model’s
assumption
Trang 24to the values on the calibration curve, and thus we can accept that Ɛi is
distributed according to the law of normal distribution
● The Scale-Location diagram plots the standardized residual root and the
forecast value This graph shows the assumption of uniformity of variance is relative satisfaction
● The Residuals vs Leverage allows us to identify points of high influence This graph indicates that there are observations 689, 226, and 225 that could be points of high influence in the data set
Trang 255 Make forecasts for the compressive strength of concrete
We use this regression model lrm2 to make forecasts for the dataset data.
Figure 19: R code and results when making forecasts for the compressive strength
of concrete
Comment: Based on the prediction results we find that reported value for csMPa
concentration does not deviate too much from the observed value The linear regression model is relatively good