University of Technology, Ho Chi Minh City a 3 Faculty of Computer Science and Engineering According to the correlation plot above, we can see: MYCT: This variable is negatively corr
Trang 1HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY Faculty of Computer Science and Engineering
BK TP.HCM
PROBABILITY AND STATISTICS (MT 2013)
Le Duong Khanh Huy - 2153380 Phan Le Khanh Trinh - 2151268 Class: CC01 - Group: 7
HO CHI MINH CITY, April 2023
Trang 3oo Faculty of Computer Science and Engineering
Contents
2.2 Multivariate Linear Regression (MLR) .0 00 0 0000000 3
4.2.2 Fitting the multivariate linear regression model 12
4.2.3 Stepwise Regression 2 ằ ằẽ ha 13 4.2.4 Linear equation 2 c Q Q Hạ ng kg KT kh kg 14
Assignment for PROBABILITY AND STATISTICS - Semester 222 - 2023 Page 1/21
Trang 4oo Faculty of Computer Science and Engineering
This assignment presents an analysis of computer system performance, focusing on the relation- ship between machine characteristics and relative performance metrics The dataset includes attributes:
1 Vendor Name: many unique symbols
2 Model Name: many unique symbols
3 MYCT: machine cycle time in nanoseconds (integer)
4 MMIN: minimum main memory in kilobytes (integer)
5 MMAX: maximum main memory in kilobytes (integer)
6 CACH: cache memory in kilobytes (integer)
7 CHMIN: minimum channels in units (integer)
8 CHMAX: maximum channels in units (integer)
9 PRP: published relative performance (integer)
10 ERP: estimated relative performance from the original article (integer)
Listing 1: Data set factors
In this assignment, we totally use R and R Studio as tools and working environment to analyze data
* vendor_name model_name MYCT MMIN MMAX CACH CHMIN CHMAX PRP ERP
1 adviser 32/60 125 256 6000 256 16 128 198 199
2 amdahl 470v/7 29 8000 32000 32 8 32 269 253
3 amdahl 470v/7a 29 8000 32000 32 8 32 220 253
4 amdahl 470v/7b 29 8000 32000 32 8 32 172 253
Figure 1: Piece of Labeled Data Frame
Check out some key factors in this data set:
population : CPU relative performance
sample : 209 models CPU relative performance measured by dataset creator
parameter : MYCT, MMIN, MMAX, CACH, CHMIN, CHMAX, PRP ,ERP
categorical variables : vendorname, model_name
In this data set, we spend much attention on two variables PRP and ERP At first glance, there are many parameters related to each model seem to have impacts on ERP and PRP (which are model performance), so regression models would be our methods to approach the data set in this
assignment
Assignment for PROBABILITY AND STATISTICS - Semester 222 - 2023 Page 2/21
Trang 53 Faculty of Computer Science and Engineering
2 Background
2.1 Exploratory data analysis (EDA)
Exploratory data analysis (EDA) is a method for drawing insights from data, often utiliz- ing data visualization and statistical graphics to reveal relationships between variables, identify patterns and trends, and detect outliers EDA is crucial for extracting important features for pre- dictive models By plotting the raw data, we can gain an understanding of the general behavior and distribution of the variables:
e Histograms are used to visualize the distribution of a numerical variable
e Box plots are used to display the distributions of numerical data values, particularly when comparing them across several groups
e Pair plots are employed to comprehend the finest characteristics that may be applied to describe a link between two variables or to create the most distinct clusters
e Correlation Matrix: The correlation coefficients between variables are displayed in a table called a correlation matrix The association between two variables is displayed in each cell
of the table
2.2 Multivariate Linear Regression (MLR)
There isn’t always a response variable or an explanatory variable in statistical analyses of data Measurement of the association between a continuous dependent variable and two or more independent variables is done using the multivariate regression method Linear relationships are those that emerge from correlations between variables We utilize this method to forecast the behavior of a response variable based on its predictors after applying multivariate regression to
a data set
There are 3 assumptions that need to be met when performing an MLR test:
— Normality: the residuals (the differences between the observed values and the predicted values) follow a normal distribution
2 No multicollinearity: The independent variables in the regression model are not highly correlated with each other Multicollinearity occurs when two or more independent vari- ables are strongly correlated, making it difficult to determine their individual effects on the dependent variable
3 Linearity : The relationship between the independent variables and the dependent variable should be linear This means that the expected value of the dependent variable changes in
a straight line as the independent variables change, holding other variables constant
The general equation of MLR:
Y = Bo + Bit + Bowe + + Bntn + €
where :
-Y is the dependent variable
-X; is the i4,, independent variable
-Bo is the intercept of Y when all X; are zeroes
-B; is the coefficient of each X;
-€ is the independent error term for the model
Assignment for PROBABILITY AND STATISTICS - Semester 222 - 2023 Page 3/21
Trang 6oo Faculty of Computer Science and Engineering
Model performance metrics :
- R-squared (R?): is the squared correlation between the actual result value and the value predicted by the model, and it measures the proportion of the predictor’s variance in the outcome that can be accounted for The better the model, the higher the value
- Root mean square error (RMSE): calculates the average error a model makes while forecasting an observation The model is better the lower the RMSE
3 amdahl 470v/7a 29 8000 32000 32 8 32
4 amdahl 470v/7b 29 8000 32000 32 8 32
Figure 2: Initial Data Frame
e Naming data feature
As we can see , columns in this data frame are not labeled Let’s make the data frame follow
the list we have above (Listing 1)
* vendor_name model_name MYCT MMIN MMAX CACH CHMIN CHMAX PRP
1 adviser 32/60 125 256 6000 256 16 128 198
2 amdahl 470v/7 29 8000 32000 32 8 32 269
3 amdahl 470v/7a 29 8000 32000 32 8 32 220
4 amdahl 470v/7b 29 8000 32000 32 8 32 172
Figure 3: Labeled Data Frame
e Checking missing value
Now it seems find However, to make sure there is no problem in this data set, we implement some function to test whether they are any missing data or relative problems
Result
vendor_name model_name MYCT MMIN MMAX CACH
9 9 9 9 9 9 CHMIN CHMAX PRP ERP
9 9 9 9
Figure 4: Checking missing value
Assignment for PROBABILITY AND STATISTICS - Semester 222 - 2023 Page 4/21
199
253
253
253
Trang 7University of Technology, Ho Chi Minh City Faculty of Computer Science and Engineering
vendor _name model_name MYCT MMIN
ibm : 32 100 : 1 Min : 17.0 Min : 64
nas : 19 1100/61-h1: 1 ist Qu.: 50.0 ist Qu.: 768
honeywell: 13 1100/81 2 ot Median 110.0 Median 2000
ner : 13 1100/82 1 Mean 203.8 Mean 2868
sperry 13 1100/83 1 3rd Qu.: 225.0 3rd Qu.: 4000
siemens : 12 1100/84 1 Max :1500.0 Max 732000
(Other) :107 (Other) 203
MMAX CACH CHMIN CHMAX Min : 64 Min 0.00 Min 0.000 Min 0.00
ist Qu.: 4000 ist Qu.: 0.00 ist Qu.: 1.000 ist Qu.: 5.00
Median 8000 Median : 8.00 Median 2.000 Median : 8.00
Mean 211796 Mean : 25.21 Mean : 4,699 Mean 18.27
3rd Qu :16000 3rd Qu.: 32.00 3rd Qu.: 6.000 3rd Qu.: 24.00
Max : 64000 Max 7256.00 Max 752.000 Max 1176.00
PRP ERP Min : 6.0 Min : 15.00
ist Qu.: 27.0 ist Qu.: 28.00
Median : 50.0 Median : 45.00
Mean 105.6 Mean : 99.33
3rd Qu.: 113.0 3rd Qu.: 101.00
Max :1150.0 Max :1238.00
Figure 5: Data overview
Because Vendor Name and Model Name are categorical variables, we can not have an overview information in this variable, so we try to make it clearer
Result
Number of vendors : 30
Most number of models ibm : 32
Least number of model(s): adviser, four-phase, microdata, sratus : 1
Figure 6: Factorial summary
We take a clearer look at the distribution of the 8 independent variables, utilizing the Histogram and Boxplot and applying the function to each figure variable
Assignment for PROBABILITY AND STATISTICS - Semester 222 - 2023 Page 5/21
Trang 8>< Faculty of Computer Science and Engineering
Histogram of MYC Boxplot of MYCT
° =
TTTTTT
0 20000 0 20000 MMIN MMIN Figure 8: MMIN
Histogram of CAC) Boxplot of CACH
Assignment for PROBABILITY AND STATISTICS - Semester 222 - 2023 Page 6/21
Trang 9According to the summary and these plots above:
Boxplot of CHMIN Histogram of CHM/
e vendor_name: The dataset contains several different vendors The most common vendor
is IBM, with 32 machines, followed by NAS with 19 machines
contains 107 machines from various different vendors
The ’Other’ category
e model_name: There are a variety of model names in the dataset, with no model appearing more than once except for the models under the ’Other’ category This indicates a diverse range of models in the dataset
e MYCT: Machine cycle time (MYCT) ranges from 17 to 1500 with a median of 110 As the mean (203.8) is larger than the median, this distribution is right-skewed, suggesting there are some machines with particularly high cycle times
Assignment for PROBABILITY AND STATISTICS - Semester 222 - 2023 Page 7/21
Trang 10MMAX: Maximum main memory (MMAX) ranges from 64 to 64000 Like MMIN, the mean (11796) is greater than the median (8000), indicating a right-skewed distribution Some machines have exceptionally high maximum memory
CACH: Cache memory size (CACH) ranges from 0 to 256 The mean (25.21) is larger than the median (8), so this distribution is right-skewed Some machines have very large cache memory sizes
CHMIN and CHMAX: The minimum and maximum numbers of channels (CHMIN, CHMAX) also have right-skewed distributions, with means larger than their medians This suggests that some machines have an unusually high number of channels
PRP and ERP: The published and estimated relative performance measures (PRP, ERP) both have right-skewed distributions, with means larger than the medians This suggests some machines have particularly high performance
Then, we use Correlation plot to give the overview of the relation between each pair of variable
Trang 11University of Technology, Ho Chi Minh City
a
3 Faculty of Computer Science and Engineering
According to the correlation plot above, we can see:
MYCT: This variable is negatively correlated with all other variables, with the strongest negative correlations with MMIN, MMAX, and CACH This suggests that as the machine cycle time (MYCT) increases, these other measures tend to decrease
MMIN: There is a very strong positive correlation between MMIN (minimum main mem-
ory) and ERP (Estimated Relative Performance), MMAX (maximum main memory), and PRP (Published Relative Performance), which suggests that machines with more minimum main memory tend to have higher performance and more maximum memory
MMAX: This variable has a strong positive correlation with ERP and PRP, indicating that machines with more maximum memory tend to have better performance There is also a strong positive correlation with MMIN, suggesting that machines often have similar amounts of minimum and maximum memory
CACH: Cache size (CACH) has strong positive correlations with PRP, ERP, MMIN, and
MMAX, suggesting that machines with more cache tend to have more memory and better performance
CHMIN: The minimum channels (CHMIN) variable is moderately positively correlated with all other variables except MYCT The strongest correlations are with ERP, PRP, and MMAX, indicating that machines with more minimum channels tend to have more maximum memory and better performance
CHMAX: Maximum channels (CHMAX) is positively correlated with all other variables except MYCT The strongest correlations are with PRP, ERP, and MMIN, suggesting that machines with more maximum channels tend to have more minimum memory and better performance
PRP and ERP: These two measures of performance are extremely strongly correlated with each other, suggesting they are measuring largely the same construct They are also both strongly positively correlated with MMIN and MMAX, indicating that machines with more memory tend to have better performance
Many of the variables appear to have strong associations with both PRP and ERP Going forward,
we will investigate techniques for analyzing the connections between these variables and PRP
Assignment for PROBABILITY AND STATISTICS - Semester 222 - 2023 Page 9/21
Trang 123 Faculty of Computer Science and Engineering
d Error t value Pr(>t) 04B5e+00 -6.948 B.00e-11 ***
MYCT 885e-02 -752¢e-02 2.789 0.0058 **
MMIN 529e-02 827e-03 8.371 9.42e-15 ***
Estimate St -8 8
4 1
1 1 MMAX 5.571e-03 6.418e-04 8.681 1.32e-15 x*x#+
6 1 -2 8
1 2
(Intercept) 589e+01
CACH 414e-01 -396e-01 4.596 7.B9e-06 ¥***
CHMIN 704e-01 BB7e-01 -0.316 0.7524
CHMAX 482e+00 ‹200e-01 6.737 1.65¢e-10 ***
Signif codes: O *** 0.001 ** 0.01 * 0.05 0.1 1
Residual standard error: 59.99 on 202 degrees of freedom
Multiple R-squared: 0.8649, Adjusted R-squared: 0.8609
F-statistic: 215.5 on 6 and 202 DF, p-value: < 2.2¢e-16
Figure 17: Check for Linearity
In linear regression, the coefficient of determination, often referred to as R-squared (R?), is a
statistical measure that represents the proportion of the variance in the dependent variable that can be explained by the independent variables in the model R-squared ranges from 0 to 1
In this project, R-squared = 0.8649 (near 1), which means a large proportion of the variability
in the dependent variable can be explained by the independent variables included in the model
So we can conclude that there is a linear relationship between PRP and others variables
Assignment for PROBABILITY AND STATISTICS - Semester 222 - 2023 Page 10/21