19 Figure 12: The result of histogram shows the distribution of skin thickness for people having and not having diabetes .... 20 Figure 14: The result of histogram shows the distributi
Trang 1TABLE OF CONTENTS
LIST OF FIGURE 3
ACKNOWLEDGMENT 5
1 INTRODUCTION 6
1.1 Topic introduction and requirements 6
1.1.1 Subject 6
1.1.2 R-studio 6
1.1.3 Our problems 6
1.2 Theoretical basis 7
2 DATA CORRECTION 12
2.1 Improt data 12
2.2 Data cleaning 12
2.3 Data clarification 13
2.4 Logistics Regression 26
2.5 Prediction 36
3 CODE R 39
REFERENCES 43
Trang 2LIST OF FIGURE
Figure 1: R code and results after reading data 12
Figure 2: R code and results when checking missing data in file "diabetes" 12
Figure 3: R code and results when performing descriptive statistics 13
Figure 4: R code and results when performing quantitative statistics for the variable "Outcome" 13
Figure 5: The result when plotting the histogram of the variable ”Pregnancies” and ”Glucose” 14
Figure 6: The result when plotting the histogram of the variable ”Blood Pressure” and ”Skin Thickness” 15
Figure 7: The result when plotting the histogram of the variable ”Insulin” and ”BMI” 16
Figure 8: The result when plotting the histogram of the variable ”Diabetes Pedigree Function” and 17
Figure 9: R code 18
Figure 10: The result of histogram shows the distribution of the number of pregnancies for people having and not having diabetes 18
Figure 11: R code 19
Figure 12: The result of histogram shows the distribution of skin thickness for people having and not having diabetes 19
Figure 13: R code 20
Figure 14: The result of histogram shows the distribution of glucose level for people having and not having diabetes 20
Figure 15: R code 21
Figure 16: The result of histogram shows the distribution of blood pressure for people having and not having diabetes 21
Figure 17: R code 22
Figure 18: The result of histogram shows the distribution of insulin level for people having and not having diabetes 22
Figure 19: R code 23
Figure 20: The result of histogram shows the distribution of BMI (Body mass index) for people having and not having diabetes 23
Figure 21: R code 24
Figure 22: The result of histogram shows the distribution of diabetes pedigree function for people having and not having diabetes 24
Figure 23: R code 25
Figure 24: The result of histogram shows the distribution of age for people having and not having diabetes 25
Figure 25: R code and results 26
Figure 26: R code and results when removing skinthickness variable from model 1 28
Figure 27: R code and results when removing Insulin variable from model 2 29
Figure 28: R code and results when removing age variable from model 3 30
Figure 29: R code and results when comparing the efficiency between model 1 and model 2 30
Figure 30: R code and results when comparing the efficiency between model 2 and model 3 31
Figure 31: R code and results when comparing the efficiency between model 3 and model 4 31
Figure 32: R code 32
Figure 33: Results when building an equation with all 8 variables 32
Trang 3Figure 35: R code and summary of model results 33
Figure 36: R code and results 34
Figure 37: R code and results 35
Figure 38: R code and the results of forecasting based on the original data set, save the results in the file diabetes 36
Figure 39: R Code and Statistical Results 36
Figure 40: R Code and Statistical Results 37
Figure 41: R code and comparison results 37
Figure 42: R Code and test results 37
Figure 43: R Code and test results 38
Figure 44: R code and evaluation results 38
Trang 4ACKNOWLEDGMENT
First of all, we would like to express our gratitude to Professor Nguyen Tien Dung for having enabled our group to have a chance to interact with R studio software We are also grateful that you have shown us an abundant amount of knowledge about Probability and Statistics This is an opportunity for us to operate the R studio We also understand that R studio is important material
to the world of Mathematics nowadays The software increases not only our knowledge but also the ideas for future projects
Trang 5PROJECT OF PROBABILITY AND STATISTIC FOR CHEMICAL
ENGINEERING (MT2013)
1 INTRODUCTION
1.1 Topic introduction and requirements
1.1.1 Subject
Probability is a part of mathematics that deals with numerical descriptions of the probability that
an event will occur, or the probability that a proposition is true The probability of an occurrence
is a number between 0 and 1, where 0 denotes the impossibility of the event and 1 represents certainty It's usually applied in fields such as mathematics, statistics, economics, gambling, science (particularly physics), artificial intelligence, machine learning, computer science, philosophy, and so on
Statistics is the study of several disciplines, including data analysis, interpretation, presentation, and organization It plays a critical part in the research process by providing analytically significant statistics to assist statistical analysts in obtaining the most correct results to address associated difficulties with social activities
To sum up, Probability and Statistics nowadays is becoming significant in our modern life,
especially with student whose major is in natural science, technology, and economy,
1.1.2 R-studio
R is a programming language and environment that is widely used in statistical computing, data analysis, and scientific research It is a popularly used programming language for data collecting, cleaning, analysis, graphing, and visualization
R is the next generation language of the “S language” in reality The S programming language allows users and students of engineering and technology university to calculate and modify data
As a language, one can use R to develop specialized software for a particular computational
problem
1.1.3 Our problems
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases The objective of the dataset is to diagnostically predict whether a patient has diabetes, based on certain diagnostic measurements included in the dataset Several constraints were placed
on the selection of these instances from a larger database In particular, all patients here are females
at least 21-year-old of Pima Indian heritage
Information about dataset attributes:
• Pregnancies: To express the Number of pregnancies
• Glucose: To express the Glucose level in blood
• Blood Pressure: To express the Blood pressure measurement
• Skin Thickness: To express the thickness of the skin
• Insulin: To express the Insulin level in blood
Trang 6• BMI: To express the Body mass index
• Diabetes Pedigree Function: To express the Diabetes percentage
• Age: To express the age
• Outcome: To express the final result 1 is Yes and 0 is No
Implementation steps:
• Import data: diabetes.csv
• Data cleaning: NA (missing data)
• Data visualization
o Convert the variable (if necessary)
o Descriptive statistics: using sample statistics and graphs
• Logistic regression model: Using a suitable logistic regression model to evaluate factors
affecting of diabetes
1.2 Theoretical basis
Logistic regression (often referred to simply as binomial logistic regression) is used to predict the probability that an observation falls into one of the categories of the dependent variable based on one or more independent variables that may be continuous or classified On the other hand, if your sea of dependencies is a count, the statistical method that should be considered is Poisson regression Also, if you have more than two types of dependent variables, that is when multinomial logistic regression should be used For example, you can use binomial logistic regression to understand whether test performance can be predicted based on review time and test anxiety (i.e., where belonging to “test performance”, measured on a proportional scale – “pass” or “fail” – and you have two independent variables: “review time” and “test anxiety”)
Logistics regression model
Logistic regression models are used to predict a categorical variable by one or more continuous or categorical independent variables The dependent variable can be binary, ordinal or multicategorical
The independent variable can be interval/scale, dichotomous, discrete or a mixture of all The logistic regression equation (in case the dependent variable is binary) is:
𝑃(𝑌𝑖 = 1) = 𝑒
−(𝛽0+𝛽1𝑥1𝑖+𝛽2𝑥2𝑖+⋯+𝛽𝑘𝑥𝑘𝑖)
1 + 𝑒−(𝛽 0 +𝛽 1 𝑥 1𝑖 +𝛽 2 𝑥 2𝑖 +⋯+𝛽 𝑘 𝑥 𝑘𝑖 )
Where:
- P is the probability of observing a case i in the outcome variable Y with a value = 1
- e is an Euler mathematical constant with a value close to 2.71828;
- And the regression coefficients β corresponding to the observed variables
Trang 7We often use regression models to estimate the effect of X variables on an Odds (Y=1)
Effects in logistic regression
For estimation and prediction purposes, the probabilities are severely limited First, they are bound
to the range 0 to 1 This implies that if the real effect of variable X on the outcome of variable Y exceeds 1, interpretation may be problematic The second limit, the probability cannot be negative Assuming that the effect of an independent variable on variable Y is negative, the logistic regression coefficient interpretation is meaningless One problem is that the regression coefficient should only be positive
To solve the above two problems, we have a two-step approach through the implementation of two variables change First, we convert the probabilities in Odds (O) to:
1 − 𝑃=
𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑒𝑣𝑒𝑛𝑡 ℎ𝑎𝑝𝑝𝑒𝑛𝑖𝑛𝑔𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑒𝑣𝑒𝑛𝑡 𝑛𝑜𝑡 ℎ𝑎𝑝𝑒𝑛𝑖𝑛𝑔
1 + 𝑂; 𝑤ℎ𝑒𝑟𝑒 𝑂 𝑖𝑠 𝑂𝑑𝑑; 𝑃 𝑖𝑠 𝑡ℎ𝑒 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 That is, the odds that an event will occur is the ratio of the number of times it is expected that the event will happen to the number of times it is expected that the event will not happen This is a direct relationship between Odds (Y=1) and probability Y=1
Thus, given that Odds can be infinite, the probability with Odds now allows the regression coefficient to have any value
The next step is to solve the second problem Relationship between Odds and Probability, slightly expanded
Algebraically, we can restate the Odds (O) formula above in terms of the logarithm of Odds (Y=1):
Accordingly, the natural logarithm (loge, symbol ln) of Odds (eg ln0, 279 = −1, 276) Therefore, the logarithm of the probability of voting for Obama is -1.276' So, if we just stop at probabilistic predictions, we can get false results (a positive number)
Trang 8Second, the true effect of the covariates involved is underrated (underestimated) The main advantage of logarithmic Odds is that the coefficients are constrained, and that they can be negative
as well as positive, ranging from negative infinity to positive infinity Stated this way, logistic regression looks exactly like multiple regression on the right side of the logarithmic Odds equation The left side of the equation is not the score of Y It is the logarithm of Odds (Y=1) This means that each unit of X has the effect of β on the logarithm of Odds of Y
Estimation of a logistic regression model with maximum likelihood (Maximum Likelihood)
Because logistic regression operates on a categorical variable, the ordinary least squares (OLS) method is unusable (it assumes a normally distributed dependent variable) Therefore, a more general estimator is used to detect a good fit of the parameters This is called the maximum likelihood estimation Maximum likelihood is an interactive estimation technique to select parameter estimates that maximize the likelihood of a sample dataset being observed In logistic regression, maximum reasonably selects coefficient estimates that maximize the logarithm of the probability of observing a particular set of values of the dependent variable in the sample for a given set of X values
Because logistic regression uses the method of maximum likelihood, the coefficient of determination (R-) may not be directly estimated Thus, we have two dilemmas for the interpretation of logistic regression: First, how can we also measure the goodness of fit – a general null hypothesis? Second, how do we estimate the partial effect of each variable X?
Statistical inference and null hypothesis
First question, how can we also measure the goodness of fit – a general null hypothesis? The statistical inferences, together with the null hypothesis, are interpreted according to the following steps:
• The first step in the regression interpretation is to evaluate the global null hypothesis that the independent seas do not have any relationship with Y In the OLS regression method, we perform This is equal to testing whether R2 must be 0 in the population using an F-test While logistic regression uses the method of maximum likelihood (non-OLS): The null hypothesis H0 is β0 = β1
= β2 = 0 We measure the size of the residuals from this model with a statistical logarithm likelihood statistic
• We then estimate the model again, assuming that the null hypothesis is false, that we find the maximum reasonable value of the coefficients β in the sample Again, we measure the size of the residuals from this model with a statistical logarithm of reasonableness
• Finally, we compare the two statistics by computing a test statistic: −2[ln(𝐿𝑛𝑢𝑙𝑙) −
ln (𝐿𝑚𝑜𝑑𝑒𝑙)]
This statistic tells us how much residual (or prediction error) can be reduced using X variables The null hypothesis suggests that the reduction is 0 ; if the statistic is large enough (in a chi-square
Trang 9test with df = number of independent variables), we reject the null hypothesis Here, we conclude that at least one independent variable has a logarithmic Odds effect
SPSS also runs R2 statistics to help evaluate the strength of associations But it as a pseudo R2 , should not be interpreted because logistic regression does not use R2 like linear regression Second question, how do we estimate the partial effect of each variable X? When the general null hypothesis is rejected, we evaluate the partial effects of the predictors
As in multiple linear regression, in logistic regression this implies that the null hypothesis for each independent variable included in the equation The null hypothesis is that each regression coefficient is zero, or it has no effect on the logarithm of Odds
Each coefficient estimator B has a standard error – the extent to which, on average, we would expect B to vary from one sample to another by chance To check the significance of B, a test statistic (not a t-test, but a Wald Chi-squared) is calculated, with 1df – degrees of freedom
It should be remembered that the coefficient B expresses the effects of a unit change of X on logarithmic Odds
In education, the effect is positive, as education increases, the logarithm of Odds also increases The Exp(B) value of an independent variable X is used to predict the probability of an event occurring based on the change in one unit change in an independent variable when all other independent variables are held constant It indicates that when it is increased by one, the Odds for the "yes" event is multiplied by one value of the value Exp(B) (this is a function e to the power B, say 1.05, which is an increase of 5%)
Optimal model selections
One of the difficult and sometimes difficult problems in multivariable logistic regression analysis
is choosing a model that can adequately describe the data A study with a dependent variable y and
3 independent variables x1, x2 and x3, we can have the following models to predict y : y = f(x1), y
= f(x2), y = f(x3), y = f(x1, x2), y = f(x1, x3), y = f(x2, x3), and y = f(x1, x2, x3), where f is a function number In general with k independent variables x1, x2, x3, , xk, we have many models (2k ) to predict y
An optimal model must meet the following three criteria:
• Simple
• Full
• Has the practical meaning
The simple criterion requires a model with few independent variables, because too many
variables make interpretation difficult, and sometimes impractical In a simile, if we spend 50,000 VND to buy 500 pages of a book, it is better than spending 60,000 VND to buy the same number
Trang 10of pages Similarly, a model with 3 independent variables that has the same ability to describe data
as a model with 5 independent variables, then the first model is chosen A pattern is simply a
save! (English is called a parsimonious model)
The adequate criterion here means that the model must describe the data satisfactorily, i.e it
must predict close (or as close as possible) to the actual observed value of the dependent variable
y If the observed value of y is 10, and if there is a predictive model of 9 and a predictive model
of 6, the former must be considered more complete
A criterion of “practical significance”, as it is called, means that the model has to be supported
by theory or has biological significance (if it is biological research), and clinical significance (if it
is a research study) clinical studies), etc It's possible that phone numbers are somehow related to
fracture rates, but of course, such a model makes no sense This is an important criterion, because
if a statistical analysis results in a model that is very mathematically meaningful but has no
practical significance, then the model is just a numbers game, with no real meaning real scientific
value
The third criterion (of practical significance) belongs to the theoretical realm, and we will not
discuss it here
We will discuss the standard simple and complete An important and useful metric for us to decide
on a simple and complete model is the Akaike Information Criterion (AIC)
The formula for calculating the AIC value:
𝐴𝐼𝐶 = −2 × log(𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑) + 2 × 𝑘 = 2[𝑘 − log(𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑)]
A simple and complete model should be one with as low AIC value as possible and the independent
variables must be statistically significant So, the problem of finding a simple and complete model
is really looking for the one (or more) with the lowest or near lowest AIC value
Trang 112 DATA CORRECTION
2.1 Import data
Read file "diabetes.csv" and assign with name diabetes
Figure 1: R code and results after reading data
2.2 Data cleaning
Check for missing data in file
Figure 2: R code and results when checking missing data in file "diabetes"
Comment: We see that in the file ''diabetes'' there is no missing data to be processed
diabetes <- read.csv(“~/Desktop/diabetes.csv
head(diabetes)
apply(is.na(diabetes),2,which
Trang 122.3 Data clarification
Calculate descriptive statistics for variables
For continuous variables “Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin",
"BMI", "DiabetesPedigreeFunction", and "Age" descriptive statistics are performed and the results are output in tabular form
Figure 3: R code and results when performing descriptive statistics
Make a statistical table for each categorical variable:
For categorical variables "Outcome", make a statistical table
Figure 4: R code and results when performing quantitative statistics for the variable "Outcome"
Trang 13Comment:
• There are 500 survey participants who do not have diabetes
• There are 268 survey participants who have diabetes
Draw a histogram showing the distribution of quantitative variables
• Pregnancies and Glucose:
Figure 5: The result when plotting the histogram of the variable ”Pregnancies” and ”Glucose”
Comment:
• From the graph of the variable ”Pregnancies”, we can see that the number of pregnancies is concentrated mostly in the range of 0 - 5 times, the highest at 0 - 2 times (349 people) and the lowest at the range of 10 to 15 The graph tends to skew left
• The graph does not have a normal distribution, the values from 0- 2 are too concentrated which have a bad influence on the logistic regression model From the graph of the variable ”Glucose”,
we can recognize that the glucose level is highly concentrated from 80 to 160 mg/dL, the highest
par(mfrow = c(1,2))
hist(diabetes$Pregnancies,xlab="Pregnancies",main="Histogram of Pregnancies",col="pink",label=T,ylim=c(0,400))
hist(diabetes$Glucose,xlab=“Glucose”,main=“Histogram of Glucose”,col=“pink”,label=T,,ylim=c(0,250))
Trang 14at 100 - 120 mg/dL, and the lowest at the range of 0-60 mg/dL There is an anomaly (possibly extraneous because it is unlikely) at 0 - 20 mg/dL Besides, the graph has the relative shape of the normal distribution
• Blood Pressure and Skin Thickness:
Figure 6: The result when plotting the histogram of the variable ”Blood Pressure” and ”Skin Thickness”
Comment:
• Based on the graph of the variable ” Blood Pressure”, we find that the value of blood pressure is mostly concentrated from 50-90 mmHg, the highest at 70-80 mmHg and the lowest at 10- 40 and 110-130 mmHg The graph has the relative shape of normal distribution However, there is an abnormality of the graph that the number of people with blood pressure in the range from 0 to 10 mmHg is quite high (35 people)
• Based on the graph of the variable ” Skin Thickness”, we find that the value of skin thickness is highly concentrated at 0-50 mm, the highest at 0-10 mm and the lowest at 50-100 mm The graph does not have a normal distribution
par(mfrow = c(1,2))
hist(diabetes$BloodPressure,xlab="BloodPressure",main="Histogram of BloodPressure",col="pink",label=T,ylim=c(0,250)) hist(diabetes$SkinThickness,xlab=“SkinThickness”,main=“Histogram of SkinThickness”,col=“pink”,label=T,,ylim=c(0,250))
Trang 15• Insulin and BMI:
Figure 7: The result when plotting the histogram of the variable ”Insulin” and ”BMI”
Comment:
• Based on the graph of the variable ” Insulin”, we find that the value of Insulin is concentrated mainly at 0-200 mu U/ml, the highest at 0-100 mu U/ml and the lowest at 300-900 mu U/ml The graph tends to skew left
• Based on the graph of the variable ”BMI ”, we can see that the value of BMI (body mass index)
is strongly concentrated at 20-40 kg/ m2, the highest at 30-35 kg/ m2 and the lowest at 5-15 and 55-70 kg/m2 The graph has the relative shape of normal distribution.Besides, there is an anomaly (possibly extraneous because it is unlikely) at 0 - 10 kg/m2
par(mfrow = c(1,2))
hist(diabetes$Insulin,xlab="Insulin",main="Histogram of Insulin",col="pink",label=T,ylim=c(0,600))
hist(diabetes$BMI,xlab=“BMI”,main=“Histogram of BMI”,col=“pink”,label=T,,ylim=c(0,250))
Trang 16• Diabetes Pedigree Function and Age:
Figure 8: The result when plotting the histogram of the variable ”Diabetes Pedigree Function” and
Comment:
• From the graph of the variable ”Diabetes Pedigree Function”, we can see the value of the diabetes pedigree is concentrated mainly at 0 - 1, the highest at the level of 0.2 - 0.4 and the lowest at the range of 1.5-2.5 The graph does not have a normal distribution, the values from 0.2- 0.4 are too concentrated
• From the graph of the variable ”Age”, we can recognize that the value of age is highly concentrated from 20 -45, the highest at 20-30, and the lowest at the range of 70-80 Besides, the
graph does not have a normal distribution, the values from 20-30 are too concentrated
par(mfrow = c(1,2))
hist(diabetes$DiabetesPedigreeFunction,xlab="DiabetesPedigreeFunction",main="Histogram of
DiabetesPedigreeFunction",col="pink",label=T,ylim=c(0,300))
hist(diabetes$Age,xlab=“Age”,main=“Histogram of Age”,col=“pink”,label=T,,ylim=c(0,300))
Trang 17Plot a histogram showing the distribution of the number of pregnancies of people with/without diabetes :
mu_Pregnancies <- ddply(diabetes, “Outcome”, summarise, grp.mean=mean(Pregnancies))
ggplot(diabetes, aes(x=Pregnancies, color=as.factor(Outcome), fill=as.factor(Outcome))) +
geom_histogram(position=“identity”, alpha=0.5) + geom_vline(data=mu_Pregnancies, aes(xintercept=grp.mean,
color=as.factor(Outcome)), linetype=“dashed”)+ scale_color_manual(values=c(“blue”, “red”, “#56B4E9”)) +
scale_fill_manual(values=c(“steelblue1”, “brown1”, “#56B4E9”)) + labs(title=“Histogram of Pregnancies for
diabetes”,x=“Pregnancies”, y = “Frequency”) + theme_classic()
Trang 18higher risk of diabetes.Besides, because two lines are different, this factor is able to identify
diabetes
Plot a histogram showing the distribution of skin thickness of people with/without diabetes :
Figure 11: R code
Figure 12: The result of histogram shows the distribution of skin thickness for people having and not having diabetes
Comment: The average skin thickness of people with diabetes is higher than for those who not having diabetes In general, the frequency distributions of people with and without disease are
library(ggplot2)
library(plyr)
mu_SkinThickness <- ddply(diabetes, “Outcome”, summarise, grp.mean=mean(SkinThickness))
ggplot(diabetes, aes(x=SkinThickness, color=as.factor(Outcome), fill=as.factor(Outcome))) +
geom_histogram(position=“identity”, alpha=0.5) + geom_vline(data=mu_SkinThickness, aes(xintercept=grp.mean,
color=as.factor(Outcome)), linetype=“dashed”)+ scale_color_manual(values=c(“blue”, “red”, “#56B4E9”)) +
scale_fill_manual(values=c(“steelblue1”, “brown1”, “#56B4E9”)) + labs(title=“Histogram of SkinThickness for
diabetes”,x=“SkinThickness”, y = “Frequency”) + theme_classic()
Trang 19comparable Therefore, measuring skin thickness does not predict the probability of a person having diabetes disease
Plot a histogram showing the distribution of glucose level of people with/without diabetes :
Figure 13: R code
Figure 14: The result of histogram shows the distribution of glucose level for people having and not having diabetes
Comment: The average skin thickness of people with diabetes is higher than for those who not having diabetes Because the two lines are different, this factor can determine diabetes
library(ggplot2)
library(plyr) mu_Glucose <- ddply(diabetes, “Outcome”, summarise, grp.mean=mean(Glucose))
ggplot(diabetes, aes(x=Glucose, color=as.factor(Outcome), fill=as.factor(Outcome))) +
geom_histogram(position=“identity”, alpha=0.5) + geom_vline(data=mu_Glucose, aes(xintercept=grp.mean,
color=as.factor(Outcome)), linetype=“dashed”)+ scale_color_manual(values=c(“blue”, “red”, “#56B4E9”)) +
scale_fill_manual(values=c(“steelblue1”, “brown1”, “#56B4E9”)) + labs(title=“Histogram of Glucose for
diabetes”,x=“Glucose”, y = “Frequency”) + theme_classic()
Trang 20Plot a histogram showing the distribution of blood pressure of people with/without diabetes :
Figure 15: R code
Figure 16: The result of histogram shows the distribution of blood pressure for people having and not having diabetes
Comment: The average value of blood pressure of people with diabetes is higher than for those who not having diabetes Because the two lines are almost the same, this factor is not able to determine diabetes
library(ggplot2)
library(plyr) mu_BloodPressure <- ddply(diabetes, “Outcome”, summarise, grp.mean=mean(BloodPressure))
ggplot(diabetes, aes(x=BloodPressure, color=as.factor(Outcome), fill=as.factor(Outcome))) +
geom_histogram(position=“identity”, alpha=0.5) + geom_vline(data=mu_BloodPressure, aes(xintercept=grp.mean,
color=as.factor(Outcome)), linetype=“dashed”)+ scale_color_manual(values=c(“blue”, “red”, “#56B4E9”)) +
scale_fill_manual(values=c(“steelblue1”, “brown1”, “#56B4E9”)) + labs(title=“Histogram of BloodPressure for
diabetes”,x=“BloodPressure”, y = “Frequency”) + theme_classic()
Trang 21Plot a histogram showing the distribution of insulin level of people with/without diabetes :
Figure 17: R code
Figure 18: The result of histogram shows the distribution of insulin level for people having and not having diabetes
Comment: The average value of insulin level for people with diabetes is higher than for those who not having diabetes Because the two lines are different, this factor can determine diabetes
library(ggplot2)
library(plyr) mu_Insulin <- ddply(diabetes, “Outcome”, summarise, grp.mean=mean(Insulin))
ggplot(diabetes, aes(x=Insulin, color=as.factor(Outcome), fill=as.factor(Outcome))) + geom_histogram(position=“identity”, alpha=0.5) + geom_vline(data=mu_Insulin, aes(xintercept=grp.mean, color=as.factor(Outcome)), linetype=“dashed”)+ scale_color_manual(values=c(“blue”, “red”, “#56B4E9”)) + scale_fill_manual(values=c(“steelblue1”, “brown1”, “#56B4E9”)) + labs(title=“Histogram of Insulin for diabetes”,x=“Insulin”, y = “Frequency”) + theme_classic()