1. Trang chủ
  2. » Luận Văn - Báo Cáo

Bài tập lớn Xác suất thống kê ĐH BK

42 5 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Probability and Statistics Project – Group 05
Trường học Vietnam National University, Hanoi – University of Science and Technology
Chuyên ngành Probability and Statistics
Thể loại Dự án Xác suất Thống kê
Thành phố Hanoi
Định dạng
Số trang 42
Dung lượng 1,1 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Cấu trúc

  • 1. INTRODUCTION (5)
    • 1.1. Topic introduction and requirements (5)
      • 1.1.1. Subject (5)
      • 1.1.2. R-studio (5)
      • 1.1.3. Our problems (5)
    • 1.2. Theoretical basis (6)
  • 2. DATA CORRECTION (11)
    • 2.1. Improt data (0)
    • 2.2. Data cleaning (11)
    • 2.3. Data clarification (12)
    • 2.4. Logistics Regression (25)
    • 2.5. Prediction (35)
  • 3. CODE R (38)

Nội dung

19 Figure 12: The result of histogram shows the distribution of skin thickness for people having and not having diabetes .... 20 Figure 14: The result of histogram shows the distributi

Trang 1

TABLE OF CONTENTS

LIST OF FIGURE 3

ACKNOWLEDGMENT 5

1 INTRODUCTION 6

1.1 Topic introduction and requirements 6

1.1.1 Subject 6

1.1.2 R-studio 6

1.1.3 Our problems 6

1.2 Theoretical basis 7

2 DATA CORRECTION 12

2.1 Improt data 12

2.2 Data cleaning 12

2.3 Data clarification 13

2.4 Logistics Regression 26

2.5 Prediction 36

3 CODE R 39

REFERENCES 43

Trang 2

LIST OF FIGURE

Figure 1: R code and results after reading data 12

Figure 2: R code and results when checking missing data in file "diabetes" 12

Figure 3: R code and results when performing descriptive statistics 13

Figure 4: R code and results when performing quantitative statistics for the variable "Outcome" 13

Figure 5: The result when plotting the histogram of the variable ”Pregnancies” and ”Glucose” 14

Figure 6: The result when plotting the histogram of the variable ”Blood Pressure” and ”Skin Thickness” 15

Figure 7: The result when plotting the histogram of the variable ”Insulin” and ”BMI” 16

Figure 8: The result when plotting the histogram of the variable ”Diabetes Pedigree Function” and 17

Figure 9: R code 18

Figure 10: The result of histogram shows the distribution of the number of pregnancies for people having and not having diabetes 18

Figure 11: R code 19

Figure 12: The result of histogram shows the distribution of skin thickness for people having and not having diabetes 19

Figure 13: R code 20

Figure 14: The result of histogram shows the distribution of glucose level for people having and not having diabetes 20

Figure 15: R code 21

Figure 16: The result of histogram shows the distribution of blood pressure for people having and not having diabetes 21

Figure 17: R code 22

Figure 18: The result of histogram shows the distribution of insulin level for people having and not having diabetes 22

Figure 19: R code 23

Figure 20: The result of histogram shows the distribution of BMI (Body mass index) for people having and not having diabetes 23

Figure 21: R code 24

Figure 22: The result of histogram shows the distribution of diabetes pedigree function for people having and not having diabetes 24

Figure 23: R code 25

Figure 24: The result of histogram shows the distribution of age for people having and not having diabetes 25

Figure 25: R code and results 26

Figure 26: R code and results when removing skinthickness variable from model 1 28

Figure 27: R code and results when removing Insulin variable from model 2 29

Figure 28: R code and results when removing age variable from model 3 30

Figure 29: R code and results when comparing the efficiency between model 1 and model 2 30

Figure 30: R code and results when comparing the efficiency between model 2 and model 3 31

Figure 31: R code and results when comparing the efficiency between model 3 and model 4 31

Figure 32: R code 32

Figure 33: Results when building an equation with all 8 variables 32

Trang 3

Figure 35: R code and summary of model results 33

Figure 36: R code and results 34

Figure 37: R code and results 35

Figure 38: R code and the results of forecasting based on the original data set, save the results in the file diabetes 36

Figure 39: R Code and Statistical Results 36

Figure 40: R Code and Statistical Results 37

Figure 41: R code and comparison results 37

Figure 42: R Code and test results 37

Figure 43: R Code and test results 38

Figure 44: R code and evaluation results 38

Trang 4

ACKNOWLEDGMENT

First of all, we would like to express our gratitude to Professor Nguyen Tien Dung for having enabled our group to have a chance to interact with R studio software We are also grateful that you have shown us an abundant amount of knowledge about Probability and Statistics This is an opportunity for us to operate the R studio We also understand that R studio is important material

to the world of Mathematics nowadays The software increases not only our knowledge but also the ideas for future projects

Trang 5

PROJECT OF PROBABILITY AND STATISTIC FOR CHEMICAL

ENGINEERING (MT2013)

1 INTRODUCTION

1.1 Topic introduction and requirements

1.1.1 Subject

Probability is a part of mathematics that deals with numerical descriptions of the probability that

an event will occur, or the probability that a proposition is true The probability of an occurrence

is a number between 0 and 1, where 0 denotes the impossibility of the event and 1 represents certainty It's usually applied in fields such as mathematics, statistics, economics, gambling, science (particularly physics), artificial intelligence, machine learning, computer science, philosophy, and so on

Statistics is the study of several disciplines, including data analysis, interpretation, presentation, and organization It plays a critical part in the research process by providing analytically significant statistics to assist statistical analysts in obtaining the most correct results to address associated difficulties with social activities

To sum up, Probability and Statistics nowadays is becoming significant in our modern life,

especially with student whose major is in natural science, technology, and economy,

1.1.2 R-studio

R is a programming language and environment that is widely used in statistical computing, data analysis, and scientific research It is a popularly used programming language for data collecting, cleaning, analysis, graphing, and visualization

R is the next generation language of the “S language” in reality The S programming language allows users and students of engineering and technology university to calculate and modify data

As a language, one can use R to develop specialized software for a particular computational

problem

1.1.3 Our problems

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases The objective of the dataset is to diagnostically predict whether a patient has diabetes, based on certain diagnostic measurements included in the dataset Several constraints were placed

on the selection of these instances from a larger database In particular, all patients here are females

at least 21-year-old of Pima Indian heritage

Information about dataset attributes:

• Pregnancies: To express the Number of pregnancies

• Glucose: To express the Glucose level in blood

• Blood Pressure: To express the Blood pressure measurement

• Skin Thickness: To express the thickness of the skin

• Insulin: To express the Insulin level in blood

Trang 6

• BMI: To express the Body mass index

• Diabetes Pedigree Function: To express the Diabetes percentage

• Age: To express the age

• Outcome: To express the final result 1 is Yes and 0 is No

Implementation steps:

Import data: diabetes.csv

• Data cleaning: NA (missing data)

• Data visualization

o Convert the variable (if necessary)

o Descriptive statistics: using sample statistics and graphs

• Logistic regression model: Using a suitable logistic regression model to evaluate factors

affecting of diabetes

1.2 Theoretical basis

Logistic regression (often referred to simply as binomial logistic regression) is used to predict the probability that an observation falls into one of the categories of the dependent variable based on one or more independent variables that may be continuous or classified On the other hand, if your sea of dependencies is a count, the statistical method that should be considered is Poisson regression Also, if you have more than two types of dependent variables, that is when multinomial logistic regression should be used For example, you can use binomial logistic regression to understand whether test performance can be predicted based on review time and test anxiety (i.e., where belonging to “test performance”, measured on a proportional scale – “pass” or “fail” – and you have two independent variables: “review time” and “test anxiety”)

Logistics regression model

Logistic regression models are used to predict a categorical variable by one or more continuous or categorical independent variables The dependent variable can be binary, ordinal or multicategorical

The independent variable can be interval/scale, dichotomous, discrete or a mixture of all The logistic regression equation (in case the dependent variable is binary) is:

𝑃(𝑌𝑖 = 1) = 𝑒

−(𝛽0+𝛽1𝑥1𝑖+𝛽2𝑥2𝑖+⋯+𝛽𝑘𝑥𝑘𝑖)

1 + 𝑒−(𝛽 0 +𝛽 1 𝑥 1𝑖 +𝛽 2 𝑥 2𝑖 +⋯+𝛽 𝑘 𝑥 𝑘𝑖 )

Where:

- P is the probability of observing a case i in the outcome variable Y with a value = 1

- e is an Euler mathematical constant with a value close to 2.71828;

- And the regression coefficients β corresponding to the observed variables

Trang 7

We often use regression models to estimate the effect of X variables on an Odds (Y=1)

Effects in logistic regression

For estimation and prediction purposes, the probabilities are severely limited First, they are bound

to the range 0 to 1 This implies that if the real effect of variable X on the outcome of variable Y exceeds 1, interpretation may be problematic The second limit, the probability cannot be negative Assuming that the effect of an independent variable on variable Y is negative, the logistic regression coefficient interpretation is meaningless One problem is that the regression coefficient should only be positive

To solve the above two problems, we have a two-step approach through the implementation of two variables change First, we convert the probabilities in Odds (O) to:

1 − 𝑃=

𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑒𝑣𝑒𝑛𝑡 ℎ𝑎𝑝𝑝𝑒𝑛𝑖𝑛𝑔𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑒𝑣𝑒𝑛𝑡 𝑛𝑜𝑡 ℎ𝑎𝑝𝑒𝑛𝑖𝑛𝑔

1 + 𝑂; 𝑤ℎ𝑒𝑟𝑒 𝑂 𝑖𝑠 𝑂𝑑𝑑; 𝑃 𝑖𝑠 𝑡ℎ𝑒 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 That is, the odds that an event will occur is the ratio of the number of times it is expected that the event will happen to the number of times it is expected that the event will not happen This is a direct relationship between Odds (Y=1) and probability Y=1

Thus, given that Odds can be infinite, the probability with Odds now allows the regression coefficient to have any value

The next step is to solve the second problem Relationship between Odds and Probability, slightly expanded

Algebraically, we can restate the Odds (O) formula above in terms of the logarithm of Odds (Y=1):

Accordingly, the natural logarithm (loge, symbol ln) of Odds (eg ln0, 279 = −1, 276) Therefore, the logarithm of the probability of voting for Obama is -1.276' So, if we just stop at probabilistic predictions, we can get false results (a positive number)

Trang 8

Second, the true effect of the covariates involved is underrated (underestimated) The main advantage of logarithmic Odds is that the coefficients are constrained, and that they can be negative

as well as positive, ranging from negative infinity to positive infinity Stated this way, logistic regression looks exactly like multiple regression on the right side of the logarithmic Odds equation The left side of the equation is not the score of Y It is the logarithm of Odds (Y=1) This means that each unit of X has the effect of β on the logarithm of Odds of Y

Estimation of a logistic regression model with maximum likelihood (Maximum Likelihood)

Because logistic regression operates on a categorical variable, the ordinary least squares (OLS) method is unusable (it assumes a normally distributed dependent variable) Therefore, a more general estimator is used to detect a good fit of the parameters This is called the maximum likelihood estimation Maximum likelihood is an interactive estimation technique to select parameter estimates that maximize the likelihood of a sample dataset being observed In logistic regression, maximum reasonably selects coefficient estimates that maximize the logarithm of the probability of observing a particular set of values of the dependent variable in the sample for a given set of X values

Because logistic regression uses the method of maximum likelihood, the coefficient of determination (R-) may not be directly estimated Thus, we have two dilemmas for the interpretation of logistic regression: First, how can we also measure the goodness of fit – a general null hypothesis? Second, how do we estimate the partial effect of each variable X?

Statistical inference and null hypothesis

First question, how can we also measure the goodness of fit – a general null hypothesis? The statistical inferences, together with the null hypothesis, are interpreted according to the following steps:

• The first step in the regression interpretation is to evaluate the global null hypothesis that the independent seas do not have any relationship with Y In the OLS regression method, we perform This is equal to testing whether R2 must be 0 in the population using an F-test While logistic regression uses the method of maximum likelihood (non-OLS): The null hypothesis H0 is β0 = β1

= β2 = 0 We measure the size of the residuals from this model with a statistical logarithm likelihood statistic

• We then estimate the model again, assuming that the null hypothesis is false, that we find the maximum reasonable value of the coefficients β in the sample Again, we measure the size of the residuals from this model with a statistical logarithm of reasonableness

• Finally, we compare the two statistics by computing a test statistic: −2[ln(𝐿𝑛𝑢𝑙𝑙) −

ln (𝐿𝑚𝑜𝑑𝑒𝑙)]

This statistic tells us how much residual (or prediction error) can be reduced using X variables The null hypothesis suggests that the reduction is 0 ; if the statistic is large enough (in a chi-square

Trang 9

test with df = number of independent variables), we reject the null hypothesis Here, we conclude that at least one independent variable has a logarithmic Odds effect

SPSS also runs R2 statistics to help evaluate the strength of associations But it as a pseudo R2 , should not be interpreted because logistic regression does not use R2 like linear regression Second question, how do we estimate the partial effect of each variable X? When the general null hypothesis is rejected, we evaluate the partial effects of the predictors

As in multiple linear regression, in logistic regression this implies that the null hypothesis for each independent variable included in the equation The null hypothesis is that each regression coefficient is zero, or it has no effect on the logarithm of Odds

Each coefficient estimator B has a standard error – the extent to which, on average, we would expect B to vary from one sample to another by chance To check the significance of B, a test statistic (not a t-test, but a Wald Chi-squared) is calculated, with 1df – degrees of freedom

It should be remembered that the coefficient B expresses the effects of a unit change of X on logarithmic Odds

In education, the effect is positive, as education increases, the logarithm of Odds also increases The Exp(B) value of an independent variable X is used to predict the probability of an event occurring based on the change in one unit change in an independent variable when all other independent variables are held constant It indicates that when it is increased by one, the Odds for the "yes" event is multiplied by one value of the value Exp(B) (this is a function e to the power B, say 1.05, which is an increase of 5%)

Optimal model selections

One of the difficult and sometimes difficult problems in multivariable logistic regression analysis

is choosing a model that can adequately describe the data A study with a dependent variable y and

3 independent variables x1, x2 and x3, we can have the following models to predict y : y = f(x1), y

= f(x2), y = f(x3), y = f(x1, x2), y = f(x1, x3), y = f(x2, x3), and y = f(x1, x2, x3), where f is a function number In general with k independent variables x1, x2, x3, , xk, we have many models (2k ) to predict y

An optimal model must meet the following three criteria:

• Simple

• Full

• Has the practical meaning

The simple criterion requires a model with few independent variables, because too many

variables make interpretation difficult, and sometimes impractical In a simile, if we spend 50,000 VND to buy 500 pages of a book, it is better than spending 60,000 VND to buy the same number

Trang 10

of pages Similarly, a model with 3 independent variables that has the same ability to describe data

as a model with 5 independent variables, then the first model is chosen A pattern is simply a

save! (English is called a parsimonious model)

The adequate criterion here means that the model must describe the data satisfactorily, i.e it

must predict close (or as close as possible) to the actual observed value of the dependent variable

y If the observed value of y is 10, and if there is a predictive model of 9 and a predictive model

of 6, the former must be considered more complete

A criterion of “practical significance”, as it is called, means that the model has to be supported

by theory or has biological significance (if it is biological research), and clinical significance (if it

is a research study) clinical studies), etc It's possible that phone numbers are somehow related to

fracture rates, but of course, such a model makes no sense This is an important criterion, because

if a statistical analysis results in a model that is very mathematically meaningful but has no

practical significance, then the model is just a numbers game, with no real meaning real scientific

value

The third criterion (of practical significance) belongs to the theoretical realm, and we will not

discuss it here

We will discuss the standard simple and complete An important and useful metric for us to decide

on a simple and complete model is the Akaike Information Criterion (AIC)

The formula for calculating the AIC value:

𝐴𝐼𝐶 = −2 × log(𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑) + 2 × 𝑘 = 2[𝑘 − log(𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑)]

A simple and complete model should be one with as low AIC value as possible and the independent

variables must be statistically significant So, the problem of finding a simple and complete model

is really looking for the one (or more) with the lowest or near lowest AIC value

Trang 11

2 DATA CORRECTION

2.1 Import data

Read file "diabetes.csv" and assign with name diabetes

Figure 1: R code and results after reading data

2.2 Data cleaning

Check for missing data in file

Figure 2: R code and results when checking missing data in file "diabetes"

Comment: We see that in the file ''diabetes'' there is no missing data to be processed

diabetes <- read.csv(“~/Desktop/diabetes.csv

head(diabetes)

apply(is.na(diabetes),2,which

Trang 12

2.3 Data clarification

Calculate descriptive statistics for variables

For continuous variables “Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin",

"BMI", "DiabetesPedigreeFunction", and "Age" descriptive statistics are performed and the results are output in tabular form

Figure 3: R code and results when performing descriptive statistics

Make a statistical table for each categorical variable:

For categorical variables "Outcome", make a statistical table

Figure 4: R code and results when performing quantitative statistics for the variable "Outcome"

Trang 13

Comment:

• There are 500 survey participants who do not have diabetes

• There are 268 survey participants who have diabetes

Draw a histogram showing the distribution of quantitative variables

• Pregnancies and Glucose:

Figure 5: The result when plotting the histogram of the variable ”Pregnancies” and ”Glucose”

Comment:

• From the graph of the variable ”Pregnancies”, we can see that the number of pregnancies is concentrated mostly in the range of 0 - 5 times, the highest at 0 - 2 times (349 people) and the lowest at the range of 10 to 15 The graph tends to skew left

• The graph does not have a normal distribution, the values from 0- 2 are too concentrated which have a bad influence on the logistic regression model From the graph of the variable ”Glucose”,

we can recognize that the glucose level is highly concentrated from 80 to 160 mg/dL, the highest

par(mfrow = c(1,2))

hist(diabetes$Pregnancies,xlab="Pregnancies",main="Histogram of Pregnancies",col="pink",label=T,ylim=c(0,400))

hist(diabetes$Glucose,xlab=“Glucose”,main=“Histogram of Glucose”,col=“pink”,label=T,,ylim=c(0,250))

Trang 14

at 100 - 120 mg/dL, and the lowest at the range of 0-60 mg/dL There is an anomaly (possibly extraneous because it is unlikely) at 0 - 20 mg/dL Besides, the graph has the relative shape of the normal distribution

• Blood Pressure and Skin Thickness:

Figure 6: The result when plotting the histogram of the variable ”Blood Pressure” and ”Skin Thickness”

Comment:

• Based on the graph of the variable ” Blood Pressure”, we find that the value of blood pressure is mostly concentrated from 50-90 mmHg, the highest at 70-80 mmHg and the lowest at 10- 40 and 110-130 mmHg The graph has the relative shape of normal distribution However, there is an abnormality of the graph that the number of people with blood pressure in the range from 0 to 10 mmHg is quite high (35 people)

• Based on the graph of the variable ” Skin Thickness”, we find that the value of skin thickness is highly concentrated at 0-50 mm, the highest at 0-10 mm and the lowest at 50-100 mm The graph does not have a normal distribution

par(mfrow = c(1,2))

hist(diabetes$BloodPressure,xlab="BloodPressure",main="Histogram of BloodPressure",col="pink",label=T,ylim=c(0,250)) hist(diabetes$SkinThickness,xlab=“SkinThickness”,main=“Histogram of SkinThickness”,col=“pink”,label=T,,ylim=c(0,250))

Trang 15

• Insulin and BMI:

Figure 7: The result when plotting the histogram of the variable ”Insulin” and ”BMI”

Comment:

• Based on the graph of the variable ” Insulin”, we find that the value of Insulin is concentrated mainly at 0-200 mu U/ml, the highest at 0-100 mu U/ml and the lowest at 300-900 mu U/ml The graph tends to skew left

• Based on the graph of the variable ”BMI ”, we can see that the value of BMI (body mass index)

is strongly concentrated at 20-40 kg/ m2, the highest at 30-35 kg/ m2 and the lowest at 5-15 and 55-70 kg/m2 The graph has the relative shape of normal distribution.Besides, there is an anomaly (possibly extraneous because it is unlikely) at 0 - 10 kg/m2

par(mfrow = c(1,2))

hist(diabetes$Insulin,xlab="Insulin",main="Histogram of Insulin",col="pink",label=T,ylim=c(0,600))

hist(diabetes$BMI,xlab=“BMI”,main=“Histogram of BMI”,col=“pink”,label=T,,ylim=c(0,250))

Trang 16

• Diabetes Pedigree Function and Age:

Figure 8: The result when plotting the histogram of the variable ”Diabetes Pedigree Function” and

Comment:

• From the graph of the variable ”Diabetes Pedigree Function”, we can see the value of the diabetes pedigree is concentrated mainly at 0 - 1, the highest at the level of 0.2 - 0.4 and the lowest at the range of 1.5-2.5 The graph does not have a normal distribution, the values from 0.2- 0.4 are too concentrated

• From the graph of the variable ”Age”, we can recognize that the value of age is highly concentrated from 20 -45, the highest at 20-30, and the lowest at the range of 70-80 Besides, the

graph does not have a normal distribution, the values from 20-30 are too concentrated

par(mfrow = c(1,2))

hist(diabetes$DiabetesPedigreeFunction,xlab="DiabetesPedigreeFunction",main="Histogram of

DiabetesPedigreeFunction",col="pink",label=T,ylim=c(0,300))

hist(diabetes$Age,xlab=“Age”,main=“Histogram of Age”,col=“pink”,label=T,,ylim=c(0,300))

Trang 17

Plot a histogram showing the distribution of the number of pregnancies of people with/without diabetes :

mu_Pregnancies <- ddply(diabetes, “Outcome”, summarise, grp.mean=mean(Pregnancies))

ggplot(diabetes, aes(x=Pregnancies, color=as.factor(Outcome), fill=as.factor(Outcome))) +

geom_histogram(position=“identity”, alpha=0.5) + geom_vline(data=mu_Pregnancies, aes(xintercept=grp.mean,

color=as.factor(Outcome)), linetype=“dashed”)+ scale_color_manual(values=c(“blue”, “red”, “#56B4E9”)) +

scale_fill_manual(values=c(“steelblue1”, “brown1”, “#56B4E9”)) + labs(title=“Histogram of Pregnancies for

diabetes”,x=“Pregnancies”, y = “Frequency”) + theme_classic()

Trang 18

higher risk of diabetes.Besides, because two lines are different, this factor is able to identify

diabetes

Plot a histogram showing the distribution of skin thickness of people with/without diabetes :

Figure 11: R code

Figure 12: The result of histogram shows the distribution of skin thickness for people having and not having diabetes

Comment: The average skin thickness of people with diabetes is higher than for those who not having diabetes In general, the frequency distributions of people with and without disease are

library(ggplot2)

library(plyr)

mu_SkinThickness <- ddply(diabetes, “Outcome”, summarise, grp.mean=mean(SkinThickness))

ggplot(diabetes, aes(x=SkinThickness, color=as.factor(Outcome), fill=as.factor(Outcome))) +

geom_histogram(position=“identity”, alpha=0.5) + geom_vline(data=mu_SkinThickness, aes(xintercept=grp.mean,

color=as.factor(Outcome)), linetype=“dashed”)+ scale_color_manual(values=c(“blue”, “red”, “#56B4E9”)) +

scale_fill_manual(values=c(“steelblue1”, “brown1”, “#56B4E9”)) + labs(title=“Histogram of SkinThickness for

diabetes”,x=“SkinThickness”, y = “Frequency”) + theme_classic()

Trang 19

comparable Therefore, measuring skin thickness does not predict the probability of a person having diabetes disease

Plot a histogram showing the distribution of glucose level of people with/without diabetes :

Figure 13: R code

Figure 14: The result of histogram shows the distribution of glucose level for people having and not having diabetes

Comment: The average skin thickness of people with diabetes is higher than for those who not having diabetes Because the two lines are different, this factor can determine diabetes

library(ggplot2)

library(plyr) mu_Glucose <- ddply(diabetes, “Outcome”, summarise, grp.mean=mean(Glucose))

ggplot(diabetes, aes(x=Glucose, color=as.factor(Outcome), fill=as.factor(Outcome))) +

geom_histogram(position=“identity”, alpha=0.5) + geom_vline(data=mu_Glucose, aes(xintercept=grp.mean,

color=as.factor(Outcome)), linetype=“dashed”)+ scale_color_manual(values=c(“blue”, “red”, “#56B4E9”)) +

scale_fill_manual(values=c(“steelblue1”, “brown1”, “#56B4E9”)) + labs(title=“Histogram of Glucose for

diabetes”,x=“Glucose”, y = “Frequency”) + theme_classic()

Trang 20

Plot a histogram showing the distribution of blood pressure of people with/without diabetes :

Figure 15: R code

Figure 16: The result of histogram shows the distribution of blood pressure for people having and not having diabetes

Comment: The average value of blood pressure of people with diabetes is higher than for those who not having diabetes Because the two lines are almost the same, this factor is not able to determine diabetes

library(ggplot2)

library(plyr) mu_BloodPressure <- ddply(diabetes, “Outcome”, summarise, grp.mean=mean(BloodPressure))

ggplot(diabetes, aes(x=BloodPressure, color=as.factor(Outcome), fill=as.factor(Outcome))) +

geom_histogram(position=“identity”, alpha=0.5) + geom_vline(data=mu_BloodPressure, aes(xintercept=grp.mean,

color=as.factor(Outcome)), linetype=“dashed”)+ scale_color_manual(values=c(“blue”, “red”, “#56B4E9”)) +

scale_fill_manual(values=c(“steelblue1”, “brown1”, “#56B4E9”)) + labs(title=“Histogram of BloodPressure for

diabetes”,x=“BloodPressure”, y = “Frequency”) + theme_classic()

Trang 21

Plot a histogram showing the distribution of insulin level of people with/without diabetes :

Figure 17: R code

Figure 18: The result of histogram shows the distribution of insulin level for people having and not having diabetes

Comment: The average value of insulin level for people with diabetes is higher than for those who not having diabetes Because the two lines are different, this factor can determine diabetes

library(ggplot2)

library(plyr) mu_Insulin <- ddply(diabetes, “Outcome”, summarise, grp.mean=mean(Insulin))

ggplot(diabetes, aes(x=Insulin, color=as.factor(Outcome), fill=as.factor(Outcome))) + geom_histogram(position=“identity”, alpha=0.5) + geom_vline(data=mu_Insulin, aes(xintercept=grp.mean, color=as.factor(Outcome)), linetype=“dashed”)+ scale_color_manual(values=c(“blue”, “red”, “#56B4E9”)) + scale_fill_manual(values=c(“steelblue1”, “brown1”, “#56B4E9”)) + labs(title=“Histogram of Insulin for diabetes”,x=“Insulin”, y = “Frequency”) + theme_classic()

Ngày đăng: 28/06/2023, 01:21

🧩 Sản phẩm bạn có thể quan tâm

w