Attribute Information - Age: age of the patient [years] - Sex: sex of the patient [M: Male, F: Female] - ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: N
Topic
Context
Cardiovascular diseases (CVDs) are the leading cause of death worldwide, responsible for approximately 17.9 million fatalities annually, which represents 32% of all global deaths Notably, four out of five CVD-related deaths result from heart attacks and strokes, with one-third of these occurring prematurely in individuals under 70 Heart failure is a frequent consequence of CVDs, and this dataset includes 11 features that can aid in predicting potential heart disease.
Individuals with cardiovascular disease or those at elevated risk due to factors like hypertension, diabetes, or hyperlipidaemia require prompt detection and management, and a machine learning model can significantly assist in this process.
Attribute Information
- Age: age of the patient [years]
- Sex: sex of the patient [M: Male, F: Female]
- ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
- RestingBP: resting blood pressure [mm Hg]
- Cholesterol: serum cholesterol [mm/dl]
- FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
The resting electrocardiogram (ECG) results can be categorized into three main findings: Normal, indicating no abnormalities; ST, which signifies the presence of ST-T wave abnormalities such as T wave inversions or ST elevation or depression exceeding 0.05 mV; and LVH, which suggests probable or definite left ventricular hypertrophy based on Estes' criteria.
- MaxHR: maximum heart rate achieved [Numeric value between 60 and
- ExerciseAngina: exercise-induced angina [Y: Yes, N: No]
- Oldpeak: oldpeak = ST [Numeric value measured in depression]
- ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
- HeartDisease: output class [1: Heart disease, 0: Normal]
Theory
Logistic Regression Analysis
Regression methods are essential in data analysis for understanding the relationship between a response variable and one or more explanatory variables When the outcome variable is discrete, often taking on two or more values, logistic regression has emerged as the standard analytical method across various fields While linear regression and analysis of variance focus on continuous dependent variables, many scenarios involve binary dependent variables, such as yes/no or ill/healthy In these cases, independent variables can be either continuous or categorical, and the goal is to uncover the relationship between them.
5 Instructor: Prof Nguyễn Tiến Dũng between independent and dependent variables
This article examines binary response variables represented by 0 and 1, which correspond to two distinct classes such as yes/no, win/lose, and ill/healthy.
0, 𝑛𝑜 Given an event frequency x recorded from n subjects, we can calculate the probability of that event as: 𝑝 = 𝑥
First, we define some notation that we will use throughout
With a binary (Bernoulli) response, we’ll mostly focus on the case when 𝑌 = 1, since with only two possibilities, it is trivial to obtain probabilities when 𝑌 =0
Risk can also be expressed in terms of odds, which represent the likelihood of an event happening The probability of an event is defined as the ratio of the chance of the event occurring to the chance of it not occurring.
We now define the logistic regression model logit(𝑝) = 𝑙𝑜𝑔 ( 𝑝
1 − 𝑝) The relationship between p and logit(p) is a continuous relationship and has the following form
1 − 𝑝(𝑥)) = 𝛽 0 + 𝛽 1 𝑥 1 + +𝛽 𝑝−1 𝑥 𝑝−1 Immediately we notice some similarities to ordinary linear regression, in particular, the right-hand side This is our usual linear combination of the predictors
The left-hand side represents the log(odds), which is the logarithm of the odds Odds are calculated as the probability of a positive event (𝑌 = 1) divided by the probability of a negative event (𝑌 = 0) When the odds equal 1, both events have equal probabilities, while odds greater than 1 indicate a preference for the positive event, and odds less than 1 suggest the opposite.
With logistic regression, which uses the Bernoulli distribution, we only need to estimate the Bernoulli distribution’s single parameter 𝑝(x), which happens to be its mean
While we initially presented ordinary linear regression, logistic regression can be considered simpler in certain aspects The use of the inverse logit transformation enables us to derive an expression for \( p(x) \).
= = + With 𝑛 observations, we write the model indexed with 𝑖 to note that it is being applied to each observation
We can apply the inverse logit transformation to obtain 𝑃[𝑌 𝑖 = 1|𝑋 𝑖 = 𝑥 𝑖 ] for each observation Since these are probabilities, it’s good that we used a function that returns values between 0 and 1
The logistic regression model describes the relationship between an independent variable \( x \) and the log-odds of a dependent variable \( p \) through the equation \( \text{logit}(p) = \alpha + \beta x \) In this model, \( \alpha \) and \( \beta \) are linear parameters that must be estimated from the data However, the interpretation of these parameters, particularly \( \beta \), differs significantly from that in linear regression models.
Method of Maximum Likelihood
The model parameters can be estimated using the maximum likelihood method, a general technique akin to the least-squares method, which optimizes a goodness-of-fit criterion Notably, the least-squares method is a modified version of the maximum-likelihood procedure The likelihood function \( L(\beta) \) represents the probability of the entire observed data set across different parameter values.
7 Instructor: Prof Nguyễn Tiến Dũng
To “fit” this model, that is estimate the 𝛽 parameters, we will use maximum likelihood
We first write the likelihood given the observed data
This is already technically a function of the 𝛽 parameters, but we’ll do some rearrangement to make this more explicit
Unlike ordinary linear regression, this maximization problem lacks an analytical solution and must be solved numerically, necessitating the use of computer software In our case, R can efficiently handle this using an iteratively reweighted least squares algorithm For a deeper understanding, we recommend exploring a machine learning or optimization course that covers various optimization strategies.
Logistic regression analysis is a type of generalized linear model that utilizes a binomial distribution for its response variable It features a link function that transforms the mean value, allowing for a linear and additive relationship with background variables In this analysis, the link function is defined as logit(\$p\$) = log(\$p/(1-p)\$).
Optimal model selections
Choosing an appropriate model for multivariable logistic regression analysis can be challenging In a study with a dependent variable \( y \) and three independent variables \( x_1, x_2, \) and \( x_3 \), various models can be formulated to predict \( y \) These include \( y = f(x_1) \), \( y = f(x_2) \), \( y = f(x_3) \), \( y = f(x_1, x_2) \), \( y = f(x_1, x_3) \), \( y = f(x_2, x_3) \), and \( y = f(x_1, x_2, x_3) \), where \( f \) represents a function Generally, with \( k \) independent variables, the number of potential models increases significantly.
2 k - 1 different models to choose from
When a model includes too many independent variables, some may lack correlation with the dependent variable, leading to unnecessary complexity It is advisable to identify and eliminate these irrelevant variables to avoid skewed results and enhance data interpretation For instance, a model with three independent variables that predicts data as accurately as or better than a model with five should be preferred for its simplicity and precision However, it is crucial to ensure that the removed variables are statistically insignificant to the dependent variable, as sacrificing accuracy for clarity can be counterproductive.
An adequate criterion for a predictive model requires it to accurately reflect the observed values of the dependent variable, y For instance, if the actual value of y is 10, a model predicting 9 is preferable to one predicting 6 Additionally, the model must possess practical significance, meaning it should be grounded in theory or have relevance in biological or clinical contexts While correlations, such as between phone numbers and fracture rates, may exist, they do not imply causation and thus lack practical significance A model that is mathematically sound but devoid of real-world relevance is merely a numbers game without scientific value To evaluate the simplicity and completeness of a model, the Akaike Information Criterion (AIC) serves as a valuable metric, with a specific formula for its calculation.
To identify an effective model, it is essential to aim for the lowest possible AIC value while ensuring that the independent variables are statistically significant Therefore, the quest for a simple and complete model involves locating the model(s) that exhibit the lowest or nearly lowest AIC values.
Information about data
+ Typical angina (TA): Typical angina consists of the following 3 features: Feeling pain like strangulation, pain like tightness or pressure in the left chest or behind the
Prof Nguyễn Tiến Dũng describes a type of chest pain that originates from the breastbone and can radiate to the chin and left hand This pain typically occurs regularly, intensifying with physical exertion, strong emotions, or exposure to cold, and usually lasts between 3 to 5 minutes.
Atypical angina (ATA) differs from classic angina as it does not present signs of an ischemic heart attack Symptoms can vary widely, including dull or sharp pain, tearing sensations, and may also involve shortness of breath or back pain.
Non-anginal angina (NAP) refers to unstable chest pain resulting from a blockage in a coronary artery, occurring without a heart attack Common symptoms include chest discomfort, which may be accompanied by shortness of breath, nausea, and excessive sweating Diagnosis is typically made through an electrocardiogram and the evaluation of serologic findings.
+ Asymptomatic (ASY) b Serum cholesterol concentration
Higher levels of HDL cholesterol in the blood are associated with a reduced risk of cardiovascular disease Conversely, when HDL cholesterol levels drop below 40 mg/dL, the risk of heart disease significantly increases Additionally, resting ECG results can provide further insights into cardiovascular health.
+ ST: The ST segment indicates that the depolarization of the ventricular myocardium has completed Usually, paragraph the ST is level with the isoelectric line as the PR
(or TP) interval Sometimes it's a little higher isoelectric line slightly d Exercise intensity (ST_Slope)
+ Up: the intensity of the exercise increases from light to heavy
+ Flat: the intensity of the exercise does not change
+ Down: the intensity of the exercise decreases from heavy to light
And some other medical information.
Code
Clean the data
Comments: There is no missing value to process in the file df
11 Instructor: Prof Nguyễn Tiến Dũng
A table of quantity statistics
There are 178 female patients participating in the test
There are 655 male patients participating in the test b ChestPainType
There are 450 patients being asymptomatic
There are 160 patients having atypical angina
There are 185 patients haing non-anginal pain
There are 38 patients having typical angina c FastingBS
There are 647 patients having fasting blood sugar 120 mg/dl
13 Instructor: Prof Nguyễn Tiến Dũng d RestingECG
There are 171 patients having resting electrocardiogram results are probable or defining left ventricular hypertrophy by Estes’ criteria
There are 499 patients having resting ECG results are normal
There are 163 patients having resting ECG results are having ST-T wave abnormality e ExerciseAngina
There are 498 patients not having exercise-induced angina
There are 335 patients having exercise-induced angina f ST_Slope
There are 62 patients having the slope of the peak exercise ST segment is down
There are 413 patients having the slope of the peak exercise ST segment is flat
There are 358 patients having the slope of the peak exercise ST segment is up g HeartDisease
There are 378 patients not having heart disease
There are 455 patients having heart disease.
Plot the Histogram and Plot the Barplot
The age histogram indicates that the majority of patients are between 50 and 60 years old, with the highest concentration in the 55-60 age group, totaling 184 individuals Conversely, the lowest number of patients, only 12, falls within the 30-year age range This distribution exhibits a symmetrical bell shape characteristic of a normal distribution.
- This histogram shows the majority of patients having resting BP from 125 to
150 mmHg The most and least common resting BP among patients are 110 – 125 mmHg (344 people) and 75 - 90 mmHg (1 people), respectively There is unusual to have 1 patient having resting BP at 0 – 10 mmHg
15 Instructor: Prof Nguyễn Tiến Dũng b Cholesterol and MaxHR histogram
The cholesterol histogram reveals that the majority of patients have serum cholesterol levels ranging from 200 to 300 mg/dl, with the highest concentration observed in the 250 to 275 mg/dl range, accounting for 238 patients In contrast, the lowest number of patients fall within the 500 to 575 mg/dl category.
(2 patients) There is unusual to have 148 people having serum cholesterol at 0 – 25 mm/dl
- This histogram shows the majority of patients having max HR from 125 to 175 The most and least common max HR among patients are 125 – 150 (224 people) and
50 – 75, around 200 (4 people), respectively It has a symmetrical bell shape of normal distribution c Oldpeak histogram
- This oldpeak histogram shows that most of patients having oldpeak between 0 and 3
The majority of patients exhibit an oldpeak ranging from -1 to 0, with a total of 404 individuals, while only 1 patient has an oldpeak below -2 This data reflects a symmetrical bell-shaped normal distribution.
17 Instructor: Prof Nguyễn Tiến Dũng d Sex and CPT barplots e FastingBloodSugar and Resting ECG barplots f ExerciseAngina, ST_Slope and HeartDisease barplots
19 Instructor: Prof Nguyễn Tiến Dũng g.Sex and CPT barplots for HeartDisease
- The percentage of patients with heart disease in the male group was higher than the percentage of patients with heart disease in the female group
- The ratio of patients with heart disease in the group of patients with chest pain is higher than patients with other symptoms h FastingBS and RestingECG barplots for HeartDisease
Patients with blood sugar levels exceeding 120 mg/dl exhibited a higher incidence of heart disease compared to those with blood sugar levels below 120 mg/dl.
- Percentage of patients with heart disease in the group of patients with abnormal
Patients with elevated ECG results show a higher prevalence of the disease compared to those with different ECG findings However, since all three groups of resting electrocardiograms exhibit a significant number of individuals with the condition, these results do not effectively aid in predicting the likelihood of disease occurrence.
21 Instructor: Prof Nguyễn Tiến Dũng i ExerciseAngina and ST_Slope barplots for HeartDisease
Patients with angina during exercise exhibit a higher percentage of heart disease compared to those without exercise-induced angina.
- The proportion of patients with heart disease in the group of patients participating in flat aerobic exercise is higher than the patients in the other group
Variables ST_Slope, ExerciseAngina, FastingBS, ChestPainType, Oldpeak,
MaxHR, Cholesterol and Age have an effect on predicting disease probability based on plots plotted j Age vs HeartDisease histogram
Older individuals are at a greater risk of developing heart disease, as evidenced by the fact that the average age of patients with the condition is higher than that of those without it.
- The highest percentage of having heart disease is the group of age 56, 57 In constrast, the smallest comes from less than 30 age group
This is not the normal distribution because there is a growth in the right side of the mean point
23 Instructor: Prof Nguyễn Tiến Dũng k RestingBP vs HeartDisease histogram
Patients with disease exhibited a higher mean resting blood pressure compared to those without disease However, the frequency distributions of both groups are largely comparable Consequently, measuring resting blood pressure is not an effective predictor of an individual's likelihood of developing cardiovascular disease.
This is not the normal distribution because in the right side of the mean point, there is a higher value and it fluctuates l Cholesterol vs HeartDisease histogram
The average cholesterol level in patients with the disease is lower compared to those without the disease Nevertheless, a greater number of patients with the disease have cholesterol levels exceeding 300, in contrast to their healthy counterparts.
Therefore, it is relative exact to predict the probability of a person having a heart disease
This is not a normal distribution because in the left side of the mean point, there is a higher value and it changes unpredictable
25 Instructor: Prof Nguyễn Tiến Dũng m MaxHR vs HeartDisease histogram
Patients with the disease exhibit a mean maximum heart rate that is significantly lower than that of patients without the disease This lower distribution of maximum heart rates in individuals with the disease aids in predicting the likelihood of heart disease presence.
This is not a normal distribution because it does not look like a slope and goes down gradually to both two sides n Oldpeak vs HeartDisease histogram
Patients with the disease exhibit a higher mean depression index compared to those without the disease, indicating a significant difference in the distribution of depression levels This correlation suggests that the depression index may serve as a predictive factor for the likelihood of developing the disease.
This is not a normal distribution because it does not look like a slope and goes down gradually to both two sides
27 Instructor: Prof Nguyễn Tiến Dũng
Model
The result of initial model with 11 variables c
The result of model after eliminating the RestingECG from initial model
29 Instructor: Prof Nguyễn Tiến Dũng d
The result of model after eliminating the RestingBP from previous model e
The result of model after eliminating the MaxHR from previous model
The optimization process of our model involved several steps, beginning with an initial model that included 11 variables, resulting in an AIC of 577.83 We then systematically removed one variable, RestingECG, which led to a reduced AIC of 574.82 This process continued for three iterations, ultimately concluding with the model that achieved the lowest AIC value.
Summary of the optimized model
So, the optimal logistic regression model has the form:
So, the model has the form:
31 Instructor: Prof Nguyễn Tiến Dũng f The prediction for training set
- There are 61 patients who do not have heart disease that are mislabelled as having heart disease (type I error)
- There are 48 patients who have heart disease that are mislabelled as not having heart disease (type II error)
As shown above, the number of false predictions is relatively small compared to the number of correct predictions, which proves our model to be quite capable and accurate
- The result of traing set’s accuracy is 86.80%, which means those variables can predict heart disease accurately g The prediction for testing set
- There are 4 patients who do not have heart disease that are mislabelled as having heart disease (type I error)
- There are 4 patients who have heart disease that are mislabelled as not having heart disease (type II error)
As shown above, the number of false predictions is relatively small compared to the number of correct predictions, which proves our model to be quite capable and accurate
Our model achieves an impressive accuracy of 90.59% in predicting the testing set, indicating its reliability Given that the model has not encountered this data before, it demonstrates the capability to effectively predict heart disease conditions in new patients, provided they supply the necessary variables.