1. Trang chủ
  2. » Luận Văn - Báo Cáo

Bài tập lớn Xác suất thống kê Đại học Bách khoa thành phố Hồ Chí Minh

33 12 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề A Stroke Happens When There Is a Disruption or Reduction in the Blood Flow to Different Parts of the Brain
Tác giả Group
Người hướng dẫn Professor Nguyen Tien Dung
Trường học Ho Chi Minh City University of Technology
Chuyên ngành Probability and Statistics
Thể loại Assignment Report
Năm xuất bản 2022
Thành phố Ho Chi Minh City
Định dạng
Số trang 33
Dung lượng 1,19 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Cấu trúc

  • I. Introduction (5)
    • 1. Topic (5)
    • 2. Problem (6)
  • II. Problem solving (7)
    • 1. Data analysis and methods theory (7)
      • 1.1. Data analysis (7)
        • 1.1.1. Import the data (7)
        • 1.1.2. Clean the data (7)
        • 1.1.3. Plot graphs for each factor (8)
        • 1.1.4. Graphs of each factors in correlation to strokes (12)
      • 1.2. Theoretical basis of logistic regression model (18)
    • 2. Apply logistic model and obtain the prediction (23)
      • 2.1. Logistic regression model (23)
      • 2.2. Prediction (26)

Nội dung

Bài tập lớn môn Xác suất thống kê Đại học bách khoa TPHCM A stroke happens when there is a disruption or reduction in the blood flow to different parts of the brain, which causes the cells there to stop receiving the nutrients and oxygen they need and to die. A stroke is a medical emergency that has to be treated right away. To stop additional damage to the damaged area of the brain and potential consequences in other body parts, early detection and adequate therapy are necessary. According to the World Health Organization (WHO), fifteen million people get strokes annually, with one victim passing away every four to five minutes. According to the Centers for Disease Control and Prevention (CDC), strokes are the sixth most common cause of death in the United States. About 11% of people die from noncommunicable diseases like stroke each year. Approximately 795,000 Americans experience the incapacitating symptoms of strokes on a regular basis. It is the fourth most common cause of death in India. There are two types of strokes: ischemic and hemorrhagic. In a hemorrhagic stroke, a weak blood artery bursts and bleeds into the brain; in a chemical stroke, clots prevent drainage. Stroke can be prevented by living a healthy, balanced lifestyle that excludes harmful habits like smoking and drinking, maintains a healthy body mass index (BMI), average blood glucose levels, and great heart and kidney function. Predicting a stroke is crucial, and it needs to be treated right away to prevent irreparable harm or death. With the advancement of medical technology, it is now possible to use... methods to predict the onset of a stroke.

Introduction

Topic

A stroke occurs when blood flow to the brain is disrupted, leading to cell death due to lack of oxygen and nutrients It is a medical emergency requiring immediate treatment to prevent further brain damage and complications The World Health Organization reports that 15 million people suffer strokes annually, with a death occurring every four to five minutes In the United States, strokes rank as the sixth leading cause of death, affecting approximately 795,000 individuals each year There are two main types of strokes: ischemic, caused by clots, and hemorrhagic, resulting from a burst blood vessel Preventive measures include maintaining a healthy lifestyle, avoiding smoking and excessive alcohol, and managing body weight and blood glucose levels Early detection and treatment are vital to minimize irreversible damage This paper explores the use of logistic regression models for predicting brain strokes, demonstrating that these models outperform other classification algorithms in accuracy, despite being trained on textual data rather than real-time brain images.

This research significantly contributes by applying various machine learning models to a publicly available dataset Unlike previous studies that primarily focused on a single model for stroke disease prediction, our approach offers a comprehensive analysis The results and comparisons of these models are discussed in the following section.

Problem

The original dataset is provided at: kaggle.com

 age: age of the patient

 hypertension: 0 if the patient does not have hypertension, 1 if the patient has hypertension

 heart_disease: 0 if the patient does not have any heart diseases, 1 if the patient has a heart disease

 ever_married: "No" or "Yes"

 work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self- employed"

 Residence_type: "Rural" or "Urban"

 avg_glucose_level: average glucose level in blood

 smoking_status: "formerly smoked", "never smoked", "smokes"

 stroke: 1 if the patient had a stroke or 0 if not

Our aim: using Rstudio to predict whether a patient has a stroke or not based on other attributes

- First, we analyze the dataset and from that choose suitable methods

- Second, we get the prediction from the methods

Problem solving

Data analysis and methods theory

Firstly, we import the database healthcare-dataset-stroke-data.csv into Rstudio by read.csv() and then view it in a table by View().

#Import the data from csv file data data summary(data)

The command \$\text{sum(is.na())}\$ counts the total number of N/A values in the dataset, resulting in a total of 31 Subsequently, the function \$\text{na.omit()}\$ is executed to remove all rows that contain these N/A values Finally, the \$\text{summary()}\$ function is utilized to display key statistics for each column factor, including the minimum, first quartile, median, mean, third quartile, and maximum values.

1.1.3 Plot graphs for each factor

For continuous variable we use histograms and for categorical we use barplots

This article presents a series of visualizations to analyze patient demographics and health factors It includes a barplot illustrating the distribution of patients by gender, with a maximum count of 500 Additionally, an age histogram is provided, showcasing age distribution with a limit of 80 years The analysis continues with barplots for hypertension and heart disease, displaying patient counts up to 600 and 750, respectively Furthermore, marital status and work type are examined through barplots, each with a maximum of 600 and 400 patients Lastly, the article addresses residence type, contributing to a comprehensive overview of the patient population.

The analysis includes a barplot illustrating the distribution of residence types, a histogram showcasing the average glucose levels with labeled axes, and another histogram representing the BMI distribution Additionally, a barplot is created to display the frequency of different smoking statuses, ensuring clear visualization of the data.

400), main="Smoking status barplot") barplot(table(data$stroke),xlab="Stroke",ylab=y,ylim= c(0, 600), main="Stroke barplot")

In the code, we assign the string "No patients" to the variable y for convenient usage throughout the script The command `par(mfrow=c(1,2))` is utilized to divide the plotting area into two columns, allowing for the simultaneous display of two graphs The `barplot()` function requires data input along with axis labels (xlab, ylab), y-axis limits (ylim), and a title (main) for the bar plot Similarly, the `hist()` function is used to create a histogram, which also requires axis labels and limits, but additionally allows for the inclusion of labels above each data column by setting the parameter `label=T`.

Result and comments: a Gender barplot and age histogram

+ For gender, there are more female patients than males.

+ The patients have various ages, however it seems that more old people participate b Hypertension barplot and heart disease barplot

+ Most of patients do not have hypertension and heart disease c Martial status batplot and work type barplot

+ Most of patients have at least one time marriage

+ Most of patients work in private company d Residence type barplot and average glucose level histogram

+ Numbers of patients from rural and urban area are almost equal

+ The highest number of patients have average glucose level from 60-100 e BMI histogram and smoking status barplot

+ The highest number of patients haveBMI number from 20-35

+ Most patients have never smoked before f Stroke barplot

+ There are more than 400 patients do not have stroke and lower that 200 have

1.1.4 Graphs of each factors in correlation to strokes

Now, to determine which factors impact one’s chance to suffer stroke, we plot diagrams demonstrating the proportion of patient suffer and don’t suffer strokes under each category.

The article presents barplots and histograms illustrating the number of patients with and without strokes based on various factors The first barplot analyzes the relationship between gender and stroke incidence, featuring a title "Barplot gender/stroke" and a y-axis labeled with "y," with a limit set from 0 to 400 It uses a color scheme of dark grey and dark blue, accompanied by a legend that identifies the gender categories The second barplot focuses on the correlation between hypertension and stroke, similarly titled "Barplot hypertension/stroke," with the same y-axis label and limits, utilizing the same color scheme for consistency.

"darkblue"),legend.text rownames(table(data$stroke,data$hypertension)),beside = TRUE) barplot(table(data$stroke,data$heart_disease), main = "Barplot heart disease/stroke", ylab = y, ylim=c(0,500),col = c("darkgrey",

"darkblue"),legend.text rownames(table(data$stroke,data$heart_disease)),beside = TRUE) barplot(table(data$stroke,data$ever_married), main = "Barplot marital status/stroke", ylab = y, ylim=c(0,400),col = c("darkgrey",

The analysis includes several bar plots illustrating the relationship between stroke occurrences and various demographic factors The first bar plot examines the correlation between stroke and marital status, while the second focuses on work type, highlighting significant trends in stroke prevalence across different employment categories Additionally, a bar plot is presented to analyze the impact of residence type on stroke rates, providing insights into how living conditions may influence health outcomes Each plot is designed with clear titles and color coding for better visualization and understanding of the data.

The analysis of stroke data reveals significant insights into the relationship between age, average glucose levels, and body mass index (BMI) with stroke occurrences By utilizing histograms, we visualize the distribution of age, average glucose levels, and BMI across different stroke types Each histogram is enhanced with vertical lines indicating the mean values for each stroke category, allowing for a clearer comparison The findings suggest that variations in these health metrics may correlate with stroke incidence, highlighting the importance of monitoring these factors in stroke prevention strategies.

Now we want to separate the patients for each factors into those who has / doesn’t have stroke To do so we add onto barplot() the argument

The option “beside=TRUE” displays the columns as side-by-side bars, while “legend.text=rownames()” labels the legend with the corresponding data names To differentiate the factors, we assign dark grey to "no" and dark blue to "yes" using the appropriate argument.

Similarly, we want to sort data for histograms but this time we use the command

“ggplot()” instead For this graph, we want to display the columns, legend and a vertical line representing the mean value of each group

+ First of all we need to assign the variable for the mean line, e.g mu_age We suspect patients who have hypertension may suffer strokes b Heart disease and marital status

+ It is surprising that compare to our initial expectation, the percentage of married patients who have suffered strokes are much higher than those you are unmarried

The analysis of the work type barplot suggests that the type of employment, whether private or governmental, does not significantly influence stroke occurrence However, a notable increase in stroke rates is observed among self-employed individuals, which may be attributed to the characteristics of the sample dataset Additionally, the impact of residence type on stroke rates warrants further investigation.

+ We can conclude that residence have no effect on stroke since for both rural and urban areas, the percentage of strokes are similar. d Age

Patients who have experienced a stroke tend to be older on average compared to those who have not, supporting the prediction that advanced age, along with diminished health and bodily function, increases the risk of strokes Additionally, monitoring average glucose levels is crucial in understanding stroke risk factors.

Patients who have experienced strokes tend to have higher mean average glucose levels compared to those who have not A significant number of stroke patients exhibit glucose levels of 150 mg/dL or higher This correlation aligns with findings from VinMec, which indicate that individuals with diabetes have average glucose levels exceeding 126 mg/dL, and elevated glucose levels are a known risk factor for strokes.

+ From the histogram, the mean BMI of those who have strokes is close to BMI of those who don’t We suspect BMI does not have effect on stroke chances.

Age, smoking status, and medical history of diabetes and hypertension are suspected to influence the occurrence of strokes, while factors such as BMI, residence, and type of work appear to have no significant impact Additionally, the effects of gender and marital status on stroke incidence remain unclear It is important to note that stroke outcomes are binary, represented by values of 1 and 0.

0, which mean “yes” or “no”, we decide that Logistic Regression model is suitable for our problem.

1.2 Theoretical basis of logistic regression model

Logistic regression is a statistical method that predicts a binary outcome, like yes or no, by analyzing prior observations in a dataset It establishes the relationship between one or more independent variables and a dependent variable For scenarios where the dependent variable is a count, Poisson regression is the appropriate statistical approach Additionally, multinomial logistic regression is recommended when dealing with more than two categories of dependent variables.

Apply logistic model and obtain the prediction

To verify our initial assumption, we construct a logistic regression model for stroke prediction using the “glm()” command, which stands for General Linear Model The GLM is mathematically equivalent to multiple regression analysis, highlighting its effectiveness in handling both qualitative and multiple quantitative variables.

The initial model for predicting stroke was developed using a generalized linear model (GLM) with various predictors, including gender, age, hypertension, heart disease, marital status, work type, residence type, average glucose level, body mass index (BMI), and smoking status The model was refined through a stepwise selection process, resulting in an optimized model The summary of the optimized model provides insights into the significant factors influencing stroke risk.

First of all, we create a general linear model using “glm()” command with data taken from our dataset (“dataa”) and then fit this linear model to binomial data (family=binomial).

After creating the initial model, we utilize the "step()" command to eliminate factors that do not influence stroke probability, followed by summarizing the final model with the "summary()" function.

The initial model demonstrates a gradual process of evaluating and eliminating irrelevant factors related to strokes, ultimately minimizing the AIC value for this dataset.

The final model aligns with our initial theory, confirming that gender and marital status are irrelevant factors The key traits of the model are summarized below.

Although the p-values for avg_glucose_level, smoking_statusnever smoked and smoking_statussmokes are still not statistically significant (>5%), howevwe this is the lowest AIC model that we can find.

From the theory, we will have the below equation from the model:

For our final part of this assignment, we check the efficiency of our model.

#predict stroke_predict

Ngày đăng: 28/06/2023, 01:12

Nguồn tham khảo

Tài liệu tham khảo Loại Chi tiết
[1] “Concept of stroke by healthline,” [Online]. Available: https://www.cdc.gov/stroke/index.htm Sách, tạp chí
Tiêu đề: Concept of stroke by healthline
[2] Dataset links: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset?fbclid=IwAR29WusgiHPUFjUF_apHO1M0ymLH2HtAPzNwt6rcP2b8Ph1pIO4jepDkChU Sách, tạp chí
Tiêu đề: Stroke Prediction Dataset
Nhà XB: Kaggle
[3] J Jambers - D.Hand - W.Hardle, Introductory Statistic with R Sách, tạp chí
Tiêu đề: Introductory Statistic with R
Tác giả: J Jambers, D.Hand, W.Hardle
[4] Nguyễn Tiến Dũng (chủ biên), Nguyễn Đình Huy, (2019), Xác suất - Thống kê & Phân tích số liệu Sách, tạp chí
Tiêu đề: Xác suất - Thống kê & Phân tích số liệu
Tác giả: Nguyễn Tiến Dũng, Nguyễn Đình Huy
Năm: 2019
[5] “Statistics of stroke by Centers for disease control and prevention,” [Online].Available: https://www.cdc.gov/stroke/facts.htm Sách, tạp chí
Tiêu đề: Statistics of stroke by Centers for disease control and prevention

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm

w