Topic comparison of logistic regression and random forest algorithm in the tasks of classificing people with heart attacks

Extension: Combining Python, Colab and Machine Learning...-.--.- 7 CHAPTER 3: DATA VISUALILZATION...QQQQ i ri anna a nena eeeeneeeeeeeeeereeeeeeeniiaa 9 CHAPTER 4: TRAINING AND TEST MODL

Trang 1

WITH HEART ATTACKS

Subject: Machine Learning

Instructor: © Huynh Ngoc Tin

Member: Tran Van Luong - 31012202541

Nguyen Duc Phat- 31012202537

HO CHI MINH CITY

Trang 2

Table of Contents

CHAPTER 1: INTRODUCTION nhe kết 4

1 Introducfion - -c- SH nh TK kh nọ TK KT KĐT gà ĐK KH 4

2 Status of People with Heart Disease m the Modern Era -.ẶSàằ Si 4

3 Context of Science, Technology and Advances mm Medicine -ẶĂĂ cà Ằ 4

4 Reason for Choosing Topic and Algorithm - nghi 4

5 Inspiration and Euture Development DirecfionS .- nh hhhnhhhhrreeeene 5 CHAPTER 2: THEORETICAL BASIS nhe tk kh kh 6

1 Introduction to Python Tnhh nh nh nh kg HH kế 6 1.1 Overview of Python ch nh nh nh nh kg Tnhh 6 1.2 Popularity and Ôpfimizafi0n - ch Hành Ho KH kh 6 1.3 Workability and ElexibiÏify nh» nh nh nh nh khoa 6

2 Introductfion to Colab Environm€n( nh hành như 6 2.1 Qverview of Colab LH» HT nọ TT TK KT KEh 6 2.2 Convenience and Resour ces ::::ceceeeei tere ei ir i KT kh 6

3 Introductfion to Machine Learning rrr nh ki kh 6 3.1 Overview of Machine Learning - nghành ky 6 3.2 Industry Base and YValue - cành non go nh 6

4 Introductfion to Logisfic Regression Algorithm nh nhhhhhhhenddeene 7 4.1 Highlighfs and VerVIW ch nh kg ng kkh 7 4.2 Characterisfics and ÁppliCafionS nh nhờ 7

5 Introducfion to Random Eorest ÁAlgori(hm ch nhu 7 5.1 Outstanding Features and ÔpfimizafÏ0I ch nho 7 5.2 Work Ability and Tasks nành nh nh nh kinh nh kh 7 5.3 Accuracy and the Euture of Random ForeSf nhe nhhnhhhhhhkkkkk 7

6 Extension: Combining Python, Colab and Machine Learning -. .- 7 CHAPTER 3: DATA VISUALILZATION QQQQ i ri anna a nena eeeeneeeeeeeeeereeeeeeeniiaa 9

CHAPTER 4: TRAINING AND TEST MODL c St kkhhhhkhHhhrrrse 13

1 Prepare the đafa HH nh kh ni tt 13

2 Model Training and ÁCCuUFACY ngành nh nen ky 13 2.1 Logistic Regression Model nh nh nhàn kho nen 13 2.2 Random Forest Model - - c SLSnn ST HH» HH Kì kh kh kh 15

Trang 3

1 Conclusions ẮeM “<‹‹ 4 21

VÄ 9) 0) vì) ai 21

I47 9.4010 50111 22

Trang 4

60 0)/0/190 1601555 4

Trang 5

CHAPTER 1: INTRODUCTION

1 Introduction

In the modern era filled with industrial progress and 4.0 technology, the value of human life is not only enhanced but also brings new challenges and risks In this context, the development of medicine plays an important role, helping to improve the quality of life and face increasingly complex diseases Among serious medical problems, cardiovascular disease occupies an important position, requiring close combination of medicine and machine learning technology to improve prediction and care

2 Status of People with Heart Disease in the Modern Era

The rate of people with heart disease is increasing, which not only affects personal health but also causes serious problems in medical and social resources With millions of people dying every year from complications from cardiovascular disease and the number of disabled people increasing, heart attacks are not only a medical problem but also a major threat

to human life Researching and applying accurate prediction methods for heart disease becomes important to enhance prevention and treatment capabilities, while minimizing the burden on the health system

3 Context of Science, Technology and Advances in Medicine

The context of scientific and technological development plays a large role in the progress of medicine Information technology and big data open the door for research and application of machine learning in the field of medicine Technology research has become an important tool for processing and analyzing big data from medical research, creating new insights and opportunities to improve the quality of healthcare Modern computing means and big data processing capabilities have created new opportunities in disease classification and prediction, enhancing diagnostic and treatment performance

4 Reason for Choosing Topic and Algorithm

Choosing the topic "Comparing Logistic Regression and Random Forest Algorithms in Classifying People with Heart Attack” comes from a clear awareness of the importance of prediction in prevention and care heart disease Faced with the growing challenge of cardiovascular disease, this research is not only a new step forward but also an effective response to an urgent medical need

Trang 6

Choosing between Logistic Regression and Random Forest raises an important question, and at the same time highlights the diversity in machine learning quiver Logistic regression, with its simplicity and popularity, is often used in medical applications In contrast, Random Forests represent diversity and flexibility, working well with big data This will yield important information about the effectiveness of each algorithm, supporting decisions about the most optimal approach A comparison between them not only provides a clear understanding of the advantages and limitations of each algorithm but also provides insight into the most effective approach in heart disease classification

5 Inspiration and Future Development Directions

The general inspiration of the project lies in the desire to contribute to the moder medical dialogue, providing new knowledge and practical applications to improve the prediction and prevention of heart disease In the future, research can expand further, with the integration of other machine learning methods and the use of multi-source data to improve the applicability and accuracy of the model Continued development in this field is not only an opportunity but also a responsibility, contributing to the common mission of modern medicine

- comprehensive and effective health care for the community

Trang 7

CHAPTER 2: THEORETICAL BASIS

Python has achieved high levels of popularity thanks to its active developer community and strong integration with many libraries and frameworks Python's optimization in combination with libraries like NumPy, pandas and scikit-learn creates a powerful programming environment for data science and machine learning tasks

1.3 Workability and Flexibility

Python's working capabilities are very diverse, from writing simple scripts to developing complex applications With its easy-to-read syntax, Python is suitable for both beginners and experienced professionals Its flexibility is also demonstrated by its ability to easily integrate with other languages and different systems

2 Introduction to Colab Environment

2.1 Overview of Colab

Google Colab, also known as Colab, is a free online programming environment provided by Google Built on the Jupyter Notebook platform, Colab allows users to write and share Python source code, perform data analysis and machine learning without needing to install an environment on their personal machine

2.2 Convenience and Resources

Colab has a simple to use interface, direct integration with Google Drive and free GPU support for heavy tasks This creates a flexible, convenient and resource-saving working environment Colab's online sharing and collaboration features also increase the convenience

of teamwork

3 Introduction to Machine Learning

3.1 Overview of Machine Learning

Machine learning is the study area of artificial intelligence in which computers are programmed to learn on their own and improve performance over time This allows computers

to "learn" from data and create predictive or classification models without direct human intervention

3.2, Industry Base and Value

Machine learning plays an important role in many fields such as medicine, finance, marketing and many other fields The ability to self-learn and apply knowledge from big data enhances prediction and decision capabilities, and provides great value in optimizing processes and making accurate predictions

Trang 8

4 Introduction to Logistic Regression Algorithm

4.1 Highlights and Overview

Logistic regression is a widely used classification algorithm in machine learning Designed to handle two-class classification tasks, it generates a linear boundary based on input features This makes it flexible and easy to interpret, making it suitable for prediction problems

Logistic regression, although simple, is characterized by its ability to solve classification problems effectively With the combination of sigmoid function and linear boundary, it well reflects the relationship between the input variable and the probability of belonging to a class This makes it a popular choice in medical missions

4,2 Characteristics and Applications

This algorithm is suitable for solving group classification problems, especially when there is a linear correlation between input variables and the probability of belonging to a group Its unique properties make it easy to experiment with and apply in real-life situations, such as

in heart disease classification research

5 Introduction to Random Forest Algorithm

5.1 Outstanding Features and Optimization

Random Forest is a powerful ensemble algorithm built on decision trees The combination of multiple decision trees helps increase accuracy and minimize overfitting Its optimization is especially effective when dealing with large data and multi-dimensional features

5.2 Work Ability and Tasks

Random Forests are suitable for a wide range of tasks, from classification to prediction and depth reduction Its ability to work with data with diverse features and its ability to deal well with noise make it a top choice in building heart disease risk prediction models 5.3 Accuracy and the Future of Random Forest

Random Forests often provide high accuracy, especially when used in medical situations With its flexibility and performance, it has the potential to grow strongly in the future, especially with the increase in data and computing facilities This opens up opportunities for its application in many different fields

6 Extension: Combining Python, Colab and Machine Learning

Combining Python, Colab and Machine Learning brings power and flexibility in the research and model deployment process Python is not only a powerful programming language but also a "binding language" for machine learning, connecting important libraries and frameworks Colab, as an online programming environment, simplifies the process of deploying and sharing source code, especially when working with big data and complex models

Trang 9

The theoretical foundation chapter has provided an extensive initial step in the background and basic theory related to the research topic on "Comparing Logistic Regression and Random Forest algorithms in classifying people with heart disease.”

The chapter begins with an introduction to Python, a popular and powerful programming language, especially in the field of machine learning The Colab programming environment is introduced as an effective and flexible tool, helping to optimize the process of researching and deploying machine learning models

The chapter then delves into the theory of machine learning, with an emphasis on its value in the 4.0 technology industry era Information about the rate of heart disease, the number of deaths, and the alarming level of heart disease in life are mentioned, demonstrating the importance of research on predicting and classifying the risk of heart disease

As an important part of the chapter, a detailed introduction to Logistic Regression and Random Forest algorithms has been made Each algorithm is described with its outstanding features, characteristics, performance, and accuracy This creates a solid theoretical basis for understanding and applying them to heart disease classification research

Finally, the chapter closes by offermg a powerful combination of Python, Colab, and machine learning This is not only a technological collaboration, but also a close connection between theory and application in the research process The theoretical foundation chapter is

an important step towards the goal of comparing the performance between Logistic Regression and Random Forest in classifying people with heart disease

Trang 10

CHAPTER 3: DATA VISUALIZATION

1 Dataset

We use a dataset called ‘heart’ on kaggle, it is a data set containing fields, which may

or may not be linked together to represent whether a person has a heart attack or not Besides, the data set contains 14 attributes and has 303 samples, which is considered a quite small data file and very easy to perform classification tasks for algorithms

«+ Below is a description of the basic information of the heart attack patient dataset:

1 Age: This is a continuous variable representing the patient's age The value can range from a minimum value to a maximum value

2 Gender (Sex): Categorical variables describe the patient's gender There are two values: 0 (male) and | (female)

3 Types of Chest Pain (cp): Categorical variable describes the type of chest pain the patient experiences There are four values:

e Possibility of left myocardial strength according to Estes's criteria

8 Maximum Heart Rate Achieved (thalachh): Continuous variable describes the maximum heart rate achieved when the patient exercises

9 Exercise Induced Angina (exng): Categorical variable describes whether the patient experiences exercise-induced pain There are two values: 0 (no) and | (yes)

10 Oldpeak: Continuous variable describes the lowest decrease in the ST segment when compared to the resting state

11 Study Type (Slope - slp): Categorical variable describes the type of ST segment studied There are three values:

e Not determined

e Ascending

Trang 11

The data contains categorical variables such as "Sex", "Chest Pain Type” (cp),

"Exercise Induced Angina - exang", and "Electrical Data" Resting Electrocardiographic Results - restecg These variables have discrete values and are described by integers or discrete values

2 Continuity

Variables such as "Age", "Resting Blood Pressure” (trtbps), "Cholesterol" (chol), and

"Maximum Heart Rate Achieved" (thalachh) are all variables continuous, represents information about the degree of continuity of data

3 Complexity

There is complexity in the relationships between variables and their potential influence

on the level of heart disease risk ("output") Specifically, variables such as “Type of Chest Pain” and “Causing Physical Pain Condition” may reflect symptom complexity and the patient's health status

4 Linearity

Continuous variables such as "Age," "Resting Blood Pressure,” "Cholesterol," and

"Maximum Heart Rate” may have a linear relationship with the likelihood of heart disease However, this relationship may not be linear for categorical variables such as “Chest Pain Type” and “Induced Exercise Pain Condition.”

5 Variation Specification Cause

Variables such as "Number of Major Vessels” (caa) and "Thalassemia" (thall) can be considered to assess their influence on heart disease risk, and can also provide information about the severity of the disease

6 Objective

The "output" variable is the target that needs to be predicted, representing the possibility of heart disease This is a binary variable (0 or 1), which can highlight the relationship between health factors and heart disease risk

In summary, this dataset combines continuous and discrete variables, characterizes information about a patient's health, and provides the opportunity to analyze the complex relationship between factors and disease susceptibility heart

2 Visualization

[Below are images and graphs showing typical values of the data]

Tiêu đề	Comparison of logistic regression and random forest algorithm in the tasks of classifying people with heart attacks
Tác giả	Huynh Ngoc Tin, Tran Van Luong, Nguyen Duc Phat
Trường học	Saigon International University
Chuyên ngành	Machine Learning
Thể loại	final essay
Thành phố	Ho Chi Minh City

Định dạng
Số trang	22
Dung lượng	1,17 MB