Tel: +92-51-222-9561; fax: +92-51-927-8257 E-m ail address: muzamal.liaqat14@ce.ceme.edu.pk The 6th International Conference on Current and Future Trends of Information and Communicat
Trang 1Procedia Computer Science 98 ( 2016 ) 368 – 373
1877-0509 © 2016 Published by Elsevier B.V This is an open access article under the CC BY-NC-ND license
( http://creativecommons.org/licenses/by-nc-nd/4.0/ ).
Peer-review under responsibility of the Program Chairs
doi: 10.1016/j.procs.2016.09.056
ScienceDirect
* Corresponding author Tel: +92-51-222-9561; fax: +92-51-927-8257
E-m ail address: muzamal.liaqat14@ce.ceme.edu.pk
The 6th International Conference on Current and Future Trends of Information and
Communication Technologies in Healthcare (ICTH 2016)
A Framework for Clustering Cardiac Patient’s Records Using
Unsupervised Learning Techniques
Rao Muzamal Liaqat*, Bilal Mehboobb, Nazar Abbas Saqibc, Muazzam A Khand
{m uzam al.liaqat14 * , bilal.m ehboob14 b , nazar.abbas c , m uazzam ak d }@ce.cem e.edu.pk National University of Sciences and Technology (NUST), H-12, Islam abad, Pakistan
Abstract
Today we are surrounded with large data related to health reports of patients In this paper we will introduce a methodology to extract the useful information (pattern) from raw data by using different unsupervised learning techniques These hidden patterns will help the practitioner to understand the hidden relation (dependency) among the data With the help of useful clustering we can predict the hidden trends in patients We will use the correlation matrix followed by K -mean (fast) to extract the interesting pattern as well as patient state that will help the practitioner to treat the patient wisely According to the nature of data we can categorize the heart patient into normal, moderate, risk and critical patients We use the different clustering algorithm and analyze the performance of each algorithm in cardiac dataset For this research we have used the real dataset provided by AFIC (Armed force institute of cardiology).Data set consist of 1500 records along with 36 attributes
© 2016 The Authors Published by Elsevier B.V
Peer-review under responsibility of the Conference Program Chairs
Keywords: Clustering; data mining; Unsupervised Learning; K-Mean (fast)
1 Introduction
It is the common practice patient co mes to the doctor, after routine procedure and tests, doctor checkup the subject and diagnosis, that’s why a large of data remain unexp lored in hospital which raises a significant problem in healthcare domain Then certain question arises e.g “How we can get the useful informatio n fro m the data, is there any hidden relat ion between the data that reveals some specific pattern to practit ioner so that they can take some wise decision” All these can be answered by using data mining and machine learning algorith ms to indicate the
© 2016 Published by Elsevier B.V This is an open access article under the CC BY-NC-ND license
( http://creativecommons.org/licenses/by-nc-nd/4.0/ ).
Peer-review under responsibility of the Program Chairs
Trang 2unseen or hidden pattern1 Nowadays we are surrounding with a large dataset related to patient history2 However the current database of patients is not so informat ive to extract any useful informat ion or to track the patient disease3 It
is believed by using data min ing techniques a lot of hidden informat ion can be extracted by discovering the hidden pattern and correlation among attributes Nowadays statistics is very popular and common ly used technique to analyze the med ical data Researchers are using the different statistical tools, software to analyze the data and extract the useful information4 In our work we will use the data min ing algorith ms wh ich are mo re reliable as co mpare d to statistical model; we will also compute the performance of different alg orith ms Basically there are t wo types of algorith ms that are used in data mining One is known as supervised learning algorith ms (in supervised learning we have trainee dataset e.g SVM , Nạve Bayes) Second is known as unsupervised learning (in wh ich we have no trainee dataset or label attribute e.g K-Mean, DBSCAN) The main focus of this paper is to extract h idden pattern and correlation among different attributes that will assist the practitioner to write a wise and better prescription for heart patient In this paper we use the unsupervis ed techniques such as means, means (fast), DBSCAN and K-medoids to find out the hidden cluster and pattern for heart patient The remaining paper is divided into 5 sections Section 2 describes the literature review Section 3 describes the methodology and detailed analysis of cluster, performance of results is carried out in section 4 Conclusion and future work is detailed in section 5
2 Literature Review
In literature a lot of wo rk has been carried out for medical data analysis to discover the hidden pattern and ext ract useful informat ion fro m large data by applying data mining techniques5 In conventional methods for informat ion extraction fro m data Pro fessional’s manual method was used, which has no worth when dataset increases in volu me
as well as in dimension To deal such data we need some co mputing technologies6.In med ical domain most of the work is carried out on cardiac image segmentation, feature extract ion, pattern recognition as well as correlation7, 8 Decision tree is a widely used algorith m that is used to mine the hidden information and back t rack the root cause in med ical data In decision tree we have root node and leaf nodes, leaf nodes represent concrete knowledge according
to label attribute Co mmonly used decision tree algorith ms are ID3, CHAID, Random Forest and Decision Stump which are mostly used for min ing the useful informat ion9.Many intelligent systems have been developed to assist the practitioner in card iac d isease10 Researchers have used the Nạve Byes, ANN and decision tree to extract the hidden pattern and correlation among attributes11 Our main focus is to process the data to get the useful informat ion and explored the h idden pattern In this paper we use the dataset provided by AFIC (A rmed force Institute of Card iology) Preprocessing steps and performance of different unsupervised learning classifiers are described in methodology section
3 Proposed Methodolog y
Our methodology to extract the hidden pattern and correlat ion among the attribute in conte xt of card iac data is shown in Fig 1
Fig 1: Knowledge Discovery Process Model
Trang 3The model is div ided into 6 phases; each phase may involve the certain input, output and operations We will explain each phase in detail
3.1 Data Acquisition
Mostly we have the medical data in the form of med ical reports, lab reports and doctor reviews fro m all kind of data can be categorized as unstructured form of data12 We get the data in report form fro m Armed Force Institute of Card iology (AFIC) Raw data consist of 1500 records with 50 attributes Then we get the target data from raw data
by applying feature selection on the basis of attributes weight and expert opinion
3.2 Target Data (Attribute Selection)
Target data is our interest data which is mined from raw data We can select the ta rget attribute from raw data by assigning weights to attribute using correlation matrix and the consensus of experts Correlation operator applied on cardiac patient data is shown in the Fig 2
Now we can see the different values of weights assigned to attribute by using this correlation matrix Weight against each attribute is shown by Fig 3 By using the weights assigned by correlation matrix and expert opinion we have selected 16 attributes Now we will extract the hidden pattern among these attributes by using the different data mining algorithms
3.3 Preprocessed Data
In this step we make our data co mpatible with machine learning algorith ms by apply ing some preprocessing steps Usually we have missing value in our data to remove these values we apply filtering so that more reliable result can be extracted fro m the data In this paper our work is related to clustering (k-mean DBSCA N, k-mean (fast), k-medoids) For this we have to convert the nominal and polynomial data into numeric because k-mean doesn’t work on such types of data In the “Report Category” we have Normal, Moderate, Risk and Critical labels these labels are replaced by numeric values 0, 1, 2 and 3 respectively
3.4 Transformed Data
Data transformation is carried out by using certain scripts on data, basically data t ransformat ion is related to data preprocessing steps such as data cleansing (in which we make the data smooth by applying some filtering to mitigate the abrupt changes in data) Data reduction is also an important step in data transformation which is used to remove or exclude the certain column that has redundant behavior or zero effect on overall result s as shown in Fig 4
Trang 43.5 Patterns/Models
This phase describe the hidden pattern extracted fro m data We will briefly exp lain the hidden pattern is result and discussion section before that we have to make some assumptions for better understanding and visualizat ion of results These assumptions are made according to universal standards and expert reco mmendations In our data we have different range of value for BMI co lu mn According to standard we can cat egorize the BMI in four groups.18
to 24(Normal Weights), 25 to 30(Over Weights), 31 to Onward (Obesity) and <18 is categorized as Underweight According to expert reco mmendations we have also divide the LVEF value into four groups for better understanding and visualization Below 30% “ Very Crit ical”, belo w 40% “ Critical”, below 50% “Risky” and above 50% is categorized as “Normal” patients
4 Result and Discussion
To extract the hidden in formation we apply the K-mean (fast) clustering then we connect it correlation matrix followed by data to similarity module to understand the internal dependency among diffe rent attribute as shown in fig 5
4.1 Hidden pattern BMI VS Report Category
In this cluster we ext ract the hidden relation between two impo rtant attributes BMI vs Report category We have assigned the four label overweight, No rmal weight, Underweight and obesity for better understanding and visualizat ion as discussed in pattern/model part All the person that have the underweight value of BMI is categorize
Fig 4: T ransform Data to Exclude Column
Fig 5: K-Mean (Fast) Implementation
Fig.6: BMI VS Report Category
Trang 5as “Normal Patient” It can understand from the graph age is an important factor; all the patients who were above 80 years are at risk as shown in Fig 6
4.2 Performance Measurement of Different Algorithms
T able 1: Comparative Analysis of Algorithms
C riteria K-Me ans K-Me ans(fast) K-Me doids DBSCAN
C luster Density -6673.259 -6673.259 -46937.563 -91490.939
C luster Distance -2554.952 -2554.652 -78871.650 N/A
Davie s Bouldin -0.968 -2554.952 -6.810 N/A
It can ext racted by visualizing the results of different algorith ms shown in table 1, we have select the K-Mean (fast) algorith m Although K-Mean and k-mean (fast) depicts similar behavior on cluster density and distance criteria DBSCAN perfo rm very poorly in cluster distance and davies Boulden Criteria However K-Means (fast) gives better result as compared to other three algorithms on the basis of selection criteria
Trang 65 Conclusion
In this paper we have applied the K-mean (fast) algorith m (value of K is 5 decided with the consultancy of practitioner) along with correlat ion and similarity of data module to extract the h idden pattern among different attributes With the help of correlation matrix and expert opin ion we decide the four attributes (LVEF, gender, LV_ Myocardium and report category) among the list of attributes Then we plot the graph to understand the hidden relation of each selected attribute with cardiac patient report category Fig 6 reveals that patient that are above 80 years regardless their value of BMI are mostly at “Risk Level” in heart failure Fig 7, 8 reflects critical situation in cardiac patient is do minant in males as compared to females Severity chances of moderate and critical cardiac patients in Fig 9 males are mo re affected as compared to females LV- Myocardiu m tells the heart state about ischemic disease (this disease occurs due to inadequ ate blood supply of an organ in body), when the value of LV-Myocardium is low patient are categorize normal and patient higher value of myocardiu m indicates the risk and critical behavior of cardiac patients as shown by Fig 9 LVEF in cardiac patient indicates how much blood the left ventricle pu mps out with each contraction If value of LVEF > 50 patient is normal otherwise we categorize as an abnormal or affected patient as shown in Fig 10
Acknowledgement
I am g rateful to AFIC, Pakistan for p roviding me dataset for research study I am thankful to my HOD, Dr Shoab
A Khan for helping and guid ing me during this wo rk I am also thankfu l to Dr Aqib Malik RM O, EM E College for assisting me in this research
References
1 K Aziz, S Aziz, Evaluation and Comparison of Coronary Heart Disease Risk Factor Profiles of Children in a Country with Developing Economy
2 Abu Khousa, E.; Campbell, P., "Predictive data mining to support clinical decisions: An overview of heart disease prediction systems," Innovations in Information Technology (IIT), 2012 International Conference on , vol., no., pp.267,272, 2012
3 Rao, R B., Krishnan, S., &Niculescu, R S (2006), Data mining for improved cardiac care ACM SIGKDD Explorations Newslett er, 8(1),
3-10
4.Kajabadi, A., Saraee, M H., &Asgari, S (2009, October) Data mining cardiovascular risk factors In Application of Information and Communication Technologies, 2009.AICT 2009 International Conference on (pp 1 -5) IEEE
5 Giudici, P.: “ Applied Data Mining: Statistical Methods for Business and Industry”, New York: John Wiley, 2003.
6 Wamiq M Ahmed, (2008) Knowledge representation and data mining for biological imaging, Purdue University Cytometry Laborat ories, Bindley Bioscience Center, 1203 W State Street, West Lafayette, IN 47907, USA
7 J.J Sychra, D.G Pave1, E Olea,(1988) , Classification Images Of Cardiac Wall Motion Abnormalities
8 R Bharat Rao, Glenn Fung, BalajiKrishnapuram, (2010), Mining Medical Images
2011.http://docs.rapidi.com/files/rapidminer/RapidMiner_OperatorReference_en.pdf
10 Palaniappan, S &, Awang, R., “Intelligent heart disease predication system using data mining technique”.IJCSNS International Journal of Computer Science and Network Security.Vol 8, No 8,2008
11 Ms Ishtake S.H , Prof Sanap S.A., Intelligent Heart Disease Prediction System Using Data Mining Techniques, International J of Healthcare
& Biomedical Research, Volume: 1, pp 94-101, 2013
12 Unstructured Data Mining: The Tools You Need to Dig the Deep Web, Posted February 13, 2013 @ 3:41 pm by Scott Raspa, http://www.ikanow.com/blog/02/13/unstructured-data-mining-digthe-deep-web