1. Trang chủ
  2. » Công Nghệ Thông Tin

Imputation Methods to Deal with Missing Values when Data Mining Trauma Injury Data pot

6 510 1
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Imputation Methods to Deal with Missing Values When Data Mining Trauma Injury Data
Tác giả Kay I Penny, Thomas Chesney
Trường học Napier University
Chuyên ngành Mathematics and Statistics
Thể loại Thesis
Thành phố Edinburgh
Định dạng
Số trang 6
Dung lượng 83,44 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Imputation Methods to Deal with Missing Values when Data Mining Trauma Injury Data Kay I Penny Centre for Mathematics and Statistics, Napier University, Craiglockhart Campus, Edinburgh

Trang 1

Imputation Methods to Deal with Missing Values when Data Mining

Trauma Injury Data

Kay I Penny

Centre for Mathematics and Statistics, Napier University, Craiglockhart Campus,

Edinburgh, EH14 1DJ k.penny@napier.ac.uk

Thomas Chesney

Nottingham University Business School, Jubilee Campus, Wollaton Road, Nottingham,

NG8 1BB Thomas.Chesney@nottingham.ac.uk

Abstract Methods for analysing trauma injury

data with missing values, collected at a UK

hospital, are reported One measure of injury

severity, the Glasgow coma score, which is

known to be associated with patient death, is

missing for 12% of patients in the dataset In

order to include these 12% of patients in the

analysis, three different data imputation

techniques are used to estimate the missing

values The imputed data sets are analysed by an

artificial neural network and logistic regression,

and their results compared in terms of sensitivity,

specificity, positive predictive value and negative

predictive value

Keywords Data mining, missing data

imputation, trauma injury

1 Introduction

Trauma injury is the most common cause of

loss of life to those under forty [1] In 1991 a

trauma system was put in place at the North

Staffordshire Hospital (NSH) in Stoke-on-Trent

in the U.K It records injury details including

Injury Severity Score (ISS) [2], Abbreviated

Injury Scores (AIS) [3], the Glasgow Coma

Score (GCS) [4], the patient's sex and age,

management and interventions, and the outcome

of the treatment, including whether the patient

lived or died during their hospital stay

North Staffordshire Hospital is a major

trauma centre in the area and receives patient

referrals from surrounding hospitals Oakley [5]

analysed data for only the most severely injured

patients admitted between 1992 to 1998, and

found determinants of mortality for this subset of

patients included age, head AIS, chest AIS, abdominal AIS, external injury AIS, mechanism

of injury, primary receiving hospital and calendar year of admission Further analysis includes a comparison of several artificial neural network (ANN) models and logistic regression (LR) to predict death during hospital stay [6] Factors found to be important in the modelling were age, mechanism of injury, whether the patient was referred from another hospital, and several injury severity scores including GCS motor and GCS verbal scores

Missing data do not always cause concern when using data mining techniques, however, these data have 12% of GCS scores missing Applying the standard practice of complete-case analysis therefore means that 12% of the dataset has been excluded from the modelling since these patients do not have recorded values for the three GSC scores Exclusion of this subset of patients may lead to bias in the results, as patients who have not had their GCS scores recorded may not be a representative sample of the population of trauma injury patients e.g it may be that these patients tend to be more seriously injured than the average or typical patient, hence the scores were not recorded due

to lack of time, or that they presented with a different type or combination of injuries etc The aim of this research is to investigate the accuracy of modelling patient death following trauma injury in conjunction with missing value imputation

2 Methods

The study involves trauma audit data from patients treated at the North Staffordshire

WK,QW&RQI,QIRUPDWLRQ7HFKQRORJ\,QWHUIDFHV,7,-XQH&DYWDW&URDWLD

Trang 2

Hospital from 1993 to 1999 and from 2001 to

2004 The gap was due to lack of resources

which affected data collection during this period

Only the most severely injured patients i.e

patients with an ISS greater than 15 are included

in this study, resulting in a total of 1658 patients

in the dataset Hence these results are

generalisable to severely injured patients only

Table 1 Factors considered for inclusion in the

analyses

Sex (Male or Female)

Age group (years): 0-15; 16-25; 26-35;

36-50; 51-70; over 70

Year of admission (1992 - 8, 2001-5)

Month of admission (Jan – Dec)

Day of admission (Mon – Sun)

Time of admission (0000 - 0359;

0400 -0759; 0800 - 1159; 1200 - 1559;

1600 - 1959; 2000 - 1359)

Referred from another hospital (yes or no)

Mechanism of injury group:

Motor vehicle crash; Fall greater than 2m;

Fall less than 2m; Assault; Other

Type of trauma: blunt (yes or no)

penetrating (yes or no)

Abbreviated injury scores (AIS):

Abdomen Cervical-spine

Upper limb Thoracic-spine

Glasgow coma scores (GCS):

Eye response; Motor response;

Verbal response

Factors considered for inclusion in the

analysis are summarised in Table 1 Two

different approaches to the statistical analysis of

these data were carried out; data mining using an

artificial neural network (ANN) and logistic

regression modelling (LR) All analysis was

carried out using the statistical packages SPSS

12, Clementine 7.0, and Solas 3.0

2.1 Data Mining Methods

ANNs attempt to mimic the biological structure and the connectivity of a natural neural network, using the human brain as an analogy Input is fed through the neurons in the network which transform them to output a probability, in this case, the probability that a patient will die

An exhaustive prune was used to create the ANN All the neurons are fully connected and each is a feed-forward multi layer perceptron which uses the sigmoid transfer function [7] The learning technique used is back propagation This means that, starting with the given topology, the network is trained, then a sensitivity analysis is performed on the hidden units and the weakest are removed This training/removing is repeated for a set length of time The ANN used in this study has 3 hidden layers with 30, 20 and 10 neurons respectively and the following learning rates: alpha=0.9, eta=0.3, as previous analysis found that this architecture works well for trauma injury data [6]

As well as data mining using an ANN, LR modelling is included for comparison The LR models were developed to determine a parsimonious model with good predictive ability, yet as simple a model as possible Hence this approach is more subjective than the ANN

In medical applications it is often the case that

a logistic regression model is developed using the complete data set, and the model is then tested on the same set of data used to build it However, it is not ideal to test the model with the same data used to build it, and to allow comparison with the data mining methods presented in this paper, a k-fold cross-validation technique was used to test all of the models, with

k set to five This technique is good practice when building neural networks with medical data [8] Using this technique the data were split into five subsets Four data subsets are used to train each model, and the fifth is used to test it This is then repeated another four times so that each data subset is used to test the models once

When splitting the dataset, those patients who lived were selected independently of those patients who died, in order to keep the same proportions of patients who died in each of the k data subsets This is necessary since the data outcome variable, patient death, is very imbalanced; 79% of patients lived and 21% died during their hospital stay

Trang 3

2.2 Missing value imputations

Previous work [6] compared the results of

four different ANN models as well as LR to

predict death during hospital stay following

injury Both GCS motor and GCS Verbal were

found to have high importance in two of the

ANNs, and GCS motor was statistically

significant in the LR model In order for these

variables to be included in the models, 12% of

the sample, i.e patients whose GCS scores were

not recorded, were excluded from the analysis

Hence missing value imputation is considered

here in order that all patients can be included in

the modelling process The GCS is a

measurement of severity of head injury and

comprises three components, each measured on

an ordinal scale: eye response (1-4), verbal

response (1-5) and motor response (1-6)

Three methods of data imputation are

considered in this study:

1 Hot-deck imputation

2 Predictive model-based imputation

3 Propensity score imputation

Hot-deck imputation [9] involves substituting

individual values drawn from patients with

observed data who are “similar” to the patient

with the missing value In terms of the GCS

scores, this would involve imputing a GCS score

drawn from a subset of patients who are

“similar” to the patient with the missing GCS

score In order to impute a particular GCS score,

this method sorts patients both with observed

values and those with missing values for this

score into a number of subsets according to a set

of covariates which are associated with the GCS

scores In this application, the imputation subsets

comprise patients with the same values of the

injury severity scores: AIS head, AIS chest, AIS

lumbar spine and AIS cervical spine Patients

with missing GCS scores will then have their

missing values replaced with observed values

selected at random, with replacement, from

patients in the same subset i.e patients who are

similar with respect to these covariates If there

are no observed values in the corresponding

subset of patients, then the subset is collapsed by

one level, and this process is repeated until an

observed value can be found

Predictive model-based imputation involves

imputing a missing value by using an ordinary

least-squares regression method to estimate a

missing GSC score Firstly, a predictive model is

estimated from the observed data, which contains

no missing values for the GCS score of interest

Let Y be the GCS variable to be imputed, and let

X be the same set of covariates used in the

hot-deck imputation listed above Let Y obs be the

observed values in Y, Y mis be the missing values

in Y, and let X obs be the covariates corresponding

to Y obs By regressing Y obs on X obs, predictions for the missing values are obtained from the equation:

mis mis a bX

Let represent the constant in the model, and b

represent the vector of regression coefficients Using this estimated model, a random element is incorporated in the estimate of the missing values Parameter values from the regression model are drawn from their posterior distribution given the data, using non-informative priors [10] [11] In this way, the extra uncertainty due to the fact that the regression parameters can be estimated, but not determined, from the observed data is reflected

a

Propensity score imputation [12] is based on the underlying assumption that the “missingness”

of an imputation variable can be explained by a set of covariates using a logistic regression model A binary indicator variable is created to represent whether the variable to be imputed is missing or observed for each individual This indicator variable is the dependent variable in the logistic regression modelling, and the independent variables are a set of covariates which is thought to be related to the variable to

be imputed Using the regression coefficients from the logistic regression model, the propensity that a patient would have a missing value can be calculated The propensity score for

a patient is the conditional probability of

“missingness”, given the observed covariates

Missing values of the imputation variable y are

imputed by values randomly drawn from a subset

of observed values of y, that is, its donor pool In

this study, five donor pool subgroups have been created The patients in the dataset are sorted in ascending order according to their assigned propensity scores, and then divided into five equal sized subgroups according to their propensity scores For each missing value, an observed value is selected for imputation, at random with replacement, from the corresponding donor pool

2.3 Evaluation methods

The five-fold cross-validation design results

in five training datasets and five corresponding

Trang 4

validation datasets Each of the three imputation

methods described above are applied to each of

these ten datasets and results are compared for

the ANN and the LR models The overall

performance of a model under a particular

imputation method is then the mean performance

of the five validation data sets In many data

mining efforts the evaluation criterion is the

overall accuracy i.e the percentage of correct

classifications made by an algorithm, however,

in medical data mining consideration must be

given to the percentage of false positives and

false negatives made The evaluation criteria

included for testing the classification algorithms

are sensitivity (sens), specificity (spec), positive

predictive value (PPV) and negative predictive

value (NPV)

A cut-point of 0.5 is used for in the logistic

regression modelling to allow comparability

between the three imputation methods A

receiver operator curve (ROC) analysis is carried

out to compare the logistic regression results

3 Results

The results for the k-fold cross-validations for

each data-mining method applied to each of the

three sets of imputed data subsets are presented

in Table 2 along with the results when no

imputation (complete-case) was performed The

mean accuracy measures of the five validation

datasets are given along with the

between-validation standard errors The performance of

the complete-case analysis is included for

comparison

For the LR modelling, there is very little

difference in performance between the three

missing data imputation methods, and all three

perform almost as well as the complete-case

model Although the specificity for all three LR

results is high, the sensitivity measures are all

fairly low, with just over half of those who die,

predicted correctly However, the cut-point of

0.5 could be lowered to increase the sensitivity

of the models, thereby decreasing specificity

The results of the ROC analysis gave areas under

the curve and between-validation standard errors

of 0.86 (0.012) for both the hot-deck and the

model-based results, and 0.85 (0.013) for the

propensity scoring method, whereas the area

under the ROC curve for the complete-case

analysis was 0.89

Similarly there is little difference between the

three imputation methods when modelling the

data with an ANN However, all imputation

methods slightly improve the positive predictive value of the ANN models compared with complete-case analysis

Table 2 Evaluations of Methods

Evaluation Criteria Data

mining/

imputation method

Sens (SE)

Spec (SE)

PPV (SE)

NPV (SE) ANN:

(1.8)

92%

(0.7)

0.61 (0.017)

0.86 (0.003)

model-based

45%

(2.2)

92%

(0.5)

0.62 (0.014)

0.86 (0.004)

(5.4)

93%

(0.9)

0.61 (0.026)

0.85 (0.011)

complete-case

LR:

(1.8)

93%

(0.7)

0.66 (0.017)

0.88 (0.003)

model- based

51%

(2.2)

93%

(0.4)

0.67 (0.007)

0.88 (0.004)

(1.1)

94%

(0.6)

0.69 (0.020)

0.88 (0.002)

complete-case

Table 3 contains a listing of the factors included in the training models Many of the factors considered for inclusion in the models (Table 1) are correlated with each other, hence the models do not include the same subsets of factors to have high importance (ANNs) or statistical significance (LRs) A typical LR model shows increased odds of death if involved

in a motor vehicle crash, having a blunt or penetrating injury, older age, not being referred from another hospital, and having a more severe

Trang 5

injury according to several AIS scores and the

three GCS scores The three GCS scores were

often found to be statistically significant in the

training models, and all training models included

at least two of the GCS scores

Ten factors included in a typical ANN

training model are listed in order of importance

(Table 3) Two GCS scores are important in this

model

Table 3 Factors included in the training models

Age group AIS cervical spine

Patient referred AIS thoracic spine

Mechanism of injury AIS external

Blunt injury GCS eye

Penetrating injury GCS motor

GCS motor AIS spine

GCS verbal AIS legs

AIS abdomen Year of admission

AIS external

4 Conclusions

There is little distinction between the three

imputation methods in terms of results observed,

for both the LR and the ANN models According

to the sensitivity and specificity measures, the

results from the imputations are almost as good

as the complete-case results, for both the LR and

ANN models This is also confirmed by the ROC

analysis, which shows that the model from the

complete-case analysis (0.89) is slightly more

accurate than those based on the imputed data

(0.86, 0.86 and 0.85)

In this study, single imputation is used i.e

each missing value is replaced with a single

imputed value, and then the data are analysed as

for a complete-case analysis The authors did

consider using multiple imputation techniques

[9], where each missing value is replaced with

2

t

M imputed values, resulting in M

completed datasets The M complete-data

inferences can be combined to form one

inference that reflects the uncertainty due to

“missingness” under that model Although multiple imputation has not been used in this application, the same missing values are effectively estimated five times under the k-fold cross-validation design, since a patient is included in a validation dataset once and in a training dataset four times Since different imputations are created for a particular missing value for each of the different data subsets, an element of between–imputation variability has been incorporated into the results

Although these results do not lead to more accurate classification of patient death or survival following trauma injury than the complete-case analysis, they do allow classification of patients whose Glasgow coma scores are missing These patients would not have been included in either building or testing the models in the complete-case analysis In other words, it would not have been possible to make a prediction for a patient with missing GCS values, whereas using imputation allows a prediction to be made

Further work to investigate how well the different imputation methods correctly estimate the missing GCS scores would be useful One approach would be to carry out a simulation study using the complete-case data only, where a subset of GCS scores is deleted to mimic the pattern of missingness in the observed data This would allow the assessment of the different imputation techniques to correctly estimate the deleted GSC scores Also, similar techniques could then be applied to the whole trauma injury dataset which includes patients with all levels of injury severity, not only those most severely injured with ISS > 15

5 References

[1] The Trauma Audit and Research Network; 2006

https://www.tarn.ac.uk/content/downloads/3 6/FirstDecade.pdf [23/01/06]

[2] Baker SP, O'Neill B, Haddon Jr W, Long

WB The injury severity score: a Method for describing patients with multiple injuries and evaluating patient care Journal of Trauma 1974; 14: 187-96

[3] Association For The Advancement Of Automotive Medicine The abbreviated injury scale, 1990 revision Des Pleines, IL, Association for the Advancement of Automotive Medicine; 1990

Trang 6

[4] Teasdale G, Jennett B Assessment of coma

and impaired consciousness A practical

scale Lancet 1974; (ii): 81-3

[5] Oakley PA, Mackenzie G, Templeton J, Cook

AL, Kirby, RM Longitudinal trends in

trauma mortality and survival in

Stoke-on-Trent 1992-1998 Injury 2004; 35: 379-85

[6] Chesney T, Penny K, Oakley P, Davies S,

Chesney D, Maffulli N, Templeton J Data

mining medical information: Should

artificial neural networks be used to analyse

trauma audit data? Int J of Healthcare

Information Systems and Informatics 2006;

1(2): 51-64

[7] Watkins D Clementine's Neural Networks

Technical Overview; 1997

http://www.cs.bris.ac.uk/~cgc/METAL/Con

sortium/secure/neural_overview.doc

[12/01/06]

[8] Cunningham P, Carney J, Jacob S Stability problems with artificial neural networks and the ensemble solution Artificial Intelligence

in medicine 2000; 20(3): 217-25

[9] Little RJA, Rubin DB Statistical Analysis with Missing Data New Jersey: John Wiley

& Sons; 2002

[10]Rubin DB Multiple Imputation for Nonresponse in Surveys New York: John Wiley; 1987

[11]Gelman A, Carlin J, Stern H, Rubin DB Bayesian Data Analysis New York: Chapman and Hall; 1995

[12]Rosenbaum PR, Rubin DB The central role

of the propensity score in observational studies for causal effects Biometrika 1983; 70: 41-55

Ngày đăng: 28/03/2014, 23:20

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

w