Handling Missing Values: Application to University Data Set doc

Handling Missing Values for Association Rule Mining allows data that approximately matches the pattern to contribute toward the overall support of the pattern.. 2.2 Handling Missing va

Trang 1

Handling Missing Values: Application to

University Data Set

Dinesh J Prajapati1 Jagruti H Prajapati2

1 Department of Information Technology, A D Patel institute of Technology, New V V Nagar-388121, India

1 Gujarat Technological University (GTU)

1

dinesh249@yahoo.com

2

Department of Information technology, Charotar Institute of Technology, Changa-388421, India

2 Charotar University of Science and Technology (CHARUSAT)

2 jagruti_eyetea@yahoo.com

Abstract

Data warehouses usually have some missing values due to unavailable data that affect the

number and the quality of the generated rules The missing values could affect the coverage

percentage and number of reduces generated from a specific data set Missing values lead to the

difficulty of extracting useful information from data set Association rule algorithms typically

only identify patterns that occur in the original form throughout the database Handling Missing

Values for Association Rule Mining allows data that approximately matches the pattern to

contribute toward the overall support of the pattern This approach is also useful in processing

missing data, which probabilistically contributes to the support of possibly matching patterns

The actual data mining process deals significantly with prediction, estimation, classification,

pattern recognition and the development of association rules Therefore, the significance of the

analysis depends heavily on the accuracy of the database and on the chosen sample data to be

used for model training and testing

Keywords:Data cleansing, Missing values, Knowledge discovery, Preprocessing

1 Introduction

Missing data are the absence of data items for a subject; they hide some information that may be

important In practice, missing data have been one major factor affecting data quality The

presence of missing data is a general and challenging problem in the data analysis field

Fortunately, missing data imputation techniques can be used to improve data quality Missing

data imputation techniques refer to any strategy that fills in missing values of a data set so that

standard data analysis methods can be applied to analyze the completed data set [9] Generally, there are two types of techniques to impute missing data Single imputation techniques indicate the substitution of a single value for each missing data Such as mean

imputation & so on Multiple imputation techniques are used to imputing missing data m times,

m complete data sets can be formed For each of the m complete data sets, standard complete

data analysis methods will be used to generate m analysis results [4] Then, m analysis results

will be integrated into a final result for the inference Propensity Score and Markov Chain Monte

Carlo are all widely used multiple imputation techniques Comparing with single imputation

Trang 2

techniques, multiple imputation techniques are more complex which need more costs to impute

missing data [3]

One limitation of many association rule mining algorithms, such as the Apriori algorithm, is that

only database entries which exactly match the candidate patterns may contribute to the support of

the candidate pattern This creates a problem for databases containing missing values [6] The

purpose of the work described in this paper is to compare the different methods to handle

missing values By doing so, the approximated values for missing data items can be incorporated

in the ordinal association rules The rest of this paper is organized as follows: Section 2 the

methods for detecting errors and handling missing values are presented In section 3, Results are

shown Conclusion is drawn in section 4

2 Handling Missing Values

There are two forms of noise in the data as described below [10-11]

1 Corrupted values: sometimes some of the values in the training set are altered from what they

should have been This may result in one or more tuples in the data set conflicting with the rules

already established The system may then regard these extreme values as noise, and ignore them

The problem is that one never knows if the extreme values are correct or not, and the challenge is

how to handle “weird” values in the best manner

2 Missing attribute values: one or more of the attribute values may be missing both for examples

in the training set and for objects which are to be classified Missing data might occur because

the value is not relevant to a particular case, could not be recorded when the data was collected,

or is concerns If attributes are missing in any training set, the system may either ignore this

object totally; for instance, finding what is the missing attribute's most probable value, or uses

the value “missing”, “unknown” or “NULL” as a separate value for the attribute

Cleansing data of errors is an important processing step particularly when integrating heterogeneous data sources Dirty data files are prevalent in data warehouses because of incorrect or missing data values, inconsistent attribute naming conventions or incomplete information [12] One important step in any data processing task is to verify the correctness of

data values Data cleaning also called data cleansing or scrubbing, detects and removes errors

and inconsistencies in data in order to improve the quality of data Causes of data quality

problems include misspellings during data entry, missing data, invalid or incomplete information

or other reasons such as inconsistent attribute naming conventions [8] The effect on prediction

accuracy of several methods for dealing with missing features at prediction time The most

common approaches for dealing with missing features involve imputation The main idea of

imputation is that if an important feature is missing for a particular instance, it can be estimated

from the data that are present [5]

The imputation model should be rich enough to preserve the associations or relationships among

variables that will be the focus of later investigation For example, suppose that a variable Y is

imputed under a normal model that includes the variable X1 After imputation, the analyst then

uses linear regression to predict Y from X1 and another variable X2 which was not in the

imputation model The estimated coefficient for X2 from this regression would tend to be biased

toward zero, because Y has been imputed without regard for its possible relationship with X2

[7] Filling a missing value can be done using any of following methods [1]

Trang 3

2.1 Handling Missing values by ignoring the tuple

When the class label is missing then this method can be used This method is not effective,

unless the tuple contains several attributes with missing values It is poor when the percentage of

missing values per attribute varies considerably For example, consider a database that contains

the following transactions, where “?” represents a missing value

i) A, B, C

ii) E, F, G

iii) ?, B, E

iv) A, B, F

Suppose the minimum support is 3 If we ignore the third transaction then the association rule

containing item A will be missed completely

2.2 Handling Missing values by filling the missing value manually

This approach is time-consuming and may not be feasible given a large data set with many

missing values

2.3 Handling Missing values by using global constant

Replace all the missing attribute values by the same constant, such as a label like “Unknown” or

think that they form an interesting concept, since they all have a value in common Hence,

although this method is simple, it is not recommended

2.4 Handling Missing values by using attribute mean

This method uses mean or average to fill missing value For example, suppose that the average

income of Employee is 20,000 Use this value to replace the missing value for income If the

transaction contains missing value in the attribute which contains non integer or non float values

then this method cannot be used For example, suppose that the average name of Employee is?

2.5 Handling Missing values by using attribute mean for all samples belonging to the same class

This method uses mean or average of attribute for the all samples of the same class For example,

if classifying Employees to credit_risk, replace the missing value with the average income value

for employees in the same credit risk of the given tuple If the transaction contains missing value

in the attribute for all samples which contains non integer or non float values then this method

cannot be used

2.6 Handling Missing values by using the most probable value

This method is widely used method Limitation of above all the methods can be overcome by

Trang 4

distribution represents the likelihood of possible values for the missing data, calculated using

frequency counts from the entries that do contain data for the corresponding field

2.7 Handling Missing values by using the most probable value for all the samples belonging

to the same class

This method uses probability of attribute for the all samples of the same class For example, if

classifying Employees to credit risk, replace the missing value with the probability of income for

employees in the same credit risk of the given tuple If the transaction contains missing value in

the attribute for all samples which contains non integer or non float values then this method

cannot be used

3 Results

For this experiment, we have taken a database with no missing values, to have a reference

database, and we have randomly introduced missing values for each attribute (rate 10%) we

have used database of University from UCI Data Repository [13] The university database has

165 data and 13 attributes Database is in original (LISP-readable) form Each observation

concerns one university In some cases, more information is provided about the attribute (e.g.,

units or domain) Some duplicates may exist and a single observation may have more than one

value for a given attribute (esp academic emphasis) It appears that several attributes could serve

as a distinguished class attribute for this database It is a LISP readable file with a few relevant

functions at the end of the data file For the University data set, various algorithms are

implemented for handling missing values and Figure 1 presents the results which demonstrate

accuracy of the algorithms implemented It is clearly visible in the results that Missing values are

successfully filled with a low noise rate

Accuracy Versus Tolerance

20

30

40

50

Tolerance

Class Probability Probability + Class

Fig 1 Accuracy versus Tolerance with 10% Missing values filled

Trang 5

4 Conclusion

This paper discusses different methods to impute the missing values Missing values are replaced

by probability distributions over possible values for the missing feature, which allows the

corresponding transaction to support all itemsets that could possibly match the data Transactions

which do not exactly match the candidate itemset may also contribute a partial amount of support

this behavior is beneficial for databases with many missing values or containing numeric data

Handling missing values using the most probable information for all the samples belonging to

the same class gives better result as compare to other techniques because presented technique is a

hybrid approach of class technique and probability technique This hybridization is eliminating

mutual disadvantages of both the basic techniques Missing values filled with better accuracy

leads to better results, this phenomenon is also observed

References

[1] Jiawei Han, Micheline Kamber, Data Mining Concepts & Techniques, Morgan Kaufmann Publishers

[2] Nayak, J Cook, D (2001), Approximate association rule mining, In Florida Artificial

Intelligence Research Symposium

[3] Azzam Sleit, Mousa Al-Akhras, Inas Juma, Marwah Alian, Applying Ordinal Association

Rules for Cleansing Data With Missing Values, Marsland Press Journal of American

Science 2009:5(3) 52-62

[4] Jianhua Wu, Qinbao Songl Junyi Shen, An Novel Association Rule Mining Based Missing

Nominal Data Imputation Method, Eighth ACIS International Conference

[5] Chih-Hung Wu, Chian-Huei Wun, Hung-Ju Chou, "Using Association Rules for Completing

Missing Data,", Fourth International Conference on Hybrid Intelligent Systems (HIS'04),

2004 pp.236-241

[6] Ragel, A 1998, Preprocessing of Missing Values Using Robust Association Rules, In

Proceedings of the Second Pacific-Asia Conference

[7] Lakshminarayan, K., Harp, S., Goldman, R., and Samad, T 1996, Imputation of missing

data using machine learning techniques, In Proceedings of the Second International

Conference on Knowledge Discovery in Databases and Data Mining

[8] Ragel, A and Cremilleux, B., “MVC - a preprocessing method to deal with missing

values”, In Proceedings of Knowl.-Based Syst 1999, 285-291

[9] Arnaud Ragel & Bruno Cremilleux, “Treatment of Missing Values for Association Rules”,

In Proceedings of PAKDD 1998

Trang 6

[10] Arnaud Ragel, Bruno Cremilleux & J L Bosson, “An Interactive and Understandable

Method to Treat Missing Values: Application to a Medical Data Set”, In ACM Comput

Surv 1985

[11] Luai Al Shalabi, “A comparative study of techniques to deal with missing data in data

sets”, In Proceedings of the 4th International Multiconference on Computer Science and

Information Technology /CSIT 2006

[12] A Pujari, Data Mining Techniques, Universities Press, India, 2001

[13] UCI Data Repository

Định dạng
Số trang	6
Dung lượng	85,88 KB