Data Mining and Knowledge Discovery Handbook, 2 Edition part 90 pdf

The geospatial visual analytics ﬁeld has recently emerged as the discipline that combines automatic data mining approaches including spatio-temporal clustering with visual reasoning supp

Trang 1

44.4 Open Issues

Spatio-temporal properties of the data introduce additional complexity to the data mining pro-cess and to the clustering in particular We can differentiate between two types of issues that the analyst should deal with or take into consideration during analysis: general and appli-cation dependent The general issues involve such aspects as data quality, precision and un-certainty (Miller and Han(2009)) Scalability, spatial resolution and time granularity can be related to application dependent issues

Data quality (spatial and temporal) and precision depends on the way the data is generated Movement data is usually collected using GPS-enabled devices attached to an object For example, when a person enters a building a GPS signal can be lost or the positioning may be inaccurate due to a weak connection to satellites As in the general data preprocessing step, the analyst should decide how to handle missing or inaccurate parts of the data - should it be ignored, tolerated or interpolated

The computational power does not go in line with the pace at which large amounts of data are being generated and stored Thus, the scalability becomes a signiﬁcant issue for the analysis and demand new algorithmic solutions or approaches to handle the data

Spatial resolution and time granularity can be regarded as most crucial in spatio-temporal clustering since change in the size of the area over which the attribute is distributed or change

in time interval can lead to discovery of completely different clusters and therefore, can lead

to the improper explanation of the phenomena under investigation There are still no gen-eral guidelines for proper selection of spatial and temporal resolution and it is rather unlikely that such guidelines will be proposed Instead, ad hoc approaches are proposed to handle the problem in speciﬁc domains (see for example (Nanni and Pedreschi(2006))) Due to this, the involvement of the domain expert in every step of spatio-temporal clustering becomes essen-tial The geospatial visual analytics ﬁeld has recently emerged as the discipline that combines automatic data mining approaches including spatio-temporal clustering with visual reasoning supported by the knowledge of domain experts and has been successfully applied at differ-ent geographical spatio-temporal phenomena ( (Andrienko and Andrienko(2006), Andrienko

et al(2007)Andrienko, Andrienko, and Wrobel, Andrienko and Andrienko(2010)))

A class of application-dependent issues that is quickly emerging in the spatio-temporal clustering ﬁeld is related to exploitation of available background knowledge Indeed, most

of the methods and solutions surveyed in this chapter work on an abstract space where loca-tions have no speciﬁc meanings and the analysis process extracts information from scratch, instead of starting from (and integrating to) possible a priori knowledge of the phenomena under consideration On the opposite, a priori knowledge about such phenomena and about the context they take place in is commonly available in real applications, and integrating them

in the mining process might improve the output quality (Alvares et al(2007)Alvares, Bogorny, Kuijpers, de Macedo, Moelans, and Vaisman, Baglioni et al(2009)Baglioni, Antonio Fernan-des de Macedo, Renso, Trasarti, and Wachowicz, Kisilevich et al(2010)Kisilevich, Keim, and Rokach) Examples of that include the very basic knowledge of the street network and land usage, that can help in understanding which aspects of the behavior of our objects (e.g., which parts of the trajectory of a moving object) are most discriminant and better suited to form ho-mogeneous clusters; or the existence of recurring events, such as rush hours and planned road maintenance in a urban mobility setting, that are known to interfere with our phenomena in predictable ways

Recently, the spatio-temporal data mining literature has also pointed out that the rele-vant context for the analysis mobile objects includes not only geographic features and other physical constraints, but also the population of objects themselves, since in most application

Trang 2

scenarios objects can interact and mutually interfere with each other’s activity Classical ex-amples include trafﬁc jams – an entity that emerges from the interaction of vehicles and, in turn, dominates their behavior Considering interactions in the clustering process is expected

to improve the reliability of clusters, yet a systematic taxonomy of relevant interaction types is still not available (neither a general one, nor any application-speciﬁc one), it is still not known how to detect such interactions automatically, and understanding the most suitable way to integrate them in a clustering process is still an open problem

44.5 Conclusions

In this chapter we focused on geographical spatio-temporal clustering We presented a

classi-ﬁcation of main spatio-temporal types of data: ST events, Geo-referenced variables, Moving

objects and Trajectories We described in detail how spatio-temporal clustering is applied on

trajectories, provided an overview of recent research developments and presented possible sce-narios in several application domains such as movement, cellular networks and environmental studies

References

Agrawal R, Faloutsos C, Swami AN (1993) Efﬁcient Similarity Search In Sequence Databases In: Lomet D (ed) Proceedings of the 4th International Conference of Founda-tions of Data Organization and Algorithms (FODO), Springer Verlag, Chicago, Illinois,

pp 69–84

Alon J, Sclaroff S, Kollios G, Pavlovic V (2003) Discovering clusters in motion time-series data In: CVPR (1), pp 375–381

Alvares LO, Bogorny V, Kuijpers B, de Macedo JAF, Moelans B, Vaisman A (2007) A model for enriching trajectories with semantic geographical information In: GIS ’07: Proceed-ings of the 15th annual ACM international symposium on Advances in geographic in-formation systems, pp 1–8

Andrienko G, Andrienko N (2008) Spatio-temporal aggregation for visual analysis of move-ments In: Proceedings of IEEE Symposium on Visual Analytics Science and Technol-ogy (VAST 2008), IEEE Computer Society Press, pp 51–58

Andrienko G, Andrienko N (2009) Interactive cluster analysis of diverse types of spatiotem-poral data ACM SIGKDD Explorations

Andrienko G, Andrienko N (2010) Spatial generalization and aggregation of massive move-ment data IEEE Transactions on Visualization and Computer Graphics (TVCG) Ac-cepted

Andrienko G, Andrienko N, Wrobel S (2007) Visual analytics tools for analysis of movement data SIGKDD Explorations Newsletter 9(2):38–46

Andrienko G, Andrienko N, Rinzivillo S, Nanni M, Pedreschi D, Giannotti F (2009) Inter-active Visual Clustering of Large Collections of Trajectories VAST 2009

Andrienko N, Andrienko G (2006) Exploratory analysis of spatial and temporal data: a sys-tematic approach Springer Verlag

Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) Optics: ordering points to identify the clustering structure SIGMOD Rec 28(2):49–60

Trang 3

Baglioni M, Antonio Fernandes de Macedo J, Renso C, Trasarti R, Wachowicz M (2009) Towards semantic interpretation of movement behavior Advances in GIScience pp 271– 288

Berndt DJ, Clifford J (1996) Finding patterns in time series: a dynamic programming ap-proach Advances in knowledge discovery and data mining pp 229–248

Birant D, Kut A (2006) An algorithm to discover spatialtemporal distributions of physical seawater characteristics and a case study in turkish seas Journal of Marine Science and Technology pp 183–192

Birant D, Kut A (2007) St-dbscan: An algorithm for clustering spatial-temporal data Data Knowl Eng 60(1):208–221

Chan KP, chee Fu AW (1999) Efﬁcient time series matching by wavelets In: In ICDE, pp 126–133

Chen L, ¨Ozsu MT, Oria V (2005) Robust and fast similarity search for moving object trajec-tories In: SIGMOD ’05: Proceedings of the 2005 ACM SIGMOD international confer-ence on Management of data, ACM, New York, NY, USA, pp 491–502

Chudova D, Gaffney S, Mjolsness E, Smyth P (2003) Translation-invariant mixture models for curve clustering In: KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, New York, NY, USA, pp 79–88

Ciaccia P, Patella M, Zezula P (1997) M-tree: An efﬁcient access method for similarity search

in metric spaces In: Jarke M, Carey M, Dittrich KR, Lochovsky F, Loucopoulos P, Jeusfeld MA (eds) Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB’97), Morgan Kaufmann Publishers, Inc., Athens, Greece, pp 426–435 Cohen S., Rokach L., Maimon O., Decision Tree Instance Space Decomposition with Grouped Gain-Ratio, Information Science, Volume 177, Issue 17, pp 3592-3612, 2007 Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clus-ters in large spatial databases with noise Data Mining and Knowledge Discovery pp 226–231

Fosca G, Dino P (2008) Mobility, Data Mining and Privacy: Geographic Knowledge Discov-ery Springer

Frentzos E, Gratsias K, Theodoridis Y (2007) Index-based most similar trajectory search In: ICDE, pp 816–825

Gaffney S, Smyth P (1999) Trajectory clustering with mixtures of regression models In: KDD ’99: Proceedings of the ﬁfth ACM SIGKDD international conference on Knowl-edge discovery and data mining, ACM, New York, NY, USA, pp 63–72

Giannotti F, Nanni M, Pinelli F, Pedreschi D (2007) Trajectory pattern mining In: Proceed-ings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, p 339

Grinstein G, Plaisant C, Laskowski S, OConnell T, Scholtz J, Whiting M (2008) VAST 2008 Challenge: Introducing mini-challenges In: Proceedings of IEEE Symposium, vol 1, pp 195–196

Gudmundsson J, van Kreveld M (2006) Computing longest duration ﬂocks in trajectory data In: GIS ’06: Proceedings of the 14th annual ACM international symposium on Advances

in geographic information systems, ACM, New York, NY, USA, pp 35–42

Hwang SY, Liu YH, Chiu JK, Lim EP (2005) Mining mobile group patterns: A trajectory-based approach In: PAKDD, pp 713–718

Iyengar VS (2004) On detecting space-time clusters In: Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining (KDD’04), ACM, pp 587–592

Trang 4

Jeung H, Yiu ML, Zhou X, Jensen CS, Shen HT (2008) Discovery of convoys in trajectory databases Proc VLDB Endow 1(1):1068–1080

Kalnis P, Mamoulis N, Bakiras S (2005) On discovering moving clusters in spatio-temporal data Advances in Spatial and Temporal Databases pp 364–381

Kang J, Yong HS (2009) Mining Trajectory Patterns by Incorporating Temporal Properties Proceedings of the 1st International Conference on Emerging Databases

Kang JH, Welbourne W, Stewart B, Borriello G (2004) Extracting places from traces of locations In: WMASH ’04: Proceedings of the 2nd ACM international workshop on Wireless mobile applications and services on WLAN hotspots, ACM, New York, NY, USA, pp 110–118

Kisilevich S, Keim D, Rokach L (2010) A novel approach to mining travel sequences us-ing collections of geo-tagged photos In: The 13th AGILE International Conference on Geographic Information Science

Kulldorff M (1997) A spatial scan statistic Communications in Statistics: Theory and Meth-ods 26(6):1481–1496

Lee JG, Han J, Whang KY (2007) Trajectory clustering: a partition-and-group framework In: SIGMOD Conference, pp 593–604

Li Y, Han J, Yang J (2004a) Clustering moving objects In: Proceedings of the 10th Inter-national Conference on Knowledge Discovery and Data Mining (KDD’04), ACM, pp 617–622

Li Y, Han J, Yang J (2004b) Clustering moving objects In: KDD, pp 617–622

Maimon O., and Rokach, L Data Mining by Attribute Decomposition with semiconductors manufacturing case study, in Data Mining for Design and Manufacturing: Methods and Applications, D Braha (ed.), Kluwer Academic Publishers, pp 311–336, 2001 Miller HJ, Han J (2009) Geographic data mining and knowledge discovery Chapman & Hall/CRC

Nanni M, Pedreschi D (2006) Time-focused clustering of trajectories of moving objects Journal of Intelligent Information Systems 27(3):267–289

Palma AT, Bogorny V, Kuijpers B, Alvares LO (2008) A clustering-based approach for dis-covering interesting places in trajectories In: SAC ’08: Proceedings of the 2008 ACM symposium on Applied computing, pp 863–868

Pelekis N, Kopanakis I, Marketos G, Ntoutsi I, Andrienko G, Theodoridis Y (2007) Similar-ity search in trajectory databases In: TIME ’07: Proceedings of the 14th International Symposium on Temporal Representation and Reasoning, IEEE Computer Society, Wash-ington, DC, USA, pp 129–140

Reades J, Calabrese F, Sevtsuk A, Ratti C (2007) Cellular census: Explorations in urban data collection IEEE Pervasive Computing 6(3):30–38

Rinzivillo S, Pedreschi D, Nanni M, Giannotti F, Andrienko N, Andrienko G (2008) Visually driven analysis of movement data by progressive clustering Information Visualization 7(3):225–239

Rokach L and Maimon O., Feature Set Decomposition for Decision Trees, Journal of Intel-ligent Data Analysis, Volume 9, Number 2, 2005b, pp 131–158

Rokach L., Genetic algorithm-based feature set partitioning for classiﬁcation prob-lems,Pattern Recognition, 41(5):1676–1700, 2008

Rokach L., Maimon O and Lavi I., Space Decomposition In Data Mining: A Clustering Ap-proach, Proceedings of the 14th International Symposium On Methodologies For Intel-ligent Systems, Maebashi, Japan, Lecture Notes in Computer Science, Springer-Verlag,

2003, pp 24–31

Trang 5

Schilit BN, LaMarca A, Borriello G, Griswold WG, McDonald D, Lazowska E, Balachan-dran A, Hong J, Iverson V (2003) Challenge: ubiquitous location-aware computing and the ”place lab” initiative In: WMASH ’03: Proceedings of the 1st ACM international workshop on Wireless mobile applications and services on WLAN hotspots, ACM, New York, NY, USA, pp 29–35

Stolorz P, Nakamura H, Mesrobian E, Muntz RR, Santos JR, Yi J, Ng K (1995) Fast spatio-temporal data mining of large geophysical datasets In: Proceedings of the First Interna-tional Conference on Knowledge Discovery and Data Mining (KDD’95), AAAI Press,

pp 300–305

Theodoridis Y (2003) Ten benchmark database queries for location-based services The Computer Journal 46(6):713–725

Vieira MR, Bakalov P, Tsotras VJ (2009) On-line discovery of ﬂock patterns in spatio-temporal data In: GIS ’09: Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, New York, NY, USA, pp 286–295

Vlachos M, Kollios G, Gunopulos D (2002) Discovering similar multidimensional trajecto-ries In: Proceedings of the International Conference on Data Engineering, pp 673–684 Vlachos M, Hadjieleftheriou M, Gunopulos D, Keogh E (2003) Indexing multi-dimensional time-series with support for multiple distance measures In: KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, New York, NY, USA, pp 216–225

Wang M, Wang A, Li A (2006) Mining Spatial-temporal Clusters from Geo-databases Lec-ture Notes in Computer Science 4093:263

Zhang P, Huang Y, Shekhar S, Kumar V (2003) Correlation analysis of spatial time series datasets: A ﬁlter-and-reﬁne approach In: In the Proc of the 7th PAKDD

Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efﬁcient data clustering method for very large databases ACM SIGMOD Record 25(2):103–114

Zheng Y, Zhang L, Xie X, Ma WY (2009) Mining interesting locations and travel sequences from gps trajectories In: WWW ’09: Proceedings of the 18th international conference

on World wide web, pp 791–800

Trang 6

Data Mining for Imbalanced Datasets: An Overview

Nitesh V Chawla

Department of Computer Science and Engineering

University of Notre Dame

IN 46530, USA

nchawla@cse.nd.edu

Summary A dataset is imbalanced if the classification categories are not approximately equally represented Recent years brought increased interest in applying machine learning techniques to difficult “real-world” problems, many of which are characterized by imbalanced data Additionally the distribution of the testing data may differ from that of the training data, and the true misclassification costs may be unknown at learning time Predictive accuracy, a popular choice for evaluating performance of a classifier, might not be appropriate when the data is imbalanced and/or the costs of different errors vary markedly In this Chapter, we dis-cuss some of the sampling techniques used for balancing the datasets, and the performance measures more appropriate for mining imbalanced datasets

Key words: imbalanced datasets, classiﬁcation, sampling, ROC, cost-sensitive measures, pre-cision and recall

45.1 Introduction

The issue with imbalance in the class distribution became more pronounced with the appli-cations of the machine learning algorithms to the real world These appliappli-cations range from telecommunications management (Ezawa et al., 1996), bioinformatics (Radivojac et al., 2004), text classiﬁcation (Lewis and Catlett, 1994, Dumais et al., 1998, Mladeni´c and Grobelnik,

1999, Cohen, 1995b), speech recognition (Liu et al., 2004), to detection of oil spills in satel-lite images (Kubat et al., 1998) The imbalance can be an artifact of class distribution and/or different costs of errors or examples It has received attention from machine learning and Data Mining community in form of Workshops (Japkowicz, 2000b, Chawla et al., 2003a, Dietterich

et al., 2003, Ferri et al., 2004) and Special Issues (Chawla et al., 2004a) The range of papers

in these venues exhibited the pervasive and ubiquitous nature of the class imbalance issues faced by the Data Mining community Sampling methodologies continue to be popular in the research work However, the research continues to evolve with different applications, as each application provides a compelling problem One focus of the initial workshops was primarily

O Maimon, L Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,

DOI 10.1007/978-0-387-09823-4_45, © Springer Science+Business Media, LLC 2010

Trang 7

the performance evaluation criteria for mining imbalanced datasets The limitation of the ac-curacy as the performance measure was quickly established ROC curves soon emerged as a popular choice (Ferri et al., 2004)

The compelling question, given the different class distributions is: What is the correct

distribution for a learning algorithm? Weiss and Provost presented a detailed analysis on the

effect of class distribution on classiﬁer learning (Weiss and Provost, 2003) Our observations agree with their work that the natural distribution is often not the best distribution for learn-ing a classiﬁer (Chawla, 2003) Also, the imbalance in the data can be more characteristic of

“sparseness” in feature space than the class imbalance Various re-sampling strategies have been used such as random oversampling with replacement, random undersampling, focused oversampling, focused undersampling, oversampling with synthetic generation of new sam-ples based on the known information, and combinations of the above techniques (Chawla

et al., 2004b)

In addition to the issue of inter-class distribution, another important probem arising due

to the sparsity in data is the distribution of data within each class (Japkowicz, 2001a) This problem was also linked to the issue of small disjuncts in the decision tree learning Yet an-other, school of thought is a recognition based approach in the form of a one-class learner The one-class learners provide an interesting alternative to the traditional discriminative approach, where in the classiﬁer is learned on the target class alone (Japkowicz, 2001b, Juszczak and Duin, 2003, Raskutti and Kowalczyk, 2004, Tax, 2001)

In this chapter1, we present a liberal overview of the problem of mining imbalanced datasets with particular focus on performance measures and sampling methodologies We will present our novel oversampling technique, SMOTE, and its extension in the boosting proce-dure — SMOTEBoost

45.2 Performance Measure

A classiﬁer is, typically, evaluated by a confusion matrix as illustrated in Figure 45.1 (Chawla

et al., 2002) The columns are the Predicted class and the rows are the Actual class In the

con-fusion matrix, T N is the number of negative examples correctly classiﬁed (True Negatives),

FP is the number of negative examples incorrectly classiﬁed as positive (False Positives), FN

is the number of positive examples incorrectly classiﬁed as negative (False Negatives) and T P

is the number of positive examples correctly classiﬁed (True Positives) Predictive accuracy is

deﬁned as Accuracy = (TP + TN)/(TP + FP + TN + FN).

However, predictive accuracy might not be appropriate when the data is imbalanced and/or the costs of different errors vary markedly As an example, consider the classiﬁcation of pixels

in mammogram images as possibly cancerous (Woods et al., 1993) A typical mammography dataset might contain 98% normal pixels and 2% abnormal pixels A simple default strategy

of guessing the majority class would give a predictive accuracy of 98% The nature of the application requires a fairly high rate of correct detection in the minority class and allows for

a small error rate in the majority class in order to achieve this (Chawla et al., 2002) Simple predictive accuracy is clearly not appropriate in such situations

1The chapter will utilize excerpts from our published work in various Journals and Confer-ences Please see the references for the original publications

Trang 8

Predicted Negative

Predicted Positive

Actual Negative Actual Positive Fig 45.1 Confusion Matrix

45.2.1 ROC Curves

The Receiver Operating Characteristic (ROC) curve is a standard technique for summarizing classiﬁer performance over a range of tradeoffs between true positive and false positive error rates (Swets, 1988) The Area Under the Curve (AUC) is an accepted performance metric for

a ROC curve (Bradley, 1997)

Percent

True Positive

Percent False Positive

100

original data set

increased undersampling

of the majority class moves the operating point to the upper right

y = x

Ideal point

Fig 45.2 Illustration of Sweeping out an ROC Curve through under-sampling Increased under-sampling of the majority (negative) class will move the performance from the lower left point to the upper right

ROC curves can be thought of as representing the family of best decision boundaries for

relative costs of TP and FP On an ROC curve the X-axis represents %FP = FP/(TN + FP) and the Y-axis represents %T P = TP/(TP + FN) The ideal point on the ROC curve would

be (0,100), that is all positive examples are classified correctly and no negative examples are misclassified as positive One way an ROC curve can be swept out is by manipulating the balance of training samples for each class in the training set Figure 45.2 shows an illustra-tion (Chawla et al., 2002) The line y = x represents the scenario of randomly guessing the class A single operating point of a classifier can be chosen from the trade-off between the

Trang 9

%TP and %FP, that is, one can choose the classiﬁer giving the best %TP for an acceptable

%FP (Neyman-Pearson method) (Egan, 1975) Area Under the ROC Curve (AUC) is a use-ful metric for classiﬁer performance as it is independent of the decision criterion selected and prior probabilities The AUC comparison can establish a dominance relationship between clas-siﬁers If the ROC curves are intersecting, the total AUC is an average comparison between models (Lee, 2000)

The ROC convex hull can also be used as a robust method of identifying potentially opti-mal classiﬁers (Provost and Fawcett, 2001) Given a family of ROC curves, the ROC convex hull can include points that are more towards the north-west frontier of the ROC space If a line passes through a point on the convex hull, then there is no other line with the same slope passing through another point with a larger true positive (TP) intercept Thus, the classiﬁer

at that point is optimal under any distribution assumptions in tandem with that slope (Provost and Fawcett, 2001)

Moreover, distribution/cost sensitive applications can require a ranking or a probabilistic estimate of the instances For instance, revisiting our mammography data example, a proba-bilistic estimate or ranking of cancerous cases can be decisive for the practitioner (Chawla,

2003, Maloof, 2003) The cost of further tests can be decreased by thresholding the patients

at a particular rank Secondly, probabilistic estimates can allow one to threshold ranking for class membership at values< 0.5 The ROC methodology by (Hand, 1997) allows for

rank-ing of examples based on their class memberships — whether a randomly chosen majority class example has a higher majority class membership than a randomly chosen minority class example It is equivalent to the Wilcoxon test statistic

45.2.2 Precision and Recall

From the confusion matrix in Figure 45.1, we can derive the expression for precision and

recall (Buckland and Gey, 1994).

precision= T P

T P + FP

T P + FN The main goal for learning from imbalanced datasets is to improve the recall without hurting the precision However, recall and precision goals can be often conﬂicting, since when

increasing the true positive for the minority class, the number of false positives can also be

increased; this will reduce the precision The F-value metric is one measure that combines the trade-offs of precision and recall, and outputs a single number reﬂecting the “goodness” of a

classiﬁer in the presence of rare classes While ROC curves represent the trade-off between

values of TP and FP, the F-value represents the trade-off among different values of TP, FP, and

FN (Buckland and Gey, 1994) The expression for the F-value is as follows:

F − value =(1 +β2) ∗ recall ∗ precision

β2∗ recall + precision

whereβ corresponds to the relative importance of precision vs recall It is usually set to

1

Trang 10

45.2.3 Cost-sensitive Measures

Cost Matrix

Cost-sensitive measures usually assume that the costs of making an error are known (Turney,

2000, Domingos, 1999, Elkan, 2001) That is one has a cost-matrix, which deﬁnes the costs

incurred in false positives and false negatives Each example, x, can be associated with a cost

C (i, j,x), which deﬁnes the cost of predicting class i for x when the “true” class is j The goal

is to take a decision to minimize the expected cost The optimal prediction for x can be deﬁned

as

∑

j

The aforementioned equation requires a computation of conditional probablities of class j given feature vector or example x While the cost equation is straightforward, we don’t always

have a cost attached to making an error The costs can be different for every example and not

only for every type of error Thus, C(i, j) is not always ≡ to C(i, j,x).

Cost Curves

(Drummond and Holte, 2000) propose cost-curves, where the x-axis represents of the fraction

of the positive class in the training set, and the y-axis represents the expected error rate grown

on each of the training sets The training sets for a data set is generated by under (or over) sampling The error rates for class distributions not represented are construed by interpolation They deﬁne two cost-sensitive components for a machine learning algorithm: 1) producing

a variety of classifiers applicable for different distributions and 2) selecting the appropriate classifier for the right distribution However, when the misclassification costs are known, the

x-axis can represent the “probability cost function”, which is the normalized product of C(− | +) ∗ P(+); the y-axis represents the expected cost.

45.3 Sampling Strategies

Over and under-sampling methodologies have received signiﬁcant attention to counter the effect of imbalanced data sets (Solberg and Solberg, 1996, Japkowicz, 2000a, Chawla et al.,

2002, Weiss and Provost, 2003, Kubat and Matwin, 1997, Jo and Japkowicz, 2004, Batista

et al., 2004, Phua and Alahakoon, 2004, Laurikkala, 2001, Ling and Li, 1998) Various stud-ies in imbalanced datasets have used different variants of over and under sampling, and have presented (sometimes conﬂicting) viewpoints on usefulness of oversampling versus under-sampling (Chawla, 2003, Maloof, 2003, Drummond and Holte, 2003, Batista et al., 2004) The random under and over sampling methods have their various short-comings The random undersampling method can potentially remove certain im-portant examples, and random oversampling can lead to overﬁtting However, there has been progression in both the under and over sampling methods (Kubat and Matwin, 1997) used one-sided selection to selectively undersample the original population They used Tomek Links (Tomek, 1976) to identify the noisy and borderline examples They also used the Condensed Nearest Neighbor (CNN) rule (Hart, 1968) to remove examples from the majority class that are far away from the decision border (Laurikkala, 2001) proposed Neighborhood

Định dạng
Số trang	10
Dung lượng	387,99 KB