Outlier Detection and Class Imbalance Based on Mass Estimation Integrated with Evidential Reasoning HOANG, Anh Japan Advanced Institute of Science and Technology... Doctoral Dissertat
Trang 1Outlier Detection and Class Imbalance Based on Mass Estimation Integrated with
Evidential Reasoning
HOANG, Anh
Japan Advanced Institute of Science and Technology
Trang 2Doctoral Dissertation
OUTLIER DETECTION AND CLASS IMBALANCE BASED ON MASS ESTIMATION INTEGRATED WITH EVIDENTIAL REASONING
HOANG, Anh
Supervisor: Professor HUYNH, Van-Nam
Graduate School of Advanced Science and Technology
Japan Advanced Institute of Science and Technology
(Knowledge Science)
September 2021
Trang 4Outlier detection and class imbalance modeling process play significant roles to enable effective and efficient algorithms for statistic analysis, data mining, machine learning, and knowledge discovery frameworks working on imbalanced datasets Although there has been vast literature on imbalanced datasets, the shortcomings of distance-based functions in response to a varied density of data points have not been solving yet
The primary aim of this dissertation was to exploit a new alternative approach for local outlier detection tasks by fundamentally changing the way to measure the outlier degree of each data point To achieve this goal,
we developed a mass-based approach to measure the dissimilarity between data points Then, we introduced a new outlier scoring method by employing mass-based dissimilarity and probability modeling to detect the local outliers
in a given dataset The experimental study tested on artificial datasets and real application datasets show that our proposed MLOS approach is competitive with the state-of-the-art approaches
In the same manner, to exploit the mass-based measurement for learning from the imbalanced datasets, we introduce the other two new methods for the class imbalance task The first model is a simple application of weighted sum The second model is an integration of the mass estimation and the Dempster-Shafer theory of evidence These proposed models were assessed
by using significant evaluation metrics such as F1 score, Brier score, ROC-AUC, and PR-AUC score testing on a wide range of benchmark datasets
In addition, all experimental results were validated using the non-parametric statistical Wilcoxon signed ranks test
This dissertation was the first study, regarding to our knowledge, to investigate the local outlier detection problem using mass-based dissimilar-ity measurement; the key finding was that the proposed MLOS approach presents an alternative way to score the outlierness of each data point in a given dataset Secondly, the simulation results showed that our proposed new models for the class imbalance task outperformed the other 11 competitive methods The experiments were conducted on a wide varying application domains, a varied imbalance ratio, and the number of instances
Keywords: Imbalanced data, outlier detection, outlier modeling, mass-based dissimilarity, weighted sum, Dempster-Shafer theory
II
Trang 5Writing the acknowledgment is always the nicest part! First, I would like
to mention the International Cooperation Department, Ministry of Education and Training, Viet Nam, for providing me a scholarship as a part of Project
911 I could not start my Doctor of Philosophy program at Japan Advanced Institute of Science and Technology (JAIST) without this financial support Special thanks to my supervisor, Professor HUYNH Van-Nam, for all he has done, which I will never forget I truly appreciate his time spending to help me on many occasions with exceptional supports Thanks for sharing knowledge not only by doing scientific research but also by living a happy life Besides, I received generous encouragement and assistant from the HUYNH’s Lab members in this work, especially Mr Toan and Mr Vinh
I would like to express my thankfulness to Professor HASHIMOTO Takashi, for his wonderful course of Introduction to Knowledge Science (K218) I enjoyed every minute of the lectures as well as the discussion
in the official hours
Professor DAM Hieu-Chi, thank you so much for caring about both what and how you have been teaching us I have learned a lot of fundamental concepts and methodologies for doing data scientist from your courses In addition, I would particularly like to mention enjoying time for playing soccer together
My sincere thanks go to JAIST Supercomputer Unit for running software and services smoothly to conduct the experimental studies Thanks to Student Welfare Section for supporting my living at JAIST Thanks to Educational Service Section, Secretarial Service Section, and other sections
at JAIST for unconditional help
Last and most of all, I am grateful to the committee members and the audiences, who might give me the questions and comments That will help
a lot to improve my work
Finally, a lot of people have supported me, and I relish this opportunity
to thank them Thanks to the members of JAIST’s Football Club, who may leave everything behind and enjoy doing sport together Thanks to my colleagues and friends who often ask me about my health and my progresses Especially, my parents are always believing in me My spouse and my son, thank you so much for being part of my life
III
Trang 61.1 Research motivations 1
1.1.1 Outlier detection 2
1.1.2 Class imbalance 3
1.2 Research questions and contributions 5
1.2.1 Research questions 5
1.2.2 Main contributions 5
1.2.3 Future directions 6
1.3 Dissertation organization 7
Chapter 2 Research background 8 2.1 Hierarchical partitioning method 8
2.2 Mass-based dissimilarity measurement 8
2.2.1 Definition 1 9
2.2.2 Definition 2 10
2.2.3 k -lowest mass-based dissimilarity neighbors 11
2.3 Dempster-Shafer theory 11
2.4 Evaluation metrics 12
2.5 Non-parametric statistical analysis 13
Chapter 3 Outlier detection 15 3.1 Introduction 15
IV
Trang 73.2 Problem formulation 18
3.3 Literature review 18
3.3.1 Geometric outlier modeling 18
3.3.2 Semi-supervised outlier modeling 20
3.4 Proposed MLOS approach 21
3.4.1 Notations 23
3.4.2 Stage 1: Data preparation 23
3.4.3 Stage 2: Data partitioning technique 23
3.4.4 Stage 3: Outlier scoring 24
3.5 Experimental result 28
3.5.1 Experimental results on synthetic datasets 28
3.5.2 Experimental results on benchmark datasets 34
3.5.3 Non-parametric statistic test 44
3.6 Chapter conclusions 46
Chapter 4 Class imbalance 48 4.1 Introduction 48
4.2 Class imbalance statement 50
4.3 Methodology 51
4.3.1 Confidence estimation 51
4.3.2 Mass-based similarity measurement 51
4.3.3 Mass-based similarity weighted k -neighbor Sk -LMN approach 52
4.3.4 Mass-based similarity integrated with evidential rea-soning: EMass approach 55
4.4 Experimental studies 58
4.4.1 Dataset description 58
4.4.2 Implementation details and evaluation metrics 62
4.4.3 Results and discussions 63
4.5 Chapter conclusions 69 Chapter 5 Summary and future works 71
V