1. Trang chủ
  2. » Giáo Dục - Đào Tạo

Outlier Detection And Class Imbalance Based On Mass Estimation Integrated With Evidential Reasoning

7 0 0

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 7
Dung lượng 93,8 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Outlier Detection and Class Imbalance Based on Mass Estimation Integrated with Evidential Reasoning HOANG, Anh Japan Advanced Institute of Science and Technology... Doctoral Dissertat

Trang 1

Outlier Detection and Class Imbalance Based on Mass Estimation Integrated with

Evidential Reasoning

HOANG, Anh

Japan Advanced Institute of Science and Technology

Trang 2

Doctoral Dissertation

OUTLIER DETECTION AND CLASS IMBALANCE BASED ON MASS ESTIMATION INTEGRATED WITH EVIDENTIAL REASONING

HOANG, Anh

Supervisor: Professor HUYNH, Van-Nam

Graduate School of Advanced Science and Technology

Japan Advanced Institute of Science and Technology

(Knowledge Science)

September 2021

Trang 4

Outlier detection and class imbalance modeling process play significant roles to enable effective and efficient algorithms for statistic analysis, data mining, machine learning, and knowledge discovery frameworks working on imbalanced datasets Although there has been vast literature on imbalanced datasets, the shortcomings of distance-based functions in response to a varied density of data points have not been solving yet

The primary aim of this dissertation was to exploit a new alternative approach for local outlier detection tasks by fundamentally changing the way to measure the outlier degree of each data point To achieve this goal,

we developed a mass-based approach to measure the dissimilarity between data points Then, we introduced a new outlier scoring method by employing mass-based dissimilarity and probability modeling to detect the local outliers

in a given dataset The experimental study tested on artificial datasets and real application datasets show that our proposed MLOS approach is competitive with the state-of-the-art approaches

In the same manner, to exploit the mass-based measurement for learning from the imbalanced datasets, we introduce the other two new methods for the class imbalance task The first model is a simple application of weighted sum The second model is an integration of the mass estimation and the Dempster-Shafer theory of evidence These proposed models were assessed

by using significant evaluation metrics such as F1 score, Brier score, ROC-AUC, and PR-AUC score testing on a wide range of benchmark datasets

In addition, all experimental results were validated using the non-parametric statistical Wilcoxon signed ranks test

This dissertation was the first study, regarding to our knowledge, to investigate the local outlier detection problem using mass-based dissimilar-ity measurement; the key finding was that the proposed MLOS approach presents an alternative way to score the outlierness of each data point in a given dataset Secondly, the simulation results showed that our proposed new models for the class imbalance task outperformed the other 11 competitive methods The experiments were conducted on a wide varying application domains, a varied imbalance ratio, and the number of instances

Keywords: Imbalanced data, outlier detection, outlier modeling, mass-based dissimilarity, weighted sum, Dempster-Shafer theory

II

Trang 5

Writing the acknowledgment is always the nicest part! First, I would like

to mention the International Cooperation Department, Ministry of Education and Training, Viet Nam, for providing me a scholarship as a part of Project

911 I could not start my Doctor of Philosophy program at Japan Advanced Institute of Science and Technology (JAIST) without this financial support Special thanks to my supervisor, Professor HUYNH Van-Nam, for all he has done, which I will never forget I truly appreciate his time spending to help me on many occasions with exceptional supports Thanks for sharing knowledge not only by doing scientific research but also by living a happy life Besides, I received generous encouragement and assistant from the HUYNH’s Lab members in this work, especially Mr Toan and Mr Vinh

I would like to express my thankfulness to Professor HASHIMOTO Takashi, for his wonderful course of Introduction to Knowledge Science (K218) I enjoyed every minute of the lectures as well as the discussion

in the official hours

Professor DAM Hieu-Chi, thank you so much for caring about both what and how you have been teaching us I have learned a lot of fundamental concepts and methodologies for doing data scientist from your courses In addition, I would particularly like to mention enjoying time for playing soccer together

My sincere thanks go to JAIST Supercomputer Unit for running software and services smoothly to conduct the experimental studies Thanks to Student Welfare Section for supporting my living at JAIST Thanks to Educational Service Section, Secretarial Service Section, and other sections

at JAIST for unconditional help

Last and most of all, I am grateful to the committee members and the audiences, who might give me the questions and comments That will help

a lot to improve my work

Finally, a lot of people have supported me, and I relish this opportunity

to thank them Thanks to the members of JAIST’s Football Club, who may leave everything behind and enjoy doing sport together Thanks to my colleagues and friends who often ask me about my health and my progresses Especially, my parents are always believing in me My spouse and my son, thank you so much for being part of my life

III

Trang 6

1.1 Research motivations 1

1.1.1 Outlier detection 2

1.1.2 Class imbalance 3

1.2 Research questions and contributions 5

1.2.1 Research questions 5

1.2.2 Main contributions 5

1.2.3 Future directions 6

1.3 Dissertation organization 7

Chapter 2 Research background 8 2.1 Hierarchical partitioning method 8

2.2 Mass-based dissimilarity measurement 8

2.2.1 Definition 1 9

2.2.2 Definition 2 10

2.2.3 k -lowest mass-based dissimilarity neighbors 11

2.3 Dempster-Shafer theory 11

2.4 Evaluation metrics 12

2.5 Non-parametric statistical analysis 13

Chapter 3 Outlier detection 15 3.1 Introduction 15

IV

Trang 7

3.2 Problem formulation 18

3.3 Literature review 18

3.3.1 Geometric outlier modeling 18

3.3.2 Semi-supervised outlier modeling 20

3.4 Proposed MLOS approach 21

3.4.1 Notations 23

3.4.2 Stage 1: Data preparation 23

3.4.3 Stage 2: Data partitioning technique 23

3.4.4 Stage 3: Outlier scoring 24

3.5 Experimental result 28

3.5.1 Experimental results on synthetic datasets 28

3.5.2 Experimental results on benchmark datasets 34

3.5.3 Non-parametric statistic test 44

3.6 Chapter conclusions 46

Chapter 4 Class imbalance 48 4.1 Introduction 48

4.2 Class imbalance statement 50

4.3 Methodology 51

4.3.1 Confidence estimation 51

4.3.2 Mass-based similarity measurement 51

4.3.3 Mass-based similarity weighted k -neighbor Sk -LMN approach 52

4.3.4 Mass-based similarity integrated with evidential rea-soning: EMass approach 55

4.4 Experimental studies 58

4.4.1 Dataset description 58

4.4.2 Implementation details and evaluation metrics 62

4.4.3 Results and discussions 63

4.5 Chapter conclusions 69 Chapter 5 Summary and future works 71

V

Ngày đăng: 29/10/2022, 01:13

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN