This study seeks to develop an effective fraud detection model tailored to the characteristics of real-world data using machine learning algorithms.. Results indicate that employing an E
Introduction
Problem define
Automobile insurance offers financial protection against losses from incidents such as accidents, theft, and damage, covering liability, vehicle damage, and medical expenses, with specifics differing by policy and location As vehicle numbers rise, the demand for automobile insurance increases to mitigate driving-related risks However, the expanding insurance sector faces challenges from fraudulent activities, including false or exaggerated claims, which impact both insurers and honest policyholders.
Fraudsters employ tactics like inflating service costs, staging accidents, and filing false claims, posing significant challenges for the industry This necessitates the development of effective fraud detection and prevention strategies to combat these deceptive practices.
Fraud detection is essential for identifying and preventing fraudulent activities, leveraging data mining, machine learning, and deep learning algorithms to enhance accuracy and automate the detection process By analyzing historical data to uncover patterns, these algorithms can identify fraudulent behavior in real-time, particularly in automobile fraud detection, where methodologies like machine learning and LSTM RNN networks are employed (Kabir, 2022) Advanced models, such as the "FraudBuster" framework, are designed to detect potential fraud during the underwriting process by examining policy characteristics and loss ratios (Nagrecha et al., 2018) These efforts underscore the importance of advanced analytics in fighting insurance fraud and maintaining the integrity of the insurance industry Furthermore, data technologies are increasingly used to uncover instances of fraud, while machine learning and AI play crucial roles in risk management and fraud detection (Bart, 2017b; Aziz).
Problem statement
Insurance fraud can manifest in various forms, including the fabrication of non-existent insurance subjects to deceive insurers and the submission of misleading or exaggerated claims Such dishonest practices not only impose financial burdens on insurance companies but also lead to increased premiums for legitimate policyholders To address this issue, insurers implement deterrence and detection strategies, which involve specialized investigation units, statistical analysis of claims data, and advanced fraud detection techniques.
Notification ) / Registration ỵ1 ■ Audit > Settlement ) Closing
Occurrence of claim Customer informs insurer
Claim is systemically registered and segmented
Customer receives payment and closing note
— Measurement of claims quantities and amounts
Measurement period for claims cycle and work times
Figure Ỉ: Illustration of the measurement of process times and quantities along the core stages of claims management Source: (Mahlow & Wagner, 20Ì6)
In automotive insurance, various techniques are employed to detect fraudulent activities, which can manifest in both skilled and unskilled forms Detecting multiple frauds related to behavioral changes poses a significant challenge for traditional machine learning methods Current approaches primarily focus on financial evaluations; however, these methods often experience performance issues due to imbalanced data and do not adequately address the low rankness of intrinsic samples when outliers arise.
Traditional models often face challenges with imbalanced datasets, especially in machine learning applications like fraud detection, where the minority class is significantly outnumbered by the majority class This imbalance can lead to suboptimal outcomes for the minority class To mitigate this issue, various techniques have been explored, including resampling methods Notably, oversampling techniques have proven effective in enhancing model performance for the minority class, leading to increased accuracy and reduced false negatives (De Zarza et al., 2023).
Figure 2: Illustration of how the ML-driven approach uses AI & replaces rule-based methods that recpiire human interventions to detect fraud Source: (Sinha, 2023)
Traditional statistics-based methods are more cost-effective than machine-learning approaches for detecting automobile insurance fraud, according to research by Benedek et al (2023) Furthermore, AI-based fraud detection methods tested on real databases also demonstrate lower cost-effectiveness compared to traditional statistical-econometric techniques.
In 2023, traditional methods for detecting insurance fraud face significant drawbacks, including performance degradation caused by imbalanced data and a failure to account for low rankness prior Additionally, these methods are less cost-effective compared to both traditional statistics-based approaches and advanced AI-based techniques (Tongesai et al., 2022).
Deep learning models offer superior speed in data processing compared to traditional model-based methods, which depend on mathematical formulations and are prone to inaccuracies, particularly in complex or dynamic systems (Shen et al., 2022) In contrast, deep learning techniques learn mappings directly from data, enabling them to function effectively in intricate environments Nonetheless, the requirement for extensive training datasets and significant computational resources can restrict the use of deep learning models in specific situations (Kabir, Nagrecha).
2021) Therefore, while deep learning models are faster, they may not always be the most efficient choice for all data processing tasks.
To overcome the limitations of traditional models, we propose a strategic plan that integrates them with deep learning techniques Our first step involves selecting the conventional models that demonstrate optimal performance, specifically Random Forest and Support Vector Machine, which have shown superior results in our research analysis Next, we aim to address the challenges faced by these models by incorporating modern deep learning approaches through the Stacking technique, ultimately enhancing the accuracy of result predictions.
Figure 3 Comparison of deep learning and traditional machine learning methods.
Related works
This paper investigates two primary methodologies for detecting insurance fraud: traditional models and contemporary deep learning approaches Traditional models, including Logistic Regression, Support Vector Machine (SVM), Naive Bayes, and Random Forest, have been widely used to identify fraudulent activities in the insurance sector We focus on enhancing the SVM model using the kernel trick and radial basis function, while also analyzing the Random Forest model through the K-Nearest Neighbor classification Additionally, we explore advanced deep learning techniques, such as reLu, Batch Normalization, and Dropout, to improve detection accuracy Our research aims to establish a theoretical foundation for data processing techniques and feature selection, ultimately seeking to address the limitations of conventional models and uncover innovative solutions in fraud detection.
Thesis Approach and Contribution
The insurance industry is grappling with significant challenges, particularly due to fraudulent claims that lead to substantial financial losses Traditional financial evaluation methods are primarily used to assess these fraud cases In response to this pressing issue, our academic research focuses on improving fraud detection in the property insurance sector by developing a Stacking model We created and evaluated both conventional models and a custom deep learning model using the Keras framework, utilizing a dataset from Kaggle comprising 1,000 insurance claims related to car collisions across seven U.S states in 2015, featuring 40 variables Our key objective was to differentiate between legitimate and fraudulent insurance contracts We identified two baseline models, Random Forest and Decision Tree, which showed strong performance, and subsequently constructed a Stacking model through a meta-learning approach within the Keras deep learning framework By harnessing crucial data and insights from our research, insurance companies can significantly mitigate losses from fraudulent activities, thus enhancing the reliability and effectiveness of their claims forecasting Future research will focus on further refining and improving the model's performance.
Our innovative framework for detecting fraud in automobile insurance addresses sustainable development challenges by enhancing fraud detection accuracy and efficiency This improvement can significantly reduce financial losses for insurance companies, fostering responsible practices that promote economic stability and ensure the long-term viability of insurance services for individuals and communities By stimulating industry innovation and technological advancement, our model supports the achievement of Sustainable Development Goal 9 (SDG9) while aligning with SDG 16 by preventing fraud, strengthening institutions, and promoting transparency and accountability within the insurance sector.
Thesis outline
The research is organized into several key sections Part two introduces and analyzes traditional and deep learning models from prior studies, leading to the formulation of a strategy for future research directions Part three offers a detailed overview of the dataset, addressing its limitations and outlining the testing and implementation methods used in our model This section thoroughly examines our methodologies and experiments In part four, the results derived from our model are presented, while part five discusses these findings and highlights the research's significance The concluding section encapsulates the study and suggests potential avenues for future research.
Previous analysis
Evaluation Metrics
Accuracy is a widely used metric for evaluating models, as it indicates the proportion of correct predictions relative to actual outcomes However, to achieve a more comprehensive and accurate assessment, it is essential to consider additional metrics In our study, we utilized Precision, Recall, Accuracy, and F1 Score for classification evaluation, each calculated to provide a nuanced understanding of model performance.
Precision = —— -—7— Recall = ———-—-——— total predicted positive total actual positive
Accuracy = ———— Fl score = -———-—„ total sample Precision + Recall
In our ongoing project, we have encountered a significant issue during experimentation; although we achieved over 73% accuracy, some models demonstrate a bias towards the majority class This bias leads to misclassification of numerous observations and a failure to detect fraud, highlighting that accuracy alone is not sufficient for evaluating the effectiveness of our models.
In the context of imbalanced data, accuracy is no longer the sole metric for evaluating model performance; instead, AUC has gained popularity as a key performance metric (Seng et al., 2021) Relying solely on accuracy has been criticized, particularly when dealing with imbalanced datasets (Malhotra & Lata, 2021) Therefore, the evaluation process will incorporate additional metrics such as recall, precision, F1-score, MCC, and AUC-ROC to provide a more comprehensive assessment of model performance.
To effectively visualize research outcomes, confusion matrices are employed to illustrate the model's actual performance This approach enhances our understanding of experimental models, enabling necessary adjustments for improved results Confusion matrices serve as a crucial tool for assessing the practical efficacy of models, particularly in binary classification scenarios such as insurance fraud detection, which is the focus of our study.
Traditional Methods
Recent advancements in insurance fraud detection highlight the ongoing efforts of researchers to enhance fraud prevention systems A new study contextualizes its findings by comparing them with existing literature, recognizing the importance of traditional methods and prior research that utilized machine learning and data mining techniques (Gomes et al., 2021) This study introduces a model incorporating Logistic Regression, Support Vector Machine (SVM), and Naive Bayes, utilizing the Boruta algorithm for feature selection The comparative evaluation reveals that SVM achieves high accuracy, while Logistic Regression demonstrates remarkable recall and sensitivity (Kumar & Gopal, 2010) These results resonate with Kumar et al.’s findings but underscore the continued debate over the most effective fraud detection model, indicating a need for further exploration.
Gomes et al.'s research, focused solely on U.S auto insurance data from 2020, raises concerns about its relevance to current market conditions and highlights the need for exploring alternative prediction models for improved fraud detection The study's emphasis on the auto insurance company's perspective suggests a gap in understanding, inviting future research to include insights from policyholders and auto repair workshops While their methodologies are noteworthy, the limitations underscore the importance of incorporating diverse viewpoints and investigating different models to enhance the field of insurance fraud detection.
Support Vector Machines (SVM) are powerful machine learning algorithms utilized for both classification and regression tasks In traditional regression analysis, the primary aim is to minimize the least squared errors by determining the regression coefficients, denoted as w The core principle of Support Vector Classification (SVC) is to establish a tolerance band around the regression line, where all data points are expected to fall within The SVC's objective is to minimize the coefficients using the 12-norm while managing errors through constraints with a maximum allowable error of 8 To accommodate instances where errors exceed this threshold, a slack parameter is introduced, enabling data points to exist outside the tolerance band while still striving to remain as close to it as possible This SVC paradigm, including the role of slack variables, is visually represented in Figure 4.
Figure 4: The Support Vector Paradigm Source: (Poufinas et al., 2023c)
The gray area surrounding the regression line indicates the £ tolerance band, where data points within this zone do not affect the objective function In contrast, points outside this band contribute to the minimization objective function The marginal black points within the gray zone are identified as support vectors, which are crucial in determining the position of the regression line.
The kernel trick is utilized when linear classifiers are insufficient, transforming the optimization space's dimensionality during training by projecting data points into a higher-dimensional feature space, allowing for acceptable error in regression hyperplanes In the context of auto insurance fraud detection, Nian et al introduced a novel method called “Unsupervised Spectral Ranking for Anomaly” (SRA), which integrates support vector machines (SVM) with spectral optimization techniques and classification strength measurement, eliminating the need for labeled data This approach leverages the Laplacian matrix to detect anomalies in auto insurance data, optimizing the identification process through SVM while using ranking references to assess model performance The findings demonstrate that the SRA model significantly outperforms traditional outlier-based fraud detection methods, highlighting its potential for effective fraud diagnosis in the auto insurance sector.
Decision Trees are a supervised machine learning algorithm utilized for classification and regression tasks, functioning by recursively partitioning data according to the most informative features These algorithms are structured like flowcharts, consisting of nodes and branches In regression tasks, Decision Trees predict the target variable's value by averaging the values of training data points that reside within the same leaf node.
Figure 5: A graphical representation of Decision Trees Source: (De Sá et al., 2016)
In a decision tree, each decision node functions as an if statement regarding a specific variable, while the leaf nodes represent the final regression values The root node signifies the complete dataset, with decision nodes dividing the data into two subsets based on the truth of the statement Leaf nodes, which do not split further, indicate the final outcomes of the decision-making process Research has investigated the application of Naive Bayesian classification networks and Decision Tree-Based algorithms to classify auto fraud claims as either fraudulent or honest, assessing model performance through various parameters and employing rule-based classification for visualization (Bhowmik, 2011).
The Random Forests model utilizes a bootstrapping-aggregating algorithm known as bagging, which combines multiple decision trees Each tree is trained on a different randomly selected subsample of the original dataset, maintaining the same size, along with a random subset of independent variables The model's capacity to generalize to unseen data is assessed using out-of-bag data, which consists of observations not included in the training set This stochastic process for selecting both training observations and independent variables enhances the model's robustness and predictive accuracy.
In regression tasks, the Random Forest model predicts target variable values by averaging data points within each leaf node This model is improved by integrating K-Nearest Neighbor classification, replacing majority voting to prevent information loss from out-of-bag (OOB) samples, and utilizing Principal Component Analysis (PCA) to transform data into PCA space These enhancements contribute to the development of an efficient insurance fraud prediction model (Li et al., 2018).
Figure 6: The depiction of a 600-tree Random Forest The average of the 600 predictions is the final prediction of the Random Forest Source: (Team, 2023)
In this study, we utilized two traditional models, Random Forest and Support Vector Machine (SVM), due to their strong performance with small datasets While they can handle large amounts of data, they face challenges when scaling up A notable drawback is the need for careful feature selection, which can be time-consuming Furthermore, both models may struggle with datasets that have excessive observations, risking overfitting or underfitting issues.
Deep learning model
2.3.1 Formula of reLu, Batch Normalization and Drop out
To improve the accuracy of deep learning models in outcome prediction, techniques like ReLU (Rectified Linear Unit), Batch Normalization, and Drop-out are utilized to enhance performance and stabilize the training process.
Definition and formula of ReLU
This article focuses on the ReLU activation function, which is the default and most widely used choice in deep learning tasks (Krishnamurthy, 2022) ReLU effectively addresses the vanishing gradient problem, making it a preferred option for hidden layers in neural networks The ReLU function is mathematically defined as f(x) = max(0, x).
Where X is the input to the function, max is the function that lakes the maximum value between 0 and X ReLU returns the value of X if it is positive and returns 0 if X is negative This function preserves the values of positive numbers and eliminates negative values, stimulating neuron activation and enhancing the non-linearity of the model ReLU involves simple calculations, addresses the issue of vanishing gradients, and has the ability to improve the performance of neural networks in various scenarios However, it also has the drawback of potential “dead neurons" when the input value is negative, impacting the learning capability of the model in certain cases.
Definition and formula of Batch Normalization
Batch Normalization is a crucial technique used in neural networks to normalize the output values of hidden layers prior to activation By addressing the distributional changes in outputs between layers during training, it enhances network stability and performance Typically, Batch Normalization is implemented after the linear layer and before the activation function, making it an essential component of neural network architecture.
Input: Values ol r over a mini-hatch: 13 = {#1 ,ô}:
Parameters to he learned: 7 0 Output: {//* = BN^(2:,)} lie 4 1 — > Xi // mini-hatch mean