Building a loss given default prediction model by machine learning techniques (khóa luận tốt nghiệp Đại học)

Given this urgency, the objectives of this study are to: i explore models for estimating enterprises' loss given default in Vietnam's commercial banking system using suitable machine lea

INTRODUCTION

RESEARCH BACKGROUND

The COVID-19 pandemic, which emerged in 2019, has had severe and complex repercussions on the global economy As the world gradually reopens and vaccination efforts continue, many businesses remain significantly impacted, with some facing ruin The introduction of the Delta and Omicron variants has intensified the challenge, leading to a more difficult phase in the global fight against COVID-19 and contributing to a wave of bankruptcies among major companies worldwide.

2020, when the COVID-19 pandemic affected most major countries in the world

Since the onset of the COVID-19 pandemic in 2019, Vietnamese businesses have faced significant challenges, with the World Bank reporting that approximately 50% of small enterprises and over 40% of medium-sized businesses have had to close temporarily or permanently In 2021 alone, nearly 55,000 businesses suspended operations, marking an 18% increase from the previous year, while 48,100 businesses awaited dissolution procedures, a rise of 27.8% Although 16,700 businesses completed dissolution procedures, this represented a decrease of 4.1%, with 14,800 of those having capital below 10 billion VND Furthermore, 211 enterprises with capital exceeding 100 billion VND saw a decline of 20.7% On average, around 10,000 companies exit the market each month due to the ongoing impact of the pandemic.

The COVID-19 pandemic has significantly impacted the economy, leading to a rise in corporate bankruptcies, which underscores the critical role of internal credit rating systems in commercial banks These systems are essential for accurately assessing lending risks, enabling banks to make informed lending decisions and effectively manage risks In Vietnam, commercial banks increasingly recognize the importance of these credit rating systems, especially as they strive to comply with Basel II standards The development and application of internal credit rating systems are crucial for enhancing credit institutions' ability to measure, evaluate, and manage risks, ultimately strengthening the banking system amidst financial challenges.

THE URGENCY OF THE RESEARCH

Amid the economic and financial challenges posed by the pandemic and fluctuating market conditions, the need for research on credit rating models has become more urgent and evident.

Current credit rating models have notable limitations and imperfections, leading to diversity and disagreement regarding their reliability This variability poses challenges for researchers and risk managers, complicating the selection of suitable models Despite these issues, credit ratings remain essential for evaluating and estimating the default probability of companies.

Research on credit rating models has evolved significantly, with studies by Aysegul Iscanoglu (2005) and Hayden & Daniel (2010) analyzing various methodologies such as discriminant analysis, logit models, decision trees, artificial neural networks, and probit regression These studies highlight the strengths and weaknesses of each approach Notably, Platt (1991) advocated for the use of standard financial variables in bankruptcy reports, while Lawrence (1992) applied a logit model to assess mortgage default probabilities, and Altman (1968) utilized a differential analysis model This ongoing research into credit rating models is crucial for enhancing risk management practices and promoting the sustainable development of the global financial system.

Identifying financial indicators that impact credit ratings has long been a key focus in loss given default research Between 1926 and 1936, scholars employed fundamental financial metrics for ranking purposes and introduced various methodologies Notably, Ramser & Foster (1931) utilized an equity to total net income index, while Fitzpatrick (1932) analyzed the equity to fixed assets ratio.

In a subsequent phase, Altman (1968) implemented various financial indicators within analytical models to assess the probability of corporate bankruptcy Key ratios utilized included owner’s equity to book value, corporate net income to total assets, operating income to total assets, profit after tax to total assets, and working capital to total assets.

In 1972, a novel approach was introduced that involved the selection of 14 distinct financial variables to evaluate bankruptcy risk Key indicators included cash to current debt, real cash flow to total debt, and working capital relative to net sales.

Over the years, researchers have identified various financial indicators that significantly impact credit ratings Notably, Blum (1974) incorporated market returns and quick ratios into his models, while Back, Laitinen, Sere, and Wesel (1996) explored additional metrics to enhance predictive accuracy.

31 different indicators in their study

In Vietnam, the current credit rating methods employed by commercial banks are largely subjective and qualitative, relying heavily on the personal experiences of credit officers This approach lacks a solid scientific foundation for accurately assessing a business's bankruptcy risk, primarily aiding lending decisions rather than informing them Consequently, there is a significant gap in published research focused on developing models that estimate corporate default probabilities based on financial indicators This highlights the urgent need for further research and the advancement of effective credit assessment and risk management methodologies within the Vietnamese market.

The government has established supportive measures and a favorable legal framework to enhance the effectiveness of credit rating agencies and improve financial information transparency This initiative aids banks in early loss management from defaults and bolsters the stock and bond markets Additionally, researching and implementing suitable rating models will significantly advance the growth of credit rating activities in Vietnam.

In this context, risk management measures and enhancing transparency in the financial sector play an important role The government issued Decree No 88/2014/ND-

On September 26, 2014, regulations were established to govern the business activities of rating services and the operational conditions of rating organizations in Vietnam Additionally, the Prime Minister approved a plan to develop credit rating services by 2020, with a vision extending to 2030, through Decision No 507/QD-TTg on April 17, 2015 A significant aspect of this initiative is the requirement that, starting in 2020, all corporate bonds issued must undergo a rating process to enhance transparency and safety within the financial market.

Selecting an appropriate model to estimate the likelihood of business default using relevant financial indicators is a crucial risk management strategy This approach allows for the early classification of customers and aids in assessing the bank's liabilities in line with the Basel Committee's recommendations (Basel II, 2004) Consequently, it empowers financial institutions to make informed and prudent lending decisions.

This thesis centers on "Building a model to predict losses due to default using machine learning methods," aiming to establish a comprehensive theoretical framework and empirical evidence for selecting suitable loss given default models By enhancing the efficiency of credit risk management in banks, this research seeks to contribute to the sustainable development of Vietnam's financial system.

RESEARCH OBJECTIVES

This research focuses on developing and testing risk estimation models for predicting Loss Given Default (LGD), aiming to enhance resources for future studies in credit risk assessment It will construct an LGD prediction model utilizing real data and advanced analytical methods, including OLS, Decision Tree, Random Forest, and XGBoost, to evaluate the influence of various factors on LGD and identify key predictors of credit risk The study will compare model performance using metrics such as MAE, MSE, RMSE, R², and MAPE to determine the most effective LGD prediction model for practical applications in the financial sector The findings will provide valuable methodologies and results for future research, advancing the use of machine learning in finance and improving risk management and decision-making processes.

RESEARCH QUESTION

To achieve the research goal, the thesis focuses on answering the main research questions:

(i) Which financial indicators have a major impact on building a loss given default prediction model?

(ii) How do machine learning methods greatly influence the prediction of loss given default for businesses, and which models provide the best results in estimating default losses for businesses?

(iii) Which machine learning model is the most suitable model to measure Loss Given Default?

RESEARCH SUBJECT AND SCOPE

Research subject: The subject of this study is building loss given default prediction model of businesses in Vietnam

The research focuses on simulating machine learning methods by collecting data and financial information from businesses that have borrowed capital from Vietnamese commercial banks and those listed on Vietnam's financial market Financial reports from these operating businesses were gathered for the period between 2009 and 2020, ensuring that all sensitive information is encrypted to maintain confidentiality and security.

RESEARCH METHODOLOGY

The research employed a combination of qualitative and quantitative methodologies to address the shortcomings of each approach and enhance the dependability of the findings

Quantitative methods: Using machine learning applications at Vietnam

The article examines the perspectives and factors affecting the prediction of Loss Given Default (LGD) for small and medium-sized enterprises (SMEs) in the context of commercial banking To facilitate quantitative research, various measurement scales have been developed Key methodologies include the Analytic-Synthetic method, which synthesizes and analyzes relevant data, the Comparison method that contrasts theoretical models with practical applications, and Descriptive Statistics, which compiles data based on significant characteristics.

Research methods like decision trees, random forests, and boosting are utilized to enhance accuracy in risk estimation These techniques provide banks with a scientific foundation for making informed credit granting decisions and effectively managing risk.

In addition, this chapter also introduces the use of Python.

RESEARCH CONTRIBUTIONS

The research results of the thesis have scientific and practical significance in the following aspects:

(i) Systematically analyze the basic theories and foundations related to loss given default prediction models and criteria for selecting appropriate models

Comprehensive studies highlight the limitations of previous research in selecting the optimal model for predicting business default losses By employing machine learning techniques to analyze financial indicators, these studies lay a robust groundwork for researchers to undertake new and more pertinent investigations in this field.

When selecting a loss given default (LGD) prediction model for accurately forecasting losses from business defaults, it is essential to consider financial indicators that reflect the true risk profile Implementing a robust LGD model can significantly enhance the effectiveness of credit risk control measures employed by commercial banks in Vietnam By focusing on relevant financial metrics and ensuring the model's adaptability to local market conditions, banks can better manage potential losses and improve overall risk assessment strategies.

THE STRUCTURE OF RESEARCH

The article is divided into 5 chapters: introduction, literature review, research methodology, research results and discussion, recommendation and conclusion, and specific details as follows:

This chapter outlines the thesis structure, emphasizing the significance of the topic and the research problem it addresses It details the research objectives and questions, clarifying the scope and focus of the study The chosen research methods are discussed, highlighting their relevance to the investigation Additionally, the contributions of the research are presented, underscoring its potential impact Finally, the chapter concludes by offering a comprehensive overview of the entire research, guiding readers through the key elements of the study.

LITERATURE REVIEW

LOSS GIVEN DEFAULT

Loss Given Default (LGD) is the ratio of capital lost to the total outstanding debt when a customer defaults It encompasses not only loan losses but also unpaid interest and associated administrative costs, such as mortgage processing and legal fees LGD can reach 100%, indicating the potential for full capital recovery However, according to Basel Committee statistics, recovery rates typically range from 20% to 80%, suggesting that average rates may be misleading The two key factors influencing a bank's capital recovery ability are the security of the loan's assets and the structure of the borrower's assets, which determine repayment priorities in bankruptcy Generally, banks have a higher recovery rate from loans compared to bonds due to their priority in debt repayment Additionally, economic downturns can lower recovery rates, and businesses in heavy industries tend to exhibit higher capital recovery rates compared to those in the service sector.

Estimating Loss Given Default (LGD) can be approached in three common ways: market LGD, implied LGD, and workout LGD The market LGD method relies on the market value of defaulted bonds or loans immediately after default, but it is limited to listed unsecured bonds, making it less applicable to commercial loans with mortgages, particularly in Vietnam where the bond market is still developing Implied LGD, on the other hand, uses the prices of risky non-defaulting bonds through various capital asset pricing models, though the lack of a standard model complicates its application Lastly, workout LGD is determined by assessing the resolution of defaulted loans, where banks estimate future cash flows and the expected time to recover these at an appropriate discount rate.

FINANCIAL INDICATORS

Financial ratios are essential analytical tools derived from an organization's financial data, enabling the measurement and comparison of financial performance These metrics, such as return on investment (ROI), return on assets (ROA), and debt-to-equity ratio, provide insights into a business's financial health and performance by comparing internal standards or competitors Data for these ratios can be sourced from formal financial reports like balance sheets, income statements, and cash flow statements Monitoring these financial indicators over time helps identify trends and fluctuations, offering a comprehensive overview of a business's performance Both business owners and external stakeholders, including investors and financial institutions, utilize these metrics to assess financial health and profitability potential In summary, financial ratios are critical tools for evaluating and predicting an organization's financial performance, reflecting the overall health of the business.

Table 2.1: Four types of financial ratios

Profitability Ratios How well does the company generate profits?

Leverage Ratios How extensively is the company using debt?

Liquidity Ratios Does the company have enough cash to pay the bills?

Efficiency Ratios How efficiently does the company use its assets and capitals?

OVERVIEW OF THE MODELS USED TO PREDICT THE LOSS GIVEN

Recent research has extensively explored the development of Loss Given Default (LGD) prediction models for businesses, primarily categorized into two types: statistical methods and machine learning methods Each approach offers distinct advantages and methodologies, highlighting the evolving landscape of predictive analytics in financial risk assessment.

Statistical models excel in making inferences by relying on specific assumptions regarding the relationships between variables, the number of estimable parameters, and the characteristics of the data distribution In contrast, machine learning methods prioritize the accuracy of predictions and maintain minimal assumptions about the data generation process, allowing them to identify complex, non-linear interactions between predictor and outcome variables The computational intensity of machine learning arises from its capability to evaluate multiple models, ultimately selecting the most accurate one for prediction tasks While this approach is beneficial for estimating Loss Given Default (LGD), it sacrifices some transparency, as machine learning models do not provide parameter estimates linking predictors to outcomes, making their reasoning and predictions harder to interpret This section provides a brief overview of models used to estimate the probability of default.

Evaluate models to estimate LGD

The Loss Given Default (LGD) research model plays a crucial role in credit risk management and banking finance, representing the percentage of a loan that a financial institution loses when a borrower defaults By estimating the potential loss risk associated with lending, LGD models help financial institutions assess their exposure to defaults Various traditional models are commonly employed to estimate Loss Given Default, enhancing the understanding of credit risk in the financial sector.

The Linear Regression Model estimates Loss Given Default (LGD) by utilizing independent variables such as the market value of collateral assets, overall debt ratio, credit score, customer applications, and various other financial factors.

The Logistic Regression Model classifies Loss Given Default (LGD) as a categorical variable, categorized into "low," "medium," and "high." By utilizing logistic regression, this model effectively determines the level of LGD based on various input variables.

Credit risk structural models, including the Merton and Black-Scholes models, utilize credit structure theory to estimate loss given default (LGD) These models establish a connection between the value of collateral and LGD through various financial relationships, providing a framework for assessing credit risk.

The Quantile Regression Model offers a comprehensive analysis of the Loss Given Default (LGD) distribution by estimating its quantiles rather than solely focusing on the mean value This approach enhances the understanding of LGD variability and provides deeper insights into its distribution characteristics.

Random Effects Regression Model: This model contains random effects between observations to estimate LGD It can handle time-invariant variables that may affect LGD but are not observed

The Panel Data Regression Model leverages panel data to estimate Loss Given Default (LGD) by capturing variations across both time and space This approach enables the analysis of factors that may impact LGD in a multidimensional context, providing a comprehensive understanding of its influencing elements.

With the advancement of technology, machine learning models are gaining popularity for estimating Loss Given Default (LGD) in banking and financial sectors Some widely used machine learning models in this context include:

The Random Forest (RF) model is an effective ensemble learning technique that combines multiple decision trees to enhance predictive performance It excels in handling heterogeneous data and is adept at modeling non-linear relationships between input variables and Loss Given Default (LGD).

Gradient Boosting Machines (GBM) is an effective ensemble learning technique that constructs decision trees in a sequential manner to enhance the optimization of the loss function Known for its strong performance, GBM excels in managing large and complex datasets.

Support Vector Machines (SVM) is a powerful classification technique that identifies optimal boundaries between different data classes In the context of Loss Given Default (LGD) estimation, SVM can effectively classify the severity of LGD by analyzing various input variables.

Neural networks (NN) are adept at learning intricate relationships between input variables and Loss Given Default (LGD) by utilizing multiple hidden layers These models can be customized to meet specific data requirements and prediction objectives, enhancing their effectiveness in various applications.

The Extreme Gradient Boosting (XGBoost) model is an advanced and optimized version of gradient boosting, known for its high performance and rapid training capabilities It is widely utilized in financial forecasting and is particularly effective for estimating Loss Given Default (LGD).

Decision trees are a straightforward yet effective technique for classification and prediction tasks They can be utilized independently or integrated into ensemble methods like Random Forest or Gradient Boosting Machines (GBM) for enhanced performance.

DATA AND METHOD OF RESEARCH

THEORETICAL FRAMEWORK

Modern models, particularly machine learning, outperform traditional models in handling complex data and achieving superior predictive capabilities Researchers and financial institutions can select between these approaches based on the data nature, prediction goals, and problem context Key differences include flexibility, where machine learning excels at modeling intricate relationships while traditional models struggle with non-linear dynamics In terms of performance, machine learning often delivers better predictions by identifying complex patterns, whereas traditional models, while simpler, may fall short in complex scenarios Additionally, machine learning is more adept at managing heterogeneous and noisy data, allowing for effective analysis of diverse datasets that traditional models may not handle well.

Machine learning models excel in predictive ability, particularly in complex scenarios with large datasets, while traditional models offer deeper insights into variable relationships but may struggle with accurate predictions in such situations Additionally, traditional models are generally easier for users to understand and interpret due to their reliance on established statistical methods, whereas machine learning models, especially those that are deep and complex, often present challenges in interpretability.

This study aims to utilize machine learning models, particularly those with explanatory features like decision trees, random forests, and boosting techniques, to develop models for estimating Loss Given Default (LGD) risk weights The methodology for constructing these machine learning models is outlined in detail.

Figure 3.1: Steps to build a machine learning model

Source: Compiled by the author

The data collection phase is crucial in developing a machine learning model for predicting risk weights, as the quality and performance of the model are directly influenced by the data gathered It is essential to focus on specific characteristics of the data during this process to ensure optimal results.

To ensure the effectiveness of machine learning models, it is crucial to collect complete data that accurately represents all facets of the problem at hand Incomplete data can result in inaccuracies and diminish the reliability of the models.

Data consistency is crucial for effective model performance, as it ensures that information gathered from various sources aligns seamlessly Inconsistencies within the data can introduce noise, ultimately hindering the accuracy and reliability of the model's outcomes.

Ensuring high-quality data is crucial, as any inaccuracies or noise can significantly diminish model accuracy Therefore, it is essential to thoroughly assess the data for errors to maintain the integrity of the analysis.

Fourth, privacy and security: Protect individual privacy and ensure that data is collected and used in compliance with personal data protection regulations

To ensure the fairness and accuracy of machine learning models, it is crucial to address data imbalance by avoiding bias during data collection Factors such as the methods used for sample selection can introduce bias, which may compromise the integrity of the data and lead to skewed results.

To ensure the accuracy and fairness of machine learning models, it is crucial to avoid bias during data collection Bias can arise from various sources, including the methods used for data collection and the selection of samples, potentially resulting in misleading outcomes.

Seventh, it is essential to examine the correlation and co-variation of variables to confirm that there is no significant relationship between them Strong correlations can negatively impact model performance and diminish the accuracy of forecasts.

To enhance machine learning model effectiveness, it is crucial to collect diverse data that encompasses all aspects of the problem at hand This diversity enables models to learn common behaviors and make accurate predictions.

Feature variables in a machine learning model refer to the attributes or information utilized to predict or classify a target These variables play a crucial role in identifying relationships and patterns within the data, enabling the model to make accurate predictions or classifications.

Characteristic variables can be quantitative features or categorical features In addition, feature variables can also be created by combining or transforming existing variables to create new, more meaningful features

In the LGD forecast model for corporate clients, key variables are frequently linked to the debt collection process and the financial status of the business post-default Commonly utilized variables in LGD models for business customers include factors that assess the effectiveness of debt recovery and the overall financial health of the company following a default event.

Information about specific loans and debts:

Initial loan amount Type of loan product (for example: working capital loan, fixed asset, etc.) Interest rate type (fixed or variable)

Information about the business's finances:

Financial history (revenue, profit, cash flow, etc.) Current financial status (assets, debt, equity, etc.) Credit rating of the business (credit rating)

Information about collateral (if any):

Value of collateral Type of asset (e.g., plant, machinery, inventory)

Information about the debt collection process:

Debt collection methods (for example, auctions, negotiations, etc.) Time and cost of debt collection

Current debt collection situation (recovery rate, amount collected, etc.)

Information about payment history and payment behavior:

Loan payment history (number of late periods, number of late payments)

Current outstanding debt of the business

Information about the market and industry:

Market characteristics and business lines Economic situation and industry forecast

Policies and regulations related to businesses

Legal regulations apply to the business's operating lines

Data splitting involves dividing data into training, test, and validation sets during machine learning development This separation is essential for evaluating model performance, preventing overfitting, and ensuring that the model generalizes well to unseen data.

Model training involves utilizing a significant amount of data to enable the model to learn and develop rules or patterns for predicting outcomes This process is crucial for enhancing the model's accuracy and effectiveness in making predictions based on the provided data.

RESEARCH MODELS AND METHODS

Decision trees are versatile machine-learning algorithms capable of both classification and regression tasks, offering simpler models that enhance interpretability Despite their advantages, they often lag behind more advanced supervised learning methods, like random forests and ensemble models, in prediction accuracy (Gareth James, Daniela Witten, Trevor Hastie, 2021) These models partition the data space into distinct regions, fitting a constant value to each For example, in a regression scenario with continuous feedback Y and inputs X1 and X2, the initial space is divided into two areas, with the response represented by the average Y value in each This process of division continues until a predefined stopping criterion is met.

Figure 3.2 The figure on the left partitions the space using the binary decomposition method The figure on the right shows the tree corresponding to the partition on the left

The diagram illustrates a systematic division of regions based on specific thresholds Initially, the first division occurs at X1 = t1, creating two regions: X1 ≤ t1 and X1 > t1 Subsequently, the region X1 ≤ t1 is further divided at X2 = t2, while the region X1 > t1 is divided at X1 = t3 Lastly, the region X1 > t3 undergoes another division at X2 = t4 This process results in the partitioning of the data into five distinct regions, labeled R1 through R5, which are utilized in the corresponding regression model for predictive analysis.

Y with a constant c 𝑚 in the region 𝑅 𝑚 , that is:

The model is illustrated as a binary tree, where the top node initiates partitioning of the complete dataset Observations meeting the criteria at each node are directed to the left branch, while others go to the right The terminal or leaf nodes represent distinct regions R1, R2, …, R5 In this context, the data comprises p inputs 𝑋 = (𝑥 1 , 𝑥 2 , … , 𝑥 𝑝 ) and a response variable 𝑎 for each of the N observations 𝑋 𝑖 , 𝑌 𝑖 (where 𝑖 = 1,2, … , 𝑁) The model incorporates feedback mechanisms within each region 𝑅 = (𝑅 1 , 𝑅 2 , … , 𝑅 𝑚 ), with M representing the total number of areas.

Minimize the sum of squares∑(𝑦 𝑖 − 𝑓(𝑥 𝑖 )) 2 for the given regions 𝑅 𝑚

𝑐̂ = ave(𝑦 𝑚 𝑖 |𝑥 𝑖 ∈ 𝑅 𝑚 ) Where 𝑐 𝑚 is the average of 𝑦 𝑖 in the region 𝑅 𝑚 Then the greedy algorithm is applied to the entire data The pair of half-planes is defined as 𝑅 1 (𝑗, 𝑠) = {𝑋|𝑋 𝑗 ≤ 𝑠} and

𝑅 2 (𝑗, 𝑠) = {𝑋|𝑋 𝑗 > 𝑠} where the division of variable 𝑗 and division point 𝑠 are chosen according to solution min𝑗,𝑠 [min

For any choice of j and s, minimization is solved by:

For each separation variable \( j \), the separation point \( s \) is determined, and the optimal pair \( (j, s) \) is identified by evaluating all possible pairs This splitting process continues across the resulting regions until a predetermined threshold is met, which may include the minimum number of records required for division or the minimum number of records allowed in the terminal node (Hastie et al., 2005).

A decision tree is a predictive model used for classification tasks, aimed at determining the category of objects to be predicted In this model, each internal node represents a variable, while the connections to child nodes signify specific values of that variable The leaf nodes indicate the predicted outcomes The decision tree utilizes a training dataset to learn how to predict categorical variable values, selecting the root node and performing splits based on information gain This recursive splitting process continues until no further splits can be made (Chuc & Hang, 2014).

Figure 3.3: The structure of the decision tree

Figure 3.2 shows the decision tree model used to classify previously unclassified data into appropriate groups and classes In this model:

Root: is the origin point containing the value of the first variable used to create branches

Internal node: internal points in the tree contain attributes and data values used for subsequent branching

Leaf node: are the leaf nodes of the tree containing the final value of the categorical variable

Branch: represents the branching rules, i.e., the relationship between the value of the independent variable (internal node) and the value of the target variable (leaf node).)

The random forest algorithm is an effective classification and regression method that operates by utilizing multiple decision trees Each tree in the forest is influenced by a randomly sampled vector, ensuring that all trees share the same distribution This ensemble approach enhances predictive accuracy and robustness, making random forests a popular choice in machine learning applications.

Random forests are a versatile machine learning technique used for both classification and regression tasks They operate by creating a population of decision trees that grow based on random vectors, which influence their development In regression scenarios, these trees output numerical values instead of class labels, with the assumption that the training data is independently drawn from the distribution of random vectors The performance of any numeric predictor in this context can be evaluated using the mean squared generalization error.

Random forests utilize bagging, or bootstrap aggregation, techniques to generate multiple regression trees from random samples of training data These trees do not require pruning, and their final output is derived from the average of their individual results.

The bagging technique involves growing trees by choosing the optimal split point for every predictor variable at each node In contrast, the random forest algorithm enhances this approach by selecting the best split point from a randomly chosen subset of predictor variables.

The random forest algorithm effectively mitigates the overfitting risk associated with single decision tree models by utilizing multiple trees, enhancing its generalization accuracy This stable algorithm remains largely unaffected when new data points are introduced, ensuring consistency in the overall model Additionally, random forest can assess internal errors, evaluate the significance of input variables, and manage variables with low correlation However, its complexity and computational demands make it more resource-intensive and time-consuming to train compared to simpler decision tree models.

Random forest models, while powerful, have the drawback of being more complex and computationally intensive compared to decision tree models This increased complexity leads to longer training times, making them less efficient than other similar models.

Figure 3.4: Implementation process of random forest algorithm

Boosting is a powerful technique used for both classification and regression tasks, similar to decision trees and random forests It functions as an additive model, where multiple simple tree models, known as base learners, are combined to form a more complex predictive function Each tree in the sequence is specifically optimized to enhance the accuracy of predictions based on the errors of the previous tree, resulting in improved overall performance.

Gradient Boosting Machine (GBM) constructs a model by aggregating the weighted outputs of multiple base learners It is compatible with various distinguishable loss functions and employs gradient descent to minimize errors from prior trees in the sequence As a first-order iterative optimization algorithm, gradient descent effectively enhances the model's accuracy.

To identify the local minimum within the function space, we can employ a method akin to gradient descent in parameter space At the mth iteration, the direction of gradient descent is determined by the negative slope of the loss function \( L \).

At each iteration, the regression tree model ∅ 𝑚 is fitted to predict negative slopes The squared error is used as the replacement loss:

Where argmin stands for minimum argument, i.e the value of 𝑥 for which 𝑓(𝑥) reaches its minimum The step length in the negative slope direction is determined by

Multiply the step length by the shrinkage factor or learning rate 𝜂 ∈ (0,1), Model performance is enhanced This gives the m prediction:

Where 𝑓 𝜎 is initialized as constant

The primary goal of a machine learning model is to deliver accurate outputs when faced with new data, a process known as generalization The effectiveness of a model hinges on its capacity to generalize, and there are various methods to assess this performance Constructing a model entails two interconnected steps.

RESEARCH RESULT AND DISCUSSION

RESEARCH DATA

To simulate the described machine learning methods, data and financial information were gathered from businesses borrowing capital from Vietnamese commercial banks and those listed on Vietnam's financial market Financial reports from operating businesses were collected for the period between 2009 and 2020, with all business information encrypted to ensure data security.

To distinguish between financially healthy businesses and those in default, the concept of technical bankruptcy is utilized Technical bankruptcy refers to companies struggling to meet debt obligations or facing severe financial issues without formal bankruptcy declarations Key criteria for classifying a business as technically bankrupt include negative equity, an EBITDA-to-interest ratio below one for two consecutive years, negative operating profit for three consecutive years, and unfavorable opinions from independent auditors (Fantazzini & Figini, 2009) Businesses meeting these criteria are classified as insolvent with a label of 1, while those that do not are assigned a label of 0, indicating they are not insolvent.

The processed dataset comprises 1,386 observations, including 151 companies classified as insolvent or suspected of insolvency, representing 10.89% of the total, while 1,235 companies, or 89.1%, are deemed non-insolvent This data is split into a training set and a test set, allocated at 80% and 20% respectively The authors utilize the training set to develop predictive models and assess their accuracy using the test set.

Figure 4.1 Proportion of default companies

Source: Calculation by the author

The team identified 14 key business indicators, as detailed in table 4.1, to create input variables for their forecasting models These indicators encompass various financial categories, including liquidity, performance, debt utilization, and profitability metrics.

Table 4.1 Predictor variables used in the model

X_1 Gross Profit Margin Gross Profit/Net revenue

X_2 Pre-Tax Profit Margin Pre-Tax Profit/Net Revenue

X_3 Return on Assets (ROA) Pre-Tax Profit/Total Assets

X_4 Return on Equity (ROE) Pre-Tax Profit/Equity

X_5 Debt to Total Assets Ratio Total Debt/Total Assets

X_6 Debt to Equity Ratio Total Debt/Equity

X_7 Current Ratio Current Assets/ Current Liabilities X_8 Quick Ratio (Current Assets – Inventory)/Current

Liabilities X_9 Interest Coverage Ratio Earnings Before Interest and Taxes

(EBIT)/Interest Expense X_10 Debt Service Coverage Tatio

Earnings Before Interest, Taxes, Depreciation, and Amortization (EBITDA)/Debt Principal and Interest Payments

X_11 Cash Return on Equity (CRE) Cash Flow from Operating

Activities/Equity X_12 Inventory Turnover Ratio Cost of Goods Sold/Average Inventory X_13 Days Sales Outstanding (DSO) Average Accounts Receivable*365/Net

Revenue X_14 Total Asset Turnover Ratio Net Revenue/Total Assets

The selection of 14 variables for the Loss Given Default (LGD) prediction model is crucial, as it incorporates loan attributes, borrower information, and macroeconomic factors that influence default risk Key loan characteristics, including debt ratio, interest rates, and loan term, directly impact a borrower's repayment ability and potential losses upon default Additionally, borrower-specific data such as credit history, income, and employment status offer insights into default likelihood and loss severity Macroeconomic indicators like exchange rates and inflation further affect repayment capacity The variable selection process involves analyzing correlations with LGD and employing techniques like regression analysis or machine learning to optimize model accuracy Historical data analysis has revealed the significant impact of these variables on LGD, enabling the model to identify critical patterns in default risk and enhance prediction reliability.

Table 4.2: Table describing the predictor variables in the model count mean std min 25% 50% 75% max

Table 4.3: Descriptive information about the dependent variables in the estimated model

Tables 4.2 and 4.3 present essential descriptive statistics for the predictor and dependent variables in the model, including mean, standard deviation, minimum, maximum, and percentile values (25%, 50%, 75%) Additionally, Figure 4.4 illustrates the correlation among the 14 variables, revealing that most variables exhibit a low correlation level, with values below 80%.

Figure 4.2: Describe the correlation of variables in the estimated model

CRITERIA FOR MODEL VERIFICATION AND EVALUATION

Test and evaluate the model on out-of-sample data

Out-of-sample testing is crucial for developing and assessing the performance of predictive models, as it effectively evaluates the accuracy and effectiveness of machine learning models By implementing this technique, we can tackle key challenges in model evaluation and ensure robust predictive capabilities.

Assess generalization ability: When a model is trained on a data set, it can

To assess a model's generalization ability, it is crucial to evaluate its performance on a test data set that it has not encountered before This testing process allows us to determine how effectively the model can predict outcomes for new data, thereby revealing its capability to learn specific characteristics from the training data.

Overfitting detection is crucial in machine learning, as it occurs when a model is excessively tailored to the training data, resulting in diminished performance on unseen data To identify and assess overfitting, out-of-sample testing is employed, which involves comparing the model's performance on both training and test datasets.

Model refinement involves utilizing out-of-sample testing results to enhance the predictive accuracy of the model This process may include adjusting hyperparameters, exploring additional variables, or even selecting a different prediction algorithm Such refinements are essential for optimizing model performance on new data.

Enhancing the reliability of results is achieved through out-of-sample testing, which provides a more precise estimate of a model's performance on real-world data, thereby improving the accuracy of results compared to relying solely on training data.

Out-of-sample testing is crucial for verifying a model's predictive accuracy on new data and its ability to generalize beyond the training dataset In this study, we will utilize 80% of our data, totaling 1,109 observations, to develop the model, while the remaining 20%, comprising 277 observations, will be used for retesting The forecast outcomes of the models discussed in Section 4.3 will be rigorously evaluated against these out-of-sample datasets.

Figure 4.3 Estimation results of the models in LGD estimation

Source: Calculated from the inspection system

The OLS regression model illustrated in Figure 4.3 reveals that the coefficient X_5 significantly influences LGD, as indicated by its large value, while the other variables exhibit relatively low regression coefficients.

Meanwhile, in all three machine learning models, Decision Tree, RF, and XGBoost show that variables X_7 and X_11 play an important role in explaining LGD

Thus, research models have shown that debt-to-total asset ratio (X_5), current ratio (X_7), and cash return on equity (CRE) (X_11) play an important role in explaining the LGD risk parameter

Figure 4.4 Estimation results of decision trees in LGD estimation

Table 4.4 LGD prediction results of models on out-of-sample data sets

Model Dataset MAE MSE RMSE R2 MAPE

The LGD prediction results highlight significant variations in model performance, with Random Forest and XGBoost being the most precise methods Random Forest stands out by achieving the lowest mean absolute error (MAE) of 0.112674, mean squared error (MSE) of 0.019933, and root mean squared error (RMSE) of 0.141183 Its R2 value of 0.698 indicates that it accounts for nearly 70% of the data variance, showcasing its reliability in predicting LGD values The ensemble approach of Random Forest effectively mitigates overfitting and captures complex nonlinear relationships in the dataset, leading to highly accurate predictions across various data scenarios.

XGBoost demonstrates strong performance metrics, achieving an MAE of 0.121662, MSE of 0.024155, and RMSE of 0.155417, with an R2 value of 0.634 that indicates it explains approximately 63.4% of data variance Although its error rates are slightly higher than those of Random Forest, XGBoost's gradient boosting technique effectively handles complex datasets and has the potential to match or surpass Random Forest's performance through additional parameter tuning and optimization.

The Decision Tree model demonstrates a moderate accuracy level, with a Mean Absolute Error (MAE) of 0.150053 and an R² value of 0.448, indicating it accounts for approximately 44.8% of the data's variability Although it improves upon linear regression by capturing more nonlinear patterns, it lacks the depth and ensemble benefits found in Random Forest and XGBoost, which enhance reliability for generalization Additionally, as standalone models, decision trees are susceptible to overfitting, potentially compromising their effectiveness on new data.

Linear Regression (OLS) demonstrates poor performance in predicting Loss Given Default (LGD), evidenced by a high Mean Absolute Error (MAE) of 3.570076, a Mean Squared Error (MSE) of 359.566337, and a Root Mean Squared Error (RMSE) of 18.963026 The model's negative R-squared value of -5445.945487 indicates its ineffectiveness in capturing the relationships between variables, performing even worse than a simple mean prediction This significant discrepancy from actual values suggests that OLS is not suitable for LGD prediction in this dataset, likely due to the complex nonlinear relationships that a linear model fails to address.

In summary, Random Forest is the most reliable and accurate model for predicting Loss Given Default (LGD), closely followed by XGBoost, which demonstrates strong performance and optimization potential While Decision Tree serves as a viable but less robust option, Linear Regression does not achieve the necessary predictive accuracy for effective LGD forecasting For practical applications, Random Forest is likely to provide the most consistent results in capturing LGD complexities, with XGBoost as a strong, tunable alternative.

Chapter 4 focuses on analyzing and testing the reliability of risk estimation models through the use of research data and applying testing and evaluation criteria It details how to test the model on out-of-sample data and covers specific criteria to evaluate the effectiveness of the estimated model, such as accuracy, coverage, and level of risk reflection ro

The estimation model results reveal the LGD parameter, a crucial element in credit risk evaluation This chapter highlights the models' applicability and reliability in forecasting and managing risks, ultimately enhancing risk management processes within financial institutions.

Next, based on the research results, chapter 5 will recommend the application process and suggest a set of criteria for building a risk weight estimation model at commercial banks.

CONCLUSION AND RECOMMENDATIONS

RECOMMENDATIONS

Research indicates that machine learning models like Decision Trees, Random Forests, and Gradient Boosting Machines (GBM) are effective in predicting Loss Given Default (LGD), offering banks enhanced insights into customer risks for improved capital and risk management The Decision Tree model is particularly valued for its straightforward interpretability and rule-based decision-making Meanwhile, the Random Forest model enhances accuracy by integrating multiple trees, which mitigates the risk of overfitting Additionally, GBM excels in processing non-linear data and variable interactions, delivering highly accurate forecasting results.

To effectively implement these models, we recommend a streamlined six-step process: First, collect and prepare data, including customer credit history, loan information, and macroeconomic variables Next, analyze and identify key input variables for Loss Given Default (LGD) through statistical analysis and expert insights Then, build and train the model using historical data, applying cross-validation to ensure accuracy and prevent overfitting Following this, compare model performance to identify the most optimal model for each LGD index Afterward, integrate the selected model into the bank's risk assessment and capital management processes Finally, periodically evaluate and update the model to reflect new data and evolving market conditions.

Step 1: Collect and prepare data, including customer credit history, loan information, and macroeconomic variables

Data collection is crucial for deploying a machine learning model, as it gathers information from diverse sources such as customer credit records, historical loan data, and macroeconomic indicators Each source offers unique insights into a customer's repayment capacity and the associated risk of the loan.

Data must be meticulously cleaned and prepared prior to model training, which includes removing inaccurate or missing entries, accurately filling in gaps, and converting the data into an analyzable format For instance, categorical data often requires encoding into numerical values for effective processing.

Before constructing the model, it's essential to identify and select input variables that influence Loss Given Default (LGD) This involves using statistical techniques like correlation and variable importance analysis to ensure that only significant variables, such as debt-to-income ratio, credit history, and loan fill rate, are included This critical step establishes a solid foundation for developing an accurate forecasting model, enabling banks to evaluate risks comprehensively and in detail.

Step 2: Analyze and select important input variables for LGD based on statistical analysis and professional experience

After preparing the data, the next step is to analyze the characteristics of the variables to assess their impact on Loss Given Default (LGD) This analysis employs statistical methods to explore the relationships between independent and dependent variables Techniques like correlation analysis, regression analysis, and variable importance analysis are utilized to identify and eliminate variables that do not enhance the predictive model.

Select Variables Based on experience in bank credit granting activities

Effective variable selection in finance and risk management requires not only statistical analysis but also the integration of professional knowledge and industry experience By leveraging insights into business processes and legal regulations, evaluators can pinpoint valuable variables that enhance the model's relevance This approach ensures that the model encompasses both historical data and critical risk management factors, leading to more accurate and informed decision-making.

Key variables in Loss Given Default (LGD) models include the debt-to-total asset ratio, current ratio, and cash return on equity, each offering distinct insights into a customer's financial health and risk profile Identifying these critical variables is essential for developing precise forecasting models, enabling banks to enhance their funding strategies and risk management practices.

Step 3: Build and train the model using historical data Apply cross- validation to evaluate accuracy and avoid overfitting

Once the input variables are established and prepared, the subsequent step involves constructing the machine learning model Various models, including decision trees, random forests, and gradient boosting machines (GBM), are developed using the prepared data Each model possesses distinct features and methodologies that enable precise simulation of the relationship between input variables and the outcome variable, which is loss given default (LGD) The models created in this study can serve as a foundational framework for future research applications.

Model training involves instructing the model to predict outcomes based on input data by utilizing historical data to identify patterns and trends To ensure the model's accuracy and prevent overfitting, cross-validation techniques are employed, allowing for effective evaluation of its performance on new data.

Model tuning involves adjusting parameters to enhance forecast accuracy, such as modifying tree sizes in Random Forests, altering the learning rate in Gradient Boosting Machines (GBM), or changing the depth of Decision Trees After fine-tuning, the model undergoes re-evaluation with a test dataset that the model has not encountered during training This process ensures that the model can generalize effectively and make accurate predictions on new data.

After training and fine-tuning the model, the next crucial step is optimizing it for peak performance This involves evaluating various models based on key criteria such as accuracy, complexity, and computational efficiency to select the best one Additionally, model optimization must take into account its deployment in real-world scenarios, which includes selecting appropriate technology and infrastructure for scalable operation This approach guarantees that the model maintains its accuracy in testing while proving effective and stable in the bank's daily risk management activities.

Step 4: Compare the performance of the models to choose the most optimal model for the LGD index

After training and fine-tuning the models, the subsequent step involves assessing their performance across multiple dimensions, including accuracy, sensitivity, specificity, and the area under the ROC curve This evaluation is crucial for understanding each model's overall predictive accuracy and its effectiveness in classifying specific cases of LGD While some models may excel in certain areas, it is important to recognize that no single model is superior in all aspects.

Utilizing charts and visualization tools enhances the understanding of model differences, with performance charts like the ROC curve and precision-recall curve enabling quick identification of the model that best balances sensitivity and specificity Furthermore, variable importance plots highlight the key variables that significantly impact each model's predictive capability, facilitating a more objective evaluation in selecting the optimal model.

Choosing the right model involves balancing performance and complexity, as a highly accurate model that demands extensive time and computational resources may not be ideal The optimal model should strike a balance between predictive performance and practical implementation, ensuring it is easily understood by stakeholders This is particularly crucial in the banking and finance sector, where the interpretation of decisions holds significant importance.

SUGGESTIONS

Recommended set of indicators to build an LGD forecasting model

Debt to Total Assets Ratio (X_5)

The Debt to Total Assets Ratio is a crucial metric for assessing a business's reliance on debt for financing its operations and assets, significantly impacting the prediction of Loss Given Default (LGD) A high ratio indicates a greater dependence on debt, which can elevate LGD due to increased financial vulnerability and the risk of default In situations where a business fails to meet its debt obligations, assets may need to be liquidated at reduced prices, further exacerbating losses Conversely, a low debt-to-asset ratio reflects a stronger financial foundation, reducing LGD risk as the business is better positioned to navigate financial challenges Thus, this ratio is vital for evaluating risk and forecasting potential losses for creditors in the event of default, enabling effective risk management strategies.

The current ratio is a crucial financial metric that assesses a company's ability to fulfill its short-term debt obligations with its current assets In Loss Given Default (LGD) forecasting, this ratio is vital as it reflects the business's short-term liquidity, influencing potential loss when debt repayment fails A high current ratio indicates sufficient current assets to cover short-term liabilities, thereby reducing LGD risk and the likelihood of default In contrast, a low current ratio suggests potential challenges in meeting short-term debts, heightening creditors' risk of loss Consequently, evaluating current liquidity is essential for forecasting LGD, enabling creditors and investors to gauge risks and implement effective risk management strategies.

The quick ratio is a crucial financial metric that assesses a business's ability to fulfill short-term debt obligations without relying on inventory sales In the context of Loss Given Default (LGD) forecasting, this ratio provides valuable insights into a company's immediate liquidity, which is essential for estimating potential loan losses in case of default A high quick ratio signifies ample current assets that can be swiftly converted into cash, thereby mitigating default risk and lowering LGD Conversely, a low quick ratio may indicate a heightened risk of failing to meet short-term financial commitments, potentially leading to increased losses for creditors Thus, understanding quick liquidity is vital for evaluating financial risk and forecasting LGD, enabling stakeholders to make informed financial decisions and manage risks effectively.

The cash return on equity, represented by the ratio of cash and cash equivalents to total equity, is a crucial financial metric that indicates a business's liquidity and solvency, particularly in loss given default (LGD) forecasting A high ratio suggests that the company possesses sufficient cash resources to meet short-term obligations and repay debts, thereby reducing the risk of loss for creditors In contrast, a low ratio raises concerns about the business's ability to fulfill its financial commitments during a crisis, indicating a higher potential risk for creditors Thus, analyzing this ratio is essential for assessing a company's financial health and resilience against risks, providing critical insights for effective credit risk management.

Proposed Provisioning Ratio: Free Cash Flow to Debt Ratio

The free cash flow to total debt ratio is a crucial metric that assesses a company's capability to generate cash flow after covering operating and capital expenditures, relative to its overall liabilities This ratio is essential for evaluating a business's long-term debt repayment capacity and profitability without relying on new debt sources A high ratio indicates that the company can generate sufficient cash flow to service its debt, thereby reducing the risk of loss for lenders In the LGD model, analyzing this ratio enables banks to gauge a company's financial recovery potential post-default, leading to a more precise estimation of potential losses.

To enhance the evaluation of an enterprise's financial capacity, debt repayment ability, and overall business health, it is essential to select and propose additional indicators These indicators provide banks with a deeper and more comprehensive understanding of potential losses that businesses may incur, ultimately helping to minimize risks and improve the accuracy of Loss Given Default (LGD) estimations.

TOPIC LIMITATION AND POTENTIAL RESEARCH DIRECTIONS

While the thesis presents valuable findings, it is important to acknowledge its limitations, particularly the restricted data set This investigation is constrained by the collection of only 1,386 observations, covering a time frame from 2009 onwards, due to time and resource limitations.

2020 The number of input variables was 14 which was consistent with the number of observations However, because this sample has not been collected until the latest year

(2023), the latest comparison is not available

The quality of input data in Vietnam's financial reporting remains ambiguous, as the audits of collective financial statements do not match the transparency and effectiveness found in developed countries Companies often prepare multiple financial reports for various purposes, such as tax, banking, auditing, and internal control Consequently, ensuring high-quality management and analysis of model inputs is crucial for achieving the most accurate results.

The Loss Given Default prediction models discussed in this thesis rely solely on financial data, mirroring the internal credit rating models currently employed by commercial banks in Vietnam However, since financial reports often fail to provide a complete and accurate picture of a customer's business performance and financial health, banks must also consider non-financial information for effective customer screening and classification.

This thesis advocates for expanded research by increasing the data set to include over 2,000 businesses, which will enhance the reliability of results Additionally, collecting quarterly corporate financial statements instead of annual ones will provide greater accuracy in the findings.

Larger and disaggregated data for each consumer group across various business sectors would provide reliable and relevant insights, enhancing the applicability of the model for organizations.

In Vietnam, building risk models for Probability of Default (PD), Loss Given Default (LGD), and Exposure at Default (EAD) is essential for calculating the capital reserve ratio in accordance with Basel II These key factors are crucial for analyzing customer types and determining credit limits For effective post-lending risk management, it is important to clearly define the relationship between asset value and the PD, LGD, and EAD metrics, as the Expected Loss (EL) is derived by multiplying these three indices.

Chapter 5 provides detailed recommendations on the process of applying machine learning models to estimate risk weights in the operations of commercial banks It emphasizes the importance of selecting an appropriate model, preparing high-quality data, and applying a thorough validation approach to ensure the accuracy and reliability of the estimation results The suggested set of criteria for building a risk weight estimation model mentions factors that need to be considered in the process of building the model, from clearly defining input variables, to choosing the appropriate algorithm to establish model evaluation criteria

This research explores the potential for debt failure and offers methods for its measurement By systematically applying modern estimation models, it highlights advanced techniques for price evaluation and strategies to reduce credit risks.

Recent advancements in credit risk analysis and quantification have been driven by innovative models and research methods such as decision trees, random forests, and boosting techniques The application of Python tools showcases the adaptability and effectiveness of information technology within the banking and finance industries.

Analyzing and testing the reliability of estimated models demonstrates their effectiveness in the Vietnamese banking sector, enhancing risk management and competitiveness in the digital age The article offers recommendations for developing risk estimation models in commercial banks, aimed at optimizing the use of learning tools, ensuring capital safety, and promoting the stability of the national financial system.

Hoang Thanh Hai, Tran Dinhnh Chcc, Nguyen Quynh Hoa (2018) Logistic regression model in measuring the probability of default for individual credit customers Journal of Economics & Business Administration, 07 (2008)

Dang Thi Thu Hang (2019) Applying the logistic model in credit risk management The monetary financial market 11

Dinh Duc Minh (2018) Evaluate some credit risk prediction models at businesss

Abellán, J., & Castellano, J G (2017) A comparative study on base classifiers in ensemble methods for credit scoring Expert Systems with Applications, 73, 1-

Altman E I., and Sabato G., (2007) Modeling credit risk for SMEs: Evidence from the

Basel Commission on Bank Regulation (2004) International Convergence on Capital

Measurement and Capital Standards BIS

Basel Commission on Bank Regulation (1999) Credit risk modelling: Current practices and applications: The Committee

Basel Commission on Bank Regulation (2001) Working paper on the internal ratings‐ based approach to specialised lending exposures

Beaver, W H (1966) Financial ratios as predictors of failure Journal of accounting research, 71-111

Bonfim D., (2009) Credit risk drivers: Evaluating the contribution of firm level information and of macroeconomic dynamics Journal of Banking & Finance 33: 281-299

Breiman, L (2001) Random forests Machine learning, 45(1), 5-32

Carling K., Jacobson T., Lindé J., and Roszbach K (2007) Corporate credit risk modeling and the macroeconomy Journal of Banking & Finance 31: 845-868 Chakraborty, C., & Joseph, A (2017) Machine learning at central banks

Dahlin, F., & Storkitt, S (2014) Estimation of loss given default for low default portfolios

De Carvalho, N., & Dermine, J (2003) Bank loan losses-given-default–empirical evidence Tentative Draft

Fagerland, M W., & Hosmer, D W (2012) A generalized Hosmer–Lemeshow goodness-of-fit test for multinomial logistic regression models The Stata Journal, 12(3), 447-453

Fantazzini, D., & Figini, S (2009) Random survival forests models for SME credit risk measurement Methodology and computing in applied probability, 11, 29-45 Friedman, J H (2001) Greedy function approximation: a gradient boosting machine

Fuster, A., Goldsmith-Pinkham, P., Ramadorai, T., & Walther, A (2018) Predictably unequal The Effects of Machine Learning on Credit Markets Revise & Resubmit in Journal of Finance

Hamerle A., Liebig T., and Scheule H., (2004) Forecasting credit portfolio Deutsche

Hartmann-Wendels, T., Miller, P., & Tửws, E (2014) Loss given default for leasing:

Parametric and nonparametric estimations Journal of Banking & Finance, 40, 364-375

Hayden, E (2003) Are credit scoring models sensitive with respect to default definitions? Evidence from the Austrian market Evidence from the Austrian Market (April 2003) EFMA

JakubíkP., (2007) Macroeconomic Environment and Credit Risk Czech Journal of

Economics and Finance (Finance a uver), Charles University Prague, Faculty of Social Sciences 57(1-2): 60-78

Jobst, R., Kellner, R., & Rửsch, D (2020) Bayesian loss given default estimation for

European sovereign bonds International Journal of Forecasting, 36(3), 1073-

Kruppa, J., Schwarz, A., Arminger, G., & Ziegler, A (2013) Consumer credit risk:

Individual probability estimates using machine learning Expert systems with applications, 40(13), 5125-5131

Luo, S., & Murphy, A (2020) Understanding the Exposure at Default Risk of

Commercial Real Estate Construction and Land Development Loans

McKinney, W (2010) Data structures for statistical computing in python Paper presented at the Proceedings of the 9th Python in Science Conference

In the study by Memić D (2015), titled "Assessing Credit Default Using Logistic Regression and Multiple Discriminant Analysis," empirical evidence from Bosnia and Herzegovina is analyzed to evaluate credit risk The research, published in the Interdisciplinary Description of Complex Systems, highlights the effectiveness of these statistical methods in predicting defaults Additionally, Merton R.C (1974) contributes to the field with his work "On the Pricing of Corporate Debt: The Risk Structure of Interest," which explores the relationship between corporate debt pricing and associated risks Together, these studies provide valuable insights into credit assessment and risk management in financial contexts.

MitalA., Desai A., and Subramanian A.,(2007) Product Development: A Structured

Approach to Consumer Product Development, Design, and Manufacture Oxford: Butterworth-Heinemann

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,

Dubourg, V (2011) Scikit-learn: Machine learning in Python the Journal of machine Learning research, 12, 2825-2830

RửschD., (2003) Correlations and business cycles of credit risk: evidence from bankruptcies in Germany Financial Markets and Portfolio Management 17(3): 309-331

Schuermann, T (2004) What do we know about loss given default?

Tarciana L Pereira and Francisko Cribari-Neto ( 2010) A test for a correct model specification in inflated beta regressions Working Paper, Instituto de

Matemỏtica, Estatớstica e Computaỗóo Cientớfica Universidade Estadual de Campinas

Vasicek O.A., (1984) Credit Valuation KMV Corporation

Xinzheng Huang and Cornelis W Oosterlee (2012) Generalized beta regression models for random loss given default The Journal of Credit Risk, 7(4):4570, 2012 Yashkir, O and Yashkir, Y (2013) ‘Loss Given Default Modelling : Comparative

Analysis’, Munich Personal RePEc Archive, 46147

The provided Python code is utilized for estimating Loss Given Default (LGD) and includes essential libraries such as os, glob, pprint, datetime, numpy, pandas, graphviz, matplotlib, seaborn, and sklearn's linear model This combination of tools enables efficient data manipulation, visualization, and statistical analysis necessary for accurate LGD estimation.

In this article, we explore the implementation of various machine learning algorithms using the scikit-learn library, focusing on decision trees and random forests for both regression and classification tasks We utilize the DecisionTreeRegressor and RandomForestRegressor for regression analysis, while the DecisionTreeClassifier and RandomForestClassifier are employed for classification To optimize model performance, we apply GridSearchCV for hyperparameter tuning and KFold cross-validation for robust evaluation Additionally, we assess model accuracy and effectiveness using metrics such as confusion matrix, accuracy score, recall score, precision score, and ROC AUC score, ensuring a comprehensive understanding of model performance.

) from sklearn.ensemble import GradientBoostingClassifier,

GradientBoostingRegressor from sklearn.svm import SVC, SVR from xgboost import XGBClassifier, XGBRegressor np.random.seed(0) df= pd.read_csv("/content/dataset1.csv")

X = df[[f"X_{i}" for i in range(1, 15)]] y = df["LGD"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_stateB)

'Decision Tree': DecisionTreeRegressor(random_stateB),

'Random Forest': RandomForestRegressor(random_stateB), 'XGBoost': XGBRegressor(random_stateB)

# Create a DataFrame to store the results results_df = pd.DataFrame(columns=['Model', 'Dataset'])

# Train and evaluate each model for model_name, model in models.items(): print(datetime.now(), model_name)

# Train the model model = model.fit(X_train, y_train)

# Make predictions on the in-sample data y_pred_in_sample = model.predict(X_train) y_pred_prob_in_sample = model.predict(X_train)

# Make predictions on the out-sample data y_pred_out_sample = model.predict(X_test) y_pred_prob_out_sample = model.predict(X_test)

# Evaluate performance on out-sample data

Tiêu đề	Building a loss given default prediction model by machine learning techniques
Tác giả	Tran Nhat Nam
Người hướng dẫn	Dr. Nguyen Minh Nhat
Trường học	Ho Chi Minh University of Banking
Chuyên ngành	Finance - Banking
Thể loại	Khóa luận tốt nghiệp
Năm xuất bản	2024
Thành phố	Ho Chi Minh City

Định dạng
Số trang	82
Dung lượng	1,46 MB