The concept of credit scoring model
Credit appraisal is a crucial process that involves gathering and analyzing data to provide credit analysts with insights into customers, aiding in final financing decisions and mitigating operational risks for credit institutions A key tool in this process is credit scoring, which can be broken down into two components: "credit," referring to the borrowing of money that must be repaid with interest, and "scoring," which involves measuring and categorizing customers based on various criteria relevant to credit institutions Essentially, credit scoring utilizes statistical models to transform relevant data into variables that significantly impact customers' repayment abilities These models help quantify customer characteristics and facilitate the classification of customers, enabling lenders to make informed decisions about who qualifies for financing and who should be declined.
Consumer loans represent a small segment of the total loan market, making individual loan evaluations less cost-effective With advancements in technology and risk management research, lenders increasingly rely on scoring models to expedite approval processes and enhance decision accuracy Mester (1997) noted that credit scoring models assessed 70% of small business loan applications in banks and lending institutions Although credit scoring systems began in the 1950s in the U.S., their use expanded in the 1990s to include housing finance evaluations (Straka, 2000) The multiple discriminant analysis, introduced by Altman in 1968, is one of the earliest and most widely used credit scoring models Since then, various techniques, including logit-probit models and neural networks, have emerged, effectively identifying customer characteristics that significantly influence default risk.
According to Hửrkkử (2010), the primary goal of credit scoring models is to assist lenders in identifying "good" and "bad" customers, thereby reducing credit risk and default rates These models utilize various personal information variables, such as age, education, marital status, and number of children, as input data Different statistical techniques are then applied to determine which characteristics effectively differentiate between customer types Credit scoring models (CSMs) can either be purchased or developed in-house, with their design varying based on the bank's capabilities and the sample data used Typically, historical data indicating default or non-default status serves as the binary dependent variable, while customer demographics are treated as independent variables Through empirical modeling, CSMs generate a score for each applicant, which is compared against a predetermined cut-off value to inform lending decisions Applications scoring above the threshold are accepted by lenders.
Judgmental analysis method and credit scoring model
Credit assessment involves evaluating a customer's characteristics against those of past borrowers If a customer's profile matches that of previous defaulters, their loan application is likely to be denied Conversely, applicants resembling reliable payers are more likely to receive loan approval The credit assessment process typically employs two main techniques: judgmental analysis and credit scoring models.
Different analysis techniques come with their own sets of advantages and disadvantages The effectiveness of judgmental analysis is largely influenced by the credit analyst's accumulated experience and analytical thinking skills This technique is susceptible to factors like subjectivity, uncertainty, and personal perspective during credit decision-making Despite these drawbacks, judgmental analysis offers benefits such as the ability to incorporate quantitative factors into the analysis process and the potential for more accurate decisions when conducted by experienced credit analysts.
Credit scoring techniques utilize extensive data from existing customers to develop statistical models that classify potential clients as either high or low risk, enabling quick financial decisions These models have significantly enhanced customer service for lenders by streamlining operations and reducing approval costs However, they face criticism regarding their reliability, as statistical models inherently rely on assumptions and the quality of data used Despite these concerns, credit scoring systems have evolved rapidly and are now essential tools in the finance and banking industries.
Advantages and disadvantages of credit scoring models
Rapid advancements in credit scoring applications have significantly benefited users, particularly in the finance and banking sectors A key advantage of these applications is their ability to generate classification results with minimal information, utilizing only statistically significant variables that reflect customers' repayment capacity In contrast, judgmental analysis reviews all customer information without excluding any elements Credit scoring systems assess both good and bad customer characteristics, while judgmental analysis primarily focuses on negative traits Built on extensive historical data and statistical models, credit scoring provides objective decision-making, whereas judgmental analysis introduces subjectivity, leading to potential discrimination issues This technique allows lenders to clearly understand the relationship between customer characteristics and payment behavior, which is often challenging to articulate in judgmental analysis Additionally, credit scoring ensures consistent classification results across different evaluators, unlike judgmental methods.
Credit scoring models offer several advantages beyond the previously mentioned benefits, including improved time efficiency and faster decision-making They help minimize approval costs and reduce errors compared to traditional judgmental analysis Additionally, these models facilitate easier risk control processes and require less customer information for classification Furthermore, credit scoring models can adapt their structure to meet various user needs.
While credit scoring models offer numerous advantages, they face criticism due to their reliance on various client characteristics as input variables, which can lead to ambiguous relationships with creditworthiness Additionally, these models often overlook economic factors that influence a customer's repayment ability and can vary significantly across different regions, lacking a universal standard The costs associated with purchasing and training for credit scoring systems pose challenges for implementation Furthermore, these models may mistakenly reject reliable customers due to sudden life changes without thorough analysis, and their dependence on historical data can result in fixed variable weights, diminishing accuracy as customer patterns evolve over time.
Historical development of credit scoring models
Development in credit card and instant loan markets
Agaewal et al (2009) examined the impact of customer characteristics on default likelihood using a dataset of 170,000 samples Their findings indicated that factors such as monthly spending, debt levels, income, asset accumulation, economic conditions, legal environment, and demographic structure significantly influence customers' repayment capacity Notably, individuals who have moved away from their birthplace are more likely to default, while married homeowners exhibit a low probability of default Additionally, customers under 30 and over 60 tend to repay their loans more effectively than others Lastly, high-income individuals and asset holders demonstrate a strong sense of repayment responsibility.
In 1999, Dunn and Kim conducted a study involving 500 customers in Ohio to identify factors influencing credit card default probabilities Their findings revealed a significant correlation between the minimum monthly payment amounts and customers' income, impacting repayment capacity Additionally, age, marital status, and the number of dependents were strongly linked to default risk, while education level, income, and ownership status showed no significant effect, contrary to initial assumptions.
Some studies prioritize profit maximization over classification, as noted by Boyes et al (2002), who emphasize that credit analysis aims to provide accurate default probability estimations These estimations allow lenders to offer loans at varying interest rates based on customer risk Key factors influencing default probability include age, education, asset ownership, number of dependents, and income-to-spending ratio Autio et al (2009) surveyed 1,951 young adults aged 18 to 29, gathering data on demographics and credit statuses, including mortgages and student loans Their findings revealed that individuals aged 18 to 23 frequently apply for instant loans, while those with higher incomes and stable employment prefer credit card loans The study concluded that while gender has minimal impact on borrowing decisions, employment status, income, and family structure are significant influencing factors.
Development in mortgage markets
In the mortgage market, lenders typically require collateral from borrowers to secure their loans The likelihood of default is significantly influenced by fluctuations in exchange rates and interest rates, particularly due to the long-term nature of these loans (Zorn and Lea, 1989).
In 2006, Vasanthi and Raja explored the link between income and customer characteristics, finding that the age of the household leader plays a crucial role in default probability Younger household leaders are more likely to default due to financial stress and inexperience in money management Conversely, customers with high incomes who borrow minimally demonstrate the lowest default rates, as they effectively manage their finances Additionally, traditional factors such as education level and marital status significantly influence payment behavior; individuals with higher education tend to secure stable jobs, enhancing their financial stability The study also revealed that younger customers and those who are divorced have a higher likelihood of defaulting, attributed to their lack of financial management experience and psychological instability.
Development in consumer credit market
In their 2009 study, Kocenda and Vojtek analyzed 3,403 customer observations across 21 characteristics to identify factors influencing payment behavior and evaluate credit scoring model techniques Their findings revealed that both logistic analysis and CART analysis performed equally well, highlighting education level, marital status, borrowing purpose, asset accumulation, and transaction history as key factors affecting payment behavior A notable limitation of the study was its reliance on a small dataset for credit scoring model development The authors recommended considering non-parametric measurement as an alternative method for creating an effective credit scoring model.
In 1997, Arminger et al developed a credit scoring model utilizing three techniques: Linear Discriminant Analysis (LDA), classification tree analysis, and feedforward neural networks They aimed to compare these methods to determine the most effective one by analyzing data from 8,163 observations collected between 1991 and 1997.
In 1992, a study conducted in Germany analyzed various inputs such as gender, experience, age, ownership, and marital status to evaluate credit scoring models The results revealed that three techniques demonstrated similar high classification power; however, the Linear Discriminant Analysis (LDA) technique outperformed the others slightly The findings also indicated that groups consisting of experienced customers, those with significant asset accumulation, females, and married individuals exhibited a lower likelihood of defaulting on credit.
In 2003, Jacodson and Roszbach conducted a study on credit scoring models, focusing on the data selection process's deviation Utilizing a bivariate approach, they incorporated both rejected and approved loans as inputs for the model Their research analyzed 13,338 observations from Sweden between 1994 and 1995, drawing from various financial and personal customer information sources Initially, they identified 57 input variables, but ultimately, only 16 were selected for the final model The findings revealed that key factors such as income, age, annual income changes, and loan amounts significantly influenced a customer's probability of default.
In 2004, Roszbach analyzed a dataset to explore the relationship between customer default periods and loan cash flow The study outlined two methods for calculating the net present value (NPV) of loans: one for timely repayments and another for defaults, which includes cash flow estimates during partial repayments and costs associated with non-performing loans By employing a tobit model, Roszbach accurately measured the default timing of customers The findings revealed that lenders often misjudged the risk-return tradeoff and did not implement lending policies that incentivized profit maximization Furthermore, the research indicated that lenders failed to differentiate the value of loans, suggesting a lack of consistency in their profit-maximization objectives The tobit model enabled lenders to predict customer default durations, allowing for more effective lending strategies.
Dinh and Kleimesier (2007) utilized 56,037 observations from one of the largest banks in Vietnam to develop a credit scoring model, employing a forward-stepwise method for variable selection However, the study encountered limitations due to a lack of essential information, a common challenge in research conducted in developing countries like Vietnam The findings revealed that the length of the customer-lender relationship is the most significant factor influencing credit scoring, followed by gender and loan amount The authors emphasized the importance of regularly updating credit scoring models to maintain their effectiveness in response to changing economic conditions.
Research by Updegrave (1987) identified key factors influencing customer payment behavior, including the number of variables in a model, payment history, work experience, duration of residence, income, ownership, age, and savings rate This finding was corroborated by Steenackers and Goovaerts (1989) in their Belgian study, where they initially considered 19 variables for a credit scoring model but ultimately narrowed it down to 11 statistically significant factors Their logistic regression analysis revealed that age, length of employment and residence, loan amounts, phone communication, employment sector, monthly income, and asset accumulation significantly impact customer payment behavior.
In 2004, Ozdemir utilized logistic regression to develop a model assessing the relationship between default risk and various demographic and financial factors within the credit retail market, using data from a Turkish bank The study's findings indicated that demographic factors lacked statistical significance in influencing customer payment behavior, whereas financial factors were significant Notably, higher interest rates and longer loan terms were identified as strong predictors of customer defaults, with borrowers facing increased risks due to potential economic fluctuations, exchange rate variations, or interest rate changes.
In 1997, Han and Henley conducted a comprehensive review of research on credit scoring model development methods Their findings indicated that there is no universally optimal approach; each method presents unique advantages and disadvantages that are largely influenced by the data structure and the specific objectives of the users.
Common variables in constructing credit scoring models
Credit scoring models play a vital role in categorizing customers into good or bad credit groups With the swift advancement of credit scoring applications in developed countries like America and England, these models have become essential for effective risk management and streamlining the lending process By utilizing credit scoring models, lenders can efficiently analyze customer data, evaluate payment histories, and determine creditworthiness, facilitating informed credit decisions.
When developing a credit scoring model, it is essential to incorporate personal characteristics as input variables, including gender, age, education level, number of dependents, job type, and work experience Additionally, other crucial factors such as the loan amount, asset accumulation, monthly income, savings rate, and the purpose of borrowing should also be taken into account to enhance the model's accuracy and effectiveness.
(Lee and Chen., 2005; Ong et al., 2005; Steenackers and Goovarts., 1989) to enhance the performance of credit scoring model
Customer information serves as input for statistical models, with statistically significant variables being utilized in credit scoring to categorize clients The swift advancement of credit scoring applications highlights the effectiveness of these models; however, there is a lack of explicit research detailing the rationale behind the selection of variables in these models Moreover, the choice of variables in credit scoring heavily relies on the initial data structure used to develop the model Initially, credit scoring models aimed to classify customers into two categories: "good" and "bad." As credit scoring evolved to meet more complex demands, models have expanded to classify customers into three groups: "good," "bad," and a third category.
“confuse”…therefore lenders will have more information about customers to make final decisions
There are no explicit requirements regarding the number of variables in a credit scoring model, as the selected variables largely depend on the data structure, cultural context, and economic conditions of each region Typically, a robust credit scoring model incorporates around twenty variables Research by Salchenberger et al (1992) and Leshno and Spector suggests that increasing the number of variables can improve the model's performance.
Credit scoring models play a crucial role in finance and banking, yet there is ongoing debate about determining the optimal cut-off point for these models Research indicates that this cut-off point varies based on lenders' strategies; those aiming to expand lending may set a lower threshold, while those focused on risk management tend to adopt a stricter cut-off Additionally, the effectiveness of credit scoring models is influenced by the number of variables and the data structure used, as well as the sample size during model development Larger sample sizes generally enhance accuracy, but the availability of data often restricts this Some studies, like those by Dutta et al (1994) and Fletcher and Goss (1993), utilized smaller sample sizes of around 300 to 400 observations, while others, such as Bellotti and Crook (2009) and Hsieh (2004), employed much larger datasets with thousands of observations Notably, constructing credit scoring models for the consumer credit market frequently involves smaller sample sizes, typically under 400 observations.
1100 observations (Sustersic et al, 2009; Lee and Chen 2005; Kim and Sohn, 2004)
Studies have highlighted a bias in credit scoring models due to the selection of only customers who received loan approvals as input data This approach limits the representativeness of the model for the entire population, ultimately impacting its performance.
The table below presents commonly used independent variables from various studies, noting that their names may differ from those in previous research Nevertheless, the underlying meanings of these variables remain consistent across the studies.
Variables Jacobson R 2003 Dinh et al 2007 Agarwal et al 2009 KocendaVojtek 2009
Migrating out of state of birth
Time living in current place
Working time in current job
Table 01: Common variables in previous studies Note : *** the most significant variables in previous studies
Previous studies highlight that customer characteristics—such as age, education, marital status, and residential status—significantly influence payment ability (Agarwal et al., 2009; Dinh Kleimeiser, 2007; Kocenda Vojtek, 2009) Additionally, factors like income, length of the customer relationship, loan maturity, and savings also play a crucial role in determining customers' payment capabilities (Vasanthi & Raja, 2006; Ozdemir & Boran, 2004; Jacobson Roszbach, 2003).
Common techniques employed in credit scoring models
Arminger et al (1997) conducted a comparative study on credit modeling using three methods: Logistic Regression (LR), Classification Trees (CT), and Neural Networks (NN) They analyzed independent variables including gender, duration in current job, age, and marital status, utilizing a dataset from a major retail bank in Germany Employing cross-validation for model setup and performance testing, their findings indicated that while all three techniques demonstrated similar predictive capabilities, LR outperformed the others slightly, with CT showing the least effectiveness.
In 1996, Desai and his team utilized neural networks, logit regression, and linear discriminant analysis to evaluate their effectiveness in developing Credit Scoring Models (CSM) using data from 53 US credit institutions between 1988 and 1991 Their findings revealed that while neural networks excelled at predicting bad loans, logit regression and neural networks performed equally in classifying both good and bad loans, with logit regression consistently outperforming linear discriminant analysis Similarly, a study by Lee et al in 2002 compared the predictive capabilities of linear discriminant analysis, logit regression, neural networks, and a neural discriminant method, concluding that all four models exhibited comparable effectiveness in differentiating between good and bad customers.
The parametric and nonparametric techniques LR and CT were used in Koenda and Vojtek study
In 2009, researchers estimated the determinants of default and found their results to be reliable, recommending the use of the CT method for improved model creation However, earlier studies by Luo (2008) and Yang (2009) argued that Logistic Regression (LR) consistently outperforms other methods due to its effectiveness in identifying customer characteristics that influence default rates.
According to Hand and Henley (1997), the effectiveness of classification methods varies based on specific data and input variables, each offering unique advantages and disadvantages For instance, logistic regression (LR) and nearest neighbor methods are user-friendly and yield easily interpretable results, while neural networks, despite their high predictive power, pose challenges in explaining the underlying processes of their outcomes Paliwal (2009) noted a growing trend in the last decade where lending institutions increasingly adopt neural networks as an alternative to traditional statistical models for constructing credit scoring models (CSM) Additionally, research has shown that hybrid models combining feed-forward neural networks with traditional statistical methods like discriminant analysis (DA) and LR can significantly enhance model performance (Cheng et al., 1994; Paliwal et al., 2009).
Studies LDA CT LR NN Hybrid Model
Table 02: Common methods in previous studies Note : *** the better method in their study
Logistic regression is widely recognized as a leading method for distinguishing between good and bad customers due to its strong performance (Cheng et al., 2003; Laitinen, 2000) While some critiques highlight that it does not assume a linear relationship between independent and dependent variables, nor require the dependent variable to be normally distributed, research by Chen and Huang (2003) indicates that weak non-linearity is common in credit scoring datasets, making logistic regression a reliable estimation tool Furthermore, previous studies have shown that estimations from logit or probit regression consistently outperform those from Discriminant Analysis (DA) (Wilson et al., 2000).
Recent studies suggest a novel approach to credit scoring model (CSM) development by integrating various techniques, recognizing that each model excels in specific segments or criteria (Koh et al., 2006) This hybrid methodology leverages the strengths of individual models to enhance overall performance Supporting research by Lee et al (2000) and Zhu et al (2001) indicates that these combined models significantly outperform traditional methods.
Lee and Chen (2005) conducted a comparison of individual models, including Discriminant Analysis (DA) and Logistic Regression (LR), alongside a hybrid model that integrated neural networks with multivariate adaptive regression splines, ultimately yielding similar results.
This study aims to evaluate the performance of Customer Satisfaction Models (CSM) by employing logistic regression, neural networks, and a hybrid model that combines these techniques.
METHODOLOGY
Data
This study analyzes a dataset from MBBank, a prominent commercial bank in Vietnam, comprising personal information of 690 customers who took out loans in 2012, with their loan status updated by the end of 2013 Key independent variables such as gender, age, education, and marital status will be utilized to differentiate between good and bad customers.
As of the end of 2013, MBBank operated six branches in Ho Chi Minh City, serving nearly 10,000 personal customers A random selection of 115 personal customers from each branch was conducted for this study, aiming for a distribution of 30% bad customers to 70% good customers The personal information of these customers was sourced from their loan applications, securely stored in the bank's data center.
The below table describes the variables and their definition
Divided into three groups: Under 30; From 30 – 40; Over 40
Two dummy variables Gender Dummy, Male code as 1, Female code as 0
Education Dummy, education with two groups: University and higher code as 1;
College and lower code as 0
Marital Status Category, marital status with three groups: Single, Married, Divorce
Category, describe place where customer is living
Divided in three groups: Owner (homeownership), Live in parent’s house, Renting
Living time at current place
Category, describe duration (number of year) that customer is spending at current living place
Divided into three groups: From 1 to 4 years, From 4 to 7 years; From
Category, describe the type of employment, divided into three groups: Manager, Officer, Private business
Working time in current field
Category, describe duration (number of year) that customer has worked in current field
Divided into three groups: Less than 4 years: From 4 to 7 years, Over 7 years
Working time in current job
Category, describe duration (number of year) that customer has worked in current company
Divided into three groups: Less than 2 years: From 2 to 4 years, Over 4 years
Category, describe number of people that customer have to support finance Divided into three groups: Non dependant; One dependant; Two dependants and more
Historical payment Dummy, Historical payment of customer in previous loans
Good historical payment code as 0, Bad historical payment code as 1
Table 03: Variables and their definitions
In this part we formulate and explain our hypotheses The variables used to test whether our hypotheses should be rejected or accepted will be also presented
Age: Age is relevant in determining the probability of default
This study anticipates that older borrowers exhibit a lower probability of default due to their potential risk aversion (Dunn and Kim, 1999; Arminger et al., 1997; Agarwal et al., 2009) In contrast, research by Autio et al (2009) in Finland indicates that younger customers often rely on borrowing to cover overdue bills, reflecting a weaker financial position and a lack of money management skills.
Gender: Men are more likely to default than women
Dinh and Kleimeier (2007) found that women are less likely to be considered bad customers compared to men, as they tend to have a greater awareness of their loan repayment obligations and exhibit better control over their spending habits.
Education: High educated is negatively correlated with probability of default
We expect that people who have better educated will have higher income and more stable, thus will less probability in default This expectation was supported by Steenackers and Goovaerts
Marital status: Different marital status can affect the default probability
In credit scoring models, marital status is a significant variable, with single individuals often deemed riskier due to perceived lower reliability and maturity compared to married counterparts Research by Agarwal et al (2009) indicates that married borrowers are 24 percent less likely to default on loans than single borrowers.
Current Living Status: Different living status has different impact on the probability of default
Living status refers to the housing arrangements of customers, which can include owning a home, residing in a parent's house without incurring monthly fees, or living in an apartment where monthly payments are required This classification is supported by the findings of Agarwal et al.
(2009), the study indicated that people who have their own house less likely to a bad customer
Living time in the current place: Time at present address has an impact on the probability of default
People who change/move their place frequently always riskier because of their instability (Agarwal et al 2009,Steenackers1989)
Type of job: Different type of job can affect the default probability
This study analyzes various job types, including officers, workers, and private business owners, to assess job stability in Vietnam The type of employment serves as a reliable indicator of income level and stability, with officer positions typically offering higher salaries and greater job security compared to workers and private business roles Consequently, officers are presumed to have a lower probability of default, while workers are considered the most at risk.
Working time in current job: Time in current job has an impact on the probability of default
The "Job Stability" variable, akin to the "Type of Job" variable, indicates the reliability of a borrower's employment status Individuals who have maintained their current job for an extended period are generally considered less risky, as they face a lower likelihood of job changes or termination.
Working time in current field: More accumulate experience is less likely to default
The variable emphasizes the importance of a customer's work experience, suggesting that individuals with extensive tenure in their current field are perceived as less risky Their accumulated experience, coupled with stable positions, increases their likelihood of receiving promotions and higher salaries.
Number of dependent people: Number of dependent people increases will increase the probability of default
This variable indicates the number of dependents a customer must financially support each month, including expenses like education and healthcare As the number of dependents increases, so does the financial pressure on the customer, which may lead to difficulties in managing their payment obligations.
Historical payment: Payment history is relevant when estimating the probability of default
A history of paid debts is typically inversely related to the likelihood of default, as timely payments reflect responsible financial management Conversely, late payments may signal negligence and diminished creditworthiness, particularly when they are significantly overdue, which strongly correlates with an increased probability of default In contrast, consistent on-time payments demonstrate sound personal finance practices and contribute to a reduced risk of default.
Methodology
This study will employ logistic regression, neural network and the hybrid model of them to create CSM It will be conducted follow steps illustrated in the below figure:
Figure 03: Steps to construct Credit scoring model
Source: Koh et el.2006, A Two-step Method to Construct Credit Scoring Models With Data Mining Techniques, International Journal of Business and Information.
Logistic regression
Logistic regression is a parametric method used for predicting binary outcomes, where the dependent variable can only take on two values: 0 (non-default) or 1 (default) This technique utilizes both continuous and categorical independent variables to fit data to a logistic curve, allowing for the estimation of the probability of a specific event occurring By employing maximum likelihood estimation through the logit function, logistic regression assumes that the dependent variables are logistically distributed, with the probability of default for a customer ranging exclusively between 0 and 1.
The default probability of a customer will be calculated as follow:
p: the default probability of a customer β0: constant β1… βn: weighted of each independent variables x i : independent variables
Odds ratio was used as an indicator to assess categories of a variable having predictive power or not Odds ratio is defined as: i i i bad Good Odds =( )*( )
Bad good badi: total number of defaulted observation of x i
Bad: total number of defaulted observation goodi: total number of non-defaulted observation of x i
Good: total number of non-defaulted observation
Variables with categories that show an odds ratio of 1 lack predictive power, indicating that they express the same probability for both defaulted and non-defaulted groups As a result, they fail to differentiate between good and bad customers, rendering them ineffective as predictors in Customer Success Management (CSM).
In addition to utilizing Odds Ratio, this study employs information value measurement to assess the classification capability between good and bad customers The information value for a specific category of a variable is defined as follows: i i i i i.
IV =(good -bad )*log(good /bad )
Variables with higher information value are deemed to possess greater predictive power in distinguishing between good and bad customers Kocenda and Vojtek (2009) propose that variables with an information value exceeding 0.2 are considered strong candidates for model input In addition to evaluating the p-value of a variable, information value serves as a valuable indicator for selecting variables for inclusion in the model.
3.3.4.1 Log-likelihood ratio (LR) test:
The Likelihood Ratio (LR) test serves as an alternative to the traditional F-test, particularly in logistic regression, which does not require the dependent variable to be normally distributed as in Ordinary Least Squares (OLS) regression This hypothesis testing method evaluates the log likelihood of both the initial and restricted models The LR test is based on the ratio of the maximized values derived from the likelihood functions of the two models being compared Essentially, it measures the difference in residual deviances between constrained and unconstrained models, providing a robust statistical framework for analysis.
L(m1): likelihood of the null model L(m2): likelihood of alternative model
For example: the null hypothesis is H 0 : β x = β y = 0, if we reject H 0 , it means that two coefficients are simultaneously not equal to zero
In logistic regression, the chi-square test is commonly employed to assess the model's goodness-of-fit This statistical test evaluates the independence and goodness-of-fit between two categorical variables By comparing observed data with expected values based on the research hypothesis, the chi-square test helps identify significant differences in the data.
To identify the optimal model from a set of proposed options, the Akaike Information Criterion (AIC) is commonly utilized AIC helps in selecting the best model by measuring the Kullback-Leibler distance between the model and the actual data, with lower values indicating better fit We begin with a basic model that includes only a constant and progressively incorporate additional variables, assessing each iteration using the AIC.
L: maximized value of estimated model K: number of free parameters in the statistical model
Neural Network
An artificial neural network, as defined by Haykin (1999), is an information processing paradigm that mimics the way biological nervous systems process information These networks function similarly to the human brain, learning from examples, accumulating experience, and developing decision-making capabilities In an artificial neuron, processing elements (PE) receive input variables, where neuron j aggregates information from multiple inputs (xi) weighted by different factors (w1j) A transformation function is then applied to convert the output into the desired value type The output from neuron j may serve as input for other neurons, regardless of the number of hidden layers in the network If the neural network contains multiple hidden layers, this aggregation of information is reiterated until it reaches the final layers.
Figure 04: Processing information in an Artificial Neuron 3.4.2 Components of artificial neural network:
A basic neural network consists of three key components: input layers, hidden layers, and output layers Input layers receive variables with random weights, which are then processed in the hidden layers The summation of these inputs, along with their random weights, serves as the input for subsequent hidden layers or the output layer if only one hidden layer exists The network calculates the output from training examples and compares it to the desired output, determining the error This process is repeated multiple times, updating the weights at each layer until the network's error is minimized or stabilized.
Figure 05: Neural network with one hidden layer
In each processing element of the network, a summation function is employed to compute the weighted sums of all input variables This process is visually represented in the figure below, demonstrating the summation formula used at a single processing element.
One processing element with n inputs Several processing element
Figure 06: Example of Summation function
At the conclusion of each processing element in the network, a transformation function is utilized to convert the neuron's summation into the desired value type In this study, the dependent variable is binary, prompting the application of the sigmoid transfer function, which is defined as follows:
YT is the transformed value
Figure 07: Example of Sigmoid function of ANN 3.4.3 Back Propagation Algorithm:
The back propagation algorithm is a fundamental technique in neural networks, as described by Rojas (2005) It consists of four key steps: First, random weights are assigned to the input variables Next, during the feed-forward computation, these variables are processed to calculate the values of hidden and output nodes The second step involves back propagation to the output layer, where the error between the actual output and the desired output is computed, leading to weight adjustments using a learning rate and momentum In the third step, back propagation continues to the hidden layer to estimate errors at the input nodes, resulting in new weight calculations Finally, the weights are updated for both the input and hidden layers, and this iterative process continues until the network's error reaches a minimum or stabilizes.
Figure 08: Back propagation algorithm of single neuron 3.5 The hybrid model:
Constructing the hybrid model comprises two steps:
+ First, identify influencing variables by using logistic regression
+ Next, those significant variables will be considered as the input variables of Back Propagation Neural Network (BPN)
The advantage of this approach is to prevent the problem of over-fitting of neural network model
This study employs classification tables to evaluate the performance of various models, utilizing three key indicators: (1) overall accuracy rate, (2) Type I accuracy rate, which measures the instances where "Good" cases are incorrectly predicted as "Bad," and (3) Type II accuracy rate, which assesses the cases where "Bad" instances are mistakenly accepted.
EMPIRICAL RESULTS
Data
This study analyzes a dataset from MBBank, one of Vietnam's largest joint-stock commercial banks, which includes detailed customer information such as gender, education, marital status, ownership, and number of dependents Additionally, the dataset incorporates historical payment records of these customers.
The dataset comprises personal information from 690 customers who took out loans in 2012, with their loan status to be updated by the end of 2013 While customers may have multiple loans across different banks, this study specifically examines the earliest loan taken by each customer in 2012.
In this study, the dependent variable is a binary dummy variable, where a value of 0 indicates a good customer who repays loans on time, and a value of 1 signifies a bad customer who fails to repay loans Utilizing Basel II guidelines, borrowers are classified as bad customers if they delay payments for over 90 days Out of 690 total observations, 246 are categorized as bad customers, representing 35.65% of the sample.
Figure 09: Ratio of good/bad customer of dataset 4.1.2 Independent Variables:
In this study, customer age is categorized into three groups: under 30, between 30 and 40, and 40 and older, utilizing two dummy variables for differentiation Previous research suggests that older individuals often exhibit a greater aversion to risk, carefully assess expenses, and align spending with their payment capabilities, making them less likely to be categorized as bad customers Consistent with these findings, this study anticipates that younger customers are more prone to becoming bad customers compared to their older counterparts.
The chart illustrates the varying ratios of good to bad customers across different age groups, revealing a clear disparity In the oldest age group, the percentage of bad customers is the lowest at approximately 22%, while the youngest group exhibits the highest proportion of bad customers, accounting for 56%.
Bad customers, as defined by Autio et al (2009), are individuals with weak financial stability who struggle to manage their finances Typically aged between 18 and 29, these customers often resort to borrowing funds to cover existing debts or rental expenses.
Ratio of good/bad customer of dataset
Figure 10: Ratio of good/bad customer base on age of customer
The current living status of a customer significantly influences their repayment ability, categorized into three groups: homeowners, individuals living in their parents' house without paying rent, and renters who pay monthly fees According to Agarwal et al (2009), homeowners are 19% less likely to default on payments compared to non-homeowners The analysis of customer ratios within these groups suggests that while homeowners demonstrate a stronger repayment capability, distinguishing between those living rent-free with parents and those renting remains inconclusive.
Ratio of good/bad customer among different groups base on age of customer
Figure 11: Ratio of good/bad customer base on Current living status
Education is one of the most power predicted variable that can distinguish good or bad customer
This study categorizes education into two groups: individuals with a bachelor's degree or higher and those with lower educational attainment It posits that higher education correlates with increased opportunities for promotions, higher salaries, and more stable work environments, thereby reducing the likelihood of becoming a problematic customer This hypothesis is further supported by findings from Steenackers and Goovaerts (1989).
The data presented in the chart strongly supports the expectation that higher education correlates with a lower likelihood of becoming a bad customer Specifically, individuals with a bachelor's degree or higher exhibit a default rate of only 22%, in stark contrast to a default rate of 62% among those with lower educational attainment.
Ratio of good/bad customer among different groups base on Current living status
Figure 12: Ratio of good/bad customer base on Education level
Gender is a critical factor in predictive modeling, particularly in credit scoring, where its use is restricted in some countries to prevent discrimination Research by Arminger et al in 1997 revealed that females generally exhibit a greater aversion to risk compared to males, leading to a lower likelihood of becoming bad customers However, recent data indicates that the default rates between genders are nearly identical, challenging traditional assumptions about gender and creditworthiness.
Figure 13: Ratio of good/bad customer abase on Gender
Ratio of good/bad customer among different groups base on Education level
Ratio of good/bad customer among different groups base on Gender
Marital status is a significant factor in predicting customer behavior in credit scoring models, categorized into three groups: single, married, and divorced Research by Arminger et al (2009) indicates that married individuals are less likely to default on credit, attributed to their greater sense of responsibility and social standing The default rate among singles is notably high at 61%, while it is much lower at 29% for married individuals, with divorced individuals falling in between at 50% These findings reinforce the expectation that marital status plays a crucial role in credit risk assessment.
Figure 14: Ratio of good/bad customer base on Marital status
The duration of residence in a current location is categorized into three groups for this study: Group 1 consists of individuals who have lived in their current place for less than four years, Group 2 includes those who have resided there for five to seven years, and Group 3 encompasses individuals who have lived in their current location for over eight years Research by Agarwal et al (2009) and Steenackers and Goovaerts (1989) indicates that frequent movers or newcomers are more likely to exhibit poor customer behavior.
The data clearly shows a significant distinction between group 1 and group 2; however, drawing a similar conclusion for group 2 and group 3 is challenging, as the proportion of dissatisfied customers in both groups is nearly identical.
Ratio of good/bad customer among different groups base on Marital status
Figure 15: Ratio of good/bad customer base on Duration living at current place
The classification of customers into three job types—manager, officer, and private business—highlights the stability of borrowers' incomes This study posits that managers represent a lower risk compared to other groups, while private business owners are considered the highest risk The following figure provides further evidence to support this expectation.
Figure 16: Ratio of good/bad customer base on Type of job
The "Working time in present job" variable, akin to "Type of job," reflects the stability of a borrower's income and their loyalty to their current employer This variable is categorized into three distinct groups.
Ratio of good/bad customer among different groups base on duration living at current place
Estimation results
Model 1 of this study will comprise all of variables that have been introduced in the earlier chapter Applying forward and backward technique to examine which variable has statistical significance in predicting payment ability of a customer The result of model 1 is presented in Table 20 This table confirmed that gender is not statistical significance in predicting payment ability of a customer, it also means that gender do not have ability to classify good or bad customer However, there are some subcategories of five variables do not have statistical significance such as “Marital status”, “Current Living Status”, “Working time in current field”,
The analysis reveals that "Gender," "Working time in current field," and "Number of dependent people" are the only variables with a significance level below 0.01, indicating a weak predictive power regarding a customer's payment ability Conversely, "Working time in current job" and "Number of dependent people" show a p-value greater than 0.1 These initial findings from model 1 will inform the development of model 2.
Model 2 is built by excluding “Gender” from model 1 because it does not have statistical significance and the information value also is under 0.01.Although Gender is removed from model 1, the result of model 2 is not clearly better than model 1 The predicted power of model 1 and model 2 is considered as the same This study will continue to construct model 3, this model will remove three variables such as “Gender”, “Working time in current field”, “Number of dependent people” variables The performance of model 3 is worse than two constructed model before, but it achieve the least number of “bad accepted” among three models “Bad accepted” represent the situation that banks or lending institutions accepted a loan for a customer, but finally he/she is a bad customer Banks will lose much capital when they make this kind of decision
Variables Model 1 Model 2 Model 3 Information Value
Living time at current place 0.18182
Working time in current field 0.00217
Working time in current job 0.10556
Table 04: Summary of selected variables in logit models Note: Detail of information value calculations is listed in Figure 23 in Appendix
4.2.1.1 Log-likelihood ratio (LR) test:
Log-Likelihood LR chi-square p-value
Table 05: Log-likelihood ratio (LR) test
According to the LR test result between model 1 and model 2, we conclude that the effect of
The analysis reveals that gender does not significantly influence customer repayment behavior, leading to the acceptance of the null hypothesis regarding gender's impact In contrast, the variables "Working time in current field" and "Number of dependent people" were excluded from model 3 despite their high statistical significance, as indicated by the LR test, which rejects the null hypothesis for these variables Ultimately, all three models demonstrate equivalent predictive power and qualify as effective credit scoring models.
Person Chi-square test df Sig
Table 06: Person Chi-square test result
According to the above table, those results indicate that there is a statistical significance relationship between independent variables and dependent variable in all three models
Table 07: Akaike Information Criterion (AIC) result
The AIC values indicate that model 2 slightly outperforms model 1, while model 3, despite having the highest AIC value, shows the least variability among the three models This suggests that a model with fewer suitable variables can yield better results than one with excessive variable exclusion Ultimately, the findings conclude that model 2 demonstrates the best performance overall.
Table 08: Classification table of logit models
The classification tables reveal that all three models in this study demonstrate adequate goodness-of-fit, effectively predicting good credit risks (reliable customers) better than bad credit risks (unreliable customers) Model 1 achieves the highest accuracy with 82% of predictions correct, while Model 3 minimizes incorrect decisions among the three Implementing Model 3 as a credit scoring tool can significantly help banks and lending institutions avoid potential capital losses by reducing the likelihood of approving loans for bad customers.
Table 09: Summary logit model comparison
All of three models can be used as credit scoring model because their performance are nearly the same, base on different criteria, better model will be selected.
Neural network
This study employs a multilayer perceptron algorithm to develop a neural network model with an architecture of 38-1-10-1, incorporating 38 independent input variables, one hidden layer with 10 neurons, and one dependent output variable The sigmoid function is utilized to ensure the output values are normalized, fluctuating between 0 and 1 Additionally, the batch training method is chosen for its effectiveness with small datasets and its capability to directly minimize total error.
The table below presents the results of the neural network model, which utilized a sigmoid function to calculate the sum of square error The model underwent extensive training to minimize this error, achieving an impressive correct prediction rate of 84.2%, indicating high performance.
Table 10: Neural network model summary
Table 11: Classification of Neural network model 4.3.2 Importance of independent variables:
The table highlights the significance of independent variables based on two criteria: importance and normalized importance Importance is determined by measuring the changes in the network's predicted value when independent variable values are altered, while normalized importance is calculated as the ratio of importance values to the highest importance value According to the table, "Historical payment" emerges as the most crucial variable in the neural network model, followed by "Type of job," "Education," "Age," "Working time in current job," and "Current living status."
Variables Subcategories Importance Normalized importance
Living time at current place Spend1 0.051 32.3%
Working time in current field Field1 0.049 31.4%
Working time in current job CurJob1 0.053 33.4%
Number of dependent people Dep1 0.069 44.0%
Hybrid model
In this study, all variables identified through logistic regression, except for "gender," will be utilized as input variables for the Back Propagation Neural Network, forming a hybrid model.
The below table describes the result of Hybrid model 1 The percentage of correct prediction of model is equivalent to 83.8%
Table 14: Classification of Hybrid model 1
The study excluded three variables such as “gender”, “Working time in current field”, “number of dependent people” to construct hybrid model 2
The below table describes the result of Hybrid model 2 The percentage of correct prediction of model is equivalent to 83.5%
Table 16: Classification of Hybrid model 2
Summary comparison
Overall accuracy rate Good rejected Bad accepted
Selected model Neural network Hybrid model 2 Neural network and Hybrid model 1
The neural network-based credit scoring model generally outperforms the logistic regression model; however, it includes statistically insignificant variables like "Gender." Therefore, Hybrid Model 1 emerges as the superior choice for credit scoring in this study.