1. Trang chủ
  2. » Giáo Dục - Đào Tạo

Construct credit scoring models using logistic regression, neural network and the hybrid model

84 97 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 84
Dung lượng 1,73 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

IN DEVELOPMENT ECONOMICSCONSTRUCT CREDIT SCORING MODELS USING LOGISTIC REGRESSION, NEURAL NETWORK AND THE HYBRID MODEL BY LE MINH TIEN MASTER OF ARTS INDEVELOPMENT ECONOMICS HO CH

Trang 1

VIETNAM – NETHERLANDS PROGRAMME FOE M.A IN DEVELOPMENT ECONOMICS

CONSTRUCT CREDIT SCORING MODELS USING LOGISTIC REGRESSION, NEURAL NETWORK AND

THE HYBRID MODEL

BY

LE MINH TIEN

MASTER OF ARTS INDEVELOPMENT ECONOMICS

HO CHI MINH CITY, NOVEMBER 2015

Trang 2

VIETNAM – NETHERLANDS PROGRAMME FOE M.A IN DEVELOPMENT ECONOMICS

CONSTRUCT CREDIT SCORING MODELS USING LOGISTIC REGRESSION, NEURAL NETWORK AND

THE HYBRID MODEL

A thesis submitted in partial fulfilment of the requirements for the degree of

By

LE MINH TIEN

Academic Supervisor:

DR PHAM DINH LONG

HO CHI MINH CITY, NOVEMBER 2015

Trang 3

Abstract

Viet Nam economy is facing many difficulties, the operation of enterprises is not effective leading to the non performing loan ratio of Banks increases In the period 2007 to 2014, Viet Nam have seen a downtrend in credit growth from 53,89% in 2007 to 11,8% in 2014 without signs of strong recovery in the next period A decline of credit growth implies that enterprises are facing difficult in approaching credit from lending institutions and those enterprises which operate mainly base on credit will be strongest affected ones Non performing loan ratio of Banks in Viet Nam has increased in 2007 to 2014, from 2% in 2007 then reached 3,25% in 2014 (highest in 2012 at 4,08%) In this period, almost enterprises could not approach Banks’ loans while Banks are afraid of non performing loan ratio increasing However, Banks are competing strongly with domestic and foreign ones to achieve shares and maintain profit at the current Viet Nam is known as a densely populated country (a market size of 90 million people and high proportion of young people) which is considered as a potential retail market for Banks to expand and develop in the next period To increase the competitiveness of Banks and also improve effective loan risk management, this study applied different methods that are common used to build up credit scoring model such as logistic regression, neural network and hybrid model Credit scoring model is considered as an application which is developed and widely applied in the sector of finance and banking in the last decades, it is useful in accelerating credit analysis process of Banks Final results confirmed that characteristics like age, education, marital status, current living status, living time in the current place, type of job, working time in current job, working time in current field, number of dependent people, historical payment have a statistically significant effect on repayment capacity of a customer Credit scoring models can classify customers according to different strategic purposes of users And the performance of hybrid models seemed better and more reliable than separate ones

Trang 4

CHAPTER 1: INTRODUCTION 8

CHAPTER 2: LITERATURE REVIEW 11

2.1 The concept of credit scoring model: 11

2.2 Judgmental analysis method and credit scoring model: 12

2.3 Advantages and disadvantages of credit scoring models: 13

2.4 Historical development of credit scoring models: 14

2.4.1 Development in credit card and instant loan markets: 16

2.4.2 Development in mortgage markets: 17

2.4.3 Development in consumer credit market: 18

2.5 Common variables in constructing credit scoring models: 20

2.6 Common techniques employed in credit scoring models: 23

CHAPTER 3: METHODOLOGY 26

3.1 Data: 26

3.1.1 Variables: 26

3.1.2 Assumptions: 28

3.2 Methodology: 30

3.3 Logistic regression: 31

3.3.1 Theory: 31

3.3.2 Odds ratio: 31

3.3.3 Information value: 32

3.3.4 Quality of the model: 32

3.3.4.1 Log-likelihood ratio (LR) test: 32

3.3.4.2 Pearson Chi-Square test: 33

3.3.4.3 Akaike Information Criterion (AIC): 33

3.4 Neural Network: 34

3.4.1 Theory: 34

3.4.2 Components of artificial neural network: 34

Trang 5

3.4.3 Back Propagation Algorithm: 37

3.5 The hybrid model: 38

3.6 Comparison of models: 38

CHAPTER 4: EMPIRICAL RESULTS 39

4.1 Data: 39

4.1.1 Dependent Variable: 39

4.1.2 Independent Variables: 40

4.2 Estimation results: 48

4.2.1 Construction of Logit models: 49

4.2.1 Comparison of Logit models: 50

4.2.1.1 Log-likelihood ratio (LR) test: 50

4.2.1.2 Person Chi-square test: 51

4.2.1.3 Akaike Information Criterion (AIC): 51

4.2.1.4 Classification tables: 52

4.2.1.5 Comparison summary: 53

4.3 Neural network: 53

4.3.1 Measurement of Model performance: 53

4.3.2 Importance of independent variables: 54

4.4 Hybrid model: 55

4.4.1 Hybrid model 1: 55

4.4.2 Hybrid model 2: 56

4.5 Summary comparison: 57

CHAPTER 5: CONCLUSION 58

5.1 Research summary and implication: 58

5.1.1 Research summary: 58

5.1.2 Implication: 59

5.2 Limitations of the study: 60

References 62

Trang 6

List of tables

Table 01 Common variables in previous studies……… 23

Table 02 Common methods in previous studies.……… 26

Table 03 Variables and their definitions……….……… 27

Table 04 Summary of selected variables in logit models……… 50

Table 05 Log-likelihood ratio (LR) test… ……… 50

Table 06 Person Chi-square test result……… 51

Table 07 Akaike Information Criterion (AIC) result……… 51

Table 08 Classification table of logit models……… 51

Table 09 Summary logit model comparison……… 52

Table 10 Neural network model summary……… 53

Table 11 Classification of Neural network model……… 53

Table 12 Importance of independent variables of Neural network model……… 54

Table 13 Hybrid model 1 summary……… 55

Table 14 Classification of Hybrid model 1……… 55

Table 15 Hybrid model 2 summary……… 56

Table 16 Classification of Hybrid model 2……… 56

Table 17 Selected model summary……… 57

Table 18 Correlation Matrix…….……… 65

Table 19 Collinearity Test……….……… 66

Table 20 Results of logit model 1 ……… 66

Table 21 Results of logit model 2 ……… 67

Table 22 Results of logit model 3 ……… 68

Table 23 Results of Neural network model……… 72

Table 24 Results of Hybrid model 1.……… 74

Table 25 Results of Hybrid model 2….……….… 78

Trang 7

Table 26 Summary of Information value of variables ……… ……… 81

List of figures Figure 01Viet Nam credit growth in 2006-2014……… 09

Figure 02 Non performing loan ratio in 2006-2014……… 10

Figure 03 Steps to construct Credit scoring model……… 30

Figure 04 Processing information in an Artificial Neuron……… 34

Figure 05 Neural network with one hidden layer……… 34

Figure 06 Example of Summation function……… 35

Figure 07 Example of Sigmoid function of ANN……… 36

Figure 08 Back propagation algorithm of single neuron……… 37

Figure 09 Ratio of good/bad customer of dataset……… 40

Figure 10 Ratio of good/bad customer base on age of customer……… 41

Figure 11 Ratio of good/bad customer base on Current living status……… 42

Figure 12 Ratio of good/bad customer base on Education level……… 43

Figure 13 Ratio of good/bad customer base on Gender……… 44

Figure 14 Ratio of good/bad customer base on Marital status……… 44

Figure 15 Ratio of good/bad customer base on Living time at current place……… 45

Figure 16 Ratio of good/bad customer base on Type of job……….…… 45

Figure 17 Ratio of good/bad customer base on Working time in present job……… 46

Figure 18 Ratio of good/bad customer base on Working time in current field….……… 47

Figure 19 Ratio of good/bad customer base on Number of dependent people……… 48

Figure 20 Ratio of good/bad customer base on Historical payment ……… ……… 48

Trang 8

CHAPTER 1: INTRODUCTION

In 2007, The Financial Crisis began from United States (US) by a decisive decline of home prices, then affected entire the economy and spread through the world economy A cut deep into demand all over the world made Viet Nam economy facing many difficulties in export sector in this time Enterprises have to narrow down their operations result in the credit growth of the banking system has slowed down in the recent period

Figure 01: Viet Nam credit growth in 2006-2014

Source: The State Bank of Vietnam’s annual report 2006-2014

Banks are afraid of losing their capital because of existing difficulties of the economy while the sign of economic recovery is still very weak, thus they are careful in making their lending decisions Economists forecast this situation would still continue in the next few years To survive and develop in this period, some economists suggested that, in the coming period, the retail banking segment will be the alternative strategy could help Banks developing their businesses and maybe is the key growth because Viet Nam has a market size of 90 million people (with a high proportion of young people) which will generate opportunities for Banks to expand their services to help consumers increasing asset value and better businesses management

as well as carry out daily payment activities Viet Nam with some typical characteristics of income developing country such as dynamic young population, rising income and desire to

Trang 9

improve the quality and lifestyle will be a great potential for the retail banking development To take advantage of this opportunity, Banks have to improve procedure system to make it more convenient and better risk management to develop this new segment

Figure 02: Non performing loan ratio in 2006-2014

Source: The State Bank of Vietnam’s annual report 2006-2014

Credit scoring model is a useful tool that was first introduced in 1940, and developed rapidly over the last two decades This is a statistical technique that helping banks or lending institutions predicting the probability of a customer can pay back the loan on time or not (Mester, 1997) This model enables banks and financial institutions classifying and evaluating easily and quickly customer’s risks to make lending decisions faster and more accurate than judgmental system This paper will build up credit scoring model by using different techniques such as logistic regression, neural network and a hybrid model of them to find out the suitable one and give some implications to assess customer’s risk

Research questions and objectives:

This study aim to build up credit scoring model by using three different techniques such as logistic regression, neural network and a hybrid model of them to identify which characteristics

Trang 10

of customer will affect their default probability; then comparing the performance between models and finding the best one The results of this study will answer this below questions:

 Which characteristics of customer can be used to identify that customer can pay back the loan or not ?

 Find out the better technique to construct credit scoring model in this study

Scope of this study:

In order to conduct this study, the collected sample of this study will comprise personal information of 690 customers of MBBank that has had a loan with this bank within the year of

2012 And the status of their loans (default/delinquent or good) will be recorded at the end of

2013 The personal information of customers will contain their own characteristics such as: gender, age, education, marital status, current living status, living time in the current place, type

of job, working time in current job, working time in current field, number of dependent people, historical payment

Structure of this study:

The first chapter of this study mentions the reason of conducting the study, research questions and objecitives In the second chapter, this study will present an overview relating to credit scoring model, then mentions the independent variables that are commonly used and the preferred method mostly applied to construct a credit scoring model Details of the independent variables were used in the research such as the meaning of each variable, assumptions of these variables and the steps that constructed models also was contained in the third chapter The first part of the fourth chapter will present an overview related to the dataset was used in this study and illustrate initial aspects about the relationship between the independent variables and the dependent variable The second part of the fourth chapter will focus on outlining the steps taken

to construct and select final credit scoring model Findings, implications and limitations of this study will be discussed in the last chapter

Trang 11

CHAPTER 2: LITERATURE REVIEW

This chapter will present an overview relating to credit scoring model, then mentions the independent variables that are commonly used and the preferred method mostly applied to construct a credit scoring model

2.1 The concept of credit scoring model:

Credit appraisal is the process of gathering, analyzing and classifying factors or variables to be able to give credit analyst an overview about customers in helping to make final financing decision This is an important process which plays important role in ensuring safety risks of the operation of credit institutions Credit scoring is a common tool that lenders have used to classify credit of customers The definition of “credit scoring” should be understood into two components which are "credit" and "scoring" The sense of “credit” is that a person borrow money at the present and have to pay money back in the future include principal and interest, while “scoring” demonstrates an action that applies different measurements to access or classify customers into separate groups base on different purposes of credit institutions Hance, “Credit scoring” is interpreted as the use of statistical models to convert the appropriate data to variables that have statistical significance on affecting payment ability of customers A credit scoring model is a model that could quantify customer characteristics and classify customers followed different purposes of credit institutions Through the process of classifying customers of credit scoring model, those lenders can make final decisions or choose customers who could be received finance and those who should be rejected

The amount of consumer loans is usually small in total loan market, therefore evaluation of each loan is not always create cost-effective Simultaneously with development of technology and research papers in risk management, banks or lenders depend much on scoring models for approving these loans to shorten time and make granted decisions more accurate According to Mester (1997), credit scoring models were applied to assess credit card applications accounting for 70% total small business applications in banks and other lending institutions In the fifties, a credit scoring system was first applied but limited in banks in the US, until in the 1990s it also applied for valuating loans in housing finance (Straka, 2000) The multiple discriminant analysis also was known as the oldest and the most common model in the history of credit scoring model,

Trang 12

it was introduced by Altman in 1968 Since then other new techniques such as logit-probit model, multiple discriminant model, neural networks…have been generated and developed in this field All of these methods are successful to identify variables or characteristics of customers that affect significantly in customer’s default ability

Hörkkö (2010) claimed that the ultimate objective of such models is helping bankers or lenders

to distinguish “good” or ”bad” customers to make accurate granting decisions result in minimizing the credit risk and default rates Separate personal information of customers such as age, education, marital status, number of children…would be considered as input variables, then different statistical technique were used to discover which customer characteristics can distinguish between “good” customers and “bad” customers CSM could be purchased or constructed itself depend on bank’s ability There is unfix CSM, it will vary relying on the sample data and the techniques use to create CSM In general, historical information such as defaulted or non-defaulted status is employed as binary dependent variable Next, separate personal information of customers such as age, education, marital status, number of children…will be considered as independent variables Then using statistical technique to estimate or evaluate in distinguishing between “good” customers and “bad” customers with the support of empirical modeling The CSM will create a score for each applicant, then comparing the score of customer with the cut-off value requirement of lenders to make granting decision or not If the score is higher than the required threshold point, lenders will accept the application form

2.2 Judgmental analysis method and credit scoring model:

Credit assessment is a process of reviewing and comparing characteristics of a customer with old customers If characteristics of customer are equivalent with old customers who have not paid back their loans on time, credit officer will reject their loan application By contrast, customers who have characteristics are equivalent with old customers who have paid back their loans on time will be received financing approved There are two common techniques are used in credit assessment process such as judgmental analysis and credit scoring model

Different analysis techniques will have separate advantages and disadvantages The performance

of judgmental analysis technique is high or low depends heavily on accumulated experience,

Trang 13

analytical thinking of credit analyst Therefore, this technique will be much affected by components such as subjectivity, uncertainty, personal perspective while making credit decisions Besides these above disadvantages, judgmental analysis technique has its own advantages such as ability to use quantitative factors in process analysis and more accurate decisions with experienced credit analysis

On the other hand, credit scoring technique demonstrates a process that credit institutions using a large amount of data (characteristics of customer) of old customers to create a quantitative/statistical model which could classify potential customers into bad group or good group to generate a quick finance decision It is said that credit scoring models have helped lenders to improve customer service sector, saving time and saving operation and approval costs

in analysis process However, this technique also facing some criticisms regarding to the certainty of these models because statistical models always contain assumptions and issues related to type of data were used to build up the model Despite criticism, it is believed that credit scoring systems have rapidly developed and become a crucial technique in finance and banking sectors in recent decades

2.3 Advantages and disadvantages of credit scoring models:

Rapidly developing credit scoring applications confirm that it brings many benefits to users, especially in finance and banking sectors in recent decades Outstanding advantage of these applications is that it requires less information in order to make classification result Credit scoring models use only those variables which are statistical significance in reflecting the repayment capacity of the customers, those variables which are not statistical significance will be removed from the model Meanwhile, judgmental analysis makes decisions based on a review process of all the information related to customers and does not exclude any elements In addition, credit scoring system’s consideration and evaluation focus on both the characteristics of good customers and bad customers, while judgmental analysis focuses largely on the characteristics of bad customers Credit scoring models are built by using a large amount of history data put into statistical models, while judgmental analyst based primarily on experience and analytical thinking of evaluators or credit officers Credit scoring systems help lenders to make objective decisions while judgmental analysis contain subjectivity of evaluators, caused issues related to discrimination Lenders can understand the relationship between the

Trang 14

characteristics of customer and payment behavior when applying credit scoring technique, while

it is difficult to clearly describe in judgmental analysis process Another important advantage of credit scoring technique is that classification result of a customer is the same with different credit evaluator and it is completely opposed to judgmental analysis

Besides these above benefits, credit scoring models still have other benefits such as: time efficiency, making decision faster, minimize cost approval and reduce mistakes than judgmental analysis, providing information for risk control process more easily, require less information of customers to classify, credit scoring model can change its structure to classify customers followed different purposes of users

Despite many benefits that credit scoring models bring in reality, there are some criticisms when applying credit scoring model At first, a credit scoring model uses any features or characteristics

of client as variables put into the model Model will contain any variables that have statistically significant while the relationship between that variable and creditworthiness of customer is ambiguous or not clear In addition, credit scoring models often eliminate the proxy for economic factors that may affect the repayment capacity of the customer Credit scoring model

in different regions, different countries, different cities could vary by regional differences, therefore, there is not an official credit scoring model all over the world However, budget to buy credit scoring model and cost to train analyst how to use the model are two serious issues for those who want to apply credit scoring system Sometimes, credit scoring model could reject a good customer because suddenly changes of customer (for example: changing in living place or company) without deeper analysis like judgmental analysis technique Another weakness is that credit scoring model uses history data to build up model, thus the weight of variables shall be fixed and the model will lack more accuracy when the pattern of customers changing over time

2.4 Historical development of credit scoring models:

Risk management plays an important role in bank and financial institution operation where aware that banking activities influence and are influenced by the economy, thus they are responsibility to raise requirements to ensure safety banking system operation and the economy

in general However, tradeoff between risk and return would be analyzed carefully because they always believed that “high risk and high return” The key important in risk management is that a

Trang 15

combination of all information of customer to analyze and make final finance decisions Thus, the performance of classifying customers system will help lenders in ensuring risk and operating efficiency

Rapidly developing credit industry in recent decades all over the world makes sense that the management of a large amount of loans is very difficult and hard to ensure risk for entire system Credit scoring technique was established and developed to handle these above issues In recent decades, credit scoring models are widely applied in financial sector and have proved that the ability to classify good and bad customers quickly Applying credit scoring model can reduce the cost of the approval process and reduce making wrong final decisions, saving time and effort in the analysis process Typically, credit approval process for customers will be conducted by two common techniques which are judgmental analysis and credit scoring model Judgmental analysis a techniques that operates based on knowledge, experienced, analytical thinking of credit analysis officer Due to the rapid development of credit industry and the need to quantify risk operation, financial institutions decide to apply credit scoring model in credit analysis process

Credit scoring systems have capacity in classifying customers into good customer who is expected to pay back loans on time and as bad customer who is insolvent It is said that credit scoring models classify customers more accurate than judgmental analysis technique, allows banks to control and provide cross services to different customer groups One of the main objective of the use of credit scoring in the financial sector is that it contribute to the development of credit management, and support effectively credit approval process

In developed countries, credit scoring models have been applied widely and effectively The number of applications increases rapidly because it was supported by good infrastructure and the availability of huge data, while in developing countries facing many limited about the availability of data and IT infrastructure to apply credit scoring models effectively

According to West (2000), credit scoring models are widely applied in the financial sector and the primary purpose is to improve the process of gathering information and credit analysis, reducing cost, faster decision-making According to statistics, about 82% of banks in the US used credit scoring models to decide which customers should be received an approval for credit

Trang 16

card applications In recent duration, some credit institutions and mortgage lenders have started

to develop credit scoring models to support credit decisions, in attempt to improve and enhance risk management

The most important task of credit scoring models is information collected process of customers, this process should be ensured accuracy and honesty In general, this information is collected from loan application of customers and other related sources Personal information of customers such as age, gender, marital status, education, income, type of current job, experience, homeownership, number of dependents, state of birth will be considered as inputs to build up a credit scoring model

2.4.1 Development in credit card and instant loan markets:

According to Agaewal et al (2009), the authors evaluated the effect of the characteristics of customers on the default possibility Dataset of the study included 170.000 samples They observed payment behavior of these customers and pointed out factors such as monthly spending, amount of debt, income, asset accumulation, economic conditions, legal environment and demographic structure will affect the repayment capacity of the customers The final result also suggested that customer who left their place of birth seem likely to default than others, while customer group who have married and owned house have very low probability of default Another interesting result is that the age of customers is under 30 and over 60 always able to pay back their loans better than the rest Finally, groups who have high income group or possess assets always have responsibility on repayment

In 1999, Dunn and Kim conducted a research to identify factors that determine the probability of default in credit card sector They made phone call directly to 500 customers who living in Ohio Sate The final interview results showed that the rate between the minimum amounts of money that customers have to pay monthly and their income has statistical significance on repayment capacity of customers Further results of this study agreed that age, marital status, number of dependents also have linked strongly to the possibility of default while education level, income, ownership status do not affect significantly as initial assumption

There are some studies did not focus on the purpose of classification, but they concentrated on the purpose of profit maximization According to Boyes et al (2002), the main purpose of credit

Trang 17

analysis is to give more accurate estimation related to the probability of default, depend on different level of default probability that lenders could give loans to customers with different interest equivalent with customer risk Research results showed that age, education, assets ownership, number of dependents, the proportion between spending and income are factors that influenced strongest Autio et al (2009) conducted a survey with 1951 young adults between the age of 18 and 29 years old through the internet They collected personal information such as age, gender, financial status, income, employment status, family structure In addition, the status of their credit such as mortgage, student loan, small loans would be gathered Especially, the study also measured the attitude of the observations regarding to borrowing money activity Final results showed that for the group who the age from 18 to 23 years old tend to apply for instant loans more frequently than the others, while high income and stable employment status group preferred to using credit card loans Research also suggested that gender did not affect significantly to borrowing decisions of a customer; while employment status, income, and family structure are the strongest influenced factors

2.4.2 Development in mortgage markets:

In the mortgage market, lenders usually require collateral of customers to guarantee their loans The probability of default in this market is affected strongly by changing in exchange rate or interest because duration of these loans is always in long term (Zorn and Lea, 1989)

In 2006, Vasanthi and Raja investigated the relationship between income and customer characteristics They concluded that the age of customer who is leader of household is very important: for example, younger household leader has higher probability of default because of high financial stress and less experienced in money management Customers have high income and only borrow a small loan always have lowest probability of default because they can control and manage their financial status logically However, some traditional characteristics such as educated level, marital status also affected significantly on payment behavior of customers Those who have high education level are easy to get a good and stable job, thus having strong financial capacity The results of this research also suggested that young customers and customers who divorced were likely to default higher than the others because they lack experienced in financial management and psychologically unstable

Trang 18

2.4.3 Development in consumer credit market:

In a study of Kocenda and Vojtek (2009), they chose 3403 observations and collected 21 characteristics information of customers (variables) to conduct a research The main purpose of their research is that they would like to identify which characteristics of customer have significantly affected on payment behavior of customer and which technique is better in building

up a credit scoring model The final results of this study expressed that the performance of logistic analysis technique and CART analysis technique is the same However, two techniques indicated that characteristics such as education level, marital status, purpose of borrowing, assets accumulation, transaction history between lenders and customers are strongest affected factors The disadvantage of this study is that it applied small dataset to set up credit scoring model However the authors also suggested that using non parametric measurement could be considered

as alternative method to build up a good credit scoring model

In 1997, Arminger et al applied three techniques such as LDA, classification tree analysis and feed forward network building up credit scoring model to compare each other and identify the best technique They collected information from 8163 observations during the period 1991 and

1992 in Germany Initial basic information considered as inputs or variables such as gender, experienced, age, ownership, marital status…The findings of this study indicated that the performance of three technique are the same, all of three technique have high classification power in build up a credit scoring model, but the performance of LDA technique is a little better than others Groups contain customers who have experienced, high assets accumulation, female and those who have married are less likely to default

Jacodson and Roszbach conducted a study related to credit scoring models in 2003 In their study, they calculated deviation in data selection process The study employed bivariate approach

to set up credit scoring model This method used both cases (rejected and approved loans) as inputs for the construction of credit scoring model Research conducted on 13338 observations in Swedish during the period 1994 and 1995 The data sources were used including financial information and personal information of customers They collected 57 input variables, however, finally only 16 variables were used to build the model Results indicated that variables such as income, age, annual income changes, amount of loan were the strongest influenced factors on the default probability of a customer

Trang 19

In 2004, Roszbach continued to use the above dataset to investigate the relationship of the default period of customers A common loan always has many prompt, thus creating a cash flow until customers liquidate their loans or become default This study explained two ways to calculate net present value (NPV) of a loan of customer In the case of which customers pay back their loan on time, NPV will be calculated as usual, while in the case that customers become default, NPV will be calculated by estimating a cash flow generated during the period that pay back a part of loans and plus cost of handling non performing loans The author applied tobit model to measure exactly the period when customers become default The results of study indicated that lenders have not act logically when they assessed tradeoff between risk and return However, lending policy of these institutions does not encourage an extending loan to earn more profit The other result showed that lenders did not differentiate the value of loans Roszbach also supplied evidences that proved lenders have not acted consistently with their objective of profit maximization Using tobit model, lenders could estimate duration when a customer are likely to default and then choosing the one who survive longer and publish lending policy effectively

Dinh and Kleimesier (2007) employed 56037 observations of one of biggest bank in Viet Nam to build up a credit scoring model They applied forward-stepwise method to select variables However, this study faced many limited because of lacking necessary information which is a basic problem when they conducted a research in a developing country like Viet Nam The study indicated that the duration that customers have relationship with lenders is the strongest affected factor, followed by gender, amount of loans However, the authors also suggested that credit scoring models should update frequently to keep its performance against economic condition changes

The study of Updegrave (1987) showed that variables such as number of variables in a model, payment history, working experienced, time living in current place, income, ownership, age and saving rate are the strongest affected factors on payment behavior of customers This result also was supported by the study of Steenackers and Goovaerts (1989) when they conducted a similar research in Belgian The authors employed 19 variables to create credit scoring model at the beginning, but in finally, there are only 11 variables had statistical significance Using logistic regression technique to build up a credit scoring model, the final model indicated that factors

Trang 20

such as age, time working and living at current place, the amount of loans, phone call, working

in state or private sector, monthly income, assets accumulation affected strongly on payment behavior of customers

In 2004, Ozdemir applied logistic regression method to build up a model measuring relationship between default risk with demographic and financial factors in credit retail market Observations

in this study were collected from a bank in Turkey The final results of this study confirmed that demographic factors did not have statistical significance in affecting payment behavior of customers, while financial factors did However, interest rate and term loan are two factors that affected strongly on payment behavior of customers Customers who have higher interest rate and term loan are likely to default than the others The authors explained that with long term loans, customers will have higher probability in facing sudden changes such as economic change, exchange rate change or interest rate change…

In 1997, Han and Henley reviewed all of researches related to methods that were used to build up credit scoring model The final result expressed that there is no optimal method using to build up credit scoring model, each method will have separate advantages and disadvantages that depend much on data structure and the purpose of users

2.5 Common variables in constructing credit scoring models:

The main responsibility of credit scoring models is to classify customers into good group or bad group With the rapid development of credit scoring applications in America, England and other developed countries, credit scoring models become more important and considered as a crucial tool in risk management and accelerate the lending process Applying credit scoring models, lenders are easy to analyze customers, assess their payment history and identify their worthiness

to make final credit decision

To build up a credit scoring model, personal characteristics are common used as input variables such as gender, age, education level, number of dependent, type of job, working experience (Hand et al., 2005; Lee and Chen., 2005; Lee et al., 2002; Steenackers and Goovarts., 1989) However, other information should be considered as input variables such the amount of loan, assets accumulation, monthly income, saving rate, purpose of borrowing and others information

Trang 21

(Lee and Chen., 2005; Ong et al., 2005; Steenackers and Goovarts., 1989) to enhance the performance of credit scoring model

All of information of customers could be considered as inputs put into statistical models, then variables have statistical significance will be used as variables in credit scoring model to classify customers The rapid development of credit scoring applications has proved the useful of this kind of model However, there is not an explicit research which explains why these variables are used in credit scoring models to classify customers Additionally, selected variables in credit scoring model depend heavily on the initial data structure provided to build the model At the beginning, credit scoring models were built to classify customers into two groups such as “good” and “bad” Then with the rapid development of credit scoring and more complex requirements, credit scoring also are developed to classify customers into three groups such as “good”, “bad”,

“confuse”…therefore lenders will have more information about customers to make final decisions

There is not explicit requirement about number of variables in a credit scoring model, thus selected variables in a model will depend much on data structure, specific culture and economic conditions in each region However, a credit scoring model is built up that commonly contains approximately twenty variables It is believed that increasing number of variables will enhance the performance of credit scoring model such as Salchenberger et al (1992), Leshno and Spector (1996), Dvir et al (2006)

Credit scoring models have demonstrated the necessity and its important role in practical applications, especially in the field of finance and banking There is some criticism related to identify the cut-off point of this model All of the past researches suggested that there is not optimal cut-off point which depends much on the attitude of lenders In cases which lenders want

to increase the growth of lending activities and market share, they will install the cut-off point lower than usual, by contrast if lenders want to control risk strictly Besides the issues regarding number of variables and data structure in credit scoring model, researchers also pay attention to sample size that was used to build up a model It is believed that the larger sample sizes the higher accuracy However, sample size was used to build a model depend much on the availability of information There were studies which used small sample size, only contained about 300 or 400 observations such as Dutta et al (1994) and Fletcher and Goss (1993) While

Trang 22

other researches can apply a large sample size with over thousands observations such as Belloti and Crook (2009), Hsieh (2004), Banasik et al (2003)…In particular, the construction of credit scoring model for consumer credit market are common using a small sample size about under

1100 observations (Sustersic et al, 2009; Lee and Chen 2005; Kim and Sohn, 2004)

Finally, there are studies that faced bias data problem because authors have chosen customers who have received loan approval as input data to build up a credit scoring model This problem makes credit scoring model have restrictions on representative for whole population and then affect the performance of model

The table below shows the popular independent variables used in some previous studies The name of these variables may be different compare to the name was used in previous studies However, the meaning of these variables also is the same among studies

Variables Jacobson R 2003 Dinh et al 2007 Agarwal et al 2009 KocendaVojtek 2009

Migrating out of state of birth 

Time living in current place  

Table 01: Common variables in previous studies

Note :*** the most significant variables in previous studies

Trang 23

In most of the previous studies, they found that characteristics of customer such as age, education, marital status and residential status have an important significance on the payment ability of customers (Agarwal et al, 2009; Dinh Kleimeiser, 2007; Kocenda Vojtek, 2009) Moreover, other personal characteristics such as income, length of relationship, maturity of the loan, savings…also have impacted on payment ability of customers as well (Vasanthi & Raja, 2006; Ozdemir & Boran, 2004; Jacobson Roszbach, 2003)

2.6 Common techniques employed in credit scoring models:

Arminger et al (1997) used three different methods such as LR, CT and NN in credit modeling and then compared their performance In their study, they used input variables such as gender, time in present job, age, available/married…as independent variables The dataset was collected from one of the largest retail bank in Germany They used cross validation method to set up model and test their performance The results of their research implied that all three techniques have predictive power equally but LR is a bit better than the others and the performance of CT technique is worst

Similarly, in 1996, Desai and his partners conducted a research using neural network, logit regression and linear discriminant analysis model to test their performance in building up CSM Their data was collected from 53 different credit institutions in the US from 1988 to 1991 The results provided ambiguous aspects between techniques: NN outperformed than the others in predicting bad loans, but both LR and NN approach are equal in performance of classifying good and bad loans Overall, LR is always better than LDA In another study of Lee et al (2002), they compared the performance of four techniques such as LDA, LR, NN and neural discriminant method and found that four models had a same predicted power in distinguishing good or bad customers

The parametric and nonparametric techniques LR and CT were used in Koenda and Vojtek study (2009) to estimate determinants of default They claimed that both results are reliable and suggested that CT method could be used to create better models However, in previous studies of Luo (2008) and Yang (2009), they suggested that LR always is the outperform method because

of its power in identifying which characteristics of customers affecting default rate

Trang 24

According to Hand and Henley (1997) and previous studies revealed that depend on specific data and input variables were used, we will have different best method Each classification has their own advantages and disadvantages such as: LR, nearest neighbour method are easy to apply and understand their results, while neural network have a high predictive power but difficult in explaining how exactly the results were built Paliwal (2009) pointed out that since the last decade, neural networks are more popular and applied broadly in lending institutions as the alternative method in constructing CSM instead of using traditional statistical models Some other studies discovered that a hybrid model was combined between feed-forward neural networks and traditional statistical methods such as DA and LR will enhance model’s performance (Cheng et al, 1994, Paliwal et al, 2009)

Studies LDA CT LR NN Hybrid Model

Table 02: Common methods in previous studies

Note :*** the better method in their study

However, logistic regression is the most popular method which is proposed by many papers because of its high performance in distinguishing good or bad customers (Cheng et al 2003, Laitinen 2000) Some other studies also criticize this method due to it do not require an assumption about existing linearity relationship between independent variables and dependent variable and dependent variable need not be normally distributed Other study by Chen and Huang (2003) proved that the weak non-linear in most of the credit scoring datasets and thus logistic regression give a reliable estimate Previous researches have suggested that estimations from logit or probit regression are always more accurate than DA (Wilson et al., 2000)

In recent period, there are some studies that propose a new approach to set up CSM They combine different techniques to construct credit scoring models because they realize that each model will have its own advantages in specific segment or criteria (Koh et al, 2006) Thus they

Trang 25

will take advantage of strengths of individual models to create a better CSM by combining different techniques together The studies of Lee et al (2000) and Zhu et al (2001) have supported this opinion The final results proved that the hybrid model outperformed significantly Similarly, Lee & Chen (2005) compared the performance of individual models such as DA, LR with a hybrid model was combined between neural network and multivariate adaptive regression splines and finally gave the same result

Basing on the above justifications, this study will apply logistic regression, neural network and the hybrid model of these techniques to build up CSM and conduct an assessment between their performances

Trang 26

CHAPTER 3: METHODOLOGY

This chapter will introduce the details of the independent variables were used in the research such as the meaning of each variable, assumptions of these variables In addition, this chapter will describe step by step the process were conducted in this study to build credit scoring model The theory of methods used to build credit scoring model, the criteria for model selection, how to select variables and criteria to comparison of models also was contained in this chapter

3.1 Data:

This study uses the dataset collected in MBBank This is a large commercial bank in Vietnam The dataset includes personal information of 690 customers who have borrowed in 2012 and the status of their loan will be updated at the end of 2013 The information collected includes gender, age, education, marital status these information will be considered as the independent variables

to distinguish between good and bad customer

MBBank has six bank branches in Ho Chi Minh city at the end of 2013 and nearly 10.000 personal customers Then, 115 personal customers in each bank branches were selected randomly This study tried to get a 30/70 (bad/good customers) in each branches Personal information of customers will be found in their loan application which were kept in data center

Category, age of customer

Divided into three groups: Under 30; From 30 – 40; Over 40

Two dummy variables Gender Dummy, Male code as 1, Female code as 0

Education Dummy, education with two groups: University and higher code as 1;

College and lower code as 0

Marital Status Category, marital status with three groups: Single, Married, Divorce

Trang 27

Two dummy variables

Current Living Status

Category, describe place where customer is living

Divided in three groups: Owner (homeownership), Live in parent’s house, Renting

Two dummy variables

Living time at current

place

Category, describe duration (number of year) that customer is spending

at current living place

Divided into three groups: From 1 to 4 years, From 4 to 7 years; From

7 years and longer

Two dummy variables

Type of job

Category, describe the type of employment, divided into three groups: Manager, Officer, Private business

Two dummy variables

Working time in current

Two dummy variables

Working time in current

Two dummy variables

Historical payment Dummy, Historical payment of customer in previous loans

Good historical payment code as 0, Bad historical payment code as 1

Table 03: Variables and their definitions

Trang 28

3.1.2 Hypotheses:

In this part we formulate and explain our hypotheses The variables used to test whether our hypotheses should be rejected or accepted will be also presented

Age: Age is relevant in determining the probability of default

This study expect that older borrowers have lower default probability because they might be risk averse people (Dunn and Kim (1999), Arminger et al (1997), Agarwal et al 2009) According to Autio et al (2009), a study was conducted in Finland showed that young customers always borrow money to pay bills overdue, have weaker financial position and do not have money management ability

Gender: Men are more likely to default than women

According to Dinh and Kleimeier (2007), women less likely to become a bad customer than men because they have higher awareness in paying the loan back and they control better in spending

money

Education: High educated is negatively correlated with probability of default

We expect that people who have better educated will have higher income and more stable, thus will less probability in default This expectation was supported by Steenackers and Goovaerts (1989)

Marital status: Different marital status can affect the default probability

This is a common variable in credit scoring model Single people will be more risky because of less reliability or maturity than married people Agarwal et al (2009) show that married borrowers is lower 24 percent compare to single borrowers likely to default

Current Living Status: Different living status has different impact on the probability of default

Living status describe customers who live in their own house or live in parent’s house (do not pay monthly fee) or in apartment (pay monthly fee) Following the result of Agarwal et al (2009), the study indicated that people who have their own house less likely to a bad customer

Trang 29

Living time in the current place: Time at present address has an impact on the probability of

default

People who change/move their place frequently always riskier because of their instability (Agarwal et al 2009,Steenackers1989)

Type of job: Different type of job can affect the default probability

This study employed different type of job such as officer, worker or private business This variable measures job stability In Vietnam, the type of job can be a good proxy for income level and stability For example, officer position, which are known to pay higher salaries than worker and more stable than private business, thus have the lower probability of default This study assumes that officer will be less risky than others and by the contrast worker will be the most risky one

Working time in current job: Time in current job has an impact on the probability of default

Similar to “Type of job” variable, this variable demonstrates the stable of borrower’s position People have worked for the present job in the long time will be expected to less risky than others because the probability of changing or get fire of current job is lower

Working time in current field: More accumulate experience is less likely to default

This variable focuses on working experience of customer People have worked for the current field in the long time will be assumed to less risky than others because they have more

experience, permanent position thus having a chance to get promotion or higher salary

Number of dependent people: Number of dependent people increases will increase the

probability of default

This variable describes the number of dependent people that a customer has to support finance monthly such as education fees, health care…When a customer has more dependants, he or she will has a high pressure with their monthly expenses, thus he or she likely to lose control of payment ability

Trang 30

Historical payment: Payment history is relevant when estimating the probability of default

Past paid debt should be negatively correlated with probability of default, late payments could be

an indication of both negligence and low credit worthiness but severely late payments For example: payments that are substantially overdue, should be strongly correlated with the probability of default Payments on time, on the other hand, ought to indicate well run personal finances and should have a decreasing effect on the probability of default

3.2 Methodology:

This study will employ logistic regression, neural network and the hybrid model of them to create CSM It will be conducted follow steps illustrated in the below figure:

Figure 03: Steps to construct Credit scoring model

Source: Koh et el.2006, A Two-step Method to Construct Credit Scoring Models With Data Mining Techniques, International Journal of Business and Information

Trang 31

The default probability of a customer will be calculated as follow:

i

bad GoodOdds =( )*( )

Bad good

badi: total number of defaulted observation of xi

Bad: total number of defaulted observation

goodi: total number of non-defaulted observation of xi

Good: total number of non-defaulted observation

Trang 32

Categories of a variable have no predictive power or express the same probability in two groups (defaulted and non-defaulted) if the odd ratio equal 1 It cannot explain the differences between good or bad customer, thus cannot be used as predictor in CSM

3.3.3 Information value:

Besides using Odds ratio, this study also applies information value measurement to evaluate the ability of classifying between good and bad customer The information value of a category of a variable is defined as:

IV =(good -bad )*log(good /bad )

Variables have higher information value will be considered as having more predictive power in classifying between good and bad customer According to Kocenda and Vojtek (2009), they suggested that variables that have information value above 0,2 will be considered as good input variables for model Besides looking at p-value of a variable, the information value is also a good signal for us to choose variables to put into model

3.3.4 Quality of the model:

3.3.4.1 Log-likelihood ratio (LR) test:

LR can be seen as the alternative test instead of using the standard F-test because logistic regression does not assume the dependent variable follow normally distributed as usually in OLS regressions A hypothesis is assessed by comparing the log likelihood between the initital model and restricted model The LR test is simply the ratio of the maximized value was calculated following likelihood function between two comparing models The main idea of LR test is that it calculates the different between residual deviances of constrained and unconstrained models The formula for the LR test statistical is:

L(m1): likelihood of the null model

L(m2): likelihood of alternative model

Trang 33

For example: the null hypothesis is H0: βx = βy = 0, if we reject H0, it means that two coefficients are simultaneously not equal to zero

3.3.4.2 Pearson Chi-Square test:

The testing is common used in logistic regression to examine goodness-of-fit of model is square test This test could examine the independence and goodness-of-fit of two categorical variables Chi-square is a statistical test that can estimate the different between observed data with the value at which we expect to obtain according to research hypothesis

chi-O = an observed frequency

E = an expected (theoretical) frequency

3.3.4.3 Akaike Information Criterion (AIC):

To achieve the best one between a set of proposed models, we are common use Akaike Information Criterion (AIC) The main purpose of AIC is that we can choose the best model base

on the Kullback-Leibler distance between the model and the truth (usually lower is better) The initial model will be set up as simplest form with only constant Then we add more variables into the model and tested with information criterion AIC is defined as

L: maximized value of estimated model

K: number of free parameters in the statistical model

Trang 34

3.4 Neural Network:

3.4.1 Theory:

According to Haykin (1999), an artificial neural network is information processing paradigm that simulating processing information similar to biological nervous systems Neural networks operate in the same way as people’s brain: learning by examples, accumulating experience and forming the decision-making ability The below figure illustrates a processing elements (PE) in

an artificial neuron Variables put into the network as inputs Neuron j is the sum “information”

of many inputs (xi) and different weights (w1j) Selected transformation function will be applied

to transform output value into desire value type This outcome at neuron j may be considered as input of different neurons or not depend on number of hidden layers of network If the neural network has several hidden layers, this sum “information” will reiterate many times until the last layers of network

Figure 04: Processing information in an Artificial Neuron 3.4.2 Components of artificial neural network:

A basic neural network usually comprises three components such as: input layers, hidden layers, and output layers At the input layers, variables are put into the network with random weights Then this information will be totalized and processed at hidden layers Next, the summation and random weights at hidden layers will be considered as the input at the next hidden layers or the output (if the neural network has one hidden layer) We calculate the output from training example and compare it to the desire output, this value also known as the error These processes are reiterated many times with updating new weights at each layer until the error of network is minimum or stable A basic neural network will be illustrated in the below figure:

Trang 35

Figure 05: Neural network with one hidden layer

At each processing element, we use summation function to calculate the sums of weighted of all inputs considered as variables put into the network The summation formula at one processing element will be illustrated in the below figure:

Trang 36

i i i=1

i=1

Y =  X W

One processing element with n inputs Several processing element

Figure 06: Example of Summation function

Trang 37

At the end of each processing element of network, different transformation function will be applied to transfer the summation of a neuron into desire value type In this study, dependent variable is a binary value, thus we apply sigmoid transfer function that is described as follow:

-Y T

Y =1/(1+e )

Y is the summation value

YT is the transformed value

Figure 07: Example of Sigmoid function of ANN 3.4.3 Back Propagation Algorithm:

Back propagation algorithm is the basic NN algorithms According to Rojas (2005), BP algorithm procedure can be illustrated in four main steps At the beginning, random weights will

be assigned to every variable put into the network Then we use back propagation algorithm to calculate or estimate the most relevant outcomes

i) Feed-forward computation: we put variables with random weights into the network, then calculating the value of hidden nodes and output nodes

ii) Back propagation to the output layer: error will be calculated between the output node and the desire output, thus we use backward propagation and weights adjustment to estimate error at the hidden layer Learning rate and momentum are employed to help computing this error

iii) Back propagation to the hidden layer: similar to the second step, we continue to estimate the errors at input nodes to get new weights of input nodes

Trang 38

iv) Weight updates: after updating all new weights for input layer and hidden layer, we continue

to reiterate estimating the output and new error This process will be end until the error of network is minimum or stable

Figure 08: Back propagation algorithm of single neuron 3.5 The hybrid model:

Constructing the hybrid model comprises two steps:

+ First, identify influencing variables by using logistic regression

+ Next, those significant variables will be considered as the input variables of Back Propagation Neural Network (BPN)

The advantage of this approach is to prevent the problem of over-fitting of neural network model

3.6 Comparison of models:

To compare the performance between different models, this study use classification tables method follow 3 indicators such as: (1) Overall accuracy rate; (2) Type I accuracy rate (Good rejected): predicted “Bad”, actual “Good”;(3) Type II accuracy rate (Bad accepted): predicted

“Good”, actual “Bad”

Trang 39

CHAPTER 4: EMPIRICAL RESULTS

The first part of this chapter will present an overview related to the dataset that was collected from MBBank Figures in this part illustrate initial aspects about the relationship between the independent variables and the dependent variable In the next part, the content will focus on outlining the steps taken to construct and select final credit scoring model

4.1 Data:

This study uses the dataset that collected at MBBank - one of the largest joint-stock commercial bank in Vietnam Data information including specific characteristics such as gender, education, marital status, ownership, number of dependent people Historical payment of these customers also updated in the dataset

The dataset includes personal information of 690 customers who have borrowed in 2012 and the status of their loan will be updated at the end of 2013 Each customer can have different separate loans at the same time in banks However, this study is focusing on the earliest loan of customers

in the year of 2012

4.1.1 Dependent Variable:

The dependent variable is a dummy variable (receiving only two values: 0 and 1) When the variable take value as 0 means that this is a good customer (repay loan on time), while variable taking value as 1 means that this is a bad customer (don’t repay loan on time) This study using Basel II to classify customers into good group or bad group: the borrower is considered as bad customer when he delayed payment over 90 days In total 690 observations in this study, there are 246 observations considered as bad customers, accounting for 35.65% of the total sample

Trang 40

Figure 09: Ratio of good/bad customer of dataset 4.1.2 Independent Variables:

Age is a variable that describe age of a customer In this study, the age of the observations is divided into three groups: group 1 includes observations that their age is under 30, group 2 includes observations that their age fluctuate from 30 to 40, and group 3 includes observations who their age from 40 and older To distinguish different groups, we continue to use two dummy variables Many previous studies indicated that older people tend to fear risks, they calculate expenses are reasonable and spending consistent with the payment possibility, therefore less likely to become a bad customer Similar to previous studies, this study also expect that younger customer will more likely to become as bad customer than the older

The below chart express the ratio between good and bad customers within one age group There

is explicit difference ratio base on different age groups In the oldest group, the ratio of bad customers is smallest (about 22%), by the contrast we have the most bad customers in the youngest group (account for 56%)

According to Autio et al (2009), bad customers are people that have weak finance status, face difficult in control their money These people usually fall in the age from 18 to 29 years old because they often borrow for the purpose of repay another maturity or renting expense

64%

36%

Ratio of good/bad customer of dataset

Good Bad

Ngày đăng: 10/12/2018, 23:49

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN