Customer segmentation in banking for personal consumption loans a study on individual borrowers in a digital banking

THEORETICAL BASIS

Overview of Machine Learning

1.1.1 History of Formation and Development

The origins of machine learning can be traced back to the 1940s when researchers began exploring basic pattern recognition problems and studying neural networks The early history of Machine Learning (ML) is marked by groundbreaking ideas and relentless efforts to create computers that could mimic human thinking processes In

1943, Walter Pitts and Warren McCulloch devised the first mathematical model of an artificial neural network, laying the foundation for modern neural networks and the development of distributed machine learning tools (McCulloch & Pitts, 1943)

Pioneers of the early ML era include Donald Hebb, Alan Turing, and Arthur Samuel While they were not the only initiators, their research and contributions significantly advanced the field of machine learning Hebb's work on neural communication, Turing's artificial intelligence test, and Samuel's coining of the term

"machine learning" all contributed to the burgeoning field of artificial intelligence (AI) and laid the groundwork for the myriad machine learning algorithms we know today (Hebb, 1949; Turing, 1950; Samuel, 1959)

Some notable milestones in the formation and development of machine learning include the following events:

- In 1943, Walter Pitts and Warren McCulloch developed the first machine learning model to address the challenge posed by John von Neumann: how can computers communicate with each other?

- In 1949, Donald Hebb introduced the concept of communication between neurons in the nervous system

- In 1950, Alan Turing introduced the Turing Test, marking a significant milestone in the field of AI

- In 1951, Marvin Lee Minsky invented the SNARC (Stochastic Neural Analog Reinforcement Calculator), an early neural network computer

- In 1967, Cover and his colleagues created the k-Nearest Neighbors (kNN) algorithm

- In 1980, the neocognitron, a multilayered artificial neural network, was discovered, serving as a precursor to convolutional neural networks (CNNs) (Fukushima, 1980)

- In 1997, IBM's Deep Blue shocked the world by defeating the reigning world chess champion

- In 2006, Geoffrey Hinton coined the term "Deep Learning" to describe new algorithms that allowed computers to "see" and differentiate objects as well as text in images and videos

- In 2017, Google published its first research on the deep learning architecture called Transformers (Vaswani et al., 2017)

From the early days of simple pattern recognition to today's complex learning models, the history of machine learning (ML) has been a fascinating journey It is the story of humanity's effort to create computers that can learn, adapt, and make intelligent decisions, much like our own cognitive processes This journey has reshaped industries, redefined human-computer interaction, and unlocked a world of untapped potential

In recent years, machine learning has undergone a series of breakthroughs and innovations, sparking a revolution in the fields of science and technology Some notable advancements include:

• The deep learning revolution in 2012

• The development of reinforcement learning algorithms, exemplified by AlphaGo from DeepMind

• Advances in natural language processing, including OpenAI's ChatGPT These advancements have significantly enhanced the capabilities and applications of artificial intelligence They also demonstrate the potential of machines to understand and generate human-like language, paving the way for the development of more advanced

As machine learning continues to evolve and adapt, we can anticipate ongoing advancements in areas such as quantum computing, unsupervised learning, and the establishment of cognitive services These future trends are sure to shape the way we live and work, as machine learning continues to redefine the boundaries of what is possible in the field of artificial intelligence

From healthcare to finance, transportation to education, the potential applications of machine learning are vast and far-reaching, promising to transform industries and redefine human-computer interaction in ways we can only imagine

As machine learning continues to be researched and developed, it is expected to play an increasingly important role in shaping the future of technology and society, transforming industries and redefining human-computer interaction.

Why is Customer Clustering Important?

Customer clustering, or customer segmentation, is a critical strategy for businesses, particularly in the banking and financial sectors Here are several reasons why customer clustering is important:

By clustering customers based on their behaviors, needs, and preferences, banks can create highly targeted marketing campaigns This personalization increases the relevance of marketing messages, leading to higher engagement and conversion rates

Understanding the specific needs and preferences of different customer segments allows banks to tailor their products and services accordingly This personalization improves customer satisfaction and loyalty, as customers feel valued and understood

Customer clustering helps banks allocate their marketing and service resources more efficiently By focusing efforts on high-value segments, banks can maximize their return on investment and avoid wasting resources on less profitable or unresponsive customer groups

Identifying at-risk customer segments through clustering enables banks to implement targeted retention strategies By addressing the specific concerns and needs of these segments, banks can reduce churn rates and maintain a stable customer base

Insights from customer clustering can guide the development of new products and services Banks can identify gaps in their offerings and tailor new products to meet the specific needs of different customer segments, enhancing their overall product portfolio Strategic Decision Making:

Customer clustering provides valuable insights into customer behavior and market trends Banks can use this information to make data-driven strategic decisions, such as entering new markets, adjusting pricing strategies, or developing new marketing channels

In a competitive market, understanding and effectively segmenting customers can provide a significant advantage Banks that can offer personalized experiences and targeted solutions are more likely to attract and retain customers compared to those that rely on generic approaches

Clustering can also help in identifying high-risk customer segments By understanding the characteristics of these segments, banks can implement appropriate risk management strategies, such as adjusting credit policies or enhancing fraud detection measures Breakthroughs and Applications

1.2.1 Breakthroughs in Machine Learning for Customer Clustering

The deep learning revolution, particularly since 2012, has significantly enhanced the capabilities of machine learning models Techniques such as neural networks, especially convolutional and recurrent neural networks, have enabled more accurate and complex customer clustering

Advancements in Natural Language Processing:

Tools like GPT-3 and GPT-4 have improved the ability to analyze and interpret large volumes of unstructured data, such as customer reviews and social media interactions This has opened new avenues for understanding customer sentiments and behaviors for clustering purposes

Reinforcement learning algorithms, as exemplified by DeepMind's AlphaGo, have shown the potential to optimize customer segmentation strategies through continuous learning and adaptation based on customer interactions and feedback

The introduction of transformer models, such as those used in Google's BERT and OpenAI's transformers, has revolutionized the processing of sequential data This is particularly useful in analyzing customer transaction histories and behaviors for more precise clustering

Banks use customer clustering to design and implement more effective marketing campaigns By targeting specific customer segments with tailored messages, banks can improve campaign effectiveness and increase response rates

Clustering enables banks to offer personalized financial products and services, such as customized loan offers, investment advice, and savings plans, based on the unique needs of each customer segment

By identifying customer segments that are at risk of churn, banks can proactively address their concerns with targeted retention strategies, such as personalized offers or improved customer support

Cross-Selling and Up-Selling:

Customer clustering helps banks identify opportunities for cross-selling and up-selling For instance, customers who are likely to need mortgage services can be targeted with home loan offers, while high-spending customers can be offered premium credit cards

Risk Management and Credit Scoring:

Banks can use clustering to enhance their risk management strategies By understanding the risk profiles of different customer segments, banks can tailor their credit scoring models and adjust their risk policies accordingly

Clustering can also improve customer support by enabling banks to understand the specific needs and preferences of different segments This allows for more personalized and effective support interactions, leading to higher customer satisfaction

By leveraging the power of machine learning and data analytics, banks can transform their customer segmentation strategies, driving better business outcomes and enhancing customer experiences.

EXPERIMENT

Data collection

The dataset used in this project is a classic dataset in banking and marketing, sourced from a digital bank Initially uploaded for research purposes, this dataset provides detailed information about a marketing campaign conducted by a financial organization It includes attributes such as customer information, marketing campaign details, and the ultimate outcome of that campaign

Figure 2 1: Illustrate the columns in the dataset

The dataset comprises 5000 rows and 14 columns of detailed descriptions as follows::

Table 2 1: Column names and description in the dataset

3 Customer-Period Time as a Customer (duration of the customer relationship)

4 MaximumSpend Maximum Amount Spent by the Customer

7 Monthly-Average-Spend Average Monthly Spending

9 Mortgage Mortgage Loan of the Customer

10 Security-Asset Collateral Assets with the Bank

11 Fixed-Deposit-Account Information about Fixed Deposit Accounts

12 Internet-Banking Customer Uses Online Banking (Yes/No)

13 Credit-Card Customer Has Credit Card (Yes/No)

14 Loan-On-Card Customer Has Revolving Credit on Credit Card

The objective of data collection is to analyze and understand the factors influencing the effectiveness of marketing campaigns, thereby making proposals and optimizing strategies for future marketing campaigns of the bank This process aims to enhance the effectiveness of marketing campaigns, thereby boosting the bank's success in attracting and retaining customers

To achieve a standardized dataset for improved accuracy in subsequent steps, the collected dataset needs to undergo preprocessing steps to clean it First, let's examine the overall information of the entire dataset Use the skimpy library in Python to view general information

Figure 2 2: Overview of the dataset

Based on Figure 2.2 regarding the dataset overview, we can observe the following:

- Most columns do not contain any missing values (NA), except for the "Loan-On- Card" column which has 20 values, accounting for 0.4% of the total number of records Therefore, there is no need to remove NA values here

- Statistical values such as mean, standard deviation, and quartile values (Q1, Q2, Q3, Q4) are all valid

- Columns with numeric data types, including integers and floats, are valid

Therefore, with the current dataset, we can skip the preprocessing steps as the dataset is entirely valid and proceed with the next steps accordingly

The descriptive statistics of the Age column are as follows:

Figure 2.3: The distribution of values in the Age column and the box plot

According to the chart, the age of customers ranges predominantly from 25 to 70 years old This age range typically represents individuals who are employed and capable of managing debt repayments punctually The most concentrated age group falls between

35 to 55 years old, with 35 being the 25th percentile (Q1) and 55 the 75th percentile (Q3) Customers in this age bracket often have higher financial needs, such as loans, investments, insurance, etc., compared to other age groups Therefore, banking products and services can be optimized for this demographic, such as mortgage loans, personal loans, etc We also identified outliers in the Age column using the IQR method and found no outliers

2.3.1.2 Column chart of Customer-Period

The distribution chart of values in Customer-Period and its box plot is:

Figure 2.4: The distribution of values in the Customer-Period column and its box plot

With this chart, Customer-Period ranges from 0 to 40 However, there are values appearing below 0, which could be outliers in the bank's dataset The box plot shows the value at Q1 is 10 and Q3 is 30, indicating that the majority of customers have been with the bank for a considerable period Therefore, the bank may consider approving loans based on this factor We also used the IQR method to identify outliers, but there were still no outlier values detected

2.3.1.3 Column Chart of Maximum-Spend

The distribution chart of values in the Maximum-Spend column is shown below

Figure 2.5: The distribution of values in the Maximum-Spend column

The chart shows an uneven distribution in customer spending With the same currency unit, the majority of customers have spending levels ranging from 5 to 150 units Specifically, the most concentrated range is from 30 to 100 units, indicating a group of customers with stable lifestyles or moderate-income earners However, the chart also reveals some higher spending values as outliers We identified these outlier values as follows:

Figure 2 6: Outlier values in Maximum-Spend

Using the IQR method, we identified 96 customers with spending higher than the rest This group likely represents affluent customers belonging to the upper class

2.3.1.4 Column Chart of Monthly-Average-Spend

The statistics for the Monthly-Average-Spend column are as follows:

Figure 2 7: The distribution of values in the Monthly-Average-Spend column

The chart shows a significant distribution difference in the Monthly-Average- Spend column While the majority of customers concentrate around the 0-3 range, there is also a notable number of customers with exceptionally high spending levels With Q1 around 1 and Q3 around 3, the box plot indicates a range of outliers beyond these values These outliers include:

The result shows that 324 customers have higher-than-normal average spending values across the entire dataset These customers may belong to the outlier group identified in the Maximum-Spend column as presented in Figure 2.6

Figure 2 9: The distribution of values in the Mortgage column

The distribution chart in the Mortgage column shows that the majority of customers do not have a mortgage with the bank currently For those who do, the amounts are primarily concentrated between 75 and 300 currency units, with the highest concentration between 75 and 150 Additionally, there are a few customers with larger mortgages, reaching up to 500 and 600 units Using the IQR method to find outliers, we identified the following outlier values

Figure 2 10: Outlier values in Mortgage

Thus, there are 291 customers with unusually high mortgage amounts compared to the majority of customers in the dataset These customers may also have high spending levels and average monthly expenditures as discussed earlier

2.3.1.6 Column Chart of Hidden Score

The Hidden Score column ranges from 1 to 4, representing a hidden score assigned by the bank to each customer The statistics for this column are as follows:

Figure 2 11: The count and distribution of values in the Hidden-Score column

The chart shows relatively even distribution of values, with no significant disparity between values in the column Specifically, 1 accounts for 29.4%, 2 accounts for 25.9%, 3 accounts for 20.2%, and 4 accounts for 24.4%

2.3.1.7 Statistics of the Credit-Card column

The statistics for the number of customers with and without credit cards are as follows:

Figure 2 12: The distribution of values in the Credit-Card column

There are over 3500 customers without credit cards, accounting for more than 70.6% of the total number of customers Meanwhile, the number of customers with credit cards is 1500, accounting for 29.4%

2.3.1.8 Statistics of the Loan-On-Card column

The statistics for the number of customers with loans on their credit cards are as follows:

Figure 2 13: The distribution of values in the Loan-On-Card column

There is a significant disparity in the Loan-On-Card column, with fewer than 500 customers having loans on their credit cards, accounting for only 9.6% of the total number of customers The remaining customers do not have any loans on their credit cards

2.3.2.1 Relationship between Customer-Period and Loan-On-Card

Figure 2 14: The relationship between Customer-Period and Loan-On-

The chart shows that both Loan-On-Card groups have Q1 at 10 and Q3 at 30 This indicates a similar distribution of Customer-Period across both groups categorized by Loan-On-Card It suggests that whether customers have loans on their credit cards or not does not significantly impact the duration of their relationship with the bank

2.3.2.2 Relationship between Maximum-Spend and Loan-On-Card

The chart illustrates the relationship between Maximum-Spend and Loan-On-Card as follows:

Figure 2 15: The relationship between Maximum-Spend and Loan-On-

Unlike Customer-Period, Maximum-Spend shows different distributions between the two groups in Loan-On-Card For the group without loans (0), Q1 is at 40 and Q3 at 75; whereas for the group with loans (1), Q1 is at 125 and Q3 is at 175

Figure 2 16: Maximum-Spend trong từng nhóm Loan-On-Card

Data Preprocessing

After visualizing the data, we have gained a better understanding of the impact and distribution of the values in the dataset In this section, we will focus on examining and preprocessing the data to standardize it before building the machine learning model First, let's review the data.head and data.tail

It is observed that the dataset is quite stable; however, the column Loan-On-Card contains some undefined NaN values These undefined values can affect the performance of the model and need to be addressed We will check the entire dataset to identify any undefined values in each column as follows:

From the results, we can see that there are only 20 undefined values in the Loan- On-Card column I will handle these values by defaulting them to 0 This means that the

20 customers with undefined Loan-On-Card status will be assigned a value of 0 After this adjustment, the data in the Loan-On-Card column will have a total of:

Figure 2 25: The number of instances in each class within Loan-On-

There is an issue here due to data imbalance: the class 1.0 has only 480 instances, while the class 0.0 has 4520 instances Data imbalance can cause significant issues when building and deploying machine learning models, especially in classification tasks Therefore, it is necessary to address this issue, and the chosen method is SMOTENC SMOTENC is a variant of the SMOTE (Synthetic Minority Over-sampling Technique) designed to handle imbalanced datasets with both categorical and continuous features SMOTE randomly selects a minority class sample and finds its nearest neighbors It then generates synthetic samples by taking a point on the line segment between the selected sample and one of its neighbors SMOTENC extends SMOTE to handle categorical features using a hybrid approach For continuous features, SMOTENC uses the same interpolation method as SMOTE For categorical features, SMOTENC uses the mode of the categorical values among the neighbors to determine the value of categorical features in the synthetic sample

The goal of SMOTENC is to increase the number of minority class samples by generating synthetic samples, thereby improving the performance of machine learning models on imbalanced datasets The main steps in the SMOTENC algorithm include:

• Step 1: Randomly select a sample from the minority class

• Step 2: Find the nearest neighbors of this sample in the multi-dimensional space

• Step 3: Create a synthetic sample by interpolating the continuous values and using the mode value for the categorical features

By generating synthetic samples for the minority class, SMOTENC helps the machine learning model to better learn the characteristics of this class, thus improving performance on imbalanced datasets

Applied to the current problem, the dataset is divided into the target set Y, which is the Loan-On-Card column, and the feature set X, which includes the remaining columns After balancing the data, the following results are obtained:

After the balancing process, both classes 0 and 1 now have 4520 instances each

At this point, we can confirm that the data is balanced and proceed with the next steps of the problem

The balanced dataset mentioned above is divided into training and testing sets with an 80:20 ratio The first model to be tested is Logistic Regression The model is trained and tested, and the results are illustrated as follows:

Figure 2 27: Logistic Regression Model Accuracy

The results are quite good, with an accuracy rate of over 87% on both the training and testing sets Let's take a look at the confusion matrix to see what it reveals about the model's performance

Figure 2 28: Logistic Regression model Confusion matrix

The results from the confusion matrix are very promising compared to expectations For the Non-Loan holders class, the model correctly predicted 3066 instances and incorrectly predicted 566 instances For the Loan holders class, it correctly predicted 3243 instances and incorrectly predicted only 357 instances The detailed classification report also shows that metrics such as precision, recall, and F1 score are all high This indicates that the Logistic Regression model is performing well for the current problem

However, I also used the following models: Decision Tree, Random Forest, Gradient Boosting, and KNN to determine which model achieves the highest accuracy and performance The performance of each model was evaluated based on their precision, recall, and F1-scores Precision, in particular, is critical in the context of loan approval processes, as it measures the accuracy of the positive predictions (i.e., correctly identifying loan holders) The results of these models on the same dataset are as follows:

Figure 2 30: Decision Tree Training Results

The Decision Tree model demonstrated a higher precision of 0.93 for loan holders, which means it was more accurate in predicting customers who actually held loans The F1-score for loan holders was also 0.94, showing a balanced performance between precision and recall

The Random Forest model performed exceptionally well, with a precision of 0.95 for loan holders This high precision indicates that the model was very effective in identifying actual loan holders The F1-score was also high at 0.95, indicating robust performance across both precision and recall metrics.

Figure 2 32: Gradient Classifier Training Results

The Gradient Boosting model achieved the highest precision of 0.96 for loan holders This model was the most accurate in identifying customers who actually held loans The F1-score for loan holders was also the highest at 0.96, reflecting an excellent balance between precision and recall

The K-Nearest Neighbors model had a precision of 0.81 for loan holders While lower than the tree-based models, it still showed a decent level of accuracy in predicting actual loan holders The F1-score was 0.84, indicating a reasonable balance but slightly lower performance compared to other models.

Based on the evaluations, the Gradient Boosting Classifier is the best model among those tested It achieved the highest accuracy of 0.958 and the highest weighted F1 score of Forest Classifier is also a good option, with accuracy and F1 scores very close to those of the Gradient Boosting Classifier However, if a balance between performance and model complexity is needed, the Random Forest Classifier might be the ideal choice due to its ensemble nature and capability to handle medium to large datasets effectively

CONCLUSION

Tiêu đề	Customer Segmentation in Banking for Personal Consumption Loans: A Study on Individual Borrowers in a Digital Banking
Tác giả	Lê Phan Anh Thư
Người hướng dẫn	Dr. Phạm Thị Việt Hương
Trường học	Vietnam National University, Hanoi International School
Chuyên ngành	Business Data Analytics
Thể loại	Graduation Project
Năm xuất bản	2024
Thành phố	Hanoi

Định dạng
Số trang	54
Dung lượng	2,9 MB

Tài liệu tham khảo	Loại	Chi tiết
10. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32. Available at: https://link.springer.com/article/10.1023/A:1010933404324	Link
13. Rygielski, C., Wang, J. C., & Yen, D. C. (2002). Data mining techniques for customer relationship management. Technology in Society, 24(4), 483-502.Available at:https://www.sciencedirect.com/science/article/abs/pii/S0957417404000132	Link
14. Wedel, M., & Kamakura, W. A. (2000). Market segmentation: Conceptual and methodological foundations (2nd ed.). Springer. Available at:https://link.springer.com/book/10.1007/978-0-306-47630-3	Link
1. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). Springer	Khác
2. Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785-794)	Khác
3. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O	Khác
4. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357	Khác
5. Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. In Advances in neural information processing systems (pp. 4765- 4774)	Khác
6. Molnar, C. (2020). Interpretable machine learning: A guide for making black box models explainable. Lulu.com	Khác
7. Raschka, S., & Mirjalili, V. (2019). Python machine learning: Machine learning and deep learning with Python, scikit-learn, and TensorFlow (3rd ed.). Packt Publishing	Khác
8. Kuhn, M., & Johnson, K. (2013). Applied predictive modeling. Springer Science & Business Media	Khác
9. Provost, F., & Fawcett, T. (2013). Data science for business: What you need to know about data mining and data-analytic thinking. O'Reilly Media	Khác