To get intelligent knowledge for prevention of the customers churn through data mining, project group launched the four phases of work according to the following six steps.
The Four phases are:
1. Data Exploration Phase 2. Experimental Mining Phase 3. Data mining Phase
4. Validation and Follow-ups The six steps are:
1. Understand the process flow, and the distribution of the data, build up the data 2. Discuss the selected dataset, compile data dictionarymap.
3. Pre selection of the data fields
4. Determine the process method for the data mining field 5. Determine the data mining plan.
6. Using MCLP software, input integrated and cleaned data in the table of TXT file, then make dataset partition, modeling, get the scoring model step by step, and visualize the results collated. The following figure is data map for customer churn prevention (Fig. 7.1):
Data collection and consolidation
1. A list is selected by the data field of the data dictionary, and data retrieval based on the data retrieval method
2. Compile the log processing program, transform the data into structured data 3. Label the structured data based on the labels from the service department,
authenticated by the technical department 4. Data consolidation
5. Cleaning transformation and discretization.
115 7.2 The Data Mining Process and Result Analysis
By above five steps, we get 4998 records of churn customers and 4998 records of normal customers. Then we used cross-validation method for ten times, each time we selected 500 records randomly to constitute the training sample from the two data sets respectively, and the remaining data as test sample. Finally we got 10 groups of training samples and test samples. As shown in Fig. 7.2:
Data mining Modeling.
We selected 10 groups on the basis of sample data to set up the evaluation model, the training and testing results show as follows, smooth lines shows our scoring models with high stability. The results are as shown in Fig. 7.3:
In this project, we use cross validation algorithm to generate the 9 groups of score voting machines, their predictive accuracy as shown in Table 7.1:
The combination of the 9 votes consisted 10 MCLP score models, if score[i]> 5 it will be judged as normal, score[i]< = 5 will be judged for the churn of customers.
Fig. 7.1 Data map for customer churn prevention
1RUPDOFXVWRPHUVWUDLQLQJFKXUQFXVWRPHUVWUDLQLQJ 1RUPDOFXVWRPHUVWHVWLQJFKXUQFXVWRPHUVWHVWLQJ Fig. 7.2 Selections for train-
ing samples and test samples
116 7 Intelligent Knowledge Acquisition and Application in Customer Churn
The score precision measurement has a lot of kinds of methods, including the cu- mulative distribution structure of the KS value method. It was confirmed to be able to identify more efficient data sets, which are widely applied in the field of credit risk management. We explain the MCLP score model performance from the view of distribution density and cumulative distribution as following.
Table 7.2 below is a distribution density table based on MCLP scoring systems with two types of customers, the first column score is in the range of [1, 10], the second column LOST is the scores of all the number and percentage of the churn customers, the third bar CURRENT shows scores of all the number and the per- centage of the normal customers. As can be seen from the charts, the 5382 churn of customers mainly in the low field; for the 69,473 regular customers, the score paragraph mainly in high field. Among them, the churn of customers’ score gath- ered in the scores of 1, of the total churn number 55.797 %, and the normal customer vscore gathered in the scores of 10, accounting for all the normal customer number of 46.196 %.
1RUPDOWUDLQLQJ FKXUQWUDLQLQJ 1RUPDOWHVWLQJ FKXUQWHVWLQJ
6DPSOHQXPEHU
$FFXUDF\UDWH
Fig. 7.3 Training and testing results
Cross
validation Testing data sets(3382churn + 65493 normal) Churn Accuracy (%) Normal Accuracy (%) Dataset 1 2506 74.0982 46,777 71.4229 Dataset 2 2451 72.4719 47,336 72.2764 Dataset 3 2518 74.4530 46,940 71.6718 Dataset 4 2505 74.0686 46,728 71.3481 Dataset 5 2509 74.1869 46,844 71.5252 Dataset 6 2467 72.945 46,951 71.6886 Dataset 7 2565 75.8427 46,534 71.0510 Dataset 8 2535 74.9556 46,518 71.0274 Dataset 9 2475 73.1815 46,496 70.9938 Table 7.1 Cross validation
table
117 7.2 The Data Mining Process and Result Analysis
A more intuitive density distributions figure is shown in Fig. 7.4, the yellow line represents the distribution of churn customers, blue lines represent the distribution of normal customers, as we can see, the yellow line represents the churn custom- ers and the blue line for normal customers are basically linear separable, thus the MCLP method has good applicability to solve churn problems. (Table 7.3)
Sum up the above distribution density data, we get the distribution function (cu- mulative distribution):
From the table, we can see the maximum separation of the churn customers and normal customers appears in the arrow pointing to the score = 5 position. that is to say, if our model on the customer’s score is score i[ ]>5, then it can be assumed that the customer loyalty is high, do not churn; if our model’s score is score i[ ]<=5, then it can be assumed that the customer is in the state to be loss, we need to take measures to let them stay.
Score Churn (5382 records) Normal (69,473 records)
Percentage Percentage
1 55.797101 13.196924
2 7.785210 4.115982
3 4.923820 2.961789
4 3.994797 3.118842
5 3.493125 3.269969
6 3.010033 3.923370
7 3.530286 4.588624
8 3.660349 6.107300
9 4.998142 12.521299
10 8.807135 46.195902
Table 7.2 The predictive accuracy of Churn prediction
/267
&855(17
Fig. 7.4 Density distributions figure
118 7 Intelligent Knowledge Acquisition and Application in Customer Churn
The following Fig. 7.5 is for the KS graphical display:
Set the origin of coordinates (0, 0), we see that, for the yellow line marked churn customers, there’s a big jumps for cumulative distribution in scores 1, growing from 0 to 55.797 %, shows that a large number of customers are accumulated in the scores 1, from 1 to 10 the growth is with relative ease; and for the blue line marking the normal customers, cumulative distribution from scores of 1–9 with relative ease, when increased from 9 to 10, the cumulative distribution grows from 53.804098 to 100 %, a large number of customers is statistically in the numerical. While the two
&KXUQ 1RUPDO $EVROXWHGHJUHHRIVHSDUDWLRQ
$&&808/$7(
6&25(
$&&808/$7( 3(5&(17$*(
Table 7.3 distribution function lists
/267
&855(17
Fig. 7.5 KS score chart