Application cases of data mining technology in credit card business
Credit card business has the characteristics of huge number of overdrafts and small single amount, which makes the application of data mining technology in credit card business become inevitable. Foreign credit card issuers have widely used data mining technology to promote the development of credit card business and achieve comprehensive performance management. Since my country issued its first credit card in 1985, the credit card business has developed rapidly and accumulated a huge amount of data. The importance of data mining in the credit card business has become increasingly apparent.
1. Application of data mining technology in credit card business The main applications of data mining technology in credit card business include analytical customer relationship management, risk management and operation management.
1. Analytical CRM
Analytical CRM applications include market segmentation, customer acquisition, cross-selling and customer churn. Credit card analysts collect and process a large amount of data, analyze the data, discover its data patterns and characteristics, analyze the characteristics, consumption habits, consumption tendencies and consumption needs of a certain customer group, and then infer the next consumption behavior of the corresponding consumer group , and then use this as a basis to proactively market specific products to the identified consumer groups. Compared with traditional large-scale marketing methods that do not distinguish the characteristics of consumer objects, this greatly saves marketing costs and improves marketing effects, thus bringing more profits to banks. Which marketing method to use for customers is based on the customer purchase probability predicted by the response model. For customers with high response probability, more proactive and humane marketing methods are used, such as telephone marketing and door-to-door marketing; for customers with higher response probability, Low-cost customers can choose lower-cost email and letter marketing methods. In addition to acquiring new customers, it is also important to maintain the loyalty of existing high-quality customers, because the cost of retaining an original customer is much lower than the cost of developing a new customer. In customer relationship management, through data mining technology, we can find the characteristics of lost customers and discover their loss patterns, so that we can make targeted compensation for cardholders with similar characteristics before they are lost, so that high-quality Customers can continue to create value for the bank.
2. Risk management
Another important application of data mining in the credit card business is risk management. Various credit scoring models can be established by using data mining technology in risk management. There are three main types of models: application credit score card, behavioral credit score card and collection credit score card, which provide pre-event, during-event and ex-post credit risk control for credit card business respectively.
The application scoring model is specially used to evaluate the credit of new customers. It is used in the credit card review stage. Through the relevant personal information filled in by the applicant, the quality of customers can be effectively and quickly identified and classified. , decide whether to pass the approval and determine the initial credit limit for approved applicants, helping the card issuing bank to control risks from the source. The application scoring model does not rely on people's subjective judgment or experience, and is conducive to the implementation of unified and standardized credit policies by card issuers. The behavioral scoring model is aimed at existing cardholders. It monitors and predicts the behavior of cardholders to assess the credit risk of cardholders. Based on the model results, it intelligently decides whether to adjust the customer's credit limit when authorizing. Determine whether authorization is passed, whether to renew the card when it expires, and provide early warning for possible emergencies. The collection scoring model is a supplement to the application scoring model and the behavior scoring model, and is established when cardholders have incurred overdue or bad debts. Collection scorecards are used to predict and evaluate the effectiveness of actions taken on a particular bad debt, such as the likelihood that a customer will respond to a warning letter. In this way, the card issuing bank can take corresponding measures to deal with overdue customers of varying degrees based on the predictions of the model. When the above three scoring models were established, the data used were mainly demographic data and behavioral data. Demographic data includes age, gender, marital status, educational background, family member characteristics, housing situation, occupation, professional title, income status, etc. Behavioral data includes cardholders’ performance information on credit card use in the past, such as frequency of use, amount, repayment status, etc. It can be seen that the use of data mining technology can enable banks to effectively establish a credit risk control system before, during and after the event.
3. Operation management
Although the application of data mining in the field of credit card operation management is not the most important, it has been used by many foreign card issuing companies to improve production efficiency, optimize processes, and Great achievements have been made in the analysis of issues such as forecasting funding and service needs and the order of service provision.
2. Commonly used data mining methods
In the application of the above data mining technology in the credit card field, there are many tools that can be used to develop prediction and description models. Some use statistical methods, such as linear regression and logistic regression; some use non-statistical or hybrid methods, such as neural networks, genetic algorithms, decision trees and regression trees. Only a few common typical methods are discussed here.
1. Linear regression
Simple linear regression analysis is a statistical technique that quantifies the relationship between two continuous variables. These two variables are the dependent variables (predictor variables) respectively. Using this method, you can find a line through the data where the points on the line minimize the variance of the corresponding data points. When building models for marketing, risk and customer relationship management, there are usually multiple independent variables. Using multiple independent independent variables to predict a continuous variable is called multiple linear regression. Models built using linear regression methods are usually robust.
2. Logistic regression
Logistic regression is the most widely used modeling technique and is very similar to linear regression. The main difference between the two is that the dependent variable (think predictor variable) of logistic regression is not continuous, but discrete or type variable. If you apply for a scoring model, you can use the logistic regression method to select key variables to determine the regression coefficients. Taking the applicant's key variables x1, x2,... The probability is p(y=1)=eβ0 β1×1…βmxm/1 eβ0 β1×1…βmxm. In the formula, β0, β1…, βm are constants, that is, 1n(p/1-p)=β0 β1×1 … βmxm
3. Neural network
Neural network processing is very different from regression processing. It does not follow any probability distribution, but imitates the function of the human brain. It can be considered as starting from every Extract and learn information from an experience. A neural network system consists of a series of nodes similar to neurons in the human brain, which are interconnected through a network. If data is input, they can do the work of determining patterns in the data. The neural network consists of an input layer, an intermediate layer (or hidden layer), and an output layer that are connected to each other. The middle layer consists of multiple nodes and does most of the network work. The output layer outputs the execution results of data analysis.
4. Genetic algorithm
Similar to neuron networks, genetic algorithms do not follow any probability distribution and are derived from the evolutionary process of "survival of the fittest". It first encodes the possible solutions to the problem in some form, and the encoded solutions are called chromosomes. Randomly select n chromosomes as the initial population, and then calculate the fitness value for each chromosome according to the predetermined evaluation function. Chromosomes with better performance have higher fitness values. Chromosomes with higher fitness values ??are selected for replication, and genetic operators are used to generate a new group of chromosomes that are more adaptable to the environment to form a new population until it finally converges to an individual that is most adaptable to the environment and obtains the optimal solution to the problem.
5. Decision tree
The goal of a decision tree is to gradually classify data into different groups or branches, establishing the strongest division on the value of the dependent variable. Because the classification rules are relatively intuitive, they are easy to understand. Figure 1 shows the decision tree of customer responses, from which it is easy to identify the group with the highest response rate.
3. Example Analysis
The following uses the logistic regression method to establish a credit card application scoring model as an example to illustrate the application of data mining technology in the credit card business. Application scoring model design can be divided into 7 basic steps.
1. Define the criteria of good customers and bad customers
The criteria of good customers and bad customers are defined according to the needs of management. According to foreign experience, at least 1,000 good and 1,000 bad samples are required to establish a risk model for predicting the quality of customers. In order to avoid risks and take into account the early stage of the credit card market, banks' main sources of profits are sellers' commissions, credit card interest, fee income and fund operation spreads. Therefore, banks generally regard reducing the overdue rate of customers as a major management goal.
For example, define a bad customer as a customer who has been overdue for more than 60 days; define a bad customer as a customer who has been overdue for more than 60 days; define a good customer as a customer who has not been overdue for more than 30 days and is not currently overdue.
Generally speaking, in the same sample space, the number of good customers is much greater than the number of bad customers. In order to ensure that the model has a high ability to identify bad customers, the ratio of the number of good and bad customer samples is 1:1.
2. Determine the sample space
The determination of the sample space should consider whether the sample is representative. A customer is a good customer, indicating that the cardholder has performed well with the card during an observation period; and a customer is considered a bad customer as long as he has a "bad" record. Therefore, the observation period of good customers is generally longer than that of bad customers. Good and bad customers can be selected in different time periods, that is, in different sample spaces. For example, the sample space of good customers is applicants from November 2003 to December 2003, and the sample space of bad customers is applicants from November 2003 to May 2004. This can ensure that the performance period of good customers is longer. long, and can ensure a sufficient number of bad customer samples. Of course, the sample of good and bad customers should be representative.
3. Data source
In the United States, there is a unified credit bureau that scores personal credit, often called "FICO score". Banks, credit card companies and financial institutions in the United States can use credit bureau data reports on individuals when conducting credit risk analyzes on customers. In our country, because the credit reporting system is not yet complete, modeling data mainly comes from application forms. With the gradual improvement of my country's national credit reporting system, part of the data for future modeling can be collected from credit reporting agencies.
4. Data sorting
In order for a large amount of sampled data to truly enter the model, it must be sorted. When processing data, attention should be paid to checking the logic of the data, distinguishing "missing data" and "0", inferring certain values ??based on logic, looking for abnormal data, and evaluating whether it is true. You can preliminarily verify whether the sampled data is random and representative by finding the minimum, maximum, and average values.
5. Variable selection
Variable selection must have both the correctness of mathematical statistics and the ability to explain actual credit card business. The logistic regression method is to find the independent variables that can predict the dependent variable as accurately as possible and give a certain weight to each variable. If the number of independent variables is too small, the fitting effect will be poor and the dependent variable cannot be well predicted; if there are too many independent variables, overfitting will occur and the effect of predicting the dependent variable will be equally poor. Therefore, some independent variables should be reduced, such as using dummy variables to represent variables that cannot be quantified, and using univariate and decision tree analysis to screen variables. Independent variables that have similar correlations with the dependent variable can be classified into one category, such as the impact of region on the probability of bad customers. Assume that the correlations of Guangdong and Fujian provinces on bad customers are -0.381 and -0.380 respectively. These two provinces can be Regions are grouped into one category. In addition, some independent variables can be constructed based on the information on the application form, such as combining "marital status" and "parented children" on the application form. Based on experience and common sense, combine these two fields to construct a new variable "already "Married with children", entering this variable into the model analysis is not really statistically predictive.
6. Model establishment
Use SAS9 software to screen variables using the stepwise regression method. An algorithm is designed here, which is divided into 6 steps.
Step 1: Find the multi-variable correlation matrix (if it is a dummy variable, then >0.5 is relatively relevant; if it is a general variable, then >0.7-0.8 is relatively relevant).
Step 2: Rotated principal component analysis (general variables requiring >0.8 are relatively relevant; dummy variables requiring >0.6-0.7 are relatively relevant).
Step 3: Find 15 variables and 30 variables in the first principal component and the second principal component respectively.
Step 4: Calculate the correlation of all 30 variables for good/bad, find the variables with high correlation and add them to the variables obtained in step 3.
Step 5: Calculate VIF. If the VIF value is relatively large, check the correlation matrix in step 1, analyze the effect of these two variables on the model, and eliminate the one with smaller correlation.
Step 6: Loop through steps 4 and 5 until all variables are found, and the multi-variable correlation matrix has a very small correlation and a single variable contributes a lot to the model.
7. Model verification
When collecting data, divide all the organized data into modeling samples for building the model and control samples for model verification. Control samples are used to verify the overall predictability and stability of the model. Model testing indicators for applying for a scoring model include K-S value, ROC, AR and other indicators. Although affected by objective factors such as unclean data, the K-S value of the scoring model applied in this example has exceeded 0.4, reaching a level that can be used.
IV. The development prospects of data mining in the domestic credit card market
In foreign countries, the credit card business is highly informatized and a large number of quantitative resources are retained in the database. Various models have been implemented very successfully in the credit card business. At present, domestic credit card issuing banks are the first to use data mining to establish application scoring models. As the first step of application in credit card business, many card issuing banks have established customized application scoring models using their own historical data. Generally speaking, the application of data mining in my country's credit card business suffers from data quality problems and it is difficult to build business models.
As various domestic card-issuing banks have established or started to establish data warehouses, data from different operating sources are stored in a centralized environment, and appropriate cleaning and conversion are performed. This provides a good operating platform for data mining and will bring various conveniences and functions to data mining. The People's Bank of China's personal credit reporting system has also been launched online, forming a nationwide concentration of personal credit data. On the basis of the continuous improvement of the internal and external environments, data mining technology will have increasingly broad application prospects in the credit card business.