1. Project background
In recent news, users find that their cards have been stolen after seemingly normal purchases or withdrawals. This phenomenon is fraud. trade. Fraudulent transactions are a harmful phenomenon that exists in various industries such as banking, insurance, and securities, and bring great losses and threats to people's economy and life. As a global problem, developed countries have implemented powerful information management systems to detect, identify and evaluate fraudulent transactions through data mining and artificial intelligence, effectively improving anti-fraud technology.
CRISP-DM, the Cross-Industry Data Mining Standard Process (as shown below), is by far the most popular data mining process reference model. The associations between various large and small nodes shown in the figure will be cyclical and rough. The process is not the focus. The key is that the results of data mining can eventually be embedded into business processes to improve business efficiency and effectiveness.
CRISP-DM has a very good fit with the SPSS Modeler developed by SPSS. It supports the three major statistical methodologies of rigorous design, semi-experimental research, and intelligentization. It is one of the most outstanding statistical software in the world. This time, SPSS Modeler18 is used as the modeling tool, using non-real medical insurance industry data (policy holder information, medical institution information table, claim information table, medical diagnosis and processing information table) as internal business data, non-real small loans The data is used as a third-party customer data source to conduct data mining modeling and analysis to detect fraudulent transactions. It is also believed to be of reference significance to other industries.
In the business understanding stage of CRISP-DM, the company first conducts a situation assessment of its resources, needs, risks, and cost benefits in order to determine the data mining goals.
The analysis of medical insurance fraud risks in business analysis is as follows:
1) Domestic medical insurance fraud manifestations
Mainly include: impostor (that is, falsification of medical qualifications) ; Falsifying the cause of disease (changing non-medical insurance-covered diseases (such as car accidents, work injuries, fights, suicides, etc.) into medical insurance-paid diseases); exaggerating losses; forging bills; forging medical documents; forging inpatient beds (i.e., hanging beds in the hospital); Fabricating false information on hospitalization, outpatient special diseases and other related information to "cheat insurance".
2) The subject of fraud
Under the "third-party payment" system, medical staff and the insured may conspire to defraud insurance institutions.
There are three main roles: policyholders, medical institutions, and insurance companies. The possible sources of fraud include policyholders and medical institutions. The goals and directions of data mining based on business characteristics are as follows:
Data anomaly detection;
Conduct classification research on policyholders, use user portraits, and combine existing and external data with Fraud score prediction for potential customers;
Classification research on medical institution information;
Medical claims detection.
Statement: In view of the length, this article is an overview, and specific ideas and algorithms will be discussed in the future.
2. Data and model analysis
2.1 Data anomaly detection
Many data anomalies can be directly judged based on experience in terms of business logic. matter. For example, a certain customer's claim frequency and amount have increased significantly over a period of time, and there is an abnormal relationship between the policyholder's payment amount and the policyholder's medical expense data. These can all be regarded as suspected fraud, and the relevant process will not be technically demonstrated.
Benford's law and anomaly detection are anomaly detection methods widely used in auditing, securities and other industries. The so-called anomaly detection is to find objects that are different from most objects. In fact, it is to find outliers. We can use multiple anomaly detection methods at the same time to improve the hit rate of detecting fraudulent transactions. Benford's law is an interesting law that reveals the distribution characteristics of the first digit in massive data: the larger the first digit of the data, the lower the frequency of occurrence.
Through cluster modeling, using the medical institution number, payment amount, number of claims, etc. as input variables:
We can find that when the claim threshold is greater than 50, the clustering distance threshold is greater than 0.2. Suspected fraud report: "Healthcare Institution Number: 10083642887, Healthcare Institution Subcategory: Psychology, Healthcare Institution Claims Number of Claims 58" and "Healthcare Institution Number: 10085843968, Healthcare Institution Subcategory: Med Trans, Healthcare Institution Claims Claim number 71”.
In order to expand the scope of abnormal data search, the specialized anomaly detection method Anomaly modeling is used:
Suspected frauds with abnormal deviation index greater than 1.5 and Anomaly marked "T" are obtained in the following table List of policyholders:
By looking at the results of the model, the table also shows the three most important influencing factors and impact indexes that cause this record to be regarded as an outlier. It can be easily seen that including DIAG diagnosis, Factors including Procedure processing and MEDcode medical measures are important factors leading to suspected fraud.
After review by the fraud department, the hit rates of the two algorithms can be compared.
2.2 Fraud analysis of policyholders
Including: cluster migration, fraud scoring, user portrait.
2.2.1 Customer cluster migration
Generally speaking, in a relatively short period of time, the status and behavior patterns of both institutions and individuals are relatively stable and will not occur. Too big of a change. If the policyholders are clustered and subdivided, and if a customer changes their subdivision group within a year or even half a year, a suspected fraud report can be submitted. Cluster modeling selects several key input variables (refer to the RFM model), such as payment amount, number of payments, and insurance terms. Cluster modeling is performed on the first and second years respectively and marks for group transformation are obtained. Suspected fraud list.
In the cluster analysis of customers, we can find some groups with a small number of records, which are often ignored in marketing activities, but are an abnormal behavior group worthy of attention in fraud detection. .
2.2.2 Fraud scoring: single classifier and ensemble learning (Ensemble Learning)
The construction of personal credit system has been very mature in developed countries, and the familiar banking industry involves Professional applications such as credit approval, limit determination, and anti-fraud. In the U.S. banking industry, only about 100 million US dollars of credit card transactions are incurred each year, accounting for about 0.02% of the total. Its mature data mining technology has achieved remarkable results.
Fraud scoring can be divided into three main steps: variable conversion, generation of logsitic regression model and score conversion. The sample is randomly divided into two parts: one part is used to build the model, and the other part is used to test the model. Bining processing of variables actually causes a certain loss of data, but due to the need for business services as the starting point, it must be considered that binned variables are more convenient for business personnel to use and understand.
The input to the logistic regression model is the WOE value (weight of evidence) of each (binned) variable. The calculation formula of Woe value: WOE=ln (proportion of good customers/proportion of bad customers)*100.
Variable conversion includes the following steps:
1) Eliminate redundant variables (one of the variables with a large correlation coefficient can be retained);
2) Bining processing of continuous variables and category merging processing of discrete variables;
3) Calculation of IV values ??and calculation of WOE values. In order to improve the prediction ability, try to screen variables with IV values ??greater than or equal to 0.02 and less than or equal to 0.05. .
The above figure is part of the model and output of the variable conversion data flow. It can be seen that the first output table is used. The credit card data as discrete variables can continue to calculate its default rate for conversion classification.
After the logistic regression modeling is carried out using the stepwise method, statistical methods must be used to convert the regression coefficients into scores. The score conversion step involves a business quantification process of scale preparation, which will not be described in detail for the time being.
The prediction model can be tested using ROC, k-s index method, etc. The scorecard test needs to reflect which segment has the greatest difference. Choose the ks index method:
Generally, KS>0.2 is considered to be a better model. prediction accuracy.
Regression is one of the basic common algorithms for single classifiers, and can also be modeled with decision tree C5.0.
Looking at the C5.0 model, you can get 8 rules for customer fraud. Based on these rules, you can understand several salient characteristics before fraudulent transactions occur, so as to discover signs of customer fraud and take early prevention. In Rule 1, it can be seen that customers under the age of 27, with a credit card type of "check", and a nationality of Greece or Yugoslavia are one of the high-risk customer groups for fraudulent transactions.
Although single classifier has been widely used in the past, it has obvious shortcomings. In recent years, the U.S. banking industry has adopted a large number of tree algorithm families. Currently, there are two main types of integrated learning that are widely used: Boosting-based and Bagging-based. The most recent one is the gradient increasing tree algorithm. These integrated learning methods avoid the problem of interdependence between variables, and their predictive analysis capabilities are gradually enhanced and have a wide range of applications. They have been proven to be very effective in anti-fraud and other fields, and are the focus of our professionals.
The main idea of ??the Boosting algorithm is to increase the resampling weight of misclassified samples in each iteration in T iterations, so that more attention will be paid to these samples in the next iteration. Multiple weak classifiers trained in this way are weighted and fused to produce a final result classifier, which improves the accuracy of the weak classification algorithm. We use boosting to set up 50 decision tree iterations:
Modeling and results:
2.2.3 User portraits
User portraits, which have become popular in recent years, are It is the company that traces its roots and has a more perceptual understanding of its customer base, assists the marketing department in precision marketing, and uses internal data and external (third-party) data to establish a large-scale data warehouse system, which becomes the company's core value resource. Users usually have several label systems such as demographics, social group characteristics, financial business characteristics, personal interests and hobbies, etc. Through research on user portraits and building various label systems for customers, we can help us get to know our customers in minutes.
Generally speaking, banks have rich transaction data, personal attribute data, consumption data, credit data and customer data. User profiling has a greater demand and has been implemented earlier. Currently, a lot of information on social interests and hobbies comes from third-party supplements. The products in the insurance industry are long-term products. The conversion rate of insurance customers to purchase insurance products again is very high, and user profiling is also a necessary process.
Based on business experience and integrated algorithm theory (when the data set is large, it can be divided into different subsets, trained separately, and then synthesized into classifiers), large companies such as the banking industry and telecommunications industry For customer data, we can first classify according to the level of customer value (long tail theory), and then establish different types of models for high-value customers, medium-low value customers, etc. to achieve better classification effects. In response to the different and rich marketing business needs of each time, the first step is to construct a subset of tag features from the huge customer tag system, and then calculate the tag impact factors through LR (RANKING MODEL) and other methods to assign weights to the tags. The top-ranked tags obtained are the portraits of the target users that the business personnel need to know. At the same time, they can also more accurately provide the marketing department with the corresponding marketing customer list, greatly improving business efficiency.
Assume that the anomaly data anomaly detection result used at the beginning is true, add the customer attribute in the policyholder information table: "Yes/No fraud occurred" and mark them separately according to the results, use k-Means to model and output For the fraud proportion of each cluster group, view the result report:
From the output results, for clusters with higher fraud proportions, we can focus on examining their group feature labels, spss modeler You can directly check the comparison of clustering features, and the model features of cluster 7 are described as follows, enabling you to recognize strangers who are committing fraudulent transactions in minutes.
2.3 Classification research of medical institutions
The classification research of medical institutions can also first use the cluster migration analysis method (the same as the cluster migration method of policyholders), foreign anti-fraud technology It has been deeply integrated into the management process of various institutions and achieved good results.
2.4 Detection of Medical Claims
In terms of how each institution handles the medical service process, it is difficult and costly to manually review fraud. Combining the concept and experience of clinical pathways and using data mining technology to build models to automatically identify a series of characteristics of each specific medical service, such as anti-radiation treatment courses, chemotherapy treatment levels, etc., is a major advance in the detection of fraud in the medical insurance industry. Domestic research and application have also begun more in-depth.
3. Summary