Current location - Trademark Inquiry Complete Network - Overdue credit card - The Risk Detection of Credit Card Transaction in Practice (3)
The Risk Detection of Credit Card Transaction in Practice (3)

Up to now, we have finished importing data sets and observing data set problems. Among them, in the problem observation part, we found that:

In this section, Xiaoyu will take you through the data preprocessing part, and the implementation methods of data set segmentation, down sampling and over sampling will be introduced in the next part of the series ~

As mentioned earlier, before starting machine learning, we expect to treat each feature equally, which requires that the numerical difference between features should not be too big, and the contribution of features to the results should be determined by the parameters multiplied by the features.

for this reason, we often need to standardize and normalize features to scale data sets, and the distribution of data samples remains unchanged, but they are all scaled to the same space.

There are three common scaling processing tools:

The desensitization features V1-V28 in the data set meet the requirements that the mean value is and the standard deviation is 1, and the standardization process has been completed. Next, we just need to standardize the Amount and Time.

sklearn provides us with StandardScaler and RobustScaler to standardize data sets. If there are large outliers in the data, it may affect the mean and variance of features, and then affect the standardization results.

in this case, it is more effective to use the median and quartile spacing for scaling. This is the standardization logic of RobustScaler.

The following is the original distribution histogram of Amount and Time:

The following is the process of feature standardization using StandardScaler:

Results:

after the time is standardized by StandardScaler, the data distribution is symmetrical about the origin and scaled by a reasonable interval.

The transaction Amount is not obvious here, so we can use boxplot to further observe it:

After the transaction amount is standardized by StandardScaler, most of the data are distributed around , but there are still many outliers, and the largest is even greater than 1.

Next, we use RobustScaler for standardization:

Drawing results:

After the time series is standardized, the data distribution is between -1 and 1, and it is symmetrical about the origin. The distribution of the amount seems to be wider and the outliers are bigger. Let's compare it with boxplot:

Through the above analysis, it is found that after standardization, the processing of amount data is still not ideal, and there are a lot of outliers. To this end, Xiaoyu will use normalization to scale all data features to the range of ~1.

Plot the normalized Amount and Time distribution:

Plot the result:

After normalization, the data are scaled to the range of ~1.