Current location - Trademark Inquiry Complete Network - Overdue credit card - Methods for outlier detection in data mining
Methods for outlier detection in data mining

Outlier detection is an important part of data mining. Its task is to discover objects that are significantly different from most other objects. Most data mining methods discard this differential information as noise. However, in some applications, rare data may contain greater research value.

Outlier detection has been widely used in fields such as telecommunications and credit card fraud detection, loan approval, e-commerce, network intrusion and weather forecasting.

The main causes of outliers are: data originating from different classes, natural variation, data measurement and mobile phone errors.

From the perspective of data scope, it is divided into global outliers and local outliers. Overall, some objects have no outlier characteristics, but from a local perspective, they show certain outliers. sex.

From the perspective of data type, it is divided into numerical outliers and categorical outliers, which are divided according to the attribute type of the data set.

Judging from the number of attributes, they are divided into one-dimensional outliers and multi-dimensional outliers. An object may have one or more attributes.

Most statistics-based outlier detection methods build a probability distribution model, calculate the probability that an object conforms to the model, and regard objects with low probability as outliers. The premise of the outlier detection method based on statistical models is that it must know what distribution the data set obeys; for high-dimensional data, the test effect may be poor.

It is usually possible to define a proximity measure between data objects, treating objects with the most points as outliers. Two-dimensional or three-dimensional data can be observed in scatter plots; it is not suitable for large data sets; it is sensitive to parameter selection; it has a global threshold and cannot handle data sets with different density areas

Consider that the data set may have different Density area This fact, from a density-based point of view, outliers are objects in low-density areas. An object's outlier score is the inverse of the density surrounding the object. It gives a quantitative measure of whether an object is an outlier and works well even if the data has different regions; it is not suitable for large data sets; parameter selection is difficult.

One way to use clustering to detect outliers is to discard small clusters that are far away from other clusters; another more systematic method is to first cluster all the objects and then evaluate the degree to which the object belongs to the cluster. Finding outliers based on clustering techniques can be highly effective; the quality of the clusters produced by the clustering algorithm has a strong impact on the quality of the outliers produced by the algorithm.

Outlier detection methods based on statistical models need to meet statistical principles. If the distribution is consistent, the test may be very effective. Proximity-based outlier detection methods are more general and easier to use than statistical methods because it is easier to determine a meaningful proximity measure for a data set than to determine its statistical distribution. Density-based outlier detection is closely related to proximity-based outlier detection, because density is commonly defined by proximity: one is to define density as the reciprocal of the average distance to the K nearest neighbors. If the distance is small, then density High; the other is to use the DBSCAN clustering algorithm. The density around an object is equal to the number of objects within the specified distance d from the object.