Current location - Trademark Inquiry Complete Network - Overdue credit card - Overview of Data Mining Concepts
Overview of Data Mining Concepts

Overview of Data Mining Concepts

Data mining is also known as knowledge discovery from databases (KDD), data analysis, data fusion (Data Fusion) and decision support. The term KDD first appeared at the 11th International Joint Artificial Intelligence Academic Conference held in August 1989. Subsequently, KDD symposiums were held in 1991, 1993 and 1994, bringing together researchers and application developers from various fields to focus on issues such as data statistics, massive data analysis algorithms, knowledge representation, and knowledge application. As the number of participants continued to increase, the KDD International Conference developed into an annual conference. The Fourth International Academic Conference on Knowledge Discovery and Data Mining held in New York, USA in 1998 not only held academic discussions, but also had more than 30 software companies display their data mining software products, many of which have been used in North America, Europe and other countries. get applied.

1. What is data mining

1.1. History of data mining

In the past decade or so, people’s ability to use information technology to produce and collect data has greatly improved. With the improvement, tens of millions of databases are used in business management, government offices, scientific research, engineering development, etc., and this momentum will continue to develop. As a result, a new challenge was raised: In this era called information explosion, information excess has become a problem that almost everyone needs to face. How can we not be overwhelmed by the vast ocean of information, discover useful knowledge in a timely manner, and improve information utilization? In order for data to truly become a company's resource, it can only be fully utilized to serve the company's own business decisions and strategic development. Otherwise, a large amount of data may become a burden or even become garbage. Therefore, facing the challenge of "people are drowned in data, but people are hungry for knowledge". On the other hand, artificial intelligence, another field of computer technology, has made significant progress since its birth in 1956. After going through the game period, natural language understanding, knowledge engineering and other stages, the current research hotspot is machine learning. Machine learning is a science that uses computers to simulate human learning. More mature algorithms include neural networks, genetic algorithms, etc. Use database management systems to store data, use machine learning methods to analyze data, and mine the knowledge behind large amounts of data. The combination of the two has led to the creation of Knowledge Discovery in Databases (KDD). Therefore, data Mining and knowledge discovery (DMKD) technology emerged as the times require and has flourished, increasingly showing its strong vitality.

Data mining is also known as knowledge discovery from databases (KDD), data analysis, data fusion (Data Fusion) and decision support. The term KDD first appeared at the 11th International Joint Artificial Intelligence Academic Conference held in August 1989. Subsequently, KDD symposiums were held in 1991, 1993 and 1994, bringing together researchers and application developers from various fields to focus on issues such as data statistics, massive data analysis algorithms, knowledge representation, and knowledge application. As the number of participants continued to increase, the KDD International Conference developed into an annual conference. The Fourth International Academic Conference on Knowledge Discovery and Data Mining held in New York, USA in 1998 not only held academic discussions, but also had more than 30 software companies display their data mining software products, many of which have been used in North America, Europe and other countries. get applied.

2.2 The concept of data mining

From 1989 to the present, the definition of KDD has been continuously improved with the deepening of people's research. The currently recognized definition is that given by Fayyad et al. ’s: KDD is the advanced process of identifying valid, novel, potentially useful, and ultimately understandable patterns from data sets. It can be seen from the definition that data mining (DataMining) is to extract hidden information from a large amount of incomplete, noisy, fuzzy and random data that people do not know in advance but are potentially useful. information and knowledge processes.

People view raw data as the source from which knowledge is formed, like mining from ore. Raw data can be structured, such as data in relational databases, or semi-structured, such as text, graphics, image data, or even heterogeneous data distributed on the network. The method of discovering knowledge can be mathematical or non-mathematical; it can be deductive or inductive. The discovered knowledge can be used for information management, query optimization, decision support, process control, etc., and can also be used for the maintenance of the data itself. Therefore, data mining is a very broad interdisciplinary subject that brings together researchers from different fields, especially scholars and engineering technicians in databases, artificial intelligence, mathematical statistics, visualization, parallel computing, etc.

It should be pointed out in particular that data mining technology has been application-oriented from the beginning. It is not only a simple search query call for a specific database, but also performs micro, meso and even macro statistics, analysis, synthesis and reasoning on these data to guide the solution of practical problems, attempt to discover the correlation between events, and even use Existing data makes predictions about future activity.

Generally speaking, it is called KDD in the field of scientific research, and it is called data mining in the field of engineering.

2. Data mining steps

KDD includes the following steps:

1. Data preparation

KDD processes a large number of objects Data, which is generally stored in a database system, is the result of long-term accumulation. However, it is often not suitable to conduct knowledge mining directly on this data. Data preparation work is required, which generally includes data selection (selecting relevant data), purification (elimination of noise and redundant data), speculation (imputation of missing data), conversion ( Conversion between discrete value data and continuous value data, grouping and classification of data values, calculation combinations between data items, etc.), data reduction (reducing the amount of data). If the object of KDD is a data warehouse, then these tasks are often already prepared when generating the data warehouse. Data preparation is the first step in KDD, and it is also an important step. Whether the data is prepared well will affect the efficiency and accuracy of data mining and the effectiveness of the final model.

2. Data mining

Data mining is the most critical step of KDD and also the technical difficulty. Most of the people who study KDD are studying data mining technology, and the commonly used technologies include decision trees, classification, clustering, rough sets, association rules, neural networks, genetic algorithms, etc. Data mining selects the parameters of the corresponding algorithm according to the goals of KDD, analyzes the data, and obtains the pattern model that may form knowledge.

3. Evaluate and explain the pattern model

The pattern model obtained above may have no practical meaning or practical value, or it may not accurately reflect the true meaning of the data. , or even contrary to fact in some cases, and therefore need to be evaluated to determine which are valid and useful patterns. The evaluation can be based on the user's years of experience, and some models can also directly use data to check their accuracy. This step also involves presenting the schema to the user in an easy-to-understand way.

4. Consolidate knowledge

Pattern models that users understand and are considered practical and valuable form knowledge. At the same time, we must also pay attention to checking the consistency of knowledge and resolving conflicts and contradictions with previously obtained knowledge so that the knowledge can be consolidated.

5. Apply knowledge

Discovering knowledge is for application, and how to enable knowledge to be applied is also one of the steps of KDD.

There are two ways to use knowledge: one is to only look at the relationships or results described by the knowledge itself, which can provide support for decision-making; the other is to require the application of knowledge to new data, which may generate new problems, and Knowledge needs to be further optimized

3. Characteristics and functions of data mining

3.1. Characteristics of data mining

Data mining has the following characteristics: Of course, these characteristics are closely related to the data to be processed and the purpose of data mining.

1. The scale of data processed is very huge.

2. Queries are generally instant random queries put forward by decision makers (users) and often cannot form precise query requirements.

3. Since data changes rapidly and may become outdated quickly, it is necessary to respond quickly to dynamic data to provide decision support.

4. Mainly based on statistical rules of large samples, the rules discovered may not be applicable to all data

3.2. Function of data mining

Data mining The following types of knowledge can be discovered:

Generalized knowledge, which reflects the knowledge of the same kind of things and the same nature;

Characteristic knowledge, which reflects the characteristic knowledge of all aspects of things;

Differential knowledge, knowledge that reflects the difference in attributes between different things; correlation knowledge, knowledge that reflects the dependence or association between things;

Predictive knowledge, based on historical sums Current data infers future data; deviant knowledge reveals anomalies when things deviate from the norm.

All this knowledge can be discovered at different conceptual levels. As the concept tree is promoted, from micro to meso to macro, it can meet the needs of different users and different levels of decision-making. For example, a typical association rule that can be found in a supermarket data warehouse may be "Customers who buy bread and butter almost always buy milk", or it may be "Customers who buy food almost all use credit cards". This rule is very useful for merchants to develop and implement customized sales plans and strategies. As for discovery tools and methods, commonly used ones include classification, clustering, dimensionality reduction, pattern recognition, visualization, decision trees, genetic algorithms, uncertainty processing, etc. To sum up, data mining has the following functions:

Prediction/verification function: The prediction/verification function refers to using several known fields of the database to predict or verify the values ??of other unknown fields. Prediction methods include statistical analysis methods, association rules and decision tree prediction methods, regression tree prediction methods, etc.

Description function: The description function refers to finding an understandable pattern that describes the data. Description methods include the following: data classification, regression analysis, clustering, generalization, constructing dependency patterns, change and deviation analysis, pattern discovery, path discovery, etc.

4. Data Mining Patterns

The task of data mining is to discover patterns from data. A pattern is an expression E expressed in language L, which can be used to describe the characteristics of the data in the data set F. The data described by E is a subset FE of the set F. E as a schema requires that it be simpler than the description method of enumerating all elements in the data subset FE. For example, "If the grade is between 81 and 90, then the grade is excellent" can be called a pattern, while "If the grade is 81, 82, 83, 84, 85, 86, 87, 88, 89, or 90, then the grade is excellent "It can't be called a pattern.

There are many types of modes, which can be divided into two categories according to their functions: Predictive mode and Descriptive mode.

A predictive model is a model that can accurately determine an outcome based on the value of a data item. The data used to mine predictive patterns can also be used to know the results clearly. For example, based on the data of various animals, a model can be established: all animals that give birth to viviparous animals are mammals.

When new animal data is available, this model can be used to determine whether the animal is a mammal.

A descriptive schema is a description of the rules that exist in the data, or a grouping of data based on the similarity of the data. Descriptive models cannot be used directly for prediction. For example, on Earth, 70% of the surface is covered by water and 30% is land.

In practical applications, the mode is often subdivided into the following six types according to its actual function:

1. Classification mode

The classification mode is a classification function ( Classifier), which can map data items in the data set to a given class. Classification patterns are often represented as a classification tree. Search from the root of the tree based on the value of the data, go up along the branches that the data satisfies, and the category can be determined when you reach the leaves.

2. Regression mode

The function definition of regression mode is similar to that of classification mode. Their difference is that the predicted value of classification mode is discrete, while the predicted value of regression mode is continuous. For example, given the characteristics of an animal, you can use the classification model to determine whether the animal is a mammal or a bird; given a person's education and work experience, you can use the regression model to determine the range of the person's annual salary. Is it below 6,000 yuan, between 6,000 yuan and 10,000 yuan, or above 10,000 yuan.

3. Time series model

The time series model predicts future values ??based on the trend of data changes over time. Here we need to take into account the special nature of time, such as some cyclical time definitions such as weeks, months, seasons, years, etc., the possible impact of different days such as holidays, the calculation method of the date itself, and some other things that need special consideration. Such as the correlation between time and space (how much influence the past has on the future), etc. Only by fully considering the time factor and using a series of values ??that change over time from existing data can we better predict future values.

4. Clustering mode

The clustering mode divides the data into different groups. The differences between the groups are as large as possible and the differences within the groups are as small as possible. Different from the classification mode, before clustering, we do not know how many groups will be divided into and what kind of groups, nor do we know which (several) data items will be used to define the groups. Generally speaking, people with rich business knowledge should be able to understand the meaning of these groups. If the resulting pattern is incomprehensible or unusable, the pattern may be meaningless and you need to go back to the previous stage to reorganize the data.

5. Association pattern

Association pattern is the association rule between data items. The association rule is a rule in the following form: "Among those who are unable to repay their loans, 60% have a monthly income of less than 3,000 yuan."

6. Sequence pattern

Sequence mode is similar to correlation mode, but connects the correlation between data with time. In order to discover sequential patterns, you need to know not only whether an event occurred, but also when it occurred. For example, among people who buy color TVs, 60% will buy a video player within 3 months

5. Discovery tasks of data mining

The subject areas involved in data mining and There are many methods and many classifications. According to mining tasks, it can be divided into classification or prediction model discovery, data summary, clustering, association rule discovery, sequence pattern discovery, dependency or dependency model discovery, anomaly and trend discovery, etc.; according to mining objects, there are relational databases , object-oriented database, spatial database, temporal database, text data source, multimedia database, heterogeneous database, heritage database and World Wide Web; according to the mining method, it can be roughly divided into: machine learning method, statistical method, neural network method and database methods. In machine learning, it can be subdivided into: inductive learning methods (decision trees, rule induction, etc.), example-based learning, genetic algorithms, etc.

Statistical methods can be subdivided into: regression analysis (multiple regression, autoregression, etc.), discriminant analysis (Bayesian discriminant, Fisher discriminant, non-parametric discriminant, etc.), cluster analysis (system clustering, dynamic clustering, etc.) etc.), exploratory analysis (principal component analysis, correlation analysis, etc.), etc. Neural network methods can be subdivided into: forward neural network (BP algorithm, etc.), self-organizing neural network (self-organizing feature mapping, competitive learning, etc.), etc. Database methods are mainly multidimensional data analysis or OLAP methods, and there are also attribute-oriented inductive methods.

From the perspective of mining tasks and mining methods, there are four very important discovery tasks: data summary, classification discovery, clustering and association rule discovery.

5.1. Data summary

The purpose of data summary is to condense the data and give a compact description of it. The traditional and simplest method of data summary is to calculate the sum, average, variance and other statistical values ??of each field in the database, or to use histograms, pie charts and other graphical representations. Data mining is mainly concerned with discussing data summary from the perspective of data generalization. Data generalization is a process of abstracting relevant data in a database from a low level to a high level. Because the information contained in the data or objects on the database is always the most original and basic information (this is to not miss any potentially useful data information). People sometimes want to process or browse data from a higher-level view, so the data needs to be generalized at different levels to adapt to various query requirements. There are currently two main techniques for data generalization: multidimensional data analysis methods and attribute-oriented induction methods.

1. Multidimensional data analysis method is a data warehouse technology, also called online analytical processing (OLAP). Data warehouse is a collection of historical data that is decision-support-oriented, integrated, stable, and different in time. The prerequisite for decision-making is data analysis. Aggregation operations such as sum, total, average, maximum, minimum, etc. are often used in data analysis. Such operations require a particularly large amount of calculation. Therefore, a natural idea is to pre-compute and store the results of the aggregation operation to facilitate the use of the decision support system. The place where the results of aggregation operations are stored is called a multidimensional database. Multidimensional data analysis technology has been successfully applied in decision support systems, such as the famous SAS data analysis software package, Business Object's decision support system Business Object, and IBM's decision analysis tools all use multidimensional data analysis technology .

Multidimensional data analysis method is used for data summary, which is aimed at the data warehouse, which stores offline historical data.

2. In order to process online data, researchers proposed an attribute-oriented induction method. Its idea is to directly generalize the data views that users are interested in (which can be obtained using general SQL query languages), rather than storing generalized data in advance like multi-dimensional data analysis methods. The proposer of the method calls this data generalization technique the attribute-oriented induction method. What is obtained after the original relationship undergoes a generalization operation is a generalized relationship, which summarizes the original relationship at a lower level from a higher level. Once the generalization relationship is in place, various in-depth operations can be performed on it to generate knowledge that meets user needs, such as generating characteristic rules, discrimination rules, classification rules, and association rules based on the generalization relationship.

5.2. Classification discovery

Classification is a very important task in data mining and is currently the most commonly used in business. The purpose of classification is to learn a classification function or classification model (also often called a classifier) ??that can map data items in the database to a given category. Both classification and regression can be used for prediction. The purpose of prediction is to automatically derive a generalized description of given data from historical data records, so that future data can be predicted. The difference from the regression method is that the output of classification is a discrete category value, while the output of regression is a continuous value.

To construct a classifier, a training sample data set is required as input.

The training set consists of a set of database records or tuples, each tuple is a feature vector composed of the values ????of the relevant field (also called an attribute or feature), and in addition, the training sample has a category label. The form of a specific sample can be: (v1, v2, ..., vn; c); where vi represents the field value and c represents the category.

Classifier construction methods include statistical methods, machine learning methods, neural network methods, etc. Statistical methods include Bayesian methods and non-parametric methods (nearest neighbor learning or case-based learning), and the corresponding knowledge representations are discriminant functions and prototype cases. Machine learning methods include decision tree methods and rule induction methods. The former is represented by a decision tree or a discriminant tree, and the latter is generally a production rule. The neural network method is mainly the BP algorithm. Its model representation is a forward feedback neural network model (an architecture composed of nodes representing neurons and edges representing connection weights). The BP algorithm is essentially a nonlinear discriminant function. In addition, a new method has emerged recently: rough set, whose knowledge representation is production rules.

Different classifiers have different characteristics. There are three classifier evaluation or comparison scales: 1. Prediction accuracy; 2. Computational complexity; 3. Simplicity of model description. Prediction accuracy is the most commonly used comparison scale, especially for predictive classification tasks. The currently recognized method is the 10-fold stratified cross-validation method. The computational complexity depends on the specific implementation details and hardware environment. In data mining, since the operation object is a huge database, the space and time complexity issues will be a very important link. For descriptive classification tasks, the more concise the model description is, the more popular it is; for example, the classifier construction method using rule representation is more useful, while the results produced by the neural network method are difficult to understand.

In addition, it should be noted that the effect of classification is generally related to the characteristics of the data. Some data are noisy, some have missing values, some are sparsely distributed, and some fields or attributes are highly correlated. Some attributes are discrete and some are continuous or mixed. It is generally believed that there is no method suitable for data with various characteristics.

5.3. Clustering

Clustering is to classify a group of individuals into several categories based on similarity, that is, "birds of a feather flock together." Its purpose is to make the distance between individuals belonging to the same category as small as possible, and the distance between individuals in different categories as large as possible. Clustering methods include statistical methods, machine learning methods, neural network methods and database-oriented methods.

Among statistical methods, clustering is called cluster analysis, which is one of the three major methods of multivariate data analysis (the other two are regression analysis and discriminant analysis). It mainly studies clustering based on geometric distance, such as Euclidean distance, Minkowski distance, etc. Traditional statistical clustering analysis methods include systematic clustering, decomposition, joining, dynamic clustering, ordered sample clustering, overlapping clustering and fuzzy clustering. This clustering method is a clustering based on global comparison. It needs to examine all individuals to determine the classification of classes; therefore, it requires that all data must be given in advance, and new data objects cannot be added dynamically. The cluster analysis method does not have linear computational complexity and is difficult to apply to situations where the database is very large.

In machine learning, clustering is called unsupervised or teacher-less induction; because compared with classification learning, the examples or data objects of classification learning have category labels, while the examples to be clustered have no labels. needs to be determined automatically by a clustering learning algorithm. In many artificial intelligence literature, clustering is also called concept clustering; because the distance here is no longer the geometric distance in statistical methods, but is determined based on the description of the concept. When cluster objects can be added dynamically, concept clustering is called concept formation.

In neural networks, there is a type of unsupervised learning method: self-organizing neural network method; such as Kohonen self-organizing feature mapping network, competitive learning network, etc.

In the field of data mining, the reported neural network clustering method is mainly the self-organizing feature mapping method. IBM specifically mentioned the use of this method for database clustering and segmentation in its data mining white paper.

5.4. Association rule discovery

Association rule is a rule in the following form, "Among the customers who bought bread and butter, 90% also bought milk" ( Bread and butter (milk). The main object used for association rule discovery is transactional database, and the application targeted at it is sales data, also called basket data. A transaction generally consists of the following parts: transaction processing time, one. Groups of items purchased by customers sometimes also have customer identification numbers (such as credit card numbers).

Due to the development of barcode technology, retail departments can use front-end cash registers to collect and store a large amount of sales data. Analysis of these historical transaction data can provide extremely valuable information on customers' purchasing behavior. For example, it can help how to place products on the shelves (such as putting products that customers often buy at the same time) and how to plan the market. (How to match purchases with each other). It can be seen that discovering association rules from transaction data is very important for improving decision-making in commercial activities such as retail.

If the support and credibility of the association rules are not considered. , then there are infinitely many association rules in the transaction database. In fact, people are generally only interested in association rules that meet certain support and credibility. In the literature, they are generally called those that meet certain requirements (such as larger ones). Support and credibility) rules are strong rules. Therefore, in order to find meaningful association rules, two thresholds need to be given: minimum support and minimum credibility. The former is what the user-specified association rules must meet. Minimum support, which represents the minimum degree that a set of items must satisfy in a statistical sense; the latter is the minimum credibility that the association rules specified by the user must satisfy, which reflects the minimum reliability of the association rules.

In actual situations, a more useful association rule is a generalized association rule because there is a hierarchical relationship between item concepts. For example, jackets and ski shirts belong to the coat category, and coats and shirts belong to the clothing category. . With hierarchical relationships, it can help to discover some more meaningful rules. For example, "buy a coat, buy shoes" (here, coats and shoes are items or concepts at a higher level, so this rule is the same. Since there are thousands of items in a store or supermarket, the support for each item (such as a ski shirt) is very low on average, so it is sometimes difficult to find useful rules; but if you consider Higher-level items (such as coats) have higher support, so useful rules may be found. In addition, the idea of ??association rule discovery can also be used to discover sequential patterns when users purchase items. There are also rules in time or sequence, because many times customers will buy these things this time, buy something related to the last time, and then buy something related.