Data mining algorithm and its application case in life

How to distinguish spam, how to judge whether a transaction is fake, how to judge the quality and grade of red wine, how to recognize words by scanning king, how to judge whether anonymous works are written by famous artists, how to judge whether a cell belongs to tumor cells and so on, all of which seem very professional and difficult to answer. However, if you know a little about data mining, you may have a feeling of bright future.

In this paper, I mainly want to briefly introduce the algorithms in data mining and the types they contain. Then, through the accessible and vivid cases in reality, we can interpret its real existence. ? Generally speaking, data mining algorithms include four types, namely classification, prediction, clustering and association. The first two belong to supervised learning, and the latter two belong to unsupervised learning, which belongs to descriptive pattern recognition and discovery.

Supervised learning is supervised learning, that is, there are objective variables, so it is necessary to explore the relationship between feature variables and objective variables, and learn and optimize algorithms under the supervision of objective variables. For example, the credit scoring model is a typical supervised learning, and the target variable is "whether to breach the contract". The purpose of the algorithm is to study the relationship between characteristic variables (demographics, asset attributes, etc.). ) and the target variable.

The biggest difference between classification algorithm and prediction algorithm is that the target variables of the former are discrete (such as whether it is overdue, whether it is tumor cells, whether it is spam, etc.). ), and the target variable of the latter is continuous. Generally speaking, specific classification algorithms include logistic regression, decision tree, KNN, Bayesian discriminant, SVM, random forest, neural network and so on.

Prediction algorithm The target variable of prediction algorithm is generally a continuous variable. Common algorithms include linear regression, regression tree, neural network, SVM and so on.

Unsupervised learning Unsupervised learning, that is, there are no target variables. Based on the data itself, the internal patterns and characteristics between variables are identified. For example, correlation analysis, through data to find the correlation between item A and item B, such as cluster analysis, divides all samples into several stable and distinguishable groups according to distance. These are all pattern recognition and analysis without the supervision of target variables.

The purpose of cluster analysis is to subdivide samples, so that the characteristics of samples in the same group are similar, and the characteristics of samples in different groups are quite different. Common clustering algorithms include kmeans, pedigree clustering, density clustering and so on.

Correlation analysis The purpose of correlation analysis is to find out the internal relations between projects. It often refers to shopping basket analysis, that is, which products (such as swimming trunks and sunscreen) consumers often buy at the same time, which is helpful for the bundled sales of merchants.

Cases and applications based on data mining The four algorithms mentioned above (classification, prediction, clustering and association) are traditional and common. There are other interesting algorithm classification and application scenarios, such as collaborative filtering, outlier analysis, social networking, text analysis and so on. Next, I want to introduce the real existence of data mining in daily life according to different algorithm types. The following are some interesting examples that can be thought of and are closely related to life.

Cases based on classification model: I want to introduce two cases, one is the classification and judgment of spam, and the other is the application in biomedical field, that is, the judgment and resolution of tumor cells.

How does the mailbox system distinguish whether an email is spam or not? This should belong to the category of text mining, which is usually distinguished by naive Bayes method. Its main principle is to judge whether the text in the email body often appears in spam. For example, if the email body contains words such as "reimbursement", "invoice" and "promotion", the probability that the email will be judged as spam will be greater.

Generally speaking, judging whether an email belongs to spam should include the following steps.

Firstly, the email body is decomposed into word combinations, assuming that an email contains 100 words.

Secondly, according to Bayesian conditional probability, calculate the probability that the mail with 100 words is spam and normal mail. If the results show that the probability of spam is greater than that of normal mail. Then the email will be classified as spam.

How to judge whether cells belong to tumor cells in medical tumor judgment? Tumor cells are different from ordinary cells. But it needs a very experienced doctor to judge by pathological section. If machine learning is used, the system can automatically identify tumor cells. At this time, the efficiency will increase rapidly. Moreover, the identification of tumor cells by subjective (doctor)+objective (model) method is cross-verified, and the conclusion may be more reliable.

How to operate? Identify by classification model. In short, there are two steps. Firstly, a series of indicators are used to describe cell characteristics, such as cell radius, texture, perimeter, area, smoothness, symmetry, concavity and convexity, etc. , which constitutes the data of cell characteristics. Secondly, on the basis of the wide table of cell characteristics, tumor cells are judged by establishing a classification model.

A case based on forecasting model. Here, I mainly want to introduce two cases. That is, to judge and predict the quality of red wine through chemical characteristics. The other is to predict and judge the fluctuation and trend of stock prices through search engines.

How to judge the quality of red wine? Experienced people will say that the most important thing in red wine is the taste. The taste is influenced by many factors, such as year, place of origin, climate, brewing technology and so on. However, statisticians have no time to taste all kinds of red wine. They think that the quality of red wine can be well judged by some chemical properties. And now many wine-making enterprises actually do the same, by monitoring the content of chemical components in red wine to control the quality and taste of red wine.

So, how to judge the quality of wine?

The first step is to collect many red wine samples, sort out and test their chemical characteristics, such as acidity, sugar content, chloride content, sulfur content, alcohol content, PH value, density and so on.

The second step is to predict and judge the quality and grade of red wine through classified regression tree model.

The search volume and stock price of search engines fluctuate. A butterfly in the tropical rain forest of South America can cause a tornado in Texas within two weeks with an occasional flap of its wings. Will your online search affect the fluctuation of the company's stock price?

It has been proved a long time ago that the search volume of Internet keywords (such as influenza) will predict the outbreak of influenza in a certain area 1 2 weeks ahead of the CDC.

Similarly, some scholars have found that the change of the company's search volume on the Internet will significantly affect the fluctuation and trend of the company's stock price, which is the so-called investor attention theory. According to this theory, the search volume of a company in a search engine represents the degree to which the stock is concerned by investors. Therefore, when the search frequency of a stock increases, it shows that investors pay more attention to the stock, which makes it easier for individual investors to buy the stock, which further leads to the stock price rise and positive stock returns. This has been verified by numerous papers.

A case based on correlation analysis: Wal-Mart's beer diaper beer diaper is a very, very old story. Here's the story. Wal-Mart found a very interesting phenomenon, that is, putting diapers and beer together can greatly increase the sales of both. The reason is that American women usually take care of their children at home, so they often ask their husbands to buy diapers for their children on their way home from work, and husbands will also buy their favorite beer at the same time. Wal-Mart found this correlation from the data, so it juxtaposed the two commodities, thus greatly increasing the related sales.

Beer diapers mainly talk about the correlation between products. If a large number of data show that consumers buy product A, they will also buy product B by the way. Then there is a correlation between A and B. In supermarkets, we often see the bundled sales of two commodities, which is probably the result of correlation analysis.

The case based on cluster analysis: the segmentation of retail customers is still relatively common. The role of segmentation is to effectively divide customer groups, so that members in the group are similar, but there are differences between groups. Its purpose is to identify different customer groups, and then accurately design push products for different customer groups, thus saving marketing costs and improving marketing efficiency.

For example, the retail customers of commercial banks are subdivided, and the distance between customers is calculated based on the characteristic variables of retail customers (demographic characteristics, asset characteristics, debt characteristics and settlement characteristics). Then, according to the distance, similar customers are classified into one category, thus effectively subdividing customers. Divide all customers into financial preference, fund preference, current preference, national debt preference, risk balancer, channel preference and so on.

Case based on outlier analysis: When Alipay is used to detect transaction fraud in payment, or when a credit card is used to pay, the system will judge whether the card swiping behavior belongs to stealing in real time. By judging the time, place, merchant name, amount, frequency and other factors. The basic principle here is to look for outliers. If your credit card is judged to be abnormal, the transaction may be terminated.

The judgment of outliers should be based on the fraud rule base. There may be two types of rules, namely event rules and model rules. The first is the event rules, such as whether the card swiping time is abnormal (card swiping in the early morning), whether the card swiping place is abnormal (card swiping in infrequent places), whether the card swiping merchants are abnormal (blacklisted cash cashing merchants), whether the card swiping amount is abnormal (whether it deviates from the normal average by three standard deviations) and whether the card swiping frequency is abnormal (high-frequency intensive card swiping). Second, the model rule is to determine whether the transaction belongs to fraud through algorithm. Generally, the classification problem is judged by building a model based on payment data, seller data and settlement data.

Case based on collaborative filtering: e-commerce guesses that you like it, and recommends that you like it in engine e-commerce, which should be the most familiar to everyone. When shopping in JD.COM Mall or Amazon, there will always be "Guess what you like", "Carefully recommend for you according to your browsing history", "Customers who bought this product also bought this product" and "Customers who browsed this product finally bought this product", all of which are the results of the operation of the recommendation engine.

Among them, I like Amazon's recommendation very much. Through "people who bought goods bought goods at the same time", I can often find some books with higher quality and higher recognition. Generally speaking, e-commerce's "guess what you like" (that is, recommendation engine) is based on collaborative filtering, and a set of rules base that conforms to its own characteristics is constructed. That is, the algorithm will consider the choices and behaviors of other customers at the same time, and on this basis, build a product similarity matrix and a user similarity matrix. On this basis, find out the most similar customers or the most relevant products, so as to complete product recommendation.

Case analysis based on social networks: Seed customers and social networks in telecommunications first appeared in the field of telecommunications. That is to say, through people's telephone records, people's relationship network can be outlined. Networks in the field of telecommunications generally analyze the relationship between customer influence and customer churn and product diffusion.

According to the call records, we can establish a customer influence index system. The indicators adopted probably include the following items: first contact, second contact, third contact, average call frequency, average call volume, etc. The analysis based on social influence shows that the loss of high-impact customers will lead to the loss of related customers. Secondly, in the diffusion of products, it is easy to promote the diffusion and penetration of new packages by choosing high-impact customers as the starting point of communication.

In addition, social networks have many applications and cases in banks (guarantee networks), insurance (gang fraud) and Internet (social networking).

Based on the case of text analysis, I mainly want to introduce two cases. One is an APP similar to "Scan King", which directly scans paper documents into electronic documents. I believe many people have used it. Let me briefly introduce the principle here. The other is that there are always rumors in the Jianghu that Cao Xueqin didn't write all the first eighty chapters and the last forty chapters of A Dream of Red Mansions. Let me talk about it from a statistical point of view.

Character recognition: Sweep King APP will automatically recognize faces when taking photos. Some apps, such as Sweep King, can scan books and then automatically convert the scanned contents into word. These belong to image recognition and optical character recognition. Image recognition is more complicated and character recognition is easier to understand.

After looking up some information, the general principle of character recognition is as follows, taking the character S as an example.

Firstly, the character image is reduced to the standard pixel size, such as 12* 16. Note that images are composed of pixels, and character images mainly include black and white pixels.

Second, extract the feature vectors of characters. How to extract the characteristics of characters, using two-dimensional histogram projection. That is, the figure (pixel map of 12* 16) is projected horizontally and vertically. There are 12 dimensions in the horizontal direction and 16 dimensions in the vertical direction. In this way, the cumulative number of black pixels in each pixel row in the horizontal direction and the cumulative number of black pixels in each pixel column in the vertical direction are calculated respectively. Therefore, the feature vectors of 12 dimension in the horizontal direction and 16 dimension in the vertical direction are obtained. Thus, a 28-dimensional character feature vector is formed.

Thirdly, based on the previous character feature vectors, through neural network learning, characters can be effectively recognized and classified.

Literary Works and Statistics: The ownership of A Dream of Red Mansions is a very famous debate, which is still pending. As for the author of A Dream of Red Mansions, it is generally believed that the first 80 chapters were written by Cao Xueqin and the last 40 chapters were written by Gao E. In fact, the main problem is to determine whether there are significant differences in words and sentences between the first 80 chapters and the last 40 chapters.

This makes a group of statisticians more excited. Some scholars make judgments by counting the frequency of nouns, verbs, adjectives, adverbs and function words, and the correlation between different parts of speech. Some scholars use function words (such as "Zhi", "Qi", "Or" Yi ","Le ","De ","Bu ","Don't "and" Hao ") to judge the differences of writing styles before and after. Some scholars make statistical judgments through the differences in the frequency of scenes (flowers, trees, food, drugs, poems). In short, it is mainly quantified by some indicators, and then whether there are significant differences between indicators, so as to judge the style of writing.

The above is the related content of data mining algorithm shared by Bian Xiao and its application cases in life. For more information, you can pay attention to Global Ivy and share more dry goods.

Can a person��s name be found by bank card number?

Can Bank of Communications credit card points be exchanged for Starbucks?

How much is the credit card quota applied by Postal Bank?

The annual fee will be deducted in the month after Ping An Credit Card is opened.

I would like to ask the electronic specialist of China Construction Bank, the credit card logistics and the lobby assistant of ICBC, which is better?

How to repay Meituan Credit Card?

What are the preferential activities of Bank of China Golden Eagle Credit Card?

Which bank has a loan of 20,000 yuan?

Will you go to jail if you owe money on your credit card?

People in prison, credit cards and bank cards can not be repaid, will they be prosecuted?