Models actually used in enterprise data mining
This is an article written several years ago. I suddenly found it today and saw it. I think it is useful to many friends who are doing data mining. Definitely a reference.
I listened to some introductions about data mining and data models from several colleagues in the company and outside experts. Overall, it is very rewarding, but of course the harvest is not in the specific technical details. It’s more about opinions and concepts.
I have worked on many models before, from the most basic clustering, decision trees, logistic, regression analysis, survival analysis, neural networks, to some joint analysis, perceptual analysis, and factor analysis in market research. /Principal component analysis, and of course more advanced structural equations. During the year I stayed at a futures company, I also worked on econometric models: ARMA clusters, ARCH clusters, VaR, etc. At that time, I felt very unconfident about the models I generated. Because the model recognition rate indicators at that time (for example: R square, did not reach more than 90% of the legendary papers in school or playing with models in daily life), I felt that this model was not good and not perfect.
Last year, with a learning attitude, I went to an Internet company with an extremely rich amount of data. I wanted to see how far large companies play with data. Although I had communicated with many talented people before, I always felt that I should It's not that simple.
After arriving at the new company, I met several modeling colleagues and listened to speeches by external experts. To a certain extent, I felt relieved. I feel that when I was making models in the past, it was more like doing academic research. Maybe it has something to do with me being a person who pursues perfection.
For example: hypothesis conditions for the establishment of the model, and variable selection.
The assumptions of the model and the distribution requirements of the data;
The variable selection of the model and various preprocessing of variables;
Aiming at the final purpose theory Try all the models available on the Internet. For example: Member churn problem: decision tree, logistic regression, survival analysis, I will try to use them, and then choose according to the one with the largest final LIFT value.
But in fact, judging from the introductions of several colleagues and friends, logistic regression is a model used by many companies.
Why not use a more "advanced" and more advanced model? What about the model? There are two reasons:
The first: the robustness of the model. These models have been proven to be the best in previous practice, or have the most stable performance. The indicators measured are nothing more than: stability, explainability (this is very important in business), and simplicity.
Second: Commercial application is already a process-like process and will not be changed easily, just like on your production line. A slight change in the model can affect many aspects and is a big project.
From communicating with them, I seem to have forgotten one thing: these are all for business, the business process should not be too complicated, and the best business model is often the simplest, right?
My point of view: Maybe it has something to do with my work experience, but I think for a data analyst or data modeler, although you use it is very simple. But the things you master should be many and complex. It is precisely because of these foundations that you can choose the best model. Therefore, when doing data mining or mathematical modeling for business services, experience is very important. Of course, these professional knowledge The solidity is also one of the most fundamental.