Let's look at an example in detail.
Figure 4 Example of house price
Take the national topic house as an example. Now I have a house to sell. How much should I mark it? The house area is 100 m2, and the price is 1 10,000,10.2 million, or10.4 million?
Obviously, I hope to get a certain law of house price and area. So how do I get this rule? Use the average house price data in the newspaper? Or refer to other similar areas? Either way, it doesn't seem very reliable.
Now I hope to get a reasonable law that can reflect the relationship between area and house price to the greatest extent. So I investigated some houses with similar room types around me and got a set of data. This set of data includes the area and price of large and small houses. If I can find out the law of area and price from this set of data, then I can get the price of the house.
Finding the rule is simple, fitting a straight line so that it "passes through" all points, and the smaller the distance from each point, the better.
Through this straight line, I got a rule that can best reflect the law of house price and area. This straight line is also a function expressed by the following formula: house price = area * a+b.
A and b above are the parameters of a straight line. After obtaining these parameters, I can calculate the price of the house.
Suppose a = 0.75 and b = 50, then the house price =100 * 0.75+50 =1.25 million. This result is different from the 6,543.8+0,000, 6,543.8+0.2,000 and 6,543.8+0.4 million I listed earlier. Because this straight line comprehensively considers most situations, it is the most reasonable prediction in the sense of "statistics".
In the process of solving, two pieces of information are revealed:
1. Determine the house price model according to the fitting function type. If it is a straight line, then the fitted equation is a straight line. If it is another type of line, such as parabola, then fit the parabola equation. There are many algorithms in machine learning, and some powerful algorithms can fit complex nonlinear models to reflect some situations that cannot be expressed by straight lines.
2. The more data I have, the more situations my model can consider, and the better the prediction effect of new situations may be. This is an embodiment of the idea that data is king in machine learning. Generally speaking (not absolutely), the more data, the better the prediction effect of the model generated by machine learning.
Through the process of fitting straight lines, we can make a complete review of the process of machine learning. First, we need to store historical data in the computer. Then, we process these data through machine learning algorithms. This process is called "training" in machine learning, and the processed results can be used by us to predict new data. This result is generally called a "model". The process of predicting new data is called "prediction" in machine learning. "Training" and "prediction" are two processes of machine learning, "model" is the intermediate output result of the process, "training" produces "model" and "model" guides "prediction".