Many languages can crawl, but python-based crawlers are more concise and convenient. Crawlers have also become an indispensable part of python language.
This paper explains what a crawler is and its basic process, and the next issue will further understand the basic process, request and response of a crawler.
What is a reptile?
Crawler is a web crawler, and English is Web Spider. Translated, it is a spider crawling on the internet. If the Internet is regarded as a big net, then a reptile is a spider crawling around on the big net. When it meets the food it wants, it will take it out.
We enter a URL in the browser, click Enter and see the page information of the website. This is the time when the browser requests the server of the website to obtain network resources. Then, the crawler is equivalent to simulating the browser to send a request and get the HTML code. HTML code usually contains tags and text information, from which we can extract the information we want.
Usually, a crawler starts from a page of a website, crawls the content of this page, finds other link addresses in the webpage, and then crawls from this address to the next page, so it crawls all the way down and crawls in batches. Then, we can see that the web crawler is a program that constantly crawls the web page to grab information.
The basic process of reptiles:
1. Initiate a request:
Send a request to the target site through the HTTP library, that is, send a request, which can contain first-class additional information, and then wait for the server to respond. The process of this request is like opening a browser, entering the URL: www.baidu.com in the address bar of the browser, and then clicking Enter. This process is actually equivalent to the browser as a browsing client, sending a request to the server.
2. Get the response content:
If the server can respond normally, we will get a response, and the content of the response is what we want. The type can be HTML, Json string, binary data (picture, video, etc. ) and so on. This process is that the server receives the client's request and parses the webpage HTML file sent to the browser.
3. Analysis content:
The content may be HTML, which can be parsed through regular expressions and web page parsing libraries. It may also be Json, which can be directly converted into Json object parsing. It may be binary data and can be saved or further processed. This step is equivalent to the browser getting the server-side files locally, and then interpreting and displaying them.
4. Save the data:
The way to save data can be to save data as text, save data to a database, or save data as a specific file in jpg, mp4 and other formats. This is equivalent to downloading pictures or videos on the webpage when we browse the webpage.