Current location - Trademark Inquiry Complete Network - Tian Tian Fund - Can Python Crawlers Climb Websites?
Can Python Crawlers Climb Websites?
First, we need to know what a reptile is. Crawler is a program that automatically crawls web data and is an important part of search engine. Through computer programs, the links of web pages are continuously extracted through customized portal addresses in the network, and unknown links are further extracted according to these links, and finally the desired content is obtained.

Next, we must consider how to use a crawler to crawl web data:

1. First of all, we should make clear three characteristics of the webpage:

1) Each web page has a unique Uniform Resource Locator (URL) to locate it;

2) The webpage uses Hypertext Markup Language (HTML) to describe the page information;

3) Web pages use Hypertext Transfer Protocol (HTTP/HTTPS) to transfer HTML data.

2. Establish the design idea of reptile:

1) First, determine the URL address of the webpage to be crawled;

2) obtaining the corresponding HTML page through the HTTP/HTTP protocol;

3) Extract useful data from HTML pages:

A. if it is needed data, save it.

B. If it is another URL in the page, please continue to the second step.

For example, we want to climb the data content of Sina information and observe that there are many categories at the top of Sina homepage, such as news, finance, science and technology, sports, entertainment, cars and so on. Each category is divided into many sub-categories, such as military, social and international. Therefore, we should first start from Sina's homepage, find the URL links of various categories, then find the URL links of small categories under the big category, and finally find the URL of each news page, and grab the text and pictures as required. This is the idea of crawling the whole resource station.

3. The way of reptiles

There are many languages that can be used as reptiles, such as PHP, Java, C/C++, Python and so on. ...

At present, Python has become the most widely used way because of its beautiful grammar, concise code, high development efficiency and many supported modules. Its related HTTP request modules and HTML parsing modules are very rich, with powerful crawler Scrapy and mature and efficient scrapy-redis distributed strategy. In addition, it is convenient to call other excuses in python.