Python + urlib2 + RegExp + bs4
or
Node.js+co, any dom framework or html parser +Request+RegExp is also convenient.
For me, the above two options are almost equivalent, but mainly because I am familiar with JS, and now I will choose more node platforms.
Crawling on the scale of the whole station:
Python + Scrapy
If the DIY spider in the above two schemes is millet plus rifle, then Scrapy is simply a heavy cannon, which is extremely useful, customized crawling rules, http error handling, XPath, RPC, pipeline mechanism and so on. And because Scrapy is based on Twisted, it is also very efficient. Relatively speaking, the only drawback is that the installation is more troublesome and the dependence is stronger. I am still a relatively new osx, so I can't install scrapy directly in pip.
In addition, if xpath is introduced into spider and xpath plug-ins are installed on chrome, the parsing path will be clear at a glance and the development efficiency will be extremely high.