IT博客汇 | spider related

spider related

hugozen发表于 2017-03-05 02:25:12

怎么部署

scrapyd + supervisord + crontab + redis

行业需求

[lagou]

其他

比如如何防止被ban掉

Here are some tips to keep in mind when dealing with these kinds of sites:
- rotate your user agent from a pool of well-known ones from browsers (google around to get a list of them)
- disable cookies (see COOKIES_ENABLED) as some sites may use cookies to spot bot behaviour
- use download delays (2 or higher). See DOWNLOAD_DELAY setting.
- if possible, use Google cache to fetch pages, instead of hitting the sites directly
- use a pool of rotating IPs. For example, the free Tor project or paid services like ProxyMesh
- use a highly distributed downloader that circumvents bans internally, so you can just focus on parsing clean pages. One example of such downloaders is Crawlera

spider related

怎么部署

可以用的一些lib

分布式

参考的blog

入门

结合

行业资料

行业需求

其他