小伙伴们在大量使用网络资源时会需要编写爬虫,现在国内大部分编写爬虫使用的是python,博主没有使用过python语言,自己写爬虫使用的是ruby.在此记录一下如何使用ruby编写功能强大、使用方便的爬虫。
ruby编程环境(这个rubychina网站wiki里有详细的教程
Nokogiri,byebug(调试用),mechanize
首先前往机锋市场观察下:
确定市场url:http://apk.gfan.com
我们使用机锋的搜索功能下载相关的app,确定它的搜索url(http://apk.gfan.com/search/学习_2.shtml):
require 'nokogiri' require 'open-uri' require 'byebug' require 'mechanize' def download_app(app) begin detail_url = URI.escape(app.css('a').attr('href').text) doc = Nokogiri::HTML(open(detail_url)) download_url = URI.escape(doc.css('#computerLoad').attr('href').text) agent = Mechanize.new file=agent.get(download_url,nil,referer=detail_url) file_name = file.filename File.open(file_name,"w") {|f| f.write file.body} return 1 rescue Excepeion => e puts 'Download app Excepeion: '+e.message end return 0 end begin base_url = 'http://apk.gfan.com' keyword = '学习' page = 1 count = 5 download_count = 0 loop do search_url = URI.escape("#{base_url}/search/#{keyword}_#{page}.shtml") doc = Nokogiri::HTML(open(search_url)) app_list = doc.css('.lp-app-list li') break if app_list.length == 0 app_list.each do |app| download_count += download_app(app) break if download_count >= count end break if download_count >= count page += 1 end rescue Excepeion => e puts 'Main method Excepeion: '+e.message end