Programming defensively requires knowing the input that your code should be able to handle. Typically, the programmer may be intimately familiar with the type of data that his/her code will encounter and can perform checks and catch exceptions with respect to the format of the data.
Web mining requires a lot more sophistication. The programmer in many cases does not know the full formatting of the data published on a web site. Additionally, this format may change over time. There are certain standards that do apply to certain types of data on the web, but one cannot rely on web developers to follow these standards. For example, the RSS Advisory Board developed a convention for the formatting of web pages so that browsers can automatically discover the links to the site’s RSS feeds. I have found in my research that approximately 95% of my sample actually implemented this convention. Not bad, but not perfect.
Always Have a Plan B, C, D, …
One might say that 95% is good enough. I am a bit obsessive when it comes to data quality, so I wanted to extract a feed for 99% of the sites I had on my list. Also, I am always leery of bias. Could there be something special about these sites that do not implement RSS autodiscovery? Clearly, there are exceptions to my Plan A. So time to move to Plan B. I found that some of the sites in this 5% used FeedBurner to index their feeds, so Plan B was to use a regular expression to extract FeedBurner URLs. This only added another 1% (actually less than that) to my coverage.
((?:http|feed)://feeds.feedburner.com/.*?)
Next, Plan C took the domain name and simply slapped /feed to it and hope it sticks. I called this process “feed probing” and it added the remaining 3% that I was looking for. If Plans A, B and C all failed to find a suitable RSS feed, all hope is lost and we just skip this site (1% error).
On the other hand, there are times when it is the HTTP client or server cannot be trusted…
Common Python Exceptions in Web Mining
It is all too common to encounter an exception while web mining or crawling. Code must handle these errors gracefully by catching exceptions or failing without aborting. One method that works well is to provide a resume mechanism that restarts execution where the code left off, rather than having to start a multiple hour/day/week job over again! Below is a taxonomy of common problems (and their Python exceptions):
HTTP Errors. These occur frequently. Some are recoverable, and others are worth just throwing out a record over. The most common ones are below, but for more information, refer to RFC 2616.
In Python, these can be caught as urllib2.HTTPError. It is also possible to specify actions based on the specific HTTP error code returned:
try: content = urllib2.urlopen(url).read() except urllib2.HTTPError, e: if e.code == 404: print "Not Found" elif e.code == 500: print "Internal Server Error"
Server Errors “URLError”. These occur frequently as well and seem to denote some sort of server or connection trouble, such as “Connection refused” or site does not exist. Usually, these are resolved by retrying the fetch. In Python, it is very important to note that HTTPError is a subclass of URLError, so when handling both exceptions distinctly, HTTPError must be caught first.
try: content = urllib2.urlopen(url).read() except urllib2.HTTPError, e: ... except urllib2.URLError, f: print f.reason
Other Bizarreness. The web is very chaotic. Sometimes weird stuff happens. The rare, elusive httplib.BadStatusLine exception technically means that the server returned an error code that the client does not understand, but it can also be thrown when the page being fetched is blank. On a recent project, I ran into a new one: httplib.IncompleteRead which has little documentation. Both of these issues can usually be resolved by retrying the fetch. Both of these pesky errors (and more) can be handled by simply catching their parent exception: httplib.HTTPException.
try: content = urllib2.urlopen(url).read() except httplib.HTTPException: #you've encountered a rare beast. You win a prize!
Everything Deserves a Second Chance
One common reaction to any error is to just throw the record out. URLErrors errors are so common, that it is probably unwise to do that if you are using the data for something. Typically, these errors go away if you try again. I use the following loop to catch errors and react appropriately.
attempt = 0 while attempt < HTTP_RETRIES: attempt += 1 try: temp = urllib2.urlopen(url).read() break except urllib2.HTTPError: break except urllib2.URLError: continue except httplib.HTTPException: continue else: continue
This code attempts to fetch URL a maximum of HTTP_RETRIES times. If the fetch is successful, Python breaks out of the loop. If a URLError or HTTPException occurs, we move on to another attempt of the fetch. If we encounter an HTTP error (not found, restricted etc), give up. Depending on the error, we can modify the code to retry on certain errors, and abort on others, but for my purposes, I do not care.
The Comatose Crawler
If you have ever done a large scale crawl on a web site, you are bound to encounter a state where your crawler becomes comatose – it is running, maybe using system resources, but is not outputting anything or reporting progress. It looks like an infinite no-op loop. I have encountered this problem since I started doing web mining in 2006 and did not, until just this past weekend, realize exactly why it was happening and how to prevent it.
Your crawler has sunk in a swamp, and is essentially trapped. For whatever reason, the HTTP server your code is communicating with maintains an open connection, but sends no data. I suppose this could be a deadlock-type situation where the HTTP server is waiting for an additional request (?), and the crawler is waiting for output from the HTTP server. It was my misunderstanding that the HTTP protocol had a built-in timeout, and I was relying on it. This is apparently not the case. There is a simple way to avoid this swamp, by setting a timeout on the socket sending the HTTP request:
import socket ... HTTP_TIMEOUT = 5 socket.setdefaulttimeout(HTTP_TIMEOUT) ... handle = urllib2.urlopen("http://www.google.com") content = handle.read() ...
If a request to the socket goes unanswered after HTTP_TIMEOUT seconds, Python throws a urllib2.URLError exception that can be caught. In my code, I just skip these troublemakers.
Traceback (most recent call last): File "", line 1, in ? File "/usr/lib64/python2.4/urllib2.py", line 130, in urlopen return _opener.open(url, data) File "/usr/lib64/python2.4/urllib2.py", line 358, in open response = self._open(req, data) File "/usr/lib64/python2.4/urllib2.py", line 376, in _open '_open', req) File "/usr/lib64/python2.4/urllib2.py", line 337, in _call_chain result = func(*args) File "/usr/lib64/python2.4/urllib2.py", line 1021, in http_open return self.do_open(httplib.HTTPConnection, req) File "/usr/lib64/python2.4/urllib2.py", line 996, in do_open raise URLError(err) urllib2.URLError: urlopen error="" timed="" out=""
With enough experience, dedication, blood, sweat, tears, and caffeine, data mining the jungle known as the World Wide Web becomes both simple and fun. Happy web mining!