Lately I have doing a lot of work with the Wikipedia XML dump as a corpus. Wikipedia provides a wealth information to researchers in easy to access formats including XML, SQL and HTML dumps for all language properties. Some of the data freely available from the Wikimedia Foundation include
|
The above resources are available not only for Wikipedia, but for other Wikimedia Foundation projects such as Wiktionary, Wikibooks and Wikiquotes.
As Wikipedia readers will notice, the articles are very well formatted and this formatting is generated by a somewhat unusual markup format defined by the MediaWiki project. As Dirk Riehl stated:
There was no grammar, no defined processing rules, and no defined output like a DOM tree based on a well defined document object model. This is to say, the content of Wikipedia is stored in a format that is not an open standard. The format is defined by 5000 lines of php code (the parse function of MediaWiki). That code may be open source, but it is incomprehensible to most. That’s why there are 30+ failed attempts at writing alternative parsers.
For example, below is an excert of Wiki-syntax for a page on data mining.
'''Data mining''' (the analysis step of the '''knowledge discovery in databases''' process, or KDD), a relatively young and interdisciplinary field of [[computer science]] {{cite web|url=http://www.sigkdd.org/curriculum.php |title=Data Mining Curriculum | publisher=[[Association for Computing Machinery|ACM]] [[SIGKDD]] |date=2006-04-30 |accessdate=2011-10-28}} {{cite web | last = Clifton | first = Christopher | title = Encyclopedia Britannica: Definition of Data Mining | year = 2010 | url = http://www.britannica.com/EBchecked/topic/1056150/data-mining | accessdate = 2010-12-09}} is the process of discovering new patterns from large [[data set]]s involving methods at the intersection of [[artificial intelligence]], [[machine learning]], [[statistics]] and [[database system]]s. The goal of data mining is to extract knowledge from a data set in a human-understandable structure and involves database and [[data management]], [[Data Pre-processing|data preprocessing]], [[statistical model|model]] and [[Statistical inference|inference]] considerations, interestingness metrics, [[Computational complexity theory|complexity]] considerations, post-processing of found structure, [[Data visualization|visualization]] and [[Online algorithm|online updating]].
I was epicly worried that I would spend weeks writing my own parser and never complete the project I am working on at work. To my surprise, I found a fairly good parser. Since I am working on named entity extraction and ngram extraction, I wanted to only extract the plain text. If we take the above junk and extract only the plain text, we would get
Data mining (the analysis step of the knowledge discovery in databases process, or KDD), a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems. The goal of data mining is to extract knowledge from a data set in a human-understandable structure and involves database and data management, data preprocessing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of found structure, visualization and online updating.
and from this we can remove punctuation (except sentence terminators .?!), convert to lower case and perform other pre-processing text mining steps. There are many, many Wikipedia parsers of various qualities. Some do not work at all, some work only on certain articles, some have been abandoned as incomplete and some are slow as molasses.
I was delighted to stumble upon Wikipedia Extractor, a Python library developed by Antonio Fuschetto, Multimedia Laboratory, Dipartimento di Informatica, Università di Pisa, that extracts plain-text from the Wikipedia XML dump file. The script is heavily object-oriented, and it is very easy to modify and extend for other purposes. For me, it is the easiest parser to use and yields the best quality output although there are other options.
Pros
Cons
Wikipedia Extractor is good for offline analysis, but users will probably want something that runs faster. Wikipedia Extractor parsed the entire Wikipedia dump in approximately 13 hours, on one core, which is quite painful. Add in further parsing and the processing time becomes unbearable even on multiple cores. A Hadoop Streaming job using Wikipedia Extractor as well as too much file I/O between Elastic MapReduce and S3 required 10 hours to complete on 15 c1.medium instances.
Ken Weiner (@kweiner) recently re-introduced me to the Cloud9 package by Jimmy Lin (@lintool) of Twitter which fills in some of these gaps. I avoided it at first because Java is not the first language I like to turn to. Cloud9 is written in Java and designed for use with Hadoop MapReduce in mind. There is a method within the package that explicitly extracts the body text of each Wikipedia article. This method calls the Bliki Wikipedia parsing library. One common problem with these Wikipedia parsers is that they often leave syntax still in the output. Jimmy seems to wrap Bliki with his own code to do a better job of extracting high quality text only output. Cloud9 also has counters and functions that detect non-article content such as redirects, disambiguation pages, and more.
Developers can introduce their own analysis, text mining and NLP code to process the article text in the mapper or reducer code. An example job distributed with Cloud9 which simply counts the number of pages in the corpus took approximately 15 minutes to run on 8 cores on an EC2 instance. A job that did more substantial required 3 hours to complete, and once the corpus was refactored as sequence files, the same job took approximately 90 minutes to run.
Conclusion
I am looking forward to playing with Cloud9 some more… I will take 90 minutes over 10 hours any day! Wikipedia Extractor is an impressive Python package that does a very good job of extracting plain text from Wikipedia articles and for that I am grateful. Unfortunately, it is far too slow to be used on a pay-per-use system such as AWS or for quick processing. Cloud9 is a Java package designed with scalability and MapReduce in mind, allowing much quicker and more wallet friendly processing.