Improving the extraction of Wikipedia data

I am happy to share some recent performance results of a new parser for Wikipedia data dumps that I have developed over the past 2 months.

The new parser is also written in Python, as it was its predecessor included in WikiXRay. However, this new parser comes with notable improvements in speed and accuracy:

  • XML reading is now based on lxml event-driven parsing, offering better performance and improving code clarity and maintainability.
  • Very small memory footprint, even for parsing huge Wikipedia languages such as English.
  • Parallelization of the extraction process for English Wikipedia (multiple dump files) and multiple languages simultaneously (single dump files), along with improvements in data retrieval and organization. This follows ideas suggested by Dimitry Chichkov (in pymwdat) and more recently by A. Halfaker (wikimedia-utilities).

New features in a nutshell:

  • Virtually 0% of error rate in data retrieval (even support for cases of revisions with missing author, text, etc.).
  • Identification of FAs and FLists for different languages, over time (including cases of demotion and subsequent promotion).
  • Creating SHA-256 hash for the text of every revision. This aids for detection of reverts to identical previous versions in any language.
  • It calculates parent revision id (in the same page), number of characters per revision and it identifies redirects.
  • User nicknames (field 'rev_user_text') has been moved to a different 'people' table. This table will soon include some useful per-user statistics like timestamp of first edit, total number of edits, etc.

Running 6 different subprocesses on a multi-core server (8 CPUs) the parser needs less than 40 hours to retrieve and calculate these information, storing 444,946,704 revisions from 27,023,430 wiki pages in the English Wikipedia. Depending on the number of filtering/detection actions to undertake, the number of parallel subprocesses and the hardware I/O speed, this time can be reduced.

I am currently following a similar approach to update our parser for the data dumps of the logging table, tracking administrative and reviewing activities (such as the approval of some revisions for languages using the flagged-revisions extension).

These new parsers will be all included in WikiDAT (Wikipedia Data Analysis Toolkit). WikiDAT is a new integrated framework to facilitate the analysis of Wikipedia data using Python, MySQL and R. Following the pragmatic paradigm "avoid reinventing the wheel", WikiDAT integrates some of the most efficient approaches for Wikipedia data analysis found in libre software code up to now.

The main focus of WikiDAT is twofold: running on multiple hardware platforms (so you do not need a cluster with Hadoop to get a decent performance, but you can run faster in a cluster or a multi-core server with huge memory) and focus on multilingual support (so you can analyse any Wikipedia language besides the ubiquitous English).

The code and accompaning documentation is expected to be released this month, so keep an eye on this blog to learn more details about WikiDAT.

I will present an excerpt of WikiDAT new features, together with an introduction to Wikipedia data analysis for researchers,  in a 3-hour workshop on June 28, at Wikipedia Academy 2012 in Berlin. Special thanks to Wikimedia-Deutschland for making this possible.