Personal tools
You are here: Home Users Felipe Ortega PhD Thesis

PhD Thesis

Description of my Ph.D. thesis "Wikipedia: A Quantiative Analysis"

Overview

This is the web page of my Ph.D. thesis, entitled "Wikipedia: A Quantitative Analysis". I successfully defended it at Universidad Rey Juan Carlos, on April 1st, 2009.

Slide_thesis

The reviewing committee of my Ph.D. included the following members:

The_PHD_committee 

This thesis was written under the guideance of Jesús M. González Barahona.

Keywords

Wikipedia;Quantitative analysis;WikiXRay;Inequality analysis;Gini coefficient;Survival analysis;Model fitting;Pareto distribution;Upper truncated Pareto;Sustainability.

Download

The manuscript is available through the following link. The document is written in English, except for a short summary in Spanish, included in one of the Appendices.

"Wikipedia: A Quantitative Analysis" (PDF, 9,2 MB).


Mass media news

On July 9, 2009, URJC published a press release

including a summary of the main conclusions found in my PhD. thesis.

Very soon, the news were reported by many mass media, including some national newspapers and radio stations.

Newspapers

  • Article in "El Mundo".
  • Article in "El País".
  • Article in "ABC".
  • Article in "Cinco Días", top business newspaper (the printed version of July 10 included a full page article and interview).

Radio Interviews


Abstract

Presently, the Wikipedia project lodges the largest collaborative community ever known in the history of mankind. Due to the large number of contributors, along with the amazing popularity level of Wikipedia in the Web, it has soon become a topic of interest for researchers of many academic disciplines. However, in spite of the increasing significance of Wikipedia in scholar publications over the past years, we oftenly find studies concentrating either on very specific aspects of the project, or else, on a specific language version.

As a result, there is a need of broadening the scope of previous research works to present a more complete picture of the Wikipedia project, its community of contributors and the evolution of this project over time. This doctoral thesis offers a quantitative analysis of the top ten language editions of Wikipedia, from different perspectives. The main goal has been to trace the evolution in time of key descriptive and organizational parameters of Wikipedia and its community of authors. The analysis is focused on logged authors (those editors who created a personal account to participate in the project). The comparative study encompasses general evolution parameters, a detailed analysis of the inner social structure and stratification of the Wikipedia community of logged authors, a study of the  inequality level of contributions (among authors and articles), a demographic study of the Wikipedia community and some basic metrics to analyze the quality of Wikipedia articles and the trustworthiness level of individual authors. This work concludes with the study of the influence of the main findings presented in this thesis for the future sustainability of Wikipedia in the following years.

The analysis of the inequality level of contributions over time, and the evolution of additional key features identified in this thesis, reveals an untenable trend towards progressive increase of the effort spent by the most active authors, as time passes by. This trend may eventually cause that these authors will reach their upper limit in the number of revisions they can perform each month, thus starting a decreasing trend in the number of monthly revisions, and an overall recession of the content creation and reviewing process in Wikipedia.

As far as we know, this is the first research work implementing a comparative analysis, from an quantitative point of view, of the top ten language editions of Wikipedia, presenting complementary results from different research perspectives. Therefore, we expect that this contribution will help the scientific community to enhance their understanding of the rich, complex and fascinating working mechanisms and behavioral patterns of the Wikipedia project and its community of authors. Likewise, we hope that WikiXRay will facilitate the hard task of developing empirical analyses on any language version of the encyclopaedia, boosting in this way the number of comparative studies like this one in many other scientific disciplines.

Methodology and tools

An important contribution in this thesis for the research community is WikiXRay, the software tool I have developed to perform the statistical analyses included in this work. This tool completely automates the process of retrieving the database dumps from the Wikimedia public repositories, massaging it to obtain key metrics and descriptive parameters, and loading them in a local database, ready to be used in empirical analyses. The code can be downloaded from the project's web page at BerliOS.

Currently, WikiXRay makes heavy use of statistical libraries from the GNU R project to implement all the analysis and statistical techniques included in this thesis. In the following sections we briefly present the most relevant methods and tools employed, along with some references from which to retrieve additional information.

Trellis graphs: the lattice package.

One of the main challenges that one has to confront in an extense comparitive analysis like this, is to find the best approach to present all results corresponding to each language version in a compact and comprehensive way. In this sense, throughout this thesis I make heavy use of (in my opinion) one of the most powerful packages in GNU R: lattice. This package implements a special type of statistical graphs, the so-called Trellis graphs, which make it really easy to compare the behaviour and effects of several covariates and control variables on the same plot.
I highly recommend the book "Lattice. Multivariate Data Visualization with R", written by Deepayan Sarkar,  the author of this package, to learn all the necessary skills to unleash the power of Trellis graphs in you work with GNU R.

Model fitting.

Another critical point in this thesis was how to find robust methods to fit statistical models to the empirical data obtained from Wikipedia database dumps, in particular for power laws and Pareto distributions. In this sense, I thank my colleague Dr. Israel Herráiz, who pointed me to the best paper on this matter. In fact, Aaron Clauset maintains a web page explaining the rationale behind this process, assessing about best practices in this field and providing all the source code to implement the best available methods.
In addition to that, I found this paper very useful to adjust another less common type of distribution, the truncated Pareto. This is the distribution followed by the effort spent by human authors in all language editions, both in terms of number of revisions performed and total number of different articles edited. The fitting process is supported in GNU R through the VGAM package.

Analysis of inequalities.

The analysis of the inequalities found in terms of the effort spent by every author in each of the Wikipedia communities is a central point of this thesis. I have studied the distribution of inequalties by means of the Lorenz curve and the Gini coefficient. The ineq package in GNU R provides extensive support for these and other statistical tools to measure inequality distribution.

Survival analysis.

Finally, the most interesting novelty in this thesis is the application of Survival Analysis for the demographic study of the different communities of authors included in this thesis. This is a well-known statistical technique already applied with successful results in other scientific areas such as Medicine (specially in Epidemiology studies), Demography and even in industrial environments (where it is better known as Failure Time Analysis).

This tool has been particularly useful in this case, since the huge amount of quantitative data gathered for each community let us obtain remarkably precise results (allowing us, for instance, to calculate the median lifetime of authors in some language versions with a 95% C.I. of approx. 2-3 days around the estimated value).

The best book I can recommend for an introduction on this topic is "Survival Analysis: A Self-Learning Text". The survival package provides all the necessary support for this techniques in GNU R.

Publications and research events related to this thesis

Book chapters.

  • "Quantitative Analysis of the Top Ten Wikipedias ". Felipe Ortega, Jesus M. Gonzalez-Barahona and Gregorio Robles. In "Software and Data Technologies", Communications in Computer and Information Science series, vol. 22, pages 257-268. Ed. Springer Berlin Heidelberg, 2009. ISBN 978-3-540-88654-9 (Print) 978-3-540-88655-6 (Online).

Conferences and Workshops.

  • "The Top-Ten Wikipedias - A Quantitative Analysis Using Wikixray" (Best Student Paper Award). Felipe Ortega, Jesus M. Gonzalez-Barahona and Gregorio Robles. Proceedings of the Second International Conference on Software and Data Technologies (ICSOFT 2007) pages 46-53; Volume ISDM/EHST/DC, Barcelona, Spain, July 22-25, 2007. INSTICC Press 2007, ISBN 978-989-8111-07-4.
  • "Quantitative Analysis of the Wikipedia Community of Users". Felipe Ortega, Jesus M. Gonzalez-Barahona. Proceedings of the 2007 International Symposium on Wikis, pages 75 - 86; Montreal, Quebec, Canada, October 21-25, 2007. ACM 2007, ISBN 978-1-59593-861-9.
  • "On the Inequality of Contributions to Wikipedia". Felipe Ortega, Jesus M. Gonzalez-Barahona and Gregorio Robles. Proceedings of the 41st Hawaii International International Conference on Systems Science (HICSS-41 2008), page 304; , 7-10 January 2008, Waikoloa, Big Island, HI, USA. IEEE Computer Society 2008.
  • "On the Analysis of Contributions from Privileged Users in Virtual Open Communities". Felipe Ortega, Daniel Izquierdo-Cortazar, Jesus M. Gonzalez-Barahona and Gregorio Robles. Proceedings of the 42st Hawaii International International Conference on Systems Science (HICSS-42 2009), section 1-10; (CD-ROM and online), 5-8 January 2009, Waikoloa, Big Island, HI, USA. IEEE Computer Society 2009.

Posters.

  • "In-depth Aanalysis of Wikipedia Community". José Felipe Ortega, Jesus M. Gonzalez-Barahona and Gregorio Robles. Proceedings of the 11th Intl. Conference of the International Society for Scientometrics and Informetrics (ISSI 2007), pages 910-911, 2007.
  • "Quantitative Analysis and Characterization of Wikipedia Requests". Antonio J. Reinoso, Jesus M. Gonzalez-Barahona, Felipe Ortega and Gregorio Robles. 2008 International Symposium on Wikis (WikiSym 2008).
Workshops
  • 1st Workshop on Interdisciplinary Research on Wikipedia and Wiki Communities (WIRW 2008). Joseph Reagle, Felipe Ortega, Antonio J. Reinoso and Ruth Jesus. 2008 International Symposium on Wikis (WikiSym 2008).
Document Actions