|Title||Empirical Software Engineering Research on Libre Software: Data Sources, Methodologies and Results|
|Year of Publication||2006|
|Advisor||Gonzalez-Barahona, Jesus M.|
|Academic Department||Departamento de Informática, Estadística y Telemática|
|Degree||Doctor europeus of Philosophy in Computer Science|
|Number of Pages||276|
|University||Universidad Rey Juan Carlos|
|Thesis Type||Doctoral Thesis|
|Keywords||data mining, empirical studies, libre software, software archaeology, software engineering, software evolution, software maintenance|
With the appearance and implantation of Internet new ways of developing software have arisen that make use of telematic tools, follow flexible methodologies and incorporate third-party contributions. One of the paradigmatic examples of software development that counts on the aforementioned characteristics can be found in the phenomenon of libre (free/open source) software, being of special special interest those projects that are large in number of participants and in software size.
Although at first these new environments are less controllable than traditional ones (because development is done generally in a geographically distributed way, there is no a company behind the development that takes the lead, traditional hierarchic structures are not followed or external contributions are hardly predictable), we have access to much information: the software product itself and many of the by-products that are created during the development process (communication archives, bug-tracking systems and versioning systems, among others). These data sources are usually publicly available on the Internet, so we can make exhaustive analysis with a great amount of data (much of which is hardly obtainable in traditional, industrial environments).
The goal of this thesis is to identify the data sources that libre software projects offer publicly, to present and display some methodologies for the analysis of these sources and the data that we can extract from them, and to show the results that have been obtained from applying these methodologies. Our intention is, in particular, to know the libre software phenomenon better, but also in general software creation processes since the acquired knowledge does not have to be specific to libre software, but could be applied to many other development environments.
Thus, we will start in this thesis with the description of the publicly available data sources on the Internet and the data that we can extract from them. Afterwards, several methods, that will depend on the source, will be used to obtain information from the data and to filter out interferences. Finally, several methodologies will be presented and applied on the data obtained from libre software projects which have been selected as case studies. The methodologies will range from classical to novel ones. Thus, among the classical we will perform an analysis of the growth of the software systems as it is known from software evolution, or we will apply social network analysis, a technique from the field of social sciences. In both cases, the contribution of this thesis has been to apply them to libre software projects. Regarding novel methodologies, we propose the archaeological analysis of software systems with the aim of stating what remains from previous versions, the generalization of software evolution to file types different from source code (for instance, documentation, translation or user interface files, among others) or the study of the evolution of volunteer participation and the regeneration of the leading "core" group. Also, a series of tools have been created to automate, at least partially, the whole process. These tools permits to reuse these methodologies on other projects.
Among the main contributions of this thesis we can state that this is the first exhaustive analysis of a large number of software projects, although the proposed methodologies and the tools that have been developed allow the study in the next future of more projects. On the other hand, we have shown that the technical analysis should be complemented with socio-technical analysis to fully understand the development process and many of the technical issues of (libre) software projects.