Workshop on Public Data about Software Development (WoPDaSD 2006) (June 10th 2006)
Co-located with the Second International Conference on Open Source Systems.
The proceedings of the WoPDaSD 2006 worshop are now available
The report of the WoPDaSD 2006 workshop is now available
Motivation
In the latest years, and specially thanks to the huge availability of data about software development that can be obtained from libre (free, open source) projects, the research community is starting to produce, use and exchange large data sets of information. These data sets have to be retrieved, purged, described, and can be published for public consumption by other groups. Their availability allows for the decoupling of research activities (some groups can focus on data retrieval and preliminary analysis, which others can devote to more in-dept analysis without bothering with data retrieval), the reproducibility of research results, and even the collaboration (and competition) in the analysis of data.
All this activity is being presented in several workshops and conferences, but a single place to exchange experiences does not exist yet. We propose this workshop as such a place, where researchers in the field can discuss specifically about this kind of data sets, how they are retrieved, how can they be analyzed and mined, how they can be exchanged and complemented, etc.
Main goals
The goal of this workshop is to foster the analysis of public available data sources about software development and the exchange of data between different research groups.
The workshop is aimed specifically at two different target studies:
- Analysis of some data collections about software development
(provided by the organizers, see below).
The analysis should show a methodology for exploring any of those data sets (or better, to relate both) searching for some specific result in the area of software development, and its applications to the actual data sets. The study can be in the field of software engineering, economics, sociology, human resources, and others. - Retrieval process and exchange formats of public available data
collections about software development.
The data collections presented should be publicly available, based themselves on public data (so that other groups could reproduce the data collection process), and be related to the field of software development. This includes, but is not limited to, data from source control systems, but tracking systems, mailing lists, websites, source and binary code, quality assurance systems, etc. Although any kind of data collection can be considered, those including information about a large amount of projects will be considered especially appropriate.
Detailed description
Following the goals described above, the workshop will accept papers about two specific issues:
- Analysis of two data collections about libre software development:
FLOSSMole and CVSAnaly-SF.
These collections, already available to any researcher, are offered for the analysis. The studies submitted should detail how they have been used, which part of the information has been considered, how they have been validated or filtered and/or post-processed (if that is the case). The description should be detailed enough to let any other research group reproduce the study. - Studies about the data retrieval and preparation for public consumption of data sets in the same realm, which could be proposed for analysis in future editions of the workshop.
FLOSSMole
FLOSSmole (formerly OSSmole) is a set of tools for gathering data (metrics) about the development of free/libre/open source projects. The FLOSSMole project also publishes the resulting analysis about FLOSS projects, and accepts data donations from other research groups. It offers this workshop a complete set of data gathered from the SourceForge development platform and the Freshmeat announcement systems.
More information can be obtained from http://ossmole.sourceforge.net
CVSAnaly-SF
CVSAnalY is a tool created by the Libre Software Engineering Group at the Universidad Rey Juan Carlos that extracts statistical information out of CVS (and recently Subversion) repository logs and transforms it in database SQL formats. It has been used to retrieve information for all projects that have an active CVS system at SourceForge. This data set is publicly offered to be analyzed in this workshop.
More information can be obtained from http://libresoft.urjc.es/Data
Target audience
The target audience is composed by the research groups interested in empirical software engineering and quantitative studies of the software development processes and methods. This includes not only software engineers, but also researchers from other fields that might use the data for economic, social and other studies.
Submissions
We solicit short position papers (3 pages) and research papers (6 pages). Short papers will be expected to discuss controversial issues in the field, or describe interesting or thought-provoking ideas that are not yet fully developed, while full papers will be expected to describe new research results, and have a higher degree of technical rigor than short papers. The papers must be in ACM 2-column format. Authors may indicate their intent to submit a paper by April 21st 2006 - the title of the paper and abstract will need to be submitted online via e-mail to grex@gsyc.escet.urjc.es. The full paper should be sent to grex@gsyc.escet.urjc.es by 28th April 2006 as PDF. Notification of acceptance will be sent by 21st May 2006. The final version of the paper is due on 2nd June 2006. Accepted papers will be published as part of the WoPDaSD proceedings (which will be delivered on a CD and on the Internet).
Tentative Program
| 9:00 | Welcome by Jesús González-Barahona (Universidad Rey Juan Carlos) |
| 9:10 | "Description of data sets" by Gregorio Robles (Universidad Rey Juan Carlos) |
| 9:25 | "Mining CVS Signals" by Jean-Michel Dalle (Univ. Pierre & Marie Curie), Laurent Daudet (Univ. Pierre & Marie Curie) and Mattheijs den Besten (UPMC & OII) |
| 10:05 | "Regurgitate: Using GIT For F/LOSS Data Collection" by Bart Massey (Portland State University) and Keith Packard (Open Source Technology Center at Intel) |
| 10:45 | Coffee break |
| 11:15 | "A tool for the measurement, storage, and pre-elaboration of data supporting the release of public datasets" by Aurelio Fausto Crisá (Geosystem SRL), Vieri del Bianco (CEFRIEL) and Luigi Lavazza (Università dell�Insubria and CEFRIEL) |
| 12:00 | Final discussion and wrap-up moderated by Megan Conklin (Elon University) |
| 13.00 |
Important Dates
- Intent to submit: 21st April 2006 (not mandatory, only for organizational purposes)
- Deadline for submission: 28th April 2006 (deadline extension: 6th May 2006)
- Paper notification: 21st May 2006
- Camera-ready paper due: 2nd June 2006
- Workshop date: 10th June 2006
Registration and Fees
The workshop advanced fee is 50 � and covers participation in the workshop, and coffe break(s). It also covers a copy of the workshop proceedings. In order to register, please fill in the registration form (WorkshopFormWoPDaSD.pdf or WorkshopFormWopDaSD.rtf) and return conference organizers. Note that if you have registered to the main conference, you do not need to fill the form.
Organizing Committee
- Jesús M. González-Barahona (Universidad Rey Juan Carlos, Spain)
- Megan Conklin (Elon University, USA)
- Gregorio Robles (Universidad Rey Juan Carlos, Spain)
Program Committee
- Stefan Koch (Wirtschaftsuniversität Vienna, Austria)
- Kevin Crowston (Syracuse University, USA)
- Bart Massey (Portland State University, USA)
- Sandeep Krishnamurthy (University of Washington, USA)
- Kieran Healy (University of Arizona, USA)
- Dawid Weiss (Poznan University of Technology, Poland)
- Daniel M. Germán (University of Victoria, Canada)
Further information
Should you need further information, you can email jgb @ gsyc.escet.urjc.es.
