Personal tools
You are here: Home Activities Workshop on Public Data about Software Development (WoPDaSD 2008) (September 2008)

Workshop on Public Data about Software Development (WoPDaSD 2008) (September 2008)

Co-located with the Fourth International Conference on Open Source Systems.

September 7th - 10th 2008, Milan, Italy

 

 

Motivation

In the latest years, and specially thanks to the huge availability of data about software development that can be obtained from libre (free, open source) projects, the research community is starting to produce, use and exchange large data sets of information. These data sets have to be retrieved, purged, described, and can be published for public consumption by other groups. Their availability allows for the decoupling of research activities (some groups can focus on data retrieval and preliminary analysis, which others can devote to more in-depth analysis without bothering with data retrieval), the reproducibility of research results, and even the collaboration (and competition) in the analysis of data.

All this activity is being presented in several workshops and conferences, but since they are not focused specifically on it, exchange of experiences and discussions are not as deep and fruitful as they could be. This workshop is once again (for the third year in a row) being such a place, hosting researchers in the field to discuss specifically about this kind of data sets, how they are retrieved, how can they be analyzed and mined, how they can be exchanged and complemented, etc.

 

Main goals

The goal of this workshop is to foster the analysis of public available data sources about software development, and the exchange of data between different research groups.

The workshop is aimed at three different kinds of studies (although other related studies could also be considered):

  1. Analysis of specific projects (provided by the organizers, see below).
    The analysis should show a methodology to explore the projects, but also it should show explanations to ``odd'' things that could appear in the data set. For instance, a company-driven project can show different behavior than a community-driven project. The study can be in the field of software engineering, economics, sociology, human resources, and others.

  2. Retrieval process and exchange formats of public available data collections about software development.
    The data collections presented should be publicly available, based themselves on public data (so that other groups could reproduce the data collection process), and be related to the field of software development. This includes, but is not limited to, data from source control systems, but tracking systems, mailing lists, websites, source and binary code, quality assurance systems, etc. Although any kind of data collection can be considered, those including information about a large amount of projects will be considered especially appropriate.

  3. Data mining activities and new retrieval tools.
    Working with a huge quantity of data is always a complex problem. Data mining techniques are welcome in this section, provided that papers include some conclusions about a specific set of projects. Again, this analysis should show a methodology to explore the data and explanations about the whole process. Cross-analysis of datasets, and specially of those provided by the organizers (FLOSSMole and FLOSSMetrics databases) is specially welcome.
    Also, new tools developed to obtain data from several data sources, such as forums, wikis, bug tracking systems and others fit perfectly here.

Detailed description of data sources

Following the goals described above, the workshop will accept papers about two specific issues (not taking into account the development of new data mining tools):

  1. Analysis of two data collections about libre software development: FLOSSMole and FLOSSMetrics
    These collections, already available to any researcher, are offered for analysis by third parties (see below). The studies submitted should detail how they have been used, which part of the information has been considered, how they have been validated or filtered and/or post-processed (if that is the case). The description should be detailed enough to let any other research group reproduce the study.

  2. Studies about the data retrieval and preparation for public consumption of data sets in the same realm, which could be proposed for analysis in future editions of the workshop.

FLOSSMole

FLOSSMole (formerly OSSmole) is a set of tools for gathering data (metrics) about the development of free/libre/open source projects. The FLOSSMole project also publishes the resulting analysis about FLOSS projects, and accepts data donations from other research groups. It offers this workshop a complete set of data gathered from the SourceForge development platform and the Freshmeat announcement systems.

More information can be obtained from http://ossmole.sourceforge.net


FLOSSMetrics

At the end of FLOSSMetrics a huge database with data from thousands of projects. Nowadays, the project is already working on the retrieval of the data, with information already available for about 1,000 projects (mainly retrieved from CVS and SVN repositories, but also mailing lists). These results are publicly available at http://data.flossmetrics.org

More information can be obtained from http://flossmetrics.org

 

Target audience

The target audience is composed by the research groups interested in empirical software engineering and quantitative studies of the software development processes and methods. This includes not only software engineers, but also researchers from other fields that might use the data for economic, social and other studies.

 

Submissions

We solicit short position papers (3 pages) and research papers (6 pages). Short papers will be expected to discuss controversial issues in the field, or describe interesting or thought-provoking ideas that are not yet fully developed, while full papers will be expected to describe new research results, and have a higher degree of technical rigor than short papers. The papers must be in ACM 2-column format. Authors may indicate their intent to submit a paper by June 16th 2008 (the title of the paper and abstract will need to be submitted online via e-mail to megan @ elon.edu, jgb @ gsyc.escet.urjc.es, and dizquierdo @ gsyc.escet.urjc.es). The full paper should be sent to the aforementioned addresses by June 25th 2007 as PDF. Notification of acceptance will be sent by July 7th 2008. The final version of the paper is due on July 14th 2008. Accepted papers will be published in agreement with the OSS requirements.

 

Intended Program

  • 9:00 Handshake session and welcome by WoPDaSD 2008 organizers
  • 9:05 QualipSO (slides)
  • 9:10 QualOSS (slides)
  • 9:15 SQO-OSS
  • 9:20 FLOSSMetrics (slides)
  • 9:25 FLOSSMole
  • 9:30 "Advances in the SourceForge Research Data Archive" by Greg Madey (University of Notre Dame, USA) (.pdf, slides)
  • 9:50 "Collecting data from distributed FOSS projects" by Fabian Fagerholm (University of Helsinki, Finland) (.pdf, slides)
  • 10:10 "Cross-repository data linking with RDF and OWL" by Kevin Crowston (Syracuse University, USA) (.pdf, slides)
  • 10:30 Coffee break
  • 11:00 "Improving community awareness in software forges by semantical aggregation of tools feeds" by Quang Vu DANG (Institut Telecom, France) (.pdf, slides)
  • 11:20 "Are FLOSS developers committing to CVS/SVN as much as they are talking in mailing lists? Challenges for Integrating data from Multiple Repositories" by Sulayman K. Sowe (Aristotle University, Greece) (.pdf, slides)
  • 11:40 "Author Entropy: A Metric for Characterization of Software Authorship Patterns" by Quinn C. Taylor (Brigham Young University, USA) (.pdf, slides)
  • 12:00 "Computer Support for Discovering OSS Processes" by Chris Jensen (University of California, USA) (.pdf, slides)
  • 12:20 Feedback and discussion about projects which offer data for research activities: FLOSSMole and FLOSSMetrics.
  • 13.00 Lunch Time

Place (see details in the OSS 2008 website):

  • Milano Convention Center
  • Via Giovanni Gattamelata 5, Milan
  • Italy

 

Important Dates

  • Intent to submit: 16th June 2008 (not mandatory, only for organizational purposes)
  • Deadline for submission: 25th June 2008
  • Paper notification: 7th July 2008
  • Camera-ready paper due: 14th July 2008
  • Workshop date: 10th September 2008

 

Organizing Committee

  • Jesus M. Gonzalez-Barahona (Universidad Rey Juan Carlos, Spain)
  • Megan Squire (Elon University, USA)
  • Daniel Izquierdo-Cortazar (Universidad Rey Juan Carlos, Spain)

 

Program Committee (tentative)

  • Kevin Crowston (Syracuse University, USA)
  • Jean-Christophe Deprez (CETIC, Belgium)
  • Roberto Di Cosmo (University of Paris VII, France)
  • Justin Erenkrantz (Apache, USA)
  • Daniel M. Germán (University of Victoria, Canada)
  • Robert Gobeille (HP, USA)
  • Stefan Koch (Wirtschaftsuniversität Vienna, Austria)
  • Ken Krugler (Krugle, Inc., USA)
  • Bart Massey (Portland State University, USA)
  • Sandro Morasca (Università dell'Insubria, Italy)
  • Gregorio Robles (Universidad Rey Juan Carlos, Spain)
  • Francesco Rullani (Copenhagen Business School, Denmark)
  • Walt Scacchi (University of California at Irvine, USA)
  • Diomidis Spinellis (Athens University of Economics and Business, Greece)
  • Tony Wassermann (Carnegie Mellon West, USA)
  • Jim Whitehead (University of California at Santa Cruz, USA)
  • Thomas Zimmermann (University of Calgary, Canada)

 

Sponsoring projects

Some research projects sponsor this workshop (although it is open to anyone who registers): Some of these projects are funded in part by the European Commission, under the Information Society Technologies (IST) research programme of the Sixth Framework Program. A list of the IST projects in the area of Software Technologies is available from http://cordis.europa.eu/ist/st/projects.htm.

 

Further information

Should you need further information, you can email jgb @ gsyc.escet.urjc.es.
Document Actions