Replication package for The Promises and Perils of Mining GitHub

Table of Contents

1 Replication package

This is the replication for the paper "The Promises and Perils of Mining GitHub" currently under review at MSR'14, the International Working Conference on Mining Software Repositories.

1.1 Main data set used in the study.

The main source of information used in this study was GhTorrent. You can download it from This is the most comprehensive data set for github and it is maintained by one of the authors (Georgios Gousios). We used the SQL data dump that includes events to Jan 1, 2014, which is labelled 2014-01-02.

1.2 Sample of 432 repositories

To complement this study we used a sample of 432 repositories. We have include a csv file that contains the following fields. Of particular interest are the two last fields. The file is located here: 400.csv

field description Manual
repo_id repoId in the ghtorrent data set  
name name of the repository given by its owner  
url url of the repository  
description description given by its owner  
c_repos number of repositories in the project  
latest date of latest commit  
earliest date of earliest commit  
c_commits number of commits recorded in ghTorrent  
c_authors number of authors recorded in ghTorrent  
c_committers number of committers recorded in ghTorrent  
created_at date when it was created  
language main programming language according to github  
deleted was it deleted?  
subject manually clasiffied detailed subject of repository yes
subject_agr manually clasiffied less subject of repository (used for aggregation) yes

1.3 Aggregation of repositories into projects.

This information is important to our study and it is not available in the ghTorrent data. We wrote a perl script to this analysis. It is available here: It works by following repositories and their forks until reaching the one that was not forked. In very few cases a repository A was cloned from another B, but B appers to have been cloned from A (we speculate this is because A was deleted after B forked it, and then A was recreated from A). For this reason, when we detect such a tie, we break it by choosing the older repo as the origin. This script has as output a table that contains two columns important for this study.

repo_id the id of the repo in ghtorrent
forked_from_id the repository of the project from where all forks originated

1.4 Questions?

If have questions about this replication package, please do not hesitate to contact me at

–daniel german

Author: dmg

Created: 2014-02-13 Thu 09:06

Emacs (Org mode 8.2.4)