Using data science to manage a software project in a GitHub organization, Part 1

Create a data science project from scratch


Content series:

This content is part # of # in the series: Using data science to manage a software project in a GitHub organization, Part 1

Stay tuned for additional content in this series.

This content is part of the series:Using data science to manage a software project in a GitHub organization, Part 1

Stay tuned for additional content in this series.

This series covers two problems: how to use data science to investigate project management around software engineering, and how to publish a data science tool to the Python Package Index.

Data science as a discipline is exploding, and many articles discuss the ins and outs of such topics as which algorithm to use. However, only a few explain how to collect data, create a project structure, and ultimately publish your software to the Python Package Index. This tutorial provides detailed hands-on instructions on both topics. The source code for this series is available on GitHub.

Software project management problems

Despite being around for decades, the software industry is still plagued by lingering issues of late delivery and poor quality. Other problems lie in evaluating the performance of teams and individual developers. In addition, a current trend in the software industry is to employ freelancers and contractors, which brings up yet another problem: How can an organization evaluate the talent of such non-employees? Furthermore, motivated professional software developers want to find a way to get better. What patterns can they emulate from the best software developers in the world?

Fortunately, developers create behavioral logs that create signals that can be used to help answer these questions. Every time a developer does a commit to a repository, a signal is created. Management can then use data science to explore these signals.

It is important to note that unless properly evaluated, such signals can be misleading. It takes only minutes of getting into a discussion about source code metadata with a clever but devious developer to hear him or her say, "Oh, this doesn't mean anything. I can just game the system." As an example, consider the developer whose GitHub profile shows 3,000 commits in a year, which equates to around 10 commits every day. This looks truly heroic, but consider that these commits may have been created by an automated script or that they are possibly "fake" commits that merely add a line to a README file. Those are the kinds of phony signals you must look out for.

Exploratory questions to consider

When evaluating a project and its developer team, here is a partial list of initial questions to consider:

  • What are the characteristics of a good (or poor) software developer or team?
  • Are there signals that can predict faulty software?
  • How can a software project manager spot signals allowing him to take action to turn around a troubled project?
  • Is there a difference between looking at open source and closed sourced projects?
  • Are there signals that identify a developer who is "gaming the system?"
  • Do you have unreliable developers, whose commits have large gaps, for instance?
  • Are you destroying your team's productivity by having too many meetings?

Here and in Part 2 that follows, I suggest some techniques that can help you answer these and other questions.

Creating an initial data science project skeleton

An often overlooked part of developing a new data science solution is the initial structure of the project. Before work is started, a best practice is to create a layout that will facilitate high-quality work and a logical organization. There are many ways to lay out a project structure, but here is one recommendation (see the listing immediately following this list for an actual output):

  • .circleci directory: This holds the configuration necessary to build the project using the CircleCI SaaS build service. (There are many similar services that work with open source software. For example, you could use an open source tool such as Jenkins.)
  • .gitignore: Be sure to ignore files that are not part of the project. This is a common misstep.
  • It's a good idea to put into your project some information about how you expect contributors to behave.
  • CONTRIBUTING.MD: Explicit instructions about how you will accept contributions are helpful in recruiting assistance.
  • LICENSE: Having a license, such as MITor BSD, is helpful. In some cases, a potential contributor may not be able to participate if you don't have a license.
  • Makefile: A makefile is a great tool for running tests and deploying and setting up an environment.
  • A good answers basic questions such as how does a user build the project and what does the project do. Additionally, the might include "badges" that show the quality of the project, such as a passing build.
  • Command-line tool: In my example, I have a dml command-line tool. Having a cli interface is helpful both in exploring your library and creating an interface for testing.
  • Library directory with a At the root of a project, you should create a library directory with a to indicate that it is importable. In this example, the library is called devml.
  • ext directory: This directory is a good place for things such as a config.json or a config.yml file. It is much better to put non-code in a place where it can be centrally referred to. A data subdirectory might be necessary as well to create some local, truncated samples to explore.
  • notebooks directory: A specific folder for holding Jupyter Notebooks makes it easy to centralize the development of notebook-related code. Additionally, it makes setting up automated testing of notebooks easier.
  • requirements.txt: A file that holds a list of packages necessary for the project.
  • A configuration file that sets up the way a Python package is deployed. You can also use it to deploy to the Python Package Index.
  • tests directory: A directory in which to place tests.

Here is the output of a ls command listing the specific components discussed above.

(.devml) ➜ devml git:(master) ✗ ls -la 
drwxr-xr-x  3 noahgift staff   96 Oct 14 15:22 .circleci
-rw-r--r--  1 noahgift staff  1241 Oct 21 13:38 .gitignore
-rw-r--r--  1 noahgift staff  3216 Oct 15 11:44
-rw-r--r--  1 noahgift staff  357 Oct 15 11:44
-rw-r--r--  1 noahgift staff  1066 Oct 14 14:10 LICENSE
-rw-r--r--  1 noahgift staff  464 Oct 21 14:17 Makefile
-rw-r--r--  1 noahgift staff 13015 Oct 21 19:59
-rwxr-xr-x  1 noahgift staff  9326 Oct 21 11:53 dml
drwxr-xr-x  4 noahgift staff  128 Oct 20 15:20 ext
drwxr-xr-x  7 noahgift staff  224 Oct 22 11:25 notebooks
-rw-r--r--  1 noahgift staff  117 Oct 18 19:16 requirements.txt
-rw-r--r--  1 noahgift staff  1197 Oct 21 14:07
drwxr-xr-x 12 noahgift staff  384 Oct 18 10:46 tests

One of the interesting things about a metric like this is that it shows engagement. With the best open source developers, there are some fascinating parallels.

Collecting and transforming the data

As usual, the worst part of the problem is figuring out how to collect and transform the data into something useful. There are several parts of this problem to solve. The first is how to collect a single repository and create a pandas DataFrame from it. In order to do this, you create a new module called inside the devml directory. This module addresses the issues around converting a git repository's metadata to a pandas Dataframe.

Here is a portion of my module. The log_to_dict function takes a path to a single git checkout on disk, then converts the output of a git command.

def log_to_dict(path):
  """Converts Git Log To A Python Dict"""
  os.chdir(path) #change directory to process git log
  repo_name = generate_repo_name()
  p = Popen(GIT_LOG_CMD, shell=True, stdout=PIPE)
  (git_log, _) = p.communicate()
    git_log = git_log.decode('utf8').strip('\n\x1e').split("\x1e")
  except UnicodeDecodeError:
    log.exception("utf8 encoding is incorrect, trying ISO-8859-1")
    git_log = git_log.decode('ISO-8859-1').strip('\n\x1e').split("\x1e")
  git_log = [row.strip().split("\x1f") for row in git_log]
  git_log = [dict(list(zip(GIT_COMMIT_FIELDS, row))) for row in git_log]
  for dictionary in git_log:
  repo_msg = "Found %s Messages For Repo: %s" % (len(git_log), repo_name)
  return git_log

In the next two functions, a path on disk is used to call the function above. Note that logs are stored as items in a list, and this list is used to create a DataFrame in pandas:

def create_org_df(path):
  """Returns a Pandas Dataframe of an Org"""
  original_cwd = os.getcwd()
  logs = create_org_logs(path)
  org_df = pd.DataFrame.from_dict(logs)
  #convert date to datetime format
  datetime_converted_df = convert_datetime(org_df)
  #Add A Date Index
  converted_df = date_index(datetime_converted_df)
  new_cwd = os.getcwd()
  cd_msg = "Changing back to original cwd: %s from %s" % (original_cwd, new_cwd)
  return converted_df
def create_org_logs(path):
  """Iterate through all paths in current working directory,
  make log dict"""
  combined_log = []
  for sdir in subdirs(path):
    repo_msg = "Processing Repo: %s" % sdir
    combined_log += log_to_dict(sdir)
  log_entry_msg = "Found a total log entries: %s" % len(combined_log)
  return combined_log

In action, this code looks like this when run without collecting into a DataFrame.

In [5]: res = create_org_logs("/Users/noahgift/src/flask")
2017-10-22 17:36:02,380 - devml.mkdata - INFO - Found repo: /Users/noahgift/src/flask/flask
In [11]: res[0]
{'author_email': '',
 'author_name': 'Radoslav Gerganov',
 'date': 'Fri Oct 13 04:53:50 2017',
 'id': '9291ead32e2fc8b13cef825186c968944e9ff344',
 'message': 'Fix typo in logging.rst (#2492)',
 'repo': b'flask'}

The second section, which makes the DataFrame, looks like the listing below.

res = create_org_df("/Users/noahgift/src/flask")
In [14]: res.describe()
count  9552.0
mean    1.0
std    0.0
min    1.0
25%    1.0
50%    1.0
75%    1.0
max    1.0

At a high level, this is a pattern to get ad-hoc data from a third party such as a Git log. To dig into this in more detail, look at the source code in its entirety.

Talking to an entire GitHub organization

With the code in place that transforms Git repositories on disk into DataFrames in place, a natural next step is to collect all the repositories for an organization. A key problem in analyzing just one repository is that it is an incomplete portion of the data to analyze in the context of the organization. One way to fix this is to talk to the GitHub API and programmatically pull down all the repositories. I use to do this, the highlights of which are shown below.

def clone_org_repos(oath_token, org, dest, branch="master"):
  """Clone All Organizations Repositories and Return Instances of Repos.
  if not validate_checkout_root(dest):
    return False
  repo_instances = []
  repos = org_repo_names(oath_token, org)
  count = 0
  for name, url in list(repos.items()):
    count += 1
    log_msg = "Cloning Repo # %s REPO NAME: %s , URL: %s " %\
             (count, name, url)
      repo = clone_remote_repo(name, url, dest, branch=branch)
    except GitCommandError:
      log.exception("NO MASTER BRANCH...SKIPPING")
  return repo_instances

Both the PyGithub and the GitPython packages are used to do much of the heavy lifting. When this code is run, it iteratively finds each repo from the API and clones it. The previous code can then be used to create a combined DataFrame.

Creating domain-specific statistics

All of this work has been done for one reason: to explore the data collected and to create domain-specific stats. To do that, you create a file. The most relevant portion to show is a function called author_unique_active_days. This function shows how many days a given developer was active for the records in the DataFrame. This is a unique domain-specific statistic that is rarely mentioned in discussions about statistics involving source code repositories.

The main function is shown below.

def author_unique_active_days(df, sort_by="active_days"):
  """DataFrame of Unique Active Days by Author With Descending Order
  author_name	unique_days
  46	Armin Ronacher	271
  260	Markus Unterwaditzer	145
  author_list = []
  count_list = []
  duration_active_list = []
  ad = author_active_days(df)
  for author in ad.index:
    vals = ad.loc[author]
    vals.reset_index(drop=True, inplace=True)
  df_author_ud = DataFrame()  
  df_author_ud["author_name"] = author_list
  df_author_ud["active_days"] = count_list
  df_author_ud["active_duration"] = duration_active_list
  df_author_ud["active_ratio"] = \
    round(df_author_ud["active_days"]/df_author_ud["active_duration"].dt.days, 2)
  df_author_ud = df_author_ud.iloc[1:] #first row is =
  df_author_ud = df_author_ud.sort_values(by=sort_by, ascending=False)
  return df_author_ud

When used from IPython, this code generates the output below.

In [18]: from devml.stats import author_unique_active_days
In [19]: active_days = author_unique_active_days(df)
In [20]: active_days.head()
       author_name active_days active_duration active_ratio
46     Armin Ronacher     241    2490 days     0.10
260 Markus Unterwaditzer      71    1672 days     0.04
119      David Lord      58    710 days     0.08
352      Ron DuPlain      47    785 days     0.06
107   Daniel Neuhäuser      19    435 days     0.04

The statistics create a ratio, called the active_ratio, which is the percentage of time, from the start to the last time the developer worked on the project, that he was actively committing code. An interesting thing about a metric like this is that it shows engagement. With the best open source developers, there are some fascinating parallels. In the next section, these core components are hooked into a command-line tool, and I compare two open source projects using the code that is created.

Wiring a data science project into a CLI

Earlier in this part, I showed how the components are created to get to the point that an analysis can be run. In this section, I show how they are wired into a flexible command-line tool that uses the Click framework. You can view the entire source code for dml. Otherwise, The pieces that are important are shown below.

First the library is imported along with the Click framework.

#!/usr/bin/env python
import os
import click
from devml import state
from devml import fetch_repo
from devml import __version__
from devml import mkdata
from devml import stats
from devml import org_stats
from devml import post_processing

Then the previous code is wired in.

@click.option("--path", default=CHECKOUT_DIR, help="path to org")
@click.option("--sort", default="active_days", help="can sorty by: active_days, active_ratio, active_duration")
def activity(path, sort):
  """Creates Activity Stats
  Example is run after checkout:
  python gstats activity --path /Users/noah/src/wulio/checkout
  org_df = mkdata.create_org_df(path)
  activity_counts = stats.author_unique_active_days(org_df, sort_by=sort)

To use this tool, it looks like this from the command line.

# Linux Development Active Ratio

dml gstats activity --path /Users/noahgift/src/linux --sort active_days
           author_name             active_days active_duration active_ratio
14541      Takashi Iwai            1677    4590 days           0.370000
4382       Eric Dumazet            1460    4504 days           0.320000
3641       David S. Miller         1428    4513 days           0.320000
7216       Johannes Berg           1329    4328 days           0.310000
8717       Linus Torvalds          1281    4565 days           0.280000
275        Al Viro                 1249    4562 days           0.270000
9915       Mauro Carvalho Chehab   1227    4464 days           0.270000
9375       Mark Brown              1198    4187 days           0.290000
3172       Dan Carpenter           1158    3972 days           0.290000
12979      Russell King            1141    4602 days           0.250000
1683       Axel Lin                1040    2720 days           0.380000
400        Alex Deucher            1036    3497 days           0.300000

# CPython Development Active Ratio

           author_name             active_days active_duration active_ratio
146        Guido van Rossum        2256    9673 days           0.230000
301        Raymond Hettinger       1361    5635 days           0.240000
128        Fred Drake              1239    5335 days           0.230000
47         Benjamin Peterson       1234    3494 days           0.350000
132        Georg Brandl            1080    4091 days           0.260000
375        Victor Stinner          980     2818 days           0.350000
235        Martin v. Löwis         958     5266 days           0.180000
36         Antoine Pitrou          883     3376 days           0.260000
362        Tim Peters              869     5060 days           0.170000
164        Jack Jansen             800     4998 days           0.160000
24         Andrew M. Kuchling      743     4632 days           0.160000
330        Serhiy Storchaka        720     1759 days           0.410000
44         Barry Warsaw            696     8485 days           0.080000
52         Brett Cannon            681     5278 days           0.130000
262        Neal Norwitz            559     2573 days           0.220000

In this analysis, Guido of Python has a 23-percent probability of working on a given day, and Linus of Linux has a 28-percent chance. What is fascinating about this particular form of analysis is that it shows behavior over a long period of time. In the case of CPython, many of these authors also had full-time jobs, so the output is even more incredible to observe. Another analysis that would be interesting would be to look at the history of developers at a organization (combining all of the available repositories). I have noticed that in some cases very senior developers can output code at around an 85-percent active ratio if they are fully employed.


In Part 1 of this series, I have shown how to create a basic data science skeleton and have explained the parts. The components are built one by one to pull data from a third-party location, transform it, analyze it, and then run it in a flexible way with a command line interface. In Part 2, I will provide an in-depth data exploration using Jupyter Notebook, using the code built in Part 1. Finally, I show how to deploy the project to the Python Package Index.

Downloadable resources

Related topics


Sign in or register to add and subscribe to comments.

Zone=Information Management, Open source
ArticleTitle=Using data science to manage a software project in a GitHub organization, Part 1: Create a data science project from scratch