Contents


Using data science to manage a software project in a GitHub organization, Part 2

Explore your project with Jupyter Notebooks and deploy it to the Python Package index

Comments

Content series:

This content is part # of # in the series: Using data science to manage a software project in a GitHub organization, Part 2

Stay tuned for additional content in this series.

This content is part of the series:Using data science to manage a software project in a GitHub organization, Part 2

Stay tuned for additional content in this series.

In Part 1 of this series, you created the basic structure of a data science project and downloaded the data programmatically from GitHub, transforming it to be statistically analyzed with pandas. Here in Part 2, you use Jupyter Notebook to explore many aspects of a software project and learn how to deploy the project to the Python Package index, both as a library and a command line tool.

Explore a GitHub organization using Jupyter Notebook

In the following sections, I explain how to use Jupyter Notebook to analyze and evaluate the development shop of a GitHub organization.

Pallets project analysis

As I pointed out in Part 1, one of the issues with looking at only a single repository is that it is only part of the data. The code that you created in Part 1 gives you the ability to clone an entire organization — with all of its repositories — and analyze it.

An example of a GitHub organization is the well-known Pallets project, which has multiple projects such as Click and Flask. The following steps detail how to perform a Jupyter Notebook analysis on the Pallets project.

  1. To start Jupyter from the command line, type jupyter notebook. Then, import the libraries that you will use:
    In [3]: import sys;sys.path.append("..")
       ...: import pandas as pd
       ...: from pandas import DataFrame
       ...: import seaborn as sns
       ...: import matplotlib.pyplot as plt
       ...: from sklearn.cluster import KMeans
       ...: %matplotlib inline
       ...: from IPython.core.display import display, HTML
       ...: display(HTML("<style>.container { width:100% !important; }</style>"))
  2. Next, run the code to download the organization:
    In [4]: from devml import (mkdata, stats, state, fetch_repo, ts)
    In [5]: dest, token, org = state.get_project_metadata("../project/config.json")
    In [6]: fetch_repo.clone_org_repos(token, org,
       ...:         dest, branch="master")
    Out[6]:
    [<git.Repo "/tmp/checkout/flask/.git">,
     <git.Repo "/tmp/checkout/pallets-sphinx-themes/.git">,
     <git.Repo "/tmp/checkout/markupsafe/.git">,
     <git.Repo "/tmp/checkout/jinja/.git">,
     <git.Repo "/tmp/checkout/werkzeug/.git">,
     <git.Repo "/tmp/checkout/itsdangerous/.git">,
     <git.Repo "/tmp/checkout/flask-website/.git">,
     <git.Repo "/tmp/checkout/click/.git">,
     <git.Repo "/tmp/checkout/flask-snippets/.git">,
     <git.Repo "/tmp/checkout/flask-docs/.git">,
     <git.Repo "/tmp/checkout/flask-ext-migrate/.git">,
     <git.Repo "/tmp/checkout/pocoo-sphinx-themes/.git">,
     <git.Repo "/tmp/checkout/website/.git">,
     <git.Repo "/tmp/checkout/meta/.git">]
  3. With the code living on disk, convert it to a pandas DataFrame:
    In [7]: df = mkdata.create_org_df(path="/tmp/checkout")
    In [9]: df.describe()
    Out[9]:
           commits
    count   8315.0
    mean       1.0
    std        0.0
    min        1.0
    25%        1.0
    50%        1.0
    75%        1.0
    max        1.0
  4. Calculate the active days:
    In [10]: df_author_ud = stats.author_unique_active_days(df)
        ...:
    In [11]: df_author_ud.head(10)
    Out[11]:
                  author_name  active_days active_duration  active_ratio
    86         Armin Ronacher          941       3817 days          0.25
    499  Markus Unterwaditzer          238       1767 days          0.13
    216            David Lord           94        710 days          0.13
    663           Ron DuPlain           56        854 days          0.07
    297          Georg Brandl           41       1337 days          0.03
    196     Daniel Neuhäuser           36        435 days          0.08
    169     Christopher Grebs           27       1515 days          0.02
    665    Ronny Pfannschmidt           23       2913 days          0.01
    448      Keyan Pishdadian           21        882 days          0.02
    712           Simon Sapin           21        793 days          0.03
  5. Create a seaborn plot by using sns.barplot to plot the top 10 contributors to the organization by the days that they are active in the project (that is, the days they actually checked in code). It is no surprise that the main author of many of the projects is almost three times more active than any other contributor.
    Figure 1. Seaborn active days plot
    barchart active days and developer name
    barchart active days and developer name

You could probably extrapolate similar observations for closed source projects across all of the repositories in a company. "Active days" could be a useful metric to show engagement, and it could be part of many metrics used to measure the effectiveness of teams and projects.

One observation from this query is that tests have a lot of churn, which might be worth exploring more. Does this mean that the tests themselves also contain bugs?

CPython project analysis

Next, let's look at a Jupyter notebook that shows the exploration of the metadata around the CPython project, the repository used to develop the Python language.

Relative churn

One of the metrics that is generated is called "relative churn." (See "Related topics" for an article from Microsoft Research about this metric.) Basically, the related churn principle states that any increase in relative code churn results in an increase in system defect density. In other words, too many changes in a file results in defects.

  1. As before, import the modules needed for the rest of the exploration:
    In [1]: import sys;sys.path.append("..")
       ...: import pandas as pd
       ...: from pandas import DataFrame
       ...: import seaborn as sns
       ...: import matplotlib.pyplot as plt
       ...: from sklearn.cluster import KMeans
       ...: %matplotlib inline
       ...: from IPython.core.display import display, HTML
       ...: display(HTML("<style>.container { width:100% !important; }</style>"))
  2. Generate churn metrics:
    In [2]: from devml.post_processing import (git_churn_df, file_len, git_populate_file_metatdata)
    In [3]: df = git_churn_df(path="/Users/noahgift/src/cpython")
    2017-10-23 06:51:00,256 - devml.post_processing - INFO - Running churn cmd: [git log --name-only --pretty=format:] at path [/Users/noahgift/src/cpython]
    In [4]: df.head()
    Out[4]:
                                                   files  churn_count
    0                         b'Lib/test/test_struct.py'          178
    1                      b'Lib/test/test_zipimport.py'           78
    2                           b'Misc/NEWS.d/next/Core'          351
    3                                             b'and'          351
    4  b'Builtins/2017-10-13-20-01-47.bpo-31781.cXE9S...            1
  3. A few filters in pandas can then be used to figure out the top relative churn files with the Python extension. See the output in Figure 2.
    In [14]: metadata_df = git_populate_file_metatdata(df)
    In [15]: python_files_df = metadata_df[metadata_df.extension == ".py"]
        ...: line_python = python_files_df[python_files_df.line_count> 40]
        ...: line_python.sort_values(by="relative_churn", ascending=False).head(15)
        ...:
    Figure 2. Top relative churn in CPython.py files
    table showing churn count
    table showing churn count

    One observation from this query is that tests have a lot of churn, which might be worth exploring more. Does this mean that the tests themselves also contain bugs? That might be interesting to explore in more detail. Also, there are a couple of Python modules that have extremely high relative churn, such as the string.py module. In looking through the source code for that file, it does look very complex for its size, and it contains metaclasses. It is possible that the complexity has made it prone to bugs. This seems like a module worth further data science exploration.

  4. Next, you can run some descriptive statistics to look for the median values across the project. These statistics show that for the couple of decades and more than 100,000 commits that the project has been around, a median file is about 146 lines, that it is changed five times, and it has a relative churn of 10 percent. This leads to the conclusion that this is the ideal type of file to be created: small and with few changes over the years.
    In [16]: metadata_df.median()
    Out[16]:
    churn_count         5.0
    line_count        146.0
    relative_churn      0.1
    dtype: float64
  5. Generating a seaborn plot for the relative churn makes the patterns even more clear:
    In [18]: import matplotlib.pyplot as plt
        ...: plt.figure(figsize=(10,10))
        ...: python_files_df = metadata_df[metadata_df.extension == ".py"]
        ...: line_python = python_files_df[python_files_df.line_count> 40]
        ...: line_python_sorted = line_python.sort_values(by="relative_churn", ascending=False).head(15)
        ...: sns.barplot(y="files", x="relative_churn",data=line_python_sorted)
        ...: plt.title('Top 15 CPython Absolute and Relative Churn')
        ...: plt.show()

    In Figure 3, the regrtest.py module sticks out quite a bit as the most modified file. Again, it makes sense why it has been changed so much. While it is a small file, typically a regression test can be very complicated. This also might be a hot spot in the code that needs to be looked at.

    Figure 3. Top relative churn in CPython .py file
    bar chart with regrtest.py the largest and test_winsound.ph                             the least
    bar chart with regrtest.py the largest and test_winsound.ph the least

Deleted files

Another area of exploration is to look at files that have been deleted throughout the history of a project. There are many directions of research that could be derived from this exploration, such as predicting that a file would later be deleted (for example, if the relative churn was too high).

  1. To look at the deleted files, create another function in the post_processing directory:
    FILES_DELETED_CMD=\
        'git log --diff-filter=D --summary | grep delete'
    def files_deleted_match(output):
        """Retrieves files from output from subprocess
        i.e:
        wcase/templates/hello.html\n delete mode 100644
        Throws away everything but path to file
        """
        files = []
        integers_match_pattern = '^[-+]?[0-9]+$'
        for line in output.split():
            if line == b"delete":
                continue
            elif line == b"mode":
                continue
            elif re.match(integers_match_pattern, line.decode("utf-8")):
                continue
            else:
                files.append(line)
        return files

    This function looks for delete messages in the git log, does some pattern matching, and extracts the files to a list so that a pandas DataFrame can be created.

  2. Next, use the function in a Jupyter notebook:
    In [19]: from devml.post_processing import git_deleted_files
        ...: deletion_counts =
    git_deleted_files("/Users/noahgift/src/cpython")

    To inspect some of the files that have been deleted, view the last few records:

    In [21]: deletion_counts.tail()
    Out[21]:
                               files     ext
    8812  b'Mac/mwerks/mwerksglue.c'      .c
    8813        b'Modules/version.c'      .c
    8814      b'Modules/Setup.irix5'  .irix5
    8815      b'Modules/Setup.guido'  .guido
    8816      b'Modules/Setup.minix'  .minix
  3. See if there is a pattern that appears with deleted files versus files that are kept. To do that, join the deleted files DataFrame:
    In [22]: all_files = metadata_df['files']
        ...: deleted_files = deletion_counts['files']
        ...: membership = all_files.isin(deleted_files)
        ...:
    In [23]: metadata_df["deleted_files"] = membership
    In [24]: metadata_df.loc[metadata_df["deleted_files"] == True].median()
    Out[24]:
    churn_count        4.000
    line_count        91.500
    relative_churn     0.145
    deleted_files      1.000
    dtype: float64
    
    In [25]: metadata_df.loc[metadata_df["deleted_files"] == False].median()
    Out[25]:
    churn_count         9.0
    line_count        149.0
    relative_churn      0.1
    deleted_files       0.0
    dtype: float64

    In looking at the median values of the deleted files compared with the files that are still in the repository, you see that there are some differences. Mainly, the relative churn number is higher for the deleted files. Perhaps the files that were problems were deleted? It is unknown without more investigation.

  4. Next, create a correlation heatmap in seaborn on this DataFrame:
    In [26]: sns.heatmap(metadata_df.corr(), annot=True)

    Figure 4 shows that there is a correlation, a very small positive one, between relative churn and deleted files. This signal might be included in a machine learning model to predict the likelihood of a file being deleted.

    Figure 4. Files deleted correlation heatmap
    files                             deleted heat map
    files deleted heat map
  5. Next, a final scatterplot shows some differences between deleted files and files that have remained in the repository:
    In [27]: sns.lmplot(x="churn_count", y="line_count", hue="deleted_files", data=metadata_df)

    Figure 5 shows three dimensions: line counts, churn counts, and the category of True/False for a deleted file.

    Figure 5. Scatterplot line counts and churn count
    heatmap
    heatmap

Deploying a project to the Python Package index

With all of the hard work performed in creating a library and command line tool, it makes sense to share the project with other people by submitting it to the Python Package index. There are only a few steps to do this:

  1. Create an account on https://pypi.python.org/pypi.
  2. Install twine:
    pip install twine
  3. Create a setup.py file.

    The two parts that are the most important are the packages section, which ensures that the library is installed, and the scripts section. The scripts section includes the dml script that we used throughout this article.

    import sys
    if sys.version_info < (3,6):
        sys.exit('Sorry, Python < 3.6 is not supported')
    import os
    from setuptools import setup
    from devml import __version__
    if os.path.exists('README.rst'):
        LONG = open('README.rst').read()
    setup(
        name='devml',
        version=__version__,
        url='https://github.com/noahgift/devml',
        license='MIT',
        author='Noah Gift',
        author_email='consulting@noahgift.com',
        description="""Machine Learning, Statistics and Utilities around Developer Productivity,
            Company Productivity and Project Productivity""",
        long_description=LONG,
        packages=['devml'],
        include_package_data=True,
        zip_safe=False,
        platforms='any',
        install_requires=[
            'pandas',
            'click',
            'PyGithub',
            'gitpython',
            'sensible',
            'scipy',
            'numpy',
        ],
        classifiers=[
            'Development Status :: 4 - Beta',
            'Intended Audience :: Developers',
            'License :: OSI Approved :: MIT License',
            'Programming Language :: Python',
            'Programming Language :: Python :: 3.6',
            'Topic :: Software Development :: Libraries :: Python Modules'
        ],
        scripts=["dml"],
    )

    The scripts directive then installs the dml tool into the path of all users who pip install the module.

  4. Add a deploy step to the Makefile:
    deploy-pypi:
        pandoc --from=markdown --to=rst README.md -o README.rst
        python setup.py check --restructuredtext --strict --metadata
        rm -rf dist
        python setup.py sdist
        twine upload dist/*
        rm -f README.rst
  5. Finally, deploy:
    (.devml) ➜  devml git:(master) ✗ make deploy-pypi
    pandoc --from=markdown --to=rst README.md -o README.rst
    python setup.py check --restructuredtext --strict --metadata
    running check
    rm -rf dist
    python setup.py sdist
    running sdist
    running egg_info
    writing devml.egg-info/PKG-INFO
    writing dependency_links to devml.egg-info/dependency_links.txt
    ....
    running check
    creating devml-0.5.1
    creating devml-0.5.1/devml
    creating devml-0.5.1/devml.egg-info
    copying files to devml-0.5.1...
    ....
    Writing devml-0.5.1/setup.cfg
    creating dist
    Creating tar archive
    removing 'devml-0.5.1' (and everything under it)
    twine upload dist/*
    Uploading distributions to https://upload.pypi.org/legacy/
    Enter your username:

Conclusion

Part 1 of this series shows you how to create a basic data science skeleton and explains its parts. Part 2 provides an in-depth data exploration using Jupyter Notebook, using the code built in Part 1. You also learned how to deploy the project to the Python Package index.

This article should be a good building block for other data science developers to study as they build solutions that can be delivered as a Python library and a command line tool.


Downloadable resources


Related topics


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, Open source
ArticleID=1051875
ArticleTitle=Using data science to manage a software project in a GitHub organization, Part 2: Explore your project with Jupyter Notebooks and deploy it to the Python Package index
publish-date=11152017