This module provides some wrapping around PyGitub to grab some stats from GitHub for a particular organization

Authentication

To access the GitHub api you need an access token. You can create one here: https://github.com/settings/tokens.

The access token will require repo scope. When working with this module locally it's probably easiest to put this token in a .env file, and use dot_env to load it. See the python-dotenv for further documentation. Alternatively you may want to save the token in a GitHub Secret, especially if you are planning to use this code as part of a GitHub Action.

create_github_session[source]

create_github_session(GH_TOKEN)

creates a session for GitHub
GH_TOKEN = os.getenv("GH_TOKEN")
create_github_session(GH_TOKEN)
<github.MainClass.Github at 0x1023d1e20>

OrgStats

OrgStats is a class that contains functionality for getting statistics for a GitHub organization.

class OrgStats[source]

OrgStats(GH_TOKEN:str, org:str)

Class for collecting GitHub statistics for an Organization
load_dotenv()
GH_TOKEN = os.getenv("GH_TOKEN")

To use org_stats you need to pass in a token to authenticate the GitHub API, and the name of a GitHub organization. We use ghorgstatstestorg for these examples.

test_org = OrgStats(GH_TOKEN, "ghorgstatstestorg")
test_org
OrgStats: ghorgstatstestorg 

OrgStats.get_repos[source]

OrgStats.get_repos(pub_status:Union[NoneType, str]=None)

Returns repositories for organisaton
optional `pub_status` filter for `public` or `private` repositories

OrgStats.get_repos[source]

OrgStats.get_repos(pub_status:Union[NoneType, str]=None)

Returns repositories for organisaton
optional `pub_status` filter for `public` or `private` repositories

get_repos returns all repositories associated with an organization. We can optionally filter by public status.

test_org.get_repos()
[Repository(full_name="ghorgstatstestorg/repo1"),
 Repository(full_name="ghorgstatstestorg/repo2"),
 Repository(full_name="ghorgstatstestorg/private_repo_1")]

This can also be access via repos, public_repos and private_repos OrgStats attributes

OrgStats.repos[source]

OrgStats.repos()

all repositories of `org`

repo_count[source]

repo_count()

count of all repositories of `org`

OrgStats.public_repos[source]

OrgStats.public_repos()

public repositories of `org`
test_org = OrgStats(GH_TOKEN, "ghorgstatstestorg")
assert L(test_org.public_repos).map(lambda x: x.private).unique()[0] == False

public_repo_count[source]

public_repo_count()

count of public repositories of `org`

OrgStats.private_repos[source]

OrgStats.private_repos()

private repositories of `org`
test_org = OrgStats(GH_TOKEN, "ghorgstatstestorg")
assert L(test_org.private_repos).map(lambda x: x.private).unique()[0] == True

private_repo_count[source]

private_repo_count()

count of private repositories of `org`

The repo attributues can be used to access repositories by type, for example accessing only public repos via public_repos

test_org.public_repos
[Repository(full_name="ghorgstatstestorg/repo1"),
 Repository(full_name="ghorgstatstestorg/repo2")]

Files

These methods retrieve information about the files in the repositories.

The files in a repository and the extension of those files can give some information about the kind of content repositories hold. For example if you promised a funder lots of tutorials you may expect more .ipynb files.

get_repo_files[source]

get_repo_files(repo:Union[str, Repository])

return files for `repo`

OrgStats.get_repo_files[source]

OrgStats.get_repo_files(repo:Union[str, Repository])

return files for `repo`

OrgStats.get_org_files[source]

OrgStats.get_org_files(pub_status:Union[NoneType, str]=None)

returns repo files for `org`

OrgStats.get_org_files[source]

OrgStats.get_org_files(pub_status:Union[NoneType, str]=None)

returns repo files for `org`

Files can also be access via the files, files_public and files_private attributes.

OrgStats.files[source]

OrgStats.files()

files for all repos

OrgStats.files[source]

files for all repos

OrgStats.files_public[source]

OrgStats.files_public()

files for public repos

OrgStats.files_public[source]

files for public repos

OrgStats.files_private[source]

OrgStats.files_private()

files for private repos

OrgStats.files_private[source]

files for private repos

Helper function

get_ext[source]

get_ext(x)

get_repo_file_ext_frequency[source]

get_repo_file_ext_frequency(repo:Union[str, Repository])

returns frequencies of file extensions for `repo`

OrgStats.get_repo_file_ext_frequency[source]

OrgStats.get_repo_file_ext_frequency(repo:Union[str, Repository])

returns frequencies of file extensions for `repo`
test_org.get_repo_file_ext_frequency('repo2')
{'.md': 1, '.py': 1}

get_org_file_ext_frequency[source]

get_org_file_ext_frequency(pub_status:Union[NoneType, str]=None)

returns frequencies of file extensions for repos in `OrgStats` `org`

OrgStats.get_org_file_ext_frequency[source]

OrgStats.get_org_file_ext_frequency(pub_status:Union[NoneType, str]=None)

returns frequencies of file extensions for repos in `OrgStats` `org`

Snapshot stats

There are two flavours of stats accessible via GitHub ones which are 'snapshots' in time and ones which are cumulative over time. 'Snapshot' stats include 'forks' and 'stars'. Although these can go up and down overtime, we mainly care about their current numbers.

get_org_snapshot_stats[source]

get_org_snapshot_stats(repos:Iterable[Repository])

Returns dictionary of star and fork counts for `repos`

OrgStats.get_org_snapshot_stats[source]

OrgStats.get_org_snapshot_stats(repos:Iterable[Repository])

Returns dictionary of star and fork counts for `repos`

You can also access get_org_snapshot_stats via OrgStats snapshot_stats property.

OrgStats.snapshot_stats[source]

OrgStats.snapshot_stats()

Returns a Pandas DataFrame of star and fork counts for public repos

OrgStats.snapshot_stats[source]

Returns a Pandas DataFrame of star and fork counts for public repos
test_org = OrgStats(GH_TOKEN, "ghorgstatstestorg")
test_org.snapshot_stats
stars forks
repo1 1 0
repo2 0 0

Long view stats

These are the other flavour of GitHub stats, these are traffic stats which include visits to a repository on GitHub and clones of organization repositories. By default GitHub only provides access two recent information for these stats. This means if we want to be able to access longer term information for these stats we need to store and update this information on a regular basis ourselves. This is what the below do in combination with Github actions.

Traffic stats

get_repo_views_traffic[source]

get_repo_views_traffic(repo:Union[str, Repository], save_dir:Union[str, Path]='view_data', load=False)

gets views traffic for `repo` and saves as csv in `save_dir`

Parameters
----------
repo : Union[str,github.Repository.Repository]
    repository from `org`
save_dir : Union[str, pathlib.Path], optional
    directory where output CSV should be saved, by default 'view_data'
load : bool, optional
    load data into a Pandas DataFrame, by default False

Returns
-------
pd.DataFrame
    contains unique and total views for `repo` with dates

OrgStats.get_repo_views_traffic[source]

OrgStats.get_repo_views_traffic(repo:Union[str, Repository], save_dir:Union[str, Path]='view_data', load=False)

Gets views traffic for repo and saves as csv in save_dir. repo is an repository under the GitHub Organization. save_dir is the directory where output CSV should be saved, by default view_data load is an optional flag which loads data into a Pandas DataFrame, by default False

test_org = OrgStats(GH_TOKEN, "ghorgstatstestorg")
test_org.get_repo_views_traffic(test_org.repos[0], 'test_dir',load=True).head(3)
total_views unique_views
_date
2020-11-30 2 1
2020-12-01 1 1

get_org_views_traffic[source]

get_org_views_traffic(public_only:bool=True, save_dir:Union[str, Path]='view_data', repos:Optional[Iterable[Repository]]=None, load=False)

Get view traffic for multiple repos from `Org`

Parameters
----------
public_only : bool, optional
    only get stats for public repos, by default True
save_dir : Union[str,pathlib.Path], optional
    directory where csvs of stats should be saved, by default 'view_data'
repos : Optional[Iterable[github.Repository.Repository]], optional
    to access stats for a specific set of repos, by default None
load : bool, optional
    whether to load views data into a DataFrame, by default False

Returns
-------
Union[None, pd.DataFrame]

OrgStats.get_org_views_traffic[source]

OrgStats.get_org_views_traffic(public_only:bool=True, save_dir:Union[str, Path]='view_data', repos:Optional[Iterable[Repository]]=None, load=False)

Get view traffic for multiple repos from `Org`

Parameters
----------
public_only : bool, optional
    only get stats for public repos, by default True
save_dir : Union[str,pathlib.Path], optional
    directory where csvs of stats should be saved, by default 'view_data'
repos : Optional[Iterable[github.Repository.Repository]], optional
    to access stats for a specific set of repos, by default None
load : bool, optional
    whether to load views data into a DataFrame, by default False

Returns
-------
Union[None, pd.DataFrame]
test_org = OrgStats(GH_TOKEN, "ghorgstatstestorg")
test_org.get_org_views_traffic(load=True).head(3)
repo1 repo2
total_views unique_views total_views unique_views
2020-11-30 2 1 8.0 1.0
2020-12-01 1 1 NaN NaN
assert len(test_org.get_org_views_traffic(load=True).columns)/2 == len(test_org.public_repos)
assert len(test_org.get_org_views_traffic(repos=test_org.repos, load=True).columns)/2 == len(test_org.repos)

Clones

get_repo_clones_traffic[source]

get_repo_clones_traffic(repo:Repository, save_dir:Union[str, Path]='clone_data', load=False)

gets clones traffic for `repo` and saves as csv in `save_dir`

Parameters
----------
repo : Union[str,github.Repository.Repository]
    repository from `org`
save_dir : Union[str, pathlib.Path], optional
    directory where output CSV should be saved, by default 'clone_data'
load : bool, optional
    load data into a Pandas DataFrame, by default False

Returns
-------
pd.DataFrame
    contains unique and total clones for `repo` with dates

OrgStats.get_repo_clones_traffic[source]

OrgStats.get_repo_clones_traffic(repo:Repository, save_dir:Union[str, Path]='clone_data', load=False)

gets clones traffic for `repo` and saves as csv in `save_dir`

Parameters
----------
repo : Union[str,github.Repository.Repository]
    repository from `org`
save_dir : Union[str, pathlib.Path], optional
    directory where output CSV should be saved, by default 'clone_data'
load : bool, optional
    load data into a Pandas DataFrame, by default False

Returns
-------
pd.DataFrame
    contains unique and total clones for `repo` with dates
test_org = OrgStats(GH_TOKEN, "ghorgstatstestorg")
test_org.get_repo_clones_traffic('repo1',save_dir='test_dir', load=True)
assert len(test_org.get_repo_clones_traffic(test_org.public_repos[1], load=True).columns) == 2

get_org_clones_traffic[source]

get_org_clones_traffic(public_only:bool=True, repos:Optional[Iterable[Repository]]=None, save_dir:Union[str, Path]='clone_data', load=False)

get clone traffic for multiple repos from `Org`

Parameters
----------
public_only : bool, optional
    only get stats for public repos, by default True
save_dir : Union[str,pathlib.Path], optional
    directory where csvs of stats should be saved, by default 'clone_data'
repos : Optional[Iterable[github.Repository.Repository]], optional
    to access stats for a specific set of repos, by default None
load : bool, optional
    whether to load views data into a DataFrame, by default False

Returns
-------
Union[None, pd.DataFrame]

OrgStats.get_org_clones_traffic[source]

OrgStats.get_org_clones_traffic(public_only:bool=True, repos:Optional[Iterable[Repository]]=None, save_dir:Union[str, Path]='clone_data', load=False)

get clone traffic for multiple repos from `Org`

Parameters
----------
public_only : bool, optional
    only get stats for public repos, by default True
save_dir : Union[str,pathlib.Path], optional
    directory where csvs of stats should be saved, by default 'clone_data'
repos : Optional[Iterable[github.Repository.Repository]], optional
    to access stats for a specific set of repos, by default None
load : bool, optional
    whether to load views data into a DataFrame, by default False

Returns
-------
Union[None, pd.DataFrame]
test_org = OrgStats(GH_TOKEN, "ghorgstatstestorg")
assert type(test_org.get_org_clones_traffic(repos=test_org.repos, save_dir='test_dir',load=True)) == pd.core.frame.DataFrame
assert (len(test_org.get_org_clones_traffic(save_dir='test_dir',load=True).columns) /2)  == test_org.public_repo_count 
from nbdev.export import notebook2script; notebook2script()