Authentication

To access the GitHub api you need an access token. You can create one here: https://github.com/settings/tokens.

The access token will require repo scope. When working with this module locally it's probably easiest to put this token in a .env file, and use dot_env to load it. See the python-dotenv for further documentation. Alternatively you may want to save the token in a GitHub Secret, especially if you are planning to use this code as part of a GitHub Action.

creates a session for GitHub

GH_TOKEN = os.getenv("GH_TOKEN")

create_github_session(GH_TOKEN)

<github.MainClass.Github at 0x1023d1e20>

OrgStats

OrgStats is a class that contains functionality for getting statistics for a GitHub organization.

Class for collecting GitHub statistics for an Organization

load_dotenv()
GH_TOKEN = os.getenv("GH_TOKEN")

To use org_stats you need to pass in a token to authenticate the GitHub API, and the name of a GitHub organization. We use ghorgstatstestorg for these examples.

test_org = OrgStats(GH_TOKEN, "ghorgstatstestorg")
test_org

OrgStats: ghorgstatstestorg

Returns repositories for organisaton
optional `pub_status` filter for `public` or `private` repositories

Returns repositories for organisaton
optional `pub_status` filter for `public` or `private` repositories

get_repos returns all repositories associated with an organization. We can optionally filter by public status.

test_org.get_repos()

[Repository(full_name="ghorgstatstestorg/repo1"),
 Repository(full_name="ghorgstatstestorg/repo2"),
 Repository(full_name="ghorgstatstestorg/private_repo_1")]

This can also be access via repos, public_repos and private_repos OrgStats attributes

all repositories of `org`

count of all repositories of `org`

public repositories of `org`

test_org = OrgStats(GH_TOKEN, "ghorgstatstestorg")
assert L(test_org.public_repos).map(lambda x: x.private).unique()[0] == False

count of public repositories of `org`

private repositories of `org`

test_org = OrgStats(GH_TOKEN, "ghorgstatstestorg")
assert L(test_org.private_repos).map(lambda x: x.private).unique()[0] == True

count of private repositories of `org`

The repo attributues can be used to access repositories by type, for example accessing only public repos via public_repos

test_org.public_repos

[Repository(full_name="ghorgstatstestorg/repo1"),
 Repository(full_name="ghorgstatstestorg/repo2")]

Files

These methods retrieve information about the files in the repositories.

The files in a repository and the extension of those files can give some information about the kind of content repositories hold. For example if you promised a funder lots of tutorials you may expect more .ipynb files.

return files for `repo`

return files for `repo`

returns repo files for `org`

returns repo files for `org`

Files can also be access via the files, files_public and files_private attributes.

files for all repos

files for all repos

files for public repos

files for public repos

files for private repos

files for private repos

Helper function

returns frequencies of file extensions for `repo`

returns frequencies of file extensions for `repo`

test_org.get_repo_file_ext_frequency('repo2')

{'.md': 1, '.py': 1}

returns frequencies of file extensions for repos in `OrgStats` `org`

returns frequencies of file extensions for repos in `OrgStats` `org`

Snapshot stats

There are two flavours of stats accessible via GitHub ones which are 'snapshots' in time and ones which are cumulative over time. 'Snapshot' stats include 'forks' and 'stars'. Although these can go up and down overtime, we mainly care about their current numbers.

Returns dictionary of star and fork counts for `repos`

Returns dictionary of star and fork counts for `repos`

You can also access get_org_snapshot_stats via OrgStats snapshot_stats property.

Returns a Pandas DataFrame of star and fork counts for public repos

Returns a Pandas DataFrame of star and fork counts for public repos

test_org = OrgStats(GH_TOKEN, "ghorgstatstestorg")
test_org.snapshot_stats

Long view stats

These are the other flavour of GitHub stats, these are traffic stats which include visits to a repository on GitHub and clones of organization repositories. By default GitHub only provides access two recent information for these stats. This means if we want to be able to access longer term information for these stats we need to store and update this information on a regular basis ourselves. This is what the below do in combination with Github actions.

Traffic stats

gets views traffic for `repo` and saves as csv in `save_dir`

Parameters
----------
repo : Union[str,github.Repository.Repository]
    repository from `org`
save_dir : Union[str, pathlib.Path], optional
    directory where output CSV should be saved, by default 'view_data'
load : bool, optional
    load data into a Pandas DataFrame, by default False

Returns
-------
pd.DataFrame
    contains unique and total views for `repo` with dates

Gets views traffic for repo and saves as csv in save_dir. repo is an repository under the GitHub Organization. save_dir is the directory where output CSV should be saved, by default view_data load is an optional flag which loads data into a Pandas DataFrame, by default False

test_org = OrgStats(GH_TOKEN, "ghorgstatstestorg")
test_org.get_repo_views_traffic(test_org.repos[0], 'test_dir',load=True).head(3)

Get view traffic for multiple repos from `Org`

Parameters
----------
public_only : bool, optional
    only get stats for public repos, by default True
save_dir : Union[str,pathlib.Path], optional
    directory where csvs of stats should be saved, by default 'view_data'
repos : Optional[Iterable[github.Repository.Repository]], optional
    to access stats for a specific set of repos, by default None
load : bool, optional
    whether to load views data into a DataFrame, by default False

Returns
-------
Union[None, pd.DataFrame]

Get view traffic for multiple repos from `Org`

Parameters
----------
public_only : bool, optional
    only get stats for public repos, by default True
save_dir : Union[str,pathlib.Path], optional
    directory where csvs of stats should be saved, by default 'view_data'
repos : Optional[Iterable[github.Repository.Repository]], optional
    to access stats for a specific set of repos, by default None
load : bool, optional
    whether to load views data into a DataFrame, by default False

Returns
-------
Union[None, pd.DataFrame]

test_org = OrgStats(GH_TOKEN, "ghorgstatstestorg")
test_org.get_org_views_traffic(load=True).head(3)

assert len(test_org.get_org_views_traffic(load=True).columns)/2 == len(test_org.public_repos)
assert len(test_org.get_org_views_traffic(repos=test_org.repos, load=True).columns)/2 == len(test_org.repos)

Clones

gets clones traffic for `repo` and saves as csv in `save_dir`

Parameters
----------
repo : Union[str,github.Repository.Repository]
    repository from `org`
save_dir : Union[str, pathlib.Path], optional
    directory where output CSV should be saved, by default 'clone_data'
load : bool, optional
    load data into a Pandas DataFrame, by default False

Returns
-------
pd.DataFrame
    contains unique and total clones for `repo` with dates

gets clones traffic for `repo` and saves as csv in `save_dir`

Parameters
----------
repo : Union[str,github.Repository.Repository]
    repository from `org`
save_dir : Union[str, pathlib.Path], optional
    directory where output CSV should be saved, by default 'clone_data'
load : bool, optional
    load data into a Pandas DataFrame, by default False

Returns
-------
pd.DataFrame
    contains unique and total clones for `repo` with dates

test_org = OrgStats(GH_TOKEN, "ghorgstatstestorg")
test_org.get_repo_clones_traffic('repo1',save_dir='test_dir', load=True)

assert len(test_org.get_repo_clones_traffic(test_org.public_repos[1], load=True).columns) == 2

get clone traffic for multiple repos from `Org`

Parameters
----------
public_only : bool, optional
    only get stats for public repos, by default True
save_dir : Union[str,pathlib.Path], optional
    directory where csvs of stats should be saved, by default 'clone_data'
repos : Optional[Iterable[github.Repository.Repository]], optional
    to access stats for a specific set of repos, by default None
load : bool, optional
    whether to load views data into a DataFrame, by default False

Returns
-------
Union[None, pd.DataFrame]

get clone traffic for multiple repos from `Org`

Parameters
----------
public_only : bool, optional
    only get stats for public repos, by default True
save_dir : Union[str,pathlib.Path], optional
    directory where csvs of stats should be saved, by default 'clone_data'
repos : Optional[Iterable[github.Repository.Repository]], optional
    to access stats for a specific set of repos, by default None
load : bool, optional
    whether to load views data into a DataFrame, by default False

Returns
-------
Union[None, pd.DataFrame]

test_org = OrgStats(GH_TOKEN, "ghorgstatstestorg")
assert type(test_org.get_org_clones_traffic(repos=test_org.repos, save_dir='test_dir',load=True)) == pd.core.frame.DataFrame
assert (len(test_org.get_org_clones_traffic(save_dir='test_dir',load=True).columns) /2)  == test_org.public_repo_count

from nbdev.export import notebook2script; notebook2script()

	total_views	unique_views
_date
2020-11-30	2	1
2020-12-01	1	1

	repo1		repo2
	total_views	unique_views	total_views	unique_views
2020-11-30	2	1	8.0	1.0
2020-12-01	1	1	NaN	NaN

	stars	forks
repo1	1	0
repo2	0	0

Stats

Authentication

create_github_session[source]

OrgStats

class OrgStats[source]

OrgStats.get_repos[source]

OrgStats.get_repos[source]

OrgStats.repos[source]

repo_count[source]

OrgStats.public_repos[source]

public_repo_count[source]

OrgStats.private_repos[source]

private_repo_count[source]

Files

get_repo_files[source]

OrgStats.get_repo_files[source]

OrgStats.get_org_files[source]

OrgStats.get_org_files[source]

OrgStats.files[source]

OrgStats.files[source]

OrgStats.files_public[source]

OrgStats.files_public[source]

OrgStats.files_private[source]

OrgStats.files_private[source]

Helper function

get_ext[source]

get_repo_file_ext_frequency[source]

OrgStats.get_repo_file_ext_frequency[source]

get_org_file_ext_frequency[source]

OrgStats.get_org_file_ext_frequency[source]

Snapshot stats

get_org_snapshot_stats[source]

OrgStats.get_org_snapshot_stats[source]

OrgStats.snapshot_stats[source]

OrgStats.snapshot_stats[source]

Long view stats

Traffic stats

get_repo_views_traffic[source]

OrgStats.get_repo_views_traffic[source]

get_org_views_traffic[source]

OrgStats.get_org_views_traffic[source]

Clones

get_repo_clones_traffic[source]

OrgStats.get_repo_clones_traffic[source]

get_org_clones_traffic[source]

OrgStats.get_org_clones_traffic[source]

`create_github_session`[source]

`class` `OrgStats`[source]

`OrgStats.get_repos`[source]

`OrgStats.get_repos`[source]

`OrgStats.repos`[source]

`repo_count`[source]

`OrgStats.public_repos`[source]

`public_repo_count`[source]

`OrgStats.private_repos`[source]

`private_repo_count`[source]

`get_repo_files`[source]

`OrgStats.get_repo_files`[source]

`OrgStats.get_org_files`[source]

`OrgStats.get_org_files`[source]

`OrgStats.files`[source]

`OrgStats.files`[source]

`OrgStats.files_public`[source]

`OrgStats.files_public`[source]

`OrgStats.files_private`[source]

`OrgStats.files_private`[source]

`get_ext`[source]

`get_repo_file_ext_frequency`[source]

`OrgStats.get_repo_file_ext_frequency`[source]

`get_org_file_ext_frequency`[source]

`OrgStats.get_org_file_ext_frequency`[source]

`get_org_snapshot_stats`[source]

`OrgStats.get_org_snapshot_stats`[source]

`OrgStats.snapshot_stats`[source]

`OrgStats.snapshot_stats`[source]

`get_repo_views_traffic`[source]

`OrgStats.get_repo_views_traffic`[source]

`get_org_views_traffic`[source]

`OrgStats.get_org_views_traffic`[source]

`get_repo_clones_traffic`[source]

`OrgStats.get_repo_clones_traffic`[source]

`get_org_clones_traffic`[source]

`OrgStats.get_org_clones_traffic`[source]