Reference#

Zoonyper consists of two classes: Project and Utils. They are documented below.

Project#

class zoonyper.project.Project(path: str = '', classifications_path: str = '', subjects_path: str = '', workflows_path: str = '', comments_path: str = '', tags_path: str = '', redact_users: bool = True, trim_paths: bool = True, parse_dates: str = '%Y-%m-%d', thumbnails_url: str = 'https://thumbnails.zooniverse.org/100x100/')#

Bases: Utils

A Zoonyper project represents a single Zooniverse project and contains all of the associated data required for analysis and visualization.

Parameters#

pathstr, optional: A directory that contains all five required files for the project: classifications.csv, subjects.csv, workflows.csv, comments.json, and tags.json. If a path is not provided, the individual file paths can be passed instead.
classifications_pathstr, optional: The path to the project’s classifications CSV file.
subjects_pathstr, optional: The path to the project’s subjects CSV file.
workflows_pathstr, optional: The path to the project’s workflows CSV file.
comments_pathstr, optional: The path to the project’s comments JSON file.
tags_pathstr, optional: The path to the project’s tags JSON file.
redact_usersbool, optional: Whether to redact user names in the classifications table. Defaults to True.
trim_pathsbool, optional: Whether to trim file paths in columns known to contain them. Defaults to True.
parse_datesstr, optional: If specified, a list of column names to be parsed as datetime objects when reading the CSV files. The default value is “%Y-%m-%d”, which will parse columns named “created_at” and “updated_at”.
thumbnails_urlstr, optional: Base URL to download thumbnails, it defaults to https://thumbnails.zooniverse.org/100x100/.

Raises#

RuntimeError: If either the paths for each of the five files are not provided or a general path is provided but does not contain all five files.

Notes#

The Project class provides a high-level interface for working with Zooniverse projects. Use the attributes to access the project’s data as pandas DataFrames, and use the methods to manipulate the data and perform analysis.

property annotations_flattened: DataFrame#

Strips the classifications down to a minimal set, preserving certain columns (passed as the include_columns parameter) with classification IDs as the index and each provided classification in column T0, T1, T2, etc.

New in version 0.1.0.

Parameters#

include_columnslist[str]: The list of columns to preserve from the classifications DataFrame, default: ["workflow_id", "workflow_version", "subject_ids"]

Raises#

NotImplementedError: If a type of data is encountered that cannot be interpreted by the script

Returns#

pandas.DataFrame: A DataFrame with the flattened annotations.

are_subjects_disambiguated() → bool#

Checks if disambiguate_subjects has been successfully executed before to disambiguate subjects.

Returns#

bool: True if subjects are disambiguated, False otherwise.

property boards: DataFrame#

Returns a preprocessed DataFrame of the project’s discussion boards.

New in version 0.1.0.

Returns#

pandas.DataFrame: A preprocessed DataFrame of discussion boards.

Notes#

The returned DataFrame is generated from the raw comments JSON data by performing the following steps:

Extracting the boards from the comments frame

Dropping duplicate information from the boards frame

The DataFrame is cached to reduce load times on subsequent calls.

classification_counts(workflow_id: int, task_number: int = 0) → Dict[int, Dict[str, int]]#

Provides the classification count by label and subject for a particular workflow.

New in version 0.1.0.

Parameters#

workflow_idint: The workflow ID for which you want to extract classifications.
task_numberint, optional: The task number that you want to extract from across the workflow, by default 0.

Raises#

KeyError: If the provided task number does not appear in any classification across the project.

Returns#

dict[int, dict[str, int]]: A dictionary with the subject ID as the key and a nested dictionary as the value, which in turn has the task label as the key and the classification count for that task label as the value.

property classifications: DataFrame#

Returns a preprocessed DataFrame of the project’s classifications.

New in version 0.1.0.

Returns#

pandas.DataFrame: A preprocessed DataFrame of classifications.

Notes#

The returned DataFrame is generated from some of the raw classifications CSV data by performing the following steps:

Extracting the nested metadata JSON from the metadata column

Creating a new “seconds” column, denoting the seconds it took for the user to classify the subject(s) and dropping the original data used for the operation

Joining the extracted metadata back on the classifications DataFrame

Extracting the nested annotations data from the annotations column

Extracting all single list values as values in the annotations column

Joining the extracted and processed annotations back on the classifications DataFrame

Reduce the length of user names, user IP addresses and session identifiers to the shortest possible, while maintaining uniqueness

Dropping the metadata and annotations column

Preprocessing date columns using the _preprocess method

The DataFrame is cached to reduce load times on subsequent calls.

property comments: DataFrame#

Returns a preprocessed DataFrame of the project’s comments.

New in version 0.1.0.

Returns#

pd.DataFrame: A pandas DataFrame containing the comments data.

Notes#

The DataFrame is loaded from the “comments” file.
The “comment_focus_id”, “comment_user_id”, “comment_created_at”, “comment_focus_type”, “comment_user_login” and “comment_body” columns are renamed to “focus_id”, “user_id”, “created_at”, “focus_type”, “user_login”, and “body” respectively.
The “board_id”, “discussion_id”, “board_title”, “board_description” and “discussion_title” columns are dropped from the DataFrame.
Duplicate data from comments is dropped.
The data is preprocessed if the parse_dates attribute is True.

disambiguate_subjects(downloads_directory: Optional[str] = None) → DataFrame#

Disambiguates subjects by identifying unique files based on their MD5 hashes and assigns a unique identifier to each disambiguated subject.

The method scans through the files in the specified downloads_directory, computes MD5 hashes of the files, and updates the subjects’ DataFrame with disambiguated subject IDs. If the method has already been run, the previously disambiguated subjects’ DataFrame is returned.

New in version 0.1.0.

Parameters#

downloads_directorystr, optional: The file path to an existing directory where the downloads from the method will be saved. If not specified, the default download_dir will be used. Defaults to None.

Raises#

RuntimeError

If the provided downloads_directory is invalid or not provided.

If the download directory is empty.

If there are multiple files with the same name but different hashes.

If not all subjects have been properly downloaded.

Returns#

pd.DataFrame: The disambiguated subjects’ DataFrame with an additional "subject_id_disambiguated" column containing unique identifiers for each disambiguated subject.

property discussions: DataFrame#

Returns a preprocessed DataFrame of the project’s discussions.

New in version 0.1.0.

Returns#

pandas.DataFrame: A preprocessed DataFrame of discussions.

Notes#

The returned DataFrame is generated from some of the raw comments JSON data by performing the following steps:

Extracting the discussions from the comments frame

Dropping duplicate information from the discussions frame

The DataFrame is cached to reduce load times on subsequent calls.

download_all_subjects(download_dir: Optional[str] = None, timeout: int = 5, sleep: Tuple[int, int] = (2, 5), organize_by_workflow: bool = True, organize_by_subject_id: bool = True) → Literal[True]#

Downloads all subjects for each workflow in the project.

New in version 0.1.0.

Parameters#

download_dirstr, optional: The directory to download the subjects into. If None, defaults to the class attribute download_dir.
timeoutint, optional: Timeout between download attempts.
sleepTuple[int, int], optional: A tuple representing the amount of time to wait (in seconds) between download attempts.
organize_by_workflowbool: Whether to organize downloaded subjects by workflow.
organize_by_subject_idbool: Whether to organize downloaded subjects by subject ID.

Raises#

FileNotFoundError: If download_dir does not exist and cannot be created.
RuntimeError: If the download_dir is not writable.

Returns#

True: If all subjects were downloaded successfully.

download_workflow(workflow_id: int, download_dir: Optional[str] = None, timeout: int = 5, sleep: Optional[Tuple[int, int]] = (2, 5), organize_by_workflow: bool = True, organize_by_subject_id: bool = True) → Literal[True]#

Download all files associated with a particular workflow and store them in the specified directory. The download directory defaults to the value set for the download_dir parameter in the Project constructor.

New in version 0.1.0.

Parameters#

workflow_idint: The ID of the workflow to download files for.
download_dirstr, optional: The directory to store downloaded files in. If not specified, uses the download_dir parameter set in the Project constructor.
timeoutint, optional: The time in seconds to wait for the server to respond before timing out.
sleeptuple, optional: A tuple containing the range of time to sleep for between file downloads, in seconds.
organize_by_workflowbool, optional: Whether to organize downloaded files by workflow ID.
organize_by_subject_idbool, optional: Whether to organize downloaded files by subject ID.

Raises#

SyntaxError: If workflow_id is not an integer.

Returns#

True: Returns True if all files have been successfully downloaded and saved.

property frames: Dict[str, DataFrame]#

Get all the frames (classifications, subjects, workflows, comments, and tags).

New in version 0.1.0.

Raises#

SyntaxError: If the provided name is not one of the allowed options.

Returns#

Dict[str, pandas.DataFrame]: A dictionary with the frames’ names as keys and the pandas DataFrame as values.

get_all_classifications_by_date() → Dict#

Provides a dictionary with information about the number of classifications in the project at any given date, provided both for the project overall and by workflow.

New in version 0.1.0.

Returns#

dict: A dictionary with the project’s workflow IDs as keys (as well as a cross-project, “All workflows” key) and, as values, a list of dictionaries for each workflow (and the entire project) which have two keys, “date” and “close”, where close counts the classifications present in the filtered classifications DataFrame at any given date in the classifications DataFrame’s date range.

get_classifications_for_workflow_by_dates(workflow_id: Optional[Union[int, str]] = None) → List[Dict[str, Union[str, int]]]#

Provides a list of the number of classifications in the project at any given date, filtered by workflow if provided (workflow_id).

New in version 0.1.0.

Parameters#

workflow_idOptional[Union[int, str]]: The workflow ID, which will filter the classifications, if provided, optional.

Returns#

list[dict[str, Union[str, int]]]: A list of dictionaries which have two keys, “date” and “close”, where close counts the classifications present in the filtered classifications DataFrame at any given date in the classifications DataFrame’s date range.

get_comments(include_staff: bool = True) → DataFrame#

Get comments for the project.

New in version 0.1.0.

Parameters#

include_staffbool: Include staff comments or not, defaults to True.

Raises#

ValueError: If include_staff is False but staff property is not set (using set_staff method).

Returns#

pandas.DataFrame: A DataFrame containing comments for the project

get_disambiguated_subject_id(subject_id: int) → Union[List, int]#

Retrieves the disambiguated subject ID for a given subject ID.

The method returns the disambiguated subject ID associated with the provided subject ID. It raises a RuntimeError if the subjects have not been disambiguated using the zoonyper.project.Project.disambiguate_subjects() method.

New in version 0.1.0.

Parameters#

subject_idint: The subject ID for which the disambiguated subject ID is to be retrieved.

Raises#

RuntimeError: If the subjects have not been disambiguated using the zoonyper.project.Project.disambiguate_subjects() method.

Returns#

Union[List, int]: The disambiguated subject ID associated with the provided subject ID. If the subject ID is not found in the subjects DataFrame, returns 0. If multiple disambiguated subject IDs are associated with the provided subject ID, returns a list of unique disambiguated subject IDs.

get_subject_comments(subject_id, include_staff: bool = True) → DataFrame#

Returns all comments made on a specific subject by users, including staff if specified.

New in version 0.1.0.

Parameters#

subject_idint or str: The ID of the subject you want to retrieve comments for
include_staffbool: Whether or not to include comments made by staff members. If False, only comments made by non-staff users will be returned.

Returns#

pandas.DataFrame: A DataFrame containing all comments made on the specified subject

get_subject_paths(downloads_directory: str = '', organize_by_workflow: bool = True, organize_by_subject_id: bool = True)#

Retrieves a list of file paths for all subjects, organized by workflow and/or subject ID as specified.

The method generates a list of file paths for each subject based on the given organization options. It can take into account the workflow ID, subject ID, and subject file name to create the file paths.

New in version 0.1.0.

Parameters#

downloads_directorystr, optional: The file path to the directory where the subjects are downloaded. If not specified (default), the Project’s download_dir property will be used.
organize_by_workflowbool, optional: If True (default), organizes the subject file paths by including the workflow ID in the path.
organize_by_subject_idbool, optional: If True (default), organizes the subject file paths by including the subject ID (name) in the path.

Returns#

list: A list of file paths for all subjects, organized based on the specified options.

get_thumbnail_url(image_url: str) → str#

Get the thumbail URL for the given image URL.

Parameters#

image_urlstr: URL to get the thumbnail URL for.

Returns#

str: Thumbnail URL.

get_workflow_timelines(include_active: bool = True) → list#

Get the start and end dates for each workflow, and indicate whether they are active.

New in version 0.1.0.

Parameters#

include_activebool: A boolean indicating whether active workflows should be included (True) or only inactive workflows (False). Defaults to True.

Returns#

list of dictionaries: A list of dictionaries, where each dictionary has keys: “workflow_id”, “start_date”, “end_date”, and “active”, which describe each workflow.

property inactive_workflow_ids: list#

Get a sorted list of unique workflow IDs, marked as inactive in the project’s workflows DataFrame.

New in version 0.1.0.

Returns#

list: A list of inactive workflow IDs.

load_frame(name: str) → DataFrame#

Load the raw dataframe specified by name and return it. If it is not loaded yet, then load it first and store it in the _raw_frames dictionary of the instance.

New in version 0.1.0.

Parameters#

namestr: The name of the raw dataframe to load.

Returns#

pandas.DataFrame: The loaded raw dataframe specified by name.

Raises#

SyntaxError: If name is not one of the valid raw dataframe names.

logged_in(workflow_id: Optional[int] = None) → Union[Dict, int]#

Get a count of the number of participants who were logged in to the Zooniverse platform at the time of their classification in a project’s workflows.

If a workflow_id is passed, this method returns the count of unique participants who were logged in at the time of classification in that workflow.

If no workflow_id is passed, this method returns a dictionary where each key is a workflow ID and the corresponding value is the count of unique participants who were logged in at the time of classification in that workflow.

To count the total number of participants who made classifications in a given workflow or across the project, use the participants_count method.

See also

zoonyper.project.Project.participants_count() for the general count of participants in a given workflow or across the project.

New in version 0.1.0.

Parameters#

workflow_idint, optional: The ID of a specific workflow for which to get the logged-in participant count. If not specified, the logged-in participant count for each workflow is returned.

Raises#

RuntimeError: If the specified workflow_id is not recorded in the project.

Returns#

Union[Dict, int]: If a workflow_id was passed, this method returns an integer value describing the logged-in participant count for that workflow. If no workflow_id was passed, a dictionary with workflow IDs as keys and their corresponding logged-in participant counts as values is returned.

participants(workflow_id: int, by_workflow: bool = False) → Union[dict, list]#

Return a list of logged-in participant names who contributed to the specified workflow.

New in version 0.1.0.

Parameters#

workflow_idint: The ID of the workflow to retrieve participants for.
by_workflowbool, optional: Whether to return a dictionary with all workflows and their participants or just a list of participants for the specified workflow, by default False.

Returns#

Union[dict, list]: If by_workflow is False (default), return a list of all logged- in participants for the specified workflow. If by_workflow is True, return a dictionary with workflow IDs as keys and a list of logged-in participants as values.

participants_count(workflow_id: Optional[int] = None) → Union[Dict, int]#

Get a count of the number of participants who have made classifications in a project’s workflows.

If a workflow_id is passed, this method returns the count of unique participants who made classifications in that workflow.

If no workflow_id is passed, this method returns a dictionary where each key is a workflow ID and the corresponding value is the count of unique participants who made classifications in that workflow.

To count the number of participants who were logged in at the time of their classification in a given workflow or across the project, use the logged_in method.

See also

zoonyper.project.Project.logged_in() for the count of participants who were logged in at the time of their classification in a given workflow or across the project.

New in version 0.1.0.

Parameters#

workflow_idint, optional: The ID of a specific workflow for which to get the participant count. If not specified, the participant count for each workflow is returned.

Raises#

RuntimeError: If the specified workflow_id is not recorded in the project.

Returns#

Union[Dict, int]: If a workflow_id was passed, this method returns an integer value describing the participant count for that workflow. If no workflow_id was passed, a dictionary with workflow IDs as keys and their participant counts as values is returned.

plot_classifications(workflow_id: Union[int, str] = '', width: int = 15, height: int = 5) → Figure#

Renders the classifications for the project or a particular workflow (if workflow_id is provided) as a matplotlib.figure.Figure.

See also

zoonyper.project.Project.get_classifications_for_workflow_by_dates(), the method that is used to generate the growth of classifications. The workflow_id passed to the plot_classifications method is passed on as-is.

New in version 0.1.0.

Parameters#

workflow_idint: Any workflow ID from the project, which can be used for filtering
widthint: Figure width in inches, default: 15
heightint: Figure height in inches, default: 5

Raises#

SyntaxError: If width or height are not provided as integers

Returns#

matplotlib.figure.Figure: A line plot showing the growth of classifications

set_staff(staff: list) → None#

Set the staff members for the project.

New in version 0.1.0.

Parameters#

stafflist[str]: A list of staff member usernames.

Returns#

None

property subject_sets: dict#

Returns a dictionary of the project’s subject sets.

New in version 0.1.0.

Returns#

dict: A dictionary of subject sets, where the key is the subject set ID and the value is a list of subjects contained within that subject set.

property subject_urls#

Retrieves a dictionary of all subject URLs matched with the subject’s ID.

The property compiles a list of URLs from the subjects DataFrame’s locations column. If the _subject_urls attribute is empty, the method first ensures that the subjects DataFrame is set up.

New in version 0.1.0.

Returns#

list: A list of all subject URLs from the locations column of the subjects DataFrame.

property subjects: DataFrame#

Returns a preprocessed DataFrame of the project’s subjects.

Issues a warning for non-disambiguated subjects if the zoonyper.project.Project.disambiguate_subjects() method has not been run on the project instance.

New in version 0.1.0.

Returns#

pandas.DataFrame: A preprocessed DataFrame of subjects.

Notes#

The returned DataFrame is generated from some of the raw subjects CSV data by performing the following steps:

Extracting the nested metadata JSON from the metadata column

Joining the extracted metadata back on the subjects DataFrame

Dropping the metadata column

Preprocessing date columns using the _preprocess method

The DataFrame is cached to reduce load times on subsequent calls.

property tags: DataFrame#

Returns a preprocessed DataFrame of the project’s tags.

New in version 0.1.0.

Returns#

pandas.DataFrame: A preprocessed DataFrame of tags.

Notes#

The returned DataFrame is generated from the raw tags JSON data by performing the following steps:

Joining the tags and comments frames

Dropping duplicate information from the tags frame

Preprocessing date columns using the _preprocess method

The DataFrame is cached to reduce load times on subsequent calls.

property workflow_ids: list#

Provides a list of the workflow IDs associated with the project.

See also

zoonyper.project.Project.workflows, the attribute containing a pandas.DataFrame of all the workflows associated with the project, for which this method returns the index.

New in version 0.1.0.

Returns#

list: List of workflow IDs

workflow_subjects(workflow_id: int) → list#

Return a list of the subject IDs for a specific workflow ID.

New in version 0.1.0.

Parameters#

workflow_idint: The ID of the workflow for which you want to get the subject IDs.

Raises#

RuntimeError: If the workflow_id argument is not an integer.

Returns#

list: A list of subject IDs for the given workflow ID.

property workflows: DataFrame#

Returns a DataFrame of the project’s workflows.

New in version 0.1.0.

Returns#

pandas.DataFrame: A DataFrame of workflows.

The DataFrame is cached to reduce load times on subsequent calls.

Utils#

class zoonyper.utils.Utils#

Superclass to Project, i.e. all the methods in this class are inherited by the Project class. This impacts the use of some methods, see for example redact_username().

MAX_SIZE_OBSERVABLE#: Constant defining the size of the maximum size for Observable export. Default: 50MB is the max size for files on Observable, but can be set to other values, should Observable allow for larger files.

New in version 0.1.0.

static camel_case(string: str) → str#

Makes any string into a CamelCase.

New in version 0.1.0.

Parameters#

stringstr: String to make into camel case.

Returns#

str: String camel cased.

Notes#

Adapted from http://bit.ly/3yXqKs2.

export(df: DataFrame, filename: str = '', filter_workflows: Optional[list] = None, drop_columns: Optional[list] = None) → None#

Export a pandas DataFrame to a CSV file with optional filtering and column removal. If a column contains only one unique value, it will not be exported to save space.

Export

New in version 0.1.0.

Parameters#

dfpandas.DataFrame: The input DataFrame to be exported.
filenamestr: The output CSV file name. Note: Should have the file suffix .csv.
filter_workflowslist, optional: A list of Zooniverse workflow IDs to filter the DataFrame before exporting. Default is None, which means no filtering.
drop_columnslist, optional: A list of column names to be removed from the DataFrame before exporting. Default is None, which means no removal.

Returns#

None: The method exports the DataFrame to a CSV file and doesn’t return any value.

Raises#

RuntimeError: If the required filename parameter is missing or if the first parameter is not a pandas DataFrame.

export_annotations_flattened(filename: str = 'annotations_flattened.csv', filter_workflows: List = [], drop_columns: List = []) → None#

Attempts to compress the project instance’s flattened annotations and exports them into CSV format.

Export

New in version 0.1.0.

Parameters#

filenamestr, optional: The output CSV file name. Note: Should have the file suffix .csv. Defaults to annotations_flattened.csv.
filter_workflowslist, optional: A list of Zooniverse workflow IDs to filter the annotations before exporting. Default is None, which means no filtering.
drop_columnslist, optional: A list of column names to be removed from the annotations DataFrame before exporting. Default is None, which means no removal.

Returns#

None: The method exports the annotations DataFrame to a CSV file and doesn’t return any value.

Raises#

RuntimeError: If the required filename parameter is missing.

export_classifications(filename: str = 'classifications.csv', filter_workflows: List = [], drop_columns: List = []) → None#

Attempts to compress the project instance’s classifications and exports them into CSV format.

Export

New in version 0.1.0.

Parameters#

filenamestr, optional: The output CSV file name. Note: Should have the file suffix .csv. Defaults to classifications.csv.
filter_workflowslist, optional: A list of Zooniverse workflow IDs to filter the classifications before exporting. Default is None, which means no filtering.
drop_columnslist, optional: A list of column names to be removed from the classifications DataFrame before exporting. Default is None, which means no removal.

Returns#

None: The method exports the classifications DataFrame to a CSV file and doesn’t return any value.

Raises#

RuntimeError: If the required filename parameter is missing.

export_observable(directory: str = 'output') → None#

Export the processed classifications and annotations data to the specified directory as CSV files, fit for uploading to ObservableHQ. Before exporting, it converts column names to camel case (camelCase). Finally, it checks if the output files exceed the allowed size and logs a warning if they do.

Export

New in version 0.1.0.

Parameters#

directory, str, optional: The output directory path where the CSV files will be saved. Default is "output".

Returns#

None: The method exports the data to CSV files and doesn’t return any value.

redact_username(username: str) → Optional[str]#

Returns a sha256 encoded string for any given string (and caches to speed up).

As Utils is inherited by Project, it should be accessible through self. Here is an example:

project = Project("<path>")
project.user_name.apply(self.redact_username)

New in version 0.1.0.

Parameters#

usernamestr: The username that you want to encode

Returns#

str, optional: Username that is encoded to not be clear to human eyes

static trim_path(path: Union[str, Path]) → str#

Shortcut that returns the name of the file from a file path, maintaining flexibility of the path’s type.

New in version 0.1.0.

Parameters#

pathUnion[str, Path]: File path

Returns#

str: Filename from path

zoonyper.utils.get_current_dir(download_dir: str, organize_by_workflow: bool, organize_by_subject_id: bool, workflow_id: int = 0, subject_id: int = 0) → Path#

Generate a Path object representing the current directory for storing downloaded files based on the specified organization options. The function can organize files by workflow, by subject ID, or both.

New in version 0.1.0.

Parameters#

download_dirstr: The base download directory path.
organize_by_workflowbool: If True, organize files in subdirectories named after their respective workflow IDs.
organize_by_subject_idbool: If True, organize files in subdirectories named after their respective subject IDs.
workflow_idint, optional: The workflow ID to be used when organizing files by workflow. Default is 0.
subject_idint, optional: The subject ID to be used when organizing files by subject ID. Default is 0.

Returns#

pathlib.Path: The Path object representing the current directory based on the organization options.