Reference#
Zoonyper consists of two classes: Project
and Utils
. They are
documented below.
Project#
- class zoonyper.project.Project(path: str = '', classifications_path: str = '', subjects_path: str = '', workflows_path: str = '', comments_path: str = '', tags_path: str = '', redact_users: bool = True, trim_paths: bool = True, parse_dates: str = '%Y-%m-%d', thumbnails_url: str = 'https://thumbnails.zooniverse.org/100x100/')#
Bases:
Utils
A Zoonyper project represents a single Zooniverse project and contains all of the associated data required for analysis and visualization.
Parameters#
- pathstr, optional
A directory that contains all five required files for the project:
classifications.csv
,subjects.csv
,workflows.csv
,comments.json
, andtags.json
. If a path is not provided, the individual file paths can be passed instead.- classifications_pathstr, optional
The path to the project’s classifications CSV file.
- subjects_pathstr, optional
The path to the project’s subjects CSV file.
- workflows_pathstr, optional
The path to the project’s workflows CSV file.
- comments_pathstr, optional
The path to the project’s comments JSON file.
- tags_pathstr, optional
The path to the project’s tags JSON file.
- redact_usersbool, optional
Whether to redact user names in the classifications table. Defaults to True.
- trim_pathsbool, optional
Whether to trim file paths in columns known to contain them. Defaults to True.
- parse_datesstr, optional
If specified, a list of column names to be parsed as datetime objects when reading the CSV files. The default value is “%Y-%m-%d”, which will parse columns named “created_at” and “updated_at”.
- thumbnails_urlstr, optional
Base URL to download thumbnails, it defaults to https://thumbnails.zooniverse.org/100x100/.
Raises#
- RuntimeError
If either the paths for each of the five files are not provided or a general path is provided but does not contain all five files.
Notes#
The Project class provides a high-level interface for working with Zooniverse projects. Use the attributes to access the project’s data as pandas DataFrames, and use the methods to manipulate the data and perform analysis.
- property annotations_flattened: DataFrame#
Strips the classifications down to a minimal set, preserving certain columns (passed as the
include_columns
parameter) with classification IDs as the index and each provided classification in column T0, T1, T2, etc.New in version 0.1.0.
Parameters#
- include_columnslist[str]
The list of columns to preserve from the classifications DataFrame, default:
["workflow_id", "workflow_version", "subject_ids"]
Raises#
- NotImplementedError
If a type of data is encountered that cannot be interpreted by the script
Returns#
- pandas.DataFrame
A DataFrame with the flattened annotations.
- are_subjects_disambiguated() bool #
Checks if disambiguate_subjects has been successfully executed before to disambiguate subjects.
Returns#
- bool
True if subjects are disambiguated, False otherwise.
- property boards: DataFrame#
Returns a preprocessed DataFrame of the project’s discussion boards.
New in version 0.1.0.
Returns#
- pandas.DataFrame
A preprocessed DataFrame of discussion boards.
Notes#
The returned DataFrame is generated from the raw comments JSON data by performing the following steps:
Extracting the boards from the comments frame
Dropping duplicate information from the boards frame
The DataFrame is cached to reduce load times on subsequent calls.
- classification_counts(workflow_id: int, task_number: int = 0) Dict[int, Dict[str, int]] #
Provides the classification count by label and subject for a particular workflow.
New in version 0.1.0.
Parameters#
- workflow_idint
The workflow ID for which you want to extract classifications.
- task_numberint, optional
The task number that you want to extract from across the workflow, by default
0
.
Raises#
- KeyError
If the provided task number does not appear in any classification across the project.
Returns#
- dict[int, dict[str, int]]
A dictionary with the subject ID as the key and a nested dictionary as the value, which in turn has the task label as the key and the classification count for that task label as the value.
- property classifications: DataFrame#
Returns a preprocessed DataFrame of the project’s classifications.
New in version 0.1.0.
Returns#
- pandas.DataFrame
A preprocessed DataFrame of classifications.
Notes#
The returned DataFrame is generated from some of the raw classifications CSV data by performing the following steps:
Extracting the nested metadata JSON from the metadata column
Creating a new “seconds” column, denoting the seconds it took for the user to classify the subject(s) and dropping the original data used for the operation
Joining the extracted metadata back on the classifications DataFrame
Extracting the nested annotations data from the annotations column
Extracting all single list values as values in the annotations column
Joining the extracted and processed annotations back on the classifications DataFrame
Reduce the length of user names, user IP addresses and session identifiers to the shortest possible, while maintaining uniqueness
Dropping the metadata and annotations column
Preprocessing date columns using the _preprocess method
The DataFrame is cached to reduce load times on subsequent calls.
- property comments: DataFrame#
Returns a preprocessed DataFrame of the project’s comments.
New in version 0.1.0.
Returns#
- pd.DataFrame
A pandas DataFrame containing the comments data.
Notes#
The DataFrame is loaded from the “comments” file.
The “comment_focus_id”, “comment_user_id”, “comment_created_at”, “comment_focus_type”, “comment_user_login” and “comment_body” columns are renamed to “focus_id”, “user_id”, “created_at”, “focus_type”, “user_login”, and “body” respectively.
The “board_id”, “discussion_id”, “board_title”, “board_description” and “discussion_title” columns are dropped from the DataFrame.
Duplicate data from comments is dropped.
The data is preprocessed if the parse_dates attribute is True.
- disambiguate_subjects(downloads_directory: Optional[str] = None) DataFrame #
Disambiguates subjects by identifying unique files based on their MD5 hashes and assigns a unique identifier to each disambiguated subject.
The method scans through the files in the specified
downloads_directory
, computes MD5 hashes of the files, and updates the subjects’ DataFrame with disambiguated subject IDs. If the method has already been run, the previously disambiguated subjects’ DataFrame is returned.New in version 0.1.0.
Parameters#
- downloads_directorystr, optional
The file path to an existing directory where the downloads from the method will be saved. If not specified, the default download_dir will be used. Defaults to
None
.
Raises#
- RuntimeError
If the provided
downloads_directory
is invalid or not provided.If the download directory is empty.
If there are multiple files with the same name but different hashes.
If not all subjects have been properly downloaded.
Returns#
- pd.DataFrame
The disambiguated subjects’ DataFrame with an additional
"subject_id_disambiguated"
column containing unique identifiers for each disambiguated subject.
- property discussions: DataFrame#
Returns a preprocessed DataFrame of the project’s discussions.
New in version 0.1.0.
Returns#
- pandas.DataFrame
A preprocessed DataFrame of discussions.
Notes#
The returned DataFrame is generated from some of the raw comments JSON data by performing the following steps:
Extracting the discussions from the comments frame
Dropping duplicate information from the discussions frame
The DataFrame is cached to reduce load times on subsequent calls.
- download_all_subjects(download_dir: Optional[str] = None, timeout: int = 5, sleep: Tuple[int, int] = (2, 5), organize_by_workflow: bool = True, organize_by_subject_id: bool = True) Literal[True] #
Downloads all subjects for each workflow in the project.
New in version 0.1.0.
Parameters#
- download_dirstr, optional
The directory to download the subjects into. If None, defaults to the class attribute download_dir.
- timeoutint, optional
Timeout between download attempts.
- sleepTuple[int, int], optional
A tuple representing the amount of time to wait (in seconds) between download attempts.
- organize_by_workflowbool
Whether to organize downloaded subjects by workflow.
- organize_by_subject_idbool
Whether to organize downloaded subjects by subject ID.
Raises#
- FileNotFoundError
If download_dir does not exist and cannot be created.
- RuntimeError
If the download_dir is not writable.
Returns#
- True
If all subjects were downloaded successfully.
- download_workflow(workflow_id: int, download_dir: Optional[str] = None, timeout: int = 5, sleep: Optional[Tuple[int, int]] = (2, 5), organize_by_workflow: bool = True, organize_by_subject_id: bool = True) Literal[True] #
Download all files associated with a particular workflow and store them in the specified directory. The download directory defaults to the value set for the
download_dir
parameter in theProject
constructor.New in version 0.1.0.
Parameters#
- workflow_idint
The ID of the workflow to download files for.
- download_dirstr, optional
The directory to store downloaded files in. If not specified, uses the download_dir parameter set in the Project constructor.
- timeoutint, optional
The time in seconds to wait for the server to respond before timing out.
- sleeptuple, optional
A tuple containing the range of time to sleep for between file downloads, in seconds.
- organize_by_workflowbool, optional
Whether to organize downloaded files by workflow ID.
- organize_by_subject_idbool, optional
Whether to organize downloaded files by subject ID.
Raises#
- SyntaxError
If
workflow_id
is not an integer.
Returns#
- True
Returns
True
if all files have been successfully downloaded and saved.
- property frames: Dict[str, DataFrame]#
Get all the frames (classifications, subjects, workflows, comments, and tags).
New in version 0.1.0.
Raises#
- SyntaxError
If the provided name is not one of the allowed options.
Returns#
- Dict[str, pandas.DataFrame]
A dictionary with the frames’ names as keys and the pandas DataFrame as values.
- get_all_classifications_by_date() Dict #
Provides a dictionary with information about the number of classifications in the project at any given date, provided both for the project overall and by workflow.
New in version 0.1.0.
Returns#
- dict
A dictionary with the project’s workflow IDs as keys (as well as a cross-project, “All workflows” key) and, as values, a list of dictionaries for each workflow (and the entire project) which have two keys, “date” and “close”, where close counts the classifications present in the filtered classifications DataFrame at any given date in the classifications DataFrame’s date range.
- get_classifications_for_workflow_by_dates(workflow_id: Optional[Union[int, str]] = None) List[Dict[str, Union[str, int]]] #
Provides a list of the number of classifications in the project at any given date, filtered by workflow if provided (
workflow_id
).New in version 0.1.0.
Parameters#
- workflow_idOptional[Union[int, str]]
The workflow ID, which will filter the classifications, if provided, optional.
Returns#
- list[dict[str, Union[str, int]]]
A list of dictionaries which have two keys, “date” and “close”, where close counts the classifications present in the filtered classifications DataFrame at any given date in the classifications DataFrame’s date range.
- get_comments(include_staff: bool = True) DataFrame #
Get comments for the project.
New in version 0.1.0.
Parameters#
- include_staffbool
Include staff comments or not, defaults to
True
.
Raises#
- ValueError
If
include_staff
isFalse
butstaff
property is not set (usingset_staff
method).
Returns#
- pandas.DataFrame
A DataFrame containing comments for the project
- get_disambiguated_subject_id(subject_id: int) Union[List, int] #
Retrieves the disambiguated subject ID for a given subject ID.
The method returns the disambiguated subject ID associated with the provided subject ID. It raises a RuntimeError if the subjects have not been disambiguated using the
zoonyper.project.Project.disambiguate_subjects()
method.New in version 0.1.0.
Parameters#
- subject_idint
The subject ID for which the disambiguated subject ID is to be retrieved.
Raises#
- RuntimeError
If the subjects have not been disambiguated using the
zoonyper.project.Project.disambiguate_subjects()
method.
Returns#
- Union[List, int]
The disambiguated subject ID associated with the provided subject ID. If the subject ID is not found in the subjects DataFrame, returns 0. If multiple disambiguated subject IDs are associated with the provided subject ID, returns a list of unique disambiguated subject IDs.
- get_subject_comments(subject_id, include_staff: bool = True) DataFrame #
Returns all comments made on a specific subject by users, including staff if specified.
New in version 0.1.0.
Parameters#
- subject_idint or str
The ID of the subject you want to retrieve comments for
- include_staffbool
Whether or not to include comments made by staff members. If
False
, only comments made by non-staff users will be returned.
Returns#
- pandas.DataFrame
A DataFrame containing all comments made on the specified subject
- get_subject_paths(downloads_directory: str = '', organize_by_workflow: bool = True, organize_by_subject_id: bool = True)#
Retrieves a list of file paths for all subjects, organized by workflow and/or subject ID as specified.
The method generates a list of file paths for each subject based on the given organization options. It can take into account the workflow ID, subject ID, and subject file name to create the file paths.
New in version 0.1.0.
Parameters#
- downloads_directorystr, optional
The file path to the directory where the subjects are downloaded. If not specified (default), the Project’s
download_dir
property will be used.- organize_by_workflowbool, optional
If
True
(default), organizes the subject file paths by including the workflow ID in the path.- organize_by_subject_idbool, optional
If
True
(default), organizes the subject file paths by including the subject ID (name) in the path.
Returns#
- list
A list of file paths for all subjects, organized based on the specified options.
- get_thumbnail_url(image_url: str) str #
Get the thumbail URL for the given image URL.
Parameters#
- image_urlstr
URL to get the thumbnail URL for.
Returns#
- str
Thumbnail URL.
- get_workflow_timelines(include_active: bool = True) list #
Get the start and end dates for each workflow, and indicate whether they are active.
New in version 0.1.0.
Parameters#
- include_activebool
A boolean indicating whether active workflows should be included (
True
) or only inactive workflows (False
). Defaults toTrue
.
Returns#
- list of dictionaries
A list of dictionaries, where each dictionary has keys: “workflow_id”, “start_date”, “end_date”, and “active”, which describe each workflow.
- property inactive_workflow_ids: list#
Get a sorted list of unique workflow IDs, marked as inactive in the project’s
workflows
DataFrame.New in version 0.1.0.
Returns#
- list
A list of inactive workflow IDs.
- load_frame(name: str) DataFrame #
Load the raw dataframe specified by
name
and return it. If it is not loaded yet, then load it first and store it in the_raw_frames
dictionary of the instance.New in version 0.1.0.
Parameters#
- namestr
The name of the raw dataframe to load.
Returns#
- pandas.DataFrame
The loaded raw dataframe specified by name.
Raises#
- SyntaxError
If name is not one of the valid raw dataframe names.
- logged_in(workflow_id: Optional[int] = None) Union[Dict, int] #
Get a count of the number of participants who were logged in to the Zooniverse platform at the time of their classification in a project’s workflows.
If a
workflow_id
is passed, this method returns the count of unique participants who were logged in at the time of classification in that workflow.If no
workflow_id
is passed, this method returns a dictionary where each key is a workflow ID and the corresponding value is the count of unique participants who were logged in at the time of classification in that workflow.To count the total number of participants who made classifications in a given workflow or across the project, use the
participants_count
method.See also
zoonyper.project.Project.participants_count()
for the general count of participants in a given workflow or across the project.New in version 0.1.0.
Parameters#
- workflow_idint, optional
The ID of a specific workflow for which to get the logged-in participant count. If not specified, the logged-in participant count for each workflow is returned.
Raises#
- RuntimeError
If the specified
workflow_id
is not recorded in the project.
Returns#
- Union[Dict, int]
If a
workflow_id
was passed, this method returns an integer value describing the logged-in participant count for that workflow. If noworkflow_id
was passed, a dictionary with workflow IDs as keys and their corresponding logged-in participant counts as values is returned.
- participants(workflow_id: int, by_workflow: bool = False) Union[dict, list] #
Return a list of logged-in participant names who contributed to the specified workflow.
New in version 0.1.0.
Parameters#
- workflow_idint
The ID of the workflow to retrieve participants for.
- by_workflowbool, optional
Whether to return a dictionary with all workflows and their participants or just a list of participants for the specified workflow, by default
False
.
Returns#
- Union[dict, list]
If
by_workflow
isFalse
(default), return a list of all logged- in participants for the specified workflow. Ifby_workflow
isTrue
, return a dictionary with workflow IDs as keys and a list of logged-in participants as values.
- participants_count(workflow_id: Optional[int] = None) Union[Dict, int] #
Get a count of the number of participants who have made classifications in a project’s workflows.
If a
workflow_id
is passed, this method returns the count of unique participants who made classifications in that workflow.If no
workflow_id
is passed, this method returns a dictionary where each key is a workflow ID and the corresponding value is the count of unique participants who made classifications in that workflow.To count the number of participants who were logged in at the time of their classification in a given workflow or across the project, use the logged_in method.
See also
zoonyper.project.Project.logged_in()
for the count of participants who were logged in at the time of their classification in a given workflow or across the project.New in version 0.1.0.
Parameters#
- workflow_idint, optional
The ID of a specific workflow for which to get the participant count. If not specified, the participant count for each workflow is returned.
Raises#
- RuntimeError
If the specified
workflow_id
is not recorded in the project.
Returns#
- Union[Dict, int]
If a
workflow_id
was passed, this method returns an integer value describing the participant count for that workflow. If noworkflow_id
was passed, a dictionary with workflow IDs as keys and their participant counts as values is returned.
- plot_classifications(workflow_id: Union[int, str] = '', width: int = 15, height: int = 5) Figure #
Renders the classifications for the project or a particular workflow (if
workflow_id
is provided) as amatplotlib.figure.Figure
.See also
zoonyper.project.Project.get_classifications_for_workflow_by_dates()
, the method that is used to generate the growth of classifications. Theworkflow_id
passed to theplot_classifications
method is passed on as-is.New in version 0.1.0.
Parameters#
- workflow_idint
Any workflow ID from the project, which can be used for filtering
- widthint
Figure width in inches, default: 15
- heightint
Figure height in inches, default: 5
Raises#
- SyntaxError
If width or height are not provided as integers
Returns#
- matplotlib.figure.Figure
A line plot showing the growth of classifications
- set_staff(staff: list) None #
Set the staff members for the project.
New in version 0.1.0.
Parameters#
- stafflist[str]
A list of staff member usernames.
Returns#
None
- property subject_sets: dict#
Returns a dictionary of the project’s subject sets.
New in version 0.1.0.
Returns#
- dict
A dictionary of subject sets, where the key is the subject set ID and the value is a list of subjects contained within that subject set.
- property subject_urls#
Retrieves a dictionary of all subject URLs matched with the subject’s ID.
The property compiles a list of URLs from the
subjects
DataFrame’slocations
column. If the_subject_urls
attribute is empty, the method first ensures that thesubjects
DataFrame is set up.New in version 0.1.0.
Returns#
- list
A list of all subject URLs from the
locations
column of thesubjects
DataFrame.
- property subjects: DataFrame#
Returns a preprocessed DataFrame of the project’s subjects.
Issues a warning for non-disambiguated subjects if the
zoonyper.project.Project.disambiguate_subjects()
method has not been run on the project instance.New in version 0.1.0.
Returns#
- pandas.DataFrame
A preprocessed DataFrame of subjects.
Notes#
The returned DataFrame is generated from some of the raw subjects CSV data by performing the following steps:
Extracting the nested metadata JSON from the metadata column
Joining the extracted metadata back on the subjects DataFrame
Dropping the metadata column
Preprocessing date columns using the _preprocess method
The DataFrame is cached to reduce load times on subsequent calls.
- property tags: DataFrame#
Returns a preprocessed DataFrame of the project’s tags.
New in version 0.1.0.
Returns#
- pandas.DataFrame
A preprocessed DataFrame of tags.
Notes#
The returned DataFrame is generated from the raw tags JSON data by performing the following steps:
Joining the tags and comments frames
Dropping duplicate information from the tags frame
Preprocessing date columns using the _preprocess method
The DataFrame is cached to reduce load times on subsequent calls.
- property workflow_ids: list#
Provides a list of the workflow IDs associated with the project.
See also
zoonyper.project.Project.workflows
, the attribute containing a pandas.DataFrame of all the workflows associated with the project, for which this method returns the index.New in version 0.1.0.
Returns#
- list
List of workflow IDs
- workflow_subjects(workflow_id: int) list #
Return a list of the subject IDs for a specific workflow ID.
New in version 0.1.0.
Parameters#
- workflow_idint
The ID of the workflow for which you want to get the subject IDs.
Raises#
- RuntimeError
If the workflow_id argument is not an integer.
Returns#
- list
A list of subject IDs for the given workflow ID.
Utils#
- class zoonyper.utils.Utils#
Superclass to
Project
, i.e. all the methods in this class are inherited by the Project class. This impacts the use of some methods, see for exampleredact_username()
.- MAX_SIZE_OBSERVABLE#
Constant defining the size of the maximum size for Observable export. Default: 50MB is the max size for files on Observable, but can be set to other values, should Observable allow for larger files.
New in version 0.1.0.
- static camel_case(string: str) str #
Makes any string into a CamelCase.
New in version 0.1.0.
Parameters#
- stringstr
String to make into camel case.
Returns#
- str
String camel cased.
Notes#
Adapted from http://bit.ly/3yXqKs2.
- export(df: DataFrame, filename: str = '', filter_workflows: Optional[list] = None, drop_columns: Optional[list] = None) None #
Export a pandas DataFrame to a CSV file with optional filtering and column removal. If a column contains only one unique value, it will not be exported to save space.
Export
New in version 0.1.0.
Parameters#
- dfpandas.DataFrame
The input DataFrame to be exported.
- filenamestr
The output CSV file name. Note: Should have the file suffix
.csv
.- filter_workflowslist, optional
A list of Zooniverse workflow IDs to filter the DataFrame before exporting. Default is
None
, which means no filtering.- drop_columnslist, optional
A list of column names to be removed from the DataFrame before exporting. Default is
None
, which means no removal.
Returns#
- None
The method exports the DataFrame to a CSV file and doesn’t return any value.
Raises#
- RuntimeError
If the required filename parameter is missing or if the first parameter is not a pandas DataFrame.
- export_annotations_flattened(filename: str = 'annotations_flattened.csv', filter_workflows: List = [], drop_columns: List = []) None #
Attempts to compress the project instance’s flattened annotations and exports them into CSV format.
Export
New in version 0.1.0.
Parameters#
- filenamestr, optional
The output CSV file name. Note: Should have the file suffix
.csv
. Defaults toannotations_flattened.csv
.- filter_workflowslist, optional
A list of Zooniverse workflow IDs to filter the annotations before exporting. Default is
None
, which means no filtering.- drop_columnslist, optional
A list of column names to be removed from the annotations DataFrame before exporting. Default is
None
, which means no removal.
Returns#
- None
The method exports the annotations DataFrame to a CSV file and doesn’t return any value.
Raises#
- RuntimeError
If the required filename parameter is missing.
- export_classifications(filename: str = 'classifications.csv', filter_workflows: List = [], drop_columns: List = []) None #
Attempts to compress the project instance’s classifications and exports them into CSV format.
Export
New in version 0.1.0.
Parameters#
- filenamestr, optional
The output CSV file name. Note: Should have the file suffix
.csv
. Defaults toclassifications.csv
.- filter_workflowslist, optional
A list of Zooniverse workflow IDs to filter the classifications before exporting. Default is
None
, which means no filtering.- drop_columnslist, optional
A list of column names to be removed from the classifications DataFrame before exporting. Default is
None
, which means no removal.
Returns#
- None
The method exports the classifications DataFrame to a CSV file and doesn’t return any value.
Raises#
- RuntimeError
If the required filename parameter is missing.
- export_observable(directory: str = 'output') None #
Export the processed classifications and annotations data to the specified directory as CSV files, fit for uploading to ObservableHQ. Before exporting, it converts column names to camel case (camelCase). Finally, it checks if the output files exceed the allowed size and logs a warning if they do.
Export
New in version 0.1.0.
Parameters#
- directory, str, optional
The output directory path where the CSV files will be saved. Default is
"output"
.
Returns#
- None
The method exports the data to CSV files and doesn’t return any value.
- redact_username(username: str) Optional[str] #
Returns a sha256 encoded string for any given string (and caches to speed up).
As
Utils
is inherited byProject
, it should be accessible throughself
. Here is an example:project = Project("<path>") project.user_name.apply(self.redact_username)
New in version 0.1.0.
Parameters#
- usernamestr
The username that you want to encode
Returns#
- str, optional
Username that is encoded to not be clear to human eyes
- zoonyper.utils.get_current_dir(download_dir: str, organize_by_workflow: bool, organize_by_subject_id: bool, workflow_id: int = 0, subject_id: int = 0) Path #
Generate a Path object representing the current directory for storing downloaded files based on the specified organization options. The function can organize files by workflow, by subject ID, or both.
New in version 0.1.0.
Parameters#
- download_dirstr
The base download directory path.
- organize_by_workflowbool
If
True
, organize files in subdirectories named after their respective workflow IDs.- organize_by_subject_idbool
If
True
, organize files in subdirectories named after their respective subject IDs.- workflow_idint, optional
The workflow ID to be used when organizing files by workflow. Default is
0
.- subject_idint, optional
The subject ID to be used when organizing files by subject ID. Default is
0
.
Returns#
- pathlib.Path
The Path object representing the current directory based on the organization options.