t_res.utils.deezy_processing module

t_res.utils.deezy_processing.obtain_matches(word: str, english_words: List[str], sims: List[str], fuzz_ratio_threshold: Optional[Union[float, int]] = 70) Tuple[List[str], List[str]]

Classifies the top 100 nearest neighbors of the given word into positive and negative matches (or discards them).

Parameters:
  • word (str) – The input word.

  • english_words (list) – A list of English words as strings.

  • sims (list) – The list of 100 nearest neighbors from the OCR word2vec model.

  • fuzz_ratio_threshold (float) – The threshold used for thefuzz.fuzz.ratio <https://github.com/seatgeek/thefuzz#simple-ratio>. If the nearest neighbor word is an existing English word and the string similarity is below fuzz_ratio_threshold, it is considered a negative match, i.e. not an OCR variation. Defaults to 70.

Returns:

A tuple that contains two lists:

  1. The first list consists of positive matches for the input word.

  2. The second list consists of negative matches, a list of negative matches for the input word.

Return type:

Tuple[List[str], List[str]]

t_res.utils.deezy_processing.create_training_set(deezy_parameters: dict, strvar_parameters: dict, wikidata_to_mentions: dict) None

Create a training set for DeezyMatch consisting of positive and negative string matches.

Given a word2vec model trained on OCR data and a list of words in the English language, this function creates a training set for DeezyMatch. The training set contains pairs of strings, where a positive match is an English word and its OCR variation (obtained from the top N word2vec neighbours and filtered by string similarity), and a negative match is an English word and a randomly selected OCR token.

Parameters:
  • deezy_parameters (dict) – Dictionary of DeezyMatch parameters for model training.

  • strvar_parameters (dict) – Dictionary of string variation parameters required to create a DeezyMatch training dataset.

  • wikidata_to_mentions (dict) – Mapping between Wikidata IDs and mentions.

Returns:

None.

Note

This function creates a new file with the string pairs dataset called w2v_ocr_pairs.txt inside the folder path defined as dm_path in the DeezyMatch parameters passed in setting up the ranker passed to this function as ranker.

t_res.utils.deezy_processing.train_deezy_model(deezy_parameters: dict, strvar_parameters: dict, wikidata_to_mentions: dict) None

Train a DeezyMatch model using the provided ranker parameters and input files.

This function trains a DeezyMatch model based on the specified parameters in the ranker object and the required input files. If the overwrite_training parameter is set to True or the model does not exist, the function will train a new DeezyMatch model.

Parameters:

deezy_parameters (dict) – Dictionary of DeezyMatch parameters for model training.

Returns:

None

Note

This function returns a DeezyMatch model, stored in the location specified in the DeezyMatch input_dfm.yaml file.

t_res.utils.deezy_processing.generate_candidates(deezy_parameters: dict, mentions_to_wikidata: dict) None

Obtain Wikidata candidates (Wikipedia mentions to Wikidata entities) and generate their corresponding vectors.

This function retrieves Wikidata candidates based on the mentions stored in the ranker object and generates their corresponding vectors using the DeezyMatch model. It writes the candidates to a file and generates embeddings with the DeezyMatch model.

Parameters:
  • deezy_parameters (dict) – Dictionary of DeezyMatch parameters for model training.

  • mentions_to_wikidata (dict) – Mapping between mentions and Wikidata IDs.

Returns:

None.

Note

The function saves the candidates to a file and generates embeddings using the DeezyMatch model. The resulting vectors are stored in the output directories specified in the DeezyMatch parameters passed to the ranker passed to this function in the ranker keyword argument.