t_res.utils.deezy_processing
module
- t_res.utils.deezy_processing.obtain_matches(word: str, english_words: List[str], sims: List[str], fuzz_ratio_threshold: Optional[Union[float, int]] = 70) Tuple[List[str], List[str]]
Classifies the top 100 nearest neighbors of the given word into positive and negative matches (or discards them).
- Parameters:
word (str) – The input word.
english_words (list) – A list of English words as strings.
sims (list) – The list of 100 nearest neighbors from the OCR word2vec model.
fuzz_ratio_threshold (float) – The threshold used for thefuzz.fuzz.ratio <https://github.com/seatgeek/thefuzz#simple-ratio>. If the nearest neighbor word is an existing English word and the string similarity is below
fuzz_ratio_threshold
, it is considered a negative match, i.e. not an OCR variation. Defaults to70
.
- Returns:
A tuple that contains two lists:
The first list consists of positive matches for the input word.
The second list consists of negative matches, a list of negative matches for the input word.
- Return type:
- t_res.utils.deezy_processing.create_training_set(deezy_parameters: dict, strvar_parameters: dict, wikidata_to_mentions: dict) None
Create a training set for DeezyMatch consisting of positive and negative string matches.
Given a word2vec model trained on OCR data and a list of words in the English language, this function creates a training set for DeezyMatch. The training set contains pairs of strings, where a positive match is an English word and its OCR variation (obtained from the top N word2vec neighbours and filtered by string similarity), and a negative match is an English word and a randomly selected OCR token.
- Parameters:
- Returns:
None.
Note
This function creates a new file with the string pairs dataset called
w2v_ocr_pairs.txt
inside the folder path defined asdm_path
in the DeezyMatch parameters passed in setting up the ranker passed to this function asmyranker
.
- t_res.utils.deezy_processing.train_deezy_model(deezy_parameters: dict, strvar_parameters: dict, wikidata_to_mentions: dict) None
Train a DeezyMatch model using the provided
myranker
parameters and input files.This function trains a DeezyMatch model based on the specified parameters in the myranker object and the required input files. If the
overwrite_training
parameter is set to True or the model does not exist, the function will train a new DeezyMatch model.- Parameters:
deezy_parameters (dict) – Dictionary of DeezyMatch parameters for model training.
- Returns:
None
Note
This function returns a DeezyMatch model, stored in the location specified in the DeezyMatch
input_dfm.yaml
file.
- t_res.utils.deezy_processing.generate_candidates(deezy_parameters: dict, mentions_to_wikidata: dict) None
Obtain Wikidata candidates (Wikipedia mentions to Wikidata entities) and generate their corresponding vectors.
This function retrieves Wikidata candidates based on the mentions stored in the
myranker
object and generates their corresponding vectors using the DeezyMatch model. It writes the candidates to a file and generates embeddings with the DeezyMatch model.- Parameters:
- Returns:
None.
Note
The function saves the candidates to a file and generates embeddings using the DeezyMatch model. The resulting vectors are stored in the output directories specified in the DeezyMatch parameters passed to the ranker passed to this function in the
myranker
keyword argument.