t_res.geoparser.ranking. Ranker
- class t_res.geoparser.ranking.Ranker(method: Literal['perfectmatch', 'partialmatch', 'levenshtein', 'deezymatch'], resources_path: str, mentions_to_wikidata: Optional[dict] = {}, wikidata_to_mentions: Optional[dict] = {}, strvar_parameters: Optional[dict] = None, deezy_parameters: Optional[dict] = None, already_collected_cands: Optional[dict] = {})
The Ranker class implements a system for candidate selection through string variation ranking. It provides methods to select candidates based on different matching approaches, such as perfect match, partial match, Levenshtein distance, and DeezyMatch. The class also handles loading and processing of resources related to candidate selection.
- Parameters:
method (str) – The candidate selection and ranking method to use.
resources_path (str) – Relative path to the resources directory (containing Wikidata resources).
mentions_to_wikidata (dict, optional) – An empty dictionary which will store the mapping between mentions and Wikidata IDs, which will be loaded through the
load_resources()
method.wikidata_to_mentions (dict, optional) – An empty dictionary which will store the mapping between Wikidata IDs and mentions, which will be loaded through the
load_resources()
method.strvar_parameters (dict, optional) – Dictionary of string variation parameters required to create a DeezyMatch training dataset. For the default settings, see Notes below.
deezy_parameters (dict, optional) – Dictionary of DeezyMatch parameters for model training. For the default settings, see Notes below.
already_collected_cands (dict, optional) – Dictionary of already collected candidates. Defaults to
dict()
(an empty dictionary).
Example
>>> # Create a Ranker object: >>> ranker = Ranker( method="perfectmatch", resources_path="/path/to/resources/", )
>>> # Load resources >>> ranker.mentions_to_wikidata = ranker.load_resources()
>>> # Train the ranker (if applicable) >>> ranker.train()
>>> # Perform candidate selection >>> queries = ['London', 'Paraguay'] >>> candidates, already_collected = ranker.run(queries)
>>> # Find candidates for mentions >>> mentions = [{'mention': 'London'}, {'mention': 'Paraguay'}] >>> mention_candidates, mention_already_collected = ranker.find_candidates(mentions)
>>> # Print the results >>> print("Candidate Selection Results:") >>> print(candidates) >>> print(already_collected) >>> print("Find Candidates Results:") >>> print(mention_candidates) >>> print(mention_already_collected)
Note
The default settings for
strvar_parameters
:strvar_parameters: Optional[dict] = { # Parameters to create the string pair dataset: "ocr_threshold": 60, "top_threshold": 85, "min_len": 5, "max_len": 15, "w2v_ocr_path": str(Path("resources/models/w2v/").resolve()), "w2v_ocr_model": "w2v_*_news", "overwrite_dataset": False, }
The default settings for
deezy_parameters
:deezy_parameters: Optional[dict] = { "dm_path": str(Path("resources/deezymatch/").resolve()), "dm_cands": "wkdtalts", "dm_model": "w2v_ocr", "dm_output": "deezymatch_on_the_fly", "ranking_metric": "faiss", "selection_threshold": 50, "num_candidates": 1, "verbose": False, "overwrite_training": False, "do_test": False, }
- check_if_contained(query: str, row: Series) float
Returns the amount of overlap, if a mention is contained within a row in the dataset.
- Parameters:
query (str) – A mention identified in a text.
row (Series) – A pandas Series representing a row in the dataset with a “mentions” column, corresponding to a mention in the knowledge base.
- Returns:
The match score indicating the degree of containment, ranging from
0.0
to1.0
(perfect match).- Return type:
Example
>>> ranker = Ranker(...) >>> query = 'apple' >>> row = pd.Series({'mentions': 'Delicious apple'}) >>> match_score = ranker.check_if_contained(query, row) >>> print(match_score) 0.3333333333333333
- damlev_dist(query: str, row: Series) float
Calculate the Damerau-Levenshtein distance between a mention and a row in the dataset.
- Parameters:
query (str) – A mention identified in a text.
row (Series) – A pandas Series representing a row in the dataset with a “mentions” column, corresponding to an alternate name of an etity in the knowledge base.
- Returns:
The similarity score between the query and the row, ranging from
0.0
to1.0
.- Return type:
Note
This method computes the Damerau-Levenshtein distance between the lowercase versions of a mention and the “mentions” column value in the given row.
The distance is then normalized to a similarity score by subtracting it from
1.0
.Example
>>> ranker = Ranker(...) >>> query = 'apple' >>> row = pd.Series({'mentions': 'orange'}) >>> similarity = ranker.damlev_dist(query, row) >>> print(similarity) 0.1666666865348816
- deezy_on_the_fly(queries: List[str]) Tuple[dict, dict]
Perform DeezyMatch (a deep neural network approach to fuzzy string matching) on-the-fly for a list of given mentions (
queries
).- Parameters:
queries (list) – A list of mentions (strings) identified in a text to match.
- Returns:
A tuple containing two dictionaries:
The first dictionary maps each mention to its candidate list, where the candidate list is a dictionary with the mention variations as keys and their match scores as values.
The second dictionary stores the already collected candidates for each mention. It is an updated version of the Ranker’s
already_collected_cands
attribute.
- Return type:
Example
>>> ranker = Ranker(...) >>> ranker.load_resources() >>> queries = ['London', 'Shefrield'] >>> candidates, already_collected = ranker.deezy_on_the_fly(queries) >>> print(candidates) {'London': {'London': 1.0}, 'Shefrield': {'Sheffield': 0.03382000000000005}} >>> print(already_collected) {'London': {'London': 1.0}, 'Shefrield': {'Sheffield': 0.03382000000000005}}
Note
This method performs DeezyMatch on-the-fly for each mention in a given list of mentions identified in a text. If a query has already been matched perfectly, it skips the fuzzy matching process for that query. For the remaining queries, it uses the DeezyMatch model to generate candidates and ranks them based on the specified ranking metric and selection threshold, provided when initialising the
Ranker()
object.
- find_candidates(mentions: List[dict]) Tuple[dict, dict]
Find candidates for the given mentions using the selected ranking method.
- Parameters:
mentions (list) – A list of predicted mentions as dictionaries.
- Returns:
A tuple containing two dictionaries:
The first dictionary maps each original mention to a sub-dictionary, where the sub-dictionary maps the mention variations to a sub-sub-dictionary with two keys:
"Score"
(the string matching similarity score) and"Candidates"
(a dictionary containing the Wikidata candidates, where the key is the Wikidata ID and value is the the relative mention- to-wikidata frequency).The second dictionary stores the already collected candidates for each query.
The variation is found by the candidate ranker in the knowledge base, and for each variation, the candidate ranking score and the candidates from Wikidata are provided. E.g. for mention “Guadaloupe” in sentence “sn83030483-1790-03-31-a-i0004_1”, the candidates will show as follows:
{ "Guadaloupe": { "Score": 1.0, "Candidates": { "Q17012": 0.003935458480913026, "Q3153836": 0.07407407407407407 } } }
- Return type:
Note
This method takes a list of mentions and finds candidates for each mention using the selected ranking method. It first extracts the queries from the mentions and then calls the appropriate method based on the ranking method chosen when initialising the
Ranker()
object.The method returns a dictionary that maps each original mention to a sub-dictionary containing the mention variations as keys and their corresponding Wikidata match scores as values.
Additionally, it updates the already collected candidates dictionary (the Ranker object’s
already_collected_cands
attribute).
- load_resources() dict
Load the ranker resources.
- Returns:
The loaded mentions-to-wikidata dictionary, which maps a mention (e.g.
"London"
) to the Wikidata entities that are referred to by this mention on Wikipedia (e.g.Q84
,Q2477346
). The data also includes, for each entity, their normalized “relevance”, i.e. number of in-links across Wikipedia.- Return type:
Note
This method loads the mentions-to-wikidata and wikidata-to-mentions dictionaries from the resources directory, specified when initialising the
Ranker()
. They are required for performing candidate selection and ranking.It filters the dictionaries to remove noise and updates the class attributes accordingly.
The method also initialises
pandarallel
if needed by the candidate ranking method (if themethod
set in the initialiser of theRanker
was set to “partialmatch” or “levenshtein”).
- partial_match(queries: List[str], damlev: bool) Tuple[dict, dict]
Perform partial matching for a list of given mentions (
queries
).- Parameters:
- Returns:
A tuple containing two dictionaries:
The first dictionary maps each mention to its candidate list, where the candidate list is a dictionary with the mention variations as keys and their match scores as values.
The second dictionary stores the already collected candidates for each mention. It is an updated version of the Ranker’s
already_collected_cands
attribute.
- Return type:
Example
>>> ranker = Ranker(...) >>> queries = ['apple', 'banana', 'orange'] >>> candidates, already_collected = ranker.partial_match(queries, damlev=False) >>> print(candidates) {'apple': {'apple': 1.0}, 'banana': {'bananas': 0.5, 'banana split': 0.75}, 'orange': {'orange': 1.0}} >>> print(already_collected) {'apple': {'apple': 1.0}, 'banana': {'bananas': 0.5, 'banana split': 0.75}, 'orange': {'orange': 1.0}}
Note
This method performs partial matching for each mention in the given list. If a mention has already been matched perfectly, it skips the partial matching process for that mention. For the remaining mentions, it calculates the match score based on the specified partial matching method: Levenshtein distance or containment.
- perfect_match(queries: List[str]) Tuple[dict, dict]
Perform perfect matching between a provided list of mentions (
queries
) and the altnames in the knowledge base.- Parameters:
queries (list) – A list of mentions (string) identified in a text to match.
- Returns:
A tuple containing two dictionaries:
The first dictionary maps each mention to its candidate list, where the candidate list is a dictionary with the mention itself as the key and a perfect match score of
1.0
.The second dictionary stores the already collected candidates for each mention. It is an updated version of the Ranker’s
already_collected_cands
attribute.
- Return type:
Note
This method checks if each mention has an exact match in the mentions_to_wikidata dictionary. If a match is found, it assigns a perfect match score of
1.0
to the mention. Otherwise, an empty dictionary is assigned as the candidate list for the mention.
- run(queries: List[str]) Tuple[dict, dict]
Run the appropriate ranking method based on the specified method.
- Parameters:
queries (list) – A list of mentions (strings) identified in a text to match.
- Returns:
A tuple containing two dictionaries. The resulting dictionaries will vary depending on the method set in the Ranker object. See Notes below for further information.
- Return type:
Example
>>> myranker = Ranker(method="perfectmatch", ...) >>> ranker.mentions_to_wikidata = myranker.load_resources() >>> queries = ['London', 'Barcelona', 'Bologna'] >>> candidates, already_collected = myranker.run(queries) >>> print(candidates) {'London': {'London': 1.0}, 'Barcelona': {'Barcelona': 1.0}, 'Bologna': {'Bologna': 1.0}} >>> print(already_collected) {'London': {'London': 1.0}, 'Barcelona': {'Barcelona': 1.0}, 'Bologna': {'Bologna': 1.0}}
Note
This method executes the appropriate ranking method based on the
method
parameter, selected when initialising theRanker()
object.It delegates the execution to the corresponding method:
perfect_match()
partial_match()
levenshtein()
deezy_on_the_fly()
See the documentation of those methods for more details about their processing if the provided mentions (
queries
).
- train() None
Training a DeezyMatch model. The training will be skipped if the model already exists and the
overwrite_training
key in thedeezy_parameters
passed when initialising theRanker()
object is set toFalse
. The training will be run on test mode if thedo_test
key in thedeezy_parameters
passed when initialising theRanker()
object is set toTrue
.- Returns:
None.