t_res.geoparser.ranking. Ranker

class t_res.geoparser.ranking.Ranker(method: Literal['perfectmatch', 'partialmatch', 'levenshtein', 'deezymatch'], resources_path: str, mentions_to_wikidata: Optional[dict] = {}, wikidata_to_mentions: Optional[dict] = {}, strvar_parameters: Optional[dict] = None, deezy_parameters: Optional[dict] = None, already_collected_cands: Optional[dict] = {})

The Ranker class implements a system for candidate selection through string variation ranking. It provides methods to select candidates based on different matching approaches, such as perfect match, partial match, Levenshtein distance, and DeezyMatch. The class also handles loading and processing of resources related to candidate selection.

Parameters:
  • method (str) – The candidate selection and ranking method to use.

  • resources_path (str) – Relative path to the resources directory (containing Wikidata resources).

  • mentions_to_wikidata (dict, optional) – An empty dictionary which will store the mapping between mentions and Wikidata IDs, which will be loaded through the load_resources() method.

  • wikidata_to_mentions (dict, optional) – An empty dictionary which will store the mapping between Wikidata IDs and mentions, which will be loaded through the load_resources() method.

  • strvar_parameters (dict, optional) – Dictionary of string variation parameters required to create a DeezyMatch training dataset. For the default settings, see Notes below.

  • deezy_parameters (dict, optional) – Dictionary of DeezyMatch parameters for model training. For the default settings, see Notes below.

  • already_collected_cands (dict, optional) – Dictionary of already collected candidates. Defaults to dict() (an empty dictionary).

Example

>>> # Create a Ranker object:
>>> ranker = Ranker(
        method="perfectmatch",
        resources_path="/path/to/resources/",
    )
>>> # Load resources
>>> ranker.mentions_to_wikidata = ranker.load_resources()
>>> # Train the ranker (if applicable)
>>> ranker.train()
>>> # Perform candidate selection
>>> queries = ['London', 'Paraguay']
>>> candidates, already_collected = ranker.run(queries)
>>> # Find candidates for mentions
>>> mentions = [{'mention': 'London'}, {'mention': 'Paraguay'}]
>>> mention_candidates, mention_already_collected = ranker.find_candidates(mentions)
>>> # Print the results
>>> print("Candidate Selection Results:")
>>> print(candidates)
>>> print(already_collected)
>>> print("Find Candidates Results:")
>>> print(mention_candidates)
>>> print(mention_already_collected)

Note

  • The default settings for strvar_parameters:

    strvar_parameters: Optional[dict] = {
        # Parameters to create the string pair dataset:
        "ocr_threshold": 60,
        "top_threshold": 85,
        "min_len": 5,
        "max_len": 15,
        "w2v_ocr_path": str(Path("resources/models/w2v/").resolve()),
        "w2v_ocr_model": "w2v_*_news",
        "overwrite_dataset": False,
    }
    
  • The default settings for deezy_parameters:

    deezy_parameters: Optional[dict] = {
        "dm_path": str(Path("resources/deezymatch/").resolve()),
        "dm_cands": "wkdtalts",
        "dm_model": "w2v_ocr",
        "dm_output": "deezymatch_on_the_fly",
        "ranking_metric": "faiss",
        "selection_threshold": 50,
        "num_candidates": 1,
        "verbose": False,
        "overwrite_training": False,
        "do_test": False,
    }
    
check_if_contained(query: str, row: Series) float

Returns the amount of overlap, if a mention is contained within a row in the dataset.

Parameters:
  • query (str) – A mention identified in a text.

  • row (Series) – A pandas Series representing a row in the dataset with a “mentions” column, corresponding to a mention in the knowledge base.

Returns:

The match score indicating the degree of containment, ranging from 0.0 to 1.0 (perfect match).

Return type:

float

Example

>>> ranker = Ranker(...)
>>> query = 'apple'
>>> row = pd.Series({'mentions': 'Delicious apple'})
>>> match_score = ranker.check_if_contained(query, row)
>>> print(match_score)
0.3333333333333333
damlev_dist(query: str, row: Series) float

Calculate the Damerau-Levenshtein distance between a mention and a row in the dataset.

Parameters:
  • query (str) – A mention identified in a text.

  • row (Series) – A pandas Series representing a row in the dataset with a “mentions” column, corresponding to an alternate name of an etity in the knowledge base.

Returns:

The similarity score between the query and the row, ranging from 0.0 to 1.0.

Return type:

float

Note

This method computes the Damerau-Levenshtein distance between the lowercase versions of a mention and the “mentions” column value in the given row.

The distance is then normalized to a similarity score by subtracting it from 1.0.

Example

>>> ranker = Ranker(...)
>>> query = 'apple'
>>> row = pd.Series({'mentions': 'orange'})
>>> similarity = ranker.damlev_dist(query, row)
>>> print(similarity)
0.1666666865348816
deezy_on_the_fly(queries: List[str]) Tuple[dict, dict]

Perform DeezyMatch (a deep neural network approach to fuzzy string matching) on-the-fly for a list of given mentions (queries).

Parameters:

queries (list) – A list of mentions (strings) identified in a text to match.

Returns:

A tuple containing two dictionaries:

  1. The first dictionary maps each mention to its candidate list, where the candidate list is a dictionary with the mention variations as keys and their match scores as values.

  2. The second dictionary stores the already collected candidates for each mention. It is an updated version of the Ranker’s already_collected_cands attribute.

Return type:

Tuple[dict, dict]

Example

>>> ranker = Ranker(...)
>>> ranker.load_resources()
>>> queries = ['London', 'Shefrield']
>>> candidates, already_collected = ranker.deezy_on_the_fly(queries)
>>> print(candidates)
{'London': {'London': 1.0}, 'Shefrield': {'Sheffield': 0.03382000000000005}}
>>> print(already_collected)
{'London': {'London': 1.0}, 'Shefrield': {'Sheffield': 0.03382000000000005}}

Note

This method performs DeezyMatch on-the-fly for each mention in a given list of mentions identified in a text. If a query has already been matched perfectly, it skips the fuzzy matching process for that query. For the remaining queries, it uses the DeezyMatch model to generate candidates and ranks them based on the specified ranking metric and selection threshold, provided when initialising the Ranker() object.

find_candidates(mentions: List[dict]) Tuple[dict, dict]

Find candidates for the given mentions using the selected ranking method.

Parameters:

mentions (list) – A list of predicted mentions as dictionaries.

Returns:

A tuple containing two dictionaries:

  1. The first dictionary maps each original mention to a sub-dictionary, where the sub-dictionary maps the mention variations to a sub-sub-dictionary with two keys: "Score" (the string matching similarity score) and "Candidates" (a dictionary containing the Wikidata candidates, where the key is the Wikidata ID and value is the the relative mention- to-wikidata frequency).

  2. The second dictionary stores the already collected candidates for each query.

    The variation is found by the candidate ranker in the knowledge base, and for each variation, the candidate ranking score and the candidates from Wikidata are provided. E.g. for mention “Guadaloupe” in sentence “sn83030483-1790-03-31-a-i0004_1”, the candidates will show as follows:

      {
        "Guadaloupe": {
            "Score": 1.0,
            "Candidates": {
                "Q17012": 0.003935458480913026,
                "Q3153836": 0.07407407407407407
            }
        }
    }
    

Return type:

Tuple[dict, dict]

Note

This method takes a list of mentions and finds candidates for each mention using the selected ranking method. It first extracts the queries from the mentions and then calls the appropriate method based on the ranking method chosen when initialising the Ranker() object.

The method returns a dictionary that maps each original mention to a sub-dictionary containing the mention variations as keys and their corresponding Wikidata match scores as values.

Additionally, it updates the already collected candidates dictionary (the Ranker object’s already_collected_cands attribute).

load_resources() dict

Load the ranker resources.

Returns:

The loaded mentions-to-wikidata dictionary, which maps a mention (e.g. "London") to the Wikidata entities that are referred to by this mention on Wikipedia (e.g. Q84, Q2477346). The data also includes, for each entity, their normalized “relevance”, i.e. number of in-links across Wikipedia.

Return type:

dict

Note

This method loads the mentions-to-wikidata and wikidata-to-mentions dictionaries from the resources directory, specified when initialising the Ranker(). They are required for performing candidate selection and ranking.

It filters the dictionaries to remove noise and updates the class attributes accordingly.

The method also initialises pandarallel if needed by the candidate ranking method (if the method set in the initialiser of the Ranker was set to “partialmatch” or “levenshtein”).

partial_match(queries: List[str], damlev: bool) Tuple[dict, dict]

Perform partial matching for a list of given mentions (queries).

Parameters:
  • queries (list) – A list of mentions (strings) identified in a text to match.

  • damlev (bool) – A flag indicating whether to use the Damerau-Levenshtein distance for matching (True) or containment-based matching (False).

Returns:

A tuple containing two dictionaries:

  1. The first dictionary maps each mention to its candidate list, where the candidate list is a dictionary with the mention variations as keys and their match scores as values.

  2. The second dictionary stores the already collected candidates for each mention. It is an updated version of the Ranker’s already_collected_cands attribute.

Return type:

Tuple[dict, dict]

Example

>>> ranker = Ranker(...)
>>> queries = ['apple', 'banana', 'orange']
>>> candidates, already_collected = ranker.partial_match(queries, damlev=False)
>>> print(candidates)
{'apple': {'apple': 1.0}, 'banana': {'bananas': 0.5, 'banana split': 0.75}, 'orange': {'orange': 1.0}}
>>> print(already_collected)
{'apple': {'apple': 1.0}, 'banana': {'bananas': 0.5, 'banana split': 0.75}, 'orange': {'orange': 1.0}}

Note

This method performs partial matching for each mention in the given list. If a mention has already been matched perfectly, it skips the partial matching process for that mention. For the remaining mentions, it calculates the match score based on the specified partial matching method: Levenshtein distance or containment.

perfect_match(queries: List[str]) Tuple[dict, dict]

Perform perfect matching between a provided list of mentions (queries) and the altnames in the knowledge base.

Parameters:

queries (list) – A list of mentions (string) identified in a text to match.

Returns:

A tuple containing two dictionaries:

  1. The first dictionary maps each mention to its candidate list, where the candidate list is a dictionary with the mention itself as the key and a perfect match score of 1.0.

  2. The second dictionary stores the already collected candidates for each mention. It is an updated version of the Ranker’s already_collected_cands attribute.

Return type:

Tuple[dict, dict]

Note

This method checks if each mention has an exact match in the mentions_to_wikidata dictionary. If a match is found, it assigns a perfect match score of 1.0 to the mention. Otherwise, an empty dictionary is assigned as the candidate list for the mention.

run(queries: List[str]) Tuple[dict, dict]

Run the appropriate ranking method based on the specified method.

Parameters:

queries (list) – A list of mentions (strings) identified in a text to match.

Returns:

A tuple containing two dictionaries. The resulting dictionaries will vary depending on the method set in the Ranker object. See Notes below for further information.

Return type:

Tuple[dict, dict]

Example

>>> myranker = Ranker(method="perfectmatch", ...)
>>> ranker.mentions_to_wikidata = myranker.load_resources()
>>> queries = ['London', 'Barcelona', 'Bologna']
>>> candidates, already_collected = myranker.run(queries)
>>> print(candidates)
{'London': {'London': 1.0}, 'Barcelona': {'Barcelona': 1.0}, 'Bologna': {'Bologna': 1.0}}
>>> print(already_collected)
{'London': {'London': 1.0}, 'Barcelona': {'Barcelona': 1.0}, 'Bologna': {'Bologna': 1.0}}

Note

This method executes the appropriate ranking method based on the method parameter, selected when initialising the Ranker() object.

It delegates the execution to the corresponding method:

  • perfect_match()

  • partial_match()

  • levenshtein()

  • deezy_on_the_fly()

See the documentation of those methods for more details about their processing if the provided mentions (queries).

train() None

Training a DeezyMatch model. The training will be skipped if the model already exists and the overwrite_training key in the deezy_parameters passed when initialising the Ranker() object is set to False. The training will be run on test mode if the do_test key in the deezy_parameters passed when initialising the Ranker() object is set to True.

Returns:

None.