t_res.utils.rel_utils module

t_res.utils.rel_utils.get_db_emb(cursor: Cursor, mentions: List[str], embtype: Literal['word', 'entity', 'snd']) List[Optional[ndarray]]

Retrieve Wikipedia2Vec embeddings for a given list of words or entities.

Parameters:
  • cursor – The cursor with the open connection to the Wikipedia2Vec database.

  • mentions (List[str]) – The list of words or entities whose embeddings to extract.

  • embtype (Literal["word", "entity", "snd"]) – The type of embedding to retrieve. Possible values are "word", "entity", or "snd". If it is set to "word" or "snd", we use Wikipedia2Vec word embeddings, if it is set to "entity", we use Wikipedia2Vec entity embeddings.

Returns:

A list of arrays (or None) representing the embeddings for the given mentions.

Return type:

List[Optional[np.ndarray]]

Note

  • The embeddings are extracted from the Wikipedia2Vec database using the provided cursor.

  • If the mention is an entity, the prefix ENTITY/ is preappended to the mention before querying the database.

  • If the mention is a word, the string is converted to lowercase before querying the database.

  • If an embedding is not found for a mention, the corresponding element in the returned list is set to None.

  • Differently from the original REL implementation, we use Wikipedia2vec embeddings both for "word" and "snd".

t_res.utils.rel_utils.eval_with_exception(str2parse: str, in_case: Optional[Any] = '') Any

Parse a string in the form of a list or dictionary.

Parameters:
  • str2parse (str) – The string to parse.

  • in_case (str, optional) – The value to return in case of an error. Default is "".

Returns:

Any

The parsed value if successful, or the specified value in case of an error.

t_res.utils.rel_utils.prepare_initial_data(df: DataFrame) dict

Generate the initial JSON data needed to train a REL model from a DataFrame.

Parameters:

df – The dataframe containing the linking training data.

Returns:

A dictionary with article IDs as keys and a list of mention dictionaries as values. Each mention dictionary contains information about a mention, excluding the “gold” field and candidates (at this point).

Return type:

dict

Note

The DataFrame passed to this function can be generated by the experiments/prepare_data.py script.

t_res.utils.rel_utils.rank_candidates(rel_json: dict, wk_cands: dict) dict

Rank the candidates for each mention in the provided JSON data.

Parameters:
  • rel_json (dict) – The JSON data containing articles and mention information.

  • wk_cands (dict) – Dictionary of Wikidata candidates for each mention.

  • mentions_to_wikidata (dict) – Dictionary mapping mentions to Wikidata entities.

Returns:

A new JSON dictionary with ranked candidates for each mention.

Return type:

dict

t_res.utils.rel_utils.add_publication(rel_json: dict, publname: Optional[str] = '', publwqid: Optional[str] = '') dict

Add publication information to the provided JSON data.

Parameters:
  • rel_json (dict) – The JSON data containing articles and mention information.

  • publname (str, optional) – The name of the publication. Defaults to an empty string.

  • publwqid (str, optional) – The Wikidata ID of the publication. Defaults to an empty string.

Returns:

A new JSON dictionary with the added publication information.

Return type:

dict

t_res.utils.rel_utils.prepare_rel_trainset(df: DataFrame, rel_params, ranker: Ranker, linker, dsplit: str) dict

Prepare the data for training and testing a REL disambiguation model.

This function takes as input a pandas DataFrame (df) containing the dataset generated in the experiments/prepare_data.py script, along with a Linking object (linker) and a Ranking object (ranker). It prepares the data in the format required to train and test a REL disambiguation model, using the candidates from the ranker.

Parameters:
  • df (pandas.DataFrame) – The pandas DataFrame containing the prepared dataset.

  • rel_params (dict) – Dictionary containing the parameters for performing entity disambiguation using the reldisamb approach.

  • mentions_to_wikidata (dict) – Dictionary mapping mentions to Wikidata entities, with counts.

  • ranker (geoparser.ranking.Ranker) – The Ranking object.

  • dsplit (str) – The split identifier for the data (e.g., "train", "test").

Returns:

The prepared data in the format of a JSON dictionary.

Return type:

dict

Note

This function stores the formatted dataset as a JSON file.