t_res.utils.rel_utils
module
- t_res.utils.rel_utils.get_db_emb(cursor: Cursor, mentions: List[str], embtype: Literal['word', 'entity', 'snd']) List[Optional[ndarray]]
Retrieve Wikipedia2Vec embeddings for a given list of words or entities.
- Parameters:
cursor – The cursor with the open connection to the Wikipedia2Vec database.
mentions (List[str]) – The list of words or entities whose embeddings to extract.
embtype (Literal["word", "entity", "snd"]) – The type of embedding to retrieve. Possible values are
"word"
,"entity"
, or"snd"
. If it is set to"word"
or"snd"
, we use Wikipedia2Vec word embeddings, if it is set to"entity"
, we use Wikipedia2Vec entity embeddings.
- Returns:
A list of arrays (or
None
) representing the embeddings for the given mentions.- Return type:
List[Optional[np.ndarray]]
Note
The embeddings are extracted from the Wikipedia2Vec database using the provided cursor.
If the mention is an entity, the prefix
ENTITY/
is preappended to the mention before querying the database.If the mention is a word, the string is converted to lowercase before querying the database.
If an embedding is not found for a mention, the corresponding element in the returned list is set to None.
Differently from the original REL implementation, we use Wikipedia2vec embeddings both for
"word"
and"snd"
.
- t_res.utils.rel_utils.eval_with_exception(str2parse: str, in_case: Optional[Any] = '') Any
Parse a string in the form of a list or dictionary.
- t_res.utils.rel_utils.prepare_initial_data(df: DataFrame) dict
Generate the initial JSON data needed to train a REL model from a DataFrame.
- Parameters:
df – The dataframe containing the linking training data.
- Returns:
A dictionary with article IDs as keys and a list of mention dictionaries as values. Each mention dictionary contains information about a mention, excluding the “gold” field and candidates (at this point).
- Return type:
Note
The DataFrame passed to this function can be generated by the
experiments/prepare_data.py
script.
- t_res.utils.rel_utils.rank_candidates(rel_json: dict, wk_cands: dict, mentions_to_wikidata: dict) dict
Rank the candidates for each mention in the provided JSON data.
- Parameters:
- Returns:
A new JSON dictionary with ranked candidates for each mention.
- Return type:
- t_res.utils.rel_utils.add_publication(rel_json: dict, publname: Optional[str] = '', publwqid: Optional[str] = '') dict
Add publication information to the provided JSON data.
- Parameters:
- Returns:
A new JSON dictionary with the added publication information.
- Return type:
- t_res.utils.rel_utils.prepare_rel_trainset(df: DataFrame, rel_params, mentions_to_wikidata, myranker: Ranker, dsplit: str) dict
Prepare the data for training and testing a REL disambiguation model.
This function takes as input a pandas DataFrame (df) containing the dataset generated in the
experiments/prepare_data.py
script, along with a Linking object (mylinker
) and a Ranking object (myranker
). It prepares the data in the format required to train and test a REL disambiguation model, using the candidates from the ranker.- Parameters:
df (pandas.DataFrame) – The pandas DataFrame containing the prepared dataset.
rel_params (dict) – Dictionary containing the parameters for performing entity disambiguation using the
reldisamb
approach.mentions_to_wikidata (dict) – Dictionary mapping mentions to Wikidata entities, with counts.
myranker (geoparser.ranking.Ranker) – The Ranking object.
dsplit (str) – The split identifier for the data (e.g.,
"train"
,"test"
).
- Returns:
The prepared data in the format of a JSON dictionary.
- Return type:
Note
This function stores the formatted dataset as a JSON file.