t_res.geoparser.linking.Linker
- class t_res.geoparser.linking.Linker(method: Literal['mostpopular', 'reldisamb', 'bydistance'], resources_path: str, experiments_path: Optional[str] = '../experiments', linking_resources: Optional[dict] = {}, overwrite_training: Optional[bool] = False, rel_params: Optional[dict] = None)
The Linker class provides methods for entity linking, which is the task of associating mentions in text with their corresponding entities in a knowledge base.
- Parameters:
method (Literal["mostpopular", "reldisamb", "bydistance"]) – The linking method to use.
resources_path (str) – The path to the linking resources.
experiments_path (str, optional) – The path to the experiments directory. Default is “../experiments/”.
linking_resources (dict, optional) – Dictionary containing the necessary linking resources. Defaults to
dict()
(an empty dictionary).overwrite_training (bool) – Flag indicating whether to overwrite the training. Defaults to
False
.rel_params (dict, optional) – Dictionary containing the parameters for performing entity disambiguation using the
reldisamb
approach (adapted from the Radboud Entityt Linker, REL). For the default settings, see Notes below.
Example:
linker = Linker( method="mostpopular", resources_path="/path/to/resources/", experiments_path="/path/to/experiments/", linking_resources={}, overwrite_training=True, rel_params={"with_publication": True, "do_test": True} )
Note
Note that, in order to instantiate the Linker with the
reldisamb
method, the Linker needs to be wrapped by a context manager in which a connection to the entity embeddings database is established and a cursor is created:
with sqlite3.connect("../resources/rel_db/embeddings_database.db") as conn: cursor = conn.cursor() mylinker = linking.Linker( method="reldisamb", resources_path="../resources/", experiments_path="../experiments/", linking_resources=dict(), rel_params={ "model_path": "../resources/models/disambiguation/", "data_path": "../experiments/outputs/data/lwm/", "training_split": "", "db_embeddings": cursor, "with_publication": wpubl, "without_microtoponyms": wmtops, "do_test": False, "default_publname": "", "default_publwqid": "", }, overwrite_training=False, )
See below the default settings for
rel_params
. Note that db_embeddings defaults to None, but it should be assigned a cursor to the entity embeddings database, as described above:rel_params: Optional[dict] = { "model_path": "../resources/models/disambiguation/", "data_path": "../experiments/outputs/data/lwm/", "training_split": "originalsplit", "db_embeddings": None, "with_publication": True, "without_microtoponyms": True, "do_test": False, "default_publname": "United Kingdom", "default_publwqid": "Q145", }
- by_distance(dict_mention: dict, origin_wqid: Optional[str] = '') Tuple[str, float, dict]
Select candidate based on distance to the place of publication.
- Parameters:
- Returns:
A tuple containing the Wikidata ID of the closest candidate to the place of publication (e.g.
"Q84"
) or"NIL"
, the confidence score of the predicted link as a float (rounded to 3 decimals), and a dictionary of all candidates and their confidence scores.- Return type:
Note
Applying the “by distance” disambiguation method for linking entities, based on geographical distance. It undertakes an unsupervised disambiguation, which returns a prediction of a location closest to the place of publication, for a provided set of candidates and the place of publication of the original text.
- load_resources() dict
Loads the linking resources.
- Returns:
Dictionary containing loaded necessary linking resources.
- Return type:
Note
Different methods will require different resources.
- most_popular(dict_mention: dict) Tuple[str, float, dict]
Select most popular candidate, given Wikipedia’s in-link structure.
- Parameters:
dict_mention (dict) – dictionary with all the relevant information needed to disambiguate a certain mention.
- Returns:
A tuple containing the most popular candidate’s Wikidata ID (e.g.
"Q84"
) or"NIL"
, the confidence score of the predicted link as a float, and a dictionary of all candidates and their confidence scores.- Return type:
Note
Applying the “most popular” disambiguation method for linking entities. Given a set of candidates for a given mention, the function returns as a prediction the more relevant Wikidata candidate, determined from the in-link structure of Wikipedia.
- run(dict_mention: dict) Tuple[str, float, dict]
Executes the linking process based on the specified unsupervised method.
- Parameters:
dict_mention – Dictionary containing the mention information.
- Returns:
The result of the linking process. For details, see below:
If the
method
provided when initialising theLinker()
object was"mostpopular"
, seemost_popular()
.If the
method
provided when initialising theLinker()
object was"bydistance"
, seeby_distance()
.
- Return type:
- train_load_model(myranker: Ranker, split: Optional[str] = 'originalsplit') EntityDisambiguation
Trains or loads the entity disambiguation model.
- Parameters:
myranker (geoparser.ranking.Ranker) – The ranker object used for training.
split (str, optional) – The split type for training. Defaults to
"originalsplit"
.
- Returns:
A trained Entity Disambiguation model.
- Return type:
Note
The training will be skipped if the model already exists and
overwrite_training
was set to False when initiating the Linker object, or if the disambiguation method is unsupervised. The training will be run on test mode ifrel_params
had ado_test
key’s value set to True when initiating the Linker object.Note
Credit:
This method is adapted from the REL: Radboud Entity Linker Github repository: Copyright (c) 2020 Johannes Michael van Hulst. See the permission notice.
Reference: @inproceedings{vanHulst:2020:REL, author = {van Hulst, Johannes M. and Hasibi, Faegheh and Dercksen, Koen and Balog, Krisztian and de Vries, Arjen P.}, title = {REL: An Entity Linker Standing on the Shoulders of Giants}, booktitle = {Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval}, series = {SIGIR '20}, year = {2020}, publisher = {ACM} }
- linking.RANDOM_SEED = 42