t_res.geoparser.pipeline.Pipeline
- class t_res.geoparser.pipeline.Pipeline(myner: Optional[Recogniser] = None, myranker: Optional[Ranker] = None, mylinker: Optional[Linker] = None, resources_path: Optional[str] = None, experiments_path: Optional[str] = None)
Represents a pipeline for processing a text using natural language processing, including Named Entity Recognition (NER), Ranking, and Linking, to geoparse any entities in the text.
- Parameters:
myner (recogniser.Recogniser, optional) – The NER (Named Entity Recogniser) object to use in the pipeline. If None, a default
Recogniser
will be instantiated. For the default settings, see Notes below.myranker (ranking.Ranker, optional) – The
Ranker
object to use in the pipeline. If None, the defaultRanker
will be instantiated. For the default settings, see Notes below.mylinker (linking.Linker, optional) – The
Linker
object to use in the pipeline. If None, the defaultLinker
will be instantiated. For the default settings, see Notes below.resources_path (str, optional) – The path to your resources directory.
experiments_path (str, optional) – The path to the experiments directory. Default is “../experiments”.
Example
>>> # Instantiate the Pipeline object with a default setup >>> pipeline = Pipeline()
>>> # Now you can use the pipeline for processing text or sentences >>> text = "I visited Paris and New York City last summer." >>> processed_data = pipeline.run_text(text)
>>> # Access the processed mentions in the document >>> for mention in processed_data: >>> print(mention)
Note
The default settings for the
Recogniser
:recogniser.Recogniser( model="Livingwithmachines/toponym-19thC-en", load_from_hub=True, )
The default settings for the
Ranker
:ranking.Ranker( method="perfectmatch", resources_path=resources_path, )
The default settings for the
Linker
:linking.Linker( method="mostpopular", resources_path=resources_path, )
- format_prediction(mention, sentence: str, wk_cands: Optional[dict] = None, context: Optional[Tuple[str, str]] = ('', ''), sent_idx: Optional[int] = 0, place: Optional[str] = '', place_wqid: Optional[str] = '') dict
- run_candidate_selection(document_dataset: List[dict]) dict
Performs candidate selection on already identified toponyms, resulting from the
run_text_recognition
method. Given a list of dictionaries corresponding to mentions, this method first extracts the subset of mentions for which to try to find candidates and then runs thefind_candidates
function from the Ranker object. This method returns a dictionary of all mentions and their candidates, with a similarity score.- Parameters:
document_dataset (List[dict]) – The list of mentions identified, formatted as dictionaries.
- Returns:
A three-level nested dictionary, as show in the example in the Note below. The outermost key is the mention as has been identified in the text, the first-level nested keys are candidate mentions found in Wikidata (i.e. potential matches for the original mention). The second-level nested keys are the match confidence score and the Wikidata entities that correspond to the candidate mentions, each with its associated normalised mention-to-wikidata relevance score.
- Return type:
Note
{'Salop': { 'Salop': { 'Score': 1.0, 'Candidates': { 'Q201970': 0.0006031363088057901, 'Q23103': 0.0075279261777561925 } } } }
- run_disambiguation(dataset, wk_cands, place: Optional[str] = '', place_wqid: Optional[str] = '')
Performs entity disambiguation given a list of already identified toponyms and selected candidates.
- Parameters:
dataset (List[dict]) – The list of mentions identified, formatted as dictionaries.
wk_cands (dict) – A three-level nested dictionary mapping mentions to potential Wikidata entities.
place (str, optional) – The place of publication associated with the text document as a human-legible string (e.g.
"London"
). Defaults to""
.place_wqid (str, optional) – The Wikidata ID of the place of publication provided in
place
(e.g."Q84"
). Defaults to""
.
- Returns:
A list of dictionaries representing the identified and linked toponyms in the sentence. Each dictionary contains the following keys:
”mention” (str): The mention text.
”ner_score” (float): The NER score of the mention.
”pos” (int): The starting position of the mention in the sentence.
”sent_idx” (int): The index of the sentence.
”end_pos” (int): The ending position of the mention in the sentence.
”tag” (str): The NER label of the mention.
”sentence” (str): The input sentence.
”prediction” (str): The predicted entity linking result.
”ed_score” (float): The entity disambiguation score.
”string_match_score” (dict): A dictionary of candidate entities and their string matching confidence scores.
”prior_cand_score” (dict): A dictionary of candidate entities and their prior confidence scores.
”cross_cand_score” (dict): A dictionary of candidate entities and their cross-candidate confidence scores.
”latlon” (tuple): The latitude and longitude coordinates of the predicted entity.
”wkdt_class” (str): The Wikidata class of the predicted entity.
- Return type:
List[dict]
- run_sentence(sentence: str, sent_idx: Optional[int] = 0, context: Optional[Tuple[str, str]] = ('', ''), place: Optional[str] = '', place_wqid: Optional[str] = '', postprocess_output: Optional[bool] = True, without_microtoponyms: Optional[bool] = False) List[dict]
Runs the pipeline on a single sentence.
- Parameters:
sentence (str) – The input sentence to process.
sent_idx (int, optional) – Index position of the target sentence in a larger text. Defaults to
0
.context (tuple, optional) – A tuple containing the previous and next sentences as context. Defaults to
("", "")
.place (str, optional) – The place of publication associated with the sentence as a human-legible string (e.g. “London”). Defaults to
""
.place_wqid (str, optional) – The Wikidata ID of the place of publication provided in
place
(e.g. “Q84”). Defaults to""
.postprocess_output (bool, optional) – Whether to postprocess the output, adding geographic coordinates. Defaults to
True
.without_microtoponyms (bool, optional) – Specifies whether to exclude microtoponyms during processing. Defaults to
False
.
- Returns:
A list of dictionaries representing the processed identified and linked toponyms in the sentence. Each dictionary contains the following keys:
sent_idx
(int): The index of the sentence.mention
(str): The mention text.pos
(int): The starting position of the mention in the sentence.end_pos
(int): The ending position of the mention in the sentence.tag
(str): The NER label of the mention.prediction`` (str): The predicted entity linking result.
ner_score`` (float): The NER score of the mention.
ed_score`` (float): The entity disambiguation score.
sentence`` (str): The input sentence.
prior_cand_score`` (dict): A dictionary of candidate entities and their string matching confidence scores.
cross_cand_score
(dict): A dictionary of candidate entities and their cross-candidate confidence scores.
If
postprocess_output
is set to True, the dictionaries will also contain the following two keys:latlon
(tuple): The latitude and longitude coordinates of the predicted entity.wkdt_class
(str): The Wikidata class of the predicted entity.
- Return type:
List[dict]
Note
The
run_sentence
method processes a single sentence through the pipeline, performing tasks such as Named Entity Recognition (NER), ranking, and linking. It takes the input sentence along with optional parameters like the sentence index, context, the place of publication and its related Wikidata ID. By default, the method performs post-processing on the output.It first identifies toponyms in the sentence, then finds relevant candidates and ranks them, and finally links them to the Wikidata ID.
- run_text(text: str, place: Optional[str] = '', place_wqid: Optional[str] = '', postprocess_output: Optional[bool] = True) List[dict]
Runs the pipeline on a text document.
- Parameters:
text (str) – The input text document to process.
place (str, optional) – The place of publication associated with the text document as a human-legible string (e.g.
"London"
). Defaults to""
.place_wqid (str, optional) – The Wikidata ID of the place of publication provided in
place
(e.g."Q84"
). Defaults to""
.postprocess_output (bool, optional) – Whether to postprocess the output, adding geographic coordinates. Defaults to
True
.
- Returns:
A list of dictionaries representing the identified and linked toponyms in the sentence. Each dictionary contains the following keys:
”sent_idx” (int): The index of the sentence.
”mention” (str): The mention text.
”pos” (int): The starting position of the mention in the sentence.
”end_pos” (int): The ending position of the mention in the sentence.
”tag” (str): The NER label of the mention.
”prediction” (str): The predicted entity linking result.
”ner_score” (float): The NER score of the mention.
”ed_score” (float): The entity disambiguation score.
”sentence” (str): The input sentence.
”prior_cand_score” (dict): A dictionary of candidate entities and their string matching confidence scores.
”cross_cand_score” (dict): A dictionary of candidate entities and their cross-candidate confidence scores.
If
postprocess_output
is set to True, the dictionaries will also contain the following two keys:”latlon” (tuple): The latitude and longitude coordinates of the predicted entity.
”wkdt_class” (str): The Wikidata class of the predicted entity.
- Return type:
List[dict]
Note
The
run_text
method processes an entire text through the pipeline, after splitting it into sentences, performing the tasks of Named Entity Recognition (NER), ranking, and linking. It takes the input text document along with optional parameters like the place of publication and its related Wikidata ID and splits it into sentences. By default, the method performs post-processing on the output.It first identifies toponyms in each of the text document’s sentences, then finds relevant candidates and ranks them, and finally links them to the Wikidata ID.
This method runs the
run_sentence()
method for each of the document’s sentences. Thewithout_microtoponyms
keyword, passed torun_sentence
comes from theLinker
’s (passed when initialising thePipeline()
object)rel_params
parameter. Seegeoparser.linking.Linker
for instructions on how to set that up.
- run_text_recognition(text: str, place: Optional[str] = '', place_wqid: Optional[str] = '') List[dict]
Runs the NER on a text document and returns the recognised entities in the format required by future steps: candidate selection and entity disambiguation.
- Parameters:
text (str) – The input text document to process.
place (str, optional) – The place of publication associated with the text document as a human-legible string (e.g.
"London"
). Defaults to""
.place_wqid (str, optional) – The Wikidata ID of the place of publication provided in
place
(e.g."Q84"
). Defaults to""
.
- Returns:
A list of dictionaries representing the identified toponyms in the sentence, in the format required by the following steps in the pipeline: candidate selection and entity disambiguation. Each dictionary contains the following keys:
mention
(str): The mention text.context
(list): List of two strings corresponding to the context (i.e. previous and next sentence).candidates
(list): List of candidates, which at this point will be empty.gold
(list): List containing the gold standard entity, which is and will remain['NONE']
.ner_score
(float): The NER score of the mention.pos
(int): The starting position of the mention in the sentence.sent_idx
(int): The index of the sentence.end_pos
(int): The ending position of the mention in the sentence.ngram
(str): The mention text (redundant).conf_md
(str): The NER score of the mention (redundant).tag
(str): The NER label of the mention.prediction
(str): The predicted entity linking result.sentence
(str): The input sentence.
- Return type:
List[dict]
Note
The
run_text_recognition
method runs Named Entity Recognition (NER) on a full text, one sentence at a time. It takes the input text (along with optional parameters like the place of publication and its related Wikidata ID) and splits it into sentences, and after that finds mentions for each sentence.