t_res.utils.process_data
module
- t_res.utils.process_data.eval_with_exception(str2parse: str, in_case: Optional[Any] = '') Any
Evaluate a string expression using
ast.literal_eval()
. If the evaluation succeeds, the result is returned. If aValueError
occurs during evaluation, the providedin_case
value is returned instead.- Parameters:
str2parse (str) – The string expression to be evaluated.
in_case (Any, optional) – The value to return in case of a
ValueError
. Defaults to""
.
- Returns:
The evaluated result if successful, or the
in_case
value if an error occurs.- Return type:
Any
Example
>>> eval_with_exception("[1, 2, 3]") [1, 2, 3] >>> process_data.eval_with_exception(None, []) []
- t_res.utils.process_data.prepare_sents(df: DataFrame) Tuple[dict, dict, dict]
Prepares annotated data and metadata on a sentence basis.
- Returns:
A tuple consisting of three dictionaries:
dSentences
: A dictionary in which we keep, for each article/ sentence (expressed as e.g."10732214_1"
, where"10732214"
is the article_id and"1"
is the order of the sentence in the article), the full original unprocessed sentence.dAnnotated
: A dictionary in which we keep, for each article/ sentence, an inner dictionary mapping the position of an annotated named entity (i.e. its start and end character, as a tuple, as the key) and another tuple as its value, which consists of: the type of named entity (such asLOC
orBUILDING
, the mention, and its annotated link), all extracted from the gold standard.dMetadata
: A dictionary in which we keep, for each article/ sentence, its metadata:place
(of publication),year
,ocr_quality_mean
,ocr_quality_sd
,publication_title
,publication_code
, andplace_wqid
(Wikidata ID of the place of publication).
- Return type:
- t_res.utils.process_data.align_gold(predictions: List[dict], annotations: dict) List[dict]
Aligns the predictions of a BERT NER model with the gold standard labels by aligning their tokenization.
The gold standard tokenization is not aligned with the tokenization produced by the BERT model, as it uses its own tokenizer.
This function aligns the two tokenizations based on the start and end positions of each predicted token.
Predicted tokens are assigned the
"O"
label by default unless their position overlaps with an annotated entity, in which case they are relabeled according to the corresponding gold standard label.- Parameters:
predictions (List[dict]) –
A list of dictionaries representing the predicted mentions. Each dictionary contains the following keys:
start
(int): The start position of the predicted token.end
(int): The end position of the predicted token.entity
(str): The predicted entity label (initially set to"O"
).link
(str): The predicted entity link (initially set to"O"
).
annotations (dict) – A dictionary where the keys are tuples representing the start and end positions of gold standard detections in a sentence, and the values are tuples containing the label type (e.g.
"LOC"
), mention (e.g."Point Petre"
), and link of the corresponding gold standard entity (e.g."Q335322"
).
- Returns:
A list of dictionaries representing the aligned gold standard labels. Each dictionary contains the same keys as the predictions:
start
(int): The start position of the aligned token.end
(int): The end position of the aligned token.entity
(str): The aligned entity label.link
(str): The aligned entity link.score
(float): The score for the aligned entity (set to1.0
as it is manually annotated).
- Return type:
List[dict]
- t_res.utils.process_data.postprocess_predictions(predictions: List[dict], gold_positions: List[dict]) dict
Postprocess predictions to be used later in the pipeline.
- Parameters:
predictions (list) –
the output of the
geoparser.recogniser.Recogniser.ner_predict()
method, where, given a sentence, a list of dictionaries is returned, where each dictionary corresponds to a recognised token, e.g.:{ "entity": "O", "score": 0.99975187, "word": "From", "start": 0, "end": 4 }
gold_positions (list) – the output of the
utils.process_data.align_gold()
function, which aligns the gold standard text to the tokenisation performed by the named entity recogniser, to enable assessing the performance of the NER and linking steps.
- Returns:
A dictionary with three key-value pairs:
sentence_preds
is mapped to the list of lists representation ofpredictions
,sentence_trues
is mapped to the list of lists representation of ‘gold_positions’, andsentence_skys
is the same assentence_trues
, but with empty link.
- Return type:
- t_res.utils.process_data.ner_and_process(dSentences: dict, dAnnotated: dict, myner) Tuple[dict, dict, dict, dict, dict]
Perform named entity recognition in the LwM way, and postprocess the output to prepare it for the experiments.
- Parameters:
dSentences (dict) – dictionary in which we keep, for each article/ sentence (expressed as e.g.
"10732214_1"
, where"10732214"
is the article_id and"1"
is the order of the sentence in the article), the full original unprocessed sentence.dAnnotated (dict) – dictionary in which we keep, for each article/ sentence, an inner dictionary mapping the position of an annotated named entity (i.e. its start and end character, as a tuple, as the key) and another tuple as its value, which consists of: the type of named entity (such as
LOC
orBUILDING
, the mention, and its annotated link), all extracted from the gold standard.myner (recogniser.Recogniser) – a Recogniser object, for NER.
- Returns:
A tuple consisting of five dictionaries:
dPreds: A dictionary where the NER predictions are stored, where the key is the sentence_id (i.e.
article_id + "_" + sentence_pos
) and the value is a list of lists, where each element corresponds to one token in a sentence, for example:["From", "O", "O", 0, 4, 0.999826967716217]
…where the the elements by their position are:
the token,
the NER tag,
the link to wikidata, set to
"O"
for now because we haven’t performed linking yet,the starting character of the token,
the end character of the token, and
the NER prediction score.
This dictionary is stored as a JSON file in the
outputs/data
folder, with the suffix_ner_predictions.json
.dTrues: A dictionary where the gold standard named entities are stored, which has the same format as dPreds above, but with the manually annotated data instead of the predictions.
This dictionary is stored as a JSON file in the
outputs/data
folder, with the suffix_gold_standard.json
.dSkys: A dictionary where the skyline will be stored, for the linking experiments. At this point, it will be the same as dPreds, without the NER prediction score. During linking, it will be filled with the gold standard entities when these have been retrieved using candidates.
This dictionary is stored as a JSON file in the
outputs/data
folder, with the suffix_ner_skyline.json
.gold_tokenization: A dictionary where the gold standard entities are stored, and keys represent
sentence_id
(i.e.article_id + "_" + sentence_pos
) and the values are lists of dictionaries, each looking like this:{ "entity": "B-LOC", "score": 1.0, "word": "Unitec", "start": 193, "end": 199, "link": "B-Q30" }
This dictionary is stored as a JSON file in the
outputs/data
folder, with the suffix_gold_positions.json
.dMentionsPred: A dictionary of detected mentions but not yet linked mentions, for example:
{ "sn83030483-1790-03-03-a-i0001_9": [ { "mention": "Unitec ? States", "start_offset": 38, "end_offset": 40, "start_char": 193, "end_char": 206, "ner_score": 0.79, "ner_label": "LOC", "entity_link": "O" } ], }
This dictionary is stored as a JSON file in the
outputs/data
folder, with the suffix_pred_mentions.json
.dMentionsGold: A dictionary consisting of gold standard mentions, analogous to the dictionary of detected mentions, but with the gold standard
ner_label
andentity_link
.
- Return type:
- t_res.utils.process_data.update_with_linking(ner_predictions: dict, link_predictions: Series) dict
Updates the NER predictions by incorporating linking results.
- Parameters:
ner_predictions (dict) – A dictionary containing NER predictions (token-per-token) for a given sentence.
link_predictions (pd.Series) – A pandas series corresponding to one row of the test_df, representing one mention.
- Returns:
A dictionary similar to ner_predictions, but with the added Wikidata link for each predicted entity.
- Return type:
- t_res.utils.process_data.update_with_skyline(ner_predictions: dict, link_predictions: Series) dict
Updates the NER predictions with the skyline link from entity linking.
- Parameters:
ner_predictions (dict) – A dictionary containing NER predictions (token-per-token) for a given sentence.
link_predictions (pd.Series) – A pandas series corresponding to one row of the test_df, representing one mention.
- Returns:
A dictionary similar to
ner_predictions
, but with the added skyline link to Wikidata. The skyline link represents the gold standard candidate if it has been retrieved through candidate ranking, otherwise it is set to"O"
.- Return type:
- t_res.utils.process_data.prepare_storing_links(processed_data: dict, all_test: List[str], test_df: DataFrame, end_to_end_eval: bool) dict
Updates the processed data dictionaries with predicted links for “preds” and the skyline for “skys”. The skyline represents the maximum achievable result by choosing the gold standard entity among the candidates.
- Parameters:
processed_data (dict) – A dictionary containing all processed data.
all_test (List[str]) – A list of article IDs in the current data split used for testing.
test_df (pd.DataFrame) – A DataFrame with one mention per row that will be used for testing in the current experiment.
end_to_end_eval (bool) – Whether to use mentions predicted by us or provided by the gold standard (for end-to-end and entity-linking only evaluation settings respectively).
- Returns:
- dict
The updated processed data dictionary with “preds” and “skys” incorporating the predicted links and skyline information.
- t_res.utils.process_data.store_for_scorer(hipe_scorer_results_path: str, scenario_name: str, dresults: dict, articles_test: List[str]) None
Stores the results in the required format for evaluation using the CLEF-HIPE scorer.
- Parameters:
hipe_scorer_results_path (str) – The first part of the output file path.
scenario_name (str) – The second part of the output file path.
dresults (dict) – A dictionary containing the results.
articles_test (list) – A list of sentences that are part of the split used for evaluating the performance in the provided experiment.
- Returns:
None.
Note
The function also creates a TSV file with the results in the Conll format required by the scorer.
For more information about the CLEF-HIPE scorer, see https://github.com/impresso/CLEF-HIPE-2020-scorer.