t_res.utils.ner
module
- t_res.utils.ner.training_tokenize_and_align_labels(examples: dict, tokenizer: Union[PreTrainedTokenizer, PreTrainedTokenizerFast], label_encoding_dict: dict)
Tokenize and align labels during training.
This function takes a training instance, consisting of tokens and named entity recognition (NER) tags, and aligns the tokens with their corresponding labels. It uses a transformers tokenizer object to tokenize the input tokens and then maps the NER tags to label IDs based on the provided label encoding dictionary.
- Parameters:
examples (Dict) – A dictionary representing a single training instance with three keys:
id
(instance ID),tokens
(list of tokens), andner_tags
(list of NER tags).tokenizer (Union[PreTrainedTokenizer, PreTrainedTokenizerFast]) – A transformers tokenizer object, which is the tokenizer of the base model.
label_encoding_dict (Dict) – A dictionary mapping NER labels to label IDs, from
label2id
intrain()
.
- Returns:
The tokenized inputs with aligned labels.
- Return type:
transformers.tokenization_utils_base.BatchEncoding
- Credit:
This function is adapted from HuggingFace.
- t_res.utils.ner.collect_named_entities(tokens: List[Tuple[str, str, str, int, int]]) List[NamedTuple]
Collect named entities from a list of tokens and return a list of named tuples representing the entities.
This function iterates over the tokens and identifies named entities based on their entity type (
entity_type
), keeping the tokens that are not tagged as"O"
. Each token is represented as a tuple with the following format:(token, entity_type, link, start_char, end_char)
.- Parameters:
tokens (List[tuple]) –
A list of tokens, where each token is represented as a tuple containing the following elements:
token
(str): The token text.entity_type
(str): The entity type (e.g.,"B-LOC"
,
"I-LOC"
,"O"
). -link
(str): Empty string reserved for the entity link. -start_char
(int): The start character offset of the token. -end_char
(int): The end character offset of the token.- Returns:
A list of named tuples (called
Entity
) representing the named entities. Each named tuple contains the following fields:e_type
(str): The entity type.link
(str): Empty string reserved for the entity link.start_offset
(int): The start offset of the entity (token position).end_offset
(int): The end offset of the entity (token position).start_char
(int): The start character offset of the entity.end_char
(int): The end character offset of the entity.
- Return type:
List[NamedTuple]
- t_res.utils.ner.aggregate_mentions(predictions: List[List[Tuple[str, str, str, int, int]]], setting: Literal['pred', 'gold']) List[dict]
Aggregate predicted or gold mentions into a consolidated format.
This function takes a list of predicted or gold mentions and aggregates them into a consolidated format. It reconstructs the text of each mention by combining the tokens and their corresponding white spaces. It also consolidates the NER label, NER score, and entity link for each mention.
- Parameters:
predictions (List[List]) – A list of token predictions, where each token prediction is represented as a list of values. For details on each of those tuples, see
collect_named_entities()
.setting (Literal["pred", "gold"]) –
The setting for aggregation:
If set to
"pred"
, the function aggregates predicted mentions. Entity links will be set to"O"
(because we haven’t performed linking yet).If set to
"gold"
, the function aggregates gold mentions. NER score will be set to1.0
as it is manually detected.
- Returns:
A list of dictionaries representing the aggregated mentions, where each dictionary contains the following keys:
mention
: The text of the mention.start_offset
: The start offset of the mention (token position).end_offset
: The end offset of the mention (token position).start_char
: The start character index of the mention.end_char
: The end character index of the mention.ner_score
: The consolidated NER score of the mention (0.0
for predicted mentions,1.0
for gold mentions).ner_label
: The consolidated NER label of the mention.entity_link
: The consolidated entity link of the mention (empty string"O"
for predicted mentions, entity label for gold mentions).
- Return type:
List[dict]
- t_res.utils.ner.fix_capitalization(entity: dict, sentence: str) dict
Correct capitalization errors in entities.
This function corrects capitalization errors in entities that occur as a result of the NER prediction. The NER prediction may return processed words with incorrect capitalization. This function replaces the processed word in the entity with the true surface form from the original dataset, using the character position information.
- t_res.utils.ner.fix_hyphens(lEntities: List[dict]) List[dict]
Fix prefix assignment errors in hyphenated entities.
This function corrects prefix assignment errors that occur in some hyphenated entities, where multiple tokens connected by hyphens form a single entity but are incorrectly assigned different prefixes (i.e.
B-
andI-
). It specifically addresses the issue of grouping in hyphenated entities, where a sequence of tokens connected by hyphens should be grouped as a single entity.- Parameters:
lEntities (list) – A list of dictionaries corresponding to predicted tokens.
- Returns:
A list of dictionaries with corrected predictions regarding hyphenation.
- Return type:
Note
Description: There is a problem with grouping when there are hyphens in words. For example, the phrase “Ashton-under-Lyne” (
["Ashton", "-", "under", "-", "Lyne"]
) is incorrectly grouped as["B-LOC", "B-LOC", "B-LOC", "B-LOC", "B-LOC"]
, when it should be grouped as["B-LOC", "I-LOC", "I-LOC", "I-LOC", "I-LOC"]
.Solution: If the current token or the previous token is a hyphen, and the entity type of both the previous and current tokens is the same and not
"O"
, the current entity’s prefix is changed to"I-"
to maintain the correct grouping.
- t_res.utils.ner.fix_nested(lEntities: List[dict]) List[dict]
Fix prefix assignment errors in nested entities.
This function corrects prefix assignment errors that occur in nested entities, where multiple tokens are part of the same entity but are incorrectly assigned different prefixes. It specifically addresses the issue of grouping in nested entities, where a sequence of tokens that form a single entity are assigned incorrect prefixes.
- Parameters:
lEntities (list) – A list of dictionaries corresponding to predicted tokens.
- Returns:
A list of dictionaries with corrected predictions regarding nested entities.
- Return type:
Note
Description: There is a problem with grouping in some nested entities. For example, the phrase “Island of Terceira” (
["Island", "of", "Terceira"]
) is incorrectly grouped as["B-LOC", "I-LOC", "B-LOC"]
, when it should be["B-LOC", "I-LOC", "I-LOC"]
as we consider it as one entity.Solution: If the current token is preposition
"of"
and the previous and current entity types are not"O"
, the current entity’s prefix is changed to"I-"
to maintain the correct grouping.
- t_res.utils.ner.fix_startEntity(lEntities: List[dict]) List[dict]
Fix prefix assignment errors in entity labeling.
This function corrects two different cases of prefix assignment errors in entity labeling:
The first token of a sentence can only be either
"O"
(not an entity) or"B-"
(beginning of an entity). If it is incorrectly assigned the prefix"I-"
, this case is fixed by changing it to"B-"
.If the first token of a grouped entity is assigned the prefix
"I-"
, but the entity type of the previous token is different, it should be"B-"
instead. This case is fixed by changing the prefix to"B-"
.
- t_res.utils.ner.aggregate_entities(entity: dict, lEntities: List[dict]) List[dict]
Aggregates entities by joining split tokens.
This function aggregates entities by joining split tokens that start with
"##"
with the previous detected entity. It takes the current entity and the list of all predicted entities as input and returns a new list of dictionaries with corrected predictions regarding split tokens.