t_res.utils.REL.vocabulary module

class t_res.utils.REL.vocabulary.Vocabulary

A class representing a vocabulary object used for storing references to embeddings.

Note

Credit:

The code for this class and its methods is taken from the REL: Radboud Entity Linker Github repository: Copyright (c) 2020 Johannes Michael van Hulst. See the permission notice. See the original script for more information.

Reference:

@inproceedings{vanHulst:2020:REL,
author =    {van Hulst, Johannes M. and Hasibi, Faegheh and Dercksen, Koen and Balog, Krisztian and de Vries, Arjen P.},
title =     {REL: An Entity Linker Standing on the Shoulders of Giants},
booktitle = {Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval},
series =    {SIGIR '20},
year =      {2020},
publisher = {ACM}
}
add_to_vocab(token: str) None

Add the given token to the vocabulary.

Parameters:

token (str) – The token to be added to the vocabulary.

Returns:

None.

get_id(token: str) int

Get the ID associated with the given token from the vocabulary.

Parameters:

token (str) – The token for which to retrieve the ID.

Returns:

The ID of the token in the vocabulary, or the ID of the unknown token if the token is not found.

Return type:

int

static normalize(token: str, lower: Optional[bool] = False, digit_0: Optional[bool] = False) str

Normalise the given token based on the specified normalisation rules.

Parameters:
  • token (str) – The token to be normalized.

  • lower (bool) – Flag indicating whether token should be converted to lowercase. Defaults to False.

  • digit_0 (bool) – Flag indicating whether digits should be replaced with '0' during normalization. Defaults to False.

Returns:

The normalized token.

Return type:

str

size() int

Get the size of the vocabulary.

Returns:

The number of words in the vocabulary.

Return type:

int

unk_token = '#UNK#'