t_res.utils.preprocess_data
module
This script reads the original data sources and formats them for our experiments.
- t_res.utils.preprocess_data.turn_wikipedia2wikidata(wikipedia_title: str, wikipedia_path: str) Optional[str]
Convert a Wikipedia title to its corresponding Wikidata ID.
- Parameters:
- Returns:
The corresponding Wikidata ID if available, or None if not.
- Return type:
Optional[str]
Example
>>> turn_wikipedia2wikidata("https://en.wikipedia.org/wiki/Colosseum", "../resources") 'Q10285' >>> turn_wikipedia2wikidata("https://en.wikipedia.org/wiki/Ancient_Egypt", "../resources") 'Q11768' >>> turn_wikipedia2wikidata("https://en.wikipedia.org/wiki/Invalid_Location", "../resources") Warning: invalid_location is not in wikipedia2wikidata, the wkdt_qid will be None.
- t_res.utils.preprocess_data.reconstruct_sentences(dTokens: dict) dict
Reconstructs all sentences in the document based on the given dictionary of tokens (with their positional information in the document and associated annotations).
- Parameters:
dTokens (dict) – A dictionary of tokens with their positional information and annotations.
- Returns:
A dictionary mapping sentence IDs to their corresponding reconstructed sentences and character start positions.
- Return type:
Note
This function takes into account white spaces to ensure character positions match.
- t_res.utils.preprocess_data.process_lwm_for_ner(tsv_topres_path: str)
Process LwM data for training a Named Entity Recognition (NER) model.
Each sentence in the LwM data is assigned a unique identifier, and consists of a list of tokens along with their associated NER tags using the BIO scheme, e.g.:
id
:10813493_1 # document_id + "_" + sentence_id
ner_tags
:['B-LOC', 'O']
tokens
:['INDIA', '.']
- Parameters:
tsv_topres_path (str) – The path to the top-level directory containing the annotated TSV files.
- Returns:
A DataFrame containing the processed LwM data for NER training, with the following columns:
id: The unique identifier of each sentence (
<document_id>_ <sentence_id>
).ner_tags: A list of NER tags assigned to each token in the sentence.
tokens: A list of tokens in the sentence.
- Return type:
Note
The function expects the annotated TSV files to be located in the
annotated_tsv
directory inside the passedtsv_topres_path
directory.The tokens in each sentence are assumed to be ordered by their occurrence.
- t_res.utils.preprocess_data.process_lwm_for_linking(resources_dir: str, tsv_topres_path: str, gazetteer_ids: List[str]) DataFrame
Process LwM data for performing entity linking.
The function reads annotated TSV files and generates a DataFrame with one toponym per row. Each row includes the annotation and resolution information of the toponym.
- Parameters:
- Returns:
- A DataFrame with the following columns:
article_id
: The identifier of the article.sentences
: A list of dictionaries containing the sentence position and text.annotations
: A list of dictionaries containing the annotation information for each toponym.place
: The place of publication.decade
: The decade of the publication.year
: The year of the publication.ocr_quality_mean
: The mean OCR quality score.ocr_quality_sd
: The OCR quality standard deviation.publication_title
: The title of the publication.publication_code
: The code of the publication.
- Return type:
Note
The function expects the annotated TSV files to be located in the
annotated_tsv
directory inside the passedtsv_topres_path
directory.The
metadata.tsv`
file containing metadata information needs to exist in the passedtsv_topres_path
directory.
- t_res.utils.preprocess_data.aggregate_hipe_entities(entity: dict, lEntities: List[dict]) List[dict]
Aggregate HIPE entities by joining consecutive tokens belonging to the same entity.
- Parameters:
- Returns:
- List[dict]
The updated list of entities after aggregating the current entity.
Example
>>> entity = { "ne_type": "I-LOC", "word": "York", "wkdt_qid": "Q60", "start": 12, "end": 15, "meto_type": "city", } >>> lEntities = [ { "ne_type": "B-LOC", "word": "New", "wkdt_qid": "Q60", "start": 8, "end": 10, "meto_type": "city", } ] >>> updated_entities = aggregate_hipe_entities(entity, lEntities) >>> print(updated_entities) [ { "ne_type": "B-LOC", "word": "New York", "wkdt_qid": "Q60", "start": 8, "end": 15, "meto_type": "city", } ]
Note
The function takes an entity and a list of entities and aggregates them by joining consecutive tokens that belong to the same entity. If the current entity is part of a multi-token entity (indicated by the
"I-"
prefix), it is joined with the previous detected entity. This helps to create complete and contiguous entities.
- t_res.utils.preprocess_data.process_hipe_for_linking(hipe_path: str, gazetteer_ids: List[str]) DataFrame
Process LwM HIPE data for performing entity linking.
The function reads HIPE data from a file and generates a DataFrame with one toponym per row. Each row includes the annotation and resolution information of the toponym.
- Parameters:
- Returns:
- A DataFrame with the following columns:
article_id
: The identifier of the article.sentences
: A list of dictionaries containing the sentence position and text.annotations
: A list of dictionaries containing the annotation information for each toponym.place
: The place of publication.decade
: The decade of the publication.year
: The year of the publication.ocr_quality_mean
: The mean OCR quality score (None).ocr_quality_sd
: The OCR quality standard deviation (None).publication_title
: The title of the publication (empty string).publication_code
: The code of the publication.
- Return type:
Note
The function assumes a specific format and structure of the HIPE data file.
- t_res.utils.preprocess_data.process_tsv(filepath: str) Tuple[dict, dict]
Process a TSV file in WebAnno 3.0 format containing token-level annotations.
- Parameters:
filepath (str) – The path to the TSV file.
- Returns:
- A tuple containing two dictionaries:
dMTokens: A dictionary of tokens with positional information and multi-token annotations. The keys in dTokens are tuples of two elements (the sentence number in the document, and the character position).
dTokens: A dictionary of tokens with positional information, Wikipedia ID, label, and BIO annotations. The values of dTokens are tuples of six elements:
the actual token,
the wikipedia url,
the toponym class,
the sentence number in the document,
the character position of a token in the document, and
the character end position of a token in the document.
- Return type:
Note
This function assumes a specific format and structure of the TSV file.