`t_res.utils.preprocess_data` module

This script reads the original data sources and formats them for our experiments.

t_res.utils.preprocess_data.turn_wikipedia2wikidata(wikipedia_title: str, wikipedia_path: str) → Optional[str]

Convert a Wikipedia title to its corresponding Wikidata ID.

Parameters:

wikipedia_title (str) – The title of the Wikipedia page.
wikipedia_path (str) – The path to your wikipedia directory.

Returns:

The corresponding Wikidata ID if available, or None if not.

Return type:

Optional[str]

Example

>>> turn_wikipedia2wikidata("https://en.wikipedia.org/wiki/Colosseum", "../resources")
'Q10285'
>>> turn_wikipedia2wikidata("https://en.wikipedia.org/wiki/Ancient_Egypt", "../resources")
'Q11768'
>>> turn_wikipedia2wikidata("https://en.wikipedia.org/wiki/Invalid_Location", "../resources")
Warning: invalid_location is not in wikipedia2wikidata, the wkdt_qid will be None.

t_res.utils.preprocess_data.reconstruct_sentences(dTokens: dict) → dict

Reconstructs all sentences in the document based on the given dictionary of tokens (with their positional information in the document and associated annotations).

Parameters:: dTokens (dict) – A dictionary of tokens with their positional information and annotations.
Returns:: A dictionary mapping sentence IDs to their corresponding reconstructed sentences and character start positions.
Return type:: dict

Note

This function takes into account white spaces to ensure character positions match.

t_res.utils.preprocess_data.process_lwm_for_ner(tsv_topres_path: str)

Process LwM data for training a Named Entity Recognition (NER) model.

Each sentence in the LwM data is assigned a unique identifier, and consists of a list of tokens along with their associated NER tags using the BIO scheme, e.g.:

id: 10813493_1 # document_id + "_" + sentence_id
ner_tags: ['B-LOC', 'O']
tokens: ['INDIA', '.']

Parameters:

tsv_topres_path (str) – The path to the top-level directory containing the annotated TSV files.

Returns:

A DataFrame containing the processed LwM data for NER training, with the following columns:

id: The unique identifier of each sentence (<document_id>_ <sentence_id>).
ner_tags: A list of NER tags assigned to each token in the sentence.
tokens: A list of tokens in the sentence.

Return type:

pandas.DataFrame

Note

The function expects the annotated TSV files to be located in the annotated_tsv directory inside the passed tsv_topres_path directory.

The tokens in each sentence are assumed to be ordered by their occurrence.

t_res.utils.preprocess_data.process_lwm_for_linking(resources_dir: str, tsv_topres_path: str, gazetteer_ids: List[str]) → DataFrame

Process LwM data for performing entity linking.

The function reads annotated TSV files and generates a DataFrame with one toponym per row. Each row includes the annotation and resolution information of the toponym.

Parameters:

resources_dir (str) – The path to the resources directory
tsv_topres_path (str) – The path to the top-level directory containing the annotated TSV files.
gazetteer_ids (list) – The set of Wikidata IDs in the gazetteer.

Returns:

A DataFrame with the following columns:

article_id: The identifier of the article.
sentences: A list of dictionaries containing the sentence position and text.
annotations: A list of dictionaries containing the annotation information for each toponym.
place: The place of publication.
decade: The decade of the publication.
year: The year of the publication.
ocr_quality_mean: The mean OCR quality score.
ocr_quality_sd: The OCR quality standard deviation.
publication_title: The title of the publication.
publication_code: The code of the publication.

Return type:

pandas.DataFrame

Note

The function expects the annotated TSV files to be located in the annotated_tsv directory inside the passed tsv_topres_path directory.

The metadata.tsv` file containing metadata information needs to exist in the passed tsv_topres_path directory.

t_res.utils.preprocess_data.aggregate_hipe_entities(entity: dict, lEntities: List[dict]) → List[dict]

Aggregate HIPE entities by joining consecutive tokens belonging to the same entity.

Parameters:

entity (dict) – The current entity to be aggregated.
lEntities (list) – The list of entities to be updated.

Returns:

List[dict]: The updated list of entities after aggregating the current entity.

Example

>>> entity = {
        "ne_type": "I-LOC",
        "word": "York",
        "wkdt_qid": "Q60",
        "start": 12,
        "end": 15,
        "meto_type": "city",
    }
>>> lEntities = [
        {
            "ne_type": "B-LOC",
            "word": "New",
            "wkdt_qid": "Q60",
            "start": 8,
            "end": 10,
            "meto_type": "city",
        }
    ]
>>> updated_entities = aggregate_hipe_entities(entity, lEntities)
>>> print(updated_entities)
[
    {
        "ne_type": "B-LOC",
        "word": "New York",
        "wkdt_qid": "Q60",
        "start": 8,
        "end": 15,
        "meto_type": "city",
    }
]

Note

The function takes an entity and a list of entities and aggregates them by joining consecutive tokens that belong to the same entity. If the current entity is part of a multi-token entity (indicated by the "I-" prefix), it is joined with the previous detected entity. This helps to create complete and contiguous entities.

t_res.utils.preprocess_data.process_hipe_for_linking(hipe_path: str, gazetteer_ids: List[str]) → DataFrame

Process LwM HIPE data for performing entity linking.

The function reads HIPE data from a file and generates a DataFrame with one toponym per row. Each row includes the annotation and resolution information of the toponym.

Parameters:

hipe_path (str) – The path to the HIPE data file.
gazetteer_ids (List[str]) – The set of Wikidata IDs in the gazetteer.

Returns:

A DataFrame with the following columns:

article_id: The identifier of the article.
sentences: A list of dictionaries containing the sentence position and text.
annotations: A list of dictionaries containing the annotation information for each toponym.
place: The place of publication.
decade: The decade of the publication.
year: The year of the publication.
ocr_quality_mean: The mean OCR quality score (None).
ocr_quality_sd: The OCR quality standard deviation (None).
publication_title: The title of the publication (empty string).
publication_code: The code of the publication.

Return type:

pandas.DataFrame

Note

The function assumes a specific format and structure of the HIPE data file.

t_res.utils.preprocess_data.process_tsv(filepath: str) → Tuple[dict, dict]

Process a TSV file in WebAnno 3.0 format containing token-level annotations.

Parameters:

filepath (str) – The path to the TSV file.

Returns:

A tuple containing two dictionaries:

dMTokens: A dictionary of tokens with positional information and multi-token annotations. The keys in dTokens are tuples of two elements (the sentence number in the document, and the character position).
dTokens: A dictionary of tokens with positional information, Wikipedia ID, label, and BIO annotations. The values of dTokens are tuples of six elements:
1. the actual token,
2. the wikipedia url,
3. the toponym class,
4. the sentence number in the document,
5. the character position of a token in the document, and
6. the character end position of a token in the document.

Return type:

tuple

Note

This function assumes a specific format and structure of the TSV file.

t_res.utils.preprocess_data.fine_to_coarse(l: List[str]) → List[str]

Convert a list of fine-grained tags to their corresponding coarse-grained equivalents.

Parameters:: l (list) – The list of fine-grained tags to be converted.
Returns:: A new list with the converted coarse-grained tags.
Return type:: list

t_res.utils.preprocess_data module

`t_res.utils.preprocess_data` module