t_res.utils.preprocess_data module

This script reads the original data sources and formats them for our experiments.

t_res.utils.preprocess_data.turn_wikipedia2wikidata(wikipedia_title: str, wikipedia_path: str) Optional[str]

Convert a Wikipedia title to its corresponding Wikidata ID.

Parameters:
  • wikipedia_title (str) – The title of the Wikipedia page.

  • wikipedia_path (str) – The path to your wikipedia directory.

Returns:

The corresponding Wikidata ID if available, or None if not.

Return type:

Optional[str]

Example

>>> turn_wikipedia2wikidata("https://en.wikipedia.org/wiki/Colosseum", "../resources")
'Q10285'
>>> turn_wikipedia2wikidata("https://en.wikipedia.org/wiki/Ancient_Egypt", "../resources")
'Q11768'
>>> turn_wikipedia2wikidata("https://en.wikipedia.org/wiki/Invalid_Location", "../resources")
Warning: invalid_location is not in wikipedia2wikidata, the wkdt_qid will be None.
t_res.utils.preprocess_data.reconstruct_sentences(dTokens: dict) dict

Reconstructs all sentences in the document based on the given dictionary of tokens (with their positional information in the document and associated annotations).

Parameters:

dTokens (dict) – A dictionary of tokens with their positional information and annotations.

Returns:

A dictionary mapping sentence IDs to their corresponding reconstructed sentences and character start positions.

Return type:

dict

Note

This function takes into account white spaces to ensure character positions match.

t_res.utils.preprocess_data.process_lwm_for_ner(tsv_topres_path: str)

Process LwM data for training a Named Entity Recognition (NER) model.

Each sentence in the LwM data is assigned a unique identifier, and consists of a list of tokens along with their associated NER tags using the BIO scheme, e.g.:

  • id: 10813493_1 # document_id + "_" + sentence_id

  • ner_tags: ['B-LOC', 'O']

  • tokens: ['INDIA', '.']

Parameters:

tsv_topres_path (str) – The path to the top-level directory containing the annotated TSV files.

Returns:

A DataFrame containing the processed LwM data for NER training, with the following columns:

  • id: The unique identifier of each sentence (<document_id>_ <sentence_id>).

  • ner_tags: A list of NER tags assigned to each token in the sentence.

  • tokens: A list of tokens in the sentence.

Return type:

pandas.DataFrame

Note

The function expects the annotated TSV files to be located in the annotated_tsv directory inside the passed tsv_topres_path directory.

The tokens in each sentence are assumed to be ordered by their occurrence.

t_res.utils.preprocess_data.process_lwm_for_linking(resources_dir: str, tsv_topres_path: str, gazetteer_ids: List[str]) DataFrame

Process LwM data for performing entity linking.

The function reads annotated TSV files and generates a DataFrame with one toponym per row. Each row includes the annotation and resolution information of the toponym.

Parameters:
  • resources_dir (str) – The path to the resources directory

  • tsv_topres_path (str) – The path to the top-level directory containing the annotated TSV files.

  • gazetteer_ids (list) – The set of Wikidata IDs in the gazetteer.

Returns:

A DataFrame with the following columns:
  • article_id: The identifier of the article.

  • sentences: A list of dictionaries containing the sentence position and text.

  • annotations: A list of dictionaries containing the annotation information for each toponym.

  • place: The place of publication.

  • decade: The decade of the publication.

  • year: The year of the publication.

  • ocr_quality_mean: The mean OCR quality score.

  • ocr_quality_sd: The OCR quality standard deviation.

  • publication_title: The title of the publication.

  • publication_code: The code of the publication.

Return type:

pandas.DataFrame

Note

The function expects the annotated TSV files to be located in the annotated_tsv directory inside the passed tsv_topres_path directory.

The metadata.tsv` file containing metadata information needs to exist in the passed tsv_topres_path directory.

t_res.utils.preprocess_data.aggregate_hipe_entities(entity: dict, lEntities: List[dict]) List[dict]

Aggregate HIPE entities by joining consecutive tokens belonging to the same entity.

Parameters:
  • entity (dict) – The current entity to be aggregated.

  • lEntities (list) – The list of entities to be updated.

Returns:

List[dict]

The updated list of entities after aggregating the current entity.

Example

>>> entity = {
        "ne_type": "I-LOC",
        "word": "York",
        "wkdt_qid": "Q60",
        "start": 12,
        "end": 15,
        "meto_type": "city",
    }
>>> lEntities = [
        {
            "ne_type": "B-LOC",
            "word": "New",
            "wkdt_qid": "Q60",
            "start": 8,
            "end": 10,
            "meto_type": "city",
        }
    ]
>>> updated_entities = aggregate_hipe_entities(entity, lEntities)
>>> print(updated_entities)
[
    {
        "ne_type": "B-LOC",
        "word": "New York",
        "wkdt_qid": "Q60",
        "start": 8,
        "end": 15,
        "meto_type": "city",
    }
]

Note

The function takes an entity and a list of entities and aggregates them by joining consecutive tokens that belong to the same entity. If the current entity is part of a multi-token entity (indicated by the "I-" prefix), it is joined with the previous detected entity. This helps to create complete and contiguous entities.

t_res.utils.preprocess_data.process_hipe_for_linking(hipe_path: str, gazetteer_ids: List[str]) DataFrame

Process LwM HIPE data for performing entity linking.

The function reads HIPE data from a file and generates a DataFrame with one toponym per row. Each row includes the annotation and resolution information of the toponym.

Parameters:
  • hipe_path (str) – The path to the HIPE data file.

  • gazetteer_ids (List[str]) – The set of Wikidata IDs in the gazetteer.

Returns:

A DataFrame with the following columns:
  • article_id: The identifier of the article.

  • sentences: A list of dictionaries containing the sentence position and text.

  • annotations: A list of dictionaries containing the annotation information for each toponym.

  • place: The place of publication.

  • decade: The decade of the publication.

  • year: The year of the publication.

  • ocr_quality_mean: The mean OCR quality score (None).

  • ocr_quality_sd: The OCR quality standard deviation (None).

  • publication_title: The title of the publication (empty string).

  • publication_code: The code of the publication.

Return type:

pandas.DataFrame

Note

The function assumes a specific format and structure of the HIPE data file.

t_res.utils.preprocess_data.process_tsv(filepath: str) Tuple[dict, dict]

Process a TSV file in WebAnno 3.0 format containing token-level annotations.

Parameters:

filepath (str) – The path to the TSV file.

Returns:

A tuple containing two dictionaries:
  1. dMTokens: A dictionary of tokens with positional information and multi-token annotations. The keys in dTokens are tuples of two elements (the sentence number in the document, and the character position).

  2. dTokens: A dictionary of tokens with positional information, Wikipedia ID, label, and BIO annotations. The values of dTokens are tuples of six elements:

    1. the actual token,

    2. the wikipedia url,

    3. the toponym class,

    4. the sentence number in the document,

    5. the character position of a token in the document, and

    6. the character end position of a token in the document.

Return type:

tuple

Note

This function assumes a specific format and structure of the TSV file.

t_res.utils.preprocess_data.fine_to_coarse(l: List[str]) List[str]

Convert a list of fine-grained tags to their corresponding coarse-grained equivalents.

Parameters:

l (list) – The list of fine-grained tags to be converted.

Returns:

A new list with the converted coarse-grained tags.

Return type:

list