t_res.geoparser.recogniser.Recogniser

class t_res.geoparser.recogniser.Recogniser(model: str, train_dataset: Optional[str] = '', test_dataset: Optional[str] = '', pipe: Optional[Pipeline] = None, base_model: Optional[str] = '', model_path: Optional[str] = '', training_args: Optional[dict] = {'batch_size': 8, 'learning_rate': 5e-05, 'num_train_epochs': 10, 'weight_decay': 0.0}, overwrite_training: Optional[bool] = False, do_test: Optional[bool] = False, load_from_hub: Optional[bool] = False)

A class for training and using a toponym recogniser with the specified parameters.

Parameters:
  • model (str) – The name of the NER model.

  • train_dataset (str, optional) – Path to the training dataset (default: "").

  • test_dataset (str, optional) – Path to the testing dataset (default: "").

  • pipe (transformers.Pipeline, optional) – A pre-loaded NER pipeline (default: None).

  • base_model (str, optional) – The name of the base model, for fine-tuning (default: "")

  • model_path (str, optional) – Path to store the trained model (default: "").

  • training_args (dict, optional) – Additional fine-tuning training arguments (default: {“batch_size”: 8, “num_train_epochs”: 10, “learning_rate”: 0.00005, “weight_decay”: 0.0}``, a dictionary).

  • overwrite_training (bool, optional) – Whether to overwrite an existing trained model (default: False).

  • do_test (bool, optional) – Whether to train in test mode (default: False).

  • load_from_hub (bool, optional) – Whether to load the model from HuggingFace model hub or locally (default: False).

Example

>>> # Create an instance of the Recogniser class
>>> recogniser = Recogniser(
        model="ner-model",
        train_dataset="train.json",
        test_dataset="test.json",
        base_model="bert-base-uncased",
        model_path="/path/to/model/",
        training_args={
            "batch_size": 8,
            "num_train_epochs": 10,
            "learning_rate": 0.00005,
            "weight_decay": 0.0,
            },
        overwrite_training=False,
        do_test=False,
        load_from_hub=False
    )
>>> # Create and load the NER pipeline
>>> pipeline = recogniser.create_pipeline()
>>> # Train the model
>>> recogniser.train()
>>> # Predict named entities in a sentence
>>> sentence = "I live in London."
>>> predictions = recogniser.ner_predict(sentence)
>>> print(predictions)
create_pipeline() Pipeline

Creates and loads a Named Entity Recognition (NER) pipeline.

Returns:

The created NER pipeline.

Return type:

geoparser.pipeline.Pipeline

Note

This method creates and loads a NER pipeline for performing named entity recognition tasks. It uses the specified model name and model path (if the model is not obtained from the HuggingFace model hub or from a local path) to initialise the pipeline. The created pipeline is stored in the pipe attribute of the Recogniser object. It is also returned by the method.

ner_predict(sentence: str) List[dict]

Predicts named entities in a given sentence using the NER pipeline.

Parameters:

sentence (str) – The input sentence.

Returns:

A list of dictionaries representing the predicted named entities. Each dictionary contains the keys "word", "entity", "score", "start" , and "end" representing the entity text, entity label, confidence score and start and end character position of the text respectively. For example:

{
    "word": "From",
    "entity": "O",
    "score": 0.99975187,
    "start": 0,
    "end": 4
}

Return type:

List[dict]

Note

This method takes a sentence as input and uses the NER pipeline to predict named entities in the sentence.

Any n-dash characters () in the provided sentence are replaced with a comma (,) to handle parsing issues related to the n-dash in OCR from historical newspapers.

train() None

Trains a NER model.

Returns:

None.

Note

If the model is obtained from the HuggingFace model hub (load_from_hub=True) or if the model already exists at the specified model path and overwrite_training is False, training is skipped.

Otherwise, the training process is executed, including the loading of datasets, model, and tokenizer, tokenization and alignment of labels, computation of evaluation metrics, training using the Trainer object, evaluation, and saving the trained model.

The training will be run on test mode if do_test was set to True when the Recogniser object was initiated.

Credit:

This function is adapted from a HuggingFace tutorial.