.. _top-resources: ================================= Resources and directory structure ================================= T-Res requires several resources to work. Some resources can be downloaded and loaded directly from the web. Others will need to be generated, following the instructions provided in this section. In this page, we describe the format of the files that are required by T-Res, therefore also giving the user the option to use their own resources (adapted to T-Res). Toponym recognition and disambiguation training data ---------------------------------------------------- We provide the dataset we used to train T-Res for the tasks of toponym recognition (i.e. a named entity recognition task) and toponym disambiguation (i.e. an entity linking task focused on geographical entities) in English. The dataset is based on the `TopRes19th dataset `_. .. note:: You can download the data (in the format required by T-Res) from the `British Library research repository `_. By default, T-Res assumes the files are stored in the following location: :: T-Res/ └── experiments/ └── outputs/ └── data/ └── lwm/ ├── ner_fine_dev.json ├── ner_fine_test.json └── linking_df_split.tsv Continue reading the sections below to learn more about the datasets, and for a description of the format expected by T-Res. 1. Toponym recognition dataset ############################## .. note:: You don't need a toponym recognition dataset if you load a NER model directly from the HuggingFace hub, or from a local folder. In that case, you can skip this section. T-Res allows directly loading a pre-trained BERT-based NER model, either locally or from the HuggingFace models hub. If this is your option, you can skip this section. Otherwise, if you want to train your own NER model using either our dataset or a different dataset, you should continue reading. T-Res requires that the data for training a NER model is provided as two json files (one for training, one for testing) in the JSON Lines format, where each line corresponds to a sentence. Each sentence is a dictionary with three key-value pairs: ``id`` (an identifier of the sentence, a string), ``tokens`` (the list of tokens into which the sentence has been split), and ``ner_tags`` (the list of annotations per token, in the BIO format). The length of ``tokens`` and ``ner_tags`` is therefore always the same. See below an example of three lines from one of the JSON files, corresponding to three annotated sentences: .. code-block:: json {"id":"3896239_29","ner_tags":["O","B-STREET","I-STREET","O","O","O","B-BUILDING","I-BUILDING","O","O","O","O","O","O","O","O","O","O"],"tokens":[",","Old","Millgate",",","to","the","Collegiate","Church",",","where","they","arrived","a","little","after","ten","oclock","."]} {"id":"8262498_11","ner_tags":["O","O","O","O","O","O","O","O","O","O","O","B-LOC","O","B-LOC","O","O","O","O","O","O"],"tokens":["On","the","'","JSth","November","the","ship","Santo","Christo",",","from","Monteveido","to","Cadiz",",","with","hides","and","copper","."]} {"id":"10715509_7","ner_tags":["O","O","O","B-LOC","O","O","O","O","O","O","O","O","O","O","O","O"],"tokens":["A","COACH","to","SOUTHAMPTON",",","every","morning","at","a","quarter","before","6",",","Sundays","excepted","."]} Note that the list of NER labels will be automatically detected from the training data. 2. Toponym disambiguation dataset ################################# .. note:: You won't need a toponym disambiguation dataset if you use the unsupervised approach for linking (i.e ``mostpopular``). You will need a toponym disambiguation dataset if you want to use one of the REL-based approaches. Train and test data examples are required for training a new entity disambiguation (ED) model. They should be provided in a single tsv file, named ``linking_df_split.tsv``, one document per row, with the following required columns: * ``article_id``: article identifier, which consists of the number in the document file in the original dataset (for example, the ``article_id`` of ``1218_Poole1860.tsv`` is ``1218``). * ``sentences``: list of dictionaries, each dictionary corresponding to a sentence in the article, with two fields: ``sentence_pos`` (the position of the sentence in the article) and ``sentence_text`` (the text of the sentence). For example: .. code-block:: json [ { "sentence_pos": 1, "sentence_text": "DUKINFIELD. " }, { "sentence_pos": 2, "sentence_text": "Knutsford Sessions." }, { "sentence_pos": 3, "sentence_text": "—The servant girl, Eliza Ann Byrom, who stole a quantity of clothes from the house where she lodged, in Dukiafield, was sentenced to two months’ imprisonment. " } ] * ``annotations``: list of dictionaries containing the annotated place names. Each dictionary corresponds to a named entity mentioned in the text, with (at least) the following fields: ``mention_pos`` (order of the mention in the article), ``mention`` (the actual mention), ``entity_type`` (the type of named entity), ``wkdt_qid`` (the Wikidata ID of the resolved entity), ``mention_start`` (the character start position of the mention in the sentence), ``mention_end`` (the character end position of the mention in the sentence), ``sent_pos`` (the sentence index in which the mention is found). For example: .. code-block:: json [ { "mention_pos": 0, "mention": "DUKINFIELD", "entity_type": "LOC", "wkdt_qid": "Q1976179", "mention_start": 0, "mention_end": 10, "sent_pos": 1 }, { "mention_pos": 1, "mention": "Knutsford", "entity_type": "LOC", "wkdt_qid": "Q1470791", "mention_start": 0, "mention_end": 9, "sent_pos": 2 }, { "mention_pos": 2, "mention": "Dukiafield", "entity_type": "LOC", "wkdt_qid": "Q1976179", "mention_start": 104, "mention_end": 114, "sent_pos": 3 } ] * ``place``: A string containing the place of publication of the newspaper to which the article belongs. For example, "Manchester" or "Ashton-under-Lyne". * ``place_wqid``: A string with the Wikidata ID of the place of publication. For example, if ``place`` is London UK, then ``place_wqid`` should be ``Q84``. Finally, the TSV contains a set of columns which can be used to indicate how to split the dataset into training (``train``), development (``dev``), testing (``test``), or documents to leave out (``left_out``). The Linker requires that the user specifies which column should be used for training the ED model. The code assumes the following columns: * ``originalsplit``: The articles maintain the ``test`` set of the original dataset. Train is split into ``train`` (0.66) and ``dev`` (0.33). * ``apply``: The articles are divided into ``train`` and ``dev``, with no articles left for testing. This split can be used to train the final entity disambiguation model, after the experiments. * ``withouttest``: This split can be used for development. The articles in the test set of the original dataset are left out. The training set is split into ``train``, ``dev`` and ``test``. `back to top <#top-resources>`_ Wikipedia- and Wikidata-based resources --------------------------------------- T-Res requires a series of Wikipedia- and Wikidata-based resources: * ``mentions_to_wikidata.json`` * ``mentions_to_wikidata_normalized.json`` * ``wikidata_to_mentions_normalized.json`` * ``wikidata_gazetteer.csv`` * ``entity2class.txt`` .. note:: These files can be generated using the `wiki2gaz `_ GitHub repository (**[coming soon]**). For more information on how they are built, refer to the ``wiki2gaz`` documentation. T-Res assumes these files in the following default location: :: T-Res/ └── resources/ └── wikidata/ ├── entity2class.txt ├── mentions_to_wikidata_normalized.json ├── mentions_to_wikidata.json ├── wikidata_gazetteer.csv └── wikidata_to_mentions_normalized.json The sections below describe the contents of the files, as well as their format, in case you prefer to provide your own resources (which should have the same format). ``mentions_to_wikidata.json`` ############################# A JSON file consisting of a python dictionary in which the key is a mention of a place in Wikipedia (by means of an anchor text) and the value is an inner dictionary, where the inner keys are the QIDs of all Wikidata entities that can be referred to by the mention in question, and the inner values are the absolute counts (i.e. the number of times such mention is used in Wikipedia to refer to this particular entity). You can load the dictionary, and access it, as follows: :: >>> import json >>> with open('mentions_to_wikidata.json', 'r') as f: ... mentions_to_wikidata = json.load(f) ... >>> mentions_to_wikidata["Wiltshire"] In the example, the value assigned to the key "Wiltshire" is: .. code-block:: json { "Q23183": 4457, "Q55448990": 5, "Q8023421": 1 } In the example, we see that the mention "Wiltshire" is assigned a mapping between key ``Q23183`` and value 4457. This means that, on Wikipedia, "Wiltshire" appears 4457 times to refer to entity `Q23183 `_ (through the mapping between Wikidata entity ``Q23183`` and its `corresponding Wikipedia page `_). ``mentions_to_wikidata_normalized.json`` ######################################## A JSON file containing the normalised version of the ``mentions_to_wikidata.json`` dictionary. For example, the value of the mention "Wiltshire" is now: .. code-block:: json { "Q23183": 0.9767696690773614, "Q55448990": 1.0, "Q8023421": 0.03125 } Note that these scores do not add up to one, as they are normalised by entity, not by mention. They are a measure of how likely an entity is to be referred to by a mention. In the example, we see that entity ``Q55448990`` is always referred to as ``Wiltshire``. ``wikidata_to_mentions_normalized.json`` ######################################## A JSON file consisting of a python dictionary in which the key is a Wikidata QID and the value is an inner dictionary, in which the inner keys are the mentions used in Wikipedia to refer to such Wikidata entity, and the values are their relative frequencies. You can load the dictionary, and access it, as follows: :: >>> import json >>> with open('wikidata_to_mentions_normalized.json', 'r') as f: ... wikidata_to_mentions_normalized = json.load(f) ... >>> wikidata_to_mentions_normalized["Q23183"] In this example, the value of entity `Q23183 `_ is: .. code-block:: json { "Wiltshire, England": 0.005478851632697786, "Wilton": 0.00021915406530791147, "Wiltshire": 0.9767696690773614, "College": 0.00021915406530791147, "Wiltshire Council": 0.0015340784571553803, "West Wiltshire": 0.00021915406530791147, "North Wiltshire": 0.00021915406530791147, "Wilts": 0.0015340784571553803, "County of Wilts": 0.0026298487836949377, "County of Wiltshire": 0.010081087004163929, "Wilts.": 0.00021915406530791147, "Wiltshire county": 0.00021915406530791147, "Wiltshire, United Kingdom": 0.00021915406530791147, "Wiltshire plains": 0.00021915406530791147, "Wiltshire England": 0.00021915406530791147 } In this example, we can see that entity ``Q23183`` is referred to as "Wiltshire, England" in Wikipedia 0.5% of the times and as "Wiltshire" 97.7% of the times. These values add up to one. ``wikidata_gazetteer.csv`` ########################## A csv file consisting of (at least) the following four columns: * a Wikidata ID (QID) of a location, * its English label, * its latitude, and * its longitude. You can load the csv, and show the first five rows, as follows: :: >>> import pandas as pd >>> df = pd.read_csv("wikidata_gazetteer.csv") >>> df. head() wikidata_id english_label latitude longitude 0 Q5059107 Centennial 40.01140 -87.24330 1 Q5059144 Centennial Grounds 39.99270 -75.19380 2 Q5059153 Centennial High School 40.06170 -83.05780 3 Q5059162 Centennial High School 38.30440 -104.63800 4 Q5059178 Centennial Memorial Samsung Hall 37.58949 127.03434 Each row corresponds to a Wikidata geographic entity (i.e. a Wikidata entity with coordinates). .. note:: Note that the latitude and longitude are not used by the disambiguation method: they are only provided as a post-processing step when rendering the output of the linking. Therefore, the columns can have dummy values (of type ``float``) if the user is not interested in obtaining the coordinates: the linking to Wikidata will not be affected. Column ``english_label`` can likewise be left empty. ``entity2class.txt`` #################### A python dictionary in which each entity in Wikidata is mapped to its most common Wikidata class. You can load the dictionary, and access it, as follows: :: >>> with open('entity2class.txt', 'r') as f: ... entity2class = json.load(f) ... >>> entity2class["Q23183"] 'Q180673' >>> entity2class["Q84"] 'Q515' For example, Wiltshire (`Q23183 `_) is mapped to `Q180673 `_, i.e. "cerimonial county of England", whereas London (`Q84 `_) is mapped to `Q515 `_, i.e. "city". .. note:: Note that the entity2class mapping is not used by the disambiguation method: the Wikidata class is only provided as a post-processing step when rendering the output of the linking. T-Res will complain if the file is not there, but values can be left empty if the user is not interested in obtaining the wikidata class of the predicted entity. The linking to Wikidata will not be affected. `back to top <#top-resources>`_ Entity and word embeddings -------------------------- .. note:: Note that you will not need this if you use the ``mostpopular`` disambiguation approach. In order to perform toponym linking and resolution using the REL-based approaches, T-Res requires a database of word2vec and wiki2vec embeddings. By default, T-Res expects a database file called ``embeddings_database.db`` with, at least, one table (``entity_embeddings``) with at least the following columns: * ``word``: Either a lower-cased token (i.e. a word on Wikipedia) or a Wikidata QID preceded by ``ENTITY/``. The database should also contain the following two wildcard tokens: ``#ENTITY/UNK#`` and ``#WORD/UNK#``. * ``emb``: The corresponding word or entity embedding. In our experiments, we derived the embeddings database from REL's shared resources. .. note:: We are working towards improving this step in the pipeline. Meanwhile, to generate the ``embeddings_database.db``, please follow these steps: #. Make sure you have ``wikidata_gazetteer.csv`` in ``./resources/wikidata/`` (see `above <#wikipedia-and-wikidata-based-resources>`_). #. Generate a Wikipedia-to-Wikidata index, following `this instructions `_, save it as: ``./resources/wikipedia/index_enwiki-latest.db``. #. Run `this script `_ to create the embeddings database (**[coming soon]**). You can load the file, and access a token embedding, as follows: :: >>> import array >>> from array import array >>> with sqlite3.connect("embeddings_database.db") as conn: ... cursor = conn.cursor() ... result = cursor.execute("SELECT emb FROM entity_embeddings WHERE word='lerwick'").fetchone() ... result = result if result is None else array("f", result[0]).tolist() ... >>> result [-0.3257000148296356, -0.00989999994635582, -0.13420000672340393, ...] You can load the file, and access an entity embedding, as follows: :: >>> import array >>> from array import array >>> with sqlite3.connect("embeddings_database.db") as conn: ... cursor = conn.cursor() ... result = cursor.execute("SELECT emb FROM entity_embeddings WHERE word='ENTITY/Q84'").fetchone() ... result = result if result is None else array("f", result[0]).tolist() ... >>> result [-0.014700000174343586, 0.007899999618530273, -0.1808999925851822, ...] T-Res expects the ``embeddings_database.db`` file to be stored as follows: :: T-Res/ └── resources/ └── rel_db/ └── embeddings_database.db `back to top <#top-resources>`_ DeezyMatch training set --------------------------------------- In order to train a DeezyMatch model, a training set consisting of positive and negative string pairs is required. We provide a dataset of positive and negative OCR variations, which can be used to train a DeezyMatch model, which can then be used to perform fuzzy string matching to find candidates for entity linking. .. note:: The DeezyMatch training set can be downloaded from the `British Library research repository `_. This dataset is only necessary if you want to use the DeezyMatch approach to perform candidate selection. This is not needed if you use ``perfectmatch``. T-Res assumes by default the DeezyMatch training set to be named ``w2v_ocr_pairs.txt`` and to be in the following location: :: T-Res/ └── resources/ └── deezymatch/ └── data/ └── w2v_ocr_pairs.txt Optionally, T-Res also provides the option to generate a DeezyMatch training set from word2vec embeddings trained on digitised texts. Continue reading the sections below for more information about both types of resources. 1. DeezyMatch training set ########################## T-Res can directly load the string pairs dataset required to train a new DeezyMatch model. By default, the code assumes the dataset to be called ``w2v_ocr_pairs.txt``. The dataset consists of three columns: ``word1``, ``word2``, and a boolean describing whether ``word2`` is an OCR variation of ``word1``. For example: .. code-block:: could might FALSE could wished FALSE could hardly FALSE could didnot FALSE could never FALSE could reusing FALSE could could TRUE could coeld TRUE could could TRUE could conld TRUE could could TRUE could couid TRUE This dataset has been automatically generated from word2vec embeddings trained on digitised historical news texts (i.e. with OCR noise), and has been expanded with toponym alternate names extracted from Wikipedia. The dataset we provide consists of 1,085,514 string pairs. 2. Word2Vec embeddings trained on noisy data ############################################ .. note:: The 19thC word2vec embeddings **are not needed** if you already have the DeezyMatch training set ``w2v_ocr_pairs.txt`` (described in the `section above <#deezymatch-training-set>`_). To create a new DeezyMatch training set using T-Res, you need to provide Word2Vec models that have been trained on digitised historical news texts. In our experiments, we used the embeddings trained on a 4.2-billion-word corpus of 19th-century British newspapers using Word2Vec (you can download them from `Zenodo `_), but you can also do this with your own word2vec embeddings. The embeddings are divided into periods of ten years each. By default, T-Res assumes that the word2vec models are stored in ``./resources/models/w2v/``, in directories named ``w2v_xxxxs_news/``, where ``xxxx`` corresponds to the decade (e.g. 1800 or 1810) of the models. See the expected directory structure below: :: T-Res/ └── resources/ └── models/ └── w2v/ ├── w2v_1800_news/ │ ├── w2v.model │ ├── w2v.model.syn1neg.npy │ └── w2v.model.wv.vectors.npy ├── w2v_1810_news/ │ ├── w2v.model │ ├── w2v.model.syn1neg.npy │ └── w2v.model.wv.vectors.npy └── .../ Summary of resources and directory structure -------------------------------------------- In the code and our tutorials, we assume the following directory structure for the mentioned resources that are required in order to run the pipeline. :: T-Res/ ├── t-res/ │ ├── geoparser/ │ └── utils/ ├── app/ ├── evaluation/ ├── examples/ ├── experiments/ │ └── outputs/ │ └── data/ │ └── lwm/ │ ├── linking_df_split.tsv [*?] │ ├── ner_fine_dev.json [*+?] │ └── ner_fine_train.json [*+?] ├── resources/ │ ├── deezymatch/ │ │ └── data/ │ │ └── w2v_ocr_pairs.txt [?] │ ├── models/ │ ├── news_datasets/ │ ├── rel_db/ │ │ └── embeddings_database.db [*+?] │ └── wikidata/ │ ├── entity2class.txt [*] │ ├── mentions_to_wikidata_normalized.json [*] │ ├── mentions_to_wikidata.json [*] │ ├── wikidta_gazetteer.csv [*] │ └── wikidata_to_mentions_normalized.json [*] └── tests/ A question mark (``?``) is used to indicate resources which are only required for some approaches (for example, the ``rel_db/embeddings_database.db`` file is only required by the REL-based disambiguation approaches). Note that an asterisk (``*``) next to the resource means that the path can be changed when instantiating the T-Res objects, and a plus sign (``+``) if the name of the file can be changed in the instantiation. `back to top <#top-resources>`_