Resources and directory structure
T-Res requires several resources to work. Some resources can be downloaded and loaded directly from the web. Others will need to be generated, following the instructions provided in this section. In this page, we describe the format of the files that are required by T-Res, therefore also giving the user the option to use their own resources (adapted to T-Res).
Toponym recognition and disambiguation training data
We provide the dataset we used to train T-Res for the tasks of toponym recognition (i.e. a named entity recognition task) and toponym disambiguation (i.e. an entity linking task focused on geographical entities) in English. The dataset is based on the TopRes19th dataset.
Note
You can download the data (in the format required by T-Res) from the British Library research repository.
By default, T-Res assumes the files are stored in the following location:
T-Res/
└── experiments/
└── outputs/
└── data/
└── lwm/
├── ner_fine_dev.json
├── ner_fine_test.json
└── linking_df_split.tsv
Continue reading the sections below to learn more about the datasets, and for a description of the format expected by T-Res.
1. Toponym recognition dataset
Note
You don’t need a toponym recognition dataset if you load a NER model directly from the HuggingFace hub, or from a local folder. In that case, you can skip this section.
T-Res allows directly loading a pre-trained BERT-based NER model, either locally or from the HuggingFace models hub. If this is your option, you can skip this section. Otherwise, if you want to train your own NER model using either our dataset or a different dataset, you should continue reading.
T-Res requires that the data for training a NER model is provided as two json files
(one for training, one for testing) in the JSON Lines format, where each line
corresponds to a sentence. Each sentence is a dictionary with three key-value
pairs: id
(an identifier of the sentence, a string), tokens
(the list of
tokens into which the sentence has been split), and ner_tags
(the list of
annotations per token, in the BIO format). The length of tokens
and ner_tags
is therefore always the same. See below an example of three lines from one of
the JSON files, corresponding to three annotated sentences:
{"id":"3896239_29","ner_tags":["O","B-STREET","I-STREET","O","O","O","B-BUILDING","I-BUILDING","O","O","O","O","O","O","O","O","O","O"],"tokens":[",","Old","Millgate",",","to","the","Collegiate","Church",",","where","they","arrived","a","little","after","ten","oclock","."]}
{"id":"8262498_11","ner_tags":["O","O","O","O","O","O","O","O","O","O","O","B-LOC","O","B-LOC","O","O","O","O","O","O"],"tokens":["On","the","'","JSth","November","the","ship","Santo","Christo",",","from","Monteveido","to","Cadiz",",","with","hides","and","copper","."]}
{"id":"10715509_7","ner_tags":["O","O","O","B-LOC","O","O","O","O","O","O","O","O","O","O","O","O"],"tokens":["A","COACH","to","SOUTHAMPTON",",","every","morning","at","a","quarter","before","6",",","Sundays","excepted","."]}
Note that the list of NER labels will be automatically detected from the training data.
2. Toponym disambiguation dataset
Note
You won’t need a toponym disambiguation dataset if you use the unsupervised
approach for linking (i.e mostpopular
). You will need a toponym disambiguation
dataset if you want to use one of the REL-based approaches.
Train and test data examples are required for training a new entity
disambiguation (ED) model. They should be provided in a single tsv file, named
linking_df_split.tsv
, one document per row, with the following required
columns:
article_id
: article identifier, which consists of the number in the document file in the original dataset (for example, thearticle_id
of1218_Poole1860.tsv
is1218
).sentences
: list of dictionaries, each dictionary corresponding to a sentence in the article, with two fields:sentence_pos
(the position of the sentence in the article) andsentence_text
(the text of the sentence). For example:[ { "sentence_pos": 1, "sentence_text": "DUKINFIELD. " }, { "sentence_pos": 2, "sentence_text": "Knutsford Sessions." }, { "sentence_pos": 3, "sentence_text": "—The servant girl, Eliza Ann Byrom, who stole a quantity of clothes from the house where she lodged, in Dukiafield, was sentenced to two months’ imprisonment. " } ]
annotations
: list of dictionaries containing the annotated place names. Each dictionary corresponds to a named entity mentioned in the text, with (at least) the following fields:mention_pos
(order of the mention in the article),mention
(the actual mention),entity_type
(the type of named entity),wkdt_qid
(the Wikidata ID of the resolved entity),mention_start
(the character start position of the mention in the sentence),mention_end
(the character end position of the mention in the sentence),sent_pos
(the sentence index in which the mention is found).For example:
[ { "mention_pos": 0, "mention": "DUKINFIELD", "entity_type": "LOC", "wkdt_qid": "Q1976179", "mention_start": 0, "mention_end": 10, "sent_pos": 1 }, { "mention_pos": 1, "mention": "Knutsford", "entity_type": "LOC", "wkdt_qid": "Q1470791", "mention_start": 0, "mention_end": 9, "sent_pos": 2 }, { "mention_pos": 2, "mention": "Dukiafield", "entity_type": "LOC", "wkdt_qid": "Q1976179", "mention_start": 104, "mention_end": 114, "sent_pos": 3 } ]
place
: A string containing the place of publication of the newspaper to which the article belongs. For example, “Manchester” or “Ashton-under-Lyne”.place_wqid
: A string with the Wikidata ID of the place of publication. For example, ifplace
is London UK, thenplace_wqid
should beQ84
.
Finally, the TSV contains a set of columns which can be used to indicate how
to split the dataset into training (train
), development (dev
), testing
(test
), or documents to leave out (left_out
). The Linker requires that
the user specifies which column should be used for training the ED model.
The code assumes the following columns:
originalsplit
: The articles maintain thetest
set of the original dataset. Train is split intotrain
(0.66) anddev
(0.33).apply
: The articles are divided intotrain
anddev
, with no articles left for testing. This split can be used to train the final entity disambiguation model, after the experiments.withouttest
: This split can be used for development. The articles in the test set of the original dataset are left out. The training set is split intotrain
,dev
andtest
.
Wikipedia- and Wikidata-based resources
T-Res requires a series of Wikipedia- and Wikidata-based resources:
mentions_to_wikidata.json
mentions_to_wikidata_normalized.json
wikidata_to_mentions_normalized.json
wikidata_gazetteer.csv
entity2class.txt
Note
These files can be generated using the
wiki2gaz GitHub
repository ([coming soon]). For more information on how they are built,
refer to the wiki2gaz
documentation.
T-Res assumes these files in the following default location:
T-Res/
└── resources/
└── wikidata/
├── entity2class.txt
├── mentions_to_wikidata_normalized.json
├── mentions_to_wikidata.json
├── wikidata_gazetteer.csv
└── wikidata_to_mentions_normalized.json
The sections below describe the contents of the files, as well as their format, in case you prefer to provide your own resources (which should have the same format).
mentions_to_wikidata.json
A JSON file consisting of a python dictionary in which the key is a mention of a place in Wikipedia (by means of an anchor text) and the value is an inner dictionary, where the inner keys are the QIDs of all Wikidata entities that can be referred to by the mention in question, and the inner values are the absolute counts (i.e. the number of times such mention is used in Wikipedia to refer to this particular entity).
You can load the dictionary, and access it, as follows:
>>> import json
>>> with open('mentions_to_wikidata.json', 'r') as f:
... mentions_to_wikidata = json.load(f)
...
>>> mentions_to_wikidata["Wiltshire"]
In the example, the value assigned to the key “Wiltshire” is:
{
"Q23183": 4457,
"Q55448990": 5,
"Q8023421": 1
}
In the example, we see that the mention “Wiltshire” is assigned a mapping
between key Q23183
and value 4457. This means that, on Wikipedia,
“Wiltshire” appears 4457 times to refer to entity Q23183 (through the mapping between
Wikidata entity Q23183
and its corresponding Wikipedia page).
mentions_to_wikidata_normalized.json
A JSON file containing the normalised version of the mentions_to_wikidata.json
dictionary. For example, the value of the mention “Wiltshire” is now:
{
"Q23183": 0.9767696690773614,
"Q55448990": 1.0,
"Q8023421": 0.03125
}
Note that these scores do not add up to one, as they are normalised by entity,
not by mention. They are a measure of how likely an entity is to be referred to
by a mention. In the example, we see that entity Q55448990
is always referred
to as Wiltshire
.
wikidata_to_mentions_normalized.json
A JSON file consisting of a python dictionary in which the key is a Wikidata QID and the value is an inner dictionary, in which the inner keys are the mentions used in Wikipedia to refer to such Wikidata entity, and the values are their relative frequencies.
You can load the dictionary, and access it, as follows:
>>> import json
>>> with open('wikidata_to_mentions_normalized.json', 'r') as f:
... wikidata_to_mentions_normalized = json.load(f)
...
>>> wikidata_to_mentions_normalized["Q23183"]
In this example, the value of entity Q23183 is:
{
"Wiltshire, England": 0.005478851632697786,
"Wilton": 0.00021915406530791147,
"Wiltshire": 0.9767696690773614,
"College": 0.00021915406530791147,
"Wiltshire Council": 0.0015340784571553803,
"West Wiltshire": 0.00021915406530791147,
"North Wiltshire": 0.00021915406530791147,
"Wilts": 0.0015340784571553803,
"County of Wilts": 0.0026298487836949377,
"County of Wiltshire": 0.010081087004163929,
"Wilts.": 0.00021915406530791147,
"Wiltshire county": 0.00021915406530791147,
"Wiltshire, United Kingdom": 0.00021915406530791147,
"Wiltshire plains": 0.00021915406530791147,
"Wiltshire England": 0.00021915406530791147
}
In this example, we can see that entity Q23183
is referred to as “Wiltshire,
England” in Wikipedia 0.5% of the times and as “Wiltshire” 97.7% of the times.
These values add up to one.
wikidata_gazetteer.csv
A csv file consisting of (at least) the following four columns:
a Wikidata ID (QID) of a location,
its English label,
its latitude, and
its longitude.
You can load the csv, and show the first five rows, as follows:
>>> import pandas as pd
>>> df = pd.read_csv("wikidata_gazetteer.csv")
>>> df. head()
wikidata_id english_label latitude longitude
0 Q5059107 Centennial 40.01140 -87.24330
1 Q5059144 Centennial Grounds 39.99270 -75.19380
2 Q5059153 Centennial High School 40.06170 -83.05780
3 Q5059162 Centennial High School 38.30440 -104.63800
4 Q5059178 Centennial Memorial Samsung Hall 37.58949 127.03434
Each row corresponds to a Wikidata geographic entity (i.e. a Wikidata entity with coordinates).
Note
Note that the latitude and longitude are not used by the disambiguation
method: they are only provided as a post-processing step when rendering
the output of the linking. Therefore, the columns can have dummy values
(of type float
) if the user is not interested in obtaining the
coordinates: the linking to Wikidata will not be affected. Column
english_label
can likewise be left empty.
entity2class.txt
A python dictionary in which each entity in Wikidata is mapped to its most common Wikidata class.
You can load the dictionary, and access it, as follows:
>>> with open('entity2class.txt', 'r') as f:
... entity2class = json.load(f)
...
>>> entity2class["Q23183"]
'Q180673'
>>> entity2class["Q84"]
'Q515'
For example, Wiltshire (Q23183) is mapped to Q180673, i.e. “cerimonial county of England”, whereas London (Q84) is mapped to Q515, i.e. “city”.
Note
Note that the entity2class mapping is not used by the disambiguation method: the Wikidata class is only provided as a post-processing step when rendering the output of the linking. T-Res will complain if the file is not there, but values can be left empty if the user is not interested in obtaining the wikidata class of the predicted entity. The linking to Wikidata will not be affected.
Entity and word embeddings
Note
Note that you will not need this if you use the mostpopular
disambiguation
approach.
In order to perform toponym linking and resolution using the REL-based approaches, T-Res requires a database of word2vec and wiki2vec embeddings.
By default, T-Res expects a database file called embeddings_database.db
with,
at least, one table (entity_embeddings
) with at least the following columns:
word
: Either a lower-cased token (i.e. a word on Wikipedia) or a Wikidata QID preceded byENTITY/
. The database should also contain the following two wildcard tokens:#ENTITY/UNK#
and#WORD/UNK#
.emb
: The corresponding word or entity embedding.
In our experiments, we derived the embeddings database from REL’s shared resources.
Note
We are working towards improving this step in the pipeline. Meanwhile, to generate
the embeddings_database.db
, please follow these steps:
#. Make sure you have wikidata_gazetteer.csv
in ./resources/wikidata/
(see
above).
#. Generate a Wikipedia-to-Wikidata index, following this instructions, save it as: ./resources/wikipedia/index_enwiki-latest.db
.
#. Run this script
to create the embeddings database ([coming soon]).
You can load the file, and access a token embedding, as follows:
>>> import array
>>> from array import array
>>> with sqlite3.connect("embeddings_database.db") as conn:
... cursor = conn.cursor()
... result = cursor.execute("SELECT emb FROM entity_embeddings WHERE word='lerwick'").fetchone()
... result = result if result is None else array("f", result[0]).tolist()
...
>>> result
[-0.3257000148296356, -0.00989999994635582, -0.13420000672340393, ...]
You can load the file, and access an entity embedding, as follows:
>>> import array
>>> from array import array
>>> with sqlite3.connect("embeddings_database.db") as conn:
... cursor = conn.cursor()
... result = cursor.execute("SELECT emb FROM entity_embeddings WHERE word='ENTITY/Q84'").fetchone()
... result = result if result is None else array("f", result[0]).tolist()
...
>>> result
[-0.014700000174343586, 0.007899999618530273, -0.1808999925851822, ...]
T-Res expects the embeddings_database.db
file to be stored as follows:
T-Res/
└── resources/
└── rel_db/
└── embeddings_database.db
DeezyMatch training set
In order to train a DeezyMatch model, a training set consisting of positive and negative string pairs is required. We provide a dataset of positive and negative OCR variations, which can be used to train a DeezyMatch model, which can then be used to perform fuzzy string matching to find candidates for entity linking.
Note
The DeezyMatch training set can be downloaded from the British Library research
repository.
This dataset is only necessary if you want to use the DeezyMatch approach to perform
candidate selection. This is not needed if you use perfectmatch
.
T-Res assumes by default the DeezyMatch training set to be named w2v_ocr_pairs.txt
and to be in the following location:
T-Res/
└── resources/
└── deezymatch/
└── data/
└── w2v_ocr_pairs.txt
Optionally, T-Res also provides the option to generate a DeezyMatch training set from word2vec embeddings trained on digitised texts. Continue reading the sections below for more information about both types of resources.
1. DeezyMatch training set
T-Res can directly load the string pairs dataset required to train a new DeezyMatch
model. By default, the code assumes the dataset to be called w2v_ocr_pairs.txt
.
The dataset consists of three columns: word1
, word2
, and a boolean describing
whether word2
is an OCR variation of word1
. For example:
could might FALSE could wished FALSE could hardly FALSE could didnot FALSE could never FALSE could reusing FALSE could could TRUE could coeld TRUE could could TRUE could conld TRUE could could TRUE could couid TRUE
This dataset has been automatically generated from word2vec embeddings trained on digitised historical news texts (i.e. with OCR noise), and has been expanded with toponym alternate names extracted from Wikipedia.
The dataset we provide consists of 1,085,514 string pairs.
2. Word2Vec embeddings trained on noisy data
Note
The 19thC word2vec embeddings are not needed if you already have the
DeezyMatch training set w2v_ocr_pairs.txt
(described in the section above).
To create a new DeezyMatch training set using T-Res, you need to provide Word2Vec
models that have been trained on digitised historical news texts. In our experiments,
we used the embeddings trained on a 4.2-billion-word corpus of 19th-century British
newspapers using Word2Vec (you can download them from Zenodo), but you can also do this with your
own word2vec embeddings. The embeddings are divided into periods of ten years each.
By default, T-Res assumes that the word2vec models are stored in
./resources/models/w2v/
, in directories named w2v_xxxxs_news/
, where
xxxx
corresponds to the decade (e.g. 1800 or 1810) of the models.
See the expected directory structure below:
T-Res/
└── resources/
└── models/
└── w2v/
├── w2v_1800_news/
│ ├── w2v.model
│ ├── w2v.model.syn1neg.npy
│ └── w2v.model.wv.vectors.npy
├── w2v_1810_news/
│ ├── w2v.model
│ ├── w2v.model.syn1neg.npy
│ └── w2v.model.wv.vectors.npy
└── .../
Summary of resources and directory structure
In the code and our tutorials, we assume the following directory structure for the mentioned resources that are required in order to run the pipeline.
T-Res/
├── t-res/
│ ├── geoparser/
│ └── utils/
├── app/
├── evaluation/
├── examples/
├── experiments/
│ └── outputs/
│ └── data/
│ └── lwm/
│ ├── linking_df_split.tsv [*?]
│ ├── ner_fine_dev.json [*+?]
│ └── ner_fine_train.json [*+?]
├── resources/
│ ├── deezymatch/
│ │ └── data/
│ │ └── w2v_ocr_pairs.txt [?]
│ ├── models/
│ ├── news_datasets/
│ ├── rel_db/
│ │ └── embeddings_database.db [*+?]
│ └── wikidata/
│ ├── entity2class.txt [*]
│ ├── mentions_to_wikidata_normalized.json [*]
│ ├── mentions_to_wikidata.json [*]
│ ├── wikidta_gazetteer.csv [*]
│ └── wikidata_to_mentions_normalized.json [*]
└── tests/
A question mark (?
) is used to indicate resources which are only required
for some approaches (for example, the rel_db/embeddings_database.db
file
is only required by the REL-based disambiguation approaches). Note that an
asterisk (*
) next to the resource means that the path can be changed when
instantiating the T-Res objects, and a plus sign (+
) if the name of the
file can be changed in the instantiation.