Creating More Training Data Without More Annotating
Contents
Creating More Training Data Without More Annotating¶
We previously listed ‘annotating’ more data as one way of improving our model. Since supervised learning requires data, having more data may be potentially helpful for improving our model.
One obvious downside of this is that collecting more training data is time consuming and may not always be very practical. In a GLAM setting we may want to use machine learning to do a task which we wouldn’t otherwise have time to do. If we have to spend a lot of time creating our training data, the machine learning approach may also become impractical in terms of resources.
Combining Domain Expertise and Machine Learning¶
The time taken to create training data is one weakness of machine learning for practical tasks. Another potential frustration domain experts may have is that their knowledge isn’t always incorporated into the machine learning process. For our use case of trying to identify the genre of a book we may already have a sense of some possible ways in which we could identify whether a book was fiction or non-fiction. For example we may already know that titles for non-fiction books tend to be longer than fiction book titles (cf. ‘An account of the mining villages of Wales’ vs ‘Oliver Twist’). If we create our training data in the usual way by labeling examples of our data with the correct label we might not be able to incorporate this domain knowledge very easily. This might be okay in some cases but we might be able to save time and get better results by trying to leverage what we already know (or can access via domain experts).
Programmatically Generating Training Data¶
One way in which we could do this is by writing a labelling function
to label titles as being either fiction or non-fiction based on the length of the title - i.e. without any annotation by hand. However, a weakness of this approach is that it deals with averages which might not always be correct. Some non-fiction book titles will be shorter than our threshold, and vice-versa for fiction books. If we could simply determine genre based on the average length of titles, we could have skipped this whole machine learning process and be done already.
So our problem is that we have some sense of functions we could use to label our data, but these functions are likely to be wrong some of the time. In this notebook we’ll explore how we can use a Python library Snorkel
to deal with this challenge and try and create additional annotations without doing any annotating by hand.
Generating New Genre Training Data¶
How will we try to approach this in our particular situation? As a reminder of our broad task, we have a collection of metadata related to the Microsoft Digitised Books collection. The ‘genre’ field isn’t yet fully populated. We have previously used a subset of this data to train a machine learning model.
What we want to do is to try and write some labelling functions that will add more labels to the full metadata dataset, so that we can give our models more examples to learn from. If we are able to do this we’ll hopefully be able to improve the performance of our model from our previous attempts.
We’ll start by doing some installation of our libraries we’ll be using in this notebook.
!pip install snorkel
!pip install fastai --upgrade
Requirement already satisfied: snorkel in /usr/local/lib/python3.7/dist-packages (0.9.7)
Requirement already satisfied: torch<2.0.0,>=1.2.0 in /usr/local/lib/python3.7/dist-packages (from snorkel) (1.9.0+cu111)
Requirement already satisfied: scipy<2.0.0,>=1.2.0 in /usr/local/lib/python3.7/dist-packages (from snorkel) (1.4.1)
Requirement already satisfied: munkres>=1.0.6 in /usr/local/lib/python3.7/dist-packages (from snorkel) (1.1.4)
Requirement already satisfied: numpy<1.20.0,>=1.16.5 in /usr/local/lib/python3.7/dist-packages (from snorkel) (1.19.5)
Requirement already satisfied: tqdm<5.0.0,>=4.33.0 in /usr/local/lib/python3.7/dist-packages (from snorkel) (4.62.3)
Requirement already satisfied: scikit-learn<0.25.0,>=0.20.2 in /usr/local/lib/python3.7/dist-packages (from snorkel) (0.22.2.post1)
Requirement already satisfied: networkx<2.4,>=2.2 in /usr/local/lib/python3.7/dist-packages (from snorkel) (2.3)
Requirement already satisfied: pandas<2.0.0,>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from snorkel) (1.1.5)
Requirement already satisfied: tensorboard<2.0.0,>=1.14.0 in /usr/local/lib/python3.7/dist-packages (from snorkel) (1.15.0)
Requirement already satisfied: decorator>=4.3.0 in /usr/local/lib/python3.7/dist-packages (from networkx<2.4,>=2.2->snorkel) (4.4.2)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas<2.0.0,>=1.0.0->snorkel) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas<2.0.0,>=1.0.0->snorkel) (2.8.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas<2.0.0,>=1.0.0->snorkel) (1.15.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn<0.25.0,>=0.20.2->snorkel) (1.1.0)
Requirement already satisfied: werkzeug>=0.11.15 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.0.0,>=1.14.0->snorkel) (1.0.1)
Requirement already satisfied: grpcio>=1.6.3 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.0.0,>=1.14.0->snorkel) (1.41.1)
Requirement already satisfied: setuptools>=41.0.0 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.0.0,>=1.14.0->snorkel) (57.4.0)
Requirement already satisfied: absl-py>=0.4 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.0.0,>=1.14.0->snorkel) (0.12.0)
Requirement already satisfied: protobuf>=3.6.0 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.0.0,>=1.14.0->snorkel) (3.17.3)
Requirement already satisfied: wheel>=0.26 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.0.0,>=1.14.0->snorkel) (0.37.0)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.0.0,>=1.14.0->snorkel) (3.3.4)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from markdown>=2.6.8->tensorboard<2.0.0,>=1.14.0->snorkel) (4.8.1)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from torch<2.0.0,>=1.2.0->snorkel) (3.10.0.2)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->markdown>=2.6.8->tensorboard<2.0.0,>=1.14.0->snorkel) (3.6.0)
Requirement already satisfied: fastai in /usr/local/lib/python3.7/dist-packages (2.5.3)
Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from fastai) (21.2)
Requirement already satisfied: pip in /usr/local/lib/python3.7/dist-packages (from fastai) (21.1.3)
Requirement already satisfied: torchvision>=0.8.2 in /usr/local/lib/python3.7/dist-packages (from fastai) (0.10.0+cu111)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from fastai) (3.2.2)
Requirement already satisfied: fastcore<1.4,>=1.3.22 in /usr/local/lib/python3.7/dist-packages (from fastai) (1.3.27)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.7/dist-packages (from fastai) (3.13)
Requirement already satisfied: pillow>6.0.0 in /usr/local/lib/python3.7/dist-packages (from fastai) (7.1.2)
Requirement already satisfied: torch<1.11,>=1.7.0 in /usr/local/lib/python3.7/dist-packages (from fastai) (1.9.0+cu111)
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from fastai) (1.1.5)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from fastai) (2.23.0)
Requirement already satisfied: fastprogress>=0.2.4 in /usr/local/lib/python3.7/dist-packages (from fastai) (1.0.0)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (from fastai) (0.22.2.post1)
Requirement already satisfied: fastdownload<2,>=0.0.5 in /usr/local/lib/python3.7/dist-packages (from fastai) (0.0.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from fastai) (1.4.1)
Requirement already satisfied: spacy<4 in /usr/local/lib/python3.7/dist-packages (from fastai) (2.2.4)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from fastprogress>=0.2.4->fastai) (1.19.5)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai) (2.0.6)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai) (1.0.5)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai) (3.0.6)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai) (1.1.3)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai) (1.0.0)
Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai) (7.4.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai) (57.4.0)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai) (1.0.6)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai) (4.62.3)
Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai) (0.4.1)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai) (0.8.2)
Requirement already satisfied: importlib-metadata>=0.20 in /usr/local/lib/python3.7/dist-packages (from catalogue<1.1.0,>=0.0.7->spacy<4->fastai) (4.8.1)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata>=0.20->catalogue<1.1.0,>=0.0.7->spacy<4->fastai) (3.6.0)
Requirement already satisfied: typing-extensions>=3.6.4 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata>=0.20->catalogue<1.1.0,>=0.0.7->spacy<4->fastai) (3.10.0.2)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->fastai) (2.10)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->fastai) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->fastai) (2021.10.8)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->fastai) (3.0.4)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->fastai) (2.8.2)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->fastai) (2.4.7)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib->fastai) (0.11.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->fastai) (1.3.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.1->matplotlib->fastai) (1.15.0)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas->fastai) (2018.9)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->fastai) (1.1.0)
Since we already have some training data we can leverage this to help us develop labelling functions
(more on this below) and to test how well these work.
import pandas as pd
import numpy as np
dtypes = {
"BL record ID": "string",
"Type of resource": "category",
"Name": "category",
"Type of name": "category",
"Country of publication": "category",
"Place of publication": "category",
"Genre": "category",
"Dewey classification": "string",
"BL record ID for physical resource": "string",
"annotator_main_language": "category",
"annotator_summaries_language": "string",
}
df = pd.read_csv("https://raw.githubusercontent.com/Living-with-machines/genre-classification/main/genre_classification_of_bl_books/data/train_valid.csv", dtype=dtypes)
df.head(1)
BL record ID | Type of resource | Name | Dates associated with name | Type of name | Role | All names | Title | Variant titles | Series title | Number within series | Country of publication | Place of publication | Publisher | Date of publication | Edition | Physical description | Dewey classification | BL shelfmark | Topics | Genre | Languages | Notes | BL record ID for physical resource | classification_id | user_id | created_at | subject_ids | annotator_date_pub | annotator_normalised_date_pub | annotator_edition_statement | annotator_genre | annotator_FAST_genre_terms | annotator_FAST_subject_terms | annotator_comments | annotator_main_language | annotator_other_languages_summaries | annotator_summaries_language | annotator_translation | annotator_original_language | annotator_publisher | annotator_place_pub | annotator_country | annotator_title | Link to digitised book | annotated | is_valid | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 014616539 | Monograph | NaN | NaN | NaN | NaN | Hazlitt, William Carew, 1834-1913 [person] | The Baron's Daughter. A ballad by the author o... | Single Works | NaN | NaN | Scotland | Edinburgh | Ballantyne, Hanson | 1877 | NaN | 20 pages (4°) | <NA> | Digital Store 11651.h.6 | NaN | NaN | English | NaN | 000206670 | 263940444.0 | 3.0 | 2020-07-27 07:35:13 UTC | 44330917.0 | 1877 | 1877 | NONE | Fiction | 655 7 $aPoetry$2fast$0(OCoLC)fst01423828 | NONE | NaN | NaN | No | <NA> | No | NaN | Ballantyne Hanson & Co. | Edinburgh | stk | The Baron's Daughter. A ballad by the author o... | http://access.bl.uk/item/viewer/ark:/81055/vdc... | True | False |
We’ll use only the data from the training split so we’re can continue to use the validation split to compare our results.
df = df[df.is_valid == False]
Check how many examples we have to work with
len(df)
3262
What is a labeling function?¶
We briefly described a function we could use to label our data using the length of the title. When we use a programmatic approach to creating our training data we can refer to the functions which we use to create our labels as labeling function
. We’ll follow a lot of the approaches outlined in the Snorkel tutorial in this notebook. They provide this example of a labeling function for the task of identifying if a youtube comment is spam or not:
from snorkel.labeling import labeling_function
@labeling_function()
def lf_contains_link(x):
# Return a label of SPAM if "http" in comment text, otherwise ABSTAIN
return SPAM if "http" in x.text.lower() else ABSTAIN
There are a few things to note here, but, since we’re following a lot of what is covered in the Snorkel tutorial we won’t repeat things in too much detail.
The first thing we need is to import labeling_function
from snorkel
as we use this for declaring our labeling functions. The way in which we create a labeling function will depend on our data, and how we might label it, but in this example we have a simple python function which returns SPAM
if the text http
appears in the comment text, if it doesn’t it returns ABSTAIN
.
We use a python decorator to indicate that this is a labeling function. If you aren’t familiar with Python decorators should just remember that decorators are used to modify the behavior of a function (the one it decorates), just as fairy lights decorate a Christmas tree and change its behaviour from ‘tree’ to ‘festive ornament’. This will make more sense in the context of Snorkel later on.
If you want to dig into decorators further this article on Real Python provides a nice introduction, or, if prefer to watch a video this youtube tutorial gives a nice overview too.
We can see here that the labeling function makes use of the idea that people often include links in spam comments i.e. “plz checkout my etsy store at http:….”. Obviously this won’t be correct all the time but fortunately Snorkel has some ways to deal with this.
What Makes a Good Labelling Function?¶
One question we might already have is “what makes a good labeling function?”. The short, annoying, answer is that it depends on context. We often have intutions about things that might work because we know the domain or have picked up ideas from working with some of the data already. In our particular example of distinguishing fiction from non-fiction books we may think that some words are likely to indicate whether a book is fiction or non-fiction. We’ll start by exploring this.
Important Words?¶
A fairly naive approach to trying to labeling a title as ‘fiction’ or ‘non-fiction’ would be to use some keywords. Let’s start by finding the most common 50 words. We can use the Counter
class from the delightful collections
module to do this.
from collections import Counter
Counter(" ".join(df["Title"]).split()).most_common(50)
[('of', 2255),
('the', 1785),
('and', 1536),
('...', 1054),
('in', 819),
('The', 693),
('A', 625),
('a', 625),
('etc', 557),
('by', 472),
('to', 413),
('With', 314),
('from', 268),
('with', 250),
('de', 228),
('van', 223),
('By', 201),
('its', 196),
('en', 194),
('der', 193),
('History', 179),
('J.', 159),
('on', 158),
('an', 157),
('[With', 157),
('[A', 152),
('illustrations', 152),
('New', 125),
('other', 121),
('for', 117),
('novel.]', 111),
('edition', 110),
('or,', 109),
('H.', 108),
('Illustrated', 98),
('A.', 96),
('und', 91),
('af', 88),
('G.', 87),
('den', 87),
('och', 75),
('C.', 73),
('or', 72),
('i', 71),
('het', 70),
('An', 68),
('Edited', 67),
('novel', 67),
('W.', 64),
('during', 64)]
We can see here that the most common words tend to be stop words. Since we want to know which words might be unique to fiction or non-fiction we’ll look at each of these separately.
df_fiction = df[df["annotator_genre"] == "Fiction"]
df_non_fiction = df[df["annotator_genre"] == "Non-fiction"]
most_frequent_fiction = Counter(" ".join(df_fiction["Title"]).split()).most_common(50)
most_frequent_fiction
[('of', 490),
('The', 331),
('A', 316),
('the', 260),
('and', 242),
('a', 184),
('...', 177),
('[A', 147),
('by', 138),
('in', 112),
('novel.]', 111),
('etc', 104),
('By', 104),
('other', 94),
('novel', 67),
('With', 53),
('tale', 50),
('der', 49),
('edition', 48),
('de', 47),
('author', 45),
('van', 45),
('or,', 41),
('en', 40),
('Poems', 39),
('J.', 39),
('story', 39),
('illustrations', 38),
('[i.e.', 35),
('A.', 34),
('stories', 30),
('romance', 29),
('H.', 28),
('poems', 28),
('or', 26),
('Second', 26),
('und', 26),
('C.', 25),
('poem', 25),
('with', 25),
('verse', 24),
('An', 24),
('from', 24),
('Tales', 24),
('New', 23),
('for', 20),
('acts', 20),
('collection', 20),
('het', 20),
('an', 19)]
most_frequent_non_fiction = Counter(
" ".join(df_non_fiction["Title"]).split()
).most_common(50)
most_frequent_non_fiction
[('of', 1765),
('the', 1525),
('and', 1294),
('...', 877),
('in', 707),
('etc', 453),
('a', 441),
('to', 397),
('The', 362),
('by', 334),
('A', 309),
('With', 261),
('from', 244),
('with', 225),
('its', 193),
('de', 181),
('van', 178),
('History', 176),
('en', 154),
('on', 153),
('[With', 144),
('der', 144),
('an', 138),
('J.', 120),
('illustrations', 114),
('New', 102),
('for', 97),
('By', 97),
('Illustrated', 94),
('af', 88),
('H.', 80),
('och', 75),
('den', 75),
('G.', 74),
('i', 71),
('or,', 68),
('und', 65),
('during', 64),
('edition', 62),
('A.', 62),
('history', 62),
('og', 61),
('account', 57),
('sketches', 54),
('W.', 53),
('P.', 51),
('through', 51),
('notes', 50),
('edition,', 50),
('het', 50)]
For our indicator words to be most reliable we would rather they didn’t appear frequently in both fiction and non-fiction titles. We can use a set to check the values which aren’t in both fiction and non-fiction titles.
set(most_frequent_non_fiction).difference(set(most_frequent_fiction))
{('...', 877),
('A', 309),
('A.', 62),
('By', 97),
('G.', 74),
('H.', 80),
('History', 176),
('Illustrated', 94),
('J.', 120),
('New', 102),
('P.', 51),
('The', 362),
('W.', 53),
('With', 261),
('[With', 144),
('a', 441),
('account', 57),
('af', 88),
('an', 138),
('and', 1294),
('by', 334),
('de', 181),
('den', 75),
('der', 144),
('during', 64),
('edition', 62),
('edition,', 50),
('en', 154),
('etc', 453),
('for', 97),
('from', 244),
('het', 50),
('history', 62),
('i', 71),
('illustrations', 114),
('in', 707),
('its', 193),
('notes', 50),
('och', 75),
('of', 1765),
('og', 61),
('on', 153),
('or,', 68),
('sketches', 54),
('the', 1525),
('through', 51),
('to', 397),
('und', 65),
('van', 178),
('with', 225)}
These words are still fairly noisy so we might be wary of using many of them. There are some which make sense intuitively so we’ll try some of these out and see how they perform.
Creating our Labelling Functions¶
We’ll start by setting some constants for our labels.
ABSTAIN = -1
FICTION = 0
NON_FICTION = 1
It is important to note that we set an option for ABSTAIN
. We often want to add an option to our labeling functions that defers from making a prediction. We usually write our labeling function to try and indicate a label, but usually if the function isn’t satisfied doesn’t indicate that the other label is correct.
Another important part of labeling functions is that we usually want to have many labeling functions, and, since we’re not relying on a single function to label our data it’s often better for our labeling function to return ABSTAIN
if our labeling condition isn’t met rather than returning another label. This becomes even more important if we have multiple labels.
One function we could try to start with is checking if the word “Novel” appears in the title text. You may have noticed that in this particular dataset the word novel often appears as part of the title so this could be a useful indicator of fiction titles.
Warning
We want to be careful that our labeling functions are specific to our data. In the BL books title metadata that we’re trying to label we have noticed that things like A Novel by...
appear in the title. This may not be the case for other book titles in different catalogues.
Our first labeling function is basically the same as the spam example above except we look for the word “novel”.
@labeling_function()
def lf_contains_novel(x):
return FICTION if "novel" in x.Title.lower() else ABSTAIN
Now we have our labeling function we can apply it to our data. We’ll start by doing this only with our validation data since we have the correct labels for this to compare our functions against.
There are various ways in which Snorkel can apply our functions to our data. In this notebook we’ll stick with an approach designed to work with Pandas dataframes. If we have a larger amount of data to label we might want to explore the use of dask applifer functions. This uses the dask library to scale the appliation of labelling functions to very large datasets. We won’t need this here but if you are planning to develop this approach with very large collections this could be worth exploring.
We put our labelling functions in a list called LFS
, we then create an applier object and pass in our dataframe.
from snorkel.labeling import PandasLFApplier
lfs = [lf_contains_novel]
applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df)
100%|██████████| 3262/3262 [00:00<00:00, 39484.97it/s]
We store the output of this in a new variable L_train
. We can use LFAnalysis
to get a summary of what our current labeling function is doing.
from snorkel.labeling import LFAnalysis
LFAnalysis(L=L_train, lfs=lfs).lf_summary()
j | Polarity | Coverage | Overlaps | Conflicts | |
---|---|---|---|---|---|
lf_contains_novel | 0 | [0] | 0.058553 | 0.0 | 0.0 |
We can see a row for our current labeling function, we can also see that at the moment our coverage (i.e. how much of our data is labeled by our function) is very low. At the moment we don’t have any overlaps or conflicts since these are relevant only when we have multiple labeling functions.
We have ground truth labels that we can use to evaluate how accurate our labeling function is. To to his we need to use the same labels as Snorkel for ground truth so we’ll map our labels to the constants we made above.
ground_truth = df.annotator_genre.replace({"Fiction": 0, "Non-fiction": 1})
We can pass our ground truth labels to lf_empirical_accuracies
to get a sumaary of the peformance of our functions.
LFAnalysis(L=L_train, lfs=lfs).lf_empirical_accuracies(ground_truth)
array([1.])
We can see here that our current function is 100% accurate. We shouldn’t get too excited about this since our coverage is very low. We’ll need to write some more labeling functions to make sure that we have some chance of labeling more of our data than we currently have done.
Heuristics¶
We could also use heuristics for our labeling functions. For example the length of the title. I don’t have any idea what threshold to use for this. Since we have some labels we can try and identify a sensible threshold. First we’ll add a new columns to our DataFrame which contains the length of our titles.
df["text_len"] = df["Title"].str.len()
We’ll now use a pandas groupby
to see what the lengths look like for fiction vs non-fiction books. Since it might be useful to have a sense of the distributions we’ll use describe
instead of mean.
df.groupby(["annotator_genre"])["text_len"].describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
annotator_genre | ||||||||
Fiction | 1083.0 | 49.438596 | 35.095600 | 5.0 | 25.0 | 39.0 | 63.0 | 271.0 |
Non-fiction | 2179.0 | 92.317118 | 58.458339 | 8.0 | 50.0 | 78.0 | 125.0 | 469.0 |
Precision vs Recall: What Value to use for our Threshold?¶
We can see various values for mean, min etc. What would be a reasonable value to use for our threshold for a labeling function which labeled a title as ‘non-fiction’? This partly comes down to whether we want high coverage (or recall) or high precision. If we choose a threshold that is higher we will label fewer examples, but they will be more likely to be correct.
For example, if we use the max value for the length of a non-fiction title 469
most titles will be much shorter than this, so our function will ‘abstain’ from applying a label, and we would only label a very small number of examples from our data. However, we also won’t have many (or any) wrongly-labeled examples since the max value for fiction here is 288
. We need to balance these two aims of coverage and precision. Since we are writing more than one labeling function we probably want to tend towards writing more precise labeling functions rather than aiming for high coverage if this is likely to introduce wrongly labeled examples.
Note
As we saw in previous chapters/notebooks we have to be a bit careful in generalizing between what we see in our training and validation data since there may be some distribution drift between our training data (which wasn’t a completely randomized sample) and the full data that we want to label. In the error analysis notebook we saw that the performance of our model was worse than it was on validation data. We should keep this in mind when writing a labeling function, since we want our labeling function to work well on new data which doesn’t have labels, not just on the data for which we already have labels.
We’ll use the value of the maximum title length for a fiction book as our threshold. This will hopefully give us fairly decent coverage without too many mistakes.
@labeling_function()
def lf_is_long_title(x):
return NON_FICTION if x.text_len > 211.0 else ABSTAIN
We do the same as earlier, including our new labeling function
lfs = [lf_contains_novel, lf_is_long_title]
applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df)
100%|██████████| 3262/3262 [00:00<00:00, 28729.01it/s]
LFAnalysis(L=L_train, lfs=lfs).lf_summary()
j | Polarity | Coverage | Overlaps | Conflicts | |
---|---|---|---|---|---|
lf_contains_novel | 0 | [0] | 0.058553 | 0.0 | 0.0 |
lf_is_long_title | 1 | [1] | 0.023299 | 0.0 | 0.0 |
We can see our coverage is still fairly low but at the moment we don’t have any conflicts. We can keep tweaking our length threshold but for now we’ll try a different approach to our labeling function.
Add Keywords¶
We already have a labeling function that uses the keyword ‘novel’ to identify likely fiction books. Since we often want to use keywords the Snorkel tutorial suggests a way we can do this more easily using keyword lookups.
from snorkel.labeling import LabelingFunction
def keyword_lookup(x, keywords, label):
if any(word in x.Title.lower() for word in keywords):
return label
return ABSTAIN
def make_keyword_lf(keywords, label=FICTION):
return LabelingFunction(
name=f"keyword_{keywords[0]}",
f=keyword_lookup,
resources=dict(keywords=keywords, label=label),
)
We can try two new keyword labels using this more concise approach:
keyword_tale = make_keyword_lf(keywords=["tale"])
keyword_poem = make_keyword_lf(keywords=["poem"])
Leveraging Other Models¶
So far we have leveraged some domain knowledge/exploration and our existing labeled data to create our labeling functions. However, we could also utilise other resources to help us label our data. Since we’re working with text we should be able to benefit from some existing NLP models to label our data. Snorkel supports this in a few different ways.
spaCy is a popular nlp library which supports a range of different models and nlp tasks. Here we’re particuarly interested in some of the named entity models supported by spaCy.
To work with this library we can use Snorkel’s SpacyPreProcessor
. Preprocessors
are used in Snorkel to do some preprocessing (hence the name) which is required for our labeling functions. These can be particularly useful if the processing takes some time and might be reused across differnet labelling functions. Let’s take a look at an example
from snorkel.preprocess.nlp import SpacyPreprocessor
# The SpacyPreprocessor parses the text in text_field and
# stores the new enriched representation in doc_field
spacy = SpacyPreprocessor(text_field="Title", doc_field="doc", memoize=True)
Above we create a SpacyPreprocessor
which will use our title field and create a new doc
field. This doc
refers to the Spacy doc
container. This can be reused for multiple different labeling functions. We pass in memoize=True
to cache our results. This means we won’t have to wait for the preprocessing to be done multiple times for different labeling functions which reuse the doc
container.
Using named entities for labeling functions.¶
spaCy has support for named entity recognition. Since these models are already created and can be used directly by us it might be worth seeing if named entities are of any benefit for our particular task.
We can again draw from our domain knowledge, intution or guesses (depending on how confident we are) and say that it’s likely that we will see more named entities of the ORG
type in non-fiction titles since these often will be about organizations. We can combine this with a slightly softer threshold for length to label titles as being likely non-fiction.
To create this function we replicate closely what we did before except that we pass in our SpacyPreProcessor
instance to let Snorkel know that this preprocesser is a requirement of this labeling function. Under the hood this will mean that if the preprocessor hasn’t been run already this will triger the preprocessing. If we reuse this for another funciton the preprocessing will already have been cached.
@labeling_function(pre=[spacy])
def has_many_org(x):
if len(x.doc) > 50 and len([ent.label_ == "ORG" for ent in x.doc.ents]) > 1:
return NON_FICTION
else:
return ABSTAIN
We might also guess that there will be more location entities in non-fiction titles
@labeling_function(pre=[spacy])
def has_many_loc(x):
if len(x.doc) > 50 and len([ent.label_ == "LOC" for ent in x.doc.ents]) > 2:
return NON_FICTION
else:
return ABSTAIN
Similarly we might also assume that there will be more GPE
entities for non-fiction
@labeling_function(pre=[spacy])
def has_many_gpe(x):
if len(x.doc) > 50 and len([ent.label_ == "GPE" for ent in x.doc.ents]) > 2:
return NON_FICTION
else:
return ABSTAIN
and law entities…
@labeling_function(pre=[spacy])
def has_law(x):
if any([ent.label_ == "LAW" for ent in x.doc.ents]):
return NON_FICTION
else:
return ABSTAIN
and if it’s long and has a date it might be a non-fiction title?
@labeling_function(pre=[spacy])
def is_long_and_has_date(x):
if len(x.doc) > 50 and any([ent.label_ == "DATE" for ent in x.doc.ents]):
return NON_FICTION
else:
return ABSTAIN
or it is long and has a FAC
entitity
@labeling_function(pre=[spacy])
def is_long_and_has_fac(x):
if len(x.doc) > 50 and any([ent.label_ == "FAC" for ent in x.doc.ents]):
return NON_FICTION
else:
return ABSTAIN
We now have a bunch of labeling functions we’ll create a new list containing these and see how they do.
lfs = [
lf_contains_novel,
lf_is_long_title,
keyword_tale,
keyword_poem,
has_many_org,
has_many_loc,
has_many_gpe,
has_law,
is_long_and_has_date,
]
applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df)
100%|██████████| 3262/3262 [00:34<00:00, 93.42it/s]
LFAnalysis(L=L_train, lfs=lfs).lf_summary()
j | Polarity | Coverage | Overlaps | Conflicts | |
---|---|---|---|---|---|
lf_contains_novel | 0 | [0] | 0.058553 | 0.000000 | 0.0 |
lf_is_long_title | 1 | [1] | 0.023299 | 0.011036 | 0.0 |
keyword_tale | 2 | [0] | 0.043532 | 0.000000 | 0.0 |
keyword_poem | 3 | [0] | 0.042918 | 0.000000 | 0.0 |
has_many_org | 4 | [1] | 0.011956 | 0.011956 | 0.0 |
has_many_loc | 5 | [1] | 0.011036 | 0.011036 | 0.0 |
has_many_gpe | 6 | [1] | 0.011036 | 0.011036 | 0.0 |
has_law | 7 | [1] | 0.004905 | 0.000000 | 0.0 |
is_long_and_has_date | 8 | [1] | 0.003372 | 0.003372 | 0.0 |
Again our coverage is quite low but we also don’t have too many conflicts. We can check the performance of these functions:
LFAnalysis(L=L_train, lfs=lfs).lf_empirical_accuracies(ground_truth)
array([1. , 0.94736842, 0.97183099, 1. , 0.92307692,
0.91666667, 0.91666667, 1. , 1. ])
These are all doing pretty good so we might be okay with lower coverage for now. We also have a resource available to us which should boost our coverage a fair bit: our previously trained model.
Using our previous model¶
In a previously notebook we trained a model which didn’t perform terribly. Although we wanted to improve the performance, hence this notebook, it wasn’t so disastrous as to be unusable, particularly with the insights we got from the error analysis notebook that if we raise the threshold of confidence for which we accept our models predictions our performance increases quite a bit. We may therefore want to try and incorporate this model as another way of labeling more data.
There are various ways in which we can do this, we’ll look at one approach below.
We’ll start by importing fastai so we can load our previously trained model.
from fastai.text.all import *
If you don’t have a model saved from notebook you can download one by uncommenting the below cell.
# !wget -O 20210928-model.pkl https://zenodo.org/record/5245175/files/20210928-model.pkl?download=1
--2021-11-11 14:12:20-- https://zenodo.org/record/5245175/files/20210928-model.pkl?download=1
Resolving zenodo.org (zenodo.org)... 137.138.76.77
Connecting to zenodo.org (zenodo.org)|137.138.76.77|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 158529715 (151M) [application/octet-stream]
Saving to: ‘20210928-model.pkl’
20210928-model.pkl 100%[===================>] 151.19M 11.1MB/s in 16s
2021-11-11 14:12:38 (9.54 MB/s) - ‘20210928-model.pkl’ saved [158529715/158529715]
learn = load_learner("20210928-model.pkl")
We can quickly check our vocab
learn.dls.vocab[1]
['Fiction', 'Non-fiction']
One way of using this model would be to create a preprocessor that will be used by Snorkel. This will do the setup required to use this model (as we saw with the spaCy example). We can do this by using the preprocessor
decorator. Our function then calls whatever we need to happen. In this case we store the predicted label and probability.
# from snorkel.preprocess import preprocessor
# @preprocessor(memoize=True)
# def fastai_pred(x):
# with learn.no_bar():
# *_, probs = learn.predict(x.title)
# x.fiction_prob = probs[0]
# x.non_fiction_prob = probs[1]
# return x
In this example we don’t want to use this since we then don’t benefit from doing our inference in batches. Instead we’ll just create some new columns to store our fastai models labels and confidence.
test_dl = learn.dls.test_dl(df.Title)
preds = learn.get_preds(dl=test_dl)
fiction_prob, non_fiction_prob = np.hsplit(preds[0].numpy(), 2)
fiction_prob
array([[0.9999399 ],
[0.9999399 ],
[0.9999399 ],
...,
[0.04363291],
[0.04363291],
[0.02832149]], dtype=float32)
df["fiction_prob"] = fiction_prob
df["non_fiction_prob"] = non_fiction_prob
We now have some new columns containing the probabilities for our labels from our previously created model.
We saw in the previous Assessing Where our Model is Going Wrong section that by only using predictions where our model was confident, we could get better results i.e. we only accept suggestions from our model where it is very confident. For example, we could accept a prediction only if it is above 95% confidence.
We’ll use this in our labelling function to set a threshold at which we accept the previous models predictions. If the model is unsure we don’t use it’s prediction. This will mean less of our data ends up labelled because some predictions aren’t used. However, we will hopefully get better predictions because we only use those where our model is confident.
@labeling_function()
def fastai_fiction_prob_v_high(x):
return FICTION if x.fiction_prob > 0.97 else ABSTAIN
@labeling_function()
def fastai_non_fiction_prob_v_high(x):
return NON_FICTION if x.non_fiction_prob > 0.97 else ABSTAIN
Again we add this to our existing labeling function list and apply it to our data
lfs += [fastai_fiction_prob_v_high, fastai_non_fiction_prob_v_high]
lfs
[LabelingFunction lf_contains_novel, Preprocessors: [],
LabelingFunction lf_is_long_title, Preprocessors: [],
LabelingFunction keyword_tale, Preprocessors: [],
LabelingFunction keyword_poem, Preprocessors: [],
LabelingFunction has_many_org, Preprocessors: [SpacyPreprocessor SpacyPreprocessor, Pre: []],
LabelingFunction has_many_loc, Preprocessors: [SpacyPreprocessor SpacyPreprocessor, Pre: []],
LabelingFunction has_many_gpe, Preprocessors: [SpacyPreprocessor SpacyPreprocessor, Pre: []],
LabelingFunction has_law, Preprocessors: [SpacyPreprocessor SpacyPreprocessor, Pre: []],
LabelingFunction is_long_and_has_date, Preprocessors: [SpacyPreprocessor SpacyPreprocessor, Pre: []],
LabelingFunction fastai_fiction_prob_v_high, Preprocessors: [],
LabelingFunction fastai_non_fiction_prob_v_high, Preprocessors: []]
applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df)
100%|██████████| 3262/3262 [00:34<00:00, 95.06it/s]
LFAnalysis(L=L_train, lfs=lfs).lf_summary()
j | Polarity | Coverage | Overlaps | Conflicts | |
---|---|---|---|---|---|
lf_contains_novel | 0 | [0] | 0.058553 | 0.056101 | 0.000000 |
lf_is_long_title | 1 | [1] | 0.023299 | 0.019926 | 0.000000 |
keyword_tale | 2 | [0] | 0.043532 | 0.034028 | 0.000613 |
keyword_poem | 3 | [0] | 0.042918 | 0.035561 | 0.000000 |
has_many_org | 4 | [1] | 0.011956 | 0.011956 | 0.000920 |
has_many_loc | 5 | [1] | 0.011036 | 0.011036 | 0.000920 |
has_many_gpe | 6 | [1] | 0.011036 | 0.011036 | 0.000920 |
has_law | 7 | [1] | 0.004905 | 0.002759 | 0.000000 |
is_long_and_has_date | 8 | [1] | 0.003372 | 0.003372 | 0.000000 |
fastai_fiction_prob_v_high | 9 | [0] | 0.223483 | 0.125996 | 0.000920 |
fastai_non_fiction_prob_v_high | 10 | [1] | 0.338749 | 0.021153 | 0.000613 |
We can see that the labelling function which uses our model outputs has a much higher coverage of our data. This should be very helpful in labelling more examples but we want to check that these are correct.
LFAnalysis(L=L_train, lfs=lfs).lf_empirical_accuracies(ground_truth)
array([1. , 0.94736842, 0.97183099, 1. , 0.92307692,
0.91666667, 0.91666667, 1. , 1. , 0.99314129,
0.99909502])
We can see that our labels all perform pretty well i.e. above 90%. We are also getting much better coverage now that we leverage our existing model.
Creating more training data¶
So far we have been using the validation data to develop some potential labeling functions. Now we are fairly satisfied with them let’s apply to the full data. We’ll quickly look at this process on our current data and then we’ll move to the full metadata json file that we use for creating more training data.
We use LabelModel
to fit a model which will be able to take as input all of the predictions from our labelling functions and fit a model which will predict the probability for a label. This model is able to deal with some conflicts between labeling functions and will do much better in most cases than a naive majority vote model i.e. one which just accepts the most often predicted label. The details of this model are beyond the scope of this notebook but if you are interested Data Programming: Creating Large Training Sets, Quickly offers a fuller overview of the details of this method[8]
from snorkel.labeling.model import LabelModel
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train=L_train, n_epochs=500, log_freq=100, seed=123)
Above we fit our LabelModel
for 500 epochs. Since we are working with the training set still we can get the score for this model.
label_model.score(
L=L_train,
Y=ground_truth,
tie_break_policy="abstain",
metrics=["precision", "recall", "f1"],
)
WARNING:root:Metrics calculated over data points with non-abstain labels only
{'f1': 0.994250331711632,
'precision': 0.9964539007092199,
'recall': 0.9920564872021183}
This is looking pretty good and hopefully this performance will be similar for our full data. We’ll now load a dataframe that includes all of the BL books metadata.
df_full = pd.read_csv(
"https://bl.iro.bl.uk/downloads/e1be1324-8b1a-4712-96a7-783ac209ddef?locale=en",
dtype=dtypes,
)
We create a new column text_len
since we need this for some of our labeling functions.
df_full["text_len"] = df_full.Title.str.len()
We also get our fastai model’s predictions into new columns. This obviously takes some time since we’re now doing inference on a fairly large dataset.
test_dl = learn.dls.test_dl(df_full.Title)
preds = learn.get_preds(dl=test_dl)
fiction_prob, non_fiction_prob = np.hsplit(preds[0].numpy(), 2)
df_full["fiction_prob"] = fiction_prob
df_full["non_fiction_prob"] = non_fiction_prob
Now we have all the same columns in place as we had previously we can now apply our labelling functions to all of our data.
applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_full)
100%|██████████| 52695/52695 [09:27<00:00, 92.85it/s]
We can check what the coverage, overlaps and conflicts look like
LFAnalysis(L=L_train, lfs=lfs).lf_summary()
j | Polarity | Coverage | Overlaps | Conflicts | |
---|---|---|---|---|---|
lf_contains_novel | 0 | [0] | 0.066875 | 0.057178 | 0.000285 |
lf_is_long_title | 1 | [1] | 0.047367 | 0.032413 | 0.003036 |
keyword_tale | 2 | [0] | 0.036702 | 0.022165 | 0.001006 |
keyword_poem | 3 | [0] | 0.089952 | 0.049018 | 0.002619 |
has_many_org | 4 | [1] | 0.021520 | 0.021520 | 0.001974 |
has_many_loc | 5 | [1] | 0.021008 | 0.021008 | 0.001917 |
has_many_gpe | 6 | [1] | 0.021008 | 0.021008 | 0.001917 |
has_law | 7 | [1] | 0.004232 | 0.002106 | 0.000531 |
is_long_and_has_date | 8 | [1] | 0.007762 | 0.007762 | 0.000380 |
fastai_fiction_prob_v_high | 9 | [0] | 0.213834 | 0.122858 | 0.000626 |
fastai_non_fiction_prob_v_high | 10 | [1] | 0.192599 | 0.019015 | 0.000683 |
The coverage is lower than we had previously. This makes sense since we previously used the same data for developing our labeling functions as we used for training our model. It’s not suprising our model is more confident about these. If we were being more dilligent we might have held back a different dataset for developing our labelling functions but since we’re being a bit pragmatic (lazy) here we won’t worry too much about this. We can again fit our model:
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train=L_train, n_epochs=500, log_freq=100, seed=123)
We now use this model to predict the probabilites from our labelling function outputs
probs_train = label_model.predict_proba(L_train)
We currently have predictions for some of our data but not all of it. Since we want only the labelled exampled we use a function from Snorkel to filter out data which our labeling functions didn’t annotate.
from snorkel.labeling import filter_unlabeled_dataframe
df_train_filtered, probs_train_filtered = filter_unlabeled_dataframe(
X=df_full, y=probs_train, L=L_train
)
We now have the predicted probabilty for each label. We could work with these probabilities but to keep things simple we’ll make these hard predictions i.e. fiction or non-fiction rather than 0.87% fiction. Again Snorkel provides a handy function for doing this.
from snorkel.utils import probs_to_preds
preds_train_filtered = probs_to_preds(probs=probs_train_filtered)
Let’s see how much data we have now.
len(preds_train_filtered)
26566
As a reminder we previosuly had 3262
labeled examples. We can see that we’ve now gained a lot more examples for relateively little work (especially if we compare how much time it would take to annotate these by hand).
26566 / 3262
8.144083384426732
We’ll store out new labels in a label column
df_train_filtered["snorkel_label"] = preds_train_filtered
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""Entry point for launching an IPython kernel.
df_train_filtered["snorkel_label"]
0 0
1 0
2 0
3 0
8 0
..
52682 0
52689 0
52692 0
52693 0
52694 0
Name: snorkel_label, Length: 26566, dtype: int64
Creating our new training data¶
As a reminder of what we’ve done:
we had training data/annotations collected via a zooniverse crowdsourcing task with
2909
labeled examples in our validation setwe had previously used this to train a model that did fairly well
we used our existing training data to generate labeling functions, these leveraged:
our intuitions about our data
SpaCy models
our previous model
we applied these labeling functions to the Microsoft Digitised Books file. Once we excluded examples which weren’t labeled by our labeling functions we had
34542
labeled examples we could work with.
We now want to get all of this data into a format we can use to train new models with. There are a few things we need to do for this.
Map to our original labels¶
We’ll map these back to our original fiction and non-fiction labels. This isn’t super important but might be more explicit then 1
or 0
for our labels.
df_train_filtered["snorkel_genre"] = df_train_filtered["snorkel_label"].map(
{0: "Fiction", 1: "Non-fiction"}
)
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_train_filtered.columns
Index(['BL record ID', 'Type of resource', 'Name',
'Dates associated with name', 'Type of name', 'Role', 'All names',
'Title', 'Variant titles', 'Series title', 'Number within series',
'Country of publication', 'Place of publication', 'Publisher',
'Date of publication', 'Edition', 'Physical description',
'Dewey classification', 'BL shelfmark', 'Topics', 'Genre', 'Languages',
'Notes', 'BL record ID for physical resource', 'text_len',
'fiction_prob', 'non_fiction_prob', 'snorkel_label', 'snorkel_genre'],
dtype='object')
Selecting required columns¶
Since we have only been using the title and the label (fiction or non-fiction) to train our models we will just keep these.
df["annotator_genre"]
0 Fiction
1 Fiction
2 Fiction
3 Fiction
4 Fiction
...
3257 Non-fiction
3258 Non-fiction
3259 Non-fiction
3260 Non-fiction
3261 Non-fiction
Name: annotator_genre, Length: 3262, dtype: object
df["snorkel_genre"] = df["annotator_genre"]
df_snorkel_train = pd.concat([df, df_train_filtered])
df_snorkel_train["snorkel_genre"].value_counts()
Fiction 15840
Non-fiction 13988
Name: snorkel_genre, dtype: int64
Prioritising human annotations¶
When we applied our labeling functions across the full Microsoft Digitised Books metadata file we didn’t do anything to exclude titles where a human annotator had already provided a label as part of the Zooniverse annotation task. Since we joined the full metadata and the human annotations together we will now have some duplicates. We almost definitely want to prioritise the human annotations over our label function labels. We could use Pandas drop_duplicates and keep the first example (the human annotated one) to deal with this.
df_snorkel_train.duplicated(subset="Title")
0 False
1 True
2 True
3 True
4 True
...
52682 False
52689 False
52692 False
52693 False
52694 False
Length: 29828, dtype: bool
df_snorkel_train = df_snorkel_train.drop_duplicates(subset="Title", keep="first")
df_snorkel_train
BL record ID | Type of resource | Name | Dates associated with name | Type of name | Role | All names | Title | Variant titles | Series title | Number within series | Country of publication | Place of publication | Publisher | Date of publication | Edition | Physical description | Dewey classification | BL shelfmark | Topics | Genre | Languages | Notes | BL record ID for physical resource | classification_id | user_id | created_at | subject_ids | annotator_date_pub | annotator_normalised_date_pub | annotator_edition_statement | annotator_genre | annotator_FAST_genre_terms | annotator_FAST_subject_terms | annotator_comments | annotator_main_language | annotator_other_languages_summaries | annotator_summaries_language | annotator_translation | annotator_original_language | annotator_publisher | annotator_place_pub | annotator_country | annotator_title | Link to digitised book | annotated | is_valid | text_len | fiction_prob | non_fiction_prob | snorkel_genre | snorkel_label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 014616539 | Monograph | NaN | NaN | NaN | NaN | Hazlitt, William Carew, 1834-1913 [person] | The Baron's Daughter. A ballad by the author of Poetical Recreations [i.e. William C. Hazlitt] . F.P | Single Works | NaN | NaN | Scotland | Edinburgh | Ballantyne, Hanson | 1877 | NaN | 20 pages (4°) | <NA> | Digital Store 11651.h.6 | NaN | NaN | English | NaN | 000206670 | 263940444.0 | 3.0 | 2020-07-27 07:35:13 UTC | 44330917.0 | 1877 | 1877 | NONE | Fiction | 655 7 $aPoetry$2fast$0(OCoLC)fst01423828 | NONE | NaN | NaN | No | <NA> | No | NaN | Ballantyne Hanson & Co. | Edinburgh | stk | The Baron's Daughter. A ballad by the author of Poetical Recreations [i.e. William C. Hazlitt] . F.P | http://access.bl.uk/item/viewer/ark:/81055/vdc_00000002F718 | True | False | 100 | 0.999940 | 0.000060 | Fiction | NaN |
5 | 014616561 | Monograph | Bingham, Ashton, Mrs | NaN | person | NaN | Bingham, Ashton, Mrs [person] | The Autumn Leaf Poems | NaN | NaN | NaN | Scotland | Edinburgh | Colston | 1891 | NaN | vi, 104 pages (8°) | <NA> | Digital Store 011649.e.105 | NaN | NaN | English | NaN | 000353271 | 268728281.0 | 3.0 | 2020-08-18 07:02:17 UTC | 44331070.0 | 1891 | 1891 | NONE | Fiction | 655 7 $aPoetry$2fast$0(OCoLC)fst01423828 | NONE | NaN | NaN | No | <NA> | No | NaN | Colston & Company | Edinburgh | stk | The Autumn Leaf Poems | http://access.bl.uk/item/viewer/ark:/81055/vdc_00000002F04C | True | False | 21 | 0.999486 | 0.000514 | Fiction | NaN |
10 | 014616607 | Monograph | Cartwright, William | NaN | person | writer | Cartwright, William, writer [person] | The Battle of Waterloo, a poem | NaN | NaN | NaN | England | London | Longman | 1827 | NaN | vii, 71 pages (8°) | <NA> | Digital Store 992.i.26 | Waterloo, Battle of (Belgium : 1815) | NaN | English | NaN | 000621918 | 263935396.0 | 3.0 | 2020-07-27 06:39:57 UTC | 44331748.0 | 1827 | 1827 | NONE | Fiction | 655 7 $aPoetry$2fast$0(OCoLC)fst01423828 | 647 7 $aBattle of Waterloo$c(Waterloo, Belgium :$d1815)$2fast$0(OCoLC)fst01172689 | NaN | NaN | No | <NA> | No | NaN | Longman, Rees, Orme, Brown & Green\nBurlton\nMerricks | London\nLeominster\nHereford | enk | The Battle of Waterloo, a poem | http://access.bl.uk/item/viewer/ark:/81055/vdc_00000002ED4C | True | False | 30 | 0.991599 | 0.008401 | Fiction | NaN |
15 | 014616686 | Monograph | Earle, John Charles | NaN | person | NaN | Earle, John Charles [person] | Maximilian, and other poems, etc | NaN | NaN | NaN | England | London | NaN | 1868 | NaN | NaN | <NA> | Digital Store 11648.i.8 | NaN | Poetry or verse | English | NaN | 001025896 | 265570129.0 | 3.0 | 2020-08-03 07:25:30 UTC | 44331725.0 | 1868 | 1868 | NONE | Fiction | 655 7 $aPoetry$2fast$0(OCoLC)fst01423828 | NONE | NaN | NaN | No | <NA> | No | NaN | Burns, Oates, & Co. | London | enk | Maximilian, and other poems, etc | http://access.bl.uk/item/viewer/ark:/81055/vdc_00000002F2AA | True | False | 32 | 0.982546 | 0.017454 | Fiction | NaN |
20 | 014616696 | Monograph | NaN | NaN | NaN | NaN | NaN | Fabellæ mostellariæ: or Devonshire and Wiltshire stories in verse; including specimens of the Devonshire dialect | NaN | NaN | NaN | England | Exeter ; London | Hamilton, Adams ; Henry S. Eland | 1878 | NaN | 77 pages (8°) | <NA> | Digital Store 11652.h.19 | NaN | NaN | English | NaN | 001187981 | 269169228.0 | 3.0 | 2020-08-20 12:32:34 UTC | 44331389.0 | 1878 | 1878 | NONE | Fiction | 655 7 $aPoetry$2fast$0(OCoLC)fst01423828 | NONE | NaN | NaN | No | <NA> | No | NaN | Hamilton, Adams, and Co.\nHenry S. Eland | London\nExeter | enk | Fabellæ mostellariæ: or Devonshire and Wiltshire stories in verse; including specimens of the Devonshire dialect | http://access.bl.uk/item/viewer/ark:/81055/vdc_00000002F90A | True | False | 112 | 0.983944 | 0.016056 | Fiction | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
52682 | 016289050 | Monograph | Hastings, Beatrice | NaN | person | NaN | Hastings, Beatrice [person] | The maids' comedy. A chivalric romance in thirteen chapters | NaN | NaN | NaN | England | London | Stephen Swift | 1911 | NaN | 199 pages, 20 cm | <NA> | Digital Store 012618.c.32 | NaN | NaN | English | Anonymous. By Beatrice Hastings | 004111105 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | <NA> | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 59 | 0.999444 | 0.000556 | Fiction | 0.0 |
52689 | 016289057 | Monograph | Garstang, Walter, M.A., F.Z.S. | NaN | person | NaN | Garstang, Walter, M.A., F.Z.S. [person] ; Shepherd, J. A. (James Affleck), 1867-approximately 1931 [person] | Songs of the Birds ... With illustrations by J.A. Shepherd | NaN | NaN | NaN | England | London | John Lane | 1922 | NaN | 101 pages, illustrations (8°) | 598.259 | Digital Store 011648.g.133 | NaN | NaN | English | Poems, with and introductory essay | 004158005 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | <NA> | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 58 | 0.993942 | 0.006058 | Fiction | 0.0 |
52692 | 016289060 | Monograph | Wellesley, Dorothy | 1889-1956 | person | NaN | Wellesley, Dorothy, 1889-1956 [person] | Early Poems. By M. A [i.e. Dorothy Violet Wellesley, Lady Gerald Wellesley.] | NaN | NaN | NaN | England | London | Elkin Mathews | 1913 | NaN | vii, 90 pages (8°) | <NA> | Digital Store 011649.eee.17 | NaN | NaN | English | NaN | 000000839 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | <NA> | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 76 | 0.987218 | 0.012782 | Fiction | 0.0 |
52693 | 016289061 | Monograph | A, T. H. E. | NaN | person | NaN | A, T. H. E. [person] | Of Life and Love [Poems.] By T. H. E. A, writer of 'The Message.' | NaN | NaN | NaN | England | London | J. M. Watkins | 1924 | NaN | 89 pages (8°) | <NA> | Digital Store 011645.e.125 | NaN | NaN | English | NaN | 000001167 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | <NA> | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 65 | 0.977032 | 0.022968 | Fiction | 0.0 |
52694 | 016289062 | Monograph | Abbay, Richard | NaN | person | NaN | Abbay, Richard [person] | Life, a Mode of Motion; or, He and I, my two selves [A poem.] | NaN | NaN | NaN | England | London | Jarrold | 1919 | NaN | volumes, 58 pages (8°) | <NA> | Digital Store 011649.g.81 | NaN | NaN | English | NaN | 000003140 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | <NA> | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 61 | 0.975888 | 0.024112 | Fiction | 0.0 |
25683 rows × 52 columns
Data leakage¶
Want to exclude data which is in test set so we drop these examples from our training data. Since we care about titles ‘leaking’ we look up whether any titles in our training data appear in our test data and remove these from the training data.
df_test = pd.read_csv("test_errors.csv")
Removing data which is in our test data¶
df_snorkel_train = df_snorkel_train[~df_snorkel_train.Title.isin(df_test.title)]
len(df_snorkel_train)
25683
Creating new splits¶
We create some new splits following the same process we used before. We can then use these splits to more accurately compare across models training using this dataset. Since we have kept the test data out of our ‘Snorkel dataset’ we will also continue to use this test data for final model evaluation.
from sklearn.model_selection import GroupShuffleSplit
train_inds, valid_ins = next(
GroupShuffleSplit(n_splits=2, test_size=0.2).split(
df_snorkel_train, groups=df_snorkel_train["Title"]
)
)
df_train, df_valid = (
df_snorkel_train.iloc[train_inds].copy(),
df_snorkel_train.iloc[valid_ins].copy(),
)
df_train["is_valid"] = False
df_valid["is_valid"] = True
df = pd.concat([df_train, df_valid])
df.snorkel_genre.value_counts()
Fiction 13918
Non-fiction 11765
Name: snorkel_genre, dtype: int64
We can see we still have a healthy number of examples to train our model on even after dropping titles which appear in our test data
len(df)
25683
Saving our new training data¶
We’ll save our new training data as a csv file.
df.to_csv("data/snorkel_train.csv", index=False)
df.head(1)
BL record ID | Type of resource | Name | Dates associated with name | Type of name | Role | All names | Title | Variant titles | Series title | Number within series | Country of publication | Place of publication | Publisher | Date of publication | Edition | Physical description | Dewey classification | BL shelfmark | Topics | Genre | Languages | Notes | BL record ID for physical resource | classification_id | user_id | created_at | subject_ids | annotator_date_pub | annotator_normalised_date_pub | annotator_edition_statement | annotator_genre | annotator_FAST_genre_terms | annotator_FAST_subject_terms | annotator_comments | annotator_main_language | annotator_other_languages_summaries | annotator_summaries_language | annotator_translation | annotator_original_language | annotator_publisher | annotator_place_pub | annotator_country | annotator_title | Link to digitised book | annotated | is_valid | text_len | fiction_prob | non_fiction_prob | snorkel_genre | snorkel_label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 014616539 | Monograph | NaN | NaN | NaN | NaN | Hazlitt, William Carew, 1834-1913 [person] | The Baron's Daughter. A ballad by the author of Poetical Recreations [i.e. William C. Hazlitt] . F.P | Single Works | NaN | NaN | Scotland | Edinburgh | Ballantyne, Hanson | 1877 | NaN | 20 pages (4°) | <NA> | Digital Store 11651.h.6 | NaN | NaN | English | NaN | 000206670 | 263940444.0 | 3.0 | 2020-07-27 07:35:13 UTC | 44330917.0 | 1877 | 1877 | NONE | Fiction | 655 7 $aPoetry$2fast$0(OCoLC)fst01423828 | NONE | NaN | NaN | No | <NA> | No | NaN | Ballantyne Hanson & Co. | Edinburgh | stk | The Baron's Daughter. A ballad by the author of Poetical Recreations [i.e. William C. Hazlitt] . F.P | http://access.bl.uk/item/viewer/ark:/81055/vdc_00000002F718 | True | False | 100 | 0.99994 | 0.00006 | Fiction | NaN |
Next steps¶
We now have a larger training set which includes both our original training data produced through crowdsourcing plus our training data we generated using our labeling functions and the Snorkel library.
Hopefully having more training data will result in being able to improve the models we can generate. In the next sections we’ll look at two approaches we can use for doing this:
training the same model as before but with more data
training a transformer based model with more data
We will hopefully see some improvements now we have more data.
Note
The main things we tried to show in this notebook:
we can leverage our domain knowledge to help generate training data using a programmatic data labeling approach
this approach can leverage existing training data generated by humans
we can often use existing models to help generate training data even if the task is quite different