Creating More Training Data Without More Annotating¶

We previously listed ‘annotating’ more data as one way of improving our model. Since supervised learning requires data, having more data may be potentially helpful for improving our model.

One obvious downside of this is that collecting more training data is time consuming and may not always be very practical. In a GLAM setting we may want to use machine learning to do a task which we wouldn’t otherwise have time to do. If we have to spend a lot of time creating our training data, the machine learning approach may also become impractical in terms of resources.

Combining Domain Expertise and Machine Learning¶

The time taken to create training data is one weakness of machine learning for practical tasks. Another potential frustration domain experts may have is that their knowledge isn’t always incorporated into the machine learning process. For our use case of trying to identify the genre of a book we may already have a sense of some possible ways in which we could identify whether a book was fiction or non-fiction. For example we may already know that titles for non-fiction books tend to be longer than fiction book titles (cf. ‘An account of the mining villages of Wales’ vs ‘Oliver Twist’). If we create our training data in the usual way by labeling examples of our data with the correct label we might not be able to incorporate this domain knowledge very easily. This might be okay in some cases but we might be able to save time and get better results by trying to leverage what we already know (or can access via domain experts).

Programmatically Generating Training Data¶

One way in which we could do this is by writing a labelling function to label titles as being either fiction or non-fiction based on the length of the title - i.e. without any annotation by hand. However, a weakness of this approach is that it deals with averages which might not always be correct. Some non-fiction book titles will be shorter than our threshold, and vice-versa for fiction books. If we could simply determine genre based on the average length of titles, we could have skipped this whole machine learning process and be done already.

So our problem is that we have some sense of functions we could use to label our data, but these functions are likely to be wrong some of the time. In this notebook we’ll explore how we can use a Python library Snorkel to deal with this challenge and try and create additional annotations without doing any annotating by hand.

Generating New Genre Training Data¶

How will we try to approach this in our particular situation? As a reminder of our broad task, we have a collection of metadata related to the Microsoft Digitised Books collection. The ‘genre’ field isn’t yet fully populated. We have previously used a subset of this data to train a machine learning model.

What we want to do is to try and write some labelling functions that will add more labels to the full metadata dataset, so that we can give our models more examples to learn from. If we are able to do this we’ll hopefully be able to improve the performance of our model from our previous attempts.

We’ll start by doing some installation of our libraries we’ll be using in this notebook.

!pip install snorkel
!pip install fastai --upgrade

Requirement already satisfied: snorkel in /usr/local/lib/python3.7/dist-packages (0.9.7)
Requirement already satisfied: torch<2.0.0,>=1.2.0 in /usr/local/lib/python3.7/dist-packages (from snorkel) (1.9.0+cu111)
Requirement already satisfied: scipy<2.0.0,>=1.2.0 in /usr/local/lib/python3.7/dist-packages (from snorkel) (1.4.1)
Requirement already satisfied: munkres>=1.0.6 in /usr/local/lib/python3.7/dist-packages (from snorkel) (1.1.4)
Requirement already satisfied: numpy<1.20.0,>=1.16.5 in /usr/local/lib/python3.7/dist-packages (from snorkel) (1.19.5)
Requirement already satisfied: tqdm<5.0.0,>=4.33.0 in /usr/local/lib/python3.7/dist-packages (from snorkel) (4.62.3)
Requirement already satisfied: scikit-learn<0.25.0,>=0.20.2 in /usr/local/lib/python3.7/dist-packages (from snorkel) (0.22.2.post1)
Requirement already satisfied: networkx<2.4,>=2.2 in /usr/local/lib/python3.7/dist-packages (from snorkel) (2.3)
Requirement already satisfied: pandas<2.0.0,>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from snorkel) (1.1.5)
Requirement already satisfied: tensorboard<2.0.0,>=1.14.0 in /usr/local/lib/python3.7/dist-packages (from snorkel) (1.15.0)
Requirement already satisfied: decorator>=4.3.0 in /usr/local/lib/python3.7/dist-packages (from networkx<2.4,>=2.2->snorkel) (4.4.2)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas<2.0.0,>=1.0.0->snorkel) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas<2.0.0,>=1.0.0->snorkel) (2.8.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas<2.0.0,>=1.0.0->snorkel) (1.15.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn<0.25.0,>=0.20.2->snorkel) (1.1.0)
Requirement already satisfied: werkzeug>=0.11.15 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.0.0,>=1.14.0->snorkel) (1.0.1)
Requirement already satisfied: grpcio>=1.6.3 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.0.0,>=1.14.0->snorkel) (1.41.1)
Requirement already satisfied: setuptools>=41.0.0 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.0.0,>=1.14.0->snorkel) (57.4.0)
Requirement already satisfied: absl-py>=0.4 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.0.0,>=1.14.0->snorkel) (0.12.0)
Requirement already satisfied: protobuf>=3.6.0 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.0.0,>=1.14.0->snorkel) (3.17.3)
Requirement already satisfied: wheel>=0.26 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.0.0,>=1.14.0->snorkel) (0.37.0)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.0.0,>=1.14.0->snorkel) (3.3.4)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from markdown>=2.6.8->tensorboard<2.0.0,>=1.14.0->snorkel) (4.8.1)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from torch<2.0.0,>=1.2.0->snorkel) (3.10.0.2)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->markdown>=2.6.8->tensorboard<2.0.0,>=1.14.0->snorkel) (3.6.0)
Requirement already satisfied: fastai in /usr/local/lib/python3.7/dist-packages (2.5.3)
Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from fastai) (21.2)
Requirement already satisfied: pip in /usr/local/lib/python3.7/dist-packages (from fastai) (21.1.3)
Requirement already satisfied: torchvision>=0.8.2 in /usr/local/lib/python3.7/dist-packages (from fastai) (0.10.0+cu111)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from fastai) (3.2.2)
Requirement already satisfied: fastcore<1.4,>=1.3.22 in /usr/local/lib/python3.7/dist-packages (from fastai) (1.3.27)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.7/dist-packages (from fastai) (3.13)
Requirement already satisfied: pillow>6.0.0 in /usr/local/lib/python3.7/dist-packages (from fastai) (7.1.2)
Requirement already satisfied: torch<1.11,>=1.7.0 in /usr/local/lib/python3.7/dist-packages (from fastai) (1.9.0+cu111)
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from fastai) (1.1.5)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from fastai) (2.23.0)
Requirement already satisfied: fastprogress>=0.2.4 in /usr/local/lib/python3.7/dist-packages (from fastai) (1.0.0)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (from fastai) (0.22.2.post1)
Requirement already satisfied: fastdownload<2,>=0.0.5 in /usr/local/lib/python3.7/dist-packages (from fastai) (0.0.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from fastai) (1.4.1)
Requirement already satisfied: spacy<4 in /usr/local/lib/python3.7/dist-packages (from fastai) (2.2.4)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from fastprogress>=0.2.4->fastai) (1.19.5)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai) (2.0.6)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai) (1.0.5)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai) (3.0.6)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai) (1.1.3)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai) (1.0.0)
Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai) (7.4.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai) (57.4.0)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai) (1.0.6)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai) (4.62.3)
Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai) (0.4.1)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai) (0.8.2)
Requirement already satisfied: importlib-metadata>=0.20 in /usr/local/lib/python3.7/dist-packages (from catalogue<1.1.0,>=0.0.7->spacy<4->fastai) (4.8.1)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata>=0.20->catalogue<1.1.0,>=0.0.7->spacy<4->fastai) (3.6.0)
Requirement already satisfied: typing-extensions>=3.6.4 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata>=0.20->catalogue<1.1.0,>=0.0.7->spacy<4->fastai) (3.10.0.2)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->fastai) (2.10)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->fastai) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->fastai) (2021.10.8)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->fastai) (3.0.4)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->fastai) (2.8.2)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->fastai) (2.4.7)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib->fastai) (0.11.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->fastai) (1.3.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.1->matplotlib->fastai) (1.15.0)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas->fastai) (2018.9)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->fastai) (1.1.0)

Since we already have some training data we can leverage this to help us develop labelling functions (more on this below) and to test how well these work.

import pandas as pd
import numpy as np

dtypes = {
    "BL record ID": "string",
    "Type of resource": "category",
    "Name": "category",
    "Type of name": "category",
    "Country of publication": "category",
    "Place of publication": "category",
    "Genre": "category",
    "Dewey classification": "string",
    "BL record ID for physical resource": "string",
    "annotator_main_language": "category",
    "annotator_summaries_language": "string",
}

df = pd.read_csv("https://raw.githubusercontent.com/Living-with-machines/genre-classification/main/genre_classification_of_bl_books/data/train_valid.csv", dtype=dtypes)

df.head(1)

	BL record ID	Type of resource	Name	Dates associated with name	Type of name	Role	All names	Title	Variant titles	Series title	Number within series	Country of publication	Place of publication	Publisher	Date of publication	Edition	Physical description	Dewey classification	BL shelfmark	Topics	Genre	Languages	Notes	BL record ID for physical resource	classification_id	user_id	created_at	subject_ids	annotator_date_pub	annotator_normalised_date_pub	annotator_edition_statement	annotator_genre	annotator_FAST_genre_terms	annotator_FAST_subject_terms	annotator_comments	annotator_main_language	annotator_other_languages_summaries	annotator_summaries_language	annotator_translation	annotator_original_language	annotator_publisher	annotator_place_pub	annotator_country	annotator_title	Link to digitised book	annotated	is_valid
0	014616539	Monograph	NaN	NaN	NaN	NaN	Hazlitt, William Carew, 1834-1913 [person]	The Baron's Daughter. A ballad by the author o...	Single Works	NaN	NaN	Scotland	Edinburgh	Ballantyne, Hanson	1877	NaN	20 pages (4°)	<NA>	Digital Store 11651.h.6	NaN	NaN	English	NaN	000206670	263940444.0	3.0	2020-07-27 07:35:13 UTC	44330917.0	1877	1877	NONE	Fiction	655 7 $aPoetry$2fast$0(OCoLC)fst01423828	NONE	NaN	NaN	No	<NA>	No	NaN	Ballantyne Hanson & Co.	Edinburgh	stk	The Baron's Daughter. A ballad by the author o...	http://access.bl.uk/item/viewer/ark:/81055/vdc...	True	False

We’ll use only the data from the training split so we’re can continue to use the validation split to compare our results.

df = df[df.is_valid == False]

Check how many examples we have to work with

len(df)

What is a labeling function?¶

We briefly described a function we could use to label our data using the length of the title. When we use a programmatic approach to creating our training data we can refer to the functions which we use to create our labels as labeling function. We’ll follow a lot of the approaches outlined in the Snorkel tutorial in this notebook. They provide this example of a labeling function for the task of identifying if a youtube comment is spam or not:

from snorkel.labeling import labeling_function


@labeling_function()
def lf_contains_link(x):
    # Return a label of SPAM if "http" in comment text, otherwise ABSTAIN
    return SPAM if "http" in x.text.lower() else ABSTAIN

There are a few things to note here, but, since we’re following a lot of what is covered in the Snorkel tutorial we won’t repeat things in too much detail.

The first thing we need is to import labeling_function from snorkel as we use this for declaring our labeling functions. The way in which we create a labeling function will depend on our data, and how we might label it, but in this example we have a simple python function which returns SPAM if the text http appears in the comment text, if it doesn’t it returns ABSTAIN.

We use a python decorator to indicate that this is a labeling function. If you aren’t familiar with Python decorators should just remember that decorators are used to modify the behavior of a function (the one it decorates), just as fairy lights decorate a Christmas tree and change its behaviour from ‘tree’ to ‘festive ornament’. This will make more sense in the context of Snorkel later on.

If you want to dig into decorators further this article on Real Python provides a nice introduction, or, if prefer to watch a video this youtube tutorial gives a nice overview too.

We can see here that the labeling function makes use of the idea that people often include links in spam comments i.e. “plz checkout my etsy store at http:….”. Obviously this won’t be correct all the time but fortunately Snorkel has some ways to deal with this.

What Makes a Good Labelling Function?¶

One question we might already have is “what makes a good labeling function?”. The short, annoying, answer is that it depends on context. We often have intutions about things that might work because we know the domain or have picked up ideas from working with some of the data already. In our particular example of distinguishing fiction from non-fiction books we may think that some words are likely to indicate whether a book is fiction or non-fiction. We’ll start by exploring this.

Important Words?¶

A fairly naive approach to trying to labeling a title as ‘fiction’ or ‘non-fiction’ would be to use some keywords. Let’s start by finding the most common 50 words. We can use the Counter class from the delightful collections module to do this.

from collections import Counter

Counter(" ".join(df["Title"]).split()).most_common(50)

[('of', 2255),
 ('the', 1785),
 ('and', 1536),
 ('...', 1054),
 ('in', 819),
 ('The', 693),
 ('A', 625),
 ('a', 625),
 ('etc', 557),
 ('by', 472),
 ('to', 413),
 ('With', 314),
 ('from', 268),
 ('with', 250),
 ('de', 228),
 ('van', 223),
 ('By', 201),
 ('its', 196),
 ('en', 194),
 ('der', 193),
 ('History', 179),
 ('J.', 159),
 ('on', 158),
 ('an', 157),
 ('[With', 157),
 ('[A', 152),
 ('illustrations', 152),
 ('New', 125),
 ('other', 121),
 ('for', 117),
 ('novel.]', 111),
 ('edition', 110),
 ('or,', 109),
 ('H.', 108),
 ('Illustrated', 98),
 ('A.', 96),
 ('und', 91),
 ('af', 88),
 ('G.', 87),
 ('den', 87),
 ('och', 75),
 ('C.', 73),
 ('or', 72),
 ('i', 71),
 ('het', 70),
 ('An', 68),
 ('Edited', 67),
 ('novel', 67),
 ('W.', 64),
 ('during', 64)]

We can see here that the most common words tend to be stop words. Since we want to know which words might be unique to fiction or non-fiction we’ll look at each of these separately.

df_fiction = df[df["annotator_genre"] == "Fiction"]
df_non_fiction = df[df["annotator_genre"] == "Non-fiction"]

most_frequent_fiction = Counter(" ".join(df_fiction["Title"]).split()).most_common(50)
most_frequent_fiction

[('of', 490),
 ('The', 331),
 ('A', 316),
 ('the', 260),
 ('and', 242),
 ('a', 184),
 ('...', 177),
 ('[A', 147),
 ('by', 138),
 ('in', 112),
 ('novel.]', 111),
 ('etc', 104),
 ('By', 104),
 ('other', 94),
 ('novel', 67),
 ('With', 53),
 ('tale', 50),
 ('der', 49),
 ('edition', 48),
 ('de', 47),
 ('author', 45),
 ('van', 45),
 ('or,', 41),
 ('en', 40),
 ('Poems', 39),
 ('J.', 39),
 ('story', 39),
 ('illustrations', 38),
 ('[i.e.', 35),
 ('A.', 34),
 ('stories', 30),
 ('romance', 29),
 ('H.', 28),
 ('poems', 28),
 ('or', 26),
 ('Second', 26),
 ('und', 26),
 ('C.', 25),
 ('poem', 25),
 ('with', 25),
 ('verse', 24),
 ('An', 24),
 ('from', 24),
 ('Tales', 24),
 ('New', 23),
 ('for', 20),
 ('acts', 20),
 ('collection', 20),
 ('het', 20),
 ('an', 19)]

most_frequent_non_fiction = Counter(
    " ".join(df_non_fiction["Title"]).split()
).most_common(50)
most_frequent_non_fiction

[('of', 1765),
 ('the', 1525),
 ('and', 1294),
 ('...', 877),
 ('in', 707),
 ('etc', 453),
 ('a', 441),
 ('to', 397),
 ('The', 362),
 ('by', 334),
 ('A', 309),
 ('With', 261),
 ('from', 244),
 ('with', 225),
 ('its', 193),
 ('de', 181),
 ('van', 178),
 ('History', 176),
 ('en', 154),
 ('on', 153),
 ('[With', 144),
 ('der', 144),
 ('an', 138),
 ('J.', 120),
 ('illustrations', 114),
 ('New', 102),
 ('for', 97),
 ('By', 97),
 ('Illustrated', 94),
 ('af', 88),
 ('H.', 80),
 ('och', 75),
 ('den', 75),
 ('G.', 74),
 ('i', 71),
 ('or,', 68),
 ('und', 65),
 ('during', 64),
 ('edition', 62),
 ('A.', 62),
 ('history', 62),
 ('og', 61),
 ('account', 57),
 ('sketches', 54),
 ('W.', 53),
 ('P.', 51),
 ('through', 51),
 ('notes', 50),
 ('edition,', 50),
 ('het', 50)]

For our indicator words to be most reliable we would rather they didn’t appear frequently in both fiction and non-fiction titles. We can use a set to check the values which aren’t in both fiction and non-fiction titles.

set(most_frequent_non_fiction).difference(set(most_frequent_fiction))

{('...', 877),
 ('A', 309),
 ('A.', 62),
 ('By', 97),
 ('G.', 74),
 ('H.', 80),
 ('History', 176),
 ('Illustrated', 94),
 ('J.', 120),
 ('New', 102),
 ('P.', 51),
 ('The', 362),
 ('W.', 53),
 ('With', 261),
 ('[With', 144),
 ('a', 441),
 ('account', 57),
 ('af', 88),
 ('an', 138),
 ('and', 1294),
 ('by', 334),
 ('de', 181),
 ('den', 75),
 ('der', 144),
 ('during', 64),
 ('edition', 62),
 ('edition,', 50),
 ('en', 154),
 ('etc', 453),
 ('for', 97),
 ('from', 244),
 ('het', 50),
 ('history', 62),
 ('i', 71),
 ('illustrations', 114),
 ('in', 707),
 ('its', 193),
 ('notes', 50),
 ('och', 75),
 ('of', 1765),
 ('og', 61),
 ('on', 153),
 ('or,', 68),
 ('sketches', 54),
 ('the', 1525),
 ('through', 51),
 ('to', 397),
 ('und', 65),
 ('van', 178),
 ('with', 225)}

These words are still fairly noisy so we might be wary of using many of them. There are some which make sense intuitively so we’ll try some of these out and see how they perform.

Creating our Labelling Functions¶

We’ll start by setting some constants for our labels.

ABSTAIN = -1
FICTION = 0
NON_FICTION = 1

It is important to note that we set an option for ABSTAIN. We often want to add an option to our labeling functions that defers from making a prediction. We usually write our labeling function to try and indicate a label, but usually if the function isn’t satisfied doesn’t indicate that the other label is correct.

Another important part of labeling functions is that we usually want to have many labeling functions, and, since we’re not relying on a single function to label our data it’s often better for our labeling function to return ABSTAIN if our labeling condition isn’t met rather than returning another label. This becomes even more important if we have multiple labels.

One function we could try to start with is checking if the word “Novel” appears in the title text. You may have noticed that in this particular dataset the word novel often appears as part of the title so this could be a useful indicator of fiction titles.

Warning

We want to be careful that our labeling functions are specific to our data. In the BL books title metadata that we’re trying to label we have noticed that things like A Novel by... appear in the title. This may not be the case for other book titles in different catalogues.

Our first labeling function is basically the same as the spam example above except we look for the word “novel”.

@labeling_function()
def lf_contains_novel(x):
    return FICTION if "novel" in x.Title.lower() else ABSTAIN

Now we have our labeling function we can apply it to our data. We’ll start by doing this only with our validation data since we have the correct labels for this to compare our functions against.

There are various ways in which Snorkel can apply our functions to our data. In this notebook we’ll stick with an approach designed to work with Pandas dataframes. If we have a larger amount of data to label we might want to explore the use of dask applifer functions. This uses the dask library to scale the appliation of labelling functions to very large datasets. We won’t need this here but if you are planning to develop this approach with very large collections this could be worth exploring.

We put our labelling functions in a list called LFS, we then create an applier object and pass in our dataframe.

from snorkel.labeling import PandasLFApplier

lfs = [lf_contains_novel]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df)

100%|██████████| 3262/3262 [00:00<00:00, 39484.97it/s]

We store the output of this in a new variable L_train. We can use LFAnalysis to get a summary of what our current labeling function is doing.

from snorkel.labeling import LFAnalysis

LFAnalysis(L=L_train, lfs=lfs).lf_summary()

	j	Polarity	Coverage	Overlaps	Conflicts
lf_contains_novel	0	[0]	0.058553	0.0	0.0

We can see a row for our current labeling function, we can also see that at the moment our coverage (i.e. how much of our data is labeled by our function) is very low. At the moment we don’t have any overlaps or conflicts since these are relevant only when we have multiple labeling functions.

We have ground truth labels that we can use to evaluate how accurate our labeling function is. To to his we need to use the same labels as Snorkel for ground truth so we’ll map our labels to the constants we made above.

ground_truth = df.annotator_genre.replace({"Fiction": 0, "Non-fiction": 1})

We can pass our ground truth labels to lf_empirical_accuracies to get a sumaary of the peformance of our functions.

LFAnalysis(L=L_train, lfs=lfs).lf_empirical_accuracies(ground_truth)

array([1.])

We can see here that our current function is 100% accurate. We shouldn’t get too excited about this since our coverage is very low. We’ll need to write some more labeling functions to make sure that we have some chance of labeling more of our data than we currently have done.

Heuristics¶

We could also use heuristics for our labeling functions. For example the length of the title. I don’t have any idea what threshold to use for this. Since we have some labels we can try and identify a sensible threshold. First we’ll add a new columns to our DataFrame which contains the length of our titles.

df["text_len"] = df["Title"].str.len()

We’ll now use a pandas groupby to see what the lengths look like for fiction vs non-fiction books. Since it might be useful to have a sense of the distributions we’ll use describe instead of mean.

df.groupby(["annotator_genre"])["text_len"].describe()

	count	mean	std	min	25%	50%	75%	max
annotator_genre
Fiction	1083.0	49.438596	35.095600	5.0	25.0	39.0	63.0	271.0
Non-fiction	2179.0	92.317118	58.458339	8.0	50.0	78.0	125.0	469.0

Precision vs Recall: What Value to use for our Threshold?¶

We can see various values for mean, min etc. What would be a reasonable value to use for our threshold for a labeling function which labeled a title as ‘non-fiction’? This partly comes down to whether we want high coverage (or recall) or high precision. If we choose a threshold that is higher we will label fewer examples, but they will be more likely to be correct.

For example, if we use the max value for the length of a non-fiction title 469 most titles will be much shorter than this, so our function will ‘abstain’ from applying a label, and we would only label a very small number of examples from our data. However, we also won’t have many (or any) wrongly-labeled examples since the max value for fiction here is 288. We need to balance these two aims of coverage and precision. Since we are writing more than one labeling function we probably want to tend towards writing more precise labeling functions rather than aiming for high coverage if this is likely to introduce wrongly labeled examples.

Note

As we saw in previous chapters/notebooks we have to be a bit careful in generalizing between what we see in our training and validation data since there may be some distribution drift between our training data (which wasn’t a completely randomized sample) and the full data that we want to label. In the error analysis notebook we saw that the performance of our model was worse than it was on validation data. We should keep this in mind when writing a labeling function, since we want our labeling function to work well on new data which doesn’t have labels, not just on the data for which we already have labels.

We’ll use the value of the maximum title length for a fiction book as our threshold. This will hopefully give us fairly decent coverage without too many mistakes.

@labeling_function()
def lf_is_long_title(x):
    return NON_FICTION if x.text_len > 211.0 else ABSTAIN

We do the same as earlier, including our new labeling function

lfs = [lf_contains_novel, lf_is_long_title]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df)

100%|██████████| 3262/3262 [00:00<00:00, 28729.01it/s]

LFAnalysis(L=L_train, lfs=lfs).lf_summary()

	j	Polarity	Coverage	Overlaps	Conflicts
lf_contains_novel	0	[0]	0.058553	0.0	0.0
lf_is_long_title	1	[1]	0.023299	0.0	0.0

We can see our coverage is still fairly low but at the moment we don’t have any conflicts. We can keep tweaking our length threshold but for now we’ll try a different approach to our labeling function.

Add Keywords¶

We already have a labeling function that uses the keyword ‘novel’ to identify likely fiction books. Since we often want to use keywords the Snorkel tutorial suggests a way we can do this more easily using keyword lookups.

from snorkel.labeling import LabelingFunction


def keyword_lookup(x, keywords, label):
    if any(word in x.Title.lower() for word in keywords):
        return label
    return ABSTAIN


def make_keyword_lf(keywords, label=FICTION):
    return LabelingFunction(
        name=f"keyword_{keywords[0]}",
        f=keyword_lookup,
        resources=dict(keywords=keywords, label=label),
    )

We can try two new keyword labels using this more concise approach:

keyword_tale = make_keyword_lf(keywords=["tale"])
keyword_poem = make_keyword_lf(keywords=["poem"])

Leveraging Other Models¶

So far we have leveraged some domain knowledge/exploration and our existing labeled data to create our labeling functions. However, we could also utilise other resources to help us label our data. Since we’re working with text we should be able to benefit from some existing NLP models to label our data. Snorkel supports this in a few different ways.

spaCy is a popular nlp library which supports a range of different models and nlp tasks. Here we’re particuarly interested in some of the named entity models supported by spaCy.

To work with this library we can use Snorkel’s SpacyPreProcessor. Preprocessors are used in Snorkel to do some preprocessing (hence the name) which is required for our labeling functions. These can be particularly useful if the processing takes some time and might be reused across differnet labelling functions. Let’s take a look at an example

from snorkel.preprocess.nlp import SpacyPreprocessor

# The SpacyPreprocessor parses the text in text_field and
# stores the new enriched representation in doc_field
spacy = SpacyPreprocessor(text_field="Title", doc_field="doc", memoize=True)

Above we create a SpacyPreprocessor which will use our title field and create a new doc field. This doc refers to the Spacy doc container. This can be reused for multiple different labeling functions. We pass in memoize=True to cache our results. This means we won’t have to wait for the preprocessing to be done multiple times for different labeling functions which reuse the doc container.

Using named entities for labeling functions.¶

spaCy has support for named entity recognition. Since these models are already created and can be used directly by us it might be worth seeing if named entities are of any benefit for our particular task.

We can again draw from our domain knowledge, intution or guesses (depending on how confident we are) and say that it’s likely that we will see more named entities of the ORG type in non-fiction titles since these often will be about organizations. We can combine this with a slightly softer threshold for length to label titles as being likely non-fiction.

To create this function we replicate closely what we did before except that we pass in our SpacyPreProcessor instance to let Snorkel know that this preprocesser is a requirement of this labeling function. Under the hood this will mean that if the preprocessor hasn’t been run already this will triger the preprocessing. If we reuse this for another funciton the preprocessing will already have been cached.

@labeling_function(pre=[spacy])
def has_many_org(x):
    if len(x.doc) > 50 and len([ent.label_ == "ORG" for ent in x.doc.ents]) > 1:
        return NON_FICTION
    else:
        return ABSTAIN

We might also guess that there will be more location entities in non-fiction titles

@labeling_function(pre=[spacy])
def has_many_loc(x):
    if len(x.doc) > 50 and len([ent.label_ == "LOC" for ent in x.doc.ents]) > 2:
        return NON_FICTION
    else:
        return ABSTAIN

Similarly we might also assume that there will be more GPE entities for non-fiction

@labeling_function(pre=[spacy])
def has_many_gpe(x):
    if len(x.doc) > 50 and len([ent.label_ == "GPE" for ent in x.doc.ents]) > 2:
        return NON_FICTION
    else:
        return ABSTAIN

and law entities…

@labeling_function(pre=[spacy])
def has_law(x):
    if any([ent.label_ == "LAW" for ent in x.doc.ents]):
        return NON_FICTION
    else:
        return ABSTAIN

and if it’s long and has a date it might be a non-fiction title?

@labeling_function(pre=[spacy])
def is_long_and_has_date(x):
    if len(x.doc) > 50 and any([ent.label_ == "DATE" for ent in x.doc.ents]):
        return NON_FICTION
    else:
        return ABSTAIN

or it is long and has a FAC entitity

@labeling_function(pre=[spacy])
def is_long_and_has_fac(x):
    if len(x.doc) > 50 and any([ent.label_ == "FAC" for ent in x.doc.ents]):
        return NON_FICTION
    else:
        return ABSTAIN

We now have a bunch of labeling functions we’ll create a new list containing these and see how they do.

lfs = [
    lf_contains_novel,
    lf_is_long_title,
    keyword_tale,
    keyword_poem,
    has_many_org,
    has_many_loc,
    has_many_gpe,
    has_law,
    is_long_and_has_date,
]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df)

100%|██████████| 3262/3262 [00:34<00:00, 93.42it/s]

LFAnalysis(L=L_train, lfs=lfs).lf_summary()

	j	Polarity	Coverage	Overlaps
lf_contains_novel	0	[0]	0.058553	0.000000
lf_is_long_title	1	[1]	0.023299	0.011036
keyword_tale	2	[0]	0.043532	0.000000
keyword_poem	3	[0]	0.042918	0.000000
has_many_org	4	[1]	0.011956	0.011956
has_many_loc	5	[1]	0.011036	0.011036
has_many_gpe	6	[1]	0.011036	0.011036
has_law	7	[1]	0.004905	0.000000
is_long_and_has_date	8	[1]	0.003372	0.003372

Again our coverage is quite low but we also don’t have too many conflicts. We can check the performance of these functions:

LFAnalysis(L=L_train, lfs=lfs).lf_empirical_accuracies(ground_truth)

array([1.        , 0.94736842, 0.97183099, 1.        , 0.92307692,
       0.91666667, 0.91666667, 1.        , 1.        ])

These are all doing pretty good so we might be okay with lower coverage for now. We also have a resource available to us which should boost our coverage a fair bit: our previously trained model.

Using our previous model¶

In a previously notebook we trained a model which didn’t perform terribly. Although we wanted to improve the performance, hence this notebook, it wasn’t so disastrous as to be unusable, particularly with the insights we got from the error analysis notebook that if we raise the threshold of confidence for which we accept our models predictions our performance increases quite a bit. We may therefore want to try and incorporate this model as another way of labeling more data.

There are various ways in which we can do this, we’ll look at one approach below.

We’ll start by importing fastai so we can load our previously trained model.

from fastai.text.all import *

If you don’t have a model saved from notebook you can download one by uncommenting the below cell.

# !wget -O 20210928-model.pkl  https://zenodo.org/record/5245175/files/20210928-model.pkl?download=1

--2021-11-11 14:12:20--  https://zenodo.org/record/5245175/files/20210928-model.pkl?download=1
Resolving zenodo.org (zenodo.org)... 137.138.76.77
Connecting to zenodo.org (zenodo.org)|137.138.76.77|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 158529715 (151M) [application/octet-stream]
Saving to: ‘20210928-model.pkl’

20210928-model.pkl  100%[===================>] 151.19M  11.1MB/s    in 16s     

2021-11-11 14:12:38 (9.54 MB/s) - ‘20210928-model.pkl’ saved [158529715/158529715]

learn = load_learner("20210928-model.pkl")

We can quickly check our vocab

learn.dls.vocab[1]

['Fiction', 'Non-fiction']

One way of using this model would be to create a preprocessor that will be used by Snorkel. This will do the setup required to use this model (as we saw with the spaCy example). We can do this by using the preprocessor decorator. Our function then calls whatever we need to happen. In this case we store the predicted label and probability.

# from snorkel.preprocess import preprocessor

# @preprocessor(memoize=True)
# def fastai_pred(x):
#     with learn.no_bar():
#         *_, probs = learn.predict(x.title)
#     x.fiction_prob = probs[0]
#     x.non_fiction_prob = probs[1]
#     return x

In this example we don’t want to use this since we then don’t benefit from doing our inference in batches. Instead we’ll just create some new columns to store our fastai models labels and confidence.

test_dl = learn.dls.test_dl(df.Title)
preds = learn.get_preds(dl=test_dl)
fiction_prob, non_fiction_prob = np.hsplit(preds[0].numpy(), 2)

fiction_prob

array([[0.9999399 ],
       [0.9999399 ],
       [0.9999399 ],
       ...,
       [0.04363291],
       [0.04363291],
       [0.02832149]], dtype=float32)

df["fiction_prob"] = fiction_prob
df["non_fiction_prob"] = non_fiction_prob

We now have some new columns containing the probabilities for our labels from our previously created model.

We saw in the previous Assessing Where our Model is Going Wrong section that by only using predictions where our model was confident, we could get better results i.e. we only accept suggestions from our model where it is very confident. For example, we could accept a prediction only if it is above 95% confidence.

We’ll use this in our labelling function to set a threshold at which we accept the previous models predictions. If the model is unsure we don’t use it’s prediction. This will mean less of our data ends up labelled because some predictions aren’t used. However, we will hopefully get better predictions because we only use those where our model is confident.

@labeling_function()
def fastai_fiction_prob_v_high(x):
    return FICTION if x.fiction_prob > 0.97 else ABSTAIN

@labeling_function()
def fastai_non_fiction_prob_v_high(x):
    return NON_FICTION if x.non_fiction_prob > 0.97 else ABSTAIN

Again we add this to our existing labeling function list and apply it to our data

lfs += [fastai_fiction_prob_v_high, fastai_non_fiction_prob_v_high]

lfs

[LabelingFunction lf_contains_novel, Preprocessors: [],
 LabelingFunction lf_is_long_title, Preprocessors: [],
 LabelingFunction keyword_tale, Preprocessors: [],
 LabelingFunction keyword_poem, Preprocessors: [],
 LabelingFunction has_many_org, Preprocessors: [SpacyPreprocessor SpacyPreprocessor, Pre: []],
 LabelingFunction has_many_loc, Preprocessors: [SpacyPreprocessor SpacyPreprocessor, Pre: []],
 LabelingFunction has_many_gpe, Preprocessors: [SpacyPreprocessor SpacyPreprocessor, Pre: []],
 LabelingFunction has_law, Preprocessors: [SpacyPreprocessor SpacyPreprocessor, Pre: []],
 LabelingFunction is_long_and_has_date, Preprocessors: [SpacyPreprocessor SpacyPreprocessor, Pre: []],
 LabelingFunction fastai_fiction_prob_v_high, Preprocessors: [],
 LabelingFunction fastai_non_fiction_prob_v_high, Preprocessors: []]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df)

100%|██████████| 3262/3262 [00:34<00:00, 95.06it/s]

LFAnalysis(L=L_train, lfs=lfs).lf_summary()

	j	Polarity	Coverage	Overlaps	Conflicts
lf_contains_novel	0	[0]	0.058553	0.056101	0.000000
lf_is_long_title	1	[1]	0.023299	0.019926	0.000000
keyword_tale	2	[0]	0.043532	0.034028	0.000613
keyword_poem	3	[0]	0.042918	0.035561	0.000000
has_many_org	4	[1]	0.011956	0.011956	0.000920
has_many_loc	5	[1]	0.011036	0.011036	0.000920
has_many_gpe	6	[1]	0.011036	0.011036	0.000920
has_law	7	[1]	0.004905	0.002759	0.000000
is_long_and_has_date	8	[1]	0.003372	0.003372	0.000000
fastai_fiction_prob_v_high	9	[0]	0.223483	0.125996	0.000920
fastai_non_fiction_prob_v_high	10	[1]	0.338749	0.021153	0.000613

We can see that the labelling function which uses our model outputs has a much higher coverage of our data. This should be very helpful in labelling more examples but we want to check that these are correct.

LFAnalysis(L=L_train, lfs=lfs).lf_empirical_accuracies(ground_truth)

array([1.        , 0.94736842, 0.97183099, 1.        , 0.92307692,
       0.91666667, 0.91666667, 1.        , 1.        , 0.99314129,
       0.99909502])

We can see that our labels all perform pretty well i.e. above 90%. We are also getting much better coverage now that we leverage our existing model.

Creating more training data¶

So far we have been using the validation data to develop some potential labeling functions. Now we are fairly satisfied with them let’s apply to the full data. We’ll quickly look at this process on our current data and then we’ll move to the full metadata json file that we use for creating more training data.

We use LabelModel to fit a model which will be able to take as input all of the predictions from our labelling functions and fit a model which will predict the probability for a label. This model is able to deal with some conflicts between labeling functions and will do much better in most cases than a naive majority vote model i.e. one which just accepts the most often predicted label. The details of this model are beyond the scope of this notebook but if you are interested Data Programming: Creating Large Training Sets, Quickly offers a fuller overview of the details of this method[8]

from snorkel.labeling.model import LabelModel

label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train=L_train, n_epochs=500, log_freq=100, seed=123)

Above we fit our LabelModel for 500 epochs. Since we are working with the training set still we can get the score for this model.

label_model.score(
    L=L_train,
    Y=ground_truth,
    tie_break_policy="abstain",
    metrics=["precision", "recall", "f1"],
)

WARNING:root:Metrics calculated over data points with non-abstain labels only

{'f1': 0.994250331711632,
 'precision': 0.9964539007092199,
 'recall': 0.9920564872021183}

This is looking pretty good and hopefully this performance will be similar for our full data. We’ll now load a dataframe that includes all of the BL books metadata.

df_full = pd.read_csv(
    "https://bl.iro.bl.uk/downloads/e1be1324-8b1a-4712-96a7-783ac209ddef?locale=en",
    dtype=dtypes,
)

We create a new column text_len since we need this for some of our labeling functions.

df_full["text_len"] = df_full.Title.str.len()

We also get our fastai model’s predictions into new columns. This obviously takes some time since we’re now doing inference on a fairly large dataset.

test_dl = learn.dls.test_dl(df_full.Title)
preds = learn.get_preds(dl=test_dl)
fiction_prob, non_fiction_prob = np.hsplit(preds[0].numpy(), 2)
df_full["fiction_prob"] = fiction_prob
df_full["non_fiction_prob"] = non_fiction_prob

Now we have all the same columns in place as we had previously we can now apply our labelling functions to all of our data.

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_full)

100%|██████████| 52695/52695 [09:27<00:00, 92.85it/s]

We can check what the coverage, overlaps and conflicts look like

LFAnalysis(L=L_train, lfs=lfs).lf_summary()

	j	Polarity	Coverage	Overlaps	Conflicts
lf_contains_novel	0	[0]	0.066875	0.057178	0.000285
lf_is_long_title	1	[1]	0.047367	0.032413	0.003036
keyword_tale	2	[0]	0.036702	0.022165	0.001006
keyword_poem	3	[0]	0.089952	0.049018	0.002619
has_many_org	4	[1]	0.021520	0.021520	0.001974
has_many_loc	5	[1]	0.021008	0.021008	0.001917
has_many_gpe	6	[1]	0.021008	0.021008	0.001917
has_law	7	[1]	0.004232	0.002106	0.000531
is_long_and_has_date	8	[1]	0.007762	0.007762	0.000380
fastai_fiction_prob_v_high	9	[0]	0.213834	0.122858	0.000626
fastai_non_fiction_prob_v_high	10	[1]	0.192599	0.019015	0.000683

The coverage is lower than we had previously. This makes sense since we previously used the same data for developing our labeling functions as we used for training our model. It’s not suprising our model is more confident about these. If we were being more dilligent we might have held back a different dataset for developing our labelling functions but since we’re being a bit pragmatic (lazy) here we won’t worry too much about this. We can again fit our model:

label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train=L_train, n_epochs=500, log_freq=100, seed=123)

We now use this model to predict the probabilites from our labelling function outputs

probs_train = label_model.predict_proba(L_train)

We currently have predictions for some of our data but not all of it. Since we want only the labelled exampled we use a function from Snorkel to filter out data which our labeling functions didn’t annotate.

from snorkel.labeling import filter_unlabeled_dataframe

df_train_filtered, probs_train_filtered = filter_unlabeled_dataframe(
    X=df_full, y=probs_train, L=L_train
)

We now have the predicted probabilty for each label. We could work with these probabilities but to keep things simple we’ll make these hard predictions i.e. fiction or non-fiction rather than 0.87% fiction. Again Snorkel provides a handy function for doing this.

from snorkel.utils import probs_to_preds

preds_train_filtered = probs_to_preds(probs=probs_train_filtered)

Let’s see how much data we have now.

len(preds_train_filtered)

As a reminder we previosuly had 3262 labeled examples. We can see that we’ve now gained a lot more examples for relateively little work (especially if we compare how much time it would take to annotate these by hand).

26566 / 3262

8.144083384426732

We’ll store out new labels in a label column

df_train_filtered["snorkel_label"] = preds_train_filtered

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

df_train_filtered["snorkel_label"]

      0
      0
      0
      0
      0
        ..
  0
  0
  0
  0
  0
Name: snorkel_label, Length: 26566, dtype: int64

Creating our new training data¶

As a reminder of what we’ve done:

we had training data/annotations collected via a zooniverse crowdsourcing task with 2909 labeled examples in our validation set
we had previously used this to train a model that did fairly well
we used our existing training data to generate labeling functions, these leveraged:
- our intuitions about our data
- SpaCy models
- our previous model
we applied these labeling functions to the Microsoft Digitised Books file. Once we excluded examples which weren’t labeled by our labeling functions we had 34542 labeled examples we could work with.

We now want to get all of this data into a format we can use to train new models with. There are a few things we need to do for this.

Map to our original labels¶

We’ll map these back to our original fiction and non-fiction labels. This isn’t super important but might be more explicit then 1 or 0 for our labels.

df_train_filtered["snorkel_genre"] = df_train_filtered["snorkel_label"].map(
    {0: "Fiction", 1: "Non-fiction"}
)

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  

df_train_filtered.columns

Index(['BL record ID', 'Type of resource', 'Name',
       'Dates associated with name', 'Type of name', 'Role', 'All names',
       'Title', 'Variant titles', 'Series title', 'Number within series',
       'Country of publication', 'Place of publication', 'Publisher',
       'Date of publication', 'Edition', 'Physical description',
       'Dewey classification', 'BL shelfmark', 'Topics', 'Genre', 'Languages',
       'Notes', 'BL record ID for physical resource', 'text_len',
       'fiction_prob', 'non_fiction_prob', 'snorkel_label', 'snorkel_genre'],
      dtype='object')

Selecting required columns¶

Since we have only been using the title and the label (fiction or non-fiction) to train our models we will just keep these.

df["annotator_genre"]

         Fiction
         Fiction
         Fiction
         Fiction
         Fiction
           ...     
  Non-fiction
  Non-fiction
  Non-fiction
  Non-fiction
  Non-fiction
Name: annotator_genre, Length: 3262, dtype: object

df["snorkel_genre"] = df["annotator_genre"]

df_snorkel_train = pd.concat([df, df_train_filtered])

df_snorkel_train["snorkel_genre"].value_counts()

Fiction        15840
Non-fiction    13988
Name: snorkel_genre, dtype: int64

Prioritising human annotations¶

When we applied our labeling functions across the full Microsoft Digitised Books metadata file we didn’t do anything to exclude titles where a human annotator had already provided a label as part of the Zooniverse annotation task. Since we joined the full metadata and the human annotations together we will now have some duplicates. We almost definitely want to prioritise the human annotations over our label function labels. We could use Pandas drop_duplicates and keep the first example (the human annotated one) to deal with this.

df_snorkel_train.duplicated(subset="Title")

      False
       True
       True
       True
       True
         ...  
  False
  False
  False
  False
  False
Length: 29828, dtype: bool

df_snorkel_train = df_snorkel_train.drop_duplicates(subset="Title", keep="first")

df_snorkel_train

	BL record ID	Type of resource	Name	Dates associated with name	Type of name	Role	All names	Title	Variant titles	Series title	Number within series	Country of publication	Place of publication	Publisher	Date of publication	Edition	Physical description	Dewey classification	BL shelfmark	Topics	Genre	Languages	Notes	BL record ID for physical resource	classification_id	user_id	created_at	subject_ids	annotator_date_pub	annotator_normalised_date_pub	annotator_edition_statement	annotator_genre	annotator_FAST_genre_terms	annotator_FAST_subject_terms	annotator_comments	annotator_main_language	annotator_other_languages_summaries	annotator_summaries_language	annotator_translation	annotator_original_language	annotator_publisher	annotator_place_pub	annotator_country	annotator_title	Link to digitised book	annotated	is_valid	text_len	fiction_prob	non_fiction_prob	snorkel_genre	snorkel_label
0	014616539	Monograph	NaN	NaN	NaN	NaN	Hazlitt, William Carew, 1834-1913 [person]	The Baron's Daughter. A ballad by the author of Poetical Recreations [i.e. William C. Hazlitt] . F.P	Single Works	NaN	NaN	Scotland	Edinburgh	Ballantyne, Hanson	1877	NaN	20 pages (4°)	<NA>	Digital Store 11651.h.6	NaN	NaN	English	NaN	000206670	263940444.0	3.0	2020-07-27 07:35:13 UTC	44330917.0	1877	1877	NONE	Fiction	655 7 $aPoetry$2fast$0(OCoLC)fst01423828	NONE	NaN	NaN	No	<NA>	No	NaN	Ballantyne Hanson & Co.	Edinburgh	stk	The Baron's Daughter. A ballad by the author of Poetical Recreations [i.e. William C. Hazlitt] . F.P	http://access.bl.uk/item/viewer/ark:/81055/vdc_00000002F718	True	False	100	0.999940	0.000060	Fiction	NaN
5	014616561	Monograph	Bingham, Ashton, Mrs	NaN	person	NaN	Bingham, Ashton, Mrs [person]	The Autumn Leaf Poems	NaN	NaN	NaN	Scotland	Edinburgh	Colston	1891	NaN	vi, 104 pages (8°)	<NA>	Digital Store 011649.e.105	NaN	NaN	English	NaN	000353271	268728281.0	3.0	2020-08-18 07:02:17 UTC	44331070.0	1891	1891	NONE	Fiction	655 7 $aPoetry$2fast$0(OCoLC)fst01423828	NONE	NaN	NaN	No	<NA>	No	NaN	Colston & Company	Edinburgh	stk	The Autumn Leaf Poems	http://access.bl.uk/item/viewer/ark:/81055/vdc_00000002F04C	True	False	21	0.999486	0.000514	Fiction	NaN
10	014616607	Monograph	Cartwright, William	NaN	person	writer	Cartwright, William, writer [person]	The Battle of Waterloo, a poem	NaN	NaN	NaN	England	London	Longman	1827	NaN	vii, 71 pages (8°)	<NA>	Digital Store 992.i.26	Waterloo, Battle of (Belgium : 1815)	NaN	English	NaN	000621918	263935396.0	3.0	2020-07-27 06:39:57 UTC	44331748.0	1827	1827	NONE	Fiction	655 7 $aPoetry$2fast$0(OCoLC)fst01423828	647 7 $aBattle of Waterloo$c(Waterloo, Belgium :$d1815)$2fast$0(OCoLC)fst01172689	NaN	NaN	No	<NA>	No	NaN	Longman, Rees, Orme, Brown & Green\nBurlton\nMerricks	London\nLeominster\nHereford	enk	The Battle of Waterloo, a poem	http://access.bl.uk/item/viewer/ark:/81055/vdc_00000002ED4C	True	False	30	0.991599	0.008401	Fiction	NaN
15	014616686	Monograph	Earle, John Charles	NaN	person	NaN	Earle, John Charles [person]	Maximilian, and other poems, etc	NaN	NaN	NaN	England	London	NaN	1868	NaN	NaN	<NA>	Digital Store 11648.i.8	NaN	Poetry or verse	English	NaN	001025896	265570129.0	3.0	2020-08-03 07:25:30 UTC	44331725.0	1868	1868	NONE	Fiction	655 7 $aPoetry$2fast$0(OCoLC)fst01423828	NONE	NaN	NaN	No	<NA>	No	NaN	Burns, Oates, & Co.	London	enk	Maximilian, and other poems, etc	http://access.bl.uk/item/viewer/ark:/81055/vdc_00000002F2AA	True	False	32	0.982546	0.017454	Fiction	NaN
20	014616696	Monograph	NaN	NaN	NaN	NaN	NaN	Fabellæ mostellariæ: or Devonshire and Wiltshire stories in verse; including specimens of the Devonshire dialect	NaN	NaN	NaN	England	Exeter ; London	Hamilton, Adams ; Henry S. Eland	1878	NaN	77 pages (8°)	<NA>	Digital Store 11652.h.19	NaN	NaN	English	NaN	001187981	269169228.0	3.0	2020-08-20 12:32:34 UTC	44331389.0	1878	1878	NONE	Fiction	655 7 $aPoetry$2fast$0(OCoLC)fst01423828	NONE	NaN	NaN	No	<NA>	No	NaN	Hamilton, Adams, and Co.\nHenry S. Eland	London\nExeter	enk	Fabellæ mostellariæ: or Devonshire and Wiltshire stories in verse; including specimens of the Devonshire dialect	http://access.bl.uk/item/viewer/ark:/81055/vdc_00000002F90A	True	False	112	0.983944	0.016056	Fiction	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
52682	016289050	Monograph	Hastings, Beatrice	NaN	person	NaN	Hastings, Beatrice [person]	The maids' comedy. A chivalric romance in thirteen chapters	NaN	NaN	NaN	England	London	Stephen Swift	1911	NaN	199 pages, 20 cm	<NA>	Digital Store 012618.c.32	NaN	NaN	English	Anonymous. By Beatrice Hastings	004111105	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	<NA>	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	59	0.999444	0.000556	Fiction	0.0
52689	016289057	Monograph	Garstang, Walter, M.A., F.Z.S.	NaN	person	NaN	Garstang, Walter, M.A., F.Z.S. [person] ; Shepherd, J. A. (James Affleck), 1867-approximately 1931 [person]	Songs of the Birds ... With illustrations by J.A. Shepherd	NaN	NaN	NaN	England	London	John Lane	1922	NaN	101 pages, illustrations (8°)	598.259	Digital Store 011648.g.133	NaN	NaN	English	Poems, with and introductory essay	004158005	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	<NA>	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	58	0.993942	0.006058	Fiction	0.0
52692	016289060	Monograph	Wellesley, Dorothy	1889-1956	person	NaN	Wellesley, Dorothy, 1889-1956 [person]	Early Poems. By M. A [i.e. Dorothy Violet Wellesley, Lady Gerald Wellesley.]	NaN	NaN	NaN	England	London	Elkin Mathews	1913	NaN	vii, 90 pages (8°)	<NA>	Digital Store 011649.eee.17	NaN	NaN	English	NaN	000000839	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	<NA>	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	76	0.987218	0.012782	Fiction	0.0
52693	016289061	Monograph	A, T. H. E.	NaN	person	NaN	A, T. H. E. [person]	Of Life and Love [Poems.] By T. H. E. A, writer of 'The Message.'	NaN	NaN	NaN	England	London	J. M. Watkins	1924	NaN	89 pages (8°)	<NA>	Digital Store 011645.e.125	NaN	NaN	English	NaN	000001167	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	<NA>	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	65	0.977032	0.022968	Fiction	0.0
52694	016289062	Monograph	Abbay, Richard	NaN	person	NaN	Abbay, Richard [person]	Life, a Mode of Motion; or, He and I, my two selves [A poem.]	NaN	NaN	NaN	England	London	Jarrold	1919	NaN	volumes, 58 pages (8°)	<NA>	Digital Store 011649.g.81	NaN	NaN	English	NaN	000003140	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	<NA>	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	61	0.975888	0.024112	Fiction	0.0

25683 rows × 52 columns

Data leakage¶

Want to exclude data which is in test set so we drop these examples from our training data. Since we care about titles ‘leaking’ we look up whether any titles in our training data appear in our test data and remove these from the training data.

df_test = pd.read_csv("test_errors.csv")

Removing data which is in our test data¶

df_snorkel_train = df_snorkel_train[~df_snorkel_train.Title.isin(df_test.title)]

len(df_snorkel_train)

Creating new splits¶

We create some new splits following the same process we used before. We can then use these splits to more accurately compare across models training using this dataset. Since we have kept the test data out of our ‘Snorkel dataset’ we will also continue to use this test data for final model evaluation.

from sklearn.model_selection import GroupShuffleSplit

train_inds, valid_ins = next(
    GroupShuffleSplit(n_splits=2, test_size=0.2).split(
        df_snorkel_train, groups=df_snorkel_train["Title"]
    )
)

df_train, df_valid = (
    df_snorkel_train.iloc[train_inds].copy(),
    df_snorkel_train.iloc[valid_ins].copy(),
)

df_train["is_valid"] = False
df_valid["is_valid"] = True

df = pd.concat([df_train, df_valid])

df.snorkel_genre.value_counts()

Fiction        13918
Non-fiction    11765
Name: snorkel_genre, dtype: int64

We can see we still have a healthy number of examples to train our model on even after dropping titles which appear in our test data

len(df)

Saving our new training data¶

We’ll save our new training data as a csv file.

df.to_csv("data/snorkel_train.csv", index=False)

df.head(1)

	BL record ID	Type of resource	Name	Dates associated with name	Type of name	Role	All names	Title	Variant titles	Series title	Number within series	Country of publication	Place of publication	Publisher	Date of publication	Edition	Physical description	Dewey classification	BL shelfmark	Topics	Genre	Languages	Notes	BL record ID for physical resource	classification_id	user_id	created_at	subject_ids	annotator_date_pub	annotator_normalised_date_pub	annotator_edition_statement	annotator_genre	annotator_FAST_genre_terms	annotator_FAST_subject_terms	annotator_comments	annotator_main_language	annotator_other_languages_summaries	annotator_summaries_language	annotator_translation	annotator_original_language	annotator_publisher	annotator_place_pub	annotator_country	annotator_title	Link to digitised book	annotated	is_valid	text_len	fiction_prob	non_fiction_prob	snorkel_genre	snorkel_label
0	014616539	Monograph	NaN	NaN	NaN	NaN	Hazlitt, William Carew, 1834-1913 [person]	The Baron's Daughter. A ballad by the author of Poetical Recreations [i.e. William C. Hazlitt] . F.P	Single Works	NaN	NaN	Scotland	Edinburgh	Ballantyne, Hanson	1877	NaN	20 pages (4°)	<NA>	Digital Store 11651.h.6	NaN	NaN	English	NaN	000206670	263940444.0	3.0	2020-07-27 07:35:13 UTC	44330917.0	1877	1877	NONE	Fiction	655 7 $aPoetry$2fast$0(OCoLC)fst01423828	NONE	NaN	NaN	No	<NA>	No	NaN	Ballantyne Hanson & Co.	Edinburgh	stk	The Baron's Daughter. A ballad by the author of Poetical Recreations [i.e. William C. Hazlitt] . F.P	http://access.bl.uk/item/viewer/ark:/81055/vdc_00000002F718	True	False	100	0.99994	0.00006	Fiction	NaN

Next steps¶

We now have a larger training set which includes both our original training data produced through crowdsourcing plus our training data we generated using our labeling functions and the Snorkel library.

Hopefully having more training data will result in being able to improve the models we can generate. In the next sections we’ll look at two approaches we can use for doing this:

training the same model as before but with more data
training a transformer based model with more data

We will hopefully see some improvements now we have more data.