Creating More Training Data Without More Annotating

We previously listed ‘annotating’ more data as one way of improving our model. Since supervised learning requires data, having more data may be potentially helpful for improving our model.

One obvious downside of this is that collecting more training data is time consuming and may not always be very practical. In a GLAM setting we may want to use machine learning to do a task which we wouldn’t otherwise have time to do. If we have to spend a lot of time creating our training data, the machine learning approach may also become impractical in terms of resources.

Combining Domain Expertise and Machine Learning

The time taken to create training data is one weakness of machine learning for practical tasks. Another potential frustration domain experts may have is that their knowledge isn’t always incorporated into the machine learning process. For our use case of trying to identify the genre of a book we may already have a sense of some possible ways in which we could identify whether a book was fiction or non-fiction. For example we may already know that titles for non-fiction books tend to be longer than fiction book titles (cf. ‘An account of the mining villages of Wales’ vs ‘Oliver Twist’). If we create our training data in the usual way by labeling examples of our data with the correct label we might not be able to incorporate this domain knowledge very easily. This might be okay in some cases but we might be able to save time and get better results by trying to leverage what we already know (or can access via domain experts).

Programmatically Generating Training Data

One way in which we could do this is by writing a labelling function to label titles as being either fiction or non-fiction based on the length of the title - i.e. without any annotation by hand. However, a weakness of this approach is that it deals with averages which might not always be correct. Some non-fiction book titles will be shorter than our threshold, and vice-versa for fiction books. If we could simply determine genre based on the average length of titles, we could have skipped this whole machine learning process and be done already.

So our problem is that we have some sense of functions we could use to label our data, but these functions are likely to be wrong some of the time. In this notebook we’ll explore how we can use a Python library Snorkel to deal with this challenge and try and create additional annotations without doing any annotating by hand.

Generating New Genre Training Data

How will we try to approach this in our particular situation? As a reminder of our broad task, we have a collection of metadata related to the Microsoft Digitised Books collection. The ‘genre’ field isn’t yet fully populated. We have previously used a subset of this data to train a machine learning model.

What we want to do is to try and write some labelling functions that will add more labels to the full metadata dataset, so that we can give our models more examples to learn from. If we are able to do this we’ll hopefully be able to improve the performance of our model from our previous attempts.

We’ll start by doing some installation of our libraries we’ll be using in this notebook.

!pip install snorkel
!pip install fastai --upgrade
Requirement already satisfied: snorkel in /usr/local/lib/python3.7/dist-packages (0.9.7)
Requirement already satisfied: torch<2.0.0,>=1.2.0 in /usr/local/lib/python3.7/dist-packages (from snorkel) (1.9.0+cu111)
Requirement already satisfied: scipy<2.0.0,>=1.2.0 in /usr/local/lib/python3.7/dist-packages (from snorkel) (1.4.1)
Requirement already satisfied: munkres>=1.0.6 in /usr/local/lib/python3.7/dist-packages (from snorkel) (1.1.4)
Requirement already satisfied: numpy<1.20.0,>=1.16.5 in /usr/local/lib/python3.7/dist-packages (from snorkel) (1.19.5)
Requirement already satisfied: tqdm<5.0.0,>=4.33.0 in /usr/local/lib/python3.7/dist-packages (from snorkel) (4.62.3)
Requirement already satisfied: scikit-learn<0.25.0,>=0.20.2 in /usr/local/lib/python3.7/dist-packages (from snorkel) (0.22.2.post1)
Requirement already satisfied: networkx<2.4,>=2.2 in /usr/local/lib/python3.7/dist-packages (from snorkel) (2.3)
Requirement already satisfied: pandas<2.0.0,>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from snorkel) (1.1.5)
Requirement already satisfied: tensorboard<2.0.0,>=1.14.0 in /usr/local/lib/python3.7/dist-packages (from snorkel) (1.15.0)
Requirement already satisfied: decorator>=4.3.0 in /usr/local/lib/python3.7/dist-packages (from networkx<2.4,>=2.2->snorkel) (4.4.2)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas<2.0.0,>=1.0.0->snorkel) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas<2.0.0,>=1.0.0->snorkel) (2.8.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas<2.0.0,>=1.0.0->snorkel) (1.15.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn<0.25.0,>=0.20.2->snorkel) (1.1.0)
Requirement already satisfied: werkzeug>=0.11.15 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.0.0,>=1.14.0->snorkel) (1.0.1)
Requirement already satisfied: grpcio>=1.6.3 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.0.0,>=1.14.0->snorkel) (1.41.1)
Requirement already satisfied: setuptools>=41.0.0 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.0.0,>=1.14.0->snorkel) (57.4.0)
Requirement already satisfied: absl-py>=0.4 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.0.0,>=1.14.0->snorkel) (0.12.0)
Requirement already satisfied: protobuf>=3.6.0 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.0.0,>=1.14.0->snorkel) (3.17.3)
Requirement already satisfied: wheel>=0.26 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.0.0,>=1.14.0->snorkel) (0.37.0)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.0.0,>=1.14.0->snorkel) (3.3.4)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from markdown>=2.6.8->tensorboard<2.0.0,>=1.14.0->snorkel) (4.8.1)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from torch<2.0.0,>=1.2.0->snorkel) (3.10.0.2)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->markdown>=2.6.8->tensorboard<2.0.0,>=1.14.0->snorkel) (3.6.0)
Requirement already satisfied: fastai in /usr/local/lib/python3.7/dist-packages (2.5.3)
Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from fastai) (21.2)
Requirement already satisfied: pip in /usr/local/lib/python3.7/dist-packages (from fastai) (21.1.3)
Requirement already satisfied: torchvision>=0.8.2 in /usr/local/lib/python3.7/dist-packages (from fastai) (0.10.0+cu111)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from fastai) (3.2.2)
Requirement already satisfied: fastcore<1.4,>=1.3.22 in /usr/local/lib/python3.7/dist-packages (from fastai) (1.3.27)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.7/dist-packages (from fastai) (3.13)
Requirement already satisfied: pillow>6.0.0 in /usr/local/lib/python3.7/dist-packages (from fastai) (7.1.2)
Requirement already satisfied: torch<1.11,>=1.7.0 in /usr/local/lib/python3.7/dist-packages (from fastai) (1.9.0+cu111)
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from fastai) (1.1.5)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from fastai) (2.23.0)
Requirement already satisfied: fastprogress>=0.2.4 in /usr/local/lib/python3.7/dist-packages (from fastai) (1.0.0)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (from fastai) (0.22.2.post1)
Requirement already satisfied: fastdownload<2,>=0.0.5 in /usr/local/lib/python3.7/dist-packages (from fastai) (0.0.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from fastai) (1.4.1)
Requirement already satisfied: spacy<4 in /usr/local/lib/python3.7/dist-packages (from fastai) (2.2.4)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from fastprogress>=0.2.4->fastai) (1.19.5)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai) (2.0.6)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai) (1.0.5)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai) (3.0.6)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai) (1.1.3)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai) (1.0.0)
Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai) (7.4.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai) (57.4.0)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai) (1.0.6)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai) (4.62.3)
Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai) (0.4.1)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai) (0.8.2)
Requirement already satisfied: importlib-metadata>=0.20 in /usr/local/lib/python3.7/dist-packages (from catalogue<1.1.0,>=0.0.7->spacy<4->fastai) (4.8.1)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata>=0.20->catalogue<1.1.0,>=0.0.7->spacy<4->fastai) (3.6.0)
Requirement already satisfied: typing-extensions>=3.6.4 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata>=0.20->catalogue<1.1.0,>=0.0.7->spacy<4->fastai) (3.10.0.2)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->fastai) (2.10)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->fastai) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->fastai) (2021.10.8)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->fastai) (3.0.4)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->fastai) (2.8.2)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->fastai) (2.4.7)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib->fastai) (0.11.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->fastai) (1.3.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.1->matplotlib->fastai) (1.15.0)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas->fastai) (2018.9)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->fastai) (1.1.0)

Since we already have some training data we can leverage this to help us develop labelling functions (more on this below) and to test how well these work.

import pandas as pd
import numpy as np
dtypes = {
    "BL record ID": "string",
    "Type of resource": "category",
    "Name": "category",
    "Type of name": "category",
    "Country of publication": "category",
    "Place of publication": "category",
    "Genre": "category",
    "Dewey classification": "string",
    "BL record ID for physical resource": "string",
    "annotator_main_language": "category",
    "annotator_summaries_language": "string",
}
df = pd.read_csv("https://raw.githubusercontent.com/Living-with-machines/genre-classification/main/genre_classification_of_bl_books/data/train_valid.csv", dtype=dtypes)
df.head(1)
BL record ID Type of resource Name Dates associated with name Type of name Role All names Title Variant titles Series title Number within series Country of publication Place of publication Publisher Date of publication Edition Physical description Dewey classification BL shelfmark Topics Genre Languages Notes BL record ID for physical resource classification_id user_id created_at subject_ids annotator_date_pub annotator_normalised_date_pub annotator_edition_statement annotator_genre annotator_FAST_genre_terms annotator_FAST_subject_terms annotator_comments annotator_main_language annotator_other_languages_summaries annotator_summaries_language annotator_translation annotator_original_language annotator_publisher annotator_place_pub annotator_country annotator_title Link to digitised book annotated is_valid
0 014616539 Monograph NaN NaN NaN NaN Hazlitt, William Carew, 1834-1913 [person] The Baron's Daughter. A ballad by the author o... Single Works NaN NaN Scotland Edinburgh Ballantyne, Hanson 1877 NaN 20 pages (4°) <NA> Digital Store 11651.h.6 NaN NaN English NaN 000206670 263940444.0 3.0 2020-07-27 07:35:13 UTC 44330917.0 1877 1877 NONE Fiction 655 7 $aPoetry$2fast$0(OCoLC)fst01423828 NONE NaN NaN No <NA> No NaN Ballantyne Hanson & Co. Edinburgh stk The Baron's Daughter. A ballad by the author o... http://access.bl.uk/item/viewer/ark:/81055/vdc... True False

We’ll use only the data from the training split so we’re can continue to use the validation split to compare our results.

df = df[df.is_valid == False]

Check how many examples we have to work with

len(df)
3262

What is a labeling function?

We briefly described a function we could use to label our data using the length of the title. When we use a programmatic approach to creating our training data we can refer to the functions which we use to create our labels as labeling function. We’ll follow a lot of the approaches outlined in the Snorkel tutorial in this notebook. They provide this example of a labeling function for the task of identifying if a youtube comment is spam or not:

from snorkel.labeling import labeling_function


@labeling_function()
def lf_contains_link(x):
    # Return a label of SPAM if "http" in comment text, otherwise ABSTAIN
    return SPAM if "http" in x.text.lower() else ABSTAIN

There are a few things to note here, but, since we’re following a lot of what is covered in the Snorkel tutorial we won’t repeat things in too much detail.

The first thing we need is to import labeling_function from snorkel as we use this for declaring our labeling functions. The way in which we create a labeling function will depend on our data, and how we might label it, but in this example we have a simple python function which returns SPAM if the text http appears in the comment text, if it doesn’t it returns ABSTAIN.

We use a python decorator to indicate that this is a labeling function. If you aren’t familiar with Python decorators should just remember that decorators are used to modify the behavior of a function (the one it decorates), just as fairy lights decorate a Christmas tree and change its behaviour from ‘tree’ to ‘festive ornament’. This will make more sense in the context of Snorkel later on.

If you want to dig into decorators further this article on Real Python provides a nice introduction, or, if prefer to watch a video this youtube tutorial gives a nice overview too.

We can see here that the labeling function makes use of the idea that people often include links in spam comments i.e. “plz checkout my etsy store at http:….”. Obviously this won’t be correct all the time but fortunately Snorkel has some ways to deal with this.

What Makes a Good Labelling Function?

One question we might already have is “what makes a good labeling function?”. The short, annoying, answer is that it depends on context. We often have intutions about things that might work because we know the domain or have picked up ideas from working with some of the data already. In our particular example of distinguishing fiction from non-fiction books we may think that some words are likely to indicate whether a book is fiction or non-fiction. We’ll start by exploring this.

Important Words?

A fairly naive approach to trying to labeling a title as ‘fiction’ or ‘non-fiction’ would be to use some keywords. Let’s start by finding the most common 50 words. We can use the Counter class from the delightful collections module to do this.

from collections import Counter
Counter(" ".join(df["Title"]).split()).most_common(50)
[('of', 2255),
 ('the', 1785),
 ('and', 1536),
 ('...', 1054),
 ('in', 819),
 ('The', 693),
 ('A', 625),
 ('a', 625),
 ('etc', 557),
 ('by', 472),
 ('to', 413),
 ('With', 314),
 ('from', 268),
 ('with', 250),
 ('de', 228),
 ('van', 223),
 ('By', 201),
 ('its', 196),
 ('en', 194),
 ('der', 193),
 ('History', 179),
 ('J.', 159),
 ('on', 158),
 ('an', 157),
 ('[With', 157),
 ('[A', 152),
 ('illustrations', 152),
 ('New', 125),
 ('other', 121),
 ('for', 117),
 ('novel.]', 111),
 ('edition', 110),
 ('or,', 109),
 ('H.', 108),
 ('Illustrated', 98),
 ('A.', 96),
 ('und', 91),
 ('af', 88),
 ('G.', 87),
 ('den', 87),
 ('och', 75),
 ('C.', 73),
 ('or', 72),
 ('i', 71),
 ('het', 70),
 ('An', 68),
 ('Edited', 67),
 ('novel', 67),
 ('W.', 64),
 ('during', 64)]

We can see here that the most common words tend to be stop words. Since we want to know which words might be unique to fiction or non-fiction we’ll look at each of these separately.

df_fiction = df[df["annotator_genre"] == "Fiction"]
df_non_fiction = df[df["annotator_genre"] == "Non-fiction"]
most_frequent_fiction = Counter(" ".join(df_fiction["Title"]).split()).most_common(50)
most_frequent_fiction
[('of', 490),
 ('The', 331),
 ('A', 316),
 ('the', 260),
 ('and', 242),
 ('a', 184),
 ('...', 177),
 ('[A', 147),
 ('by', 138),
 ('in', 112),
 ('novel.]', 111),
 ('etc', 104),
 ('By', 104),
 ('other', 94),
 ('novel', 67),
 ('With', 53),
 ('tale', 50),
 ('der', 49),
 ('edition', 48),
 ('de', 47),
 ('author', 45),
 ('van', 45),
 ('or,', 41),
 ('en', 40),
 ('Poems', 39),
 ('J.', 39),
 ('story', 39),
 ('illustrations', 38),
 ('[i.e.', 35),
 ('A.', 34),
 ('stories', 30),
 ('romance', 29),
 ('H.', 28),
 ('poems', 28),
 ('or', 26),
 ('Second', 26),
 ('und', 26),
 ('C.', 25),
 ('poem', 25),
 ('with', 25),
 ('verse', 24),
 ('An', 24),
 ('from', 24),
 ('Tales', 24),
 ('New', 23),
 ('for', 20),
 ('acts', 20),
 ('collection', 20),
 ('het', 20),
 ('an', 19)]
most_frequent_non_fiction = Counter(
    " ".join(df_non_fiction["Title"]).split()
).most_common(50)
most_frequent_non_fiction
[('of', 1765),
 ('the', 1525),
 ('and', 1294),
 ('...', 877),
 ('in', 707),
 ('etc', 453),
 ('a', 441),
 ('to', 397),
 ('The', 362),
 ('by', 334),
 ('A', 309),
 ('With', 261),
 ('from', 244),
 ('with', 225),
 ('its', 193),
 ('de', 181),
 ('van', 178),
 ('History', 176),
 ('en', 154),
 ('on', 153),
 ('[With', 144),
 ('der', 144),
 ('an', 138),
 ('J.', 120),
 ('illustrations', 114),
 ('New', 102),
 ('for', 97),
 ('By', 97),
 ('Illustrated', 94),
 ('af', 88),
 ('H.', 80),
 ('och', 75),
 ('den', 75),
 ('G.', 74),
 ('i', 71),
 ('or,', 68),
 ('und', 65),
 ('during', 64),
 ('edition', 62),
 ('A.', 62),
 ('history', 62),
 ('og', 61),
 ('account', 57),
 ('sketches', 54),
 ('W.', 53),
 ('P.', 51),
 ('through', 51),
 ('notes', 50),
 ('edition,', 50),
 ('het', 50)]

For our indicator words to be most reliable we would rather they didn’t appear frequently in both fiction and non-fiction titles. We can use a set to check the values which aren’t in both fiction and non-fiction titles.

set(most_frequent_non_fiction).difference(set(most_frequent_fiction))
{('...', 877),
 ('A', 309),
 ('A.', 62),
 ('By', 97),
 ('G.', 74),
 ('H.', 80),
 ('History', 176),
 ('Illustrated', 94),
 ('J.', 120),
 ('New', 102),
 ('P.', 51),
 ('The', 362),
 ('W.', 53),
 ('With', 261),
 ('[With', 144),
 ('a', 441),
 ('account', 57),
 ('af', 88),
 ('an', 138),
 ('and', 1294),
 ('by', 334),
 ('de', 181),
 ('den', 75),
 ('der', 144),
 ('during', 64),
 ('edition', 62),
 ('edition,', 50),
 ('en', 154),
 ('etc', 453),
 ('for', 97),
 ('from', 244),
 ('het', 50),
 ('history', 62),
 ('i', 71),
 ('illustrations', 114),
 ('in', 707),
 ('its', 193),
 ('notes', 50),
 ('och', 75),
 ('of', 1765),
 ('og', 61),
 ('on', 153),
 ('or,', 68),
 ('sketches', 54),
 ('the', 1525),
 ('through', 51),
 ('to', 397),
 ('und', 65),
 ('van', 178),
 ('with', 225)}

These words are still fairly noisy so we might be wary of using many of them. There are some which make sense intuitively so we’ll try some of these out and see how they perform.

Creating our Labelling Functions

We’ll start by setting some constants for our labels.

ABSTAIN = -1
FICTION = 0
NON_FICTION = 1

It is important to note that we set an option for ABSTAIN. We often want to add an option to our labeling functions that defers from making a prediction. We usually write our labeling function to try and indicate a label, but usually if the function isn’t satisfied doesn’t indicate that the other label is correct.

Another important part of labeling functions is that we usually want to have many labeling functions, and, since we’re not relying on a single function to label our data it’s often better for our labeling function to return ABSTAIN if our labeling condition isn’t met rather than returning another label. This becomes even more important if we have multiple labels.

One function we could try to start with is checking if the word “Novel” appears in the title text. You may have noticed that in this particular dataset the word novel often appears as part of the title so this could be a useful indicator of fiction titles.

Warning

We want to be careful that our labeling functions are specific to our data. In the BL books title metadata that we’re trying to label we have noticed that things like A Novel by... appear in the title. This may not be the case for other book titles in different catalogues.

Our first labeling function is basically the same as the spam example above except we look for the word “novel”.

@labeling_function()
def lf_contains_novel(x):
    return FICTION if "novel" in x.Title.lower() else ABSTAIN

Now we have our labeling function we can apply it to our data. We’ll start by doing this only with our validation data since we have the correct labels for this to compare our functions against.

There are various ways in which Snorkel can apply our functions to our data. In this notebook we’ll stick with an approach designed to work with Pandas dataframes. If we have a larger amount of data to label we might want to explore the use of dask applifer functions. This uses the dask library to scale the appliation of labelling functions to very large datasets. We won’t need this here but if you are planning to develop this approach with very large collections this could be worth exploring.

We put our labelling functions in a list called LFS, we then create an applier object and pass in our dataframe.

from snorkel.labeling import PandasLFApplier

lfs = [lf_contains_novel]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df)
100%|██████████| 3262/3262 [00:00<00:00, 39484.97it/s]

We store the output of this in a new variable L_train. We can use LFAnalysis to get a summary of what our current labeling function is doing.

from snorkel.labeling import LFAnalysis

LFAnalysis(L=L_train, lfs=lfs).lf_summary()
j Polarity Coverage Overlaps Conflicts
lf_contains_novel 0 [0] 0.058553 0.0 0.0

We can see a row for our current labeling function, we can also see that at the moment our coverage (i.e. how much of our data is labeled by our function) is very low. At the moment we don’t have any overlaps or conflicts since these are relevant only when we have multiple labeling functions.

We have ground truth labels that we can use to evaluate how accurate our labeling function is. To to his we need to use the same labels as Snorkel for ground truth so we’ll map our labels to the constants we made above.

ground_truth = df.annotator_genre.replace({"Fiction": 0, "Non-fiction": 1})

We can pass our ground truth labels to lf_empirical_accuracies to get a sumaary of the peformance of our functions.

LFAnalysis(L=L_train, lfs=lfs).lf_empirical_accuracies(ground_truth)
array([1.])

We can see here that our current function is 100% accurate. We shouldn’t get too excited about this since our coverage is very low. We’ll need to write some more labeling functions to make sure that we have some chance of labeling more of our data than we currently have done.

Heuristics

We could also use heuristics for our labeling functions. For example the length of the title. I don’t have any idea what threshold to use for this. Since we have some labels we can try and identify a sensible threshold. First we’ll add a new columns to our DataFrame which contains the length of our titles.

df["text_len"] = df["Title"].str.len()

We’ll now use a pandas groupby to see what the lengths look like for fiction vs non-fiction books. Since it might be useful to have a sense of the distributions we’ll use describe instead of mean.

df.groupby(["annotator_genre"])["text_len"].describe()
count mean std min 25% 50% 75% max
annotator_genre
Fiction 1083.0 49.438596 35.095600 5.0 25.0 39.0 63.0 271.0
Non-fiction 2179.0 92.317118 58.458339 8.0 50.0 78.0 125.0 469.0

Precision vs Recall: What Value to use for our Threshold?

We can see various values for mean, min etc. What would be a reasonable value to use for our threshold for a labeling function which labeled a title as ‘non-fiction’? This partly comes down to whether we want high coverage (or recall) or high precision. If we choose a threshold that is higher we will label fewer examples, but they will be more likely to be correct.

For example, if we use the max value for the length of a non-fiction title 469 most titles will be much shorter than this, so our function will ‘abstain’ from applying a label, and we would only label a very small number of examples from our data. However, we also won’t have many (or any) wrongly-labeled examples since the max value for fiction here is 288. We need to balance these two aims of coverage and precision. Since we are writing more than one labeling function we probably want to tend towards writing more precise labeling functions rather than aiming for high coverage if this is likely to introduce wrongly labeled examples.

Note

As we saw in previous chapters/notebooks we have to be a bit careful in generalizing between what we see in our training and validation data since there may be some distribution drift between our training data (which wasn’t a completely randomized sample) and the full data that we want to label. In the error analysis notebook we saw that the performance of our model was worse than it was on validation data. We should keep this in mind when writing a labeling function, since we want our labeling function to work well on new data which doesn’t have labels, not just on the data for which we already have labels.

We’ll use the value of the maximum title length for a fiction book as our threshold. This will hopefully give us fairly decent coverage without too many mistakes.

@labeling_function()
def lf_is_long_title(x):
    return NON_FICTION if x.text_len > 211.0 else ABSTAIN

We do the same as earlier, including our new labeling function

lfs = [lf_contains_novel, lf_is_long_title]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df)
100%|██████████| 3262/3262 [00:00<00:00, 28729.01it/s]
LFAnalysis(L=L_train, lfs=lfs).lf_summary()
j Polarity Coverage Overlaps Conflicts
lf_contains_novel 0 [0] 0.058553 0.0 0.0
lf_is_long_title 1 [1] 0.023299 0.0 0.0

We can see our coverage is still fairly low but at the moment we don’t have any conflicts. We can keep tweaking our length threshold but for now we’ll try a different approach to our labeling function.

Add Keywords

We already have a labeling function that uses the keyword ‘novel’ to identify likely fiction books. Since we often want to use keywords the Snorkel tutorial suggests a way we can do this more easily using keyword lookups.

from snorkel.labeling import LabelingFunction


def keyword_lookup(x, keywords, label):
    if any(word in x.Title.lower() for word in keywords):
        return label
    return ABSTAIN


def make_keyword_lf(keywords, label=FICTION):
    return LabelingFunction(
        name=f"keyword_{keywords[0]}",
        f=keyword_lookup,
        resources=dict(keywords=keywords, label=label),
    )

We can try two new keyword labels using this more concise approach:

keyword_tale = make_keyword_lf(keywords=["tale"])
keyword_poem = make_keyword_lf(keywords=["poem"])

Leveraging Other Models

So far we have leveraged some domain knowledge/exploration and our existing labeled data to create our labeling functions. However, we could also utilise other resources to help us label our data. Since we’re working with text we should be able to benefit from some existing NLP models to label our data. Snorkel supports this in a few different ways.

spaCy is a popular nlp library which supports a range of different models and nlp tasks. Here we’re particuarly interested in some of the named entity models supported by spaCy.

To work with this library we can use Snorkel’s SpacyPreProcessor. Preprocessors are used in Snorkel to do some preprocessing (hence the name) which is required for our labeling functions. These can be particularly useful if the processing takes some time and might be reused across differnet labelling functions. Let’s take a look at an example

from snorkel.preprocess.nlp import SpacyPreprocessor

# The SpacyPreprocessor parses the text in text_field and
# stores the new enriched representation in doc_field
spacy = SpacyPreprocessor(text_field="Title", doc_field="doc", memoize=True)

Above we create a SpacyPreprocessor which will use our title field and create a new doc field. This doc refers to the Spacy doc container. This can be reused for multiple different labeling functions. We pass in memoize=True to cache our results. This means we won’t have to wait for the preprocessing to be done multiple times for different labeling functions which reuse the doc container.

Using named entities for labeling functions.

spaCy has support for named entity recognition. Since these models are already created and can be used directly by us it might be worth seeing if named entities are of any benefit for our particular task.

We can again draw from our domain knowledge, intution or guesses (depending on how confident we are) and say that it’s likely that we will see more named entities of the ORG type in non-fiction titles since these often will be about organizations. We can combine this with a slightly softer threshold for length to label titles as being likely non-fiction.

To create this function we replicate closely what we did before except that we pass in our SpacyPreProcessor instance to let Snorkel know that this preprocesser is a requirement of this labeling function. Under the hood this will mean that if the preprocessor hasn’t been run already this will triger the preprocessing. If we reuse this for another funciton the preprocessing will already have been cached.

@labeling_function(pre=[spacy])
def has_many_org(x):
    if len(x.doc) > 50 and len([ent.label_ == "ORG" for ent in x.doc.ents]) > 1:
        return NON_FICTION
    else:
        return ABSTAIN

We might also guess that there will be more location entities in non-fiction titles

@labeling_function(pre=[spacy])
def has_many_loc(x):
    if len(x.doc) > 50 and len([ent.label_ == "LOC" for ent in x.doc.ents]) > 2:
        return NON_FICTION
    else:
        return ABSTAIN

Similarly we might also assume that there will be more GPE entities for non-fiction

@labeling_function(pre=[spacy])
def has_many_gpe(x):
    if len(x.doc) > 50 and len([ent.label_ == "GPE" for ent in x.doc.ents]) > 2:
        return NON_FICTION
    else:
        return ABSTAIN

and law entities…

@labeling_function(pre=[spacy])
def has_law(x):
    if any([ent.label_ == "LAW" for ent in x.doc.ents]):
        return NON_FICTION
    else:
        return ABSTAIN

and if it’s long and has a date it might be a non-fiction title?

@labeling_function(pre=[spacy])
def is_long_and_has_date(x):
    if len(x.doc) > 50 and any([ent.label_ == "DATE" for ent in x.doc.ents]):
        return NON_FICTION
    else:
        return ABSTAIN

or it is long and has a FAC entitity

@labeling_function(pre=[spacy])
def is_long_and_has_fac(x):
    if len(x.doc) > 50 and any([ent.label_ == "FAC" for ent in x.doc.ents]):
        return NON_FICTION
    else:
        return ABSTAIN

We now have a bunch of labeling functions we’ll create a new list containing these and see how they do.

lfs = [
    lf_contains_novel,
    lf_is_long_title,
    keyword_tale,
    keyword_poem,
    has_many_org,
    has_many_loc,
    has_many_gpe,
    has_law,
    is_long_and_has_date,
]
applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df)
100%|██████████| 3262/3262 [00:34<00:00, 93.42it/s]
LFAnalysis(L=L_train, lfs=lfs).lf_summary()
j Polarity Coverage Overlaps Conflicts
lf_contains_novel 0 [0] 0.058553 0.000000 0.0
lf_is_long_title 1 [1] 0.023299 0.011036 0.0
keyword_tale 2 [0] 0.043532 0.000000 0.0
keyword_poem 3 [0] 0.042918 0.000000 0.0
has_many_org 4 [1] 0.011956 0.011956 0.0
has_many_loc 5 [1] 0.011036 0.011036 0.0
has_many_gpe 6 [1] 0.011036 0.011036 0.0
has_law 7 [1] 0.004905 0.000000 0.0
is_long_and_has_date 8 [1] 0.003372 0.003372 0.0

Again our coverage is quite low but we also don’t have too many conflicts. We can check the performance of these functions:

LFAnalysis(L=L_train, lfs=lfs).lf_empirical_accuracies(ground_truth)
array([1.        , 0.94736842, 0.97183099, 1.        , 0.92307692,
       0.91666667, 0.91666667, 1.        , 1.        ])

These are all doing pretty good so we might be okay with lower coverage for now. We also have a resource available to us which should boost our coverage a fair bit: our previously trained model.

Using our previous model

In a previously notebook we trained a model which didn’t perform terribly. Although we wanted to improve the performance, hence this notebook, it wasn’t so disastrous as to be unusable, particularly with the insights we got from the error analysis notebook that if we raise the threshold of confidence for which we accept our models predictions our performance increases quite a bit. We may therefore want to try and incorporate this model as another way of labeling more data.

There are various ways in which we can do this, we’ll look at one approach below.

We’ll start by importing fastai so we can load our previously trained model.

from fastai.text.all import *

If you don’t have a model saved from notebook you can download one by uncommenting the below cell.

# !wget -O 20210928-model.pkl  https://zenodo.org/record/5245175/files/20210928-model.pkl?download=1
--2021-11-11 14:12:20--  https://zenodo.org/record/5245175/files/20210928-model.pkl?download=1
Resolving zenodo.org (zenodo.org)... 137.138.76.77
Connecting to zenodo.org (zenodo.org)|137.138.76.77|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 158529715 (151M) [application/octet-stream]
Saving to: ‘20210928-model.pkl’

20210928-model.pkl  100%[===================>] 151.19M  11.1MB/s    in 16s     

2021-11-11 14:12:38 (9.54 MB/s) - ‘20210928-model.pkl’ saved [158529715/158529715]
learn = load_learner("20210928-model.pkl")

We can quickly check our vocab

learn.dls.vocab[1]
['Fiction', 'Non-fiction']

One way of using this model would be to create a preprocessor that will be used by Snorkel. This will do the setup required to use this model (as we saw with the spaCy example). We can do this by using the preprocessor decorator. Our function then calls whatever we need to happen. In this case we store the predicted label and probability.

# from snorkel.preprocess import preprocessor
# @preprocessor(memoize=True)
# def fastai_pred(x):
#     with learn.no_bar():
#         *_, probs = learn.predict(x.title)
#     x.fiction_prob = probs[0]
#     x.non_fiction_prob = probs[1]
#     return x

In this example we don’t want to use this since we then don’t benefit from doing our inference in batches. Instead we’ll just create some new columns to store our fastai models labels and confidence.

test_dl = learn.dls.test_dl(df.Title)
preds = learn.get_preds(dl=test_dl)
fiction_prob, non_fiction_prob = np.hsplit(preds[0].numpy(), 2)
fiction_prob
array([[0.9999399 ],
       [0.9999399 ],
       [0.9999399 ],
       ...,
       [0.04363291],
       [0.04363291],
       [0.02832149]], dtype=float32)
df["fiction_prob"] = fiction_prob
df["non_fiction_prob"] = non_fiction_prob

We now have some new columns containing the probabilities for our labels from our previously created model.

We saw in the previous Assessing Where our Model is Going Wrong section that by only using predictions where our model was confident, we could get better results i.e. we only accept suggestions from our model where it is very confident. For example, we could accept a prediction only if it is above 95% confidence.

We’ll use this in our labelling function to set a threshold at which we accept the previous models predictions. If the model is unsure we don’t use it’s prediction. This will mean less of our data ends up labelled because some predictions aren’t used. However, we will hopefully get better predictions because we only use those where our model is confident.

@labeling_function()
def fastai_fiction_prob_v_high(x):
    return FICTION if x.fiction_prob > 0.97 else ABSTAIN
@labeling_function()
def fastai_non_fiction_prob_v_high(x):
    return NON_FICTION if x.non_fiction_prob > 0.97 else ABSTAIN

Again we add this to our existing labeling function list and apply it to our data

lfs += [fastai_fiction_prob_v_high, fastai_non_fiction_prob_v_high]
lfs
[LabelingFunction lf_contains_novel, Preprocessors: [],
 LabelingFunction lf_is_long_title, Preprocessors: [],
 LabelingFunction keyword_tale, Preprocessors: [],
 LabelingFunction keyword_poem, Preprocessors: [],
 LabelingFunction has_many_org, Preprocessors: [SpacyPreprocessor SpacyPreprocessor, Pre: []],
 LabelingFunction has_many_loc, Preprocessors: [SpacyPreprocessor SpacyPreprocessor, Pre: []],
 LabelingFunction has_many_gpe, Preprocessors: [SpacyPreprocessor SpacyPreprocessor, Pre: []],
 LabelingFunction has_law, Preprocessors: [SpacyPreprocessor SpacyPreprocessor, Pre: []],
 LabelingFunction is_long_and_has_date, Preprocessors: [SpacyPreprocessor SpacyPreprocessor, Pre: []],
 LabelingFunction fastai_fiction_prob_v_high, Preprocessors: [],
 LabelingFunction fastai_non_fiction_prob_v_high, Preprocessors: []]
applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df)
100%|██████████| 3262/3262 [00:34<00:00, 95.06it/s]
LFAnalysis(L=L_train, lfs=lfs).lf_summary()
j Polarity Coverage Overlaps Conflicts
lf_contains_novel 0 [0] 0.058553 0.056101 0.000000
lf_is_long_title 1 [1] 0.023299 0.019926 0.000000
keyword_tale 2 [0] 0.043532 0.034028 0.000613
keyword_poem 3 [0] 0.042918 0.035561 0.000000
has_many_org 4 [1] 0.011956 0.011956 0.000920
has_many_loc 5 [1] 0.011036 0.011036 0.000920
has_many_gpe 6 [1] 0.011036 0.011036 0.000920
has_law 7 [1] 0.004905 0.002759 0.000000
is_long_and_has_date 8 [1] 0.003372 0.003372 0.000000
fastai_fiction_prob_v_high 9 [0] 0.223483 0.125996 0.000920
fastai_non_fiction_prob_v_high 10 [1] 0.338749 0.021153 0.000613

We can see that the labelling function which uses our model outputs has a much higher coverage of our data. This should be very helpful in labelling more examples but we want to check that these are correct.

LFAnalysis(L=L_train, lfs=lfs).lf_empirical_accuracies(ground_truth)
array([1.        , 0.94736842, 0.97183099, 1.        , 0.92307692,
       0.91666667, 0.91666667, 1.        , 1.        , 0.99314129,
       0.99909502])

We can see that our labels all perform pretty well i.e. above 90%. We are also getting much better coverage now that we leverage our existing model.

Creating more training data

So far we have been using the validation data to develop some potential labeling functions. Now we are fairly satisfied with them let’s apply to the full data. We’ll quickly look at this process on our current data and then we’ll move to the full metadata json file that we use for creating more training data.

We use LabelModel to fit a model which will be able to take as input all of the predictions from our labelling functions and fit a model which will predict the probability for a label. This model is able to deal with some conflicts between labeling functions and will do much better in most cases than a naive majority vote model i.e. one which just accepts the most often predicted label. The details of this model are beyond the scope of this notebook but if you are interested Data Programming: Creating Large Training Sets, Quickly offers a fuller overview of the details of this method[8]

from snorkel.labeling.model import LabelModel

label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train=L_train, n_epochs=500, log_freq=100, seed=123)

Above we fit our LabelModel for 500 epochs. Since we are working with the training set still we can get the score for this model.

label_model.score(
    L=L_train,
    Y=ground_truth,
    tie_break_policy="abstain",
    metrics=["precision", "recall", "f1"],
)
WARNING:root:Metrics calculated over data points with non-abstain labels only
{'f1': 0.994250331711632,
 'precision': 0.9964539007092199,
 'recall': 0.9920564872021183}

This is looking pretty good and hopefully this performance will be similar for our full data. We’ll now load a dataframe that includes all of the BL books metadata.

df_full = pd.read_csv(
    "https://bl.iro.bl.uk/downloads/e1be1324-8b1a-4712-96a7-783ac209ddef?locale=en",
    dtype=dtypes,
)

We create a new column text_len since we need this for some of our labeling functions.

df_full["text_len"] = df_full.Title.str.len()

We also get our fastai model’s predictions into new columns. This obviously takes some time since we’re now doing inference on a fairly large dataset.

test_dl = learn.dls.test_dl(df_full.Title)
preds = learn.get_preds(dl=test_dl)
fiction_prob, non_fiction_prob = np.hsplit(preds[0].numpy(), 2)
df_full["fiction_prob"] = fiction_prob
df_full["non_fiction_prob"] = non_fiction_prob

Now we have all the same columns in place as we had previously we can now apply our labelling functions to all of our data.

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_full)
100%|██████████| 52695/52695 [09:27<00:00, 92.85it/s] 

We can check what the coverage, overlaps and conflicts look like

LFAnalysis(L=L_train, lfs=lfs).lf_summary()
j Polarity Coverage Overlaps Conflicts
lf_contains_novel 0 [0] 0.066875 0.057178 0.000285
lf_is_long_title 1 [1] 0.047367 0.032413 0.003036
keyword_tale 2 [0] 0.036702 0.022165 0.001006
keyword_poem 3 [0] 0.089952 0.049018 0.002619
has_many_org 4 [1] 0.021520 0.021520 0.001974
has_many_loc 5 [1] 0.021008 0.021008 0.001917
has_many_gpe 6 [1] 0.021008 0.021008 0.001917
has_law 7 [1] 0.004232 0.002106 0.000531
is_long_and_has_date 8 [1] 0.007762 0.007762 0.000380
fastai_fiction_prob_v_high 9 [0] 0.213834 0.122858 0.000626
fastai_non_fiction_prob_v_high 10 [1] 0.192599 0.019015 0.000683

The coverage is lower than we had previously. This makes sense since we previously used the same data for developing our labeling functions as we used for training our model. It’s not suprising our model is more confident about these. If we were being more dilligent we might have held back a different dataset for developing our labelling functions but since we’re being a bit pragmatic (lazy) here we won’t worry too much about this. We can again fit our model:

label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train=L_train, n_epochs=500, log_freq=100, seed=123)

We now use this model to predict the probabilites from our labelling function outputs

probs_train = label_model.predict_proba(L_train)

We currently have predictions for some of our data but not all of it. Since we want only the labelled exampled we use a function from Snorkel to filter out data which our labeling functions didn’t annotate.

from snorkel.labeling import filter_unlabeled_dataframe

df_train_filtered, probs_train_filtered = filter_unlabeled_dataframe(
    X=df_full, y=probs_train, L=L_train
)

We now have the predicted probabilty for each label. We could work with these probabilities but to keep things simple we’ll make these hard predictions i.e. fiction or non-fiction rather than 0.87% fiction. Again Snorkel provides a handy function for doing this.

from snorkel.utils import probs_to_preds

preds_train_filtered = probs_to_preds(probs=probs_train_filtered)

Let’s see how much data we have now.

len(preds_train_filtered)
26566

As a reminder we previosuly had 3262 labeled examples. We can see that we’ve now gained a lot more examples for relateively little work (especially if we compare how much time it would take to annotate these by hand).

26566 / 3262
8.144083384426732

We’ll store out new labels in a label column

df_train_filtered["snorkel_label"] = preds_train_filtered
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
df_train_filtered["snorkel_label"]
0        0
1        0
2        0
3        0
8        0
        ..
52682    0
52689    0
52692    0
52693    0
52694    0
Name: snorkel_label, Length: 26566, dtype: int64

Creating our new training data

As a reminder of what we’ve done:

  • we had training data/annotations collected via a zooniverse crowdsourcing task with 2909 labeled examples in our validation set

  • we had previously used this to train a model that did fairly well

  • we used our existing training data to generate labeling functions, these leveraged:

    • our intuitions about our data

    • SpaCy models

    • our previous model

  • we applied these labeling functions to the Microsoft Digitised Books file. Once we excluded examples which weren’t labeled by our labeling functions we had 34542 labeled examples we could work with.

We now want to get all of this data into a format we can use to train new models with. There are a few things we need to do for this.

Map to our original labels

We’ll map these back to our original fiction and non-fiction labels. This isn’t super important but might be more explicit then 1 or 0 for our labels.

df_train_filtered["snorkel_genre"] = df_train_filtered["snorkel_label"].map(
    {0: "Fiction", 1: "Non-fiction"}
)
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
df_train_filtered.columns
Index(['BL record ID', 'Type of resource', 'Name',
       'Dates associated with name', 'Type of name', 'Role', 'All names',
       'Title', 'Variant titles', 'Series title', 'Number within series',
       'Country of publication', 'Place of publication', 'Publisher',
       'Date of publication', 'Edition', 'Physical description',
       'Dewey classification', 'BL shelfmark', 'Topics', 'Genre', 'Languages',
       'Notes', 'BL record ID for physical resource', 'text_len',
       'fiction_prob', 'non_fiction_prob', 'snorkel_label', 'snorkel_genre'],
      dtype='object')

Selecting required columns

Since we have only been using the title and the label (fiction or non-fiction) to train our models we will just keep these.

df["annotator_genre"]
0           Fiction
1           Fiction
2           Fiction
3           Fiction
4           Fiction
           ...     
3257    Non-fiction
3258    Non-fiction
3259    Non-fiction
3260    Non-fiction
3261    Non-fiction
Name: annotator_genre, Length: 3262, dtype: object
df["snorkel_genre"] = df["annotator_genre"]
df_snorkel_train = pd.concat([df, df_train_filtered])
df_snorkel_train["snorkel_genre"].value_counts()
Fiction        15840
Non-fiction    13988
Name: snorkel_genre, dtype: int64

Prioritising human annotations

When we applied our labeling functions across the full Microsoft Digitised Books metadata file we didn’t do anything to exclude titles where a human annotator had already provided a label as part of the Zooniverse annotation task. Since we joined the full metadata and the human annotations together we will now have some duplicates. We almost definitely want to prioritise the human annotations over our label function labels. We could use Pandas drop_duplicates and keep the first example (the human annotated one) to deal with this.

df_snorkel_train.duplicated(subset="Title")
0        False
1         True
2         True
3         True
4         True
         ...  
52682    False
52689    False
52692    False
52693    False
52694    False
Length: 29828, dtype: bool
df_snorkel_train = df_snorkel_train.drop_duplicates(subset="Title", keep="first")
df_snorkel_train
BL record ID Type of resource Name Dates associated with name Type of name Role All names Title Variant titles Series title Number within series Country of publication Place of publication Publisher Date of publication Edition Physical description Dewey classification BL shelfmark Topics Genre Languages Notes BL record ID for physical resource classification_id user_id created_at subject_ids annotator_date_pub annotator_normalised_date_pub annotator_edition_statement annotator_genre annotator_FAST_genre_terms annotator_FAST_subject_terms annotator_comments annotator_main_language annotator_other_languages_summaries annotator_summaries_language annotator_translation annotator_original_language annotator_publisher annotator_place_pub annotator_country annotator_title Link to digitised book annotated is_valid text_len fiction_prob non_fiction_prob snorkel_genre snorkel_label
0 014616539 Monograph NaN NaN NaN NaN Hazlitt, William Carew, 1834-1913 [person] The Baron's Daughter. A ballad by the author of Poetical Recreations [i.e. William C. Hazlitt] . F.P Single Works NaN NaN Scotland Edinburgh Ballantyne, Hanson 1877 NaN 20 pages (4°) <NA> Digital Store 11651.h.6 NaN NaN English NaN 000206670 263940444.0 3.0 2020-07-27 07:35:13 UTC 44330917.0 1877 1877 NONE Fiction 655 7 $aPoetry$2fast$0(OCoLC)fst01423828 NONE NaN NaN No <NA> No NaN Ballantyne Hanson & Co. Edinburgh stk The Baron's Daughter. A ballad by the author of Poetical Recreations [i.e. William C. Hazlitt] . F.P http://access.bl.uk/item/viewer/ark:/81055/vdc_00000002F718 True False 100 0.999940 0.000060 Fiction NaN
5 014616561 Monograph Bingham, Ashton, Mrs NaN person NaN Bingham, Ashton, Mrs [person] The Autumn Leaf Poems NaN NaN NaN Scotland Edinburgh Colston 1891 NaN vi, 104 pages (8°) <NA> Digital Store 011649.e.105 NaN NaN English NaN 000353271 268728281.0 3.0 2020-08-18 07:02:17 UTC 44331070.0 1891 1891 NONE Fiction 655 7 $aPoetry$2fast$0(OCoLC)fst01423828 NONE NaN NaN No <NA> No NaN Colston & Company Edinburgh stk The Autumn Leaf Poems http://access.bl.uk/item/viewer/ark:/81055/vdc_00000002F04C True False 21 0.999486 0.000514 Fiction NaN
10 014616607 Monograph Cartwright, William NaN person writer Cartwright, William, writer [person] The Battle of Waterloo, a poem NaN NaN NaN England London Longman 1827 NaN vii, 71 pages (8°) <NA> Digital Store 992.i.26 Waterloo, Battle of (Belgium : 1815) NaN English NaN 000621918 263935396.0 3.0 2020-07-27 06:39:57 UTC 44331748.0 1827 1827 NONE Fiction 655 7 $aPoetry$2fast$0(OCoLC)fst01423828 647 7 $aBattle of Waterloo$c(Waterloo, Belgium :$d1815)$2fast$0(OCoLC)fst01172689 NaN NaN No <NA> No NaN Longman, Rees, Orme, Brown & Green\nBurlton\nMerricks London\nLeominster\nHereford enk The Battle of Waterloo, a poem http://access.bl.uk/item/viewer/ark:/81055/vdc_00000002ED4C True False 30 0.991599 0.008401 Fiction NaN
15 014616686 Monograph Earle, John Charles NaN person NaN Earle, John Charles [person] Maximilian, and other poems, etc NaN NaN NaN England London NaN 1868 NaN NaN <NA> Digital Store 11648.i.8 NaN Poetry or verse English NaN 001025896 265570129.0 3.0 2020-08-03 07:25:30 UTC 44331725.0 1868 1868 NONE Fiction 655 7 $aPoetry$2fast$0(OCoLC)fst01423828 NONE NaN NaN No <NA> No NaN Burns, Oates, & Co. London enk Maximilian, and other poems, etc http://access.bl.uk/item/viewer/ark:/81055/vdc_00000002F2AA True False 32 0.982546 0.017454 Fiction NaN
20 014616696 Monograph NaN NaN NaN NaN NaN Fabellæ mostellariæ: or Devonshire and Wiltshire stories in verse; including specimens of the Devonshire dialect NaN NaN NaN England Exeter ; London Hamilton, Adams ; Henry S. Eland 1878 NaN 77 pages (8°) <NA> Digital Store 11652.h.19 NaN NaN English NaN 001187981 269169228.0 3.0 2020-08-20 12:32:34 UTC 44331389.0 1878 1878 NONE Fiction 655 7 $aPoetry$2fast$0(OCoLC)fst01423828 NONE NaN NaN No <NA> No NaN Hamilton, Adams, and Co.\nHenry S. Eland London\nExeter enk Fabellæ mostellariæ: or Devonshire and Wiltshire stories in verse; including specimens of the Devonshire dialect http://access.bl.uk/item/viewer/ark:/81055/vdc_00000002F90A True False 112 0.983944 0.016056 Fiction NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
52682 016289050 Monograph Hastings, Beatrice NaN person NaN Hastings, Beatrice [person] The maids' comedy. A chivalric romance in thirteen chapters NaN NaN NaN England London Stephen Swift 1911 NaN 199 pages, 20 cm <NA> Digital Store 012618.c.32 NaN NaN English Anonymous. By Beatrice Hastings 004111105 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN <NA> NaN NaN NaN NaN NaN NaN NaN NaN NaN 59 0.999444 0.000556 Fiction 0.0
52689 016289057 Monograph Garstang, Walter, M.A., F.Z.S. NaN person NaN Garstang, Walter, M.A., F.Z.S. [person] ; Shepherd, J. A. (James Affleck), 1867-approximately 1931 [person] Songs of the Birds ... With illustrations by J.A. Shepherd NaN NaN NaN England London John Lane 1922 NaN 101 pages, illustrations (8°) 598.259 Digital Store 011648.g.133 NaN NaN English Poems, with and introductory essay 004158005 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN <NA> NaN NaN NaN NaN NaN NaN NaN NaN NaN 58 0.993942 0.006058 Fiction 0.0
52692 016289060 Monograph Wellesley, Dorothy 1889-1956 person NaN Wellesley, Dorothy, 1889-1956 [person] Early Poems. By M. A [i.e. Dorothy Violet Wellesley, Lady Gerald Wellesley.] NaN NaN NaN England London Elkin Mathews 1913 NaN vii, 90 pages (8°) <NA> Digital Store 011649.eee.17 NaN NaN English NaN 000000839 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN <NA> NaN NaN NaN NaN NaN NaN NaN NaN NaN 76 0.987218 0.012782 Fiction 0.0
52693 016289061 Monograph A, T. H. E. NaN person NaN A, T. H. E. [person] Of Life and Love [Poems.] By T. H. E. A, writer of 'The Message.' NaN NaN NaN England London J. M. Watkins 1924 NaN 89 pages (8°) <NA> Digital Store 011645.e.125 NaN NaN English NaN 000001167 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN <NA> NaN NaN NaN NaN NaN NaN NaN NaN NaN 65 0.977032 0.022968 Fiction 0.0
52694 016289062 Monograph Abbay, Richard NaN person NaN Abbay, Richard [person] Life, a Mode of Motion; or, He and I, my two selves [A poem.] NaN NaN NaN England London Jarrold 1919 NaN volumes, 58 pages (8°) <NA> Digital Store 011649.g.81 NaN NaN English NaN 000003140 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN <NA> NaN NaN NaN NaN NaN NaN NaN NaN NaN 61 0.975888 0.024112 Fiction 0.0

25683 rows × 52 columns

Data leakage

Want to exclude data which is in test set so we drop these examples from our training data. Since we care about titles ‘leaking’ we look up whether any titles in our training data appear in our test data and remove these from the training data.

df_test = pd.read_csv("test_errors.csv")

Removing data which is in our test data

df_snorkel_train = df_snorkel_train[~df_snorkel_train.Title.isin(df_test.title)]
len(df_snorkel_train)
25683

Creating new splits

We create some new splits following the same process we used before. We can then use these splits to more accurately compare across models training using this dataset. Since we have kept the test data out of our ‘Snorkel dataset’ we will also continue to use this test data for final model evaluation.

from sklearn.model_selection import GroupShuffleSplit
train_inds, valid_ins = next(
    GroupShuffleSplit(n_splits=2, test_size=0.2).split(
        df_snorkel_train, groups=df_snorkel_train["Title"]
    )
)
df_train, df_valid = (
    df_snorkel_train.iloc[train_inds].copy(),
    df_snorkel_train.iloc[valid_ins].copy(),
)
df_train["is_valid"] = False
df_valid["is_valid"] = True
df = pd.concat([df_train, df_valid])
df.snorkel_genre.value_counts()
Fiction        13918
Non-fiction    11765
Name: snorkel_genre, dtype: int64

We can see we still have a healthy number of examples to train our model on even after dropping titles which appear in our test data

len(df)
25683

Saving our new training data

We’ll save our new training data as a csv file.

df.to_csv("data/snorkel_train.csv", index=False)
df.head(1)
BL record ID Type of resource Name Dates associated with name Type of name Role All names Title Variant titles Series title Number within series Country of publication Place of publication Publisher Date of publication Edition Physical description Dewey classification BL shelfmark Topics Genre Languages Notes BL record ID for physical resource classification_id user_id created_at subject_ids annotator_date_pub annotator_normalised_date_pub annotator_edition_statement annotator_genre annotator_FAST_genre_terms annotator_FAST_subject_terms annotator_comments annotator_main_language annotator_other_languages_summaries annotator_summaries_language annotator_translation annotator_original_language annotator_publisher annotator_place_pub annotator_country annotator_title Link to digitised book annotated is_valid text_len fiction_prob non_fiction_prob snorkel_genre snorkel_label
0 014616539 Monograph NaN NaN NaN NaN Hazlitt, William Carew, 1834-1913 [person] The Baron's Daughter. A ballad by the author of Poetical Recreations [i.e. William C. Hazlitt] . F.P Single Works NaN NaN Scotland Edinburgh Ballantyne, Hanson 1877 NaN 20 pages (4°) <NA> Digital Store 11651.h.6 NaN NaN English NaN 000206670 263940444.0 3.0 2020-07-27 07:35:13 UTC 44330917.0 1877 1877 NONE Fiction 655 7 $aPoetry$2fast$0(OCoLC)fst01423828 NONE NaN NaN No <NA> No NaN Ballantyne Hanson & Co. Edinburgh stk The Baron's Daughter. A ballad by the author of Poetical Recreations [i.e. William C. Hazlitt] . F.P http://access.bl.uk/item/viewer/ark:/81055/vdc_00000002F718 True False 100 0.99994 0.00006 Fiction NaN

Next steps

We now have a larger training set which includes both our original training data produced through crowdsourcing plus our training data we generated using our labeling functions and the Snorkel library.

Hopefully having more training data will result in being able to improve the models we can generate. In the next sections we’ll look at two approaches we can use for doing this:

  • training the same model as before but with more data

  • training a transformer based model with more data

We will hopefully see some improvements now we have more data.

Note

The main things we tried to show in this notebook:

  • we can leverage our domain knowledge to help generate training data using a programmatic data labeling approach

  • this approach can leverage existing training data generated by humans

  • we can often use existing models to help generate training data even if the task is quite different