Model inference¶

Inference is the process of making new predictions on unseen data. There are different approaches to carrying out inference which will depend on purpose of the model and how it will be used. Two main approaches to doing inference are:

‘real-time’ single item predictions i.e. calling an API to predict a single example
‘batch inference’ i.e. running inference against a larger volume of data

Since we have a set of data we want to augment with additional machine generated labels we will use the second, batch inference, approach. Because we are only likely to run this batch prediction process occasionally, for example if we create a better performing model, we won’t spend much time worrying about how quick the inference process is.

!pip install fastai==2.5.2

Collecting fastai==2.5.2
  Downloading fastai-2.5.2-py3-none-any.whl (186 kB)
?25l
     |█▊                              | 10 kB 21.2 MB/s eta 0:00:01
     |███▌                            | 20 kB 7.0 MB/s eta 0:00:01
     |█████▎                          | 30 kB 5.1 MB/s eta 0:00:01
     |███████                         | 40 kB 4.9 MB/s eta 0:00:01
     |████████▉                       | 51 kB 2.5 MB/s eta 0:00:01
     |██████████▌                     | 61 kB 2.8 MB/s eta 0:00:01
     |████████████▎                   | 71 kB 2.8 MB/s eta 0:00:01
     |██████████████                  | 81 kB 3.1 MB/s eta 0:00:01
     |███████████████▉                | 92 kB 3.3 MB/s eta 0:00:01
     |█████████████████▋              | 102 kB 2.7 MB/s eta 0:00:01
     |███████████████████▍            | 112 kB 2.7 MB/s eta 0:00:01
     |█████████████████████           | 122 kB 2.7 MB/s eta 0:00:01
     |██████████████████████▉         | 133 kB 2.7 MB/s eta 0:00:01
     |████████████████████████▋       | 143 kB 2.7 MB/s eta 0:00:01
     |██████████████████████████▍     | 153 kB 2.7 MB/s eta 0:00:01
     |████████████████████████████▏   | 163 kB 2.7 MB/s eta 0:00:01
     |██████████████████████████████  | 174 kB 2.7 MB/s eta 0:00:01
     |███████████████████████████████▋| 184 kB 2.7 MB/s eta 0:00:01
     |████████████████████████████████| 186 kB 2.7 MB/s 
?25hRequirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (from fastai==2.5.2) (0.22.2.post1)
Requirement already satisfied: torch<1.10,>=1.7.0 in /usr/local/lib/python3.7/dist-packages (from fastai==2.5.2) (1.9.0+cu111)
Collecting fastcore<1.4,>=1.3.8
  Downloading fastcore-1.3.27-py3-none-any.whl (56 kB)
     |████████████████████████████████| 56 kB 4.2 MB/s 
?25hRequirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from fastai==2.5.2) (1.1.5)
Requirement already satisfied: spacy<4 in /usr/local/lib/python3.7/dist-packages (from fastai==2.5.2) (2.2.4)
Requirement already satisfied: pillow>6.0.0 in /usr/local/lib/python3.7/dist-packages (from fastai==2.5.2) (7.1.2)
Requirement already satisfied: pip in /usr/local/lib/python3.7/dist-packages (from fastai==2.5.2) (21.1.3)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from fastai==2.5.2) (3.2.2)
Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from fastai==2.5.2) (21.0)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from fastai==2.5.2) (2.23.0)
Requirement already satisfied: fastprogress>=0.2.4 in /usr/local/lib/python3.7/dist-packages (from fastai==2.5.2) (1.0.0)
Collecting fastdownload<2,>=0.0.5
  Downloading fastdownload-0.0.5-py3-none-any.whl (13 kB)
Requirement already satisfied: torchvision>=0.8.2 in /usr/local/lib/python3.7/dist-packages (from fastai==2.5.2) (0.10.0+cu111)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from fastai==2.5.2) (1.4.1)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.7/dist-packages (from fastai==2.5.2) (3.13)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from fastprogress>=0.2.4->fastai==2.5.2) (1.19.5)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai==2.5.2) (4.62.3)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai==2.5.2) (1.0.5)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai==2.5.2) (1.1.3)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai==2.5.2) (1.0.5)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai==2.5.2) (0.8.2)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai==2.5.2) (3.0.5)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai==2.5.2) (57.4.0)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai==2.5.2) (2.0.5)
Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai==2.5.2) (0.4.1)
Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai==2.5.2) (7.4.0)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai==2.5.2) (1.0.0)
Requirement already satisfied: importlib-metadata>=0.20 in /usr/local/lib/python3.7/dist-packages (from catalogue<1.1.0,>=0.0.7->spacy<4->fastai==2.5.2) (4.8.1)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata>=0.20->catalogue<1.1.0,>=0.0.7->spacy<4->fastai==2.5.2) (3.6.0)
Requirement already satisfied: typing-extensions>=3.6.4 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata>=0.20->catalogue<1.1.0,>=0.0.7->spacy<4->fastai==2.5.2) (3.7.4.3)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->fastai==2.5.2) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->fastai==2.5.2) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->fastai==2.5.2) (2021.5.30)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->fastai==2.5.2) (2.10)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->fastai==2.5.2) (1.3.2)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->fastai==2.5.2) (2.8.2)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib->fastai==2.5.2) (0.10.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->fastai==2.5.2) (2.4.7)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from cycler>=0.10->matplotlib->fastai==2.5.2) (1.15.0)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas->fastai==2.5.2) (2018.9)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->fastai==2.5.2) (1.0.1)
Installing collected packages: fastcore, fastdownload, fastai
  Attempting uninstall: fastai
    Found existing installation: fastai 1.0.61
    Uninstalling fastai-1.0.61:
      Successfully uninstalled fastai-1.0.61
Successfully installed fastai-2.5.2 fastcore-1.3.27 fastdownload-0.0.5

In the previous notebook we saved our model. We can load it using the load_model method.

from fastai.text.all import *

If you don’t have a saved model you can grab one by uncomenting this cell

!wget -O 20210928-model.pkl https://zenodo.org/record/5245175/files/20210928-model.pkl?download=1

--2021-11-02 19:34:35--  https://zenodo.org/record/5245175/files/20210928-model.pkl?download=1
Resolving zenodo.org (zenodo.org)... 137.138.76.77
Connecting to zenodo.org (zenodo.org)|137.138.76.77|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 158529715 (151M) [application/octet-stream]
Saving to: ‘20210928-model.pkl’

20210928-model.pkl  100%[===================>] 151.19M  26.5MB/s    in 6.5s    

2021-11-02 19:34:43 (23.3 MB/s) - ‘20210928-model.pkl’ saved [158529715/158529715]

learn_class = load_learner("20210928-model.pkl", cpu=False)

Trying some examples of made up books¶

To start with let’s just call the predict method on some made up book titles to see if it gives sensible answers:

learn_class.predict("A history of the French Navy")

('Non-fiction', tensor(1), tensor([0.0081, 0.9919]))

learn_class.predict("Communist Manifesto")

('Non-fiction', tensor(1), tensor([0.4674, 0.5326]))

These seem sensible enough predictions. We can also see what information we get back from the predict method. Particularly important to note here is that we get back a tensor containing the confidence for each prediction. We are likely going to want to keep this information alongside our predictions.

Predicting against the full BL Microsoft books metadata¶

We are now ready to run predictions against the full collection of metadata which contains all of the titles we want to have genre labels for.

full_metadata_url = (
    "https://bl.iro.bl.uk/downloads/e4bf0f74-2c64-4322-93c7-0dcc5e5246da?locale=en"
)

dtypes = {
    "BL record ID": "string",
    "Type of resource": "category",
    "Name": "category",
    "Role": "category",
    "Title": "string",
    "Country of publication": "category",
    "Place of publication": "category",
    "Publisher": "category",
    "Genre": "category",
    "Languages": "category",
}

df_full = pd.read_csv(full_metadata_url, low_memory=False, dtype=dtypes)

As a reminder we can check how big this dataset is

len(df_full)

df_full = df_full[df_full.Title.notna()]

Creating our test data¶

We need to make sure that our data is processed in the same way when we do inference as when we make predictions. For example our text needs to be tokenized in the same way. This is made very easy in fastai because we can use the test_dl method. This method knows how to process data for our model. We just need to pass in the relevant column containing our text.

titles = df_full.loc[:, "Title"]

learn_class.dls.num_workers = 0

%%time
test_data = learn_class.dls.test_dl(titles)

CPU times: user 30min 24s, sys: 1min 13s, total: 31min 38s
Wall time: 30min 19s

Once we have done this we can use the get_preds method to run predictions against all of our data.

%%time
predictions = learn_class.get_preds(dl=test_data)

CPU times: user 3min 53s, sys: 8.72 s, total: 4min 1s
Wall time: 17min

You can see that this didn’t take too long considering the size of our data. We might want to double check our predictions match the lenght of our original data. If we just call length on predictions

len(predictions)

You can see we get something back which has len 2. Let’s have a look at this.

predictions

(tensor([[0.0759, 0.9241],
         [0.1282, 0.8718],
         [0.9074, 0.0926],
         ...,
         [0.0986, 0.9014],
         [0.0675, 0.9325],
         [0.0834, 0.9166]]), None)

We can see that this is a tuple, with the first element containing the tensor we’re interested in. Let’s get the length of this.

len(predictions[0])

assert len(predictions[0]) == len(df_full)

Since we only want the first element of our predictions tuple let’s store it in a new variable preds_tensor.

preds_tensor = predictions[0]

preds_tensor[0]

tensor([0.0759, 0.9241])

At the moment we have the probabilities for each label. We can get the vocab from our dls attribute.

learn_class.dls.vocab[1]

['Fiction', 'Non-fiction']

To make it easier to work with this data let’s map our probabilties to this vocab. We’ll first store the argmax value for each prediction i.e. the index of the max value.

df_full["predicted_label"] = preds_tensor.numpy().argmax(1)

We can then create a dictionary which we can use to map our 1 and 0 labels to the text versions

decode = dict(enumerate(learn_class.dls.vocab[1]))

decode

{0: 'Fiction', 1: 'Non-fiction'}

df_full.predicted_label = df_full.predicted_label.replace(decode)

We’ll create two new variables to store the probabilties for each of our labels.

import numpy as np

fiction_probs, non_fiction_probs = np.hsplit(preds_tensor.numpy(), learn_class.dls.c)

df_full["fiction_probs"] = fiction_probs
df_full["non_fiction_probs"] = non_fiction_probs

Let’s take a quick look at how our new columns look:

df_full[["Title", "predicted_label", "fiction_probs", "non_fiction_probs"]].head(5)

	Title	predicted_label	fiction_probs	non_fiction_probs
0	Aabc [etc.] Jesus Vocales, eli äänelliset bokstawit Consonantes Luku-merkit	Non-fiction	0.075868	0.924132
1	A che serve il Papa?	Non-fiction	0.128236	0.871764
2	A. for Apple [An illustrated alphabet.]	Fiction	0.907428	0.092572
3	Á Grãa Bretanha	Non-fiction	0.262661	0.737339
4	A quien me entiende [On the factious spirit of the Mexican press. Signed: Uno de tantos.]	Non-fiction	0.479002	0.520998

This looks like a fairly reasonable format for storing our predictions. Let’s save as a json and csv file.

df_full.to_json("bl_books_w_genre.json")

df_full.to_csv("bl_books_w_genre.csv", index=False)

Conclusion¶

We have now got a full set of predictions that we could work with. We might want to dig into the potential weakness of our model further though and try and improve on this intial model. We’ll do that in the next sections.

Classifying 19th Century British Library books using Crowdsourcing and Machine Learning

Model inference

Contents

Model inference¶

Trying some examples of made up books¶

Predicting against the full BL Microsoft books metadata¶

Creating our test data¶

Conclusion¶