Assessing Where our Model is Going Wrong¶

The model we trained in the previous notebooks had a fairly high f1 score when we tested the model on our validation set, however when we tested it on a new ‘test’ dataset our performance was lower. In this notebook we’ll dig into this is a little more detail.

Addressing Worries about Domain Shift: Creating Test Data¶

We have previously discussed domain drift but as a reminder this is broadly when the data we make predictions against is different from, or becomes different from the training data. As an example, a model which is used to predict ice-cream sales which had training data from the summer months is likely to do less well (or badly) at predicting ice-cream sales in the winter.

In our case the divergence might be more subtle. One potential issue is in the sampling used to generate our initial training data might not be completely random. This could lead to subtle differences in our training data compared to the data we will be predicting against.

Potential Areas of Difference¶

There are a bunch of ways in which the training data we have doesn’t match the target full dataset we want to predict on including:

Language¶

The full British Library Microsoft Books corpus contains a range of different languages. As the Zooniverse annotation task was done by speakers of a subset of languages there might be an over or under-representation of languages in the training data compared to the full data.

Dates¶

The full dataset covers a broad time period with the tiles likely varying quite a bit from the beginning of this period 1500s, compared to the later period 1800s. If the training data is over-represented by one period we might expect the model to struggle with titles from different time period.

Difficult titles skipped¶

Genre is not always clear cut and it can be difficult to tell in some cases. The book could contain a mixture that could be reasonably labeled as either fiction or non-fiction for example, a book of poems with significant commentary about the poems. If annotators skipped these examples the training data won’t include as many ‘hard’ examples for the model to learn from. We could also be generous and say if a human expert annotator is struggling to classify the books genre then a machine learning model might also struggle.

Creating a New Randomly Sampled Test Dataset¶

After developing a model we wanted to test out how well it performed on completely unseen data that was randomly sampled from the full BL books corpus. To do this we decided to internally do some verification of the results of the model previously trained. We did this by:

  • sampling randomly from the full books corpus

  • verifying the models predictions

All of this was done using a crude Google sheet with a couple of people working through this data. Since the aim was to cover as many titles as possible there was sometimes some work required to translate titles, or look at the digitized copy of the book in the BL catalogue for verification. This results in a new ‘test’ dataset which we used to evaluate our model against.

This section will explore the results with this test data. First we import some packages

import pandas as pd
import numpy as np
import re
from sklearn.metrics import f1_score, classification_report, accuracy_score

We load our initial input data

df_training = pd.read_csv("data/annotations.csv")

and our new test dataset.

df_test = pd.read_csv("data/test_errors.csv")
len(df_test)
999

We do a bit of tidying to make sure the format of the labels matches


df_test.predicted_label.unique()
array(['Non-fiction', 'Fiction'], dtype=object)
df_test.true_label.unique()
array(['non_fiction', 'fiction', 'both', nan], dtype=object)
df_test = df_test[df_test["true_label"].isin(["non_fiction", "fiction"])]
df_test.true_label = df_test.true_label.str.lower()
df_test.predicted_label = df_test.predicted_label.str.lower()
df_test.predicted_label = df_test.predicted_label.str.replace("-", "_")

Our Test Data¶

Our new test dataframe includes the following columns which are relevant here:

df_test[
    [
        "predicted_label",
        "fiction_probs",
        "non_fiction_probs",
        "true_label",
        "free text comment",
    ]
]
predicted_label fiction_probs non_fiction_probs true_label free text comment
0 non_fiction 0.020685 0.979315 non_fiction NaN
1 non_fiction 0.037538 0.962462 non_fiction NaN
2 non_fiction 0.389111 0.610889 fiction NaN
3 non_fiction 0.050611 0.949388 non_fiction NaN
4 non_fiction 0.087175 0.912825 non_fiction NaN
... ... ... ... ... ...
994 fiction 0.927855 0.072145 fiction NaN
995 non_fiction 0.354936 0.645064 fiction NaN
996 fiction 0.537227 0.462773 non_fiction NaN
997 non_fiction 0.087564 0.912436 fiction NaN
998 non_fiction 0.068475 0.931525 non_fiction NaN

850 rows × 5 columns

We see above the predicted_label this is the label predicted by our previously trained mode, we also have the probabilities for these predictions. The true_label is a new human annotation of the correct label.

Metrics on our Test Data¶

As a start let’s see how our model performed on our new test data i.e. how often it’s predictions matched the human annotation for that title.

print(
    classification_report(
        df_test.true_label.astype("category").cat.codes,
        df_test.predicted_label.astype("category").cat.codes,
        target_names=df_test.predicted_label.unique(),
    )
)
              precision    recall  f1-score   support

 non_fiction       0.96      0.72      0.82       296
     fiction       0.87      0.98      0.92       554

    accuracy                           0.89       850
   macro avg       0.91      0.85      0.87       850
weighted avg       0.90      0.89      0.89       850

We can notice a few different things here:

  • the f1-score is lower than it was in our previous notebook when we looked at the validation data score.

  • there is a fairly big difference in performance between fiction and non-fiction

  • the distribution of our labels is still uneven

This is already useful for us. We might decide to try and annotate more fiction examples to try and improve the performance or to train our model differently to account for the uneven distribution of labels. Before we jump to doing any of these things let’s see if we can understand where our models is going wrong.

đŸ€Ą Where (and why?) our model sucks?¶

Our single number metrics can be useful for getting an overview of performance on our data as a whole but we sometimes want to also dig into subsets of our data to see whether there are particular facets the model struggles on.

How Does Confidence Vary?¶

One thing we might want to look at is whether the confidence of the predictions has any impact on errors i.e. if the model is confident is it less often wrong compared to when it is less confident?

Let’s create a new column argmax which stores the probability of the predicted label (as a reminder this is the max prediction of the two possible labels.

df_test["argmax"] = df_test[["fiction_probs", "non_fiction_probs"]].max(axis=1)

We can quickly take a look at the distribution of the confidence for the predicted label

df_test["argmax"].describe()
count    850.000000
mean       0.912980
std        0.116743
min        0.502686
25%        0.904535
50%        0.966442
75%        0.982130
max        0.997301
Name: argmax, dtype: float64

We can see here that most of the predictions have above 90% confidence, and 50% of our predictions have above 96% confidence. This is expected because we are using a softmax function when we use cross entropy loss. The lowest predicted confidence is just above 50% suggesting the model was pretty unsure in this example. To make this a bit easier to assess we’ll create a new column correct which contains whether our models prediction matched the human prediction in our test data.

df_test["correct"] = df_test.true_label == df_test.predicted_label
df_test["correct"].head(3)
0     True
1     True
2    False
Name: correct, dtype: bool

We can quickly grab some examples where the model was not correct

df_test[df_test["correct"] == False][["title", "true_label"]]
title true_label
2 ['Colville of the Guards'] fiction
13 ['Metempsychosis. A poem, in two parts. By A. ... fiction
34 ['The Cunning-Man ... Originally written and c... fiction
55 ['Sheilah McLeod, etc'] fiction
63 ['PoppĂŠa'] fiction
... ... ...
989 ['Poems and Imitations of the British Poets, w... fiction
993 ['Ralph Royster Doyster, a comedy [in five act... fiction
995 ['Helga: a poem in seven Cantos. [With notes a... fiction
996 ['The Geological Observer'] non_fiction
997 ['Lays of Far Cathay and others. A collection ... fiction

91 rows × 2 columns

Looking at the Confidence of our Predictions¶

Our model (and most machine learning models) give as output “logits” these are the ‘raw’ prediction. Often we then use a softmax or similar funciton to turn these into a probabiliy distribution. We often just take the argmax of this probabiliy to choose a label i.e. pick the label which the model is most ‘confident’ about.

We might want to use the values associated with these probabilites in some way to determine whether to accept a label. For example if our model gives the following confidences:

  • fiction: 51%

  • non_fiction: 49%

We could take the argmax value and use ‘fiction’ as our label. It seems likely however that the model isn’t very sure here. In practice we tend to not get very close values like this so often because of the properties of the softmax function but we may still have some labels predicted much more confidently than others.

We’ll group our date by whether the prediction was correct or not, and the type of label, and then look at the distribution of the argmax value (i.e. the probabily of the label which was predicted).

df_test.groupby(["correct", "true_label"])["argmax"].describe()
count mean std min 25% 50% 75% max
correct true_label
False fiction 82.0 0.755182 0.148684 0.506073 0.612328 0.794254 0.891587 0.970296
non_fiction 9.0 0.747011 0.172843 0.537227 0.624419 0.707089 0.899563 0.989327
True fiction 214.0 0.869542 0.134010 0.502686 0.808385 0.924449 0.974263 0.996350
non_fiction 545.0 0.956519 0.060473 0.541530 0.957278 0.974239 0.983971 0.997301

Let’s start with the ‘mean’ argmax value. We can see that the when the prediction was correct for fiction, the mean argmax value was 0.95. When the model incorectly predicted ‘fiction’ the mean argmax value was 0.75.

When the model correctly predicted ‘non_fiction’ it had a mean argmax value of 0.95. When the model incorrectly predicted the label ‘non_fiction’ it had a mean argmax value of 0.74.

This possibly suggests that we might want to set a ‘threshold’ for when we accept the model’s predictions for each label. There is a trade off here. If we increase the threshold higher there will be more labels where the model’s suggestions are ignored. If we set the threshold too low we might get more mistakes. Let’s see what happens to our models performance if we set some thresholds based on the above.

df_subset = df_test.loc[
    ((df_test.predicted_label == "fiction") & (df_test.fiction_probs > 0.90))
    | ((df_test.predicted_label == "non_fiction") & (df_test.non_fiction_probs > 0.97))
]
print(
    classification_report(
        df_subset.true_label,
        df_subset.predicted_label,
        target_names=df_subset.predicted_label.unique(),
    )
)
              precision    recall  f1-score   support

 non_fiction       0.98      0.99      0.99       125
     fiction       1.00      0.99      1.00       326

    accuracy                           0.99       451
   macro avg       0.99      0.99      0.99       451
weighted avg       0.99      0.99      0.99       451

We can see if we pick these higher thresholds, our model does much better. This comes at the expense of coverage. Working out whether you prefer an accurate model or a model that labels all of your data depends in a large part to how you are using your models predictions. You may decide for example to get humans to annotate the examples your model struggled with. If you are working at a very large scale and the alternative is no labels for genre you might still prefer a noisy label compared to no label at all.

Other Factors That Impact Model Performance?¶

There may be other types of title where our model is wrong more often. We can draw on our intuitions to some extent here but one things that can be very helpful is to have annotated some of the data yourself. If you completely outsource the process of creating training you might not have seen enough examples of the data to have any sense of what it looks like. In the process of creating the test data we noticed some titles are very short and others very long. If a title is very short it has less information and it’s possible our model will struggle. Let’s see if this is the case

df_test["title_length"] = df_test.title.str.len()
df_test.groupby(["correct", "true_label"])["title_length"].describe()
count mean std min 25% 50% 75% max
correct true_label
False fiction 82.0 92.865854 59.902341 10.0 54.75 80.5 119.0 301.0
non_fiction 9.0 50.111111 22.273552 27.0 29.00 44.0 64.0 93.0
True fiction 214.0 50.714953 29.242680 15.0 31.00 41.0 61.0 179.0
non_fiction 545.0 118.253211 73.611723 9.0 66.00 102.0 144.0 499.0

We can see here that when the model predicts fiction the mean length of the title when the model was incorrect is ~92, when a fiction prediction is correct the man length of the title ~50. This suggests that maybe our model actually gets thrown off by longer fiction book titles. We see the opposite with non-fiction where our model seems to correctly prediction non-fiction titles when these titles are longer.

Looking at Examples¶

We can look at some examples of titles to get a better intuition for how our model is behaving. Let’s start with incorrect predictions, where the model was confident. We’ll look at 10 of the most confident wrong examples:

for i, row in enumerate(
    df_test[df_test["correct"] == False]
    .sort_values("argmax", ascending=False)
    .itertuples()
):
    if i > 10:
        break
    print(row.predicted_label, row.argmax)
    print(row.title)
    print("----" * 20)
fiction 0.98932695
['Janet Delille. [A novel.]']
--------------------------------------------------------------------------------
non_fiction 0.97029626
['Royston Winter Recreations in the days of Queen Anne: translated into Spenserian stanza by the Rev. W. W. Harvey, ... from a contemporary Latin poem [entitled “Bruma,” etc.] ... With illustrations by H. J. Thurnall, and notes on Royston Memorabilia by the Royston Publisher [J. Warren]']
--------------------------------------------------------------------------------
non_fiction 0.96185756
['Legends of the Afghan Countries. In verse. With various pieces, original and translated']
--------------------------------------------------------------------------------
non_fiction 0.9605836
['In Cornwall, and Across the Sea [in verse]; with poems written in Devonshire, etc']
--------------------------------------------------------------------------------
non_fiction 0.9512329
['[Pierre le Grand, comédie en quatre actes, et en prose mêlée de chants, etc.]']
--------------------------------------------------------------------------------
non_fiction 0.9509604
['The poetical works of Oliver Goldsmith. With remarks, attempting to ascertain ... the actual scene of the deserted village; and illustrative engravings, by Mr. Alkin, from drawings taken upon the spot. By the Rev. R. H. Newell', 'Collections. III. Poems']
--------------------------------------------------------------------------------
non_fiction 0.9478657
['The City of Dreadful Night. And other stories. [“With the Calcutta Police” from “The City of Dreadful Night” and other selected tales. With “Character Sketch of Rudyard Kipling. By Rev. C. O. Day.”]', 'Selections']
--------------------------------------------------------------------------------
fiction 0.9427181
['The Lost Manuscripts of a Blue Jacket']
--------------------------------------------------------------------------------
non_fiction 0.93851966
['[Envy at arms! or, Caloric alarming the Church. [A satire in verse by - Thom? occasioned by the opposition of the Ministers of Edinburgh to the election of J. Leslie to the professorship of mathematics in the University of Edinburgh.]]']
--------------------------------------------------------------------------------
non_fiction 0.9349319
['Select British Poets, or new elegant extracts from Chaucer to the present time, with critical remarks. By W. Hazlitt', 'Single Works']
--------------------------------------------------------------------------------
non_fiction 0.9300464
['The Select Poetical Works of Sir Walter Scott; comprising The Lay of the Last Minstrel; Marmion; The Lady of the Lake; ballads, lyrical pieces, etc. MS. notes', 'Collections of Works. Smaller Collections of Poems']
--------------------------------------------------------------------------------

We can do the same for examples where our model is correct but also was not confident

for i, row in enumerate(
    df_test[df_test["correct"] == True]
    .sort_values("argmax", ascending=True)
    .itertuples()
):
    if i > 10:
        break
    print(row.predicted_label, row.argmax)
    print(row.title)
    print("----" * 20)
fiction 0.50268555
['A Legend of Fyvie Castle. By K. G. [i.e. Catherine J. B. Gordon.]']
--------------------------------------------------------------------------------
fiction 0.50299066
['Heera, the Maid of the Dekhan. A poem, in five cantos']
--------------------------------------------------------------------------------
fiction 0.5074
['Rambling Rhymes']
--------------------------------------------------------------------------------
fiction 0.5121436
['The Small House at Allington. With eighteen illustrations by J. E. Millais', 'Single Works']
--------------------------------------------------------------------------------
fiction 0.5274991
['Abd-el-Kader: a poem, in six cantos']
--------------------------------------------------------------------------------
fiction 0.528868
['A Cure for the Heart-ache; a comedy, in five acts [in prose], etc']
--------------------------------------------------------------------------------
fiction 0.53044933
["Auld Lang Syne. By the author of “The Wreck of the 'Grosvenor”' [i.e. William Clark Russell]"]
--------------------------------------------------------------------------------
non_fiction 0.5415301
['Wreck of the “London.” [With illustrations.]']
--------------------------------------------------------------------------------
fiction 0.5452555
['The State-farce: a lyrick. Written at Clermont, and inscribed to His Grace the Duke of Newcastle']
--------------------------------------------------------------------------------
fiction 0.5558798
['Miscellaneous Poems']
--------------------------------------------------------------------------------
fiction 0.5597714
["[The Actress's Ways and Means to industriously raise the Wind! Containing the moral and entertaining poetical effusions of Mrs. R. Beverley.]"]
--------------------------------------------------------------------------------

Things we Might Notice¶

  • There might be some examples which are incorrectly labeled or are very hard to tell

  • There might be particular title phrases such as [With Illustrations] that might throw our model off

  • Long titles with lots of proper nouns might confuse our model?

  • ???

We would need to do some more digging to see if there are other patterns in the errors of our model.

Conclusion¶

We have seen that if we bump up the threshold for the models confidence it improves quite a bit. This could be one way of dealing with the outputs of this model in a pragmatic way. It’s not always possible to train a perfect model but if we have a model that does quite well and we can also use only some of its outputs we might still be able to do a lot more than we could without machine learning. This could be particularly useful when we use outputs of external models that other people have trained since we might have less ability to control the training process but can still set thresholds of when to accept predictions or not.

We’ll now move on to trying to improve our model!

Note

The main things we tried to show in this notebook:

  • It is very helpful to have true test data to get a proper sense of how well your model will perform on unseen data

  • The models performance was uneven across our labels

  • There might be some types of title that our model struggles with more

  • We can decide a threshold for when we accept our models predictions this can often improve the performance of our model