Assessing Where our Model is Going Wrong
Contents
Assessing Where our Model is Going Wrong¶
The model we trained in the previous notebooks had a fairly high f1 score when we tested the model on our validation set, however when we tested it on a new âtestâ dataset our performance was lower. In this notebook weâll dig into this is a little more detail.
Addressing Worries about Domain Shift: Creating Test Data¶
We have previously discussed domain drift but as a reminder this is broadly when the data we make predictions against is different from, or becomes different from the training data. As an example, a model which is used to predict ice-cream sales which had training data from the summer months is likely to do less well (or badly) at predicting ice-cream sales in the winter.
In our case the divergence might be more subtle. One potential issue is in the sampling used to generate our initial training data might not be completely random. This could lead to subtle differences in our training data compared to the data we will be predicting against.
Potential Areas of Difference¶
There are a bunch of ways in which the training data we have doesnât match the target full dataset we want to predict on including:
Language¶
The full British Library Microsoft Books corpus contains a range of different languages. As the Zooniverse annotation task was done by speakers of a subset of languages there might be an over or under-representation of languages in the training data compared to the full data.
Dates¶
The full dataset covers a broad time period with the tiles likely varying quite a bit from the beginning of this period 1500s, compared to the later period 1800s. If the training data is over-represented by one period we might expect the model to struggle with titles from different time period.
Difficult titles skipped¶
Genre is not always clear cut and it can be difficult to tell in some cases. The book could contain a mixture that could be reasonably labeled as either fiction or non-fiction for example, a book of poems with significant commentary about the poems. If annotators skipped these examples the training data wonât include as many âhardâ examples for the model to learn from. We could also be generous and say if a human expert annotator is struggling to classify the books genre then a machine learning model might also struggle.
Creating a New Randomly Sampled Test Dataset¶
After developing a model we wanted to test out how well it performed on completely unseen data that was randomly sampled from the full BL books corpus. To do this we decided to internally do some verification of the results of the model previously trained. We did this by:
sampling randomly from the full books corpus
verifying the models predictions
All of this was done using a crude Google sheet with a couple of people working through this data. Since the aim was to cover as many titles as possible there was sometimes some work required to translate titles, or look at the digitized copy of the book in the BL catalogue for verification. This results in a new âtestâ dataset which we used to evaluate our model against.
This section will explore the results with this test data. First we import some packages
import pandas as pd
import numpy as np
import re
from sklearn.metrics import f1_score, classification_report, accuracy_score
We load our initial input data
df_training = pd.read_csv("data/annotations.csv")
and our new test dataset.
df_test = pd.read_csv("data/test_errors.csv")
len(df_test)
999
We do a bit of tidying to make sure the format of the labels matchesâŠ
df_test.predicted_label.unique()
array(['Non-fiction', 'Fiction'], dtype=object)
df_test.true_label.unique()
array(['non_fiction', 'fiction', 'both', nan], dtype=object)
df_test = df_test[df_test["true_label"].isin(["non_fiction", "fiction"])]
df_test.true_label = df_test.true_label.str.lower()
df_test.predicted_label = df_test.predicted_label.str.lower()
df_test.predicted_label = df_test.predicted_label.str.replace("-", "_")
Our Test Data¶
Our new test dataframe includes the following columns which are relevant here:
df_test[
[
"predicted_label",
"fiction_probs",
"non_fiction_probs",
"true_label",
"free text comment",
]
]
predicted_label | fiction_probs | non_fiction_probs | true_label | free text comment | |
---|---|---|---|---|---|
0 | non_fiction | 0.020685 | 0.979315 | non_fiction | NaN |
1 | non_fiction | 0.037538 | 0.962462 | non_fiction | NaN |
2 | non_fiction | 0.389111 | 0.610889 | fiction | NaN |
3 | non_fiction | 0.050611 | 0.949388 | non_fiction | NaN |
4 | non_fiction | 0.087175 | 0.912825 | non_fiction | NaN |
... | ... | ... | ... | ... | ... |
994 | fiction | 0.927855 | 0.072145 | fiction | NaN |
995 | non_fiction | 0.354936 | 0.645064 | fiction | NaN |
996 | fiction | 0.537227 | 0.462773 | non_fiction | NaN |
997 | non_fiction | 0.087564 | 0.912436 | fiction | NaN |
998 | non_fiction | 0.068475 | 0.931525 | non_fiction | NaN |
850 rows Ă 5 columns
We see above the predicted_label
this is the label predicted by our previously trained mode, we also have the probabilities for these predictions. The true_label
is a new human annotation of the correct label.
Metrics on our Test Data¶
As a start letâs see how our model performed on our new test data i.e. how often itâs predictions matched the human annotation for that title.
print(
classification_report(
df_test.true_label.astype("category").cat.codes,
df_test.predicted_label.astype("category").cat.codes,
target_names=df_test.predicted_label.unique(),
)
)
precision recall f1-score support
non_fiction 0.96 0.72 0.82 296
fiction 0.87 0.98 0.92 554
accuracy 0.89 850
macro avg 0.91 0.85 0.87 850
weighted avg 0.90 0.89 0.89 850
We can notice a few different things here:
the f1-score is lower than it was in our previous notebook when we looked at the validation data score.
there is a fairly big difference in performance between fiction and non-fiction
the distribution of our labels is still uneven
This is already useful for us. We might decide to try and annotate more fiction examples to try and improve the performance or to train our model differently to account for the uneven distribution of labels. Before we jump to doing any of these things letâs see if we can understand where our models is going wrong.
đ€Ą Where (and why?) our model sucks?¶
Our single number metrics can be useful for getting an overview of performance on our data as a whole but we sometimes want to also dig into subsets of our data to see whether there are particular facets the model struggles on.
How Does Confidence Vary?¶
One thing we might want to look at is whether the confidence of the predictions has any impact on errors i.e. if the model is confident is it less often wrong compared to when it is less confident?
Letâs create a new column argmax
which stores the probability of the predicted label (as a reminder this is the max
prediction of the two possible labels.
df_test["argmax"] = df_test[["fiction_probs", "non_fiction_probs"]].max(axis=1)
We can quickly take a look at the distribution of the confidence for the predicted label
df_test["argmax"].describe()
count 850.000000
mean 0.912980
std 0.116743
min 0.502686
25% 0.904535
50% 0.966442
75% 0.982130
max 0.997301
Name: argmax, dtype: float64
We can see here that most of the predictions have above 90% confidence, and 50% of our predictions have above 96% confidence. This is expected because we are using a softmax function when we use cross entropy loss. The lowest predicted confidence is just above 50% suggesting the model was pretty unsure in this example. To make this a bit easier to assess weâll create a new column correct
which contains whether our models prediction matched the human prediction in our test data.
df_test["correct"] = df_test.true_label == df_test.predicted_label
df_test["correct"].head(3)
0 True
1 True
2 False
Name: correct, dtype: bool
We can quickly grab some examples where the model was not correct
df_test[df_test["correct"] == False][["title", "true_label"]]
title | true_label | |
---|---|---|
2 | ['Colville of the Guards'] | fiction |
13 | ['Metempsychosis. A poem, in two parts. By A. ... | fiction |
34 | ['The Cunning-Man ... Originally written and c... | fiction |
55 | ['Sheilah McLeod, etc'] | fiction |
63 | ['PoppĂŠa'] | fiction |
... | ... | ... |
989 | ['Poems and Imitations of the British Poets, w... | fiction |
993 | ['Ralph Royster Doyster, a comedy [in five act... | fiction |
995 | ['Helga: a poem in seven Cantos. [With notes a... | fiction |
996 | ['The Geological Observer'] | non_fiction |
997 | ['Lays of Far Cathay and others. A collection ... | fiction |
91 rows Ă 2 columns
Looking at the Confidence of our Predictions¶
Our model (and most machine learning models) give as output âlogitsâ these are the ârawâ prediction. Often we then use a softmax or similar funciton to turn these into a probabiliy distribution. We often just take the argmax of this probabiliy to choose a label i.e. pick the label which the model is most âconfidentâ about.
We might want to use the values associated with these probabilites in some way to determine whether to accept a label. For example if our model gives the following confidences:
fiction: 51%
non_fiction: 49%
We could take the argmax value and use âfictionâ as our label. It seems likely however that the model isnât very sure here. In practice we tend to not get very close values like this so often because of the properties of the softmax function but we may still have some labels predicted much more confidently than others.
Weâll group our date by whether the prediction was correct or not, and the type of label, and then look at the distribution of the argmax
value (i.e. the probabily of the label which was predicted).
df_test.groupby(["correct", "true_label"])["argmax"].describe()
count | mean | std | min | 25% | 50% | 75% | max | ||
---|---|---|---|---|---|---|---|---|---|
correct | true_label | ||||||||
False | fiction | 82.0 | 0.755182 | 0.148684 | 0.506073 | 0.612328 | 0.794254 | 0.891587 | 0.970296 |
non_fiction | 9.0 | 0.747011 | 0.172843 | 0.537227 | 0.624419 | 0.707089 | 0.899563 | 0.989327 | |
True | fiction | 214.0 | 0.869542 | 0.134010 | 0.502686 | 0.808385 | 0.924449 | 0.974263 | 0.996350 |
non_fiction | 545.0 | 0.956519 | 0.060473 | 0.541530 | 0.957278 | 0.974239 | 0.983971 | 0.997301 |
Letâs start with the âmeanâ argmax
value. We can see that the when the prediction was correct for fiction, the mean argmax value was 0.95
. When the model incorectly predicted âfictionâ the mean argmax value was 0.75
.
When the model correctly predicted ânon_fictionâ it had a mean argmax
value of 0.95
. When the model incorrectly predicted the label ânon_fictionâ it had a mean argmax
value of 0.74
.
This possibly suggests that we might want to set a âthresholdâ for when we accept the modelâs predictions for each label. There is a trade off here. If we increase the threshold higher there will be more labels where the modelâs suggestions are ignored. If we set the threshold too low we might get more mistakes. Letâs see what happens to our models performance if we set some thresholds based on the above.
df_subset = df_test.loc[
((df_test.predicted_label == "fiction") & (df_test.fiction_probs > 0.90))
| ((df_test.predicted_label == "non_fiction") & (df_test.non_fiction_probs > 0.97))
]
print(
classification_report(
df_subset.true_label,
df_subset.predicted_label,
target_names=df_subset.predicted_label.unique(),
)
)
precision recall f1-score support
non_fiction 0.98 0.99 0.99 125
fiction 1.00 0.99 1.00 326
accuracy 0.99 451
macro avg 0.99 0.99 0.99 451
weighted avg 0.99 0.99 0.99 451
We can see if we pick these higher thresholds, our model does much better. This comes at the expense of coverage. Working out whether you prefer an accurate model or a model that labels all of your data depends in a large part to how you are using your models predictions. You may decide for example to get humans to annotate the examples your model struggled with. If you are working at a very large scale and the alternative is no labels for genre you might still prefer a noisy label compared to no label at all.
Other Factors That Impact Model Performance?¶
There may be other types of title where our model is wrong more often. We can draw on our intuitions to some extent here but one things that can be very helpful is to have annotated some of the data yourself. If you completely outsource the process of creating training you might not have seen enough examples of the data to have any sense of what it looks like. In the process of creating the test data we noticed some titles are very short and others very long. If a title is very short it has less information and itâs possible our model will struggle. Letâs see if this is the case
df_test["title_length"] = df_test.title.str.len()
df_test.groupby(["correct", "true_label"])["title_length"].describe()
count | mean | std | min | 25% | 50% | 75% | max | ||
---|---|---|---|---|---|---|---|---|---|
correct | true_label | ||||||||
False | fiction | 82.0 | 92.865854 | 59.902341 | 10.0 | 54.75 | 80.5 | 119.0 | 301.0 |
non_fiction | 9.0 | 50.111111 | 22.273552 | 27.0 | 29.00 | 44.0 | 64.0 | 93.0 | |
True | fiction | 214.0 | 50.714953 | 29.242680 | 15.0 | 31.00 | 41.0 | 61.0 | 179.0 |
non_fiction | 545.0 | 118.253211 | 73.611723 | 9.0 | 66.00 | 102.0 | 144.0 | 499.0 |
We can see here that when the model predicts fiction the mean length of the title when the model was incorrect is ~92
, when a fiction prediction is correct the man length of the title ~50
. This suggests that maybe our model actually gets thrown off by longer fiction book titles. We see the opposite with non-fiction where our model seems to correctly prediction non-fiction titles when these titles are longer.
Looking at Examples¶
We can look at some examples of titles to get a better intuition for how our model is behaving. Letâs start with incorrect predictions, where the model was confident. Weâll look at 10 of the most confident wrong examples:
for i, row in enumerate(
df_test[df_test["correct"] == False]
.sort_values("argmax", ascending=False)
.itertuples()
):
if i > 10:
break
print(row.predicted_label, row.argmax)
print(row.title)
print("----" * 20)
fiction 0.98932695
['Janet Delille. [A novel.]']
--------------------------------------------------------------------------------
non_fiction 0.97029626
['Royston Winter Recreations in the days of Queen Anne: translated into Spenserian stanza by the Rev. W. W. Harvey, ... from a contemporary Latin poem [entitled âBruma,â etc.] ... With illustrations by H. J. Thurnall, and notes on Royston Memorabilia by the Royston Publisher [J. Warren]']
--------------------------------------------------------------------------------
non_fiction 0.96185756
['Legends of the Afghan Countries. In verse. With various pieces, original and translated']
--------------------------------------------------------------------------------
non_fiction 0.9605836
['In Cornwall, and Across the Sea [in verse]; with poems written in Devonshire, etc']
--------------------------------------------------------------------------------
non_fiction 0.9512329
['[Pierre le Grand, comeÌdie en quatre actes, et en prose meÌleÌe de chants, etc.]']
--------------------------------------------------------------------------------
non_fiction 0.9509604
['The poetical works of Oliver Goldsmith. With remarks, attempting to ascertain ... the actual scene of the deserted village; and illustrative engravings, by Mr. Alkin, from drawings taken upon the spot. By the Rev. R. H. Newell', 'Collections. III. Poems']
--------------------------------------------------------------------------------
non_fiction 0.9478657
['The City of Dreadful Night. And other stories. [âWith the Calcutta Policeâ from âThe City of Dreadful Nightâ and other selected tales. With âCharacter Sketch of Rudyard Kipling. By Rev. C. O. Day.â]', 'Selections']
--------------------------------------------------------------------------------
fiction 0.9427181
['The Lost Manuscripts of a Blue Jacket']
--------------------------------------------------------------------------------
non_fiction 0.93851966
['[Envy at arms! or, Caloric alarming the Church. [A satire in verse by - Thom? occasioned by the opposition of the Ministers of Edinburgh to the election of J. Leslie to the professorship of mathematics in the University of Edinburgh.]]']
--------------------------------------------------------------------------------
non_fiction 0.9349319
['Select British Poets, or new elegant extracts from Chaucer to the present time, with critical remarks. By W. Hazlitt', 'Single Works']
--------------------------------------------------------------------------------
non_fiction 0.9300464
['The Select Poetical Works of Sir Walter Scott; comprising The Lay of the Last Minstrel; Marmion; The Lady of the Lake; ballads, lyrical pieces, etc. MS. notes', 'Collections of Works. Smaller Collections of Poems']
--------------------------------------------------------------------------------
We can do the same for examples where our model is correct but also was not confident
for i, row in enumerate(
df_test[df_test["correct"] == True]
.sort_values("argmax", ascending=True)
.itertuples()
):
if i > 10:
break
print(row.predicted_label, row.argmax)
print(row.title)
print("----" * 20)
fiction 0.50268555
['A Legend of Fyvie Castle. By K. G. [i.e. Catherine J. B. Gordon.]']
--------------------------------------------------------------------------------
fiction 0.50299066
['Heera, the Maid of the Dekhan. A poem, in five cantos']
--------------------------------------------------------------------------------
fiction 0.5074
['Rambling Rhymes']
--------------------------------------------------------------------------------
fiction 0.5121436
['The Small House at Allington. With eighteen illustrations by J. E. Millais', 'Single Works']
--------------------------------------------------------------------------------
fiction 0.5274991
['Abd-el-Kader: a poem, in six cantos']
--------------------------------------------------------------------------------
fiction 0.528868
['A Cure for the Heart-ache; a comedy, in five acts [in prose], etc']
--------------------------------------------------------------------------------
fiction 0.53044933
["Auld Lang Syne. By the author of âThe Wreck of the 'Grosvenorâ' [i.e. William Clark Russell]"]
--------------------------------------------------------------------------------
non_fiction 0.5415301
['Wreck of the âLondon.â [With illustrations.]']
--------------------------------------------------------------------------------
fiction 0.5452555
['The State-farce: a lyrick. Written at Clermont, and inscribed to His Grace the Duke of Newcastle']
--------------------------------------------------------------------------------
fiction 0.5558798
['Miscellaneous Poems']
--------------------------------------------------------------------------------
fiction 0.5597714
["[The Actress's Ways and Means to industriously raise the Wind! Containing the moral and entertaining poetical effusions of Mrs. R. Beverley.]"]
--------------------------------------------------------------------------------
Things we Might Notice¶
There might be some examples which are incorrectly labeled or are very hard to tell
There might be particular title phrases such as
[With Illustrations]
that might throw our model offLong titles with lots of proper nouns might confuse our model?
???
We would need to do some more digging to see if there are other patterns in the errors of our model.
Conclusion¶
We have seen that if we bump up the threshold for the models confidence it improves quite a bit. This could be one way of dealing with the outputs of this model in a pragmatic way. Itâs not always possible to train a perfect model but if we have a model that does quite well and we can also use only some of its outputs we might still be able to do a lot more than we could without machine learning. This could be particularly useful when we use outputs of external models that other people have trained since we might have less ability to control the training process but can still set thresholds of when to accept predictions or not.
Weâll now move on to trying to improve our model!
Note
The main things we tried to show in this notebook:
It is very helpful to have true test data to get a proper sense of how well your model will perform on unseen data
The models performance was uneven across our labels
There might be some types of title that our model struggles with more
We can decide a threshold for when we accept our models predictions this can often improve the performance of our model