In the previous notebook, we created a sample and annotated this data. Since we want to create a classifier to help us understand a historical phenomenon, i.e. the use of visuals in newspaper advertising, we must see how robust our model is across the period of our data.
Since we have quite an extensive date range (1850-1950), we want to ensure our model doesn't do significantly worse for some years. If it does do particularly badly for some years, we may draw mistaken conclusions about the historical trends we're trying to detect through our classifier.
First we start with some imports. From nnanno
we import one function
from nnanno.annotate import check_download_df_match
We import fastai and some callbacks for our model training
from fastai.callback import *
from fastai.vision.all import *
We import a few extra libraries for helping with plotting
from pathlib import Path
import matplotlib
import pandas as pd
matplotlib.style.use("seaborn-notebook")
#from nnanno.annotate import *
#sampler = nnSampler()
#df = load_annotations_csv("https://github.com/Living-with-machines/nnanno_example_data/blob/main/annotations/results.csv")
#sampler.download_sample("data/images", original=True, df=df)
We'll load the JSON file containing our annotated data. If you need a reminder of where this came from, refer to the previous notebook in this series where this data was created.
json_file = list(Path("data/images/").rglob("*.json"))[0]
df = pd.read_json(json_file)
We can use a small function from nnano to check that the number of images we have matches our dataframe
check_download_df_match("data/images/", df)
df.pub_date = pd.to_datetime(df.pub_date)
df.pub_date.dt.year
We can also put year in a new column
df["year"] = df.pub_date.dt.year
Evaluating decade accuracy
To see if our model will perform worse for some years, we'll train a model on the entire dataset and the run predictions on a subset and see if particular years perform worse.
Some years may be 'harder' for the model regardless of the training data. For example, the newspaper for some years might have been subject to damage. These biases might be systematic, or they might be random; trying out the model across all years will give us a way of roughly checking this.
First we import train_test_split
from sklearn to help us prepare subsamples
from sklearn.model_selection import train_test_split
We now want to create a train and valid sample. We pass in the stratify
option to make sure we get equal splits by year
train, valid_test = train_test_split(
df.copy(), test_size=600, stratify=df.pub_date.dt.year
)
We split valid_test
into a 'validation' and 'test' set
valid, test_df = train_test_split(
valid_test.copy(), test_size=400, stratify=valid_test.pub_date.dt.year
)
Check the sizes of these
len(train), len(valid), len(test_df)
We can confirm this is roughly even accross years
train.pub_date.dt.year.value_counts()
and with valid
valid.pub_date.dt.year.value_counts()
This looks good enough. There is a difference of 1 between some years but this probably isn't going to make a big difference. Now we can get to the training part.
As with the rest of the examples, we'll use fastai since it helps us experiment quickly and focus on the questions we're interested in exploring in this case.
Since we're going to be training multiple models, its helpful to define metrics once as a variable that we can pass into our model. This means we can change all the metrics for the model by changing this line.
METRICS = [F1Score(average="micro"), Precision(), Recall(), accuracy]
We now add an extra is_valid
column to our train and valid DataFrames. This means we can then quickly pass the entire DataFrame to fastai
train["is_valid"] = False
valid["is_valid"] = True
df = pd.concat([train, valid])
df.reset_index(drop=True, inplace=True)
One of the beautiful things about the fastai API is that it is layered, and we can quickly jump 'down' a layer to get more control when needed. One way we can do this is with the DataBlocK API. You can read more about this API in the DataBlock tutorial.
A tl;dr summary of the DataBlock API; it allows us to define how we want to load our data by specifying various pipeline components. For example, get_y
is used to say how the labels should be loaded, splitter
is used to define how we split our data.
full_data = DataBlock(
blocks=(ImageBlock, CategoryBlock),
splitter=ColSplitter(),
get_x=ColReader("download_image_path", pref="data/images/"),
get_y=ColReader("label"),
item_tfms=Resize(224),
)
We can use this 'template' for loading our data with different DataFrames. Let's load all of this data and have a quick look to see everything looks okay. So far, we have only created a DataBlock
this is a 'framework' for loading our data. To do the actual loading, we'll use the dataloaders method
dls = full_data.dataloaders(df, bs=8, num_workers=0)
dls.show_batch()
Train
Now we'll create a model and train it using two callbacks. SaveModelCallback
will keep track of our best model monitoring f1_score
at the end of the training; it will return the best performing model. EarlyStoppingCallback
will stop the model after a set amount of patience
(i.e. epochs during which there is no improvement).
learn = cnn_learner(
dls, resnet18, metrics=METRICS, cbs=SaveModelCallback(monitor="f1_score")
)
with learn.no_bar():
learn.fine_tune(10, cbs=EarlyStoppingCallback(patience=4, monitor="f1_score"))
We can now take a look at our models predictions. First we create a test data loader. We can do this using the dls.test_dl
and passing in our test_df
test_data = learn.dls.test_dl(test_df)
Using test_dl
ensures that we the same 'transforms' are used for loading our data for testing as was used during the training process. It's beyond the scope of this notebook to fully cover transforms in fastai, but we can take a look at what these are doing in this example
test_data.transform
We can now use this test data to get some predictions using get_preds
.
preds = learn.get_preds(dl=test_data, with_decoded=True)
If we take a look at the type returned you'll see we get a tuple back of length 3
type(preds), len(preds)
If we look at the first tuple (slicing the first few examples), you'll see we have a tensor with two columns. As you might guess, each of these is a column with the probability for one of the possible labels.
preds[0][:5]
The second tuple value is empty (we'll ignore why for now)
preds[1]
the last value has a tensor with 0 and 1, this is the predicted label for each possible value.
preds[2][:5]
We now want to check whether these are correct. You'll notice that the decoded predictions don't contain our actual labels. We can quickly convert these using a list comprehension. We can access the vocab from our learner to check which order the labels are
learn.dls.vocab
pred_str = ["text_only" if y == 0 else "visual" for y in preds[2].tolist()]
pred_str[:3]
Now we can see where the predicted label matches the true label
test_df["correct"] = test_df["label"] == pred_str
We get back a new column which contains True
or False
depending on whether the prediction was correct or not
test_df["correct"].head(3)
To make it easier to count these correct labels we replace them with 1
or 0
test_df["correct"].replace({True: 1, False: 0}, inplace=True)
We might also be interested in not only know whether a prediction was wrong but also how confident a prediction was. We'll create a new column argmax_confidence
which we'll use to store the maximum value for each prediction. We can combine this with our correct label to look at the probability for accurate and wrong predictions
test_df["argmax_confidence"] = preds[0].numpy().max(1)
For example, we can filter by correct predictions and then look at the distribution of the predicted probability
test_df[test_df["correct"] == 1]["argmax_confidence"].describe()
Since we're particularly interested in any potential difference in accuracy based on year we can use a groupby
and then look at the mean value for the correct label for each year
test_df.groupby([test_df.pub_date.dt.year])["correct"].agg("mean").sort_values()
We can quickly create a plot which might make the data easier to read
fig = (
test_df.groupby([test_df.pub_date.dt.year])["correct"]
.agg("mean")
.sort_index()
.plot(kind="barh")
)
We can see that there is some difference in the performance on the test set for various years. This could be down to noise in the training process and data so we can't draw any hard conclusions from this small test, but it might be something we want to explore further. Since we also have the probabilities, lets see how the distribution of these compares when the label is correct for each year
ax = (
test_df[test_df["label"] == pred_str]
.groupby([test_df.pub_date.dt.year])["argmax_confidence"]
.plot.kde(use_index=True, legend=True, figsize=(20, 10))
)
We see here that the distribution is fairly consistent across each year. We can also take a look at the distribution of confidence where the model was not correct
ax = (
test_df[test_df["label"] != pred_str]
.groupby([test_df.pub_date.dt.year])["argmax_confidence"]
.plot.kde(use_index=True, legend=True, figsize=(20, 10))
)
We can see that the argmax confidence distribution when the model made wrong predictions is much more varied. Let's check how many the model got wrong in total on the test set
len(test_df[test_df["label"] != pred_str])
This is a pretty small number in total so once we have done our groubpy looking at the distribtuions is probably not particularly meaningul. Let's take a look for all years
ax = (
test_df[test_df["label"] != pred_str]["argmax_confidence"]
.plot.kde(use_index=True, legend=True, figsize=(20, 10))
)
This notebook has offered a very rough exploration of this topic but hopefully shows some of the potential areas of investigation when working with historical images. In the next notebook, we'll move onto model training in more detail.