# import fastai
# from fastai.vision.all import *
# from fastcore import *

Training a model

Since we will use fastai for inference, we'll quickly create a model that we can use to test inference functionality. This model is just intended to help develop the inference functionality so we won't worry about the model's performance.

dls = ImageDataLoaders.from_csv(
    "../ph/ads/",
    "ads_upsampled.csv",
    folder="images",
    fn_col="file",
    label_col="label",
    item_tfms=Resize(64, ResizeMethod.Squish),
    num_workers=0,
)
dls.show_batch()
learn = cnn_learner(dls, resnet18, metrics=F1Score())
learn.fine_tune(1)
epoch train_loss valid_loss f1_score time
0 0.962725 0.586650 0.769231 00:16
epoch train_loss valid_loss f1_score time
0 0.534687 0.473433 0.797386 00:17

Inference helpers

The next few functions are 'helper' functions which are used for doing inference using the Newspaper Navigator data.

Missing images

Because we are dealing with images requested via the web we have to deal with the occasional hiccup. These hiccups could include requested image not being returned from an IIIF request, or a network issue etc. As a reminder of what load_url_image does

doc(load_url_image)

load_url_image[source]

load_url_image(url:str, mode='RGB')

Attempts to load an image from url returns None if request times out or no image at url

Show in docs

As you can see above, load_url_image will sometimes return None. When we're running inference, this can cause an issue because we want to create batches of images to speed up inference. We don't want to include None's in a batch of images to predict since fastai/Pytorch won't know what to do with it.

To get around this, we create a function which filters a batch of images and replaces None with a fake image. This function also returns the index of items that were originally None. We later use this index of items which were None to replace any predictions made for dummy images with np.nan

We use this function to deal with Nones appearing in our image batches before passing them to a fastai learner.

Note

This could be a sub-optimal approach to this issue, but since we don't encounter too many missing images and the overhead for this is fairly low, it seems like an okay solution and allows us to work in batches rather than work with a single image at a time. This is useful especially when we have a GPU available.

Notes on fastcore

You may have seen that the results in wrapped in L. This comes from fastcore and is a Class which adds some extra bells and whilsles to a standard Python list see the fastcore docs for more details. In this case we can use it to grab the indexes of images which are None. Some other nice features include displaying nicely in notebooks

L([1, 2, 3, 4])
(#4) [1,2,3,4]

We get a handy #4 count to indicate the length of a sequence. This is surprisingly handy especially for someone who is lazy like me.

im_files = (get_image_files("../ph/ads/images"))[:8]
results = list(map(PILImage.create, im_files))
results.append(None)
results = [None] + results
image_batch, none_image_index = _filter_replace_none_image(results)
assert len(results) == len(image_batch)
assert none_image_index.items == [
    0,
    9,
]  # check indexes are at the start and end of list

This is a helper to create a header for the metadata from newspaper navigator

Predict

Below is the main code for doing the inference. patch_to is used to add new methods to the nnPredict class. This is helpful for avoiding one massive code cell in a Jupyter notebook. Like other aspects of nnanno this approach is probably upsetting.

At the moment inference assumes a classification model.

# check for multicategory tensor
# tensor preds can be passed directly to DF?

class nnPredict[source]

nnPredict(learner:fastai.learner, try_gpu:bool=True)

nnPredict is used in combination with a trained leaner to run inference on Newspaper Navigator

class nnPredict:
    """`nnPredict` is used in combination with a trained leaner to run inference on Newspaper Navigator"""

    population = pd.read_csv(
        pkg_resources.resource_stream("nnanno", "data/all_year_counts.csv"), index_col=0
    )

    def __init__(self, learner: fastai.learner, try_gpu: bool = True):
        """Creates an ``nnPredict` instance from `learner`, puts on GPU if `try_gpu` is true and CUDA is avilable"""
        self.learner = learner
        self.try_gpu = try_gpu
        self.dls = learner.dls
        self.decode_map = {v: k for v, k in enumerate(self.learner.dls.vocab)}

    def __repr__(self):
        return f"{self.__class__.__name__} \n" f"learner vocab:{self.learner.dls.vocab}"

nnPredict is a class used to do the predictions. It takes as input a fastai learner and a boolean option which controls whether the prediction methods should try and make use of a GPU or not.

None[source]

nnPredict.predict_from_sample_df[source]

nnPredict.predict_from_sample_df(sample_df:DataFrame, bs:int=16, disable_pbar:bool=False)

Runs inference on sample_df using batch size bs, disable_pbar controls whether to show progress bar Returns a Pandas DataFrame containing orginal dataframe and predictions, with labels taken from learner.dls.vocab

Most of the time it is likely we'll use the predict_from_sample_df as part of predict_sample

None[source]

nnPredict.predict_sample[source]

nnPredict.predict_sample(kind:str, out_dir:str, sample_size:Union[int, float], bs:int=16, start_year:int=1850, end_year:int=1950, step:int=1, year_sample:bool=True, size=None, force_dir=False)

runs inference for a sample sample_size of kind from newspaper navigator using batch size bs

predict_sample can be used to make predictions on new samples of the newspaper navigator dataset. The sampling works in mostly the same manner as in create_sample. The main difference is that we might want to also specify some changes to how images are requested via IIIF. For example if our model was trained on images of 256 x 256 we probably want to also use this size in our inference. bs controls the size of each batch.

Using the model trained at the top, we can pass our fastai learner into nnPredict.

pred = nnPredict(learn)

We now have created an `nnPredict inference object we can use to make predictions against newspaper navigator data.

pred
nnPredict 
learner vocab:['illustrations', 'text-only']

We can use predict_sample to run inference given some parameters. We should be considerate here about how large we make this sample. In this case we are running against ads which is a particularly large dataset. We can either pass in a defined sample size for each year or we may want to have a percentage so we get a representative sample for each year. In this case we just run a small sample as an example.

pred.predict_sample(
    "ads",
    "test/",
    0.001,
    start_year=1850,
    end_year=1852,
    bs=32,
    force_dir=True,
    size=(64, 64),
)

When predict_sample is called a directory is created and a JSON file containing the predictions for each years sample is created. Let's load these predictions

df = pd.concat([pd.read_json(f) for f in (Path("test").rglob("*json"))])

If we take a look at the columns of the DataFrame we'll see some new columns

df.columns
Index(['filepath', 'pub_date', 'page_seq_num', 'edition_seq_num', 'batch',
       'lccn', 'box', 'score', 'ocr', 'place_of_publication',
       'geographic_coverage', 'name', 'publisher', 'url', 'page_url',
       'iiif_url', 'pred_decoded', 'illustrations_prob', 'text-only_prob'],
      dtype='object')

pred_decoded contains the labels for your classes

df["pred_decoded"].value_counts()
text-only        20
illustrations     8
Name: pred_decoded, dtype: int64

and a column for the models confidence for each possible class label

df["text-only_prob"].describe()
count    28.000000
mean      0.687567
std       0.330806
min       0.023303
25%       0.363166
50%       0.847320
75%       0.964801
max       0.999390
Name: text-only_prob, dtype: float64

None[source]

pred = nnPredict(learn)

nnPredict.predict[source]

nnPredict.predict(kind:str, out_dir:str, bs:int=32, start_year:int=1850, end_year:int=1950, step:int=1, size=None)

predict on the full dataset for a given kind from start_year until end_year using step size

from nbdev.export import notebook2script

notebook2script()
Converted 00_core.ipynb.
Converted 01_sample.ipynb.
Converted 02_annotate.ipynb.
Converted 03_inference.ipynb.
Converted index.ipynb.