Sample, annotate and apply computer vision models to the Newspaper Navigator dataset

CI

tl;dr

nnanno is a modest collection of tools to help work with the delightful Newspaper Navigator data.

nnanno

Newspaper Navigator is a project which extracted visual content (pictures, maps etc.) from the Library of Congress Chronicling America digitised newspaper collection.

Newspaper Navigator has released data in a number of formats including json files which contain a range of metadata about the newspaper from which the image was taken from. nnanno was thrown together to help work with this collection as part of the preparation of some example datasets used in a series of Programming Historian tutorials (currently under review). Since this code was developed using the wonderful nbdev library it is possible to install it for your own use.

What nnanno tries to help with

nnanno doesn't to provide an end-to-end to end software for using machine learning with the Newspaper Navigator data. Instead it is a minimal collection of code that may help you if you want to work with the Newspaper Navigator data.

There are three particular areas where nnanno tries to help a little:

  • sampling subsets from the full Newspaper Navigator data
  • annotating this sample with additional labels using label studio
  • inference (experimental 😬) running inference i.e making new predictions with a machine learning model on the newspaper navigator dataset using IIIF.

Disclaimer

This code was written mainly to help develop some example datasets for a series of Programming Historian tutorials. It has some tests but there are likely bugs and issues with the code. The code in this repository was all written in notebooks, some people hate notebooks. Those people will probably hate this too 🤷‍♂️

If you want to work with the full Newspaper Navigator data you will likely be better of accessing it via the provided S3 bucket see news-navigator.labs.loc.gov/ for more information.

nbdev notes

This code was written using nbdev. This is a tool that helps use Jupyter notebooks for developing Python libraries. Inside the documentation you will see code cells followed by output. This is generated from a Jupyter notebook and shows the actual output of the code rather than something that has been copied and pasted for example:

import datetime
print(datetime.date.today())
2021-02-02

Is evaluated as Python code. This also means all of the documentation and examples can be opened in notebooks and the code inspected, changed and run.

Install

You can install nnanno via PyPI:

pip install nnanno

This will install all of the code for sampling from Newspaper Navigator data.

Optional extra packages

If you want to use all of the parts of nnanno you'll need to install some extra packages.

label studio

If you also want to use label-studio for annotating you will need to install this too. There are various ways in which you may want to install label-studio. See the label studio documentation for various options for setting this up.

fastai

If you want to use the experimental inference functionality you will need to install fastai. See the fastai docs for options for doing this.

Documentation

The documentation can be viewed as rendered pages with the option of opening as notebooks in Google Colab.

Illustrations in advertising: an 'end-to-end' example of using nnanno

You can find an 'end-to-end' example in examples folder of the documentation.

This example goes through the process of sampling, annotating, training a model and predicting this against the newspaper navigator data.

Functionality

The three main areas of nnanno are shown below. The examples in the documentation and in the "end-to-end" example shows this functionality in more detail.

Creating samples

nnanno can be used to create samples from the Newspaper Navigator data:

from nnanno.sample import *
sampler = nnSampler()
df = sampler.create_sample(1,
                           'photos',
                           start_year=1850, 
                           end_year=1855, 
                           step=1)

This returns a dataframe containing samples from the Newspaper Navigator data (loaded via JSON) into a Pandas DataFrame.

df.columns
Index(['filepath', 'pub_date', 'page_seq_num', 'edition_seq_num', 'batch',
       'lccn', 'box', 'score', 'ocr', 'place_of_publication',
       'geographic_coverage', 'name', 'publisher', 'url', 'page_url'],
      dtype='object')

Annotation

The annotation part of nnanno is mainly a little bit of documentation and a few functions to help setup annotation of a sample from Newspaper Navigator using IIIF urls and the label studio annotation tool. Since we can annotate via IIIF this offers a way of annotating without having to download large amounts of data locally.

from nnanno.annotate import create_label_studio_json

Inference

The inference section of nnanno attempts to show one possible way to use IIIF to run inference against samples of Newspaper Navigator using a trained fastai model. This will allow you to make predictions against larger parts of the Newspaper Navigator data.

from nnanno.inference import *
from fastai.vision.all import *
dls = ImageDataLoaders.from_csv('../ph/ads/', 
                                'ads_upsampled.csv',
                                folder='images', 
                                fn_col='file', 
                                label_col='label',
                                item_tfms=Resize(64,ResizeMethod.Squish),
                                num_workers=0)
learn = cnn_learner(dls, resnet18, metrics=F1Score())
learn.fit(1)
epoch train_loss valid_loss f1_score time
0 0.924742 0.963872 0.645570 00:12

With a trained fastai model we can predict on a sample from Newspaper Navigator

predictor = nnPredict(learn, try_gpu=False)
predictor.predict_sample('ads','testinference',0.01,end_year=1850)

This returns a json file for each year from the sample containing the original newspaper navigator data plus the predictions from your model

df = pd.read_json('testinference/1850.json')

We can access the 'decoded' predictions

df['pred_decoded'].value_counts()
text-only        50
illustrations    38
Name: pred_decoded, dtype: int64

or work with the probabilities directly

df.iloc[:5,-2]
0    0.152435
1    0.027183
2    0.096673
3    0.395800
4    0.957422
Name: illustrations_prob, dtype: float64

Acknowledgment

This work was support by Living with Machines. This project, funded by the UK Research and Innovation (UKRI) Strategic Priority Fund, is a multidisciplinary collaboration delivered by the Arts and Humanities Research Council (AHRC), with The Alan Turing Institute, the British Library and the Universities of Cambridge, East Anglia, Exeter, and Queen Mary University of London.