Using computer vision to explore ads in the Newspaper Navigator dataset

Overview

Working with digitized historic materials at scale poses some challenges. This series of notebooks is intended to give a rough overview of some of the considerations involved. It is not intended to model best practice.

Each of these notebooks looks at a particular stage in a machine learning pipeline.

Semi minimal computing

Minimal computing refers to

"computing done under some set of significant constraints of hardware, software, education, network capacity, power, or other factors. Minimal computing includes both the maintenance, refurbishing, and use of machines to do DH work out of necessity along with the use of new streamlined computing hardware like the Raspberry Pi or the Arduino micro controller to do DH work by choice." - Source https://go-dh.github.io/mincomp/about/

Deep learning has a reputation for being resource-heavy in multiple ways. In particular deep learning is associated with requiring lots of data and a large amount of compute resource. Although there is truth to this reputation, it is increasingly possible to use deep learning with smallish amounts of data and smallish amounts of compute. Since smallish is a little bit vague in these notebooks, the following constraints are observed:

  • only one person doing annotating
  • the majority of code was run on a consumer laptop (Macbook Pro 2018)
  • some of the notebooks are executed on the Google Colab platform to take advantage of the 'free' GPUs.

Although these constraints aren't super minimal, they hopefully show that deep learning might be something that could be used in the absence of large grants or donations of cloud credits.

Our question

The question guiding these notebooks is the changes in visual content in advertising in newspapers over the period covered by the Newspaper Navigator data. In particular, we want to see how much advertising was 'visual' and how much was text only. We may then explore how the relative frequency of these types of advertising intersects with other information we can find in the Newspaper Navigator data.

Working at scale: sampling

The first proper notebook will focus on how we should sample from newspaper navigator data in a way that will help us train a computer vision model that will be useful for answering our question.

How our training data interacts with our models: can CNNs time travel?

Although we may want to use computer vision to answer humanities questions, we should still consider how these models work and where they might not work. This may not involve developing new computer vision model architectures as much as trying to find ways of evaluating existing models for biases on the data you will be working with.

The following notebook explores an example of this type of question by looking at whether classifiers trained on one decade will be effective at making predictions in a much later or earlier time period.

Images as data: Inference and working with outputs

Once we are semi confident our model does fairly well across all data in our corpus, we move to inference (the process of creating predictions on new unseen data). This includes some discussion of some considerations we might want to make when we do this process on extensive collections with relatively limited computational resources.

The final notebook begins to look at possible approaches to working with the outputs of machine learning models.