This notebook will work with the predictions generated in the previous notebook and begin to look at some ways in which we can explore this data. This will only scratch the surface and will focus on two main areas:

working with the predictions as a 'tabular' form of data using pandas
visualizing the data using hvplot

We'll explain this in more detail as we go.

We'll start by importing a few packages. The most relevant one here is Pandas which we'll use for working with tabular data. We'll cover the others as we use them.

import pandas as pd
from PIL import Image
import requests
import io

We'll import matplotlib and set a different style

import matplotlib.pyplot as plt

plt.style.use('ggplot')

Install note

The process for installing hvplot can be found in their documentation

We also import holoviews and hvplot which we'll use for visualizing our data. These will be discussed in more detail later on.

import holoviews as hv
hv.extension('bokeh')
import hvplot.pandas

Loading our data

If you do not already have some predictions data you can the date generated from the previous notebook via this URL. This URL consists of a JSON file which combines all of the predictions into a single JSON file.

url = "https://github.com/Living-with-machines/nnanno_example_data/blob/main/ad_inference_full.json?raw=true"

We can pass this URL directly into the Pandas read_json method. Depending on your internet connection speed this might take a little bit of time.

df = pd.read_json(url)

Initial data prep

Let's start by checking how much data we're working with and see how many columns we have

df.shape

(21496, 19)

We often want slightly more detail about our DataFrame. One way in which you can do this is to use the info method on the DataFrame. If you pass in memory_usage="deep" you'll get a slightly more accurate view on memory usage.

df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21496 entries, 0 to 21495
Data columns (total 19 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   filepath              21496 non-null  object 
 1   pub_date              21496 non-null  object 
 2   page_seq_num          21496 non-null  int64  
 3   edition_seq_num       21496 non-null  int64  
 4   batch                 21496 non-null  object 
 5   lccn                  21496 non-null  object 
 6   box                   21496 non-null  object 
 7   score                 21496 non-null  float64
 8   ocr                   21496 non-null  object 
 9   place_of_publication  21496 non-null  object 
 10  geographic_coverage   21496 non-null  object 
 11  name                  21496 non-null  object 
 12  publisher             20790 non-null  object 
 13  url                   21496 non-null  object 
 14  page_url              21496 non-null  object 
 15  iiif_url              21496 non-null  object 
 16  pred_decoded          21463 non-null  object 
 17  text_only_prob        21463 non-null  float64
 18  visual_prob           21463 non-null  float64
dtypes: float64(3), int64(2), object(14)
memory usage: 49.5 MB

This memory usage is pretty modest, so we don't need to worry too much about memory here. However, since we might end up working with more extensive predictions (especially if we don't rely on Colab) we'll treat this data as if it were more extensive. We can use this as an opportunity to look at some things we can do to make it slightly easier to work with large data (using modest hardware) in pandas. This delays the need to go and rent a VM with large amounts of RAM.

However, although we'll look a bit at memory, and tools for working with more extensive data, we'll restrict ourselves to working in Pandas. For bigger data, you may need to explore other tools like Dask or cuDF.

The first thing we might want to look at again is the Dtype column. Dtype refers to the underlying storage pandas is using for different columns in your data. We won't dig into this in a ton of detail, but we'll see how specifying our data can help us use less memory.

df.dtypes

filepath                 object
pub_date                 object
page_seq_num              int64
edition_seq_num           int64
batch                    object
lccn                     object
box                      object
score                   float64
ocr                      object
place_of_publication     object
geographic_coverage      object
name                     object
publisher                object
url                      object
page_url                 object
iiif_url                 object
pred_decoded             object
text_only_prob          float64
visual_prob             float64
dtype: object

One thing we might notice is that we have a lot of object dtypes. These refer to Python objects. Since Pandas is supposed to integrate well with Python, this is often the lowest common denominator for storing data. However, we can often save some memory by choosing more specific dtypes for our object. Since this isn't the primary focus of this notebook, we won't dwell on this to long and instead focus on one super magical dtype category.

If you look at the column names you'll spot pred_decoded. Let's take a look at that.

df['pred_decoded'].head(3)

0    text_only
1    text_only
2       visual
Name: pred_decoded, dtype: object

Since we know there are only two possible values here we should be able to store this value without using much memory. Let's see how much memory this is using at the moment.

df.memory_usage(deep=True)['pred_decoded'] / 1024 ** 2 # convert bytes to megabytes

1.3313169479370117

We probably don't need to worry about this for this particular dataset, but we can probably shave this down since we only need to store whether data is in one of two categories. The category dtype is intended to store data where values are repeated many times, as is often the case with 'categorical' data i.e a column recording colour could include red, green, blue, etc., but there is likely to be a lot of repetition. For this type of data, category will use less memory. Let's convert our pred_decoded column to a category dtype

df = df.astype({"pred_decoded":"category"})

We can inspect the memory usage of this specific column again

df.memory_usage(deep=True)['pred_decoded'] / 1024 ** 2 # convert bytes to megabytes

0.02072620391845703

We can see that this is a big saving in memory. Whilst it doesn't matter much here this can often be very helpful when working with bigger data where you have a lot of repeated values. This column might not be the only one that could benefit. One good way of deciding where a category might be helpful is to look at the number of unique values in a column. Looking at the name as an example

df.name.unique().shape, len(df)

((1329,), 21496)

Again we can see that there must be some repetition because the number of unique values is lower than the total length of the DataFrame. We can repeat this and inspect multiple columns. For this data we choose a few columns which might be compressed using the category dype.

df = df.astype({"batch": "category", "lccn": "category", "name": "category", "publisher": "category", "pred_decoded":"category"})

Since we'll be working with dates a lot, we also parse our date column again.

df['pub_date'] = pd.to_datetime(df['pub_date'])

df['year'] = df.pub_date.dt.year

We'll often want to use dates as the main filter for our data so we'll set this as our index and replace our previous index.

df.set_index(df.pub_date, drop=True, inplace=True)

Comparing again our memory usage now we've made some (modest) changes.

df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 21496 entries, 1850-07-25 to 1950-08-24
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   filepath              21496 non-null  object        
 1   pub_date              21496 non-null  datetime64[ns]
 2   page_seq_num          21496 non-null  int64         
 3   edition_seq_num       21496 non-null  int64         
 4   batch                 21496 non-null  category      
 5   lccn                  21496 non-null  category      
 6   box                   21496 non-null  object        
 7   score                 21496 non-null  float64       
 8   ocr                   21496 non-null  object        
 9   place_of_publication  21496 non-null  object        
 10  geographic_coverage   21496 non-null  object        
 11  name                  21496 non-null  category      
 12  publisher             20790 non-null  category      
 13  url                   21496 non-null  object        
 14  page_url              21496 non-null  object        
 15  iiif_url              21496 non-null  object        
 16  pred_decoded          21463 non-null  category      
 17  text_only_prob        21463 non-null  float64       
 18  visual_prob           21463 non-null  float64       
 19  year                  21496 non-null  int64         
dtypes: category(5), datetime64[ns](1), float64(3), int64(3), object(8)
memory usage: 41.6 MB

Again, for this data it's small enough that we probably didn't need to do this but being careful about dtypes can help you prolong the amount of time you can work on a local machine. Using The Pandas Category Data Type contains a fuller discussion of using this dtype.

Now we'll start working with the data in earnest.

Exploratory Data Analysis

We'll start with some Exploratory Data Analysis to give us a high level overview of the 'shape' of our data. This may already help spark some questions we want to explore further.

Missing values

nnanno prediction methods all expect to occasionally get a IIIF URL that doesn't return an image. When this happens that row's prediction is given a np.nan value. Since we probably don't want to include these we should check how many values are missing and then filter these out

df[df['pred_decoded'].isna()].shape

(33, 20)

df = df[df['pred_decoded'].notna()]

When working with a new dataset we'll often want to know some basic properties about it's distribution, we can do this using the describe method

df.describe()

In this particular case, some of these columns aren't useful. However, we can already see that we have columns for visual_prob and text_prob as you can probably guess, this is the predicted probability for each of these classes for each image. The distributions of these already give us some sense of what our underlying data looks like.

Visual data as tabular data

A question that faces us; what meaningful questions can we ask from this type of data? We're working with what looks like 'regular' tabular data, i.e. a row per observation with columns representing a mixture of categorical, ordinal and numeric data. However, the underlying 'data' points to images predicted as adverts, i.e. visual data. One of the challenges when working 'at scale' is that there is some distance from the underlying source, in this case, the images. For this data, we have 21496 rows which each represent an image. This is likely too large to be able to interrogate each image individually meaningfully. We have to think about new ways to work with visual data that we can't interact directly with visual objects. This notebook won't answer this question but will try and show some possible things we can do with this type of data.

Visual vs text only ads

Going back to our starting question about the use of illustration in advertisements in the newspaper navigator dataset. Let's get an overview of how frequently these two types of advertisement appear in our data

df['pred_decoded'].value_counts(normalize=True)

text_only    0.66822
visual       0.33178
Name: pred_decoded, dtype: float64

So we can say that ~66% of ads are predicted as text only, and ~33% are predicted as visual.

On its own, this is not particularly interesting, but it already tells us what we might have suspected - that more ads are textual than visual. We are probably interested in digging into how this did or didn't change over time and how the relative frequency of visual ads might coincide with other data features. Let's look at a few examples of doing this.

Grouby

We can start to look at the relative presence of visual vs text-only ads to use the Pandas groupby functionality. Since we might want to learn about changes over the period we're looking at, we could groupby year to see the values for each year.

df.groupby(['year', 'pred_decoded']).size().unstack(fill_value=0)

This already gives us much better insight into the data, but it can still be challenging to interpret a graph of this kind, so we might want to visualize this data. One quick way of doing this is via the plot API within pandas. This often gives us a reasonably good visual with minimal effort.

fig = df.groupby(['year', 'pred_decoded']).size().unstack(fill_value=0).plot.bar(figsize=(15,5))

This chart already gives a better picture of the overall trend. We can see that although the number of visual adverts increases over time, the growth of text ads until the 1920s is faster. We then see a massive drop off. In the sampling notebook we discussed how this was likely due to changes in the underlying dataset rather than an actual sharp decline in the number of newspapers/adverts in the 1930-50s. In fact, we are likely to expect this period, particularly the 1950s, to include much more visual advertising with the growth of consumer culture. It's a bit hard to see these later years in our above chart because of their relative scale to the rest of the data. We can use our year column to filter to these later years.

fig = df[df.year >=1930].groupby(['year', 'pred_decoded']).size().unstack(fill_value=0).plot.bar(figsize=(15,5))

We may want to be extra sceptical of these later years because the underlying data may contain additional biases. However, we can see here that in the 1940s and 1950s, the number of visual adverts does begin to overtake the number of text-only ads (according to our model's predictions!). This likely confirms existing priors we had about the changes we were likely to see.

Did some publisher print more visual advertisements?

Beyond looking at trends over time, we might want to know if some publishers tend to publish more visual ads than others. We can start by replicating our groupby from above but replacing year with publisher. We then grab the 'visual' column and sort values.

df.groupby(['publisher', 'pred_decoded']).size().unstack(fill_value=0)['visual'].sort_values(ascending=False)

publisher
W.D. Wallach & Hope                           531
[publisher not identified]                    218
Times Pub. Co.                                112
New York Tribune                              107
Edward Rosewater                               97
                                             ... 
Sol. Miller                                     0
Smith, Camp & Co.                               0
Herald Pub. Association                         0
Slovenian Socialist Association of America      0
                                                0
Name: visual, Length: 1185, dtype: int64

From this we can see the that "W.D. Wallach & Hope" is the most prominent publisher but if we look at the number of times this publisher appears in the data we can see that this might also be because this publisher appears more often in general in the data

df[df['publisher']=='W.D. Wallach & Hope'].shape[0]

1116

Moving from labels to predictions

We have so far been working with the pred_decoded label column and treating the values as the 'truth'. However, we know that, in reality, these are predictions that contain mistakes. Working with the labels gives us a very 'binary' view of the prediction data. One way of introducing a bit more nuance is to start working instead with the probabilities of each label instead of the label. Let's look again at the visual_prob column.

df['visual_prob'].describe()

count    21463.000000
mean         0.366774
std          0.375399
min          0.000316
25%          0.056012
50%          0.150080
75%          0.787368
max          0.999992
Name: visual_prob, dtype: float64

This gives us the full range, we can also see what this looks like for values above 0.5 i.e any predictions where the probability of visual is higher than text only.

df[df['pred_decoded']=='visual']['visual_prob'].describe()

count    7121.000000
mean        0.871653
std         0.140289
min         0.500061
25%         0.790588
50%         0.933412
75%         0.984379
max         0.999992
Name: visual_prob, dtype: float64

We can see the breakdown by quartile. One thing we might take away from this distribution is that the predictions (where the prediction is visual) tend to be quite 'confident' i.e. above 90%. We should be careful here. The final layer of our model uses a softmax function. This functions 'wants' to push one of the possible values i.e. visual or text to be higher. This means that we should be careful to not 'translate' the visual_probs into 'everyday' probabilities. With that disclaimer out the way lets start working with the probabilities.

df.groupby(['publisher'])['visual_prob'].agg('mean').sort_values(ascending=False)

publisher
Wells Spicer                          0.999907
R.A. Shotwell                         0.999735
W.I. Bateman, C.W. McDonald           0.999554
Daniel G. Fitch & George W. Clason    0.998710
S. Lambert                            0.998662
                                        ...   
E. Norelius                           0.012081
W.B. Mitchell                         0.010704
[Willie J. Miller]                    0.010169
                                           NaN
Gateway Pub. Co.                           NaN
Name: visual_prob, Length: 1185, dtype: float64

This time once we have done our groupby we calculate the mean for that group (in this case the publisher). We can then see which publishers in general have higher predictions for visual content. We can dig into the top value

df[df['publisher']=='Wells Spicer']

This publisher only has one entry so that makes the 'mean' a little bit meaningless (sorry, that wasn't meant to be a pun). Even so, we might now be curious about what this really visual ad is. We can of course grab the image via the iiif_url column

df[df['publisher']=='Wells Spicer']['iiif_url'][0]

'https://chroniclingamerica.loc.gov/iiif/2/iahi_charizard_ver01%2Fdata%2Fsn84027398%2F00279529418%2F1870120101%2F1026.jp2/pct:57.91,62.27,13.28,28.42/pct:50/0/default.jpg'

This link will take you to the image, now we've indulged our curiosity let's deal with the fact that there are likely to be a long tail of publishers that appear infrequently in the data. We can look at the distribution of the value counts.

df.publisher.value_counts().describe()

count    1185.000000
mean       17.516456
std        47.989247
min         0.000000
25%         3.000000
50%         7.000000
75%        15.000000
max      1116.000000
Name: publisher, dtype: float64

We could look at filtering out these publishers which appear less frequently but we'll skip this for now.

Trends over time

We might also want to look at how the mean probability of an item being predicted as a 'visual' ad changes over time, an obvious thing we might want to look at is to see the mean visual probability for each year:

df.groupby(df.pub_date.dt.year)['visual_prob'].agg('mean')

pub_date
1850    0.372529
1860    0.416202
1870    0.328116
1880    0.331151
1890    0.369852
1900    0.351486
1910    0.330685
1920    0.390455
1930    0.422416
1940    0.531726
1950    0.528068
Name: visual_prob, dtype: float64

We might also want to look at this by month instead

df.groupby(df.pub_date.dt.month)['visual_prob'].agg('mean').sort_values()

pub_date
8     0.349699
7     0.350872
6     0.351593
1     0.354589
12    0.357954
2     0.360121
11    0.373501
3     0.375549
9     0.376283
5     0.380835
10    0.381201
4     0.386250
Name: visual_prob, dtype: float64

It seems like August is the most bland month for adverts in this data! We could also look at the day of the week.

df.groupby(df.pub_date.dt.dayofweek)['visual_prob'].agg('mean')

pub_date
0    0.361478
1    0.363778
2    0.376375
3    0.371488
4    0.371199
5    0.341765
6    0.387246
Name: visual_prob, dtype: float64

Although we can see some patters in these examples, it isn't always so easy to look at a list of numbers in this way. This is where using some visualization comes in handy...

Visualizing the data

It is often helpful to generate visuals to try and look for patterns in our data. Sometimes these visuals spark new insights which we may not see so easily in the more 'raw data'. In this section, we'll look at some ways of pragmatically creating visuals for our data. We have a few criteria that we'll follow.

we'll focus on quick visualization that can be done relatively 'directly' with the data we have
we'll focus on tools with a 'standard' API (we'll cover what this means in more detail later)
we will focus on tools that will continue to work even if our data grows much larger

Data visualization vs visualizing data

"Data visualization is an interdisciplinary field that deals with the graphic representation of data. It is a particularly efficient way of communicating when the data is numerous as for example a Time Series." - Source:https://en.wikipedia.org/wiki/Data_visualization Although we are using visuals to represent our data in this notebook, we focus on quick and easy ways of getting visuals about our data. Data Visualization as a field contains many consideration and best practices, many of which we may not follow in this notebook.

We are able to get some sense of the distribution of the data from the 'raw numbers', but some visualizations can help us see the distribution better. We've already seen the pandas plot API in action. Let's see if we can get a better sense of the distribution of visual vs non-visual publication by day of the week by plotting this.

fig = df.groupby(df.pub_date.dt.dayofweek)[['visual_prob','text_only_prob']].agg('mean').plot.bar()

This already gives a better sense of our data but we may want to be able to focus on particular years, or modify the parameters we use for plotting. We'll look at hvplot as a tool ford doing slightly more interactive plots.

Plotting using hvplot

hvplot is a visualization library that aims to mirror the pandas plot API closely but offering some interactivity 'out of the box'. We'll briefly examine how this tool can help us explore the data in a slightly more interactive way. Since time is such an important dimension, we'll focus on mapping against this dimension here. We have the time available in the index:

df.index

DatetimeIndex(['1850-07-25', '1850-11-20', '1850-06-06', '1850-07-26',
               '1850-10-31', '1850-08-13', '1850-09-26', '1850-08-03',
               '1850-10-25', '1850-11-07',
               ...
               '1950-06-14', '1950-06-30', '1950-05-04', '1950-02-01',
               '1950-12-02', '1950-12-21', '1950-07-05', '1950-11-28',
               '1950-06-25', '1950-08-24'],
              dtype='datetime64[ns]', name='pub_date', length=21463, freq=None)

Lets start by replicating what we previously did and group are predicted labels by dayofweek. This produces a chart which we can already hover over as a way of interacting with it.

df.groupby(df.pub_date.dt.dayofweek).count()['pred_decoded'].hvplot(kind='bar')

We can also do the same with month

df.groupby(df.pub_date.dt.month).count()['pred_decoded'].hvplot(kind='bar')

This already gives us a bit more interactivity. If we want to make things more interactive though we can move the groupby from pandas to doing this inside hvplot. If we do this then we get a drop down menu for the options inside the groupby.

df.hvplot.hist(y='year',groupby='pred_decoded')

This is quite a quick way of quickly adding interactivity to our visualizations. This can be very helpful for spotting trends etc. more easily when interactivity about. If we add another type of index to our groupby such as month we can try and spot and trends in visual vs non visual ads by month of the year.

df.hvplot.hist(y='year',by='pred_decoded',groupby='index.month')

Extra interactivity

Although this is already useful we can also use another library panel to try and add some additional interactivity to visualizations.

import panel as pn

Using panel we can add Widgets for various different interactive components. In this case the time to slice by.

group =  pn.widgets.Select(name='group',options=['index.dayofweek', 'index.month', 'index.is_quarter_end'])

We can now use this widget to decide what to use for our groupby.

plt = df.hvplot.hist(y='year',by='pred_decoded',groupby=group)

We can now add our group widget and the plot together in a 'dashboard'. You can see that as we update the groupby the second widget will update depending on what is required to interact with that data.

pn.Column(pn.WidgetBox(group),plt)

Working with hvplot and probabilities

We have so far worked with the decoded labels but sometimes we may want to work with the probabilities directly. hvplot included a heatmap visualization that could be useful for helping us try and see the intensity of the predictions. We'll use year and month as two axes for our visuals. We use np.mean to get an aggregate of the visual probability for each 'tile'

import numpy as np

df.hvplot.heatmap(x='index.year', 
                  y='index.month',
                  aggregator=np.mean,C='visual_prob', cmap='blues').aggregate(function=np.mean)

We can also do this with day of week

df.hvplot.heatmap(x='index.year', y='index.dayofweek',aggregator=np.mean,C='visual_prob', cmap='blues').aggregate(function=np.mean)

Conclusion

This is the end of this series of notebook on working with ads data. We have only scratched the surface of how we can start working with this data. Some things we haven't done is to use the OCR or all the metadata available in earnest. There would also be opportunities for mapping the data spatially.

Hopefully this series of notebooks has given some sense of how we can use computer vision to as humanities questions without requiring large scale resources.

fin

	page_seq_num	edition_seq_num	score	text_only_prob	visual_prob	year
count	21463.000000	21463.000000	21463.000000	21463.000000	21463.000000	21463.000000
mean	451.839724	1.011462	0.967999	0.633226	0.366774	1908.634394
std	313.098573	0.113644	0.026288	0.375399	0.375399	15.473567
min	1.000000	1.000000	0.900011	0.000008	0.000316	1850.000000
25%	201.000000	1.000000	0.951581	0.212632	0.056012	1900.000000
50%	402.000000	1.000000	0.976570	0.849920	0.150080	1910.000000
75%	640.000000	1.000000	0.989696	0.943988	0.787368	1920.000000
max	2204.000000	4.000000	0.999442	0.999684	0.999992	1950.000000

pred_decoded	text_only	visual
year
1850	18	9
1860	104	65
1870	287	114
1880	667	280
1890	1337	686
1900	2648	1220
1910	4540	1842
1920	3998	2233
1930	303	203
1940	214	230
1950	226	239

	filepath	pub_date	page_seq_num	edition_seq_num	batch	lccn	box	score	ocr	place_of_publication	geographic_coverage	name	publisher	url	page_url	iiif_url	pred_decoded	text_only_prob	visual_prob	year
pub_date
1870-12-01	iahi_charizard_ver01/data/sn84027398/002795294...	1870-12-01	1026	1	iahi_charizard_ver01	sn84027398	[0.5791464992, 0.6226507612000001, 0.711870399...	0.954609	[Doty's, Washing, Machine, Lately, much, Impro...	Tipton, Cedar Co., Iowa	[Iowa--Cedar--Tipton]	The Tipton advertiser. [volume]	Wells Spicer	https://news-navigator.labs.loc.gov/data/iahi_...	https://chroniclingamerica.loc.gov/data/batche...	https://chroniclingamerica.loc.gov/iiif/2/iahi...	visual	0.000093	0.999907	1870

Working with predictions