This notebook will work with the predictions generated in the previous notebook and begin to look at some ways in which we can explore this data. This will only scratch the surface and will focus on two main areas:
- working with the predictions as a 'tabular' form of data using
pandas
- visualizing the data using
hvplot
We'll explain this in more detail as we go.
We'll start by importing a few packages. The most relevant one here is Pandas which we'll use for working with tabular data. We'll cover the others as we use them.
import pandas as pd
from PIL import Image
import requests
import io
We'll import matplotlib and set a different style
import matplotlib.pyplot as plt
plt.style.use('ggplot')
Install note
The process for installing hvplot
can be found in their documentation
We also import holoviews
and hvplot
which we'll use for visualizing our data. These will be discussed in more detail later on.
import holoviews as hv
hv.extension('bokeh')
import hvplot.pandas
url = "https://github.com/Living-with-machines/nnanno_example_data/blob/main/ad_inference_full.json?raw=true"
We can pass this URL directly into the Pandas read_json
method. Depending on your internet connection speed this might take a little bit of time.
df = pd.read_json(url)
df.shape
We often want slightly more detail about our DataFrame. One way in which you can do this is to use the info
method on the DataFrame. If you pass in memory_usage="deep"
you'll get a slightly more accurate view on memory usage.
df.info(memory_usage='deep')
This memory usage is pretty modest, so we don't need to worry too much about memory here. However, since we might end up working with more extensive predictions (especially if we don't rely on Colab) we'll treat this data as if it were more extensive. We can use this as an opportunity to look at some things we can do to make it slightly easier to work with large data (using modest hardware) in pandas. This delays the need to go and rent a VM with large amounts of RAM.
However, although we'll look a bit at memory, and tools for working with more extensive data, we'll restrict ourselves to working in Pandas. For bigger data, you may need to explore other tools like Dask or cuDF.
The first thing we might want to look at again is the Dtype
column. Dtype
refers to the underlying storage pandas is using for different columns in your data. We won't dig into this in a ton of detail, but we'll see how specifying our data can help us use less memory.
df.dtypes
One thing we might notice is that we have a lot of object
dtype
s. These refer to Python objects. Since Pandas is supposed to integrate well with Python, this is often the lowest common denominator for storing data. However, we can often save some memory by choosing more specific dtypes
for our object. Since this isn't the primary focus of this notebook, we won't dwell on this to long and instead focus on one super magical dtype category
.
If you look at the column names you'll spot pred_decoded
. Let's take a look at that.
df['pred_decoded'].head(3)
Since we know there are only two possible values here we should be able to store this value without using much memory. Let's see how much memory this is using at the moment.
df.memory_usage(deep=True)['pred_decoded'] / 1024 ** 2 # convert bytes to megabytes
We probably don't need to worry about this for this particular dataset, but we can probably shave this down since we only need to store whether data is in one of two categories. The category
dtype is intended to store data where values are repeated many times, as is often the case with 'categorical' data i.e a column recording colour could include red, green, blue, etc., but there is likely to be a lot of repetition. For this type of data, category
will use less memory. Let's convert our pred_decoded
column to a category
dtype
df = df.astype({"pred_decoded":"category"})
We can inspect the memory usage of this specific column again
df.memory_usage(deep=True)['pred_decoded'] / 1024 ** 2 # convert bytes to megabytes
We can see that this is a big saving in memory. Whilst it doesn't matter much here this can often be very helpful when working with bigger data where you have a lot of repeated values. This column might not be the only one that could benefit. One good way of deciding where a category
might be helpful is to look at the number of unique
values in a column. Looking at the name as an example
df.name.unique().shape, len(df)
Again we can see that there must be some repetition because the number of unique values is lower than the total length of the DataFrame. We can repeat this and inspect multiple columns. For this data we choose a few columns which might be compressed using the category
dype.
df = df.astype({"batch": "category", "lccn": "category", "name": "category", "publisher": "category", "pred_decoded":"category"})
Since we'll be working with dates a lot, we also parse our date column again.
df['pub_date'] = pd.to_datetime(df['pub_date'])
df['year'] = df.pub_date.dt.year
We'll often want to use dates as the main filter for our data so we'll set this as our index and replace our previous index.
df.set_index(df.pub_date, drop=True, inplace=True)
Comparing again our memory usage now we've made some (modest) changes.
df.info(memory_usage='deep')
Again, for this data it's small enough that we probably didn't need to do this but being careful about dtypes
can help you prolong the amount of time you can work on a local machine. Using The Pandas Category Data Type contains a fuller discussion of using this dtype.
Now we'll start working with the data in earnest.
df[df['pred_decoded'].isna()].shape
df = df[df['pred_decoded'].notna()]
When working with a new dataset we'll often want to know some basic properties about it's distribution, we can do this using the describe
method
df.describe()
In this particular case, some of these columns aren't useful. However, we can already see that we have columns for visual_prob
and text_prob
as you can probably guess, this is the predicted probability for each of these classes for each image. The distributions of these already give us some sense of what our underlying data looks like.
Visual data as tabular data
A question that faces us; what meaningful questions can we ask from this type of data? We're working with what looks like 'regular' tabular data, i.e. a row per observation with columns representing a mixture of categorical, ordinal and numeric data. However, the underlying 'data' points to images predicted as adverts, i.e. visual data. One of the challenges when working 'at scale' is that there is some distance from the underlying source, in this case, the images. For this data, we have 21496 rows which each represent an image. This is likely too large to be able to interrogate each image individually meaningfully. We have to think about new ways to work with visual data that we can't interact directly with visual objects. This notebook won't answer this question but will try and show some possible things we can do with this type of data.
df['pred_decoded'].value_counts(normalize=True)
So we can say that ~66% of ads are predicted as text only, and ~33% are predicted as visual.
On its own, this is not particularly interesting, but it already tells us what we might have suspected - that more ads are textual than visual. We are probably interested in digging into how this did or didn't change over time and how the relative frequency of visual ads might coincide with other data features. Let's look at a few examples of doing this.
df.groupby(['year', 'pred_decoded']).size().unstack(fill_value=0)
This already gives us much better insight into the data, but it can still be challenging to interpret a graph of this kind, so we might want to visualize this data. One quick way of doing this is via the plot
API within pandas. This often gives us a reasonably good visual with minimal effort.
fig = df.groupby(['year', 'pred_decoded']).size().unstack(fill_value=0).plot.bar(figsize=(15,5))
This chart already gives a better picture of the overall trend. We can see that although the number of visual adverts increases over time, the growth of text ads until the 1920s is faster. We then see a massive drop off. In the sampling notebook we discussed how this was likely due to changes in the underlying dataset rather than an actual sharp decline in the number of newspapers/adverts in the 1930-50s. In fact, we are likely to expect this period, particularly the 1950s, to include much more visual advertising with the growth of consumer culture. It's a bit hard to see these later years in our above chart because of their relative scale to the rest of the data. We can use our year column to filter to these later years.
fig = df[df.year >=1930].groupby(['year', 'pred_decoded']).size().unstack(fill_value=0).plot.bar(figsize=(15,5))
We may want to be extra sceptical of these later years because the underlying data may contain additional biases. However, we can see here that in the 1940s and 1950s, the number of visual adverts does begin to overtake the number of text-only ads (according to our model's predictions!). This likely confirms existing priors we had about the changes we were likely to see.
Did some publisher print more visual advertisements?
Beyond looking at trends over time, we might want to know if some publishers tend to publish more visual ads than others. We can start by replicating our groupby
from above but replacing year
with publisher
. We then grab the 'visual' column and sort values.
df.groupby(['publisher', 'pred_decoded']).size().unstack(fill_value=0)['visual'].sort_values(ascending=False)
From this we can see the that "W.D. Wallach & Hope" is the most prominent publisher but if we look at the number of times this publisher appears in the data we can see that this might also be because this publisher appears more often in general in the data
df[df['publisher']=='W.D. Wallach & Hope'].shape[0]
Moving from labels to predictions
We have so far been working with the pred_decoded
label column and treating the values as the 'truth'. However, we know that, in reality, these are predictions that contain mistakes. Working with the labels gives us a very 'binary' view of the prediction data. One way of introducing a bit more nuance is to start working instead with the probabilities of each label instead of the label. Let's look again at the visual_prob
column.
df['visual_prob'].describe()
This gives us the full range, we can also see what this looks like for values above 0.5
i.e any predictions where the probability of visual is higher than text only.
df[df['pred_decoded']=='visual']['visual_prob'].describe()
We can see the breakdown by quartile. One thing we might take away from this distribution is that the predictions (where the prediction is visual) tend to be quite 'confident' i.e. above 90%. We should be careful here. The final layer of our model uses a softmax function. This functions 'wants' to push one of the possible values i.e. visual or text to be higher. This means that we should be careful to not 'translate' the visual_probs into 'everyday' probabilities. With that disclaimer out the way lets start working with the probabilities.
df.groupby(['publisher'])['visual_prob'].agg('mean').sort_values(ascending=False)
This time once we have done our groupby
we calculate the mean for that group (in this case the publisher). We can then see which publishers in general have higher predictions for visual content. We can dig into the top value
df[df['publisher']=='Wells Spicer']
This publisher only has one entry so that makes the 'mean' a little bit meaningless (sorry, that wasn't meant to be a pun). Even so, we might now be curious about what this really visual ad is. We can of course grab the image via the iiif_url
column
df[df['publisher']=='Wells Spicer']['iiif_url'][0]
This link will take you to the image, now we've indulged our curiosity let's deal with the fact that there are likely to be a long tail of publishers that appear infrequently in the data. We can look at the distribution of the value counts.
df.publisher.value_counts().describe()
We could look at filtering out these publishers which appear less frequently but we'll skip this for now.
df.groupby(df.pub_date.dt.year)['visual_prob'].agg('mean')
We might also want to look at this by month instead
df.groupby(df.pub_date.dt.month)['visual_prob'].agg('mean').sort_values()
It seems like August is the most bland month for adverts in this data! We could also look at the day of the week.
df.groupby(df.pub_date.dt.dayofweek)['visual_prob'].agg('mean')
Although we can see some patters in these examples, it isn't always so easy to look at a list of numbers in this way. This is where using some visualization comes in handy...
Visualizing the data
It is often helpful to generate visuals to try and look for patterns in our data. Sometimes these visuals spark new insights which we may not see so easily in the more 'raw data'. In this section, we'll look at some ways of pragmatically creating visuals for our data. We have a few criteria that we'll follow.
- we'll focus on quick visualization that can be done relatively 'directly' with the data we have
- we'll focus on tools with a 'standard' API (we'll cover what this means in more detail later)
- we will focus on tools that will continue to work even if our data grows much larger
Data visualization vs visualizing data
"Data visualization is an interdisciplinary field that deals with the graphic representation of data. It is a particularly efficient way of communicating when the data is numerous as for example a Time Series." - Source:https://en.wikipedia.org/wiki/Data_visualization Although we are using visuals to represent our data in this notebook, we focus on quick and easy ways of getting visuals about our data. Data Visualization as a field contains many consideration and best practices, many of which we may not follow in this notebook.
We are able to get some sense of the distribution of the data from the 'raw numbers', but some visualizations can help us see the distribution better. We've already seen the pandas plot
API in action. Let's see if we can get a better sense of the distribution of visual vs non-visual publication by day of the week by plotting this.
fig = df.groupby(df.pub_date.dt.dayofweek)[['visual_prob','text_only_prob']].agg('mean').plot.bar()
This already gives a better sense of our data but we may want to be able to focus on particular years, or modify the parameters we use for plotting. We'll look at hvplot
as a tool ford doing slightly more interactive plots.
Plotting using hvplot
hvplot
is a visualization library that aims to mirror the pandas
plot
API closely but offering some interactivity 'out of the box'. We'll briefly examine how this tool can help us explore the data in a slightly more interactive way. Since time is such an important dimension, we'll focus on mapping against this dimension here. We have the time available in the index:
df.index
Lets start by replicating what we previously did and group are predicted labels by dayofweek
. This produces a chart which we can already hover over as a way of interacting with it.
df.groupby(df.pub_date.dt.dayofweek).count()['pred_decoded'].hvplot(kind='bar')
We can also do the same with month
df.groupby(df.pub_date.dt.month).count()['pred_decoded'].hvplot(kind='bar')
This already gives us a bit more interactivity. If we want to make things more interactive though we can move the groupby
from pandas to doing this inside hvplot
. If we do this then we get a drop down menu for the options inside the groupby.
df.hvplot.hist(y='year',groupby='pred_decoded')
This is quite a quick way of quickly adding interactivity to our visualizations. This can be very helpful for spotting trends etc. more easily when interactivity about. If we add another type of index to our groupby
such as month we can try and spot and trends in visual vs non visual ads by month of the year.
df.hvplot.hist(y='year',by='pred_decoded',groupby='index.month')
import panel as pn
Using panel
we can add Widgets
for various different interactive components. In this case the time to slice by.
group = pn.widgets.Select(name='group',options=['index.dayofweek', 'index.month', 'index.is_quarter_end'])
We can now use this widget to decide what to use for our groupby
.
plt = df.hvplot.hist(y='year',by='pred_decoded',groupby=group)
We can now add our group widget and the plot together in a 'dashboard'. You can see that as we update the groupby the second widget will update depending on what is required to interact with that data.
pn.Column(pn.WidgetBox(group),plt)
Working with hvplot and probabilities
We have so far worked with the decoded labels but sometimes we may want to work with the probabilities directly. hvplot
included a heatmap
visualization that could be useful for helping us try and see the intensity of the predictions. We'll use year and month as two axes for our visuals. We use np.mean
to get an aggregate of the visual probability for each 'tile'
import numpy as np
df.hvplot.heatmap(x='index.year',
y='index.month',
aggregator=np.mean,C='visual_prob', cmap='blues').aggregate(function=np.mean)
We can also do this with day of week
df.hvplot.heatmap(x='index.year', y='index.dayofweek',aggregator=np.mean,C='visual_prob', cmap='blues').aggregate(function=np.mean)
Conclusion
This is the end of this series of notebook on working with ads data. We have only scratched the surface of how we can start working with this data. Some things we haven't done is to use the OCR or all the metadata available in earnest. There would also be opportunities for mapping the data spatially.
Hopefully this series of notebooks has given some sense of how we can use computer vision to as humanities questions without requiring large scale resources.
fin