Sample Inspector (Part I)

Looking at keywords over time

In this notebook, we investigate how the digital sample compares to the ‘universe’ of book publications in the 19th century. Even though the Microsoft Digitised Books corpus is a rich collection, it remains unclear what is in there and what’s not. Especially when one is interested in a specific topic, like ‘machines’, knowing what content we don’t have digital access to is critical if we want to make sense and of findings built on such digital resources.

To understand the digital sample to the population of printed works, we compare the keywords in titles between these two levels. We show how to load and process data in Pandas, build a quick and efficient method to gauge and visualize the presence of a set of selected keywords in book titles over time.

# first we need import all libraries and tools required in the rest of this notebook
%matplotlib inline
import json
from tqdm.notebook import tqdm
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from collections import Counter, defaultdict

Loading the data

Using Pandas, we load the metadata on the BL books collection and print the first three rows for inspection.

metadata_blb = pd.read_csv(
    "https://bl.iro.bl.uk/downloads/e1be1324-8b1a-4712-96a7-783ac209ddef?locale=en",
    dtype={"BL record ID": "string"},
    parse_dates=False,
)
metadata_blb.head(3)
BL record ID Type of resource Name Dates associated with name Type of name Role All names Title Variant titles Series title ... Date of publication Edition Physical description Dewey classification BL shelfmark Topics Genre Languages Notes BL record ID for physical resource
0 014602826 Monograph Yearsley, Ann 1753-1806 person NaN More, Hannah, 1745-1833 [person] ; Yearsley, A... Poems on several occasions [With a prefatory l... NaN NaN ... 1786 Fourth edition MANUSCRIPT note NaN NaN Digital Store 11644.d.32 NaN NaN English NaN 3996603
1 014602830 Monograph A, T. NaN person NaN Oldham, John, 1653-1683 [person] ; A, T. [person] A Satyr against Vertue. (A poem: supposed to b... NaN NaN ... 1679 NaN 15 pages (4°) NaN Digital Store 11602.ee.10. (2.) NaN NaN English NaN 1143
2 014602831 Monograph NaN NaN NaN NaN NaN The Aeronaut, a poem; founded almost entirely,... NaN NaN ... 1816 NaN 17 pages (8°) NaN Digital Store 992.i.12. (3.) Dublin (Ireland) NaN English NaN 22782

3 rows × 24 columns

Parsing all of the titles for a year

Next, we process the titles in this dataframe: we create a pd.Series object in which we map a year to a long string that contains all the titles of books published in that year.

# convert all titles to a string
metadata_blb["title_str"] = metadata_blb.Title.apply(lambda x: "".join([t for t in x]))
# in case of a date range select the first year
metadata_blb["Date of publication"] = metadata_blb["Date of publication"].apply(
    lambda x: str(x).split("-")[0]
)
# group all titles by year, i.e. for each year we concatenate all titles as one long string
titles_by_year_bl = metadata_blb.groupby("Date of publication")["title_str"].apply(
    " ".join
)
# slice the Series to retain only titles that fall within the target period
titles_by_year_bl = titles_by_year_bl["1800":"1899"]

These operations return a variable titles_by_year_bl of type pd.Series in which the index refers to the year of publication and the values are a concatenation of all titles

titles_by_year_bl
Date of publication
1800    Egbert; or, The Suicide. A tale [In verse.] So...
1801    The Old Hag in a Red Cloak. A romance [In vers...
1802    Elegy to the memory of the late Duke of Bedfor...
1803    [Britons strike home.] Songs &c. in Britons st...
1804    The Sports of the Genii [Etchings from drawing...
                              ...                        
1895    A Few Verses The Ghais o' Dennilair: a legend ...
1896    Book of Word & Music of Humorous Songs, Burles...
1897    Altenglische Spruchweisheit, alt- und mittelen...
1898    A Vision. (Penny edition.) The Ballad of the W...
1899    Hawthorn and Lavender: songs and madrigals We ...
Name: title_str, Length: 100, dtype: object
titles_by_year_bl["1800"][:200]
'Egbert; or, The Suicide. A tale [In verse.] Songs, Chorusses, etc, in the new pantomime of Harlequin Tour; or, the Dominion of Fancy, as performed at the Theatre Royal, Covent-Garden, etc A Journey to'

Loading catalogue metadata

Then we load a .csv file which contains an export of the British Library Catalogue. For sure, the BL catalogue won’t give us a complete list of all books in the 19th century, but at least it is converging to the universe of known printed works in this period. As a comparison and contextualisation tool, it helps us understand the contours and composition of the digital collection.

metadata_cat = pd.read_csv(
    "https://bl.iro.bl.uk/downloads/e4bf0f74-2c64-4322-93c7-0dcc5e5246da?locale=en",
    dtype={"BL record ID": "string"},
)
metadata_cat.head(3)
/usr/local/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3296: DtypeWarning: Columns (2,3,19) have mixed types.Specify dtype option on import or set low_memory=False.
  exec(code_obj, self.user_global_ns, self.user_ns)
BL record ID Type of resource BNB number ISBN Name Dates associated with name Type of name Role All names Title ... Publisher Date of publication Edition Physical description Dewey classification BL shelfmark Topics Genre Languages Notes
0 000000004 Monograph NaN NaN NaN NaN NaN NaN Carlbohm, Johan Arvid, printer [person] Aabc [etc.] Jesus Vocales, eli äänelliset boks... ... präntätty directörin J.A. Carlbohmin tykönä 1800 NaN 16 unnumbered pages, 17 cm (8°) NaN 12976.aa.3 Writing ; Reading ; Writing--Alphabets--Primer... NaN Finnish Finnish primer, beginning with the Lord's pray...
1 000000006 Monograph NaN NaN NaN NaN NaN NaN NaN A che serve il Papa? ... Tiberina 1889 NaN 32 pages, 14 cm NaN 3900.aaa.20. (4.) NaN NaN Italian NaN
2 000000007 Monograph NaN NaN NaN NaN NaN NaN NaN A. for Apple [An illustrated alphabet.] ... Ward & Lock 1894 NaN NaN NaN 12811.h.70 NaN NaN NaN NaN

3 rows × 25 columns

Parsing date ranges

Values in the 'Date of publication' column sometimes refer to a date range instead of a specific year. To simplify things, we take the first year as the date of publication. Admittedly, this choice is questionable, and we encourage that you craft other solutions depending on your research interests.

metadata_cat["Date of publication"].unique()
array(['1800', '1889', '1894', ..., '1855-1875', '1887-1937', '1893-1963'],
      dtype=object)
metadata_cat["date"] = metadata_cat["Date of publication"].apply(
    lambda x: str(x).split("-")[0]
)

We also do the same for the BL Books collection

metadata_blb["date"] = metadata_blb["Date of publication"].apply(
    lambda x: str(x).split("-")[0]
)

With its 5.1 billion tokens and close to 50k titles, the Microsoft Digitised Books is an impressive collection. However, it constitutes just 2.73% of all books printed in the 19th century. As a historian, the amount of information seems overwhelming, but one should realise it is still a small empirical basis when making claims about cultural evolution in the 19th century.

num_19thc_books_bl = len(
    metadata_blb[(metadata_blb.date >= "1800") & (metadata_blb.date < "1900")]
)
num_19thc_books_cat = len(
    metadata_cat[(metadata_cat.date >= "1800") & (metadata_cat.date < "1900")]
)
print("Number of 19thC works in the BL Books collection: ", num_19thc_books_bl)
print("Number of 19thC works in the Catologue: ", num_19thc_books_cat)
print(
    "The sample size is", round(num_19thc_books_bl / num_19thc_books_cat * 100, 2), "%"
)
Number of 19thC works in the BL Books collection:  46518
Number of 19thC works in the Catologue:  1686158
The sample size is 2.76 %

Comparing our corpus to a wider collection

Similar to what we did with BL Books metadata, we concatenate titles in the catalogue to one long string, one per year, and slice the pd.Series object.

metadata_cat["Title"] = metadata_cat["Title"].astype(
    str
)  # convert all titles to strings
titles_by_year_cat = metadata_cat.groupby("date")["Title"].apply(
    " ".join
)  # group titles by year and join as one long string
titles_by_year_cat = titles_by_year_cat["1800":"1899"]  # slice the series

We established that the digital corpus constitutes only 2.76% of the population. To assess changes over time, we can visualize the sample size (as a percentage of the population) for each year.

a = metadata_cat.groupby("date")["Title"].count()["1800":"1900"]
b = metadata_blb.groupby("date")["title_str"].count()["1800":"1900"]
(b / a * 100).plot()
<matplotlib.axes._subplots.AxesSubplot at 0x159360110>
_images/sample_inspector_i_25_1.png

In the cells below, we turn to analysing and comparing the content of the titles. We repeat the following steps for both the BL Books metadata and catalogue information.

  • We create a document-term matrix, where each row comprises the word counts for all titles in a specific year

  • We compute the total word counts per year

We use the CountVectorizer object provided by sklearn which takes a list of texts as input and converts this to a document-term matrix. The min_df argument allows us to discard words that only appear once, thus avoiding that the matrix becomes too large.

bl_counts = CountVectorizer(min_df=2)  # create a CountVectorizer object
bl_dtm = bl_counts.fit_transform(
    titles_by_year_bl
)  # fit the CountVectorizer on the titles
totals_bl = bl_dtm.sum(axis=1)
# sum row wise, i.e. totals for each year
totals_bl = np.squeeze(np.array(totals_bl))
cat_counts = CountVectorizer(min_df=10)
cat_dtm = cat_counts.fit_transform(titles_by_year_cat)
totals_cat = cat_dtm.sum(axis=1)
totals_cat = np.squeeze(np.array(totals_cat))

By amending the terms variable you can select the keywords you want to investigate and compare between sample and population.

terms = {"machinery", "machines", "machine", "engine", "engines"}
terms_bl = list(terms.intersection(set(bl_counts.get_feature_names())))
print(
    "Plotting relative frequency for following words in BL Books titles:\n\t", terms_bl
)
terms_cat = list(terms.intersection(set(cat_counts.get_feature_names())))
print()
print(
    "Plotting relative frequency for following words in BL Catalogue titles:\n\t",
    terms_cat,
)
Plotting relative frequency for following words in BL Books titles:
	 ['machinery', 'machines', 'engines']

Plotting relative frequency for following words in BL Catalogue titles:
	 ['machine', 'machines', 'engines', 'machinery', 'engine']

In the code cell below, we first collect frequencies of the given query terms in the BL Books corpus (and sum all counts if more than one keyword is given). We divide the yearly absolute counts by the total number of words in titles to obtain yearly relative frequencies. We repeat the same procedure for the catalogue data.

Lastly, we plot the relative frequencies for both the BL Books (blue) and catalogue (orange), slight smoothing the timeline by showing the rolling mean.

idx_bl = [bl_counts.vocabulary_[term] for term in terms_bl]
if len(idx_bl) <= 1:
    counts_bl = np.squeeze(bl_dtm[:, idx_bl].toarray())
else:
    counts_bl = np.squeeze(bl_dtm[:, idx_bl].toarray()).sum(axis=1)

pd.Series(counts_bl / totals_bl, index=titles_by_year_bl.index).rolling(2).mean().plot()
idx_cat = [cat_counts.vocabulary_[term] for term in terms_cat]

if len(idx_cat) <= 1:
    counts_cat = np.squeeze(cat_dtm[:, idx_cat].toarray())
else:
    counts_cat = np.squeeze(cat_dtm[:, idx_cat].toarray()).sum(axis=1)

pd.Series(counts_cat / totals_cat).rolling(2).mean().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x162b24590>
_images/sample_inspector_i_33_1.png

## Fin.