{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "Z8l3phE7O_fi" }, "source": [ "# Sample Inspector (Part I)\n", "## Looking at keywords over time\n", "\n", "In this notebook, we investigate how the digital sample compares to the 'universe' of book publications in the 19th century. Even though the Microsoft Digitised Books corpus is a rich collection, it remains unclear what is in there and what's not. Especially when one is interested in a specific topic, like 'machines', knowing what content we don't have digital access to is critical if we want to make sense and of findings built on such digital resources.\n", "\n", "To understand the digital sample to the population of printed works, we compare the keywords in titles between these two levels. We show how to load and process data in Pandas, build a quick and efficient method to gauge and visualize the presence of a set of selected keywords in book titles over time." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "zci-0phcO_fl" }, "outputs": [], "source": [ "# first we need import all libraries and tools required in the rest of this notebook\n", "%matplotlib inline\n", "import json\n", "from tqdm.notebook import tqdm\n", "import pandas as pd\n", "import numpy as np\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "from collections import Counter, defaultdict" ] }, { "cell_type": "markdown", "id": "9e850520", "metadata": {}, "source": [ "## Loading the data" ] }, { "cell_type": "markdown", "metadata": { "id": "znApsC3CO_fm" }, "source": [ "Using Pandas, we load the metadata on the BL books collection and print the first three rows for inspection." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "vQEZ2G0fO_fn" }, "outputs": [ { "data": { "text/html": [ "
\n", " | BL record ID | \n", "Type of resource | \n", "Name | \n", "Dates associated with name | \n", "Type of name | \n", "Role | \n", "All names | \n", "Title | \n", "Variant titles | \n", "Series title | \n", "... | \n", "Date of publication | \n", "Edition | \n", "Physical description | \n", "Dewey classification | \n", "BL shelfmark | \n", "Topics | \n", "Genre | \n", "Languages | \n", "Notes | \n", "BL record ID for physical resource | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "014602826 | \n", "Monograph | \n", "Yearsley, Ann | \n", "1753-1806 | \n", "person | \n", "NaN | \n", "More, Hannah, 1745-1833 [person] ; Yearsley, A... | \n", "Poems on several occasions [With a prefatory l... | \n", "NaN | \n", "NaN | \n", "... | \n", "1786 | \n", "Fourth edition MANUSCRIPT note | \n", "NaN | \n", "NaN | \n", "Digital Store 11644.d.32 | \n", "NaN | \n", "NaN | \n", "English | \n", "NaN | \n", "3996603 | \n", "
1 | \n", "014602830 | \n", "Monograph | \n", "A, T. | \n", "NaN | \n", "person | \n", "NaN | \n", "Oldham, John, 1653-1683 [person] ; A, T. [person] | \n", "A Satyr against Vertue. (A poem: supposed to b... | \n", "NaN | \n", "NaN | \n", "... | \n", "1679 | \n", "NaN | \n", "15 pages (4°) | \n", "NaN | \n", "Digital Store 11602.ee.10. (2.) | \n", "NaN | \n", "NaN | \n", "English | \n", "NaN | \n", "1143 | \n", "
2 | \n", "014602831 | \n", "Monograph | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "The Aeronaut, a poem; founded almost entirely,... | \n", "NaN | \n", "NaN | \n", "... | \n", "1816 | \n", "NaN | \n", "17 pages (8°) | \n", "NaN | \n", "Digital Store 992.i.12. (3.) | \n", "Dublin (Ireland) | \n", "NaN | \n", "English | \n", "NaN | \n", "22782 | \n", "
3 rows × 24 columns
\n", "\n", " | BL record ID | \n", "Type of resource | \n", "BNB number | \n", "ISBN | \n", "Name | \n", "Dates associated with name | \n", "Type of name | \n", "Role | \n", "All names | \n", "Title | \n", "... | \n", "Publisher | \n", "Date of publication | \n", "Edition | \n", "Physical description | \n", "Dewey classification | \n", "BL shelfmark | \n", "Topics | \n", "Genre | \n", "Languages | \n", "Notes | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "000000004 | \n", "Monograph | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "Carlbohm, Johan Arvid, printer [person] | \n", "Aabc [etc.] Jesus Vocales, eli äänelliset boks... | \n", "... | \n", "präntätty directörin J.A. Carlbohmin tykönä | \n", "1800 | \n", "NaN | \n", "16 unnumbered pages, 17 cm (8°) | \n", "NaN | \n", "12976.aa.3 | \n", "Writing ; Reading ; Writing--Alphabets--Primer... | \n", "NaN | \n", "Finnish | \n", "Finnish primer, beginning with the Lord's pray... | \n", "
1 | \n", "000000006 | \n", "Monograph | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "A che serve il Papa? | \n", "... | \n", "Tiberina | \n", "1889 | \n", "NaN | \n", "32 pages, 14 cm | \n", "NaN | \n", "3900.aaa.20. (4.) | \n", "NaN | \n", "NaN | \n", "Italian | \n", "NaN | \n", "
2 | \n", "000000007 | \n", "Monograph | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "A. for Apple [An illustrated alphabet.] | \n", "... | \n", "Ward & Lock | \n", "1894 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "12811.h.70 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
3 rows × 25 columns
\n", "