{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "tXis3tfFTNcG"
},
"source": [
"# Sample Inspector (Part II)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "NJJnti00TNcJ"
},
"source": [
"\n",
"\n",
"This notebook compares the Microsoft Digitised Books collection to the genre annotations. We'd like to know if the annotated sample deviates from the digital collection and look at aspects that _should_ remains stable across both datasets."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "eDMj08-5TNcK"
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import json\n",
"import pandas as pd\n",
"from collections import Counter"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2gQ5Gj36djCZ"
},
"source": [
"## Data processing"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TxyORvXGTNcL"
},
"source": [
"First, we load the metadata of the Microsoft Digitised Books collection. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 555
},
"id": "eLGGRMUPTNcM",
"outputId": "b2f91cdf-1c41-4568-af74-cd85fd3694de"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
BL record ID
\n",
"
Type of resource
\n",
"
Name
\n",
"
Dates associated with name
\n",
"
Type of name
\n",
"
Role
\n",
"
All names
\n",
"
Title
\n",
"
Variant titles
\n",
"
Series title
\n",
"
Number within series
\n",
"
Country of publication
\n",
"
Place of publication
\n",
"
Publisher
\n",
"
Date of publication
\n",
"
Edition
\n",
"
Physical description
\n",
"
Dewey classification
\n",
"
BL shelfmark
\n",
"
Topics
\n",
"
Genre
\n",
"
Languages
\n",
"
Notes
\n",
"
BL record ID for physical resource
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
014602826
\n",
"
Monograph
\n",
"
Yearsley, Ann
\n",
"
1753-1806
\n",
"
person
\n",
"
NaN
\n",
"
More, Hannah, 1745-1833 [person] ; Yearsley, A...
\n",
"
Poems on several occasions [With a prefatory l...
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
England
\n",
"
London
\n",
"
NaN
\n",
"
1786
\n",
"
Fourth edition MANUSCRIPT note
\n",
"
NaN
\n",
"
NaN
\n",
"
Digital Store 11644.d.32
\n",
"
NaN
\n",
"
NaN
\n",
"
English
\n",
"
NaN
\n",
"
3996603
\n",
"
\n",
"
\n",
"
1
\n",
"
014602830
\n",
"
Monograph
\n",
"
A, T.
\n",
"
NaN
\n",
"
person
\n",
"
NaN
\n",
"
Oldham, John, 1653-1683 [person] ; A, T. [person]
\n",
"
A Satyr against Vertue. (A poem: supposed to b...
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
England
\n",
"
London
\n",
"
NaN
\n",
"
1679
\n",
"
NaN
\n",
"
15 pages (4°)
\n",
"
NaN
\n",
"
Digital Store 11602.ee.10. (2.)
\n",
"
NaN
\n",
"
NaN
\n",
"
English
\n",
"
NaN
\n",
"
1143
\n",
"
\n",
"
\n",
"
2
\n",
"
014602831
\n",
"
Monograph
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
The Aeronaut, a poem; founded almost entirely,...
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
Ireland
\n",
"
Dublin
\n",
"
Richard Milliken
\n",
"
1816
\n",
"
NaN
\n",
"
17 pages (8°)
\n",
"
NaN
\n",
"
Digital Store 992.i.12. (3.)
\n",
"
Dublin (Ireland)
\n",
"
NaN
\n",
"
English
\n",
"
NaN
\n",
"
22782
\n",
"
\n",
" \n",
"
\n",
"
\n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
],
"text/plain": [
" BL record ID Type of resource ... Notes BL record ID for physical resource\n",
"0 014602826 Monograph ... NaN 3996603\n",
"1 014602830 Monograph ... NaN 1143\n",
"2 014602831 Monograph ... NaN 22782\n",
"\n",
"[3 rows x 24 columns]"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"metadata_blb = pd.read_csv(\n",
" \"https://bl.iro.bl.uk/downloads/e1be1324-8b1a-4712-96a7-783ac209ddef?locale=en\",\n",
" dtype={\"BL record ID\": \"string\"},\n",
" parse_dates=False,\n",
")\n",
"metadata_blb.head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4aMzk5XheLr9"
},
"source": [
"## Computing the title's first character."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "hJ8L_EXGTNcM"
},
"source": [
"A simple test for comparing the sample to the Microsoft Digitised Books metadata is computing the probabilities of the title's first character. This distribution should look similar for both datasets. Below we create a function `first_alpha_char` that returns the first character of a string. If none could be found, it returns a hashtag."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "CPLtjbSUTNcN"
},
"outputs": [],
"source": [
"def first_alpha_char(x):\n",
" \"\"\"returns the first lowercased alphatical character of a string\"\"\"\n",
" try:\n",
" x = x[0].lower()\n",
" x = \"\".join([c for c in x if c.isalpha()])\n",
" return x[0]\n",
" except:\n",
" return \"#\""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FZYJQODXTNcO"
},
"source": [
"With `.apply()` we can extract to first character from the `title` column."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "AoxWf4e9TNcP"
},
"outputs": [],
"source": [
"metadata_blb[\"first_alpha_char\"] = metadata_blb.Title.apply(first_alpha_char)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ISXwu6L-TNcP"
},
"source": [
"Next, we use `Counter()` to count all elements in the `first_alpha_char` column and turn the counts into probabilities."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "59konXvgTNcQ"
},
"outputs": [],
"source": [
"char_count = Counter(\n",
" metadata_blb[metadata_blb[\"first_alpha_char\"] != None].first_alpha_char\n",
")\n",
"char_probs = {k: v / sum(char_count.values()) for k, v in char_count.items()}"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HKAGHnMQeUe_"
},
"source": [
"## Plotting distribution of first title character"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 283
},
"id": "SMfFfhNTTNcR",
"outputId": "f4b58450-0a7f-47dc-865d-5bc226bc338b"
},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"pd.Series(char_probs, index=sorted(char_probs.keys()))[\"a\":\"z\"].plot()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "uHRwDDmgTNcR"
},
"source": [
"Next, we apply the same procedure to the annotated sample. We'll start by loading the annotated sample."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "NvIeNXJAeeB8"
},
"source": [
"## Comparing to the annotated data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 355
},
"id": "Mlqog-T3Tdq1",
"outputId": "f73c425f-3f1f-47bc-e2f6-23609b0882fe"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py:2718: DtypeWarning: Columns (17,35,37) have mixed types.Specify dtype option on import or set low_memory=False.\n",
" interactivity=interactivity, compiler=compiler, result=result)\n"
]
},
{
"data": {
"text/html": [
"\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
BL record ID
\n",
"
Type of resource
\n",
"
Name
\n",
"
Dates associated with name
\n",
"
Type of name
\n",
"
Role
\n",
"
All names
\n",
"
Title
\n",
"
Variant titles
\n",
"
Series title
\n",
"
Number within series
\n",
"
Country of publication
\n",
"
Place of publication
\n",
"
Publisher
\n",
"
Date of publication
\n",
"
Edition
\n",
"
Physical description
\n",
"
Dewey classification
\n",
"
BL shelfmark
\n",
"
Topics
\n",
"
Genre
\n",
"
Languages
\n",
"
Notes
\n",
"
BL record ID for physical resource
\n",
"
classification_id
\n",
"
user_id
\n",
"
created_at
\n",
"
subject_ids
\n",
"
annotator_date_pub
\n",
"
annotator_normalised_date_pub
\n",
"
annotator_edition_statement
\n",
"
annotator_genre
\n",
"
annotator_FAST_genre_terms
\n",
"
annotator_FAST_subject_terms
\n",
"
annotator_comments
\n",
"
annotator_main_language
\n",
"
annotator_other_languages_summaries
\n",
"
annotator_summaries_language
\n",
"
annotator_translation
\n",
"
annotator_original_language
\n",
"
annotator_publisher
\n",
"
annotator_place_pub
\n",
"
annotator_country
\n",
"
annotator_title
\n",
"
Link to digitised book
\n",
"
annotated
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
014602826
\n",
"
Monograph
\n",
"
Yearsley, Ann
\n",
"
1753-1806
\n",
"
person
\n",
"
NaN
\n",
"
More, Hannah, 1745-1833 [person] ; Yearsley, A...
\n",
"
Poems on several occasions [With a prefatory l...
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
England
\n",
"
London
\n",
"
NaN
\n",
"
1786
\n",
"
Fourth edition MANUSCRIPT note
\n",
"
NaN
\n",
"
NaN
\n",
"
Digital Store 11644.d.32
\n",
"
NaN
\n",
"
NaN
\n",
"
English
\n",
"
NaN
\n",
"
3996603
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
NaN
\n",
"
False
\n",
"
\n",
" \n",
"
\n",
"
\n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
],
"text/plain": [
" BL record ID Type of resource ... Link to digitised book annotated\n",
"0 014602826 Monograph ... NaN False\n",
"\n",
"[1 rows x 46 columns]"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"annotations = pd.read_csv(\n",
" \"https://bl.iro.bl.uk/downloads/36c7cd20-c8a7-4495-acbe-469b9132c6b1?locale=en\",\n",
" dtype={\"BL record ID\": str},\n",
")\n",
"annotations.head(1)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "fk5UuvpZXA1q"
},
"source": [
"This dataset includes both annotated data and non-annotated data. Since we only want the annotated data, we can filter these out using the `annotated` column which contains a flag to indicate if the data has been annotated. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "07WVi70SW-w5"
},
"outputs": [],
"source": [
"annotations = annotations[annotations[\"annotated\"] == True]"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zaIK-3w3XW7i"
},
"source": [
"Because of the way in which the annotations were collected we have some duplicates. There are different ways in which we can deal with these duplicates but here we will just drop the duplicates for the `Title` column. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "cbwmG3KbYJ3Z"
},
"outputs": [],
"source": [
"annotations = annotations.drop_duplicates(subset=\"Title\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "lxqJUyUhTNcS",
"outputId": "e66d3f0d-29d7-4c9f-cd13-95952b5e3c39"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
" \"\"\"Entry point for launching an IPython kernel.\n"
]
}
],
"source": [
"annotations[\"first_alpha_char\"] = annotations.Title.apply(first_alpha_char)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Z0Jj1_rZTNcT"
},
"outputs": [],
"source": [
"char_count_anno = Counter(\n",
" annotations[annotations[\"first_alpha_char\"] != None].first_alpha_char\n",
")\n",
"char_probs_anno = {\n",
" k: v / sum(char_count_anno.values()) for k, v in char_count_anno.items()\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "QbvjzIeMeh-g"
},
"source": [
"## Plotting a comparison between the annotated subset and full collection "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 283
},
"id": "BHX17wkUTNcU",
"outputId": "b5c121d3-8edd-4935-cdc1-8bb91c560cc2"
},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"pd.Series(char_probs, index=sorted(char_probs.keys())).plot()\n",
"pd.Series(char_probs_anno, index=sorted(char_probs_anno.keys())).plot()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Nr2r6u61TNcU",
"outputId": "b059861c-af6a-4e22-8506-64f7b46a830a"
},
"outputs": [
{
"data": {
"text/plain": [
"array([nan, 'lat\\nger', 'gmh\\nlat\\nger', 'ger', 'ger\\ngmh\\nlat',\n",
" 'ger\\nfre', 'eng', 'lat\\ngmh\\nger', 'ger\\neng', 'ger\\nlat',\n",
" 'lat\\nfre', 'dut\\nfre\\nspa', 'fre\\nger', 'ger\\nfrs\\nlat',\n",
" 'ger\\ngmh', 'lat\\ndut\\nfrm\\nfre', 'ger\\nspa', 'eng\\nfre'],\n",
" dtype=object)"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"annotations[\"annotator_main_language\"].unique()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "EZyYLA4KTNcU"
},
"outputs": [],
"source": [
"annotations[\"english\"] = annotations[\"annotator_main_language\"].apply(\n",
" lambda x: str(x).lower().startswith(\"eng\")\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Pz-G_ALxeqDY"
},
"source": [
"## Compare distribution of publication date"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "hCyul5VqTNcV"
},
"source": [
"Lastly, we can compare the distribution over time to see if the sample is biased towards are certain period."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "8k_qgv4LY8pM"
},
"outputs": [],
"source": [
"metadata_blb = metadata_blb[metadata_blb[\"Date of publication\"].notna()].copy(deep=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "IcAm0xViZkHC"
},
"outputs": [],
"source": [
"metadata_blb[\"date\"] = metadata_blb[\"Date of publication\"].str.split(\"-\").str[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 283
},
"id": "y5-MSel-TNcV",
"outputId": "ea4e123b-253a-4b8b-916e-266fc1988ab2"
},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"year_counts = Counter(metadata_blb[\"date\"].values)\n",
"year_probs = {k: v / sum(year_counts.values()) for k, v in year_counts.items()}\n",
"pd.Series(year_probs, index=sorted(year_probs.keys()))[\"1800\":\"1900\"].plot()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "MltuiV8JTNcV"
},
"outputs": [],
"source": [
"year_counts_anno = Counter(annotations[\"annotator_normalised_date_pub\"].values)\n",
"year_probs_anno = {\n",
" k: v / sum(year_counts_anno.values()) for k, v in year_counts_anno.items()\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zTR9jXILTNcW"
},
"source": [
"Below, we made a small function to manipulate the date of publication field by extracting the year (if available) and returning it as an integer."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ozwvgAArTNcW"
},
"outputs": [],
"source": [
"def get_int(x):\n",
" \"\"\"return year is integer\"\"\"\n",
" try:\n",
" return int(x)\n",
" except:\n",
" pass\n",
" try:\n",
" return int(x.split(\"-\")[0])\n",
" except:\n",
" False\n",
"\n",
"\n",
"year_probs_anno = {str(k): v for k, v in year_probs_anno.items() if get_int(k)}"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "NhXWgWVgexcb"
},
"source": [
"## Comparing publication dates across the annotated and full collection "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 283
},
"id": "bHguWCLqTNcW",
"outputId": "67e9bdfb-92e3-4527-baae-0b7094e303fc"
},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"pd.Series(year_probs, index=sorted(year_probs.keys()))[\"1800\":\"1900\"].plot()\n",
"pd.Series(year_probs_anno, index=sorted(year_probs_anno.keys()))[\"1800\":\"1900\"].plot()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "aSGEVQxXe3Ja"
},
"source": [
"## Conclusion \n",
"\n",
"Whilst this wasn't a super rigorous assessment of the potential differences and similarities between the collections, it does give us some sense of this. \n",
"\n",
"Because we are using our annotated data to train a model that we will then want to use on the whole collection (or at least a more significant part of the collection). We want to be careful about how both of these collections may differ. If there are substantial differences between the two, our model may perform well on our training data but badly on new data. This concept is something we'll return to frequently in the remaining sections. "
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [],
"name": "sample_inspector_ii.ipynb",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 0
}