{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "tXis3tfFTNcG" }, "source": [ "# Sample Inspector (Part II)\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "NJJnti00TNcJ" }, "source": [ "\n", "\n", "This notebook compares the Microsoft Digitised Books collection to the genre annotations. We'd like to know if the annotated sample deviates from the digital collection and look at aspects that _should_ remains stable across both datasets." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "eDMj08-5TNcK" }, "outputs": [], "source": [ "%matplotlib inline\n", "import json\n", "import pandas as pd\n", "from collections import Counter" ] }, { "cell_type": "markdown", "metadata": { "id": "2gQ5Gj36djCZ" }, "source": [ "## Data processing" ] }, { "cell_type": "markdown", "metadata": { "id": "TxyORvXGTNcL" }, "source": [ "First, we load the metadata of the Microsoft Digitised Books collection. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 555 }, "id": "eLGGRMUPTNcM", "outputId": "b2f91cdf-1c41-4568-af74-cd85fd3694de" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BL record IDType of resourceNameDates associated with nameType of nameRoleAll namesTitleVariant titlesSeries titleNumber within seriesCountry of publicationPlace of publicationPublisherDate of publicationEditionPhysical descriptionDewey classificationBL shelfmarkTopicsGenreLanguagesNotesBL record ID for physical resource
0014602826MonographYearsley, Ann1753-1806personNaNMore, Hannah, 1745-1833 [person] ; Yearsley, A...Poems on several occasions [With a prefatory l...NaNNaNNaNEnglandLondonNaN1786Fourth edition MANUSCRIPT noteNaNNaNDigital Store 11644.d.32NaNNaNEnglishNaN3996603
1014602830MonographA, T.NaNpersonNaNOldham, John, 1653-1683 [person] ; A, T. [person]A Satyr against Vertue. (A poem: supposed to b...NaNNaNNaNEnglandLondonNaN1679NaN15 pages (4°)NaNDigital Store 11602.ee.10. (2.)NaNNaNEnglishNaN1143
2014602831MonographNaNNaNNaNNaNNaNThe Aeronaut, a poem; founded almost entirely,...NaNNaNNaNIrelandDublinRichard Milliken1816NaN17 pages (8°)NaNDigital Store 992.i.12. (3.)Dublin (Ireland)NaNEnglishNaN22782
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " BL record ID Type of resource ... Notes BL record ID for physical resource\n", "0 014602826 Monograph ... NaN 3996603\n", "1 014602830 Monograph ... NaN 1143\n", "2 014602831 Monograph ... NaN 22782\n", "\n", "[3 rows x 24 columns]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "metadata_blb = pd.read_csv(\n", " \"https://bl.iro.bl.uk/downloads/e1be1324-8b1a-4712-96a7-783ac209ddef?locale=en\",\n", " dtype={\"BL record ID\": \"string\"},\n", " parse_dates=False,\n", ")\n", "metadata_blb.head(3)" ] }, { "cell_type": "markdown", "metadata": { "id": "4aMzk5XheLr9" }, "source": [ "## Computing the title's first character." ] }, { "cell_type": "markdown", "metadata": { "id": "hJ8L_EXGTNcM" }, "source": [ "A simple test for comparing the sample to the Microsoft Digitised Books metadata is computing the probabilities of the title's first character. This distribution should look similar for both datasets. Below we create a function `first_alpha_char` that returns the first character of a string. If none could be found, it returns a hashtag." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "CPLtjbSUTNcN" }, "outputs": [], "source": [ "def first_alpha_char(x):\n", " \"\"\"returns the first lowercased alphatical character of a string\"\"\"\n", " try:\n", " x = x[0].lower()\n", " x = \"\".join([c for c in x if c.isalpha()])\n", " return x[0]\n", " except:\n", " return \"#\"" ] }, { "cell_type": "markdown", "metadata": { "id": "FZYJQODXTNcO" }, "source": [ "With `.apply()` we can extract to first character from the `title` column." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "AoxWf4e9TNcP" }, "outputs": [], "source": [ "metadata_blb[\"first_alpha_char\"] = metadata_blb.Title.apply(first_alpha_char)" ] }, { "cell_type": "markdown", "metadata": { "id": "ISXwu6L-TNcP" }, "source": [ "Next, we use `Counter()` to count all elements in the `first_alpha_char` column and turn the counts into probabilities." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "59konXvgTNcQ" }, "outputs": [], "source": [ "char_count = Counter(\n", " metadata_blb[metadata_blb[\"first_alpha_char\"] != None].first_alpha_char\n", ")\n", "char_probs = {k: v / sum(char_count.values()) for k, v in char_count.items()}" ] }, { "cell_type": "markdown", "metadata": { "id": "HKAGHnMQeUe_" }, "source": [ "## Plotting distribution of first title character" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 283 }, "id": "SMfFfhNTTNcR", "outputId": "f4b58450-0a7f-47dc-865d-5bc226bc338b" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "pd.Series(char_probs, index=sorted(char_probs.keys())).plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 283 }, "id": "h-9kbTFwTNcR", "outputId": "80ad3166-989f-4a39-c432-992645cf0cda" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "pd.Series(char_probs, index=sorted(char_probs.keys()))[\"a\":\"z\"].plot()" ] }, { "cell_type": "markdown", "metadata": { "id": "uHRwDDmgTNcR" }, "source": [ "Next, we apply the same procedure to the annotated sample. We'll start by loading the annotated sample." ] }, { "cell_type": "markdown", "metadata": { "id": "NvIeNXJAeeB8" }, "source": [ "## Comparing to the annotated data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 355 }, "id": "Mlqog-T3Tdq1", "outputId": "f73c425f-3f1f-47bc-e2f6-23609b0882fe" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py:2718: DtypeWarning: Columns (17,35,37) have mixed types.Specify dtype option on import or set low_memory=False.\n", " interactivity=interactivity, compiler=compiler, result=result)\n" ] }, { "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BL record IDType of resourceNameDates associated with nameType of nameRoleAll namesTitleVariant titlesSeries titleNumber within seriesCountry of publicationPlace of publicationPublisherDate of publicationEditionPhysical descriptionDewey classificationBL shelfmarkTopicsGenreLanguagesNotesBL record ID for physical resourceclassification_iduser_idcreated_atsubject_idsannotator_date_pubannotator_normalised_date_pubannotator_edition_statementannotator_genreannotator_FAST_genre_termsannotator_FAST_subject_termsannotator_commentsannotator_main_languageannotator_other_languages_summariesannotator_summaries_languageannotator_translationannotator_original_languageannotator_publisherannotator_place_pubannotator_countryannotator_titleLink to digitised bookannotated
0014602826MonographYearsley, Ann1753-1806personNaNMore, Hannah, 1745-1833 [person] ; Yearsley, A...Poems on several occasions [With a prefatory l...NaNNaNNaNEnglandLondonNaN1786Fourth edition MANUSCRIPT noteNaNNaNDigital Store 11644.d.32NaNNaNEnglishNaN3996603NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNFalse
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " BL record ID Type of resource ... Link to digitised book annotated\n", "0 014602826 Monograph ... NaN False\n", "\n", "[1 rows x 46 columns]" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "annotations = pd.read_csv(\n", " \"https://bl.iro.bl.uk/downloads/36c7cd20-c8a7-4495-acbe-469b9132c6b1?locale=en\",\n", " dtype={\"BL record ID\": str},\n", ")\n", "annotations.head(1)" ] }, { "cell_type": "markdown", "metadata": { "id": "fk5UuvpZXA1q" }, "source": [ "This dataset includes both annotated data and non-annotated data. Since we only want the annotated data, we can filter these out using the `annotated` column which contains a flag to indicate if the data has been annotated. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "07WVi70SW-w5" }, "outputs": [], "source": [ "annotations = annotations[annotations[\"annotated\"] == True]" ] }, { "cell_type": "markdown", "metadata": { "id": "zaIK-3w3XW7i" }, "source": [ "Because of the way in which the annotations were collected we have some duplicates. There are different ways in which we can deal with these duplicates but here we will just drop the duplicates for the `Title` column. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "cbwmG3KbYJ3Z" }, "outputs": [], "source": [ "annotations = annotations.drop_duplicates(subset=\"Title\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "lxqJUyUhTNcS", "outputId": "e66d3f0d-29d7-4c9f-cd13-95952b5e3c39" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " \"\"\"Entry point for launching an IPython kernel.\n" ] } ], "source": [ "annotations[\"first_alpha_char\"] = annotations.Title.apply(first_alpha_char)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Z0Jj1_rZTNcT" }, "outputs": [], "source": [ "char_count_anno = Counter(\n", " annotations[annotations[\"first_alpha_char\"] != None].first_alpha_char\n", ")\n", "char_probs_anno = {\n", " k: v / sum(char_count_anno.values()) for k, v in char_count_anno.items()\n", "}" ] }, { "cell_type": "markdown", "metadata": { "id": "QbvjzIeMeh-g" }, "source": [ "## Plotting a comparison between the annotated subset and full collection " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 283 }, "id": "BHX17wkUTNcU", "outputId": "b5c121d3-8edd-4935-cdc1-8bb91c560cc2" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "pd.Series(char_probs, index=sorted(char_probs.keys())).plot()\n", "pd.Series(char_probs_anno, index=sorted(char_probs_anno.keys())).plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Nr2r6u61TNcU", "outputId": "b059861c-af6a-4e22-8506-64f7b46a830a" }, "outputs": [ { "data": { "text/plain": [ "array([nan, 'lat\\nger', 'gmh\\nlat\\nger', 'ger', 'ger\\ngmh\\nlat',\n", " 'ger\\nfre', 'eng', 'lat\\ngmh\\nger', 'ger\\neng', 'ger\\nlat',\n", " 'lat\\nfre', 'dut\\nfre\\nspa', 'fre\\nger', 'ger\\nfrs\\nlat',\n", " 'ger\\ngmh', 'lat\\ndut\\nfrm\\nfre', 'ger\\nspa', 'eng\\nfre'],\n", " dtype=object)" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "annotations[\"annotator_main_language\"].unique()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "EZyYLA4KTNcU" }, "outputs": [], "source": [ "annotations[\"english\"] = annotations[\"annotator_main_language\"].apply(\n", " lambda x: str(x).lower().startswith(\"eng\")\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "Pz-G_ALxeqDY" }, "source": [ "## Compare distribution of publication date" ] }, { "cell_type": "markdown", "metadata": { "id": "hCyul5VqTNcV" }, "source": [ "Lastly, we can compare the distribution over time to see if the sample is biased towards are certain period." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8k_qgv4LY8pM" }, "outputs": [], "source": [ "metadata_blb = metadata_blb[metadata_blb[\"Date of publication\"].notna()].copy(deep=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "IcAm0xViZkHC" }, "outputs": [], "source": [ "metadata_blb[\"date\"] = metadata_blb[\"Date of publication\"].str.split(\"-\").str[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 283 }, "id": "y5-MSel-TNcV", "outputId": "ea4e123b-253a-4b8b-916e-266fc1988ab2" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "year_counts = Counter(metadata_blb[\"date\"].values)\n", "year_probs = {k: v / sum(year_counts.values()) for k, v in year_counts.items()}\n", "pd.Series(year_probs, index=sorted(year_probs.keys()))[\"1800\":\"1900\"].plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "MltuiV8JTNcV" }, "outputs": [], "source": [ "year_counts_anno = Counter(annotations[\"annotator_normalised_date_pub\"].values)\n", "year_probs_anno = {\n", " k: v / sum(year_counts_anno.values()) for k, v in year_counts_anno.items()\n", "}" ] }, { "cell_type": "markdown", "metadata": { "id": "zTR9jXILTNcW" }, "source": [ "Below, we made a small function to manipulate the date of publication field by extracting the year (if available) and returning it as an integer." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ozwvgAArTNcW" }, "outputs": [], "source": [ "def get_int(x):\n", " \"\"\"return year is integer\"\"\"\n", " try:\n", " return int(x)\n", " except:\n", " pass\n", " try:\n", " return int(x.split(\"-\")[0])\n", " except:\n", " False\n", "\n", "\n", "year_probs_anno = {str(k): v for k, v in year_probs_anno.items() if get_int(k)}" ] }, { "cell_type": "markdown", "metadata": { "id": "NhXWgWVgexcb" }, "source": [ "## Comparing publication dates across the annotated and full collection " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 283 }, "id": "bHguWCLqTNcW", "outputId": "67e9bdfb-92e3-4527-baae-0b7094e303fc" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "pd.Series(year_probs, index=sorted(year_probs.keys()))[\"1800\":\"1900\"].plot()\n", "pd.Series(year_probs_anno, index=sorted(year_probs_anno.keys()))[\"1800\":\"1900\"].plot()" ] }, { "cell_type": "markdown", "metadata": { "id": "aSGEVQxXe3Ja" }, "source": [ "## Conclusion \n", "\n", "Whilst this wasn't a super rigorous assessment of the potential differences and similarities between the collections, it does give us some sense of this. \n", "\n", "Because we are using our annotated data to train a model that we will then want to use on the whole collection (or at least a more significant part of the collection). We want to be careful about how both of these collections may differ. If there are substantial differences between the two, our model may perform well on our training data but badly on new data. This concept is something we'll return to frequently in the remaining sections. " ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "sample_inspector_ii.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 0 }