Sample Inspector (Part II)¶

This notebook compares the Microsoft Digitised Books collection to the genre annotations. We’d like to know if the annotated sample deviates from the digital collection and look at aspects that should remains stable across both datasets.

%matplotlib inline
import json
import pandas as pd
from collections import Counter

Data processing¶

First, we load the metadata of the Microsoft Digitised Books collection.

metadata_blb = pd.read_csv(
    "https://bl.iro.bl.uk/downloads/e1be1324-8b1a-4712-96a7-783ac209ddef?locale=en",
    dtype={"BL record ID": "string"},
    parse_dates=False,
)
metadata_blb.head(3)

	BL record ID	Type of resource	Name	Dates associated with name	Type of name	Role	All names	Title	Variant titles	Series title	Number within series	Country of publication	Place of publication	Publisher	Date of publication	Edition	Physical description	Dewey classification	BL shelfmark	Topics	Genre	Languages	Notes	BL record ID for physical resource
0	014602826	Monograph	Yearsley, Ann	1753-1806	person	NaN	More, Hannah, 1745-1833 [person] ; Yearsley, A...	Poems on several occasions [With a prefatory l...	NaN	NaN	NaN	England	London	NaN	1786	Fourth edition MANUSCRIPT note	NaN	NaN	Digital Store 11644.d.32	NaN	NaN	English	NaN	3996603
1	014602830	Monograph	A, T.	NaN	person	NaN	Oldham, John, 1653-1683 [person] ; A, T. [person]	A Satyr against Vertue. (A poem: supposed to b...	NaN	NaN	NaN	England	London	NaN	1679	NaN	15 pages (4°)	NaN	Digital Store 11602.ee.10. (2.)	NaN	NaN	English	NaN	1143
2	014602831	Monograph	NaN	NaN	NaN	NaN	NaN	The Aeronaut, a poem; founded almost entirely,...	NaN	NaN	NaN	Ireland	Dublin	Richard Milliken	1816	NaN	17 pages (8°)	NaN	Digital Store 992.i.12. (3.)	Dublin (Ireland)	NaN	English	NaN	22782

Computing the title’s first character.¶

A simple test for comparing the sample to the Microsoft Digitised Books metadata is computing the probabilities of the title’s first character. This distribution should look similar for both datasets. Below we create a function first_alpha_char that returns the first character of a string. If none could be found, it returns a hashtag.

def first_alpha_char(x):
    """returns the first lowercased alphatical character of a string"""
    try:
        x = x[0].lower()
        x = "".join([c for c in x if c.isalpha()])
        return x[0]
    except:
        return "#"

With .apply() we can extract to first character from the title column.

metadata_blb["first_alpha_char"] = metadata_blb.Title.apply(first_alpha_char)

Next, we use Counter() to count all elements in the first_alpha_char column and turn the counts into probabilities.

char_count = Counter(
    metadata_blb[metadata_blb["first_alpha_char"] != None].first_alpha_char
)
char_probs = {k: v / sum(char_count.values()) for k, v in char_count.items()}

Plotting distribution of first title character¶

pd.Series(char_probs, index=sorted(char_probs.keys())).plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7fb9a0488f90>

pd.Series(char_probs, index=sorted(char_probs.keys()))["a":"z"].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7fb997cb64d0>

Next, we apply the same procedure to the annotated sample. We’ll start by loading the annotated sample.

Comparing to the annotated data¶

annotations = pd.read_csv(
    "https://bl.iro.bl.uk/downloads/36c7cd20-c8a7-4495-acbe-469b9132c6b1?locale=en",
    dtype={"BL record ID": str},
)
annotations.head(1)

/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py:2718: DtypeWarning: Columns (17,35,37) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

	BL record ID	Type of resource	Name	Dates associated with name	Type of name	Role	All names	Title	Variant titles	Series title	Number within series	Country of publication	Place of publication	Publisher	Date of publication	Edition	Physical description	Dewey classification	BL shelfmark	Topics	Genre	Languages	Notes	BL record ID for physical resource	classification_id	user_id	created_at	subject_ids	annotator_date_pub	annotator_normalised_date_pub	annotator_edition_statement	annotator_genre	annotator_FAST_genre_terms	annotator_FAST_subject_terms	annotator_comments	annotator_main_language	annotator_other_languages_summaries	annotator_summaries_language	annotator_translation	annotator_original_language	annotator_publisher	annotator_place_pub	annotator_country	annotator_title	Link to digitised book	annotated
0	014602826	Monograph	Yearsley, Ann	1753-1806	person	NaN	More, Hannah, 1745-1833 [person] ; Yearsley, A...	Poems on several occasions [With a prefatory l...	NaN	NaN	NaN	England	London	NaN	1786	Fourth edition MANUSCRIPT note	NaN	NaN	Digital Store 11644.d.32	NaN	NaN	English	NaN	3996603	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	False

This dataset includes both annotated data and non-annotated data. Since we only want the annotated data, we can filter these out using the annotated column which contains a flag to indicate if the data has been annotated.

annotations = annotations[annotations["annotated"] == True]

Because of the way in which the annotations were collected we have some duplicates. There are different ways in which we can deal with these duplicates but here we will just drop the duplicates for the Title column.

annotations = annotations.drop_duplicates(subset="Title")

annotations["first_alpha_char"] = annotations.Title.apply(first_alpha_char)

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

char_count_anno = Counter(
    annotations[annotations["first_alpha_char"] != None].first_alpha_char
)
char_probs_anno = {
    k: v / sum(char_count_anno.values()) for k, v in char_count_anno.items()
}

Plotting a comparison between the annotated subset and full collection¶

pd.Series(char_probs, index=sorted(char_probs.keys())).plot()
pd.Series(char_probs_anno, index=sorted(char_probs_anno.keys())).plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7fb99572a3d0>

annotations["annotator_main_language"].unique()

array([nan, 'lat\nger', 'gmh\nlat\nger', 'ger', 'ger\ngmh\nlat',
       'ger\nfre', 'eng', 'lat\ngmh\nger', 'ger\neng', 'ger\nlat',
       'lat\nfre', 'dut\nfre\nspa', 'fre\nger', 'ger\nfrs\nlat',
       'ger\ngmh', 'lat\ndut\nfrm\nfre', 'ger\nspa', 'eng\nfre'],
      dtype=object)

annotations["english"] = annotations["annotator_main_language"].apply(
    lambda x: str(x).lower().startswith("eng")
)

Compare distribution of publication date¶

Lastly, we can compare the distribution over time to see if the sample is biased towards are certain period.

metadata_blb = metadata_blb[metadata_blb["Date of publication"].notna()].copy(deep=True)

metadata_blb["date"] = metadata_blb["Date of publication"].str.split("-").str[0]

year_counts = Counter(metadata_blb["date"].values)
year_probs = {k: v / sum(year_counts.values()) for k, v in year_counts.items()}
pd.Series(year_probs, index=sorted(year_probs.keys()))["1800":"1900"].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7fb995d465d0>

year_counts_anno = Counter(annotations["annotator_normalised_date_pub"].values)
year_probs_anno = {
    k: v / sum(year_counts_anno.values()) for k, v in year_counts_anno.items()
}

Below, we made a small function to manipulate the date of publication field by extracting the year (if available) and returning it as an integer.

def get_int(x):
    """return year is integer"""
    try:
        return int(x)
    except:
        pass
    try:
        return int(x.split("-")[0])
    except:
        False


year_probs_anno = {str(k): v for k, v in year_probs_anno.items() if get_int(k)}

Comparing publication dates across the annotated and full collection¶

pd.Series(year_probs, index=sorted(year_probs.keys()))["1800":"1900"].plot()
pd.Series(year_probs_anno, index=sorted(year_probs_anno.keys()))["1800":"1900"].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7fb994dad710>

Conclusion¶

Whilst this wasn’t a super rigorous assessment of the potential differences and similarities between the collections, it does give us some sense of this.

Because we are using our annotated data to train a model that we will then want to use on the whole collection (or at least a more significant part of the collection). We want to be careful about how both of these collections may differ. If there are substantial differences between the two, our model may perform well on our training data but badly on new data. This concept is something we’ll return to frequently in the remaining sections.

Classifying 19th Century British Library books using Crowdsourcing and Machine Learning

Sample Inspector (Part II)

Contents