Using our new Hugging Face model¶

Now we have our new Hugging Face model available on the model hub we can use it as we would any other model on the hub 😀

Install our required packages¶

First we install our required packages.

!pip install git+https://github.com/huggingface/transformers

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-drup2ss7
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-drup2ss7
  Installing build dependencies ... ?25l?25hdone
  Getting requirements to build wheel ... ?25l?25hdone
    Preparing wheel metadata ... ?25l?25hdone
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.7/dist-packages (from transformers==4.15.0.dev0) (4.62.3)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from transformers==4.15.0.dev0) (4.8.2)
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
     |████████████████████████████████| 3.3 MB 12.9 MB/s 
?25hRequirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from transformers==4.15.0.dev0) (21.3)
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
     |████████████████████████████████| 61 kB 573 kB/s 
?25hRequirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from transformers==4.15.0.dev0) (3.4.0)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.7/dist-packages (from transformers==4.15.0.dev0) (2019.12.20)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from transformers==4.15.0.dev0) (2.23.0)
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
     |████████████████████████████████| 596 kB 60.7 MB/s 
?25hRequirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from transformers==4.15.0.dev0) (1.19.5)
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
     |████████████████████████████████| 895 kB 56.4 MB/s 
?25hRequirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.7/dist-packages (from huggingface-hub<1.0,>=0.1.0->transformers==4.15.0.dev0) (3.10.0.2)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>=20.0->transformers==4.15.0.dev0) (3.0.6)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->transformers==4.15.0.dev0) (3.6.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->transformers==4.15.0.dev0) (2021.10.8)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->transformers==4.15.0.dev0) (1.24.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->transformers==4.15.0.dev0) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->transformers==4.15.0.dev0) (3.0.4)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers==4.15.0.dev0) (1.15.0)
Requirement already satisfied: joblib in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers==4.15.0.dev0) (1.1.0)
Requirement already satisfied: click in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers==4.15.0.dev0) (7.1.2)
Building wheels for collected packages: transformers
  Building wheel for transformers (PEP 517) ... ?25l?25hdone
  Created wheel for transformers: filename=transformers-4.15.0.dev0-py3-none-any.whl size=3363835 sha256=a24943d58697e9444233dd50be9a5f8b24aefbd8b001457ac60bf9ebf7f405e2
  Stored in directory: /tmp/pip-ephem-wheel-cache-6za5soy6/wheels/35/2e/a7/d819e3310040329f0f47e57c9e3e7a7338aa5e74c49acfe522
Successfully built transformers
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninstalled PyYAML-3.13
Successfully installed huggingface-hub-0.2.1 pyyaml-6.0 sacremoses-0.0.46 tokenizers-0.10.3 transformers-4.15.0.dev0

!pip install datasets

Collecting datasets
  Downloading datasets-1.16.1-py3-none-any.whl (298 kB)
     |████████████████████████████████| 298 kB 12.5 MB/s eta 0:00:01
?25hRequirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from datasets) (1.1.5)
Collecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
     |████████████████████████████████| 243 kB 72.1 MB/s 
?25hRequirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from datasets) (21.3)
Requirement already satisfied: multiprocess in /usr/local/lib/python3.7/dist-packages (from datasets) (0.70.12.2)
Requirement already satisfied: pyarrow!=4.0.0,>=3.0.0 in /usr/local/lib/python3.7/dist-packages (from datasets) (3.0.0)
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2021.11.1-py3-none-any.whl (132 kB)
     |████████████████████████████████| 132 kB 93.1 MB/s 
?25hRequirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from datasets) (1.19.5)
Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.7/dist-packages (from datasets) (2.23.0)
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
     |████████████████████████████████| 1.1 MB 84.2 MB/s 
?25hRequirement already satisfied: huggingface-hub<1.0.0,>=0.1.0 in /usr/local/lib/python3.7/dist-packages (from datasets) (0.2.1)
Requirement already satisfied: dill in /usr/local/lib/python3.7/dist-packages (from datasets) (0.3.4)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from datasets) (4.8.2)
Requirement already satisfied: tqdm>=4.62.1 in /usr/local/lib/python3.7/dist-packages (from datasets) (4.62.3)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.7/dist-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets) (3.10.0.2)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.7/dist-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets) (6.0)
Requirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from huggingface-hub<1.0.0,>=0.1.0->datasets) (3.4.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging->datasets) (3.0.6)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets) (2021.10.8)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets) (1.24.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests>=2.19.0->datasets) (2.10)
Requirement already satisfied: charset-normalizer<3.0,>=2.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (2.0.8)
Collecting asynctest==0.13.0
  Downloading asynctest-0.13.0-py3-none-any.whl (26 kB)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.7/dist-packages (from aiohttp->datasets) (21.2.0)
Collecting frozenlist>=1.1.1
  Downloading frozenlist-1.2.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (192 kB)
     |████████████████████████████████| 192 kB 90.7 MB/s 
?25hCollecting multidict<7.0,>=4.5
  Downloading multidict-5.2.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (160 kB)
     |████████████████████████████████| 160 kB 89.5 MB/s 
?25hCollecting async-timeout<5.0,>=4.0.0a3
  Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Collecting aiosignal>=1.1.2
  Downloading aiosignal-1.2.0-py3-none-any.whl (8.2 kB)
Collecting yarl<2.0,>=1.0
  Downloading yarl-1.7.2-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (271 kB)
     |████████████████████████████████| 271 kB 82.3 MB/s 
?25hRequirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->datasets) (3.6.0)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas->datasets) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas->datasets) (2.8.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas->datasets) (1.15.0)
Installing collected packages: multidict, frozenlist, yarl, asynctest, async-timeout, aiosignal, fsspec, aiohttp, xxhash, datasets
Successfully installed aiohttp-3.8.1 aiosignal-1.2.0 async-timeout-4.0.2 asynctest-0.13.0 datasets-1.16.1 frozenlist-1.2.0 fsspec-2021.11.1 multidict-5.2.0 xxhash-2.0.2 yarl-1.7.2

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import torch
import pandas as pd

Loading our Model and Tokenizer¶

We create a tokenizer and a model using the pre-trained model we created. We can use the handy Auto.. methods for this.

tokenizer = AutoTokenizer.from_pretrained("BritishLibraryLabs/bl-books-genre")
model = AutoModelForSequenceClassification.from_pretrained("BritishLibraryLabs/bl-books-genre")

To make the process of doing inference straightforward we can use a pipeline. These are intended to make the process of using models for inference easy for a wide range of tasks. We need to tell the pipeline what kind of task we’re doing and pass in our model and tokenizer.

classifier = pipeline("text-classification", model=model, tokenizer=tokenizer,device=0)

If we have a GPU available we can set the device when creating the pipeline. We can double check this using the device attribute.

classifier.device

device(type='cuda', index=0)

Viewing Predictions¶

Let’s try it out with some made up book titles

title = "The Coal Fields of South Wales"

classifier(title)

[{'label': 'Non-fiction', 'score': 0.9989659786224365}]

classifier("Oliver Twist")

[{'label': 'Fiction', 'score': 0.9980145692825317}]

Now we essentially have a function that takes some text and returns some predictions. We can now predict against our full dataset. Since we’re not going to be doing this over and over, we won’t worry too much about the performance of our approach.

Getting the data we want to augment¶

Our orignal goal was to augment data with additional genre labels. We can now grab that metadata to use for inference

csv_url = "https://bl.iro.bl.uk/downloads/e4bf0f74-2c64-4322-93c7-0dcc5e5246da?locale=en"

dtypes = {
    "BL record ID": "string",
    "Type of resource": "category",
    "BNB number":"category",
    "ISBN":"category",
    "Name": "category",
    "Type of name": "category",
    "Country of publication": "category",
    "Place of publication": "category",
    "Genre": "category",
    "Dewey classification": "string",
    "BL record ID for physical resource": "string",
}

df = pd.read_csv(
    csv_url,
    dtype=dtypes,
    low_memory=False
)

🤗 datasets¶

We’ll again use one of the libraries from the 🤗 ecoystem to help us make our prediction process quite efficient. We use a library called datasets. This library provides access to a huge number of existing datasets but we can also use it as way of processing datasets stored in other formats locally. We won’t go into the deep details of that library here. If you are interested the video below gives a great overview

We begin by importing the library

import datasets

As an inital step we remove any titles without a title in the metadata

df = df[~df.Title.isna()]

For inference we only want a subset of the DataFrame. We grab two columns and copy them to a new dataframe.

pred_df = df[['Title','BL record ID']].copy(deep=True)

Note

You might wonder why we include BL record ID when we don’t need this for prediction. We do this mainly because the datasets library expects a DataFrame not a Series (which is what we’d get with a single column).

Beyond this, it is often important when we use maching learning for these kinds of applications that we can link back our new augmented data to existing metadata. BL record ID can also serve this function since it’s a unique ID.

Our model has a maximum length it can take as input. We could deal with this in various ways but here we’ll take a slightly crude approach and just chop off any title text that extends beyond our models maximum length.

pred_df['text'] = pred_df['Title'].str[:512]

We just do a quick check to make sure we have equal lenghts for our prediction DataFrame and our original DataFrame.

assert len(df) == len(pred_df)

We can now create a dataset.Dataset using the from_pandas method.

dataset = datasets.Dataset.from_pandas(pred_df)

Let’s take a look at what this looks like

dataset

Dataset({
    features: ['Title', 'BL record ID', 'text', '__index_level_0__'],
    num_rows: 1752072
})

You can see here that we have the columns we passed in to our dataset and some information about the number of rows. If we want to check again we can assert the length of everything is the same.

assert len(df) == len(pred_df) == len(dataset)

Note

Checking all of these lengths here might seem a bit silly since we’ve already seen what the lengths are but it can often be good to chuck the odd assert in our code. If we run this code in a script we won’t be able to visually check these things so easily. In that case having an assert statement acts as a ‘sort of test’ and will flag if something isn’t what we expect before we’ve spent hours training a model.

Inference¶

We’re now ready for inference. We’ll import one more Class for doing this.

from transformers.pipelines.base import KeyDataset

We set a fairly high batch size. You may have to reduce this if you have less GPU memory available.

bs = 256

We now create a loop that will batch up our data and run it through our model. We then save the predictions in a new list.

all_preds = []
for pred in classifier(KeyDataset(dataset, "text"), batch_size=bs, truncation="only_first"):
    all_preds.append(pred)

CPU times: user 45min 48s, sys: 5.65 s, total: 45min 54s
Wall time: 42min 19s

This will take a bit of time but in the grand scheme of things isn’t too long too wait. If we were running this model in a ‘live’ system and really cared about latency we might need to explore other approaches.

Again we can check the lenth of our predictions to see if it looks okay

len(all_preds)

We now just do a bit of tidying to get our labels into the format we want. We’ll start by checking what the predictions look like

all_preds[0]

{'label': 'Non-fiction', 'score': 0.518064022064209}

We store these in a new column

df['raw_predictions'] = all_preds

We now grab the labels from our predictions

df["predicted_label"] = df['raw_predictions'].apply(lambda x: ["label"])

We now store the probabilities

df["prob"] = df['raw_predictions'].apply(lambda x: x["score"])

df['prob']

0          0.518064
1          0.954458
2          0.955602
3          0.997447
4          0.820073
             ...   
1752073    0.999866
1752074    0.992083
1752075    0.998614
1752076    0.524758
1752077    0.999718
Name: prob, Length: 1752072, dtype: float64

def get_fiction_prob(x):
    if x.predicted_label == "Fiction":
        return x.prob
    else:
        return 1 - x.prob

df["fiction_probs"] = df.apply(get_fiction_prob, axis=1)

df["non_fiction_probs"] = 1 - df["fiction_probs"]

df = df.drop(columns=["prob"])

Saving our updated metadata¶

Now we can save our results to a csv file

df.to_csv("bl_books_w_genre_transformer.csv")

Conclusion¶

We have now got to the point where was have a version of the Microsoft Books metadata with a bunch of additional metadata 🦾

Classifying 19th Century British Library books using Crowdsourcing and Machine Learning

Using our new Hugging Face model

Contents