Fine tuning our fastai model with new data
Contents
Fine tuning our fastai model with new data¶
We have now created an expanded version of our training data using Snorkel
. We can see if this improves the performance of our model compared to our previous attempt.
We could do this by creating a new model and training from scratch. However, our previous model was already doing quite well so it might make sense to try and benefit from what the model has already learned and fine-tune our previous model instead.
Why fine tune our model?¶
Updating our previous model may also allow us to train for less time since the model isn’t starting from ‘scratch’. In particular, we saw previously that training a language model for our data was time consuming so we may want to limit how much we repeat this step. In some contexts it is likely that you will often want to train a model and update it with new data. This new data may, for example, we arriving as part of a crowdsourcing task. Instead of waiting for all of the data to arrive before training a model you will often want to train as you get new data. In other settings you may have data that is temporal. For example a predictive model that tries to predict some outcome for the next month will at some point have the next’s months training data available.
We’ll start by installing the same version of fastai
we used previously.
!pip install fastai==2.5.2
Collecting fastai==2.5.2
Downloading fastai-2.5.2-py3-none-any.whl (186 kB)
|████████████████████████████████| 186 kB 8.2 MB/s eta 0:00:01
?25hRequirement already satisfied: torchvision>=0.8.2 in /usr/local/lib/python3.7/dist-packages (from fastai==2.5.2) (0.11.1+cu111)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from fastai==2.5.2) (2.23.0)
Requirement already satisfied: spacy<4 in /usr/local/lib/python3.7/dist-packages (from fastai==2.5.2) (2.2.4)
Collecting torch<1.10,>=1.7.0
Downloading torch-1.9.1-cp37-cp37m-manylinux1_x86_64.whl (831.4 MB)
|████████████████████████████████| 831.4 MB 2.3 kB/s
?25hRequirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from fastai==2.5.2) (21.2)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (from fastai==2.5.2) (1.0.1)
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from fastai==2.5.2) (1.1.5)
Requirement already satisfied: fastprogress>=0.2.4 in /usr/local/lib/python3.7/dist-packages (from fastai==2.5.2) (1.0.0)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from fastai==2.5.2) (1.4.1)
Requirement already satisfied: pip in /usr/local/lib/python3.7/dist-packages (from fastai==2.5.2) (21.1.3)
Collecting fastcore<1.4,>=1.3.8
Downloading fastcore-1.3.27-py3-none-any.whl (56 kB)
|████████████████████████████████| 56 kB 6.1 MB/s
?25hCollecting fastdownload<2,>=0.0.5
Downloading fastdownload-0.0.5-py3-none-any.whl (13 kB)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.7/dist-packages (from fastai==2.5.2) (3.13)
Requirement already satisfied: pillow>6.0.0 in /usr/local/lib/python3.7/dist-packages (from fastai==2.5.2) (7.1.2)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from fastai==2.5.2) (3.2.2)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from fastprogress>=0.2.4->fastai==2.5.2) (1.19.5)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai==2.5.2) (2.0.6)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai==2.5.2) (1.1.3)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai==2.5.2) (1.0.6)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai==2.5.2) (57.4.0)
Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai==2.5.2) (0.4.1)
Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai==2.5.2) (7.4.0)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai==2.5.2) (1.0.5)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai==2.5.2) (4.62.3)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai==2.5.2) (3.0.6)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai==2.5.2) (1.0.0)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from spacy<4->fastai==2.5.2) (0.8.2)
Requirement already satisfied: importlib-metadata>=0.20 in /usr/local/lib/python3.7/dist-packages (from catalogue<1.1.0,>=0.0.7->spacy<4->fastai==2.5.2) (4.8.2)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata>=0.20->catalogue<1.1.0,>=0.0.7->spacy<4->fastai==2.5.2) (3.6.0)
Requirement already satisfied: typing-extensions>=3.6.4 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata>=0.20->catalogue<1.1.0,>=0.0.7->spacy<4->fastai==2.5.2) (3.10.0.2)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->fastai==2.5.2) (2021.10.8)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->fastai==2.5.2) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->fastai==2.5.2) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->fastai==2.5.2) (1.24.3)
Collecting torchvision>=0.8.2
Downloading torchvision-0.11.1-cp37-cp37m-manylinux1_x86_64.whl (23.3 MB)
|████████████████████████████████| 23.3 MB 1.3 MB/s
?25h Downloading torchvision-0.10.1-cp37-cp37m-manylinux1_x86_64.whl (22.1 MB)
|████████████████████████████████| 22.1 MB 1.2 MB/s
?25hRequirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib->fastai==2.5.2) (0.11.0)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->fastai==2.5.2) (2.8.2)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->fastai==2.5.2) (1.3.2)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->fastai==2.5.2) (2.4.7)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.1->matplotlib->fastai==2.5.2) (1.15.0)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas->fastai==2.5.2) (2018.9)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->fastai==2.5.2) (1.1.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->fastai==2.5.2) (3.0.0)
Installing collected packages: torch, fastcore, torchvision, fastdownload, fastai
Attempting uninstall: torch
Found existing installation: torch 1.10.0+cu111
Uninstalling torch-1.10.0+cu111:
Successfully uninstalled torch-1.10.0+cu111
Attempting uninstall: torchvision
Found existing installation: torchvision 0.11.1+cu111
Uninstalling torchvision-0.11.1+cu111:
Successfully uninstalled torchvision-0.11.1+cu111
Attempting uninstall: fastai
Found existing installation: fastai 1.0.61
Uninstalling fastai-1.0.61:
Successfully uninstalled fastai-1.0.61
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchtext 0.11.0 requires torch==1.10.0, but you have torch 1.9.1 which is incompatible.
Successfully installed fastai-2.5.2 fastcore-1.3.27 fastdownload-0.0.5 torch-1.9.1 torchvision-0.10.1
We then import the necessary packages
import pandas as pd
import torch
from fastai.text.all import *
Loading our snorkel generated data¶
We’ll define some of the dtypes
for the data we generated using Snorkel
previously.
dtypes = {
"BL record ID": "string",
"Type of resource": "category",
"Name": "category",
"Type of name": "category",
"Country of publication": "category",
"Place of publication": "category",
"Genre": "category",
"Dewey classification": "string",
"BL record ID for physical resource": "string",
"annotator_main_language": "category",
"annotator_summaries_language": "string",
}
We can now grab the new Snorkel data
!wget https://transfer.sh/H3zJcc/snorkel_train.csv
--2021-11-15 18:18:25-- https://transfer.sh/H3zJcc/snorkel_train.csv
Resolving transfer.sh (transfer.sh)... 144.76.136.153
Connecting to transfer.sh (transfer.sh)|144.76.136.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10211140 (9.7M) [text/csv]
Saving to: ‘snorkel_train.csv’
snorkel_train.csv 100%[===================>] 9.74M 4.22MB/s in 2.3s
2021-11-15 18:18:29 (4.22 MB/s) - ‘snorkel_train.csv’ saved [10211140/10211140]
df = pd.read_csv("snorkel_train.csv", dtype=dtypes)
df.head(2)
BL record ID | Type of resource | Name | Dates associated with name | Type of name | Role | All names | Title | Variant titles | Series title | Number within series | Country of publication | Place of publication | Publisher | Date of publication | Edition | Physical description | Dewey classification | BL shelfmark | Topics | Genre | Languages | Notes | BL record ID for physical resource | classification_id | user_id | created_at | subject_ids | annotator_date_pub | annotator_normalised_date_pub | annotator_edition_statement | annotator_genre | annotator_FAST_genre_terms | annotator_FAST_subject_terms | annotator_comments | annotator_main_language | annotator_other_languages_summaries | annotator_summaries_language | annotator_translation | annotator_original_language | annotator_publisher | annotator_place_pub | annotator_country | annotator_title | Link to digitised book | annotated | is_valid | text_len | fiction_prob | non_fiction_prob | snorkel_genre | snorkel_label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 014616539 | Monograph | NaN | NaN | NaN | NaN | Hazlitt, William Carew, 1834-1913 [person] | The Baron's Daughter. A ballad by the author of Poetical Recreations [i.e. William C. Hazlitt] . F.P | Single Works | NaN | NaN | Scotland | Edinburgh | Ballantyne, Hanson | 1877 | NaN | 20 pages (4°) | <NA> | Digital Store 11651.h.6 | NaN | NaN | English | NaN | 000206670 | 263940444.0 | 3.0 | 2020-07-27 07:35:13 UTC | 44330917.0 | 1877 | 1877 | NONE | Fiction | 655 7 $aPoetry$2fast$0(OCoLC)fst01423828 | NONE | NaN | NaN | No | <NA> | No | NaN | Ballantyne Hanson & Co. | Edinburgh | stk | The Baron's Daughter. A ballad by the author of Poetical Recreations [i.e. William C. Hazlitt] . F.P | http://access.bl.uk/item/viewer/ark:/81055/vdc_00000002F718 | True | False | 100 | 0.999940 | 0.000060 | Fiction | NaN |
1 | 014616686 | Monograph | Earle, John Charles | NaN | person | NaN | Earle, John Charles [person] | Maximilian, and other poems, etc | NaN | NaN | NaN | England | London | NaN | 1868 | NaN | NaN | <NA> | Digital Store 11648.i.8 | NaN | Poetry or verse | English | NaN | 001025896 | 265570129.0 | 3.0 | 2020-08-03 07:25:30 UTC | 44331725.0 | 1868 | 1868 | NONE | Fiction | 655 7 $aPoetry$2fast$0(OCoLC)fst01423828 | NONE | NaN | NaN | No | <NA> | No | NaN | Burns, Oates, & Co. | London | enk | Maximilian, and other poems, etc | http://access.bl.uk/item/viewer/ark:/81055/vdc_00000002F2AA | True | False | 32 | 0.982546 | 0.017454 | Fiction | NaN |
Downloading our previous model¶
Since we’re going to try and fine tune the model we created previously we need to have it available. We can grab it using wget
!wget -O 20210928-model.pkl https://zenodo.org/record/5245175/files/20210928-model.pkl?download=1
--2021-11-15 18:18:30-- https://zenodo.org/record/5245175/files/20210928-model.pkl?download=1
Resolving zenodo.org (zenodo.org)... 137.138.76.77
Connecting to zenodo.org (zenodo.org)|137.138.76.77|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 158529715 (151M) [application/octet-stream]
Saving to: ‘20210928-model.pkl’
20210928-model.pkl 100%[===================>] 151.19M 15.1MB/s in 9.7s
2021-11-15 18:18:41 (15.6 MB/s) - ‘20210928-model.pkl’ saved [158529715/158529715]
We can use fastai
’s load_learner
function to load our model.
learn = load_learner("20210928-model.pkl", cpu=False)
We can make sure our model is using the GPU
learn.cuda()
SequentialRNN(
(0): SentenceEncoder(
(module): AWD_LSTM(
(encoder): Embedding(17392, 400, padding_idx=1)
(encoder_dp): EmbeddingDropout(
(emb): Embedding(17392, 400, padding_idx=1)
)
(rnns): ModuleList(
(0): WeightDropout(
(module): LSTM(400, 1152, batch_first=True)
)
(1): WeightDropout(
(module): LSTM(1152, 1152, batch_first=True)
)
(2): WeightDropout(
(module): LSTM(1152, 400, batch_first=True)
)
)
(input_dp): RNNDropout()
(hidden_dps): ModuleList(
(0): RNNDropout()
(1): RNNDropout()
(2): RNNDropout()
)
)
)
(1): PoolingLinearClassifier(
(layers): Sequential(
(0): LinBnDrop(
(0): BatchNorm1d(1200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(1): Dropout(p=0.2, inplace=False)
(2): Linear(in_features=1200, out_features=50, bias=False)
(3): ReLU(inplace=True)
)
(1): LinBnDrop(
(0): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(1): Dropout(p=0.1, inplace=False)
(2): Linear(in_features=50, out_features=2, bias=False)
)
)
)
)
When we export our fastai
learner
it includes the vocab for the model. This is made up of two vocabs, the first of which includes the vocab for the language model.
learn.dls.vocab[0]
['xxunk',
'xxpad',
'xxbos',
'xxeos',
'xxfld',
'xxrep',
'xxwrep',
'xxup',
'xxmaj',
',',
'the',
'.',
'of',
'a',
'and',
'…',
'in',
'[',
']',
'by',
'with',
'etc',
'-',
'de',
'to',
"'",
':',
';',
'from',
"'s",
'an',
'on',
'or',
'history',
'la',
'novel',
'poems',
'et',
'edition',
'illustrations',
'(',
')',
'j.',
'und',
'der',
'other',
'des',
'for',
'illustrated',
'new',
'verse',
'notes',
'edited',
'life',
'author',
'w.',
'h.',
'von',
'its',
'a.',
'poem',
'c.',
'second',
'tale',
'les',
'at',
'en',
'story',
'i',
'e.',
'i.e.',
'm.',
'acts',
'historical',
'account',
'g.',
'his',
'le',
'du',
'die',
'f.',
'being',
'map',
'par',
'plates',
'two',
'which',
'&',
'geschichte',
'r.',
'maps',
'five',
'sketches',
'sur',
'england',
'translated',
's.',
'mit',
'1',
'histoire',
'john',
'3',
'years',
'guide',
't.',
'is',
'book',
'first',
'à',
'as',
'three',
'works',
'romance',
'some',
'prose',
'war',
'old',
'present',
'english',
'france',
'songs',
'den',
'original',
'l.',
'containing',
'county',
'narrative',
'sir',
'b.',
'including',
'south',
'during',
'verses',
'e',
'great',
'through',
'van',
'introduction',
'london',
'america',
'written',
'p.',
'letters',
'st',
'north',
'd.',
'their',
'del',
'voyage',
'di',
'description',
'states',
'mr',
'land',
'that',
'sketch',
'aus',
'travels',
'journal',
'french',
'world',
'comedy',
'y',
'tales',
'west',
'reprinted',
'british',
'third',
'm',
'city',
'love',
'state',
'lord',
'portrait',
'tragedy',
'country',
'zur',
'compiled',
'united',
'ancient',
'sea',
'society',
'general',
'lady',
'series',
'year',
'tour',
'under',
'descriptive',
'added',
'stories',
'africa',
'india',
'modern',
'documents',
'william',
'american',
'poetical',
'dr',
'george',
'king',
'time',
'ireland',
'her',
'preface',
'appendix',
'i.',
'revised',
'royal',
'town',
'man',
'paris',
'journey',
'adventures',
'avec',
'geography',
'engravings',
'nach',
'one',
'it',
'mrs',
'east',
'dem',
'upon',
'death',
'?',
'signed',
'dans',
'biographical',
'notices',
'au',
'thomas',
'charles',
'day',
'parish',
'church',
'our',
'various',
'report',
'all',
'ii',
'vol',
'2',
'historia',
'late',
'part',
'james',
'my',
'и',
'islands',
'last',
'century',
'collection',
'house',
'observations',
'historique',
'are',
'bis',
'into',
'york',
'scotland',
'queen',
'antiquities',
'herausgegeben',
'coast',
'memoir',
'island',
'four',
'italy',
'home',
'times',
'political',
'remarks',
'!',
'before',
'letter',
'henry',
'days',
'son',
'together',
'geology',
'visit',
'people',
'das',
'western',
'also',
'egypt',
'wales',
'china',
'short',
'castle',
'drama',
'published',
'ville',
'édition',
'rev',
'between',
'revolution',
'robert',
'enlarged',
'early',
'c',
'young',
'un',
'k.',
'spain',
'europe',
'ms',
'saint',
'comprising',
'af',
"l'histoire",
'mary',
'indian',
'einer',
'ses',
'additions',
'v.',
'stadt',
'essay',
'portraits',
'fourth',
'german',
'several',
'parts',
'empire',
"d'après",
'canada',
'notice',
'miss',
'poetry',
'made',
'por',
'ballads',
'gold',
'el',
'ein',
'most',
'woman',
'irish',
'handbook',
'directions',
'river',
'auf',
'selected',
'court',
'progress',
'esq',
'het',
'opera',
'collected',
'government',
'manners',
'expedition',
'brief',
'britain',
'ouvrage',
'travel',
'sailing',
'view',
'address',
'captain',
'round',
'della',
'delivered',
'n.',
'complete',
'past',
'little',
'rome',
'others',
'ode',
'summer',
'l.p',
'prince',
'wife',
'residence',
'twenty',
'natural',
'miscellaneous',
'ou',
'1848',
'events',
'social',
'play',
'six',
'révolution',
'records',
'no',
'historiques',
'chiefly',
'discovery',
'papers',
'depuis',
'who',
'relating',
'views',
'review',
'da',
'earliest',
'plans',
'con',
'geological',
'los',
'civil',
'une',
'men',
'satire',
'geographical',
'russia',
'neighbourhood',
'scenes',
'w',
'numerous',
'hundred',
'pictures',
'jahre',
'memoirs',
'edward',
'settlement',
"d'un",
'greece',
'essays',
'daughter',
'now',
'white',
'southern',
'period',
'hall',
'battle',
'family',
'въ',
'4',
'arranged',
'secret',
'scenery',
'cape',
'may',
'duke',
'drawings',
'über',
'fifth',
'introductory',
'australia',
'song',
'1789',
'eastern',
'dramatic',
'customs',
'critical',
'siècle',
'für',
'after',
'over',
'statistical',
'louis',
'central',
'northern',
'district',
'record',
'public',
'true',
'pieces',
"d'une",
'was',
'plan',
'physical',
'og',
'door',
'among',
'personal',
'lecture',
'topographical',
'popular',
'authors',
'military',
'translation',
'way',
'nature',
'voyages',
'cantos',
'comic',
'origin',
'peter',
'corrected',
'eine',
'be',
'books',
'school',
'work',
'lyrics',
'character',
'places',
'zum',
'epistle',
'hand',
'resources',
'months',
'einem',
'souvenirs',
'las',
'how',
'black',
'note',
'diary',
'deutschen',
'recollections',
'up',
'rambles',
'sous',
'holy',
'text',
'study',
'annals',
'afterwards',
'writings',
'proceedings',
'principal',
'8',
'poets',
'tom',
'thousand',
'sonnets',
'edinburgh',
'gazetteer',
'children',
'earl',
'literary',
'auflage',
'addressed',
'o.',
'switzerland',
'army',
'zu',
'a.d',
'yorkshire',
'pt',
'illustrative',
'rhymes',
'reminiscences',
'o',
'mexico',
'gravures',
'och',
'geschiedenis',
'taken',
'village',
'environs',
'subjects',
'extracts',
'lectures',
'palestine',
'roman',
'pays',
'richard',
's',
'david',
'pour',
'lands',
'reise',
'h',
'princess',
'condition',
'moral',
'wood',
'walter',
'scott',
'theatre',
'do',
'company',
'colony',
'survey',
'countries',
'française',
'campaign',
'illustré',
'g',
'scottish',
'1814',
'winter',
'spanish',
'friend',
'gentleman',
'editor',
'nebst',
'seine',
'victoria',
'reign',
'hon',
'inhabitants',
'1866',
'met',
'zealand',
'founded',
'alexander',
'étude',
"jusqu'à",
'many',
'entitled',
'musical',
'what',
'about',
'good',
'germany',
'extrait',
'picturesque',
'és',
'garden',
'sacred',
'oxford',
'ten',
'province',
'rather',
'pacific',
'against',
'abbey',
'lost',
'lake',
'smith',
'durch',
'few',
'directory',
'bay',
'pendant',
'quellen',
'paul',
'valley',
'ocean',
'provinces',
'jahren',
'1815',
'national',
'trip',
'list',
'red',
'act',
'al',
'down',
'selections',
'archives',
'impressions',
'majesty',
'hill',
'recent',
'pilot',
'literature',
'inédits',
'géographie',
'farce',
'out',
'iv',
'5',
'elizabeth',
'wild',
'christian',
'massachusetts',
'have',
'japan',
'california',
'carte',
'que',
'more',
'words',
'union',
'storia',
'joseph',
'girl',
'byron',
'not',
'right',
'those',
'magazine',
'catalogue',
'selection',
'coloured',
'beitrag',
'memory',
'b',
'relative',
'unter',
'prefixed',
'grand',
'heart',
'mining',
'portugal',
'use',
'printed',
'art',
'margaret',
'night',
'russian',
'mountains',
'supplement',
'colonies',
'manual',
'aux',
'1870',
'system',
'cruise',
'read',
'counties',
'adventure',
'med',
'drawn',
'fall',
'july',
'authentic',
'forest',
'nos',
'august',
'continent',
'descriptions',
'1830',
'isle',
'practical',
'monuments',
'vom',
'6',
'this',
'acted',
'dictionary',
'seven',
'performed',
'mystery',
'translations',
'members',
'zweite',
'mines',
'paper',
'local',
'eight',
'asia',
'institutions',
'l',
'holland',
'twelve',
'anecdotes',
'religion',
'official',
'index',
'strange',
'excursions',
'photographs',
'bearbeitet',
'field',
'x',
'don',
'fields',
'incidents',
'studies',
'marriage',
'cross',
'cathedral',
'pen',
'z',
'études',
'su',
'rose',
'v',
'committee',
'conquest',
'information',
'treatise',
'adapted',
'thoughts',
'african',
'australian',
'jours',
'съ',
'called',
'il',
'explanatory',
'turkey',
'sources',
'kingdom',
'1871',
'deutsche',
'virginia',
'religious',
'arthur',
'sixth',
'own',
'manchester',
'golden',
'science',
'channel',
'1849',
'republic',
'officer',
'7',
'domestic',
'cities',
'á',
'abbildungen',
'reference',
'd',
'club',
'private',
'hope',
'memorials',
'correspondence',
'borough',
'nouvelle',
'belgique',
'prize',
'peace',
'will',
'meeting',
'wanderings',
'norway',
'dessins',
'1812',
'trade',
'chronicle',
'full',
'publié',
'topography',
'rebellion',
'facts',
'coal',
'société',
'chronik',
'june',
'university',
'doctor',
'light',
'atlantic',
'across',
'christmas',
'march',
'charlotte',
'major',
'characters',
'college',
'ed',
'railway',
'zeit',
'music',
'parliament',
'passages',
'legend',
'1862',
'route',
'embracing',
'département',
'concerning',
'jack',
'maid',
'real',
'relation',
'long',
'rivers',
'chronicles',
'universal',
'green',
'chinese',
'plays',
'remarkable',
'extracted',
'corrections',
'till',
'tables',
'cartes',
'beiträge',
'deuxième',
'dream',
'1840',
'samuel',
'abroad',
'deux',
'ballad',
'earth',
'traduit',
'bishop',
'jerusalem',
'waterloo',
'lays',
'fair',
'traveller',
'italian',
'transactions',
'holiday',
'1850',
'publiés',
'karte',
'tot',
'remains',
'bristol',
'legends',
'engraved',
'place',
'best',
'coasts',
'qui',
'lines',
'elegy',
'only',
'anne',
'ward',
'thirty',
'battles',
'association',
'border',
'mission',
'greek',
'archæological',
'по',
'shakespeare',
'latin',
'picture',
'anniversary',
'invasion',
'rise',
'upper',
'commercial',
'bd',
'historie',
'indians',
'seiner',
'persons',
'professor',
'towns',
'temps',
'francis',
'accompanied',
'park',
'robinson',
'pilgrimage',
'dissertation',
'human',
'commerce',
'fifty',
'1793',
'til',
'sobre',
'исторіи',
'occasions',
'friends',
'oder',
'vicinity',
'wars',
'uit',
'servir',
'città',
'essai',
'volume',
'am',
'historic',
'september',
'pictorial',
'syria',
'ceylon',
'1863',
'dargestellt',
'nord',
'dedication',
'1887',
'pindar',
'search',
'autumn',
'excursion',
'baron',
'ship',
'alfred',
'rio',
'port',
'boston',
'norfolk',
'researches',
'xv',
'mémoires',
'revue',
'service',
'r',
'pope',
'eminent',
'sermon',
'temple',
'high',
'them',
'schools',
'water',
'constantinople',
'chapter',
'1861',
'manor',
'politique',
'bilder',
'siege',
'near',
'annual',
'poet',
'child',
'april',
'palace',
'additional',
'1813',
'chapters',
'experiences',
'accounts',
...]
We can quickly check what our columns are:
df.columns
Index(['BL record ID', 'Type of resource', 'Name',
'Dates associated with name', 'Type of name', 'Role', 'All names',
'Title', 'Variant titles', 'Series title', 'Number within series',
'Country of publication', 'Place of publication', 'Publisher',
'Date of publication', 'Edition', 'Physical description',
'Dewey classification', 'BL shelfmark', 'Topics', 'Genre', 'Languages',
'Notes', 'BL record ID for physical resource', 'classification_id',
'user_id', 'created_at', 'subject_ids', 'annotator_date_pub',
'annotator_normalised_date_pub', 'annotator_edition_statement',
'annotator_genre', 'annotator_FAST_genre_terms',
'annotator_FAST_subject_terms', 'annotator_comments',
'annotator_main_language', 'annotator_other_languages_summaries',
'annotator_summaries_language', 'annotator_translation',
'annotator_original_language', 'annotator_publisher',
'annotator_place_pub', 'annotator_country', 'annotator_title',
'Link to digitised book', 'annotated', 'is_valid', 'text_len',
'fiction_prob', 'non_fiction_prob', 'snorkel_genre', 'snorkel_label'],
dtype='object')
We’ll create a dataloaders
object which will contain the new data we’ll use to update our previous model. This will probably look familiar from our previous notebook. The main difference here is that we pass in the vocab for our previous language model data to the text_vocab
parameter. This makes sure that we have a consistent vocab.
dls = TextDataLoaders.from_df(
df,
text_col="Title",
label_col="snorkel_genre",
valid_col="is_valid",
text_vocab=learn.dls.vocab[0],
)
/usr/local/lib/python3.7/dist-packages/numpy/core/_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
return array(a, dtype, copy=False, order=order)
dls.show_batch()
text | category | |
---|---|---|
0 | xxbos xxmaj los xxmaj xxunk y las xxmaj maravillas del xxmaj mundo . … xxmaj anales del mundo desde los tiempos xxunk hasta nuestros dias . … xxmaj gran xxmaj memorandum histórico … que comprende xxunk las obras xxunk . xxmaj la xxmaj xxunk … xxmaj historia xxmaj universal , escrita por el xxunk xxmaj xxunk xxmaj xxunk y su xxunk celebrado xxmaj arte de xxunk los datos de las xxunk históricas , xxunk y otros antiguos documentos ; … continuada hasta hoy dia por xxup m. de xxmaj saint xxmaj xxunk ; la xxmaj historia de xxmaj xxunk el xxmaj grande , escrita por xxmaj xxunk xxmaj xxunk , la de xxmaj xxunk y xxmaj roma , xxmaj xxunk y los xxmaj xxunk , xxmaj xxunk y xxmaj xxunk , xxunk los xxunk xxmaj xxunk de este último ; la de la guerra de xxmaj xxunk y xxmaj xxunk | Non-fiction |
1 | xxbos xxmaj platonis , et quæ vel xxmaj platonis xxunk xxunk , vel xxmaj xxunk solent xxunk xxunk , xxmaj xxunk omnia . xxmaj ad xxunk xxunk recensuit xxunk inde xxunk xxunk xxunk xxup i. xxmaj xxunk . xxmaj annotationibus xxunk xxmaj xxunk , xxmaj xxunk , xxmaj xxunk , xxmaj wyttenbachii , xxmaj xxunk , xxmaj xxunk , xxunk xxunk non xxunk xxmaj xxunk , xxmaj xxunk , xxmaj xxunk , xxmaj xxunk , xxmaj xxunk , xxmaj xxunk , xxmaj xxunk et xxmaj xxunk , xxunk ex commentariis xxunk xxunk excerpta . ( xxunk xxmaj xxunk vita xxunk xxmaj xxunk vita xxunk vita xxunk et xxunk xxunk xxunk veterum xxunk xxunk de xxmaj xxunk ejusque xxunk et xxunk xxmaj xxunk et xxunk scriptis atque xxunk dissertatio xxup j. xxup a. xxunk codicum xxup mss . xxmaj platonis ejusque xxunk e xxmaj xxunk xxmaj xxunk . xxmaj xxunk codicum | Non-fiction |
2 | xxbos xxmaj narrative of xxmaj travels and xxmaj discoveries in xxmaj northern and xxmaj central xxmaj africa , in the years 1822 , 1823 , and 1824 , by xxmaj major xxmaj denham , xxmaj captain xxmaj clapperton and the late xxmaj doctor xxmaj xxunk … xxmaj with an appendix … by xxmaj major xxup d. xxmaj denham … and xxmaj captain xxup h. xxmaj clapperton [ including ' translations from the xxmaj arabic , of various letters and documents , brought from xxmaj xxunk and xxmaj soudan by xxmaj major xxmaj denham and xxmaj captain xxmaj clapperton . xxmaj by xxup a. xxmaj xxunk , ' ' botanical xxmaj appendix . xxmaj by xxmaj robert xxmaj brown , ' and ' letter to xxmaj major xxmaj denham , on the rock specimens brought from xxmaj africa . xxmaj by xxmaj charles xxmaj xxunk ' ] [ with plates and | Non-fiction |
3 | xxbos xxmaj the xxmaj history of xxmaj england ; from the invasion of xxmaj julius xxmaj caesar to the xxmaj revolution in 1688 : by xxup d. xxmaj hume … xxmaj with a continuation , from that period to the death of xxmaj george the xxmaj second , by xxmaj tobias xxmaj smollett … and xxmaj chronological xxmaj records to the coronation of his present xxmaj majesty , xxmaj george the xxmaj fourth , by xxmaj john xxmaj burke … xxmaj with numerous engravings . ( historical xxmaj questions comprising a series of studies upon the most important passages in xxmaj xxunk 's xxmaj universal xxmaj histories ; commencing with xxmaj hume and xxmaj xxunk 's xxmaj history of xxmaj england . xxmaj by xxmaj robert xxmaj xxunk . ) | Non-fiction |
4 | xxbos xxmaj additional xxmaj reasons , for our immediately xxunk xxmaj spanish xxmaj america , deduced from the … present crisis : and containing valuable information , respecting the late important events , both at xxmaj buenos xxmaj ayres , and in the xxmaj xxunk : as well as with respect to the present disposition and views of the xxmaj spanish xxmaj americans : being intended as a supplement to ' south xxmaj american xxmaj independence ' … xxmaj second edition , enlarged . ( letter to the xxmaj spanish xxmaj americans , by xxup d. xxmaj juan xxmaj xxunk xxmaj xxunk y xxmaj guzman , xxunk and proclamations by xxmaj general xxmaj miranda . ) | Non-fiction |
5 | xxbos xxmaj new xxmaj englands xxmaj xxunk xxmaj cast up at xxmaj london : or , a xxmaj relation of the xxmaj proceedings of the xxmaj court at xxmaj boston in new - england against divers xxunk and godly persons , for xxunk for government in the common - wealth , according to the xxunk of xxmaj england , and for xxunk of themselves and children to the xxunk in their churches … xxmaj as also a xxunk xxmaj answer to some passages in a late book , entituled xxmaj xxunk unmasked , set out by xxmaj mr . xxmaj xxunk , concerning the xxmaj independent xxmaj churches holding communion with the xxmaj reformed xxmaj churches | Non-fiction |
6 | xxbos xxmaj some xxmaj account of the xxmaj barony and xxmaj town of xxmaj xxunk … xxmaj including the journals kept by xxmaj messrs . xxmaj xxunk and xxmaj xxunk … from the 21 xxmaj james xxup i , to the death of xxmaj william xxrep 3 i . ; with notes genealogical , descriptive , and explanatory … xxmaj edited and enlarged from the collections made by xxup w. xxup b. xxmaj bridges , the xxmaj rev . xxup c. xxmaj thomas and the xxmaj rev . xxup h. xxup g. xxmaj fothergill … a new edition , with additional chapters , edited by xxup w. xxup h. xxup k. xxmaj wright | Non-fiction |
7 | xxbos xxmaj xxunk xxunk ? xxmaj amici : or , a xxmaj true and diverting account of a late battle between a priest and a porter . xxmaj in xxmaj hudibrastick verse . xxmaj address'd to the xxmaj orator of clare - market [ by xxmaj henry xxmaj price . ] xxmaj to which is added , the xxmaj fat vicar 's race : or , a merry new song of a wager between xxmaj parson xxmaj v - gh - n [ i.e. xxmaj vaughan ] … and a gentleman in xxmaj staffordshire which was won by the former . xxmaj wrote by a gentleman , who saw the race performed | Non-fiction |
8 | xxbos xxmaj memorias sobre el estado rural del xxmaj rio de la xxmaj plata en 1801 ; xxunk de límites entre el xxmaj brasil y el xxmaj paraguay á últimos del siglo xxup xv xxrep 3 i , é xxunk sobre varios xxunk de la xxmaj america xxunk española . xxmaj escritos xxunk de xxmaj don xxmaj felix de xxmaj azara … los publica … xxmaj don xxmaj xxunk de xxmaj azara … bajo la direccion de xxmaj don xxup b. xxup s. xxmaj xxunk de xxmaj xxunk … autor de las notas … a xxunk escritos , etc [ with a portrait of xxup f. de xxmaj azara . ] | Non-fiction |
We can now assign a new dataloader to our learner object. This will then be used to train our model.
learn.dls = dls
As we’ve done previously we’ll use the learning rate finder to try and find a sensible learning rate.
suggested = learn.lr_find()
As before we now want to train our model, again we go a little bit wild with the number of epochs but use a callback to make sure our model stops training if we don’t see any improvements.
learn.fit_one_cycle(
100,
lr_max=1e-7,
cbs=[
ShowGraphCallback(),
SaveModelCallback(monitor="f1_score"),
EarlyStoppingCallback(monitor="f1_score", patience=10),
],
)
epoch | train_loss | valid_loss | accuracy | f1_score | time |
---|---|---|---|---|---|
0 | 0.407947 | 0.172334 | 0.932061 | 0.932026 | 04:12 |
1 | 0.401191 | 0.122557 | 0.951139 | 0.951041 | 04:14 |
2 | 0.437112 | 0.148886 | 0.940238 | 0.940177 | 04:14 |
3 | 0.451716 | 0.109705 | 0.956395 | 0.956275 | 04:13 |
4 | 0.455507 | 0.189706 | 0.926222 | 0.926204 | 04:14 |
5 | 0.421345 | 0.123881 | 0.950944 | 0.950846 | 04:14 |
6 | 0.416993 | 0.115451 | 0.954448 | 0.954336 | 04:12 |
7 | 0.393227 | 0.226075 | 0.912206 | 0.912206 | 04:17 |
8 | 0.398733 | 0.113009 | 0.955811 | 0.955692 | 04:12 |
9 | 0.417066 | 0.117806 | 0.953475 | 0.953368 | 04:12 |
10 | 0.423782 | 0.113344 | 0.954837 | 0.954719 | 04:15 |
11 | 0.452844 | 0.172289 | 0.931867 | 0.931830 | 04:12 |
12 | 0.398562 | 0.139598 | 0.945688 | 0.945607 | 04:12 |
13 | 0.442877 | 0.149645 | 0.939848 | 0.939788 | 04:12 |
Better model found at epoch 0 with f1_score value: 0.9320256479232389.
Better model found at epoch 1 with f1_score value: 0.9510406510603469.
Better model found at epoch 3 with f1_score value: 0.9562748837151331.
No improvement since epoch 3: early stopping
Checking model performance¶
We can check our best model by calling validate
.
learn.validate()
(#3) [0.10970497131347656,0.9563947916030884,0.9562748837151331]
However, as we have done previously we always want to go back to our test data to see how our model is doing there.
df_test = pd.read_csv("test_errors.csv")
df_test = df_test[["title", "true_label"]]
df_test = df_test.dropna(subset=["true_label"]).copy()
df_test = df_test[df_test.true_label.isin({"non_fiction", "fiction"})]
test_data = learn.dls.test_dl(df_test.loc[:, "title"])
preds = learn.get_preds(dl=test_data)
probs = preds[0]
predictions = probs.argmax(1)
true_labels = df_test.true_label.astype("category").cat.codes
from sklearn.metrics import f1_score, classification_report, accuracy_score
We can start with an overall f1score
f1_score(true_labels, predictions)
0.9557522123893805
But we probably want to use the classifcation report again to get a more detailed view on our model performance
print(
classification_report(
true_labels,
predictions,
target_names=learn.dls.vocab[1],
)
)
precision recall f1-score support
Fiction 0.95 0.88 0.91 296
Non-fiction 0.94 0.97 0.96 554
accuracy 0.94 850
macro avg 0.94 0.93 0.93 850
weighted avg 0.94 0.94 0.94 850
If you go back to our previous fastai notebook you will see that we do a get some small improvements in our model. These small improvements can add up, or be significant so we should be pleased to get this performance boost as a result of creating some more data.
We’ll save this new version of the model:
learn.export("20211115-model.pkl")
Next steps¶
There is still quite a bit of playing around we could do here to eek out some more performance from our model. We haven’t really done anything to improve our model training except look for a sensible learning rate. There are a whole bunch of other parameters we could try tweaking. For now we’ll move on and see if using a transformer model will help.