Tools to support creating and process annotation for samples of Newspaper Navigator data using Label Studio
from nbdev import *

Annotating Newspaper Navigator data

Once you have created a sample of Newspaper Navigator data using sample, you might want to annotate it somehow. These annotations may function as the input for a machine learning model or could be used directly to explore images in the newspaper navigator data. The Examples section in the documentation shows how annotations can generate training data for machine learning tasks.

Setup annotation task

The bulk of annotation work is outsourced to label studio, which provides a flexible annotations system that supports annotations for various data types, including images and text. This module does a few steps to help process annotations produced through label studio. This module is essentially some suggestions on how you can get label-studio setup with data from Newspaper Navigator.

First, we'll create a small sample of images we want to annotate using sample. If you have already done this step, you can skip this.

sampler = nnSampler()
df = sampler.create_sample(
    50, "photos", start_year=1910, end_year=1920, year_sample=False
)

There are a few ways in which we can use label studio to annotate. For example, we could download images from our sample using sample.download_sample. However, if we have a large sample of images, we might want to do some annotating before downloading all of these images locally.

Label-studio supports annotating from a URL. We can use this combined with IIIF to annotate images without downloading them all first since IIIF is a flexible interface for getting images. IIIF also gives us flexibility in annotating at a smaller resolution/size before downloading higher-res images.

Create label studio annotation tasks

Label-studio supports a load of different ways of setting up 'tasks'. In this context, a 'task' is an image to be annotated. One way of setting up a task is to import a JSON file that includes tasks. To do this, we take an existing sample DataFrame and add column image, which contains a IIIF URL.

create_label_studio_json[source]

create_label_studio_json(sample:Union[DataFrame, Type[nnSampler]], fname:Union[str, Path, NoneType]=None, original:bool=True, pct:Optional[int]=None, size:Optional[tuple]=None, preserve_asp_ratio:bool=True)

create a json file which can be used to upload tasks to label studio

We can pass in either a dataframe or nnSampler to create_label_studio_json. This is a simple function that will create a JSON file that can create 'tasks' in labels studio. In this example, we pass in size parameters. This is used to generate a IIIF URL that will request this size.

create_label_studio_json(df, "tasks.json", size=(500, 500))

This creates a JSON file we can use to load tasks into label-studio.

Importing tasks into label studio

To avoid this documentation becoming out of date, I haven't included screenshots etc. However, you can currently (January 2021) create tasks in label studio via the GUI or by passing in tasks through the CLI. For example, to load the tasks and create a template for annotating classifications

label-studio init project_name --template=image_classification --input-path=tasks.json

You can then start label-studio and complete the rest of the setup via the GUI.

label-studio start ./project_name

Setting up labeling

For a proper introduction to configuring your labels, consult the label studio documentation. One way in which you can setup labels is to use a template as shown above. This template setups an image classification task. There are other templates for different tasks. These templates consist of XML templates that define your labels. These templates allow you to define how you want to label your images and share these definitions with others. For example

<View>
  <Choices name="choice" toName="image" showInLine="true" choice="multiple">
        <Choice value="human"/>
        <Choice value="animal"/>
        <Choice value="human-structure"/>
        <Choice value="landscape"/>
  </Choices>
<Image name="image" value="$image"/>
</View>

You can change many other options in Label-studio. It also includes features such as adding a machine learning backend to support annotations.

Notes on labelling using IIIF images

There are a few things to consider and be aware of when loading images via IIIF in label studio.

Missing images

Occasionally when you are doing your annotations in label studio for IIIF URLs, you will get a missing image error. This is probably because for some reason the IIIF URL has been generated incorrectly for that image, or that image doesn't exist via IIIF. If this happens, you can 'skip' this image in the annotation interface.

Setting a comfortable size for viewing

You can take advantage of the flexibility of IIIF by requesting images to be a specific size when you create the tasks. This also helps speed up the process of loading each image since we often request a smaller sized image to fit it in on a smallish screen comfortably.

Annotating vs training image size, resolution etc.

IF you are annotating labels or classifications, you may decide to annotate at a smaller size or quality and work with a higher quality image when you come to training a model. If you are doing any annotations of pixels or regions of the image, you will want to be careful to make sure these aren't lost if moving between different sizes of the image.

Exporting and loading annotations from label studio

Label studio supports a broad range of annotation tasks which may require particular export formats i.e. COCO or VOC for object detection. Since the processing of these outputs is tasks specific this module only contains functionality to deal with image classification and labeling tasks since these were the tasks covered in the Programming Historian lessons for which this code was originally written.

Exporting and processing CSV

Once you have finished annotating all your images or got too bored of annotating, you can export in various formats, including JSON and CSV. A CSV export is often sufficient for simple tasks and has the additional benefit of having a lower barrier to entry than JSON for people who aren't coders.

We'll now process the annotations we generated above and labeled using label studio

process_labels[source]

process_labels(x)

load_annotations_csv[source]

load_annotations_csv(csv:Union[str, Path], kind='classification')

def load_annotations_csv(csv: Union[str, Path], kind="classification"):
    if kind == "classification":
        df = pd.read_csv(csv, converters={"box": eval})
        df["label"] = df["choice"]
        return df
    if kind == "label":
        df = pd.read_csv(csv, converters={"box": eval})
        df["label"] = df["choice"].apply(process_labels)
        return df

As you can see above, this code doesn't do much to process the annotations into a DataFrame. The main things to note are the kind parameter. The CSV export for labelling tasks includes a column that contains a JSON with the labels. In this case, we use a pandas converter and eval and grab the choices, which returns a list of labels.

If we look at the columns from the annotation DataFrame we'll see that label studio kept the original metadata. We now have a new column label that contains our annotations. We also have a column choice containing the original column format from the label studio export, which will be different from the label column when processing labelling annotations.

annotation_df = load_annotations_csv("test_iiif_anno/label_studio_export.csv")
annotation_df.columns
Index(['batch', 'box', 'edition_seq_num', 'filepath', 'geographic_coverage',
       'image', 'lccn', 'name', 'ocr', 'page_seq_num', 'page_url',
       'place_of_publication', 'pub_date', 'publisher', 'score', 'url', 'id',
       'choice', 'label'],
      dtype='object')

We can now do the usual Pandas things to start exploring our annotations further. For example we can see how many of each label option we have

annotation_df["choice"].value_counts()
Human       52
no_human    16
Name: choice, dtype: int64

Downloading the images associated with annotations

Once we have some annotations done, we'll often want to get the original images to work locally. This is particularly important if we are planning to train a machine learning model with these images. Although it is possible to train a model using the images from IIIF, since we'll usually be grabbing these images multiple times for each epoch, this isn't particularly efficient and isn't very friendly to the IIIF endpoint.

We can use the sampler.download_sample method to download our sample; we just pass in our annotation DataFrame a folder we want to download images to and an optional name to save our 'log' of the download. We can also pass in different parameters to request different size etc. of the image. See the download_sample docs for more details.

sampler.download_sample(
    "test_iiif_anno/test_dl", df=annotation_df, original=True, json_name="test_dl"
)

Moving between local annotation and the cloud ☁

Although 'storage is cheap', it isn't free. One helpful feature of the IIIF annotations workflow is that it allows you to annotate 'locally,' i.e. on a personal computer and then quickly move the information required to download all the images into the cloud without having to pass the images themselves around. This is particularly useful if you will use a service like Google Colab to train a computer vision model, i.e. you don't have the resources to rent GPUs.

In the context of working with limited bandwidth, it might also be relatively time-consuming to download a large set of images. However, it might be feasible to get around this by annotating using the IIIF images and then using a service like google Colab when you want to grab the actual images files. Since Colab is running in the cloud with a big internet tube, this should be much more doable even if your internet is limited.

Once you have download your images you may want to check if any images weren't able to download. You can do this using the check_download_df_match function.

check_download_df_match[source]

check_download_df_match(dl_folder:Union[Path, str], df:DataFrame)

This will let you know if you have a different number of downloaded images compared to the number of rows in the DataFrame.

check_download_df_match("test_iiif_anno/test_dl", annotation_df)
Length of DataFrame 68 and number of images in test_iiif_anno/test_dl 68 match 😀

Working with the annotations

This will really depend on the framework or library you want to use. In fastai the process is simple since our data matches one of the fastai 'factory' methods for loading data.

Loading with fastai

from fastai.vision.all import *
df = pd.read_json("test_iiif_anno/test_dl/test_dl.json")
dls = ImageDataLoaders.from_df(
    df,
    path="test_iiif_anno/test_dl",
    fn_col="download_image_path",
    label_col="choice",
    item_tfms=Resize(64),
    bs=4,
)
dls.show_batch()

Process completions directly

Label studio stores annotations as json files so we can work with these directly without using the exports from label studio. This code below shows how to do this but the above approach is likely to be more reliable.

load_df[source]

load_df(json_file:Union[str, Path])

load_completions[source]

load_completions(path:Union[str, Path])

df = load_completions("../ph/ads/ad_annotations/")
df.head(1)
created_at id lead_time result data
0 1602237290 457001 1.014 [illustrations] {'image': 'http://localhost:8081/data/upload/d...
# df = load_completions('../ph/photos/multi_label/')
# df.head(1)

anno_sample_merge[source]

anno_sample_merge(sample_df:DataFrame, annotation_df:DataFrame)

anno_sample_merge merges a DataFrame containing a sample from Newspaper Navigator and a DataFrame containing annotations

Parameters

sample_df : pd.DataFrame A Pandas DataFrame which holds a sample from Newspaper Navigator Generated by sample.nnSample() annotation_df : pd.DataFrame A pandas DataFrame containing annotations loaded via the annotate.nnAnnotations class

Returns

pd.DataFrame A new DataFrame which merges the two input DataFrames

sample_df = pd.read_csv("../ph/ads/sample.csv", index_col=0)

class nnAnnotations[source]

nnAnnotations(df)

nnAnnotations.from_completions[source]

nnAnnotations.from_completions(path, kind, drop_dupes=True, sample_df=None)

annotations = nnAnnotations.from_completions(
    "../ph/ads/ad_annotations/", "classification"
)
annotations
nnAnnotations #annotations:549
annotations.labels
array(['illustrations', 'text-only'], dtype=object)
annotations.label_counts
text-only        376
illustrations    173
Name: result, dtype: int64

nnAnnotations.merge_sample[source]

nnAnnotations.merge_sample(sample_df)

annotations.merge_sample(sample_df)
annotations.merged_df.head(2)
filepath pub_date page_seq_num edition_seq_num batch lccn box score ocr place_of_publication geographic_coverage name publisher url page_url created_at id lead_time result data
0 iahi_gastly_ver01/data/sn82015737/00279529091/... 1860-03-09 447 1 iahi_gastly_ver01 sn82015737 [Decimal('0.30762831315880534'), Decimal('0.04... 0.950152 ['JTO', 'TMCE', 'An', 't%E', '3eott', 'County'... Davenport, Iowa ['Iowa--Scott--Davenport'] Daily Democrat and news. [volume] Maguire, Richardson & Co. https://news-navigator.labs.loc.gov/data/iahi_... https://chroniclingamerica.loc.gov/data/batche... 1602237486 iahi_gastly_ver01/data/sn82015737/00279529091/... 0.838 text-only iahi_gastly_ver01_data_sn82015737_00279529091_...
1 ohi_cobweb_ver04/data/sn85026050/00280775848/1... 1860-08-17 359 1 ohi_cobweb_ver04 sn85026050 [Decimal('0.5799164973813336'), Decimal('0.730... 0.985859 ['9', 'BI.', 'I', '.QJtf', 'A', 'never', 'fall... Fremont, Sandusky County [Ohio] ['Ohio--Sandusky--Fremont'] Fremont journal. [volume] I.W. Booth https://news-navigator.labs.loc.gov/data/ohi_c... https://chroniclingamerica.loc.gov/data/batche... 1602236992 ohi_cobweb_ver04/data/sn85026050/00280775848/1... 7.593 illustrations ohi_cobweb_ver04_data_sn85026050_00280775848_1...

nnAnnotations.export_merged[source]

nnAnnotations.export_merged(out_fn)

annotations.export_merged("testmerge.csv")

nnAnnotations.from_completions[source]

nnAnnotations.from_completions(path, kind, drop_dupes=True, sample_df=None)

annotations = nnAnnotations.from_completions(
    "../ph/ads/ad_annotations/", "classification"
)
annotations.annotation_df.head(2)
created_at id lead_time result data
0 1602237290 457001 1.014 illustrations txdn_argentina_ver01_data_sn84022109_002111018...
1 1602237157 179001 2.068 text-only khi_earhart_ver01_data_sn85032814_00237283260_...
from nbdev.export import notebook2script

notebook2script()
Converted 00_core.ipynb.
Converted 01_sample.ipynb.
Converted 02_annotate.ipynb.
Converted 03_inference.ipynb.
Converted index.ipynb.