from nbdev import *
Annotating Newspaper Navigator data
Once you have created a sample of Newspaper Navigator data using sample
, you might want to annotate it somehow. These annotations may function as the input for a machine learning model or could be used directly to explore images in the newspaper navigator data. The Examples
section in the documentation shows how annotations can generate training data for machine learning tasks.
Setup annotation task
The bulk of annotation work is outsourced to label studio, which provides a flexible annotations system that supports annotations for various data types, including images and text. This module does a few steps to help process annotations produced through label studio. This module is essentially some suggestions on how you can get label-studio setup with data from Newspaper Navigator.
First, we'll create a small sample of images we want to annotate using sample
. If you have already done this step, you can skip this.
sampler = nnSampler()
df = sampler.create_sample(
50, "photos", start_year=1910, end_year=1920, year_sample=False
)
There are a few ways in which we can use label studio to annotate. For example, we could download images from our sample using sample.download_sample
. However, if we have a large sample of images, we might want to do some annotating before downloading all of these images locally.
Label-studio supports annotating from a URL. We can use this combined with IIIF to annotate images without downloading them all first since IIIF is a flexible interface for getting images. IIIF also gives us flexibility in annotating at a smaller resolution/size before downloading higher-res images.
Create label studio annotation tasks
Label-studio supports a load of different ways of setting up 'tasks'. In this context, a 'task' is an image to be annotated. One way of setting up a task is to import a JSON
file that includes tasks. To do this, we take an existing sample DataFrame and add column image
, which contains a IIIF URL.
We can pass in either a dataframe or nnSampler
to create_label_studio_json
. This is a simple function that will create a JSON
file that can create 'tasks' in labels studio. In this example, we pass in size parameters. This is used to generate a IIIF URL that will request this size.
create_label_studio_json(df, "tasks.json", size=(500, 500))
This creates a JSON
file we can use to load tasks into label-studio.
Importing tasks into label studio
To avoid this documentation becoming out of date, I haven't included screenshots etc. However, you can currently (January 2021) create tasks in label studio via the GUI or by passing in tasks through the CLI. For example, to load the tasks and create a template for annotating classifications
label-studio init project_name --template=image_classification --input-path=tasks.json
You can then start label-studio and complete the rest of the setup via the GUI.
label-studio start ./project_name
Setting up labeling
For a proper introduction to configuring your labels, consult the label studio documentation. One way in which you can setup labels is to use a template as shown above. This template setups an image classification task. There are other templates for different tasks. These templates consist of XML
templates that define your labels. These templates allow you to define how you want to label your images and share these definitions with others. For example
<View>
<Choices name="choice" toName="image" showInLine="true" choice="multiple">
<Choice value="human"/>
<Choice value="animal"/>
<Choice value="human-structure"/>
<Choice value="landscape"/>
</Choices>
<Image name="image" value="$image"/>
</View>
You can change many other options in Label-studio. It also includes features such as adding a machine learning backend to support annotations.
Notes on labelling using IIIF images
There are a few things to consider and be aware of when loading images via IIIF in label studio.
Missing images
Occasionally when you are doing your annotations in label studio for IIIF URLs, you will get a missing image error. This is probably because for some reason the IIIF URL has been generated incorrectly for that image, or that image doesn't exist via IIIF. If this happens, you can 'skip' this image in the annotation interface.
Setting a comfortable size for viewing
You can take advantage of the flexibility of IIIF by requesting images to be a specific size when you create the tasks. This also helps speed up the process of loading each image since we often request a smaller sized image to fit it in on a smallish screen comfortably.
Annotating vs training image size, resolution etc.
IF you are annotating labels or classifications, you may decide to annotate at a smaller size or quality and work with a higher quality image when you come to training a model. If you are doing any annotations of pixels or regions of the image, you will want to be careful to make sure these aren't lost if moving between different sizes of the image.
Exporting and loading annotations from label studio
Label studio supports a broad range of annotation tasks which may require particular export formats i.e. COCO or VOC for object detection. Since the processing of these outputs is tasks specific this module only contains functionality to deal with image classification and labeling tasks since these were the tasks covered in the Programming Historian lessons for which this code was originally written.
Exporting and processing CSV
Once you have finished annotating all your images or got too bored of annotating, you can export in various formats, including JSON and CSV. A CSV export is often sufficient for simple tasks and has the additional benefit of having a lower barrier to entry than JSON for people who aren't coders.
We'll now process the annotations we generated above and labeled using label studio
def load_annotations_csv(csv: Union[str, Path], kind="classification"):
if kind == "classification":
df = pd.read_csv(csv, converters={"box": eval})
df["label"] = df["choice"]
return df
if kind == "label":
df = pd.read_csv(csv, converters={"box": eval})
df["label"] = df["choice"].apply(process_labels)
return df
As you can see above, this code doesn't do much to process the annotations into a DataFrame. The main things to note are the kind
parameter. The CSV export for labelling tasks includes a column that contains a JSON with the labels. In this case, we use a pandas converter and eval
and grab the choices, which returns a list of labels.
If we look at the columns from the annotation DataFrame we'll see that label studio kept the original metadata. We now have a new column label
that contains our annotations. We also have a column choice
containing the original column format from the label studio export, which will be different from the label
column when processing labelling annotations.
annotation_df = load_annotations_csv("test_iiif_anno/label_studio_export.csv")
annotation_df.columns
We can now do the usual Pandas things to start exploring our annotations further. For example we can see how many of each label option we have
annotation_df["choice"].value_counts()
Downloading the images associated with annotations
Once we have some annotations done, we'll often want to get the original images to work locally. This is particularly important if we are planning to train a machine learning model with these images. Although it is possible to train a model using the images from IIIF, since we'll usually be grabbing these images multiple times for each epoch, this isn't particularly efficient and isn't very friendly to the IIIF endpoint.
We can use the sampler.download_sample
method to download our sample; we just pass in our annotation DataFrame a folder we want to download images to and an optional name to save our 'log' of the download. We can also pass in different parameters to request different size etc. of the image. See the download_sample
docs for more details.
sampler.download_sample(
"test_iiif_anno/test_dl", df=annotation_df, original=True, json_name="test_dl"
)
Moving between local annotation and the cloud ☁
Although 'storage is cheap', it isn't free. One helpful feature of the IIIF annotations workflow is that it allows you to annotate 'locally,' i.e. on a personal computer and then quickly move the information required to download all the images into the cloud without having to pass the images themselves around. This is particularly useful if you will use a service like Google Colab to train a computer vision model, i.e. you don't have the resources to rent GPUs.
In the context of working with limited bandwidth, it might also be relatively time-consuming to download a large set of images. However, it might be feasible to get around this by annotating using the IIIF images and then using a service like google Colab when you want to grab the actual images files. Since Colab is running in the cloud with a big internet tube, this should be much more doable even if your internet is limited.
Once you have download your images you may want to check if any images weren't able to download. You can do this using the check_download_df_match
function.
This will let you know if you have a different number of downloaded images compared to the number of rows in the DataFrame.
check_download_df_match("test_iiif_anno/test_dl", annotation_df)
from fastai.vision.all import *
df = pd.read_json("test_iiif_anno/test_dl/test_dl.json")
dls = ImageDataLoaders.from_df(
df,
path="test_iiif_anno/test_dl",
fn_col="download_image_path",
label_col="choice",
item_tfms=Resize(64),
bs=4,
)
dls.show_batch()
df = load_completions("../ph/ads/ad_annotations/")
df.head(1)
# df = load_completions('../ph/photos/multi_label/')
# df.head(1)
sample_df = pd.read_csv("../ph/ads/sample.csv", index_col=0)
annotations = nnAnnotations.from_completions(
"../ph/ads/ad_annotations/", "classification"
)
annotations
annotations.labels
annotations.label_counts
annotations.merge_sample(sample_df)
annotations.merged_df.head(2)
annotations.export_merged("testmerge.csv")
annotations = nnAnnotations.from_completions(
"../ph/ads/ad_annotations/", "classification"
)
annotations.annotation_df.head(2)
from nbdev.export import notebook2script
notebook2script()