from nbdev import *

This module defines some functions which are used in other places within nnanno. This includes some things for working with images, making requests etc.

Requests

Since we're going to be making requests quite often we modify the default requests session slightly to better suit our needs.

create_session creates a requests session. We also add a Retry to up the number of times we'll retry a request.

Cached requests

We might want to request the same data from the Newspaper Navigator Dataset multiple times, i.e. if we want to generate different sized samples. This isn't ideal since it takes time for us to wait for the response and money for the LOC to serve those responses. nnanno uses caching for most requests to the Newspaper Navigator dataset. When a request is sent to a URL for the first time, it is requested in the usual way. If that same URL is requested again, it will be returned from a cache.

This is done by using the requests_cache library. requests_cache creates a sqlite database to store the cached requests.

This database url_cache.sqlite will grow as you request more URLs from the Newspaper Navigator dataset. Some requests are too big for SQLite to ingest (ads beyond 1870 and headlines).

create_cached_session creates a session that returns cached results if the URL has been previously requested. This cache is stored in an SQLite database url_cache.sqlite

session = create_cached_session()
r = session.get("https://google.com")
r.status_code, r.from_cache

(200, False)

Multiprocessing

Quite a few of the things done by nnanno can be done in parallel. A crude heuristic for defining the max number of workers is based on the cpu count and length of data.

get_max_workers returns the min of the CPU count or len of the data. This isn't used everywhere and is a fairly crude heuristic.

list_data = [1, 2, 3, 4]
get_max_workers(list_data), get_max_workers()

(4, 12)

Images

This code deals with loading, saving and downloading images.

Loading

load_url_image loads an image from a URL if available and returns it as a PIL Image in RGB "mode by default i.e. 3 channels. Since we might have some URL timeouts etc. from time to time, it will return None if it doesn't find an image or the requests has a timeout. This means downstream function will usually want to check for and handle None as well as PIL Images

url = "https://news-navigator.labs.loc.gov/data/dlc_fiji_ver01/data/sn83030214/00175040936/1900102801/0519/001_0_99.jpg"
im = load_url_image(url)
im

Casting to RGB

Not used at the moment, but we may sometimes want to enforce all PIL Images to RGB mode to avoid tensor mismatches between different images shapes.

Note

I haven't looked closely at the impact of casting to RGB versus using grayscale images for machine learning. Since we don't lose information by casting to RGB this is probably a sensible starting point, especially as I am assuming images loaded will usually be used as part of a deep learning pipeline. This often involves transfer learning from images trained on three channels. There is some experimentation/research to be done on the best way of handling this to be done.

def _to_rgb(images):
    rgb_images = []
    for im in images:
        rgb_images.append(im.convert("RGB"))
    return rgb_images

Saving images

save_image saves an image im to filename fname and directory out_dir.

im = Image.open(io.BytesIO(requests.get(url).content))
save_image(im, "test_iif.jpg")

Downloading images

Combines load_url_image and save_image

IIIF

The Image Interoperability Framework (IIIF, pronounced “Triple-Eye-Eff”), is a lovely API for requesting images.

We can utilise IIIF to download images from the Newspaper Navigator dataset because the API allows you to specify a region. As a bonus, we can also set a bunch of other valuable things in our request, such as size, which will help ensure we only download the size of the image we need. Thanks to Benjamin Lee for pointing out this API and for the code in https://github.com/LibraryOfCongress/newspaper-navigator/tree/master/news_navigator_app

parse_box parses the predicted bounding boxes from the Newspaper Navigator data and returns it in a format suitable for making IIIF requests.

box = [0.2624743069, 0.0310959541, 0.9860676740000001, 0.2558741574]
parse_box(box)

(26.24, 3.11, 72.36, 22.48)

create_iiif_url takes a few field from the newspaper navigator data for an image, and returns a iiif_url which (mostly) points to that image available via IIIF.

The required fields are the original url from Newspaper Navigator and box the bounding box prediction from Newspaper Navigator.

box = [0.2624743069, 0.0310959541, 0.9860676740000001, 0.2558741574]
url = "https://news-navigator.labs.loc.gov/data/dlc_fiji_ver01/data/sn83030214/00175040936/1900102801/0519/001_0_99.jpg"

iiifurl = create_iiif_url(box, url, pct=10)
print(iiifurl)
load_url_image(iiifurl)

https://chroniclingamerica.loc.gov/iiif/2/dlc_fiji_ver01%2Fdata%2Fsn83030214%2F00175040936%2F1900102801%2F0519.jp2/pct:26.24,3.11,72.36,22.48/pct:10/0/default.jpg

Convenience

Some convenience functions that might be used in other places

from nbdev.export import notebook2script

notebook2script()

Converted 00_core.ipynb.
Converted 01_sample.ipynb.
Converted 02_annotate.ipynb.
Converted 03_inference.ipynb.
Converted index.ipynb.

Core

Requests

`create_session`[source]

Cached requests

`create_cached_session`[source]

Multiprocessing

`get_max_workers`[source]

Images

Loading

`load_url_image`[source]

Casting to RGB

Note

Saving images

`save_image`[source]

Downloading images

`download_image`[source]

IIIF

`parse_box`[source]

`create_iiif_url`[source]

`iiif_df_apply`[source]

Convenience

`bytesto`[source]

Core

Requests

create_session[source]

Cached requests

create_cached_session[source]

Multiprocessing

get_max_workers[source]

Images

Loading

load_url_image[source]

Casting to RGB

Note

Saving images

save_image[source]

Downloading images

download_image[source]

IIIF

parse_box[source]

create_iiif_url[source]

iiif_df_apply[source]

Convenience

bytesto[source]

`create_session`[source]

`create_cached_session`[source]

`get_max_workers`[source]

`load_url_image`[source]

`save_image`[source]

`download_image`[source]

`parse_box`[source]

`create_iiif_url`[source]

`iiif_df_apply`[source]

`bytesto`[source]