core functionality used in nnanno
from nbdev import *

This module defines some functions which are used in other places within nnanno. This includes some things for working with images, making requests etc.

Requests

Since we're going to be making requests quite often we modify the default requests session slightly to better suit our needs.

create_session[source]

create_session()

returns a requests session

create_session creates a requests session. We also add a Retry to up the number of times we'll retry a request.

Cached requests

We might want to request the same data from the Newspaper Navigator Dataset multiple times, i.e. if we want to generate different sized samples. This isn't ideal since it takes time for us to wait for the response and money for the LOC to serve those responses. nnanno uses caching for most requests to the Newspaper Navigator dataset. When a request is sent to a URL for the first time, it is requested in the usual way. If that same URL is requested again, it will be returned from a cache.

This is done by using the requests_cache library. requests_cache creates a sqlite database to store the cached requests.

This database url_cache.sqlite will grow as you request more URLs from the Newspaper Navigator dataset. Some requests are too big for SQLite to ingest (ads beyond 1870 and headlines).

create_cached_session[source]

create_cached_session()

Creates a session which caches requests

create_cached_session creates a session that returns cached results if the URL has been previously requested. This cache is stored in an SQLite database url_cache.sqlite

session = create_cached_session()
r = session.get("https://google.com")
r.status_code, r.from_cache
(200, False)

Multiprocessing

Quite a few of the things done by nnanno can be done in parallel. A crude heuristic for defining the max number of workers is based on the cpu count and length of data.

get_max_workers[source]

get_max_workers(data=None)

Returns int to pass to max_workers based on len of data if available or cpu_count()

get_max_workers returns the min of the CPU count or len of the data. This isn't used everywhere and is a fairly crude heuristic.

list_data = [1, 2, 3, 4]
get_max_workers(list_data), get_max_workers()
(4, 12)

Images

This code deals with loading, saving and downloading images.

Loading

load_url_image[source]

load_url_image(url:str, mode='RGB')

Attempts to load an image from url returns None if request times out or no image at url

load_url_image loads an image from a URL if available and returns it as a PIL Image in RGB "mode by default i.e. 3 channels. Since we might have some URL timeouts etc. from time to time, it will return None if it doesn't find an image or the requests has a timeout. This means downstream function will usually want to check for and handle None as well as PIL Images

url = "https://news-navigator.labs.loc.gov/data/dlc_fiji_ver01/data/sn83030214/00175040936/1900102801/0519/001_0_99.jpg"
im = load_url_image(url)
im

Casting to RGB

Not used at the moment, but we may sometimes want to enforce all PIL Images to RGB mode to avoid tensor mismatches between different images shapes.

Note

I haven't looked closely at the impact of casting to RGB versus using grayscale images for machine learning. Since we don't lose information by casting to RGB this is probably a sensible starting point, especially as I am assuming images loaded will usually be used as part of a deep learning pipeline. This often involves transfer learning from images trained on three channels. There is some experimentation/research to be done on the best way of handling this to be done.

def _to_rgb(images):
    rgb_images = []
    for im in images:
        rgb_images.append(im.convert("RGB"))
    return rgb_images

Saving images

save_image[source]

save_image(im:Image, fname:str, out_dir:Union[str, Path]='.')

Saves im as fname to out_dir

save_image saves an image im to filename fname and directory out_dir.

im = Image.open(io.BytesIO(requests.get(url).content))
save_image(im, "test_iif.jpg")

Downloading images

download_image[source]

download_image(url:str, fname:str, out_dir:Union[str, Path]='.')

Attempts to load image from url and save as fname to out_dir Returns None if bad URL or request timesout

Combines load_url_image and save_image

IIIF

The Image Interoperability Framework (IIIF, pronounced “Triple-Eye-Eff”), is a lovely API for requesting images.

We can utilise IIIF to download images from the Newspaper Navigator dataset because the API allows you to specify a region. As a bonus, we can also set a bunch of other valuable things in our request, such as size, which will help ensure we only download the size of the image we need. Thanks to Benjamin Lee for pointing out this API and for the code in https://github.com/LibraryOfCongress/newspaper-navigator/tree/master/news_navigator_app

parse_box[source]

parse_box(box:Union[Tuple, List[T]])

Parses the box value from Newspaper Navigator data to prepre for IIIF request

parse_box parses the predicted bounding boxes from the Newspaper Navigator data and returns it in a format suitable for making IIIF requests.

box = [0.2624743069, 0.0310959541, 0.9860676740000001, 0.2558741574]
parse_box(box)
(26.24, 3.11, 72.36, 22.48)

create_iiif_url[source]

create_iiif_url(box:Union[Tuple, List[T]], url:str, original:bool=False, pct:int=None, size:tuple=None, preserve_asp_ratio:bool=True)

Returns a IIIF URL from bounding box and URL

create_iiif_url takes a few field from the newspaper navigator data for an image, and returns a iiif_url which (mostly) points to that image available via IIIF.

The required fields are the original url from Newspaper Navigator and box the bounding box prediction from Newspaper Navigator.

box = [0.2624743069, 0.0310959541, 0.9860676740000001, 0.2558741574]
url = "https://news-navigator.labs.loc.gov/data/dlc_fiji_ver01/data/sn83030214/00175040936/1900102801/0519/001_0_99.jpg"
iiifurl = create_iiif_url(box, url, pct=10)
print(iiifurl)
load_url_image(iiifurl)
https://chroniclingamerica.loc.gov/iiif/2/dlc_fiji_ver01%2Fdata%2Fsn83030214%2F00175040936%2F1900102801%2F0519.jp2/pct:26.24,3.11,72.36,22.48/pct:10/0/default.jpg

iiif_df_apply[source]

iiif_df_apply(row, original:bool=False, pct:int=50, size:tuple=None, preserve_asp_ratio:bool=True)

Creates IIIF urls from a pandas DataFrame containing newspaper navigator data

Convenience

Some convenience functions that might be used in other places

bytesto[source]

bytesto(bytes, to:str, bsize:int=1024)

Takes bytes and returns value convereted to to

from nbdev.export import notebook2script

notebook2script()
Converted 00_core.ipynb.
Converted 01_sample.ipynb.
Converted 02_annotate.ipynb.
Converted 03_inference.ipynb.
Converted index.ipynb.