from nbdev import *
This module defines some functions which are used in other places within nnanno
. This includes some things for working with images, making requests etc.
create_session
creates a requests session. We also add a Retry
to up the number of times we'll retry a request.
Cached requests
We might want to request the same data from the Newspaper Navigator Dataset multiple times, i.e. if we want to generate different sized samples. This isn't ideal since it takes time for us to wait for the response and money for the LOC to serve those responses. nnanno
uses caching for most requests to the Newspaper Navigator dataset. When a request is sent to a URL for the first time, it is requested in the usual way. If that same URL is requested again, it will be returned from a cache.
This is done by using the requests_cache library. requests_cache
creates a sqlite database to store the cached requests.
This database url_cache.sqlite
will grow as you request more URLs from the Newspaper Navigator dataset. Some requests are too big for SQLite to ingest (ads beyond 1870 and headlines).
create_cached_session
creates a session that returns cached results if the URL has been previously requested. This cache is stored in an SQLite database url_cache.sqlite
session = create_cached_session()
r = session.get("https://google.com")
r.status_code, r.from_cache
get_max_workers
returns the min
of the CPU count or len
of the data. This isn't used everywhere and is a fairly crude heuristic.
list_data = [1, 2, 3, 4]
get_max_workers(list_data), get_max_workers()
load_url_image
loads an image from a URL if available and returns it as a PIL Image
in RGB
"mode by default i.e. 3 channels. Since we might have some URL timeouts etc. from time to time, it will return None
if it doesn't find an image or the requests has a timeout. This means downstream function will usually want to check for and handle None
as well as PIL Images
url = "https://news-navigator.labs.loc.gov/data/dlc_fiji_ver01/data/sn83030214/00175040936/1900102801/0519/001_0_99.jpg"
im = load_url_image(url)
im
Casting to RGB
Not used at the moment, but we may sometimes want to enforce all PIL Images
to RGB
mode to avoid tensor mismatches between different images shapes.
Note
I haven't looked closely at the impact of casting to RGB
versus using grayscale images for machine learning. Since we don't lose information by casting to RGB
this is probably a sensible starting point, especially as I am assuming images loaded will usually be used as part of a deep learning pipeline. This often involves transfer learning from images trained on three channels. There is some experimentation/research to be done on the best way of handling this to be done.
def _to_rgb(images):
rgb_images = []
for im in images:
rgb_images.append(im.convert("RGB"))
return rgb_images
save_image
saves an image im
to filename fname
and directory out_dir
.
im = Image.open(io.BytesIO(requests.get(url).content))
save_image(im, "test_iif.jpg")
Combines load_url_image
and save_image
IIIF
The Image Interoperability Framework (IIIF, pronounced “Triple-Eye-Eff”), is a lovely API for requesting images.
We can utilise IIIF to download images from the Newspaper Navigator dataset because the API allows you to specify a region. As a bonus, we can also set a bunch of other valuable things in our request, such as size, which will help ensure we only download the size of the image we need. Thanks to Benjamin Lee for pointing out this API and for the code in https://github.com/LibraryOfCongress/newspaper-navigator/tree/master/news_navigator_app
parse_box
parses the predicted bounding boxes from the Newspaper Navigator data and returns it in a format suitable for making IIIF requests.
box = [0.2624743069, 0.0310959541, 0.9860676740000001, 0.2558741574]
parse_box(box)
create_iiif_url
takes a few field from the newspaper navigator data for an image, and returns a iiif_url
which (mostly) points to that image available via IIIF.
The required fields are the original url
from Newspaper Navigator and box
the bounding box prediction from Newspaper Navigator.
box = [0.2624743069, 0.0310959541, 0.9860676740000001, 0.2558741574]
url = "https://news-navigator.labs.loc.gov/data/dlc_fiji_ver01/data/sn83030214/00175040936/1900102801/0519/001_0_99.jpg"
iiifurl = create_iiif_url(box, url, pct=10)
print(iiifurl)
load_url_image(iiifurl)
from nbdev.export import notebook2script
notebook2script()