Create samples from Newspaper Navigator

Newspaper Navigator JSON files

We need to work with the JSON files from the Newspaper Navigator data. The first thing that might be helpful is some code for generating the URLs for a particular year and kind. Since the URLs are systematically structured, this is easy.

get_json_url[source]

get_json_url(year:Union[str, int], kind:str='photos')

Returns url for the json data from news-navigator for given year and kind

assert (
    get_json_url(1860)
    == "https://news-navigator.labs.loc.gov/prepackaged/1860_photos.json"
)
assert (
    get_json_url(1950)
    == "https://news-navigator.labs.loc.gov/prepackaged/1950_photos.json"
)
assert (
    get_json_url(1950, "ads")
    == "https://news-navigator.labs.loc.gov/prepackaged/1950_ads.json"
)

load_json[source]

load_json(url:str)

Returns json loaded from url

We can also test that this returns what we think inside the notebook. These tests are often hidden in the documentation. However, inside the notebook, there will usually be a cell below a function definition that includes some tests.

test_json = load_json(
    "https://news-navigator.labs.loc.gov/prepackaged/1950_photos.json"
)
assert type(test_json[0]) == dict
assert type(test_json) == list

Working with big JSON

This works well for a smallish file but if we try this with the 1905_ads.json file which is ~3.3GB we will likely run out of memory. For example running

with requests.get('https://news-navigator.labs.loc.gov/prepackaged/1910_ads.json') as r:
    data = json.loads(r.content)
len(data)

on a Google Colab instance with 25GB of RAM causes a crash.

Streaming JSON

One way to get around this would be to throw more RAM at the problem. However, since we only want to sample the JSON and don't need to work with the whole dataset, this seems wasteful. So instead, we'll use ijson, a Python library for streaming JSON.

We can see how this works for a URL from Newspaper Navigator if we create a request via Requests using stream=True to return a streaming version of the response.

r = requests.get(get_json_url(1850, "ads"), stream=True)

We can pass this response to ijson. In this case we just parse an item at a time. If the JSON is really big this might already be too much. ijson allows for much more granular parsing of JSON but for what we need, parsing by item is fine. We can see what the return of this looks like

objects = ijson.items(r.raw, "item")
objects
<_yajl2.items at 0x7fa7204ce030>

We get back something from _yajl2 this is the underlying parser ijson is using. See ijson docs for more on available parsers.

We can call next on this object to start iterating over it, one item at a time. If we look at the keys of the first response, you'll see that this is one entry from the original JSON data.

first = next(objects)
first.keys()
dict_keys(['filepath', 'pub_date', 'page_seq_num', 'edition_seq_num', 'batch', 'lccn', 'box', 'score', 'ocr', 'place_of_publication', 'geographic_coverage', 'name', 'publisher', 'url', 'page_url'])
r.close()

Counting the size of the data

If we want to sample from newspaper navigator it is important to be able to know the size of the total population for a given year and kind of image i.e. 10000 photos for 1950.

Normally in Python we would use len to count the length of a python object

data = {"a": "one", "b": "two", "c": "three"}
len(data)
3

If we try and do this with our objects we get an error TypeError: object of type '_yajl2.items' has no len(). This is because the point if of ijson is to avoid loading json into memory so we don't know how long the total data will be.

We can get around this by using the toolz libraries itertoolz.count method. count is similar to len except that it can work on lazy sequences, i.e. something with a next attribute. Unfortunately, this ends up being relatively slow because we still need to go through all of the data, which means that although we can avoid loading the data into memory, we still need to stream it to get the length. We usually won't need to call this repeatedly, but if we call this function multiple times, we cache the results to make sure we don't calculate the length of the same data multiple times.

count_json_iter[source]

count_json_iter(url:str, session=None)

Returns count of objects in url json file using an iterator to avoid loading json into memory

count_json_iter counts the length of a json file loaded via URL.

count_json_iter("https://news-navigator.labs.loc.gov/prepackaged/1850_photos.json")
22
url = "https://news-navigator.labs.loc.gov/prepackaged/1850_photos.json"
assert type(count_json_iter(url)) == int
assert len(json.loads(requests.get(url).content)) == count_json_iter(url)

get_year_size[source]

get_year_size(year:Union[int, str], kind:str)

returns size of a json dataset for a given year and kind results are cached

Parameters

year : Union[int,str] year from newspaper navigator kind : str {'ads', 'photos', 'maps', 'illustrations', 'comics', 'cartoons', 'headlines'}

Returns

size :dict returns a dict with year as a key and size as value

get_year_size(1850, "photos")
{'1850': 22}

get_year_sizes[source]

get_year_sizes(kind:str, start:int=1850, end:int=1950, step:int=5)

Returns the sizes for json data files for kind between year start and end with step size 'step'

Parameters: kind (str): kind of image from news-navigator: {'ads', 'photos', 'maps', 'illustrations', 'comics', 'cartoons', 'headlines'}

Returns: Pandas.DataFrame: holding data from input json url

Returns the year sizes for a given kind, taking a step size step. For example, to get the number of photos in the news-navigator dataset between 1850 and 1860 for every year:

%%time
get_year_sizes("photos", 1850, 1855, step=1)
CPU times: user 146 ms, sys: 37.9 ms, total: 184 ms
Wall time: 1.2 s
photos_count
1850 22
1851 20
1852 22
1853 45
1854 221
1855 17
assert len(get_year_sizes("photos", 1850, 1860, step=1)) == 11
assert len(get_year_sizes("photos", 1850, 1860, step=2)) == 6

get_all_year_sizes[source]

get_all_year_sizes(start:int=1850, end:int=1950, step:int=1, save:bool=True)

Returns a dataframe with number of counts from year start to end

Creating Samples

Streaming sampling

Since we want a subset of the Newspaper Navigator datasets we can either work with for annotation or inference, we want to create samples. Sampling in Python can be complicated depending on the type of population you are working with and your sample's properties. Usually, we can do something relatively simple. For example, if we want to sample from a selection of books, we could do:

import random

books = ["War and Peace", "Frankenstein", "If They Come in the Morning"]
random.sample(books, 1)
['War and Peace']

However, we run into the same problem as trying to get the length of a JSON dataset that wouldn't fit into memory above. For example, we want to sample $k$ examples from one of our JSON files that we can't load into memory. To get around this, we can use Reservoir_sampling:

Reservoir sampling is a family of randomized algorithms for choosing a simple random sample without replacement of k items from a population of unknown size n in a single pass over the items. The size of the population n is not known to the algorithm and is typically too large to fit all n items into main memory. The population is revealed to the algorithm over time, and the algorithm cannot look back at previous items.

sample_stream[source]

sample_stream(stream, k:int)

Return a random sample of k elements drawn without replacement from stream. Designed to be used when the elements of stream cannot easily fit into memory.

Now we sample whilst only loading a small number of items into memory at one time. This does come at some cost, mainly speed. There are faster ways of sampling from a stream but this isn't the main bottle neck for sampling in this case. We can for example sample from a large range of numbers without memory issues.

sample_stream(range(1, 100000), 5)
array([62151, 45070, 43590, 71352, 61951])

We can still sample from lists

names = ["Karl Marx", "Rosa Luxenburg", "Raya Dunayevskaya", "CLR James"]
sample_stream(iter(names), 2)
array(['Karl Marx', 'Raya Dunayevsk'], dtype='<U14')

calc_frac_size[source]

calc_frac_size(url, frac, session=None)

returns fraction size from a json stream

calc_year_from_total[source]

calc_year_from_total(total, start, end, step)

Calculate size of a year sample based on a total sample size

calc_year_from_total(10, 1850, 18950, 1)
1

Reducing memory usage

Since we are trying to be a bit careful with memory usage we will convert column dtypes to be smaller when possible.

reduce_df_memory[source]

reduce_df_memory(df)

Sampling Newspaper navigator

We now start building up a class nnSampler for doing our proper sampling.

class nnSampler[source]

nnSampler()

Sampler for creating samples from Newspaper Navigator data

sample_year[source]

sample_year(kind:str, sample_size:Union[int, float], year:int)

samples sample_size for year and kind

sample_year("photos", 1, 1850)
assert (
    len(sample_year("maps", 0.1, 1850)) == 1
)  # test we always have a sample size of at least one

None[source]

nnSampler.create_sample[source]

nnSampler.create_sample(sample_size:Union[int, float], kind:str='photos', start_year:int=1850, end_year:int=1950, step:int=5, year_sample=True, save:bool=False, reduce_memory=True)

Creates a sample of Newspaper Navigator data for a given set of years and a kind

Parameters: sample_size: int, float sample size can either be a fixed number or a fraction of the total dataset size kind (str): kind of image from news-navigator: {'ads', 'photos', 'maps', 'illustrations', 'comics', 'cartoons', 'headlines'}

Returns: Pandas.DataFrame: holding data from input json url

create_sample returns a dataframe which samples from Newspaper Navigator. year_sample controls whether you want sample_size to be for each year or for you entire sample. Selecting year_sample false will return a sample of a size close to what you define in sample_size. This is useful for example if you plan to annotate your sample with some new labels.

For any years where sample size is larger than the sample available you just get everything for that year.

sampler = nnSampler()
sampler.create_sample(5, step=2, end_year=1852, year_sample=False)
filepath pub_date page_seq_num edition_seq_num batch lccn box score ocr place_of_publication geographic_coverage name publisher url page_url
0 msar_icydrop_ver05/data/sn86074079/00295878502... 1850-12-05 1149 1 msar_icydrop_ver05 sn86074079 [0.4527419749698691, 0.07917078993055555, 0.75... 0.951330 [Wednesday, Dec., Uth, one, day, only., -, i, ... Canton, Miss. [Mississippi--Madison--Canton] The Madisonian. R.D. Price https://news-navigator.labs.loc.gov/data/msar_... https://chroniclingamerica.loc.gov/data/batche...
1 in_indianapolisolympians_ver02/data/sn86058217... 1850-04-03 470 1 in_indianapolisolympians_ver02 sn86058217 [0.19613312344418035, 0.41648075810185187, 0.3... 0.943215 [,, j, ii, i:, 1., 1, a, it, A, I, II, V, ...] Richmond, IA [i.e. Ind.] [Indiana--Wayne--Richmond] Richmond palladium. [volume] D.P. Holloway & B.W. Davis https://news-navigator.labs.loc.gov/data/in_in... https://chroniclingamerica.loc.gov/data/batche...
2 ohi_ingstad_ver01/data/sn85026051/00296027029/... 1850-12-07 115 1 ohi_ingstad_ver01 sn85026051 [0.30707743987524494, 0.6473851770787806, 0.44... 0.984739 [COME, IN,, WE, CALL, YOU!] Fremont, Sandusky County, Ohio [Ohio--Sandusky--Fremont] Fremont weekly freeman. [volume] J.S. Fouke https://news-navigator.labs.loc.gov/data/ohi_i... https://chroniclingamerica.loc.gov/data/batche...
3 ohi_edgar_ver01/data/sn85038121/00280775502/18... 1852-06-24 443 1 ohi_edgar_ver01 sn85038121 [0.5482568719219458, 0.05572789142773935, 0.80... 0.966192 [;, 'Y, '., ", i, ', -, 7, i, ', f, 1,, ft, ',... Gallipolis, Ohio [Ohio--Gallia--Gallipolis] Gallipolis journal. [volume] Alexander Vance https://news-navigator.labs.loc.gov/data/ohi_e... https://chroniclingamerica.loc.gov/data/batche...
4 msar_cloudchaser_ver01/data/sn87065038/0029587... 1852-10-07 262 1 msar_cloudchaser_ver01 sn87065038 [0.4655820986278216, 0.5739247633736971, 0.601... 0.968907 [i, "ntnTf-fhr-i, -irrr-T-, r, ., -, J, I] Columbus, Miss. [Mississippi--Lowndes--Columbus] The primitive Republican. F.G. Baldwin https://news-navigator.labs.loc.gov/data/msar_... https://chroniclingamerica.loc.gov/data/batche...
5 txdn_eastland_ver01/data/sn83025730/0027955983... 1852-07-10 611 1 txdn_eastland_ver01 sn83025730 [0.6132026808043348, 0.2620943737307529, 0.975... 0.982666 [1, i, ||, «, -, ', #, •«, ft,, %, >*<», I, <,... Marshall, Tex. [Texas--Harrison--Marshall] The Texas Republican. [volume] F.J. Patillo https://news-navigator.labs.loc.gov/data/txdn_... https://chroniclingamerica.loc.gov/data/batche...

Downloading sample images

None[source]

nnSampler.download_sample[source]

nnSampler.download_sample(out_dir:str, json_name:Optional[str]=None, df:Optional[DataFrame]=None, original:bool=True, pct:Optional[int]=None, size:Optional[tuple]=None, preserve_asp_ratio:bool=True)

Download images associated with a sample The majority of paramters relate to the options available in a IIIF image request see https://iiif.io/api/image/3.0/#4-image-requests for further information

Parameters

out_dir The save directory for the images json_name

df optional DataFrame containing a sample original if True will download orginal size images via IIIF pct optional value which scales the size of images requested by pct size a tuple representing width by height, will be passed to IIIF request preserve_asp_ratio whether to ask the IIIF request to preserve aspect ratio of image or not

Returns

None

download_sample is used to download images from a sample

sampler = nnSampler()
sampler
nnSampler
sampler.population
ads_count photos_count maps_count illustrations_count comics_count cartoons_count headlines_count total
1850 8841 22 5 671 9 0 11243 20791
1851 10065 20 6 457 7 0 12262 22817
1852 8764 22 10 671 10 8 13524 23009
1853 11517 45 5 1106 88 1 13224 25986
1854 15050 221 15 732 11 3 15282 31314
... ... ... ... ... ... ... ... ...
1946 185139 5945 1857 1053 3280 861 68275 266410
1947 181223 4188 1750 1115 3630 797 57018 249721
1948 152987 4282 1359 1154 3031 624 43432 206869
1949 154510 6015 1888 1280 3356 634 42904 210587
1950 154961 5630 1952 1223 3893 704 37854 206217

101 rows × 8 columns

df = sampler.create_sample(
    sample_size=10, kind="photos", start_year=1850, end_year=1855, reduce_memory=True
)
df.head(5)
filepath pub_date page_seq_num edition_seq_num batch lccn box score ocr place_of_publication geographic_coverage name publisher url page_url
0 ohi_ingstad_ver01/data/sn85026051/00296027029/... 1850-07-27 37 1 ohi_ingstad_ver01 sn85026051 [0.29913574490319106, 0.622813938380955, 0.430... 0.980025 [ht, I, ', Wll., ., III, tl, T, ., "', "', ", ... Fremont, Sandusky County, Ohio [Ohio--Sandusky--Fremont] Fremont weekly freeman. [volume] J.S. Fouke https://news-navigator.labs.loc.gov/data/ohi_i... https://chroniclingamerica.loc.gov/data/batche...
1 ohi_ingstad_ver01/data/sn85026051/00296027029/... 1850-07-20 33 1 ohi_ingstad_ver01 sn85026051 [0.3009427797781111, 0.6294158908847332, 0.433... 0.929614 [L, -, COME, IN,, WE, CALL, YOU, !, .v';:] Fremont, Sandusky County, Ohio [Ohio--Sandusky--Fremont] Fremont weekly freeman. [volume] J.S. Fouke https://news-navigator.labs.loc.gov/data/ohi_i... https://chroniclingamerica.loc.gov/data/batche...
2 ncu_hawk_ver02/data/sn84026472/00416156360/185... 1850-05-22 289 1 ncu_hawk_ver02 sn84026472 [0.6732673909317263, 0.042179068056539225, 0.8... 0.914908 [] Hillsborough, N.C. [North Carolina--Orange--Hillsboro] The Hillsborough recorder. [volume] Dennis Heartt https://news-navigator.labs.loc.gov/data/ncu_h... https://chroniclingamerica.loc.gov/data/batche...
3 msar_icydrop_ver05/data/sn86074079/00295878502... 1850-12-05 1149 1 msar_icydrop_ver05 sn86074079 [0.4527419749698691, 0.07917078993055555, 0.75... 0.951330 [Wednesday, Dec., Uth, one, day, only., -, i, ... Canton, Miss. [Mississippi--Madison--Canton] The Madisonian. R.D. Price https://news-navigator.labs.loc.gov/data/msar_... https://chroniclingamerica.loc.gov/data/batche...
4 vtu_londonderry_ver01/data/sn84023252/00200296... 1850-05-11 283 1 vtu_londonderry_ver01 sn84023252 [0.5275910554108796, 0.16344128086556137, 0.68... 0.915448 [Old, Dr., Jnaob, TownHond,, 'Jlu, Utigiml, l)... St. Johnsbury, Vt. [Vermont--Caledonia--Saint Johnsbury] The Caledonian. [volume] A.G. Chadwick https://news-navigator.labs.loc.gov/data/vtu_l... https://chroniclingamerica.loc.gov/data/batche...

Downloading a sample

We may want to work with images locally. We can download them using the download_sample method.

sampler.create_sample(
    sample_size=10, kind="ads", start_year=1850, end_year=1850, reduce_memory=True
)
sampler.download_sample("test")
from nbdev.export import notebook2script

notebook2script()
Converted 00_core.ipynb.
Converted 01_sample.ipynb.
Converted 02_annotate.ipynb.
Converted 03_inference.ipynb.
Converted index.ipynb.