Newspaper Navigator JSON files

We need to work with the JSON files from the Newspaper Navigator data. The first thing that might be helpful is some code for generating the URLs for a particular year and kind. Since the URLs are systematically structured, this is easy.

assert (
    get_json_url(1860)
    == "https://news-navigator.labs.loc.gov/prepackaged/1860_photos.json"
)
assert (
    get_json_url(1950)
    == "https://news-navigator.labs.loc.gov/prepackaged/1950_photos.json"
)
assert (
    get_json_url(1950, "ads")
    == "https://news-navigator.labs.loc.gov/prepackaged/1950_ads.json"
)

We can also test that this returns what we think inside the notebook. These tests are often hidden in the documentation. However, inside the notebook, there will usually be a cell below a function definition that includes some tests.

test_json = load_json(
    "https://news-navigator.labs.loc.gov/prepackaged/1950_photos.json"
)
assert type(test_json[0]) == dict
assert type(test_json) == list

Working with big JSON

This works well for a smallish file but if we try this with the 1905_ads.json file which is ~3.3GB we will likely run out of memory. For example running

with requests.get('https://news-navigator.labs.loc.gov/prepackaged/1910_ads.json') as r:
    data = json.loads(r.content)
len(data)

on a Google Colab instance with 25GB of RAM causes a crash.

Streaming JSON

One way to get around this would be to throw more RAM at the problem. However, since we only want to sample the JSON and don't need to work with the whole dataset, this seems wasteful. So instead, we'll use ijson, a Python library for streaming JSON.

We can see how this works for a URL from Newspaper Navigator if we create a request via Requests using stream=True to return a streaming version of the response.

r = requests.get(get_json_url(1850, "ads"), stream=True)

We can pass this response to ijson. In this case we just parse an item at a time. If the JSON is really big this might already be too much. ijson allows for much more granular parsing of JSON but for what we need, parsing by item is fine. We can see what the return of this looks like

objects = ijson.items(r.raw, "item")
objects

<_yajl2.items at 0x7fa7204ce030>

We get back something from _yajl2 this is the underlying parser ijson is using. See ijson docs for more on available parsers.

We can call next on this object to start iterating over it, one item at a time. If we look at the keys of the first response, you'll see that this is one entry from the original JSON data.

first = next(objects)
first.keys()

dict_keys(['filepath', 'pub_date', 'page_seq_num', 'edition_seq_num', 'batch', 'lccn', 'box', 'score', 'ocr', 'place_of_publication', 'geographic_coverage', 'name', 'publisher', 'url', 'page_url'])

r.close()

Counting the size of the data

If we want to sample from newspaper navigator it is important to be able to know the size of the total population for a given year and kind of image i.e. 10000 photos for 1950.

Normally in Python we would use len to count the length of a python object

data = {"a": "one", "b": "two", "c": "three"}
len(data)

3

If we try and do this with our objects we get an error TypeError: object of type '_yajl2.items' has no len(). This is because the point if of ijson is to avoid loading json into memory so we don't know how long the total data will be.

We can get around this by using the toolz libraries itertoolz.count method. count is similar to len except that it can work on lazy sequences, i.e. something with a next attribute. Unfortunately, this ends up being relatively slow because we still need to go through all of the data, which means that although we can avoid loading the data into memory, we still need to stream it to get the length. We usually won't need to call this repeatedly, but if we call this function multiple times, we cache the results to make sure we don't calculate the length of the same data multiple times.

count_json_iter counts the length of a json file loaded via URL.

count_json_iter("https://news-navigator.labs.loc.gov/prepackaged/1850_photos.json")

22

url = "https://news-navigator.labs.loc.gov/prepackaged/1850_photos.json"
assert type(count_json_iter(url)) == int
assert len(json.loads(requests.get(url).content)) == count_json_iter(url)

get_year_size(1850, "photos")

{'1850': 22}

Returns the year sizes for a given kind, taking a step size step. For example, to get the number of photos in the news-navigator dataset between 1850 and 1860 for every year:

%%time
get_year_sizes("photos", 1850, 1855, step=1)

CPU times: user 146 ms, sys: 37.9 ms, total: 184 ms
Wall time: 1.2 s

assert len(get_year_sizes("photos", 1850, 1860, step=1)) == 11
assert len(get_year_sizes("photos", 1850, 1860, step=2)) == 6

Creating Samples

Streaming sampling

Since we want a subset of the Newspaper Navigator datasets we can either work with for annotation or inference, we want to create samples. Sampling in Python can be complicated depending on the type of population you are working with and your sample's properties. Usually, we can do something relatively simple. For example, if we want to sample from a selection of books, we could do:

import random

books = ["War and Peace", "Frankenstein", "If They Come in the Morning"]
random.sample(books, 1)

['War and Peace']

However, we run into the same problem as trying to get the length of a JSON dataset that wouldn't fit into memory above. For example, we want to sample $k$ examples from one of our JSON files that we can't load into memory. To get around this, we can use Reservoir_sampling:

Reservoir sampling is a family of randomized algorithms for choosing a simple random sample without replacement of k items from a population of unknown size n in a single pass over the items. The size of the population n is not known to the algorithm and is typically too large to fit all n items into main memory. The population is revealed to the algorithm over time, and the algorithm cannot look back at previous items.

Now we sample whilst only loading a small number of items into memory at one time. This does come at some cost, mainly speed. There are faster ways of sampling from a stream but this isn't the main bottle neck for sampling in this case. We can for example sample from a large range of numbers without memory issues.

sample_stream(range(1, 100000), 5)

array([62151, 45070, 43590, 71352, 61951])

We can still sample from lists

names = ["Karl Marx", "Rosa Luxenburg", "Raya Dunayevskaya", "CLR James"]
sample_stream(iter(names), 2)

array(['Karl Marx', 'Raya Dunayevsk'], dtype='<U14')

calc_year_from_total(10, 1850, 18950, 1)

1

Reducing memory usage

Since we are trying to be a bit careful with memory usage we will convert column dtypes to be smaller when possible.

Sampling Newspaper navigator

We now start building up a class nnSampler for doing our proper sampling.

sample_year("photos", 1, 1850)
assert (
    len(sample_year("maps", 0.1, 1850)) == 1
)  # test we always have a sample size of at least one

create_sample returns a dataframe which samples from Newspaper Navigator. year_sample controls whether you want sample_size to be for each year or for you entire sample. Selecting year_sample false will return a sample of a size close to what you define in sample_size. This is useful for example if you plan to annotate your sample with some new labels.

For any years where sample size is larger than the sample available you just get everything for that year.

sampler = nnSampler()
sampler.create_sample(5, step=2, end_year=1852, year_sample=False)

Downloading sample images

download_sample is used to download images from a sample

sampler = nnSampler()

sampler

nnSampler

sampler.population

df = sampler.create_sample(
    sample_size=10, kind="photos", start_year=1850, end_year=1855, reduce_memory=True
)
df.head(5)

Downloading a sample

We may want to work with images locally. We can download them using the download_sample method.

sampler.create_sample(
    sample_size=10, kind="ads", start_year=1850, end_year=1850, reduce_memory=True
)
sampler.download_sample("test")

from nbdev.export import notebook2script

notebook2script()

Converted 00_core.ipynb.
Converted 01_sample.ipynb.
Converted 02_annotate.ipynb.
Converted 03_inference.ipynb.
Converted index.ipynb.

	filepath	pub_date	page_seq_num	edition_seq_num	batch	lccn	box	score	ocr	place_of_publication	geographic_coverage	name	publisher	url	page_url
0	msar_icydrop_ver05/data/sn86074079/00295878502...	1850-12-05	1149	1	msar_icydrop_ver05	sn86074079	[0.4527419749698691, 0.07917078993055555, 0.75...	0.951330	[Wednesday, Dec., Uth, one, day, only., -, i, ...	Canton, Miss.	[Mississippi--Madison--Canton]	The Madisonian.	R.D. Price	https://news-navigator.labs.loc.gov/data/msar_...	https://chroniclingamerica.loc.gov/data/batche...
1	in_indianapolisolympians_ver02/data/sn86058217...	1850-04-03	470	1	in_indianapolisolympians_ver02	sn86058217	[0.19613312344418035, 0.41648075810185187, 0.3...	0.943215	[,, j, ii, i:, 1., 1, a, it, A, I, II, V, ...]	Richmond, IA [i.e. Ind.]	[Indiana--Wayne--Richmond]	Richmond palladium. [volume]	D.P. Holloway & B.W. Davis	https://news-navigator.labs.loc.gov/data/in_in...	https://chroniclingamerica.loc.gov/data/batche...
2	ohi_ingstad_ver01/data/sn85026051/00296027029/...	1850-12-07	115	1	ohi_ingstad_ver01	sn85026051	[0.30707743987524494, 0.6473851770787806, 0.44...	0.984739	[COME, IN,, WE, CALL, YOU!]	Fremont, Sandusky County, Ohio	[Ohio--Sandusky--Fremont]	Fremont weekly freeman. [volume]	J.S. Fouke	https://news-navigator.labs.loc.gov/data/ohi_i...	https://chroniclingamerica.loc.gov/data/batche...
3	ohi_edgar_ver01/data/sn85038121/00280775502/18...	1852-06-24	443	1	ohi_edgar_ver01	sn85038121	[0.5482568719219458, 0.05572789142773935, 0.80...	0.966192	[;, 'Y, '., ", i, ', -, 7, i, ', f, 1,, ft, ',...	Gallipolis, Ohio	[Ohio--Gallia--Gallipolis]	Gallipolis journal. [volume]	Alexander Vance	https://news-navigator.labs.loc.gov/data/ohi_e...	https://chroniclingamerica.loc.gov/data/batche...
4	msar_cloudchaser_ver01/data/sn87065038/0029587...	1852-10-07	262	1	msar_cloudchaser_ver01	sn87065038	[0.4655820986278216, 0.5739247633736971, 0.601...	0.968907	[i, "ntnTf-fhr-i, -irrr-T-, r, ., -, J, I]	Columbus, Miss.	[Mississippi--Lowndes--Columbus]	The primitive Republican.	F.G. Baldwin	https://news-navigator.labs.loc.gov/data/msar_...	https://chroniclingamerica.loc.gov/data/batche...
5	txdn_eastland_ver01/data/sn83025730/0027955983...	1852-07-10	611	1	txdn_eastland_ver01	sn83025730	[0.6132026808043348, 0.2620943737307529, 0.975...	0.982666	[1, i, \|\|, «, -, ', #, •«, ft,, %, >*<», I, <,...	Marshall, Tex.	[Texas--Harrison--Marshall]	The Texas Republican. [volume]	F.J. Patillo	https://news-navigator.labs.loc.gov/data/txdn_...	https://chroniclingamerica.loc.gov/data/batche...

	ads_count	photos_count	maps_count	illustrations_count	comics_count	cartoons_count	headlines_count	total
1850	8841	22	5	671	9	0	11243	20791
1851	10065	20	6	457	7	0	12262	22817
1852	8764	22	10	671	10	8	13524	23009
1853	11517	45	5	1106	88	1	13224	25986
1854	15050	221	15	732	11	3	15282	31314
...	...	...	...	...	...	...	...	...
1946	185139	5945	1857	1053	3280	861	68275	266410
1947	181223	4188	1750	1115	3630	797	57018	249721
1948	152987	4282	1359	1154	3031	624	43432	206869
1949	154510	6015	1888	1280	3356	634	42904	210587
1950	154961	5630	1952	1223	3893	704	37854	206217

	filepath	pub_date	page_seq_num	edition_seq_num	batch	lccn	box	score	ocr	place_of_publication	geographic_coverage	name	publisher	url	page_url
0	ohi_ingstad_ver01/data/sn85026051/00296027029/...	1850-07-27	37	1	ohi_ingstad_ver01	sn85026051	[0.29913574490319106, 0.622813938380955, 0.430...	0.980025	[ht, I, ', Wll., ., III, tl, T, ., "', "', ", ...	Fremont, Sandusky County, Ohio	[Ohio--Sandusky--Fremont]	Fremont weekly freeman. [volume]	J.S. Fouke	https://news-navigator.labs.loc.gov/data/ohi_i...	https://chroniclingamerica.loc.gov/data/batche...
1	ohi_ingstad_ver01/data/sn85026051/00296027029/...	1850-07-20	33	1	ohi_ingstad_ver01	sn85026051	[0.3009427797781111, 0.6294158908847332, 0.433...	0.929614	[L, -, COME, IN,, WE, CALL, YOU, !, .v';:]	Fremont, Sandusky County, Ohio	[Ohio--Sandusky--Fremont]	Fremont weekly freeman. [volume]	J.S. Fouke	https://news-navigator.labs.loc.gov/data/ohi_i...	https://chroniclingamerica.loc.gov/data/batche...
2	ncu_hawk_ver02/data/sn84026472/00416156360/185...	1850-05-22	289	1	ncu_hawk_ver02	sn84026472	[0.6732673909317263, 0.042179068056539225, 0.8...	0.914908	[]	Hillsborough, N.C.	[North Carolina--Orange--Hillsboro]	The Hillsborough recorder. [volume]	Dennis Heartt	https://news-navigator.labs.loc.gov/data/ncu_h...	https://chroniclingamerica.loc.gov/data/batche...
3	msar_icydrop_ver05/data/sn86074079/00295878502...	1850-12-05	1149	1	msar_icydrop_ver05	sn86074079	[0.4527419749698691, 0.07917078993055555, 0.75...	0.951330	[Wednesday, Dec., Uth, one, day, only., -, i, ...	Canton, Miss.	[Mississippi--Madison--Canton]	The Madisonian.	R.D. Price	https://news-navigator.labs.loc.gov/data/msar_...	https://chroniclingamerica.loc.gov/data/batche...
4	vtu_londonderry_ver01/data/sn84023252/00200296...	1850-05-11	283	1	vtu_londonderry_ver01	sn84023252	[0.5275910554108796, 0.16344128086556137, 0.68...	0.915448	[Old, Dr., Jnaob, TownHond,, 'Jlu, Utigiml, l)...	St. Johnsbury, Vt.	[Vermont--Caledonia--Saint Johnsbury]	The Caledonian. [volume]	A.G. Chadwick	https://news-navigator.labs.loc.gov/data/vtu_l...	https://chroniclingamerica.loc.gov/data/batche...

Sample

Newspaper Navigator JSON files

`get_json_url`[source]

`load_json`[source]

Working with big JSON

Streaming JSON

Counting the size of the data

`count_json_iter`[source]

`get_year_size`[source]

Parameters

Returns

`get_year_sizes`[source]

`get_all_year_sizes`[source]

Creating Samples

Streaming sampling

`sample_stream`[source]

`calc_frac_size`[source]

`calc_year_from_total`[source]

Reducing memory usage

`reduce_df_memory`[source]

Sampling Newspaper navigator

`class` `nnSampler`[source]

`sample_year`[source]

`None`[source]

`nnSampler.create_sample`[source]

Downloading sample images

`None`[source]

`nnSampler.download_sample`[source]

Parameters

Returns

Downloading a sample

Sample

Newspaper Navigator JSON files

get_json_url[source]

load_json[source]

Working with big JSON

Streaming JSON

Counting the size of the data

count_json_iter[source]

get_year_size[source]

Parameters

Returns

get_year_sizes[source]

get_all_year_sizes[source]

Creating Samples

Streaming sampling

sample_stream[source]

calc_frac_size[source]

calc_year_from_total[source]

Reducing memory usage

reduce_df_memory[source]

Sampling Newspaper navigator

class nnSampler[source]

sample_year[source]

None[source]

nnSampler.create_sample[source]

Downloading sample images

None[source]

nnSampler.download_sample[source]

Parameters

Returns

Downloading a sample

`get_json_url`[source]

`load_json`[source]

`count_json_iter`[source]

`get_year_size`[source]

`get_year_sizes`[source]

`get_all_year_sizes`[source]

`sample_stream`[source]

`calc_frac_size`[source]

`calc_year_from_total`[source]

`reduce_df_memory`[source]

`class` `nnSampler`[source]

`sample_year`[source]

`None`[source]

`nnSampler.create_sample`[source]

`None`[source]

`nnSampler.download_sample`[source]