assert (
get_json_url(1860)
== "https://news-navigator.labs.loc.gov/prepackaged/1860_photos.json"
)
assert (
get_json_url(1950)
== "https://news-navigator.labs.loc.gov/prepackaged/1950_photos.json"
)
assert (
get_json_url(1950, "ads")
== "https://news-navigator.labs.loc.gov/prepackaged/1950_ads.json"
)
We can also test that this returns what we think inside the notebook. These tests are often hidden in the documentation. However, inside the notebook, there will usually be a cell below a function definition that includes some tests.
test_json = load_json(
"https://news-navigator.labs.loc.gov/prepackaged/1950_photos.json"
)
assert type(test_json[0]) == dict
assert type(test_json) == list
This works well for a smallish file but if we try this with the 1905_ads.json file which is ~3.3GB we will likely run out of memory. For example running
with requests.get('https://news-navigator.labs.loc.gov/prepackaged/1910_ads.json') as r:
data = json.loads(r.content)
len(data)
on a Google Colab instance with 25GB of RAM causes a crash.
Streaming JSON
One way to get around this would be to throw more RAM at the problem. However, since we only want to sample the JSON and don't need to work with the whole dataset, this seems wasteful. So instead, we'll use ijson
, a Python library for streaming JSON.
We can see how this works for a URL from Newspaper Navigator if we create a request via Requests using stream=True
to return a streaming version of the response.
r = requests.get(get_json_url(1850, "ads"), stream=True)
We can pass this response to ijson
. In this case we just parse an item at a time. If the JSON is really big this might already be too much. ijson
allows for much more granular parsing of JSON but for what we need, parsing by item is fine. We can see what the return of this looks like
objects = ijson.items(r.raw, "item")
objects
We get back something from _yajl2
this is the underlying parser ijson
is using. See ijson
docs for more on available parsers.
We can call next on this object to start iterating over it, one item at a time. If we look at the keys of the first response, you'll see that this is one entry from the original JSON data.
first = next(objects)
first.keys()
r.close()
data = {"a": "one", "b": "two", "c": "three"}
len(data)
If we try and do this with our objects
we get an error TypeError: object of type '_yajl2.items' has no len()
. This is because the point if of ijson is to avoid loading json into memory so we don't know how long the total data will be.
We can get around this by using the toolz libraries itertoolz.count method. count
is similar to len
except that it can work on lazy sequences, i.e. something with a next
attribute. Unfortunately, this ends up being relatively slow because we still need to go through all of the data, which means that although we can avoid loading the data into memory, we still need to stream it to get the length. We usually won't need to call this repeatedly, but if we call this function multiple times, we cache the results to make sure we don't calculate the length of the same data multiple times.
count_json_iter
counts the length of a json file loaded via URL
.
count_json_iter("https://news-navigator.labs.loc.gov/prepackaged/1850_photos.json")
url = "https://news-navigator.labs.loc.gov/prepackaged/1850_photos.json"
assert type(count_json_iter(url)) == int
assert len(json.loads(requests.get(url).content)) == count_json_iter(url)
get_year_size(1850, "photos")
Returns the year sizes for a given kind, taking a step size step
. For example, to get the number of photos in the news-navigator dataset between 1850 and 1860 for every year:
%%time
get_year_sizes("photos", 1850, 1855, step=1)
assert len(get_year_sizes("photos", 1850, 1860, step=1)) == 11
assert len(get_year_sizes("photos", 1850, 1860, step=2)) == 6
Streaming sampling
Since we want a subset of the Newspaper Navigator datasets we can either work with for annotation or inference, we want to create samples. Sampling in Python can be complicated depending on the type of population you are working with and your sample's properties. Usually, we can do something relatively simple. For example, if we want to sample from a selection of books, we could do:
import random
books = ["War and Peace", "Frankenstein", "If They Come in the Morning"]
random.sample(books, 1)
However, we run into the same problem as trying to get the length of a JSON dataset that wouldn't fit into memory above. For example, we want to sample $k$ examples from one of our JSON files that we can't load into memory. To get around this, we can use Reservoir_sampling:
Reservoir sampling is a family of randomized algorithms for choosing a simple random sample without replacement of k items from a population of unknown size n in a single pass over the items. The size of the population n is not known to the algorithm and is typically too large to fit all n items into main memory. The population is revealed to the algorithm over time, and the algorithm cannot look back at previous items.
Now we sample whilst only loading a small number of items into memory at one time. This does come at some cost, mainly speed. There are faster ways of sampling from a stream but this isn't the main bottle neck for sampling in this case. We can for example sample from a large range of numbers without memory issues.
sample_stream(range(1, 100000), 5)
We can still sample from lists
names = ["Karl Marx", "Rosa Luxenburg", "Raya Dunayevskaya", "CLR James"]
sample_stream(iter(names), 2)
calc_year_from_total(10, 1850, 18950, 1)
Sampling Newspaper navigator
We now start building up a class nnSampler
for doing our proper sampling.
sample_year("photos", 1, 1850)
assert (
len(sample_year("maps", 0.1, 1850)) == 1
) # test we always have a sample size of at least one
create_sample
returns a dataframe which samples from Newspaper Navigator. year_sample
controls whether you want sample_size
to be for each year or for you entire sample. Selecting year_sample
false will return a sample of a size close to what you define in sample_size
. This is useful for example if you plan to annotate your sample with some new labels.
For any years where sample size is larger than the sample available you just get everything for that year.
sampler = nnSampler()
sampler.create_sample(5, step=2, end_year=1852, year_sample=False)
download_sample
is used to download images from a sample
sampler = nnSampler()
sampler
sampler.population
df = sampler.create_sample(
sample_size=10, kind="photos", start_year=1850, end_year=1855, reduce_memory=True
)
df.head(5)
Downloading a sample
We may want to work with images locally. We can download them using the download_sample
method.
sampler.create_sample(
sample_size=10, kind="ads", start_year=1850, end_year=1850, reduce_memory=True
)
sampler.download_sample("test")
from nbdev.export import notebook2script
notebook2script()