Harvester for Earlydutchbooks

The following short Python script can be used to download pages from earlydutchbooksonline. Note: I do not know whether it works under Python 3, so you are advised to install Python 2.7 or 2.8 if you want to use it.

import sys
from urllib2 import urlopen
code = raw_input("Nummer van het document: ")
minnum = int(raw_input("Eerste pagina: "))
maxnum = int(raw_input("Laatste pagina: "))
size = int(raw_input("Formaat: "))
directory = raw_input("Directory: ")
for page in xrange(minnum, maxnum + 1):
    url = "http://imageviewer.kb.nl/ImagingService/imagingService?h=%i&id=dpo:%s:mpeg21:%04i:image"%(size,code,page)
    url = urlopen(url)
    f = open(directory+"/%03i.jpg"%(page-minnum+1), "wb")
    f.write(url.read())
    f.close()
    url.close()
    print "saved page %04i as page %03i"%(page, page-minnum+1)

When running the script, you will be asked for five pieces of data:

The number of the book you want to harvest. It can be found from the url of the page with the book. For example, the book at url http://www.earlydutchbooksonline.nl/nl/view/image/searchvalue/de/searchaction/list/id/dpo%3A8724%3Ampeg21 will have number 8724 (the part between the %3As)
The first page number to be harvested
The last page number to be harvested
An indication of the size of the pages to be created (presumably this is the vertical size in pixels, but I am not 100% sure). Size 800 seems to give a good page size both for OCR and for proofreading. If you want to have images for the post processor, you should of course use a larger size like 2000 or 3000
The directory where the files are to be saved. This should be an existing directory

The program will then get the pages from the earlydutchbooksonline server, and save them under number 001.jpg, 002.jpg, etcetera in the specified directory

Anonymous

Search

Harvester for Earlydutchbooks

Namespaces

More

Page actions

Navigation

Wiki Navigation

DP Navigation

Wiki tools

Wiki tools

Anonymous

Search

Harvester for Earlydutchbooks

Navigation

Wiki tools

Page tools