Harvester for Earlydutchbooks
From DPWiki
The following short Python script can be used to download pages from earlydutchbooksonline. Note: I do not know whether it works under Python 3, so you are advised to install Python 2.7 or 2.8 if you want to use it.
import sys from urllib2 import urlopen code = raw_input("Nummer van het document: ") minnum = int(raw_input("Eerste pagina: ")) maxnum = int(raw_input("Laatste pagina: ")) size = int(raw_input("Formaat: ")) directory = raw_input("Directory: ") for page in xrange(minnum, maxnum + 1): url = "http://imageviewer.kb.nl/ImagingService/imagingService?h=%i&id=dpo:%s:mpeg21:%04i:image"%(size,code,page) url = urlopen(url) f = open(directory+"/%03i.jpg"%(page-minnum+1), "wb") f.write(url.read()) f.close() url.close() print "saved page %04i as page %03i"%(page, page-minnum+1)
When running the script, you will be asked for five pieces of data:
- The number of the book you want to harvest. It can be found from the url of the page with the book. For example, the book at url http://www.earlydutchbooksonline.nl/nl/view/image/searchvalue/de/searchaction/list/id/dpo%3A8724%3Ampeg21 will have number 8724 (the part between the %3As)
- The first page number to be harvested
- The last page number to be harvested
- An indication of the size of the pages to be created (presumably this is the vertical size in pixels, but I am not 100% sure). Size 800 seems to give a good page size both for OCR and for proofreading. If you want to have images for the post processor, you should of course use a larger size like 2000 or 3000
- The directory where the files are to be saved. This should be an existing directory
The program will then get the pages from the earlydutchbooksonline server, and save them under number 001.jpg, 002.jpg, etcetera in the specified directory