Harvesting

From DPWiki
Jump to: navigation, search


At Distributed Proofreaders (DP), harvest (also proofraid) is defined as the process of downloading the page images of a document from an online source.



This can be done manually, by loading and saving the page images one by one, but it is more frequently done using a script written for the purpose.

Two scripts are available at DP for such image harvesting: gharvest, which works only with Google book scans, and snatch, which is more generally applicable, having a modular design to work with various Image Provider sites.

What is Harvesting?

There are many online sources of scans suitable for processing through Distributed Proofreaders (DP). Content providers may download or "harvest" image scans from these websites, get copyright clearance for them, and then OCR/process them as they would their own scans. There are several coordinated scan harvesting efforts going on at DP.

Requirements for Harvested Documents

The requirements for a harvested document are the same as for any other document going through DP:

  • A harvested document must be in the public domain so that copyright clearance can be obtained. Be careful that you do not harvest and illegally distribute any copyrighted or otherwise restricted material.
  • All pages must be present. Users beware -- the quality control in some of these collections isn't the best, so before you start, make sure that scans are available for every page of the project. If you love a project, broken though it may be, please complete it with the aid of the Missing Pages Wiki before uploading it to DP.
  • High enough quality illustrations must be included for the post-processor. Some of the collections provide lovely high resolutions scans, so please include these in addition to the lower quality scans for the proofers at the time of project creation. (Please do not ask the post-processors to download these themselves -- this is your job!) Check harvesting high-resolution images to see if there are tips for locating the best images to use at the site in question. If there are no high quality illustrations available to harvest, please consider choosing another project, or again completing the project with the aid of the Missing Pages Wiki before uploading. Though the text is often considered the more important part of these e-books, you may have trouble finding a post-processor for the project if the illustrations are not satisfactory.