Projects needing OCR

From DPWiki
Jump to navigation Jump to search

This wiki page is for PMs who do not do their own OCR to list prospective projects they need OCR for, and for those who are willing to do OCR for them to volunteer to help.

See also: General information and guidelines on using the OCR Pool.

Using this page

When making an edit, please fill in the "Summary" field with a brief description of your change (e.g. "Added 'Crime and Punishment'"). You do not need to add your user name; the software adds it automatically.

Click on the 'history' link in the gray menu bar at the top of the page to see the change log entries for this page.

Procedure summary

Content Providers

If you are a Content Provider only, who wishes to find an adoptive PM, please use the Content Providers seeking Project Managers wiki page (also known as the Project Adoption Center) to list projects. Project managers please look there for projects to adopt.

Project Managers

  • Do not add any files to these pools until Copyright Clearances have been obtained from PGLAF. This means the Project must be cleared under US copyright laws. The clearance information should be in-hand, but should not be uploaded with the scans.
  • Add an entry to the list of projects: copy & paste one of the existing entries, then change the details. Please be sure to mention the required OCR language, processing the scans might need (page splitting, cropping, renumbering, despeckling), and any other issues that might concern a potential OCR volunteer.
  • When a request has been fulfilled, delete the entries which have been fulfilled.
  • Please only upload projects which are complete. If you have a project you'd like to contribute, but it is missing pages or good scans of illustrations, please post over to the Missing pages wiki page. Once the pages have been provided you can upload the zip file and post the project here.
  • If you choose to include other text files, either within the zip file or as an adjunct file with the same name, please save it as a plain text file.

More detailed instructions are given here.

OCR volunteers

  • To claim the project, edit this page: Update the project you wish to process with *Claimed by username* together with a date next to the project title. Typing two dashes followed by four tildes (which inserts your username and a timestamp) is the most convenient way of doing this.
  • Send a PM to the Content Provider saying you are working on OCR for their project.
  • After the OCR is complete, upload a zip of the OCR to the CPer's dpscans; PM the Content Provider to let them know the name of the zip, and that it's ready in their dpscans; edit the Wiki entry to read *Done by username* and date.
  • If there are any problems (such as a missing page) please leave a note on this Wiki page. If there are missing pages, make a note here and let the PM know. If the PM has provided the scans or link to an online scanset, it is his or her responsibility to make sure that everything is complete, and to resolve any problems.
  • It is the Project Manager's responsibility to remove the Wiki page entry.

More detailed instructions are given here.

Projects needing OCR

Note that periodic maintenance sweeps will be done to remove user sections for those who have not been active in a year or more.


These (already cleared) projects need to be OCRed. Thank you!

New Projects (added 4/13/2018 - I do not need anything more than the OCR text files. :D I do everything else!):

Usual Template

  • ' ( - pages)
    • Claimed -- No.
    • Raws:
    • Source:
    • Description:
    • Clearance: Rule 1

Send a Private Message


Send a Private Message

hutcheson (harvest requests)


  • These books are desirable, mostly to fill out series already begun at Project Gutenberg.
  • I am willing to clear, PM, and/or postprocess, but will defer to other volunteers wherever possible.
  • I will coordinate to make sure there is a volunteer for each stage.


Each ~100 pages with 12 watercolour illustrations. A number of this series are already online at Gutenberg or FadedPage. All Full view at the Internet Archive. and the scans look pretty good to me.


Send a Private Message


  • LANGUAGE is English by default.
  • ZIP files, uploaded to a public Dropbox folder.
  • Covers and other illustrations (raw and digested) are not included. I provide these when I upload for proofing.
  • PNG files for proofing (1000PX wide, B&W) are not included. I will provide these when I upload for proofing.
  • TAILORED projects are scanTailored to cropped pages: 400DPI B&W TIF. They should be ready for OCR.
  • I can provide other formats, resolutions, color modes but don't yet know what is best.



Note: anything with a real URL in the scanset line is ready for OCR, with clearance pending. I've been consistently successful in getting clearances, so expect these will not be a problem.



  • Crying Stones, by Harry Rimmer
  • Spiritual Folksongs of Early America
  • The Story of Old Ironsides
  • Signers of the Declaration [of Independence]
  • The SP Mystery
  • The Strange Likeness
  • Peaks District
  • The Outdoor Girls on a Hike
  • Pianist's Guide to Sight-Reading and Memorization, by Beryl Rubinstein
  • The Radio Boys Seek the Lost Atlantis, by Gerald Breckenridge


Send a Private Message

Uploaded to:

Title: Pericles and the Golden Age of Athens claimed by --Puppernutter 13:29, 18 June 2011 (PDT)
Author: Abbott, Evelyn
Number of PNGs: 390
Language: English
Year: 1895 printing of 1891 copyright
Notes: I usually OCR my own projects, but my OCR software doesn't handle ligatures and accented letters well, and this project has a huge number of Æ/æ ligatures, and a significant number of [OE]/[oe] ligatures in it. There are a fair number of illustrations to work around (including highly illustrated DropCaps at the start of every chapter), and there are some underlined passages, but otherwise the scans are quite clean. I split the two-column Index pages. Just in case you wonder when you look at the zip contents, I didn't include the front-matter pages in the OCR zip package because I OCRed them myself while I was trying to figure out how much of a mess my own software would make of the text. The text files can be e-mailed to me at kraester AT Thanks for the help.


Send a Private Message to srjfoo

--- I've decided to go on a Marion Harland (Mary Virginia Terhune) jaunt. I'm hoping to do all available ones. In general she didn't believe in writing a short book. Many recipes in these books have a cross + after the title to indicate easier recipes so if you see those, don't worry, your OCR did not go crazy. Almost all are on the Internet Archive if that is faster:

Don't choose this one yet! It needs new scans of pages 379 and 553 Uploaded to dpscans folder: (IA

Title: Marion Harland's Complete Cook Book
Author: Marion Harland
Language: English
Pages: 861
Illustrations: 50ish


Send a Private Message

Old Changelog for OCR Pool