OCR Pool

From DPWiki
Jump to: navigation, search

The OCR Pool is the part of Distributed Proofreaders's (DP's) server used to store the scanned pages of documents that require pre-processing.


See Projects needing OCR or Content Providers seeking Project Managers for the lists of who needs what.

The process is described in more detail below.


How to Contribute Images to the OCR Pool

This defines the extra steps involved in contributing images to the OCR Pool for someone else to OCR and preprocess. It does not cover things covered elsewhere, such as how to scan, how to create projects, etc.

1. Create a ZIP file containing all the page scans. See later for what to do with the higher-resolution illustrations. Name this ZIP file in the form <DPNick>_<Proj>.zip, where <DPNick> is your DP username[*], and <Proj> is a short string of characters that will help you identify which book the project is for; it might be the author's surname, or an abbreviated or keyword version of the title, or any combination thereof. Since it is only needed to identify the file to you among the other OCR Pool files of yours (all of which will start with <DPNick>), it need not be hugely long or complex.

[*] If your DP username has spaces or other characters unusable in a unix filename, come as close as possible and include a text file in the zip stating your correct DP username. If you are not a registered DP user (in which case you will not be able to manage the project, just contribute scanned images), choose a nickname or version of your real name for <DPNick>.

Example: DP User JoeyDoey wants to send images of "The Campfire Girls vs Smoky the Bear" by Not A. Realauthor. Possible names he might choose for the zip file include (among many others)

  • JoeyDoey_CampSmoky.zip
  • JoeyDoey_RealauthorCGSB.zip
  • JoeyDoey_CfireBearNAR.zip
  • JoeyDoey_CGVSBNAR.zip


2. Decide if you just want to just contribute the scans only and let someone else manage the project, or if you want to be the Project Manager for this book yourself after the images have been OCRd and the text pre-processed by someone else.

If you just want to contribute the scans

Please follow the instructions on the Content Providers seeking Project Managers wiki page.

If you will be the PM for this project

  1. Upload the <DPNick>_<Proj>.zip file to your dpscans folder.
  2. Post an announcement about the files to the Projects needing OCR page saying they are there and ready for OCR. Mention the language, the type of book, and any other details about it you think may be of interest. Mention if you think the images may or will need splitting, converting, renumbering or cropping.
  3. One of the people doing OCR from the pool will edit the Wiki page to indicate that the project has been "claimed", and should PM you to request that you transfer the files to their dpscans.
  4. When they are finished with the preprocessing they will send you a PM and transfer the files back to your dpscans folder.
  5. Look for a folder or zip files named with <OCRNick>_<Proj>. Using these files (and any additional files for the illustrations, if appropriate), go ahead and create the project as usual. <OCRNick> is the DP username of the person who did the OCR/prepping, so you know who to contact in case of questions or if something needs to be redone, and who to credit when creating the project.
  6. After the project has been created, delete the files from your dpscans folder. The moment the project is created, all the files are copied elsewhere on the server, so they no longer need to remain in dpscans.

So as an example, John Doe, DP user JoeyDoey, has left a file JoeyDoey_CampSmoky.zip in his dpscans directory. DP user Ziggurat claims the project and asks to have the files transferred. A few days later, JoeyDoey looks in the his dpscans directory and finds two files, Ziggurat_CampSmokyImages.zip and Ziggurat_CampSmokyText.zip, from which JoeyDoey can create the project as per usual. He knows that user Ziggurat did the OCR and can be contacted in case a page has gone missing, etc. After he creates the project he deletes the Ziggurat_CampSmokyImages.zip and Ziggurat_CampSmokyText.zip from his personal dpscans folder, and updates the OCR Pool Wiki to reflect that the work has been completed.

Recommendations to make the OCRer's life easier

1. Ensure that the files are ordered alphabetically in the order they appear in the book. This means they can be safely loaded into the OCR package in one operation. Unless there's a good reason not to, please name them 001.png, 002.png ... (or 0001... for long books) as this is what they will be called when they come out of Guiprep and renaming them can be a pain.

2. Provide scans at the highest resolution you have, and keep preprocessing to a minimum. Although we want smaller file sizes for the proofers, we want the best possible OCR. Operations like despeckling usually help but sometimes don't (we've had projects where the OCR has missed virtually all the punctuation) so it's best to leave it to whoever's OCRing to decide. The one exception is that you should in most cases reduce the bit depth to 4-bit grayscale: more than this rarely improves the OCR and it cuts file size down considerably compared to 24-bit (or colour).

3. If your upload is large, it's worth running pngcrush or something similar if you can. 500 pages of 600dpi 4-bit scans is quite a lot of data, so reducing the size will help.

4. Include everything at the top level of the ZIP file (rather than down a long list of subdirectories).

What to do with illustrations

(In other words, the higher resolution version of the illustrations for the final HTML version.)

If you're PMing the project yourself, they don't need to go into the OCR pool, and it's best not to include them in your ZIP file at all.

If you're only CPing, then they need to go to the OCR pool for handover to the PM. If there's only a few, you can include them with the page scans ZIP (but ensure they're called something distinctive so they stand out).

If there's a lot of images then please put them in a separate ZIP file. (If your original file is JoeyDoey_CampFire then call this JoeyDoey_CampFire_images.) This means that the PM can upload them straight into the project without having to reupload them to dpscans, usually much slower than downloading. Please bear in mind the guidelines in this section of the PM FAQ on naming and formats.

How to Process Images from the OCR Pool

This defines the extra steps involved in obtaining images from the OCR Pool. It does not cover things covered elsewhere, such as how to OCR or preprocess text, how to create projects, etc.

Needs OCR only

If the file you are volunteering to OCR is from the Projects needing OCR page, the content provider intends to also manage the project through DP.

Look through the lists on the Wiki page and select a project you would like to work on. Follow the Wiki instructions and edit the page to indicate that you have "claimed" the project, and will be doing the OCR.

Send a PM to the DP user whose scans you have claimed to provide OCR for, and request that they move the zip files into your dpscans directory.

The files here have names of the form <DPNick>_<Proj>.zip, where <DPNick> is the DP username of the person who scanned the images and <Proj> is a brief reminder for them of which book the images are from.

After you have finished the OCR, and have uploaded the zip(s) of all the text files, please send them a PM telling them the OCR is now ready. Thanks, you are done and can go on to another if you wish. :D

Needs OCR and Project Manager

If the file you are volunteering to OCR is from the Content Providers seeking Project Managers wiki page, the content provider has elected to be a content provider only, so you may manage the project.

As with the projects needing OCR only, look through the list on the Wiki page and select a project you would like to work on. Follow the Wiki instructions and edit the page to indicate that you have "claimed" the project, and will be doing the OCR. Then send a PM to the DP user whose scans you have claimed to provide OCR for and to manage, and request that they move the zip files into your dpscans directory.

The CPer should have provided a README containing clearance and scanner credit information, and any other information needed to set up the project on DP.

OCR and prep the images as per usual. The project is now yours to create and manage as usual. After project creation, please remember to delete the files from your dpscans directory: the moment the project is created, all the files are copied elsewhere on the server, so they no longer need to remain in dpscans. Please also delete the entry from the Content Providers seeking Project Managers wiki page to reflect that the work has been completed.

How to access dpscans

Users with a dpscans directory can access it using the Remote File Manager.