26,906 titles preserved for the world!
163 in Feb 2014 — 30 in Mar 2014 — More...
|DP||· Register · Help|
Content Provider's FAQ
See also: Scanning FAQ
So you've reached the rank of "Proofreader Extraordinaire" and figured
that you would branch out into different arenas.
These Guidelines are here to try to help you through the process.
Note that you don't necessarily have to do all these steps yourself. It's quite possible to do some steps and hand off your results to someone else. You can also elect to manage the project once the necessary files have been uploaded to the DP server. See the Project's Manager's FAQ for details.
Frequently asked Questions - (separate page) Some common questions that are related to scanning, OCR, etc. that are not covered here.
What kinds of books do you have? :-)
Seriously, there are really few restrictions on what kind of text
you can contribute to DP. The biggest, and probably most important is: The
book MUST be in the public domain (i.e., the copyright must have
expired). In general this tends to mean books that were written before
1923. There are exceptions to the 1923 rule, but a lot of times
it is troublesome to try to prove them. There is a good detailed
discussion of what is and isn't eligible at the Project Gutenberg site
on this page.
For a discussion of copyright terms in other countries, check out this
The book should not already be on the Project Gutenberg site. This site exists as a feeder site to Project Gutenberg, and it makes little sense to spend all the time and effort on a text that is already there. A different version of an existing book is OK though. You can check the Project Gutenberg online catalog to see if a book is already on there.
There is also a site called David's In-Progress List listing all of the books that people are presently working on. Again, this is helpful to avoid duplication of effort. If you find your book listed but the clearance date is over a few years old then it is probably OK to go ahead and do it.
You might want to stick with a shorter fictional work for the first project you contribute. It is probably better to avoid books which contain a lot of illustrations, maps, charts, tables and pictures for your first project.
Non-English language texts are fine too, though keep in mind that at the moment PGDP uses Latin-1, not Unicode. Texts in most western European languages and a few others (e.g. English, French, German, Latin (sans length marks), Italian, Spanish, Swedish, Dutch, Swahilli) are usually appropriate for PGDP. However, texts with many characters outside Latin-1 are probably better handled at DP-EU, which uses Unicode. The procedures for preparing texts are the same for both sites, but permission to be a PM must be obtained separately from the administrators of each site. If you have a query about whether a text in a non-English language is appropriate for DP, please post a question to the Providing Content Forum.
It is helpful, though not strictly necessary that you understand the language that the book is written in. It will theoretically make it easier during post-processing to be able to tell from context whether paragraph breaks should be at page breaks, however, checking back against the original book can get you through too.
Libraries, flea markets, yard sales, auctions, estate sales, your parents/grandparents, the trash, (you'd be AMAZED at what people throw away!) used book shops, friends, schools, you name it, there's books there. It is better to have a book that you will have access to for the whole time the project is being worked on so you can refer back to it if problems turn up in the scans. (Happens depressingly often)
You may find many eligible books in the circulating (borrowable) collection of your local library, but do be careful, because the scanning process can be a little rough on books and they may get damaged.
There are also many on-line sites devoted to used books if you are trying to find a particular one:
There are also many sites which have books available online as PDF
or image files which can be downloaded and OCRed.
(Note that some PDF files do not contain actual page images,
but instead contain text resulting from OCR or retyping.
Since DP needs page images, we can't use those PDF files for DP.)
and historical societies seem to be rich sources. This is especially
helpful if you don't have access to a scanner or physical books. There are drawbacks: they are usually a
fairly intensive download, especially over dial-up; you don't have
access to the actual book to check against if there are later problems,
and the selection is limited. Not having to do the scanning is a big
There is a large list of possible scan source sites in the Content Providers Forum under the topic "Online sources of scanned book images"Please follow the individual site guidelines regarding acceptable use and protocol. We don't want to be bad neighbors.
If you do go this route, it is considered good form to credit the source of the scan when the text is submitted to Project Gutenberg.
Once you have found a book that you think might be a good candidate,
the first thing you should do is get a clearance
line. This is an approval, if you will, of the book for the
Project Gutenberg site, and also registers the book as being a work in
progress to let other people know it is reserved so as not to duplicate
The preferred method for requesting a copyright clearance is the
web interface at this page
(http://copy.pglaf.org). There are quite a few handy tips
and links there also.
You probably should not invest too much time until you've received your clearance line.
Now you need to scan it.
There are too many scanners and scanning packages to give specific instructions here. In general, good all-purpose parameters for scanning: 300dpi, black and white (not grayscale), and average brightness unless the paper is very yellow. Higher dpi doesn't necessarily make for better OCR unless the text is extremely small. You want to end up with good, reasonably clean images that the OCR software won't choke on.
The following examples and explanations assume that you are using ABBYY FineReader. This FAQ tends to concentrate on using ABBYY FineReader Pro because:
ABBYY FineReader Pro 5.0 or higher (and most other high end OCR programs) have built in scanning functionality and will allow you to automate the process to a great extent. In FineReader, to open a new batch. Click on File->New Batch, (Ctrl+N) and give it an appropriate name. (The title of the book, abbreviated, is a good choice) This is where FineReader stores all of the interim files for the project. It is probably a good idea to make a separate batch directory in which to put all of your individual batches.
As a matter of fact, while we're on the subject, let's talk about
directory structure a little bit. It is a good idea to use a logical
directory structure to help keep track of things. There is no
"right" or "wrong" way to do this, it mostly depends on personal
preference. However, in order to use some of the features of the tools
that have been written to make things easier, a certain structure must
Starting at the appropriate place in your directory structure
(Shown as "C:\" in this example, choose a place comfortable for
you.) Make two directories: "Batch" and "Projects".
Every time you start a new batch in FineReader, it automatically
generates a directory where it stores raw image and text data, named
with a batch name that you specify. Save this under the "Batch"
Under the "Projects" directory make another directory. Name this
with the same name as the "Batch" name used in FineReader. Under that
directory make several more directories: "pngs", "textw" and "textwo" .
These are where you will save the images and text files from FineReader.
"Textw" stands for text with line breaks and "textwo" stands for text
without line breaks. These will be explained more later.
Here's a little graphic to demonstrate. Assuming a book named Book1:
Some people like to put the batch from FineReader in the same directory as the png and text directories to keep track of them easier. That is fine too if you prefer it that way. Personal preference and comfort comes into this a lot.
When your batch directory is set up, in FineReader, Select File->Scan Multiple images (Ctrl+Shift+K) to start scanning the book. From here the procedure will vary greatly depending on what features your scanner has, (automatic document feeder or not) and your personal preferences, (acknowledge each scan or have a timed pause between.) Obviously, other packages will be different; your best bet is to check the help files that came with your specific package.
If the scanner bed will accommodate it, scan 'two-up' images (two book pages per image), as this will speed up the scanning process. Try to keep the book in the same place on the scanner for each scan (say, tight into a corner). That will make it easier to do the cropping and splitting.
Crop the images, if necessary, to minimize black borders around the page image. If you are ending up with LARGE black borders around your page image, you should probably adjust your scanning "window" smaller to avoid scanning outside where the page lays on the scanner bed. Doing this will save you both time-the scanner doesn't have to scan such a large area-and space on your drive-smaller files. Don't crop the image down till there is no or very little margin around the text, this can affect recognition and can cause difficulties during the proofreading process. Ideally, what you want is some white space around the text, but no black.
If you have two-up images, split them into individual (one-up) page images. Generally there are two easy ways to get one-up images from two-up images:
When you save your image files, save them as black and white
images, not color or grayscale; you probably want ".tif" or ".png"
format image files. Later you'll NEED ".png" format files, so if your
OCR software can handle them it might be better to use them now. Avoid
saving them as jpegs (lossy format) or .bmp bitmaps (huge files). Under
FineReader, to save all the image files at once, select them all
first,(click in the thumbnail window and press Ctrl-A) then choose
File->Save Images (F12), and be sure to give the images a name since
it doesn't insert the batch name automatically. It will save them in a
series with the specified name, a hyphen, and a four digit counter.
(Book1 - 0001.png, Book1 - 0002.png... etc.) Save them to the
For e-texts/.pdf files, you want to end up in the same place. If the page images are available as single page .tifs, .gifs, or .pngs you'll need to download them, convert them to .pngs, and make sure the filenames follow the correct format. If you have multi page images, you may need to split them first. With .pdf files you'll need to use one of the software utilities to extract the .tif (usually) images from the .pdf
Note: ABBYY FineReader OCR 6.0 is capable of working directly
with .pdf files. You don't need to extract the images first. If you set
up a batch, it will extract .tif images to the batch directory
automatically as it is loading the .pdf files. These can then be
converted to .pngs for later use.
For more help with ABBYY FineReader, please see our FineReader Tips and Tricks forum topic.
Now you've got to run the images through an OCR (Optical Character
Recognition) program. Again, there are too many programs out there to
give useful specific directions for them all. You will need to wind up
at the same place though the path you take may be different.
If you don't have an OCR package, you can take advantage of the DP OCR Pool. Other DP volunteers who do have OCR packages are more than happy to OCR images on your behalf.
Assuming you DO have OCR software...
If you used FineReader for the scanning, you've already set up a batch and the images are already there.
If not, open up FineReader. Click on File->New Batch, (Ctrl+N).
and name it appropriately. Click File->Open Image,(Ctrl+O). Select
all of the images and click on "Open". You might want to open just
one or two at first to be sure everything is working, then do the rest.
Try to make sure that you select them in the order that they belong. If
they are named so that they will sort correctly in alphabetical order,
you can select them all at once.
Check settings under "Tools-->Options". Select the correct
language for the text. Hit (Ctrl-shift-R) or the "read all" icon, to
initiate the OCR sequence, then go away for another (usually shorter)
break. There is also an option under the "Process" menu to perform
background processing, which allows you to minimize the window and do
other things while waiting.
For complex or "busy" pages of text and illustrations, some extra work may be necessary. ABBYY FineReader tries to analyze the layout of a page as it does the OCR. For simple, two-column pages it usually gets the layout right, but if the columns are broken up by illustrations, tables, etc, it will almost certainly get the layout wrong.
It is possible to draw boxes on the scanned image to show FineReader which pieces of text to group together. Once the boxes are drawn, you can tell FineReader how to order them in the OCR'd text. In order to draw the boxes, click on the little box icon at the top of the icons along the left-hand side of the window. This is usually the default, so clicking on that icon may not be necessary. Find your starting point, hold down the mouse button and drag until the box is the right size. You can adjust the box in fine detail in the zoomed image at the bottom of the window. If you draw the boxes in the order that you want them processed then you don't have to do anything else. Just hit Cntrl-R and let FineReader OCR the page. Sometimes, however, it's not convenient to draw the boxes in the correct order. You can tell FineReader what order you want by clicking on the 123 icon on the left side of the window. Then click on the text/illustration boxes in the order that you want them. The numbers on the boxes will change to reflect the final output order. Note that when FineReader is actually doing the OCR, it may not process the boxes in the order you specified, but the result will come out in the correct order.
When doing OCR on a long, complicated project, it works well to let FineReader OCR all the pages, then go through and look briefly at each page to see if it needs manual tuning. You can move from page to page quite quickly by using Alt-down arrow. When you see a page that FineReader didn't get right, you can delete the OCR'd text only or the OCR'd text AND the text boxes, depending on how badly it got things wrong. Fix or redraw the boxes and fix the order as necessary, then move on to the next page. If you have Background Processing turned on, it will do the OCR while you are looking for the next problem page.
Note also that you can specify different recognition languages for different text boxes, but, at least in FineReader 5.0, you must manually change the language, and read each box in the correct order, making this quite time consuming.
When that is done, you'll need to save the text files to do further
processing on them. Depending what tools you will use in preprocessing,
the formats and locations you save them in will vary. To use the guiprep
script (highly recommended) you will
need to do something like the following :
up the text files:
In the "textw" directory, save the text with the settings: Save as type Rich text Format, Create a separate file for each page, Retain font and font size. On the RTF tab of the Formats Settings, check Keep page breaks and Keep line breaks and uncheck everything else. It doesn't matter what the File name is set to. The name of your batch is probably fine.
In the "textwo" directory, save the text with the settings: Save as type Rich text Format, Create a separate file for each page,Retain font and font size. On the RTF tab of the Formats Settings, check Keep page breaks and Remove optional hyphens and uncheck everything else. Make sure the File name is set the same as in the textw directory.
Using the script without RTF Markup Extraction:
If you don't want to do markup extraction, (or your OCR package won't support RTF files) you can skip saving the files as RTFs and just save them as plain text files. Again, to do dehyphenization, you will need to save the files in two directories, textw and textwo.
Save the text with line breaks in textw. The ISO Latin-1 code page will give you pretty good results for English and most European languages. The site works with ISO Latin-1 so that will be least problematic to fit into the character space used. If necessary, you can try other code pages but be aware that they may not be as easy to use on the site and may not yield satisfactory results with some of the script functions.
The textwo directory should use all of the same settings except that Keep line breaks needs to be unchecked. Be sure to use the same code page and file names in both the textw and textwo directories.
At this point the script is used exactly the same way except you'll skip the Extract Markup routine.
Using the script without RTF Markup Extraction or Dehyphenization:
If you are using a different OCR package that can't save as rtf or do automatic line rejoining, you may need to skip those two functions. Save the files in a directory named "text" using the same settings as for textw without RTF extraction above. Uncheck both Extract and Dehyphenate under the Process Text tab. It won't hurt to leave them checked but the script will complain that it can't find the other directories and/or files.
If you aren't using guiprep just save the files into the "text" directory. Save as plain text, keep line breaks, use blank line as paragraph separator.
Now you are going to need to do a little preprocessing on those text
files. The tools you use will dictate how you proceed. The
major tool (Guiprep) is covered here.
Guiprep is capable of extracting italic and bold markup from the
OCRed text. (save lots of time for proofreaders), removing the end-of-line
hyphens and rejoining the broken words, filtering out many, many
scanning errors, renaming the files in the format need by Distributed
Proofreaders and checking for zero byte files, all automatically. It
also provides an interactive mechanism for header removal which is very
stable and user friendly. The
included with the script is quite comprehensive and should be consulted
for any detailed questions.
A general overview of how to use it:
Open the script, a graphical user interface will pop up.
uses a tabbed screen scheme, similar functions are grouped on different
The finished files will be in a directory named "text".
Guiprep also can automatically rename your .png files and provides a
front end to pngcrush to losslessly reduce the size of your png file and
reduce your upload. It also has a FTP client built in which will
automate a lot of the upload.
If this is your first time contributing a project and/or you are not a
project manager, send an email to JulietS, that includes
the author, title, etc and, ideally, the clearance line and any comments
you may want included on the project page. Make sure you include
your name and a contact email address (if different from the sending
address). She will contact you with an FTP address and directoryname
where you can upload the image and text files. Use an FTP client to
upload all of the .png and .txt files you generated earlier into that
You can also upload a single .zip file of all the .png & .txt files.
(There are a few free FTP clients listed in the software
section, or, the guiprep toolkit has an FTP client built in that
will automate some of the process.)
Alternately, if you anticipate having several
projects, you may want to send a message to ldavies (Louise) and ask
to be made a project manager. This will open up access to some of the
project creation and control features. The same general procedures are
used once you are a project manager, you just need to create your own
project pages and set up your own upload directories, details are given
on the project managers page.
Wow! That was fun, let's do another! :-)
Scanning / OCR
5.0 Pro is much cheaper than 6.0 and is still available (though not directly from ABBYY software) and does what is needed. If possible, stick with the Pro version though; the Home and Sprint versions don't have necessary features. Good for scanning, but a little finicky about which scanners it supports.Text file processing tools:
Image viewing and manipulation:
[Win32] - Nice general purpose image manipulation and conversion
XnView Free [Win32] - Nice general purpose image manipulation and conversion software.
Firehand Ember Shareware [Win32] - Another nice
image viewing and conversion program.
netpbm Free [Win32, Unix] - A toolkit for manipulation of graphic images, including conversion of images between a variety of different formats.
[Win32] Nice very configurable utility for batch renaming files. Very
point 'n click.
File Archiving and Compression tools:
7.zip Free-GPL [win32
Unix] Free utility to uncompress .zip archives.
ICEOWS Freeware [Win32] Compress files in
ICE and ZIP formats and uncompress nearly any common format. Many
language interfaces available.
Info-ZIP Free-BSD [Nearly all OS's and
Platforms] A collection of utilities for working with zip format
compressed files. Support for a large number of platforms and OS's.
FILZIP Freeware [Win32] Point and click
manipulation of compressed files. GUI interface. Multiple file
extraction. Lots of nice features.
WinZip Shareware [Win32]
Utility to create and extract .zip archives. Free trial.
WS_FTP LE Shareware
[Win32] Easy to use FTP client. Free for non-commercial use.
Smart FTP Shareware [Win32] Another easy to
use FTP client. Free for-non commercial use.
Xpdf Free-GPL [Dos/Win Unix] Utilities to extract images or text from .pdf files among other things.