Content Providing FAQ
- FAQs in DPWiki
- See Also in DP Wiki
- Other Related FAQs
A Content Provider does not have to be a registered Distributed Proofreaders (DP) volunteer on the DP website. However, it might be a little difficult to recruit a Project Manager for the project if you are not.
Selecting a Project
Which book you pick is up to you. The only requirement is that it be copyright clearable (discussion below). It is best if it is something in which you have interest. Chances are that you will find others who will work on it as well.
Finding a book.
There are several ways to find a project to CP (Content Provide). You can search the library, buy from a local bookstore, raid your own bookshelves, ask a friend, pull them out of the trash, or find projects that are already scanned at some of the many on-line sources for scans. Be sure to pick a project on which you will enjoy working, because you will be shepherding this project through up to 5 Rounds (if you choose to be the project manager), and until the project is posted. This may take several years from start to finish, but much of this time can be spent waiting for your project to be released in a round.
If you want the project to go through the system quickly, pick a popular genre; watch which release queues are moving fast, as this changes regularly.
If you choose to get a book from one of the on-line book archives, please follow the individual site guidelines regarding acceptable use and protocol. We don't want to be bad neighbors. It is considered good form to credit the source of the scan when the text is submitted to Project Gutenberg, so make sure the PM knows its source.
Projects that are part of another publication
If the project is a portion of another publication and the entire publication is no longer under copyright restrictions, please do not separate out the smaller portion of the publication as a separate project.
Check for completeness
At this juncture, regardless of the source, you should check the book for completeness before going any further. See the Project completeness checklist for a list of the types of things to check.
Digital Library of India/Public Library of India scansets
Some scansets from the Internet Archive that were provided by "Digital Library of India"/"Public Library of India" have incorrect publication information so that works appear to be in the public domain in the US when they aren't. If you are considering using one of these scansets, please verify publication information independently with other sources. If you can't confirm that the publication is Public Domain in the US, please don't run the book. For more information, please read this forum thread.
Difficulty.
Some things can make the project harder than others. The amount of time you wish to spend on this should be considered. Check the inner margin (gutter) of the book. The wider this is, the easier it will be to scan, and the fewer extra measures you'll need to take in OCR and answering forum questions. This does not mean that you should not work with books that have a narrow gutter, just that they will be much harder. Projects with a lot of illustrations are also harder and more time-consuming. This will be discussed more under Scan/Download images and Prepare the Illustrations.
Copyrights and clearances.
Do a preliminary check to see if it is clearable. For the current year (2023), usually that means it was published in or before 1927. See clearances, below, for more information.
On 1 January 2024 books first published in 1928 enter the Public Domain in the US. Project Gutenberg is now clearing books in anticipation of that day, but please don't load projects onto DP's server until they are fully in the public domain.
Each year on 1 January, there will be an update in the public domain publication year. For information about handling projects that enter the public domain on 1 January, please read the Handling projects entering Public Domain on 1 January section of the Project Managing FAQ.
Check for Duplicate Projects
Make sure the book is not underway or already at PG: PG's In Progress List form searches the PG clearance and posted databases by title and/or author so you can find out whether a book is already at PG or whether a clearance for it has been requested, and when, and whether the clearance has been approved. These clearance listings searched go back to 2004. The In Progress List also searches for projects at DP and for ones at DP Canada and Faded Page.
If you don't see a match via the In Progress List, you should use DP's Project Search to double-check whether a book is in progress at DP as well as DP Canada's search (which also has the option to search Faded Page).
"Cleared" Status means someone has requested and received copyright clearance, but has not yet finished the project. If this clearance is several years old, it has probably (though not certainly) been abandoned.
After you get clearance, you will get an e-mail along with the other clearance holder, letting each of you know that the other is working on it. You can then communicate with them to find out if they are working on it, or if you are free to begin processing it.
Some projects, most notably periodicals and multi-volume editions, will have "blanket clearances." This does not mean that the person who requested the original clearance has all of the volumes ready to scan! Most of these clearances are associated with DP in some way, so if an Überproject doesn't exist for the periodical/set you have (where the PM will often list the volumes they have available), you can post in the Content Provider's forum to find out who's working on what.
If the project says "Posted" it has been posted to Project Gutenberg with the accompanying ebook number. It is a good idea to look at all of these lists by author and separately by title.
Running a project that is already in PG.
Even if a book is already in PG, it may be worth processing again. This will require some legwork to determine, so be sure you feel strongly about the book before pursuing this. PG welcomes different editions, illustrated versions, different translations, etc. In addition, many of the older ebooks have more errors than we would find acceptable today and reprocessing them through DP may be the best way to change that. If the book has a PG number under 10,000 then it probably doesn't have an illustrated version and might be a good candidate for an upgrade.
Below is a list of reasons you might provide an existing PG project through DP. You will need a copyright clearance for each of these cases. For reworks, the PM should put a note into the Project Comments section explaining why the project is being redone. If you are a CP only, then you should include details of why the project is being redone in a text file attached in the project zip file.
- Basic upgrade
- You have the same text version, there are no illustrations, and the PG version is riddled with errors: Be sure to let PG know, when you upload the final version, that this is a revision of an existing ebook, based on a paper copy in hand. If there are only a few problems, submit them via the PG errata process.
- Illustration Upgrade
- You have the same text version, but there are illustrations and they are not present in the PG version: Same as the basic upgrade except that you'll be submitting an illustrated html version.
- Different Translation
- PG will treat this as a completely different ebook and welcomes them. There are already at least half a dozen translations of the Iliad, for example, and more are always welcome.
- Different Edition
- Some books were published in very different editions. Where this is the case, PG welcomes them as separate ebooks. You will have to document the fact that your edition has significant differences from the version that is already in PG.
Projects at Faded Page
Occasionally Distributed Proofreaders volunteers ask how they can arrange to copy a project that is already at Distributed Proofreaders Faded Page site to Project Gutenberg. This request is often made under situations in which a volunteer is eager to see all of a particular series or author at Project Gutenberg and some of the volumes are already posted at Faded Page. DP has spoken with Distributed Proofreaders Canada and it has approved the following process.
- Check that the book isn't already at Project Gutenberg and that it is eligible to be at PG according to US copyright law.
- If the book isn't at PG and appears to be no longer under US copyright, apply for a PG copyright clearance.
- Collect the HTML, text and image files, by download from Faded Page
- Remove the FP licence from the text and HTML files.
- Arrange the files in the same directory structure as we use for DP uploads to PG.
- Validate the CSS and HTML.
- Zip and upload to PG;
- Ensure that the credit line for DPC is retained and entered in the "Extra Credits" field. If the person doing the uploading had to do a lot of updates after running it through our checks to meet PG requirements, it would be reasonable to add that name to the credits and, for books that don't require such work, uploaders may add their names to the credits along with their role in getting the book to PG. The important thing though is that DP Canada's people be credited for their work.
- Upload to PG.
Projects posted to PG in this way are fully Distributed Proofreaders Canada projects rather Distributed Proofreaders projects. However, the information is being included here because most of the queries received come from Content Providers and Project Managers.
Get a clearance
You have obtained a book, and have decided that it is both clearable and not already in PG or in progress, or you have a book you think is clearable and need to find out for sure. In both cases, it is time to ask the experts.
You will need to have scans of the Title Page and Verso (the back of the title page), also known as the TP&V. You may need scans of other material as well, such as an inscription on the fly-leaf, in order to establish date.
Check for Duplicates
The clearance team does not check for duplicate clearances. In addition, having a clearance does not mean you "own" that title for some period of time.
Before you move forward with a clearance request, please make sure that the book isn’t already in progress at common proofreading sites:
- PG's In Progress List form: This important program searches the PG clearance and posted databases by title and/or author so you can find out whether a book is already at PG or whether a clearance for it has been requested, and when, and whether the clearance has been approved. These clearance listings searched go back to 2004. The In Progress List also searches for projects at DP and for ones at DP Canada and Faded Page.
- Already at Project Gutenberg
- In progress at DP, DP Canada. Note that the extended search at DP Canada, linked to above, can also search their FadedPage display site.
DP also has an in-progress check script that combines several searches, but carries the warning at the top of the page: Do not rely solely on the information returned by this script. It does not check DP Canada or their FadedPage display site or any other sites similar to ours which may have publicly available searches. It's often best to use multiple search terms and search sites individually, rather than depending on a composite search.
If your search string doesn't find anything, please try variations. Sometimes projects are duplicated because the search string was too restrictive. Common words may find too many results, but the shortest string that will uniquely identify a title is usually best. If the title is extremely common ("Poems", for example), try searching for both title and author.
Copyright Clearance.
Copyright clearance is a process by which Project Gutenberg determines if a book is in the public domain according to the copyright laws of the United States. Project Gutenberg maintains a set of Rules that are used to determine if a book is clearable. This DP site operates under U.S. law; if you cannot obtain a clearance, your book cannot be processed through this site.
Please read PG's copyright clearance rules for details.
If your book is not clearable under PG's rules, but the author and everyone else associated with the book (i.e., the illustrator, editor, translator) has been deceased for at least 70 years, you may wish to send the book to one of our sister sites, DP Canada.
As of June, 2018, Project Gutenberg is approving clearances for books that will become Public Domain as of January 1, 2019 and January 1, 2020. Please do not upload projects that are not in the public domain in the US as of the time you upload the project files. See this post.
Create a PGLAF Account.
The next step is to set up an account at PGLAF (the branch of PG that handles clearances). If you have Direct Uploading or PPV access, you already have a PGLAF account.
To create a new account, browse to PGLAF and read the welcome page. This contains a lot of useful information on the clearance process, and a number of useful links. Next, Click the New username link, and fill out the form. Be sure that the email address you enter is valid and is checked regularly; this is the address where posted notices and clearance notifications go, and also where you will be contacted if a conflict occurs.
Submit a Clearance Request.
After completing the registration process, log in, and select "Submit a New Clearance Request". A large form will appear; most of the information required should be available directly from the title page of your book. If not, you will have to do some research. Document any findings in the field provided; be sure to list the source of any information not found on your book's title page and verso page (the page immediately following the title page). If a date is listed twice in different contexts (separate publication date and copyright date, for example) enter it twice. Remember when attaching images that they should be small in size (100k is a reasonable maximum; most should be smaller), but the smallest text should still be legible. Multi-volume works can be cleared in a single clearance request if the dates are the same, or if you provide the earliest and latest title and verso.
With respect to providing URLS on the clearance request form, PG accepts URLS for Rule 6 requests and other complex clearance requests; however, for regular Rule 1 requests that use a web-site for establishing copyright or publication date, PG would like to see a screen shot, and either the screen shot or the accompanying text should indicate what web-site the screen shot came from. To this purpose, PG accepts capturing a webpage at the Wayback Machine and providing that link so as to eliminate the problem of link rot or changing webpages, provided it is pointed out in the associated text that the link is to the Wayback Machine.
Checking the Clearance Registration Form details
Before you submit your clearance request, it is important to carefully review the title page scan of your book and the publication information and verify that the information you have entered on the form is correct. Please ensure that:
- Author field contains data
- Author's name is spelled correctly
- All authors listed on title page are listed in upload form
- Any illustrator, translator, editor, etc., listed on title page is listed in upload form
- Title is complete and all words are spelled correctly
- Titles of serials (if this isn't a blanket clearance) include volume number, issue number, and issue date
- Title is appropriately capitalized
- If English, the title and subtitle should be capitalized using Sentence case with appropriate capitalization of proper nouns.
Example: The story of the little red hen Example: The life of Mildred Morris
- For titles in languages other than English (LOTE), please follow the capitalization used on the book's Title Page. However, if the title there is fully capitalized, please use the conventions for title capitalization common in the book's language (If in doubt, you may refer to the capitalization used for books in that language within the catalogs of major libraries such as the Library of Congress or WorldCat).
- Subtitle listed on title page is listed in upload form
- If you are uploading a periodical, please check the Project Gutenberg for previously posted issues of that periodical and follow the title formatting they used.
- If your project is part of a multiple volume set, state the number, i.e. English History, (Vol. 2/6).
Note: It is very important that all the information you enter in the clearance registration form match the details of the project for which you are requesting clearance: The uploaded item MUST match the clearance. This certainly includes all the publication metadata (publisher, location and date). If the information does not match, then Project Gutenberg should not accept the upload of your project once it completes post-processing.
Types of Clearances.
There are several types of clearances. The most common is rule 1, but some others are used on occasion. Project Gutenberg only clears based on the United States Copyright Laws. However, if you would like a detailed discussion of copyrights in other countries, visit The Online Books Page.
Wait.
All that is left now is to wait for the results of your request. Basic clearances using the standard rules are usually processed within several weeks. Rule 6 clearances, which require more research, usually take longer (and may require further research on your part before it clears). You must receive the clearance before loading the project onto DP.
You may get a response that says NOT OK. A reason for the denial of the clearance will always be given. Be sure to check that reason, since technical difficulties such as corrupted files can easily generate this response. Feel free to resubmit your clearance request after correcting whatever problem was noted.
Managing Clearances
You can view the status of your clearance requests and request cancellation of any of your clearances via Project Gutenberg's Clearance Management tool.
Scan/Download images
There are two ways to get these images. You can scan them yourself, or you can find an Image Provider that has already scanned them.
Scanning.
For the text of the project, it is best to scan this within your OCR package. Many OCR packages deskew in a way that works great for text, but mangles illustrations, so do not use it for the illustrations. If you have a few illustrations it is best to make two runs with the scanner. The first pass scans every page in black and white, for the OCR package. On the second pass scan only the illustrations. IrfanView and xnview both have a scanning interface that is good for this. Be sure to get full-color scans of all color illustrations, and grey-scale scans of all black and white or grey-scale illustrations. Also it is nice to get a scan of the cover and spine of the book. The back is also nice if it is illustrated. If there are any advertisements in the book, please scan them as well. Pages images should be scanned at 200-400 DPI (this varies depending upon the fonts used). Illustrations should be scanned at a higher resolution; 600 DPI is generally safe.
When you first use your scanner, check to see if it dithers in black and white mode. Dithering is a method of simulating colors you don't actually have available by scattering dots around and fooling the eye. The first image has been dithered, and is actually somewhat easier to read, but will confuse the OCR program and inflate the file size. The second has been thresholded, and is the preferred method. If your scanner driver dithers, consider scanning in grey scale and letting your OCR engine convert it to black and white.
Generally you should not despeckle the images, because this process often removes punctuation marks. If you find that despeckling improves the OCR quality, then do so, but use the non-despeckled versions for the page scans that you upload to DP.
For instructions on how to scan using Abbyy, see the Abbyy Scanning Documentation.
Scanning advice
- Preparing page scans
- Preparing illustration scans
Avoiding the most common pitfalls
Image providers
There are many online image archives that make available scans of public domain books. For a list of some of these sites see Details of Image Sources.
Please do not use scans from any archives that charge for the use of their service. These archives usually have a compilation copyright, and other restrictions on their use. Please follow the individual site guidelines regarding acceptable use and protocol. We don't want to be bad neighbors. It is considered good form to credit the source of the scan when the text is submitted to Project Gutenberg.
If you've downloaded images to process from an online source, it's important that you record the source of the scans. Filling out the "image provider" field when you create a project allows DP to coöperate with online image archives' policies. It's also nice to let Project Gutenberg know the source of the scans at clearance time, but it is not required.
DP accepts PNG and JPG images only for proofreading. If the images that you've downloaded are in a different format, you'll need to convert them as part of your preparation process.
You'll need to prepare the page scans, plus the illustration scans if there are any, prior to uploading the images to DP. See those links for recommendations. If this results in images of reduced quality, consider adding a link to the original images in the project comments, but do ensure that the images you provide for proofers are legible themselves without reference to outside sources.
Scanners.
Scanners are devices we use to create images of books. When choosing a scanner, or checking to see if a scanner is useful for DP, there are a few important factors to consider. First is form factor. The most common types of scanner are Flatbed scanners, ADF (Automatic Document Feeder) scanners, and scanners with both a flatbed and an ADF. There are also some less common types of scanners discussed below. Flatbed scanners are useful for scanning books while they are still intact, while ADF scanning is faster but requires that the spine of the book be removed. Modern scanners almost always use an USB interface and have sufficient optical resolution for our needs, so we will focus on other aspects of the scanner.
If you have questions, or just want to see what others have discussed, there is a thread on scanner recommendations. There is also a wiki page of Scanner Reviews.
Flatbed Scanners.
Flatbed scanners have a number of advantages for providing content for DP. Most CPs start with a flatbed scanner. They are cheap and relatively common, you can scan material that is still bound, and they are moderately fast. You can also scan two pages at a time and have the OCR software or image preparation software separate them for you. Flatbed scanners have a fixed glass plate where you place the book, and an internal moving head that passes underneath the glass plate.
When choosing a flatbed scanner for providing content, there are a few key factors to consider: Scanning speed, maximum size, and type of scan head. Speed is obvious; you're going to be scanning a few hundred impressions per book, and the difference between 10 second scans and 45 second scans adds up. Maximum size affects what type of material you can scan; most books will fit entirely on a standard A4/Letter-sized scanner, but periodicals and large folio-sized books are much easier to scan on A3/11x17-sized scanners. The type of scan head is important because of book gutters; you want a flatbed scanner with a CCD (charge-coupled device) scanning element, not a CIS (Contact Image Scanner) scanning element. CCD scanners can focus much further above the glass plate than CIS scanners, and keep the letters in the gutter from getting too blurry. Book gutters are the area between two facing pages where the pages curve up into the spine. The best way to tell if a scanner is CCD or CIS is to look at the specifications, but a CCD scanner will be fairly thick, and will have a fairly large mirror mounted at a 45-degree angle on the scanhead, while a CIS scanner has a fairly narrow scanhead with two rows of LEDs and sensors.
There are also a few specialized book scanners like the Plustek Optibook that avoid gutter problems by having a very narrow margin on one side, and scanning a single page at a time. For this type of scanner it doesn't matter what type of scanning element it uses, as the page lays flat upon the glass.
ADF Scanners.
ADF (Automatic Document Feeder) scanners pass the pages of a book over a stationary scanning head. They usually have a hopper that allows you to load a number of pages and let the computer handle the scanning. This can be much faster than than a flatbed scanner.
Some important factors for selecting an ADF scanner include simplex/duplex (whether the scanner can digitize both sizes of the paper at the same time), hopper size (how many pages the scanner can hold at once), paper path size (letter/A4 scanners are much more common, but can't handle folio/quarto-sized books or periodicals), double feed and jam detection, and ease of maintenance/availability of spares (the rubber rollers and other parts tend to wear out faster on old books).
Just as important as the scanner is a reliable method of removing the book spine. The best method is to use a professional-grade paper trimmer; this will slice cleanly through the entire book and reduce the odds of the paper double feeding or jamming. You may be able to find a local print shop that will do this for a small fee or for free. You can also use a band saw or scroll saw, but these tend to leave more ragged edges that induce more double feeding and jamming.
Pen scanners/handheld scanners.
These are primarily useful for scanning texts in a reference library with strict rules about scanners and cameras. Pen scanners scan a single line of text at a time, while handheld scanners scan a larger swath as you pass them over a page. These are not recommended for general scanning use because they are much slower than other methods.
Digital cameras.
A few CPs, and many large scale scanning operations use digital cameras instead of a traditional scanner. They have the advantage of not requiring the book to be laid flat, and can be very fast. They do tend to be more expensive and much larger than traditional scanners. Some even have automatic page turners. Results can vary depending upon the quality of the cameras, placement of the cameras relative to the pages, method of holding the pages flat, lighting (tricky to get diffuse and even), and vibration. See the Internet Archives scanning robot and DIY Book Scanner for some examples.
Image Preparation
See Page scans for general information about proofing image prep and Illustration scans for general information about illustration image prep.
Depending on what Operating System your computer uses, your options for the final preparation of images will vary.
Scan Tailor
Scan Tailor is a free, cross-platform program that is useful for deskewing, splitting, resizing , adding or removing margins, equalizing illumination, and providing an output that is suitable as a source for quickly making final proofing images.
OCR
You have the scans, now you need the text. (Alternatively, this might be a type-in project.) optical character recognition (OCR) is the process through which a program takes the image, and "reads" it, producing the text files. There are many programs that do this. Some are very good, some are adequate, and many are not good at all. Some have more functions than others, and some are fairly expensive.
Please note, since the changeover to separate proofing and formatting rounds, pre-formatting should not be added to the project. Pre-formatting gets in the way of the proofers doing their jobs.
OCR Software.
If you do not have Abbyy FineReader Pro, do not feel that you need to go out and buy the software in order to OCR. You can use any OCR program. So long as you get it into the correct format in the end, that is fine. The instructions given below are for Abbyy FineReader Pro. We will attempt to make them as general as possible, so that you can convert them to other programs, but some software will not do everything that Abbyy FineReader does.
ABBYY FineReader
The most popular program is ABBYY FineReader. It does an excellent job, and you can find an older version on eBay without breaking the pocket book. Try to stick with the Pro version. The home and sprint versions are much less expensive, and far less feature rich. Instead of getting the newest home edition, get a 1 version old pro version, you will be much happier. There is a forum thread ABBYY Finereader Tips and Tricks for help with Abbyy FineReader.
Readiris
Readiris v.11 has adequate character recognition, but does have some limitations when it comes to use for DP. It has a limit of 50 pages per batch recognition, meaning for a 300 page book, you need to run 6 separate batches of OCR, then rename the files.
more specifics to follow.
Ocrad
See Ocrad.
Tesseract
Tesseract is an open source OCR engine, see Tesseract's homepage or Wikipedia for more details.
ABBYY FineReader Scanning Instructions.
Prepare the Illustrations.
ok. You have page scans, your text is ready, but you still have some illustrations to prepare. It is also a good idea to make a scan of the cover and the spine if they have any decoration on them. Some people will get them even if they have no decoration, as this gives a nice feel to the HTML version.
How to handle illustrations on a page.
Many books have illustrations within the text. We like to create HTML versions of all books with illustrations. This means the CP or PM must get these illustrations and include them in the project. It is not OK to just say "The illustrations will be provided by the PM at a later date" or "You can download the illustrations from this location" as the PM or site may not be around when the text is finished.
In order to get the illustrations, scan in full color, or greyscale as needed, in an application other than Abbyy Finereader. IrfanView, the Gimp, or your scanner's software should all provide decent image scanning. Abbyy finereader processes images in several ways that are effective on text, but unacceptable for illustrations.
Illustrations should be scanned at a sufficient resolution to capture fine detail. While it may not be needed now, it is important if the book is to be reprinted or screen technology improves. Generally speaking, 300 DPI is adequate for line art, continuous tone, and descreened images; screened images often require 600 DPI to avoid moire effects. ** Add images to illustrate various types **
Then crop around the illustration, leaving some space around the illustration in order to rotate and clean up the illustration. Do not feel that you need to provide clean rotated images in perfect, ready to post format. This can be done by the PP. If you do wish to clean them up, many PPs appreciate this, however, please leave them larger than you think the PP will need. This allows the PP to resize them to the way they like it.
How to handle plates
Plates are handled in much the same way. Proofing images for both the plates and the blank backsides need to be placed in the correct order as they were placed in the book.
OCR Pool.
If you don't have an OCR package at all, don't want to bother with it, or really want it done with a good OCR program, then you can use the OCR Pool. This group will take the scans you provide and produce the text for you.
Check the project.
- Check that every page image is there, and is complete. Include all pages, including title, verso, all illustrations, and plates and all blank pages. Leading (prior to the first printed page) and trailing (after the final printed page) blank pages should be removed. If any pages are missing or damaged they should be replaced or repaired before continuing.
- Check that every page has been OCR'd. You should have one text file for each page image file, and they should have the same base name (e.g., 001.txt and 001.png). (If you're submitting a type-in project, just create empty text files with the appropriate names.)
- Image & text files should be named so that a simple sort of the filenames (e.g., an "order by name" listing of the files) puts them in the proper (book-binding) sequence.
- One common convention is to simply number each page serially starting from 001 (or 0001 if there are more than 999 pages). Note that typically, this serial number will not agree with the page number printed in the book, but the difference (or 'offset') will usually be consistent over the body of the book. Check for a consistent offset, and investigate any anomalies, as they may indicate missing or duplicated pages. Changes in the offset can also occur due to unnumbered content pages (plates/appendices/introductions/whatever), which are fine. (Don't try to achieve a consistent offset at the expense of the proper sequence of pages.)
- Another possibility is to name the image & text files according to the original printed page number. This is complicated by books with multiple page-numbering sequences (e.g., frontmatter numbered with roman numerals) and pages without an explicit or implied page number (e.g., plates). Such complications can be accommodated by judicious use of extra characters in the filename. For page files, the filename can be up to 12 characters long, so in practice the base name can have up to 8 characters. Allowed characters include digits, letters, underscore, hyphen, and dot. Just make sure that a simple sort of the filenames puts them in the proper sequence. (Don't rely on a particular collation for uppercase vs. lowercase letters. To be safe, only use one or the other.)
GuiPrep
GuiPrep is a software package created by DP's own Thundergnat. This is a great package that takes the OCR output, checks it for common OCR errors and then spits out a ready for DP version of the text file. It will also renumber the images, and run PNGCrush to make the images smaller. This is a very handy tool indeed. You can find information about downloading it here.
Guiprep Installation and Upgrade
Initial installation
Guiprep requires Perl, and it was edited and tested under Strawberry Perl on Windows. If you have a different version of Perl, guiprep may work. There are reports from the wild of it working on Linux under a different Perl. If you do not have Perl, install Strawberry Perl. If you don't know if you have Perl, open a command prompt on your computer and type:
perl --version
If the reply is the version of Perl that is installed on your computer, then you don't need to install perl.
Download the most recent version of guiprep from Guiprep Releases page at Github and expand it in a directory.
After installing Perl successfully, you also need some additional modules, Windows users can double click on a file in the distribution package, install_cpan_modules.pl. For other operating systems, see the INSTALL.md file in the distribution package.
Updating an existing installation
Updating an existing installation is not recommended. Using a copy of settings.rc from an earlier version may disable some of the newer features. Instead, rename your old guiprep folder, and then proceed as if it were a new installation.
Guiprep Initial Use
The most common use of Guiprep is to process the output from OCR, dehyphenating end-of-line words and preparing it for upload to Distributed Proofreaders web-site for proofreading. The output of the OCR may be two sets of text files, one set is required with a file per page and including line breaks as in the images (in directory textw) and optionally a second set of files without the line breaks using the OCR's dictionary to resolve end of line hyphenation (in directory textwo).
Directory Setup
Guiprep expects to find textw and optionally the textwo in a project directory. The output of dehyphenation will be placed in the text directory, also in the project directory, which will be created if it is not present. If you are going to use guiprep to rename or optimize your png files, then there should also be a pngs directory as a sub-directory of the project directory, containing all the png files.
Starting Guiprep
If your computer runs Windows, there is a file in the distribution called run_guiprep.bat. Double-clicking on this file will start guiprep. (Older distributions of guiprep contained winprep.exe or run_guiprep???.bat [where ??? is the version number]. If you have any of these files on your computer, you should consider following the instructions for upgrading shown above.)
In all cases, you can start guiprep from a command prompt. Guiprep will only work properly if started in the guiprep directory, the one that was unzipped during installation.
cd <guiprep directory> perl guiprep.pl
For instance, on my computer I start the most recent version of guiprep with
cd \pgdp\guiprep perl guiprep.pl
(The change directory command may have a different syntax on your computer.)
Select Options
When guiprep starts, it will open to the Select Options tab. Once you get the settings you want, you will need to look at this tab very infrequently.
Changing the "Default Markup" is not recommended.
In the first set of options:
- Make sure that Dehyphenate using German style hypens... is not checked, unless your project uses them.
- Save hyphens.txt & dehyphen.txt... is primarily for debugging and should be unchecked unless requested by a support person.
- Make sure that you do not attempt to remove headers and footers in both the OCR program and guiprep. If they were removed in OCR, then uncheck those options here. If headers and footers are still in the files after the OCR, then check the boxes. If headers and footers are not present and you tell guiprep to remove headers and footers, it will remove a line or two of text from the top and bottom of each page.
- Build a standard upload batch... If you are not going to make any further changes to the text before uploading, this might be helpful. Most CPs at least take a look at the guiprep output and may want to make changes.
In the scrollable list of options below that, the following are primarily of historical interest, and generally should be unchecked:
- Convert £ to "Pounds".
- Convert ¢ to "Cents".
- Convert § to "Section".
- Convert ° to "Degrees".
The following option will put curious marks in your text if the OCR ran words together, and you should consider whether to use it or not:
- Mark possible missing spaces between word/sentences.
If you are working on a book that contains mathematics, then you may want to uncheck:
- Convert solitary 1 to l.
- Convert solitary 0 to O.
It is good to familiarize yourself with all of the options because some may be relevant for a specific project.
Change Directory
If you have multiple disk volumes on your computer, select the drive containing the project directory you want to process.
Use the windows to navigate to your prep files. Interactive mode is the easiest to use and the only mode that works for Search and Headers & Footers. To use interactive mode, navigate the left-hand window (Change To Directory) so that your text directory(s) and optionally your pngs directory appear in that window. Ignore the other directory listing (Select Directories To Batch Process).
Process Text
The options in the Process Text tab:
- Extract Markup -- A good idea if coming from rtf files or of the text was previously processed through DP formatting, otherwise you can leave this checked and it won't do anything.
- Dehyphenate -- That's why we are using this program.
- Rename Txt Files -- OCR programs frequently put funky names on text files. This changes them to 001.txt, 002.txt, ...
- Filter Files -- Fix some common character substitutions. A good idea.
- Fix Common Scannos -- Another good idea.
- Fix Olde Engliſh -- This looks for things that might be the long s which was used in old English (ſ) and converts them to s. Don't use this option unless you know your project contains long s, because it will try to change f to s. If your book does contain long s, then this option is desirable.
- Convert to ISO 8859-1. -- Don't use this for books which will be represented in utf-8. Since that is most of our books today, uncheck this option.
- Rename Png Files -- If png files are present, they are renamed to match the Txt File renaming mentioned above, i.e. 001.png, 002.png, ...
- Run Pngcrush -- Pngcrush optimizes the png files for size without losing any information. There are other programs which will also optimize png files. It is important that png files get optimized before uploading to the Distributed Proofreaders web-site. If you don't do it here, then make sure you do it elsewhere before uploading.
At this point the status window in the lower left hand corner should say
Working in interactive mode.
Hit the Start Processing button and watch it run. If you are working on a large book, or you are running pngcrush, this can take some time. (If nothing happens, then you probably did not select the proper directory in Change Directory.) When it finishes, the last text in the large text window on the right will be
Finished all selected routines.
Your text files are now ready, and if you requested any work be done to your png files, that has been done as well.
Explore other options and tabs
This is a quick start guide, not a complete manual, and there are other options and ways of using guiprep. This document only attempts to show the most straightforward way of using guiprep for a beginner. It is safe to explore the other features and tabs, and you are encouraged to do so. The full user guide is in the distribution package and is linked to below.
See also
- the Guiprep wiki page.
- GuiPrep scanno file for French texts.
- the Guiprep user manual
FAQ, or what do I need to know?
What is the difference between a CP and a PM? And what do those abbreviations mean?
A: The CP or Content Provider supplies the scans to be processed at DP, and may also prepare the files for the proofreaders, but does not necessarily deal with the project beyond that. CPs do not have to be members of DP.
The PM or Project Manager is responsible for creating the project at DP, guiding it through the rounds, answering proofreader/formatter questions, and making decisions that will help create the most consistent output possible for the post-processor. PMs may provide their own content or acquire scans from another CP. The term PM in a different context means Private Message.
How much of my time will CPing take?
A: It depends on the book you choose to CP. If you choose a short novella with no illustrations, then it could take a couple hours to scan, OCR, check and prep your project. If, on the other hand, you are working on a thousand-plus page book on ship construction with 33 fold-out plates and a couple hundred illustrations, then it could take a year or more to finish the scanning alone.
What are the qualifications necessary to become a CP?
A: There are no qualification requirements to be a CP. You just must be able to get the images into good order and find a PM willing to work with you.
What kind of equipment do I need to CP?
A:
- If you wish to provide content for books which don't already have scans available on the internet, you will need a scanner that is capable of scanning the material you want to provide. Some libraries have scanners for public use if you do not have one. You can also harvest images from an online source such as the Internet Archive.
- A program with which to do image preparation.
- You will also need some sort of OCR package capable of providing the OCR text needed to start from. If you do not have an OCR package, there is an OCR Pool with volunteers willing to do this for you.
- You will also need Guiprep installed. However if you use the OCR pool, they can run GuiPrep for you, if you ask them nicely.
Are there deadlines? Who sets the schedule? What if the schedule is not met?
A: The only deadlines and schedules are set by the CP. If as the CP you do not want to set a deadline or schedule, then don't. If you do set a deadline and it is passed, then the only one who is going to come down on you, is you. Some projects take very little time, others take a long time.
What files do I need to provide?
A: You should provide clear black and white png images of every page. These should be large enough to be read easily, but not too large to be downloaded over a dial-up modem. Usually if you can get them below 100K the latter is fine. For details, see Page scans.
You will also need to provide text files containing the OCR output of each page. The png and the text file must have the same base name. For example, 005.png goes with 005.txt or the upload software won't know what to do. (Note: Guiprep has a tool to help getting the names to match so long as both sets of file are in correct alpha-numeric sort order.)
If there are any illustrations in the book, or a decorative cover, grey-scale or color images of each should be provided. If the title page and/or half-title are decorative, have some kind of graphic on them, or printed in multiple colors, they should be included as illustrations even if there is a usable cover. Otherwise, a high resolution copy of the title page is encouraged, but not required. If there is no decorative cover, a high resolution image of the title page can be used, in whole or in part, as a source of a cover for the epub edition. Photos and heavily shaded drawings are best provided in high-quality jpg format. Line art, with fewer shades of grey, may be more suitable as pngs. See Illustration scans for more detailed information.
If my project has music in it, is there anything special I need to do?
Consult the Music Guidelines for detailed help with projects containing music.