Note that there are additional references at the end of this document under See also ...
- 1 What is post-processing?
- 2 Who can post-process?
- 3 What help is available?
- 4 What tools can I use?
- 5 How do I choose a book to post-process?
- 6 Setting yourself as post-processor for a book
- 7 How long does post-processing take?
- 8 What if I change my mind, or don't have time?
- 9 How long can I keep a book checked out?
- 10 So what do I have to do to post-process a book?
- 10.1 First do some research
- 10.2 File formats
- 10.3 Keep a "To Do" list
- 10.4 Do a first-pass check
- 10.5 Save your work often
- 10.6 Check for comments left by proofreaders making you aware of questions/problems/markup
- 10.7 Check the markup
- 10.8 Check the text for other problems
- 10.9 Handle any illustrations
- 10.10 Create a Transcriber's Note
- 10.11 Curly quotes or straight quotes?
- 10.12 Plain Text Version, HTML Version — what do I do first?
- 10.13 Creating a plain text version
- 10.13.1 Plain text character encoding
- 10.13.2 Save a new version
- 10.13.3 Remove comments
- 10.13.4 Rejoin pages
- 10.13.5 Change markup
- 10.13.6 Ligatures
- 10.13.7 Dashes
- 10.13.8 Footnotes in the plain text version
- 10.13.9 Sidenotes in the plain text version
- 10.13.10 Indices in the plain text version
- 10.13.11 Illustrations in the plain text version
- 10.13.12 Greek in the plain text version
- 10.13.13 Rewrapping text
- 10.13.14 Poetry, tables and blockquotes, etc. that should not rewrap
- 10.13.15 Vertical spacing in the plain text version
- 10.13.16 Removing end-of-line spaces and rewrap markers
- 10.13.17 Straighten up the title page, front matter, table of contents, and list of illustrations
- 10.13.18 Check formatting and spacing
- 10.13.19 PPtext and other tools
- 10.13.20 Byte-Order Marks
- 10.14 Smooth Reading your Project
- 10.14.1 Smooth Reading for PP
- 10.14.2 Is my book suitable/will it be a benefit?
- 10.14.3 How much Post-Processing should I do first?
- 10.14.4 What formats should I upload for SR?
- 10.14.5 Should I include the .bin file?
- 10.14.6 File names
- 10.14.7 How do I prepare my files for uploading to SR?
- 10.14.8 Where do I find Smooth Readers?
- 10.14.9 How do I make my project available for Smooth Reading?
- 10.14.10 Can I add or replace a Smooth Reading file?
- 10.14.11 Can I replace just a single file?
- 10.14.12 What if I accidentally put the wrong project in the SR pool?
- 10.14.13 Can I remove my book from the Smooth Reading Pool
- 10.14.14 Getting notification when a project finishes Smooth Reading
- 10.14.15 What do I do with the feedback?
- 10.14.16 Can I get non-DPers involved?
- 10.14.17 Contacting Smooth Readers
- 10.15 Creating an HTML version
- 10.15.1 Code line lengths
- 10.15.2 Projects that are part of a series
- 10.15.3 Page numbering and page rejoining
- 10.15.4 Converting to HTML
- 10.15.5 HTML title
- 10.15.6 Header
- 10.15.7 HTML tables
- 10.15.8 Fonts
- 10.15.9 Text foreground and background colors
- 10.15.10 Use of <br>, and empty tags
- 10.15.11 Title pages and front matter in the HTML version
- 10.15.12 Headings and chapters in the HTML version
- 10.15.13 Using a style sheet
- 10.15.14 Greek in the HTML version
- 10.15.15 Preparing illustrations for the HTML version
- 10.15.16 Footnotes in the HTML version
- 10.15.17 Sidenotes in the HTML version
- 10.15.18 Indices in the HTML version
- 10.15.19 External Links
- 10.15.20 Dashes
- 10.15.21 Horizontal Rules
- 10.15.22 Use @media in your CSS for different media types
- 10.15.23 Updating your Transcriber's Note
- 10.15.24 Checking your formatting, HTML and style sheet
- 10.15.25 Smooth Reading reviews
- 10.15.26 Final HTML Checks
- 10.16 Checking that you haven't introduced errors into the book
- 11 I've finished preparing my book — now what...?
- 11.1 Uploading for verification (PPV)
- 11.2 Uploading to Project Gutenberg yourself
- 11.3 How to find out when your project is posted
- 11.4 What happens once my book is uploaded to Project Gutenberg?
- 12 Help! I've got a problem with ...
- 13 What's different about ...
- 13.1 Periodicals and Uberprojects
- 13.2 Drama
- 13.3 Music
- 13.4 Languages Other Than English (LOTE)
- 13.5 Maths (LaTeX)
- 13.6 Symbols and scripts, non-ASCII characters, non-Latin scripts, and downright weird things
- 13.7 Errata pages
- 14 See also ...
- 15 Change Log
What is post-processing?
On its journey through multiple proofreading and formatting rounds, the text may have been worked on by hundreds of volunteers. Post-processors must standardize the formatting of the book and adjust it to comply with Project Gutenberg's requirements. They must also deal with any detectable mistakes or inconsistencies that have survived all proofreading and formatting rounds.
The ultimate goal of post-processing is to create a consistently formatted etext, that contains as few errors as possible and that accurately reflects the intentions of the author.
Both plain text (.txt file) and HTML versions (.html) are always needed, but some projects may require other formats. Except in rare situations, Project Gutenberg will automatically convert the HTML text into .epub and .mobi formats.
Who can post-process?
Post-processors require more experience than ordinary proofreaders. Since they prepare the text for uploading to Project Gutenberg, they make choices and decisions about the layout and "look" of the etext. Because of this, post-processing is usually available only to volunteers who have completed a specific number of pages in F1.
To see whether you have access to do post-processing, please refer to the "Site Progress Snapshot" chart on the Activity Hub page. If you have access, there will be a green check-mark to the right of the Post-Processing heading. Otherwise there will be a white "X" in a red circle. If you do not have access and believe you have completed sufficient F1 pages, simply click on the "Post-Processing" heading in the chart to access the Post-Processing page on which, under the "Entrance Requirements" heading, you will see a statement of whether or not you meet the requirements. If you do qualify to do post-processing, you may then click on the "click here to submit a request" link to gain access to do this function.
If you are not yet eligible, but have a special reason for wanting to post-process (for example, if you would be post-processing in a language for which we have very few post-processors), please request access by sending an email to the PPV coordinator at ppv-coord at pgdp.net.
Once you have access to post-processing, a "PP" link will appear in the top navigation bar on your Distributed Proofreader site pages.
What help is available?
This FAQ contains a lot of information including a Help section. Our Post-Processing Forum also has helpful threads, especially the No Dumb Questions topic which is the best place to post new post-processing questions.
If you're new to post-processing, you should arrange for a mentor. A mentor will help you learn PPing faster and easier. To engage a mentor, simply send a note to the PPV coordinators at ppv-coord at pgdp.net and they will find you a mentor. (You may also post in the Help! PPer seeking a PP Mentor forum thread or pick a mentor from the PP Mentors section of the wiki.)
Books recently-uploaded to Project Gutenberg
It is also a good idea to look at books that have been recently uploaded to Project Gutenberg to see how their post-processors prepared them. This step is especially useful if you can find a book similar to the one you are about to post-process (same author, related topic, etc.).
What tools can I use?
- a text editor for working on your plain text and HTML files.
- a monospace font such as DP Sans Mono that allows you to easily differentiate between ones and lower-case "l"s, etc. when you check your project for errors
- a program capable of opening and editing images such as .pngs, .jpgs, and (if you are working with LaTeX) .svgs
- checking tools such as our Post-Processing Workbench (whose PPtext program includes, among other things, a spellchecker and Gutcheck), PPTools online, and Gutcheck
- capability to create and handle zip folders
- viewers for e-reader formats
There are other useful programs available which can be extremely useful and will usually save you a lot of time. Some such as Guiguts have Gutcheck built in. For further information on how Guiguts is used for post-processing, there is a PP Checklist using Guiguts.
How do I choose a book to post-process?
What to look for in a first project
For your first project, it's best to pick a fiction book with a relatively small number of pages (less than 200 or so). Here's why:
- A low page count makes the work go faster and is easier to handle.
- Fiction usually has fewer words per page and a simpler format than non-fiction, so it scans more clearly and is less likely to result in scanning errors and inconsistent formatting.
- Fiction generally lacks complicated features such as footnotes, tables, illustrations, poetry, and/or other items that could be difficult for a new post-processor to deal with.
Please do not work on projects that are still in copyright in your own country.
Finding a book
There are several good ways to find a book for post-processing:
- Check the Post-Processing page to see whether there is a book there that looks inviting.
- Check the Projects for new PPers thread for books that have completed the rounds and would make good starting points. You can also post there that you are in the market for a good book!
- Review the list of books in the rounds with no PPer assigned and contact the Project Manager if you find one you'd like to post-process once it finishes the rounds.
- Check the rounds for an interesting book in line with your skill level. You can tell whether a book already has a Post-Processor assigned by checking its Project Page. (If the book's Project Page shows no post-processor assigned or if the Project Manager is listed as the Post-Processor, you may contact the Project Manager to see whether he or she is willing to assign it to you, so that once it finishes the rounds, you may post-process it.)
- Contact a PP Mentor. Not only can mentors answer questions about the PPing process, but they can occasionally help provide more suitable projects or suggest alternatives to the above methods.
You think you've found a good book to post-process
Once you've found a book that looks like a good prospect, you should examine the instructions on that book's Project Page to see whether any post-processing instructions appear too difficult.
You should also check the project thread in the forums to see what proofreaders and Project Manager have been saying about it. This can alert you to issues that might make the work more difficult than you had realized.
Next, it's a good idea to download the book's text by going to its Project Page, scrolling to the bottom of the page, and selecting "Download Zipped Text". Once downloaded, you can scroll through the whole text to see whether there are any difficulties such as footnotes, poetry, foreign languages, dialects, and tables in it. That way, you will know what you will be dealing with before you commit to the project. If you see any of these items and are new to post-processing, you may want to pick a different project. But if you think you can handle it, give it a try! There's always lots of help available.
Setting yourself as post-processor for a book
If you decide you want to work on a book, and the book was not already checked out to you by the Project Manager, you can select it from the Post-Processing Pool by going to the PP Page and clicking on the book title. Then go to the Project Page, scroll down and select the "Check Out Book" button.
Once you've done that, you should make sure that the book shows as checked out to you or you might end up working for hours on it only to find that someone else has checked it out and submitted it! The book will be listed near the middle of the PP Post-Processing Page under the heading, "Books I Have Checked Out".
If the book you have selected is one that has taken a long time to process through the rounds or that has remained at any one processing stage for a long period, please search Project Gutenberg to make sure that that version of book has not already been uploaded to them by someone who is not a Distributed Proofreaders volunteer. If you find an already-posted book that looks very similar to the one you plan to work on, please contact db-req and explain the issue before you do any work on the book.
Downloading the book to work on
You're now the Post-processor for a book. What next?
- You should make a separate directory on your computer for your book so all the files will stay together as you work on them.
- Then you should download the text of the book (if you haven't done so already) by going to its Project Page, scrolling to the bottom of the page, and selecting "Download Zipped Text".
- Next, you should download the page images and illustrations for the book by clicking on the "Download Zipped Images" link. This file is sometimes rather large and may take time to completely fully transfer to your system.
- Once you've downloaded both zip files, you should extract their files into your directory. It's a good idea to put the numbered files (001.png, 002.png, etc.) from the images zip into a subdirectory called "pngs" and the illustration i_number files (i-123.jpg, i-177.png, etc.) into a subdirectory called "images". The text files should go directly into the root of your new directory.
How long does post-processing take?
It's very difficult to answer this question in advance — post-processing duration can vary from several hours to several days, weeks or even (in some cases) months. The time that a book will take to complete depends on the following factors:
- the difficulty and length of the work itself,
- problems discovered that were not noticed previously such as missing pages or images
- the tools being used,
- the time the Post-Processor has to dedicate to the work each day,
- the amount of experience the Post-Processor has,
- special factors such as having to wait for multiple parts of the book to be completed and merged, and
- Real Life intruding into a Post-Processor's time
Take it at your own pace — you will be the last person going through this book in detail before its posting (although a Smooth Reader may read it carefully for you and a Post-Processing Verifier (PPV) will verify your work unless you have earned Direct Upload (DU) privileges).
Try not to feel discouraged if it seems like it takes a long time to complete an "easy" book. Concentrate on learning the process of post-processing, familiarizing yourself with any tools you might be using, and doing a quality job, rather than on working quickly. You will speed up naturally with practice.
What if I change my mind, or don't have time?
If you realize that the project you've chosen is too complicated for you or if you find yourself short of time, it is perfectly okay to return the project. Think of it as allowing someone else the chance to work on it, and freeing up your own valuable time for another project or other Distributed Proofreaders work!
If the book was assigned to you by a Project Manager, please notify that person so that he or she may pass the project on appropriately. Otherwise, if you had assigned yourself to the book, you may simply return it to the post-processing pool.
You can look for an easier project straight away or take another one when you have more time.
How to return a project to the post-processing pool
To return a project, find the title of the book you are working on the PP Page or under the Post-Processing Projects "Active" tab on your "My Projects" page and click on its title to view the specific Project Page for that text. Then scroll to the Post-Processing Files section and click on the Return to Available button to go to the Return Project page.
The Return Project page provides the option for you to upload a zip of the partially-completed project (if you have already done some work on the book and wish to make it available to the next PPer). You can also add any comments that might be helpful to the next Post-Processor. Once you are ready, click the "Return project" button to return the project.
How long can I keep a book checked out?
The site software is set up to check at the beginning of every month, and send out reminder emails to all post-processors who have one or more books in their PP queues that have been checked out for 90 days or more without their having visited the book's Project Page since the timer passed the 90-day mark.
You don't have to finish post-processing a book in 90 days. All you have to do is visit the Project Page for each project you are notified about, and it will be checked out to you for another 90 days. The intent of this reminder is to make sure that PPers keep track of what they have in their queues. If you do not visit the project page after receiving this request, the project in question may be reclaimed.
The speed with which a project is post-processed is dependent on many things. If you check out a project to PP and find, for whatever reason, that you're not going to be able to complete it as planned, please ask for help. If the project has problems, such as missing pages or illustrations, let the Project Manager know what you have discovered; if you haven't gotten a response within a couple of weeks, please send the same information to db-req.
There is currently no limit to the number of times you can renew a project; however, please bear in mind that Distributed Proofreaders is not the only source for books posted to Project Gutenberg: there are other groups, as well as solo producers, and titles have been posted to PG by other producers while still in progress at Distributed Proofreaders. The longer a book sits in a PPer's queue, the more likely this is to happen. When that happens, the book is generally deleted from our server, and all the work that DPers have done has gone to waste.
If you know you're going to be away from the internet for an extended period of time, and not be able to renew projects you have checked out, notify db-req about when you'll be unavailable. This will guarantee that your projects will not be reclaimed. (Please note that "the next two years" is not a reasonable interval. If you expect to be gone that long, please consider returning anything to the pool that you can't finish before you'll be unavailable.)
What happens if I don't renew projects I have checked out for Post-Processing
As mentioned in the section above, an email notification is sent out on the first of each month. If you have not visited the project page by the 15th of the month (server time), you will receive a second notification email.
The sender for the email will be firstname.lastname@example.org. Elsewhere on the site, we recommend that you whitelist dphelp so that emails from DP will not be tagged as spam/junk.
If, a week or so after the second notification (and before the beginning of the following month), no response has been made, the project will be eligible to be reclaimed. Responses could include a reply to the email, project renewal, notifying db-req of extenuating circumstances. If you have already notified db-req that you expect to be absent for an extended period of time, your project(s) will not be reclaimed during the expected absence. For more information about reclaims, please read the Post-Processing Reclaims document.
So what do I have to do to post-process a book?
Many post-processing tasks can be automated using tools designed to minimize the complexity and repetition of such jobs. Please refer to the software-specific tutorials or user guides to find out how to use these utilities effectively.
There is a lot of information in our wiki about post-processing. The DP Official Documentation portion of the wiki includes a large section on post-processing. Also the Main wiki page includes a section covering useful documentation regarding post-processing.
First do some research
Please read the Project Manager's comments on the book's Project Page and follow the Project Manager's instructions for the project. On the Project Page you will also find a link to the forum discussion area for that book. Please review the posts there, and, if the Project Manager, proofreaders, or formatters found anything of concern, make a note of it for special attention while processing the text.
If the project was previously worked on and returned by another post-processor, there may also be comments about the project on the Project Page and possibly a zip file of the partially-completed post-processing work.
The Project Gutenberg master formats are plain text and HTML. From the HTML version of each project Project Gutenberg also generates an epub and mobi format of the book. Consequently, every Distributed Proofreaders project submitted to Project Gutenberg must include an HTML version and a plain text file. An exception to this rule is the small number of LaTeX projects which involve mathematical notation (for more information about LaTeX projects please read the LaTeX section of this document.)
In addition to the text and HTML files, a Post-Processor will work with .png and .jpg image files and, if the book involves musical notation, with midi or mp3 audio files, musicXML and possibly even PDF for printable sheet music. Also, if the book involves complex mathematical formulas needing LaTeX, PPers may use Scalable Vector Graphics (SVG) files. For more information about handling music books, please read the Music section of this document and our Music Guidelines.
Note: Different post-processors use different processes to create their HTML and plain text versions. Some do the plain text first followed by the HTML, others do the HTML first followed by the plain text, and still others generate both the plain text and HTML versions from a single source file. For more information, please read the Plain Text Version, HTML Version — what do I do first? section of this document.
Keep a "To Do" list
Please review the Getting your PP Project Ready for PPV document. It covers all of the major aspects that need to be checked or completed before submitting a project to be post-processing verified (PPV). It is also a useful resource for experienced post-processors who already have Direct Upload capability. The document can form the basis of your checklist.
There is also a Guiguts PP Process Checklist that may be useful if you use Guiguts and one for ppgen and Guiguts. Even if you don't use Guiguts, these checklists provide a good start for creating your own detailed checklist in spreadsheet or text format.
If you'd like, you can also put notes about your progress in the "Post-Processor's Comments" box which can be found toward the bottom of your book's Project Page. These comments are visible to others. They can act as a "To Do" list, or notes on points to watch out for from proofreader comments, and are particularly useful if you have to take a break from post-processing for a short while, so you can start working with your project right where you left off. Also, if you leave DP, your notes could help another volunteer to take over where you have left off.
Do a first-pass check
Check through the text page by page, opening the corresponding page scan in your image viewer. This way, you'll quickly notice things such as missed bolding or italics, unmarked poems, page numbering problems, and block quotes etc.
You should especially check that new paragraphs that start at the top of a page have not been missed and that paragraphs that should continue from the previous page have not been treated as separate paragraphs.
During your page-by-page review, it's good time to also remove the cross-page
/# markups and the
[Blank Page] words (once you've ensured that the pages really are blank) and rejoin hyphenated cross-page words.
You should also check for missing pages (rare, but it does happen) and illustrations. If you find missing pages or illustrations, please contact the Project Manager before starting to post-process and, if you don't hear back from the PM within a couple of weeks, please contact db-req.
The first pass is a good time to check that footnotes and sidenotes are all present and correctly marked and to rejoin footnotes and sidenotes that have been split across pages. For more information about Footnotes, please read the Footnotes in the plain text version and Footnotes in the HTML version sections of this document. There is more information about Sidenotes in the Sidenotes in the plain text version and Sidenotes in the HTML version section of this document.
For information on correcting errors, please read the Correcting issues asterisked by proofreaders section of this document. Please also remember to keep saving your project as you make changes, giving it a new file-name so you can recover if you make a mistake.
You will need to rejoin footnotes split across pages.
Make sure that the number/letter/symbol in the text matches the tag in the note itself. In-line footnotes (footnote within a line of text) are discouraged even when extremely short.
No anchor-text or anchor text without a footnote
If you can make out where the tag should go in the text, then it is probably best to insert it with a Transcriber's Note. If there is a tag without a footnote, then just a Transcriber's Note is probably best. See the section on Transcriber's Notes for details on how to word this and where to place it.
Footnotes are too small!
If neither you nor the proofreaders can figure out the footnote, please contact the Project Manager.
Save your work often
Remember to save your work often as you work on it, using a new filename each time, so that if you make a mistake, you can easily recover.
Check for comments left by proofreaders making you aware of questions/problems/markup
Once you have checked through the book page by page, you should run a search for
* to find notes proofreaders/formatters have made in the text to highlight questions/solutions and potential problems. Comments are usually entered with
[** but by checking for all asterisks, you may find some comments that were entered incorrectly. As you check through all asterisked comments, you may find some that you decide not to implement — that's alright. Correct any issues you believe should be corrected and record any changes you make. You should remove the asterisked comments as you resolve them.
Please be careful in correcting errors — correcting what you might think is an obvious error, may in fact be correct spelling/phrasing for the time the book was written. Some words marked as errors may actually have been normal usage for that author or for the period of the book. When in doubt, it is good to check the rest of the text to see whether and how a word was used. The Ngram Viewer can also help you determine whether a word flagged as an error was common in when the book was published.
As the Post-Processor, you are responsible for resolving problems noted by the proofreaders and formatters. If you need advice or a second opinion, try any of the methods listed in the Help section of this document.
Correcting issues asterisked by proofreaders
Obvious printers' errors should be addressed in one or more of the following ways:
- Correct silently and state in the Transcriber's Note that all such errors have been corrected silently.
- Correct all such errors and note them in Transcriber's Note, linking (in the HTML version) each change to the change made in the text.
- Leave uncorrected and state in the Transcriber's Note that at all such errors were left uncorrected. If you do this, it is important to keep track of which ones you have left uncorrected since it's very easy for you to become confused later between those printers errors that you're leaving "as is" and errors introduced to the text during our processing.
Note: Many post-processors fix what appear to be printer's "errers" (such as changing "errers" to "errors"). However, do not modernize or switch the spelling or grammar from British English to American English or the other way around. We are preserving history, not improving it.
For asterisked end-of-line hyphens, you will need to check the text for whether the author hyphenated those words elsewhere in the text. If the words are hyphenated differently throughout the book, it is a good practice to hyphenate or not according to what the author did most frequently for those words. If the words don't appear elsewhere in the text, it's good to check the common hyphenation usage was for those words for the period in which the book was published. The Ngram Viewer can help you determine whether hyphenated words flagged as errors were in common usage in the period of the book.
For information about checking hyphenated words that are not at the end of lines, please read the Checking hyphenated words section of this document.
Check the markup
Make sure that the
/# #/, etc. tags are balanced. Every
<i> tag needs a closing and properly placed
</i> tag and so on. If you haven't already done so, you should also ensure that no tags have been missed. To prevent later messes, please ensure that any poetry and indented text are correctly marked up. Also, if you haven't already done this, check any markup that ranges over a page break and make sure that redundant markup across a page break is deleted.
To review the various formatting tags, please read the Formatting Guidelines.
Check the text for other problems
You may find it easier to check for some of these problems while doing the plain text or HTML version of the book or while working on the ppgen source file. However, if you are updating just the plain text or the HTML version, please remember to update the other version before you complete your work.
There are numerous tools that can help you do this checking. Please check the What tools can I use? section of this document. One of the best though is the Post-Processing Workbench which includes online versions of virtually all of the checking tools you'll need. It is also important that you you do your checks using a font such as DP Sans Mono that allows you to easily differentiate between ones and lower-case "l"s, etc.
Your initial checking should check the following:
- end-of-line spaces need to be removed to prevent double spaces when text is rewrapped
- inconsistent line spacing around chapter and section headings
- spaces around hyphens
- spaces before punctuation
- spaces around quotes in English and/or other languages
- mismatched quotation marks
- he/be errors
- spaces around
- spaces within abbreviations
- inconsistencies in how periods are used with abbreviations
- incorrect use of oe and ae ligatures
- multiple spaces in non-marked text (other than for poetry and other specially indented text)
- incorrectly formatted thought breaks
- incorrectly formatted ellipses (according to the rules of the text's language, or ensuring they all match the original if that is what you prefer)
- dashes with three hyphens (
---) instead of two (
--) for an em-dash
- appropriate spacing of em- and long dashes (
----) or (
- incorrect paragraph breaks
- incorrect or missed quotation marks
- letters hidden in numbers (for example,
60) and numbers hidden in words (for example
Checking hyphenated words
You should also check hyphenated words throughout the text and decide whether to standardize, and what to state in your Transcriber's Note. For example, if there are 20 occurrences of "to-morrow", but only one of "tomorrow", you could decide to change the irregular one, but if there is not a clear majority of either version, you will have to decide whether to leave them as written or to change the others.
Paranoid text checks (stealth scannos, etc.)
These may be run by separate tools or by your main post-processing program. For more information, please refer to the manual or tutorial for the toolset you are using, or ask in the Post-Processing Forum.
These tools include "smart" programs which can check for irregularities such as he/be irregularities. There are also various regular expression (regex) searches are available which flag unusual letter combinations such as
tb (possible scanno for
m). Some tools will run regexes as a set, through a search-and-replace box — again, please check the manual/tutorial for the software you're using.
One useful check is to run is the regex
\n\n\n which catches all chapter and section spacing allowing you to confirm their consistency, as well as finding any extra line breaks between paragraphs — especially common after block quotes or poetry. It's a good idea to run this again on the text version, after you've removed markup such as
/* */ and
The Regular Expression Clinic for more information and help about regexes.
PPtext in our Post-processing Workbench is an excellent online tool to check for he/be errors and scanning errors.
Even if it looks like it's going to be a pain, spellchecking is always needed. Texts written before spelling was regularized might be the only reasonable exception but, even for those, spellchecking is often useful. Even books with dialect or other deliberate non-standard spelling can be spellchecked. You may want to leave this step until later in your checklist and/or repeat the spellcheck whenever you type in new information including a Transcriber's Note.
Note: Please do not modernize spelling. It is important to be certain that what you see as a potential error is actually an error. Spelling in the time of the book you are working on may have been different from modern spelling. If you have any doubts about whether something is an error or not, you can check elsewhere in the book to see whether it was spelled that way throughout. If you don't find anything similar to that word, you can try the Google Ngram Viewer to see how most writers spelled that word at the time of the book. You should also not change spelling to meet modern American or British spelling standards.
PPtext in our Post-Processing Workbench tool suite is an excellent tool for checking your spelling and finding word inconsistencies throughout the book.
Checking occasional text in other languages
If your book has made occasional use of other languages that you cannot readily check, you can check the Language Skills List to ask for help. If a native speaker hasn't been at DP for some weeks or can't help with your particular problem, have a look in the forums. In the Languages Other Than English (LOTE) section of this document, there is a list of all the forum areas in which you may obtain help for various languages. Don't worry if there are few members or the forum hasn't been posted to in a while — your question might be all it takes to create a lively and helpful community discussion.
Handle any illustrations
Move each illustration markup to an appropriate paragraph break. Some post-processors like to put illustrations just before or after the text they illustrate. Others prefer to place them at the end of the chapter, not wishing to interrupt the flow of the text. Do whatever you think is right for your book.
For information on handling illustrations for the HTML version, please read the Preparing illustrations for the HTML version section of this document.
Create a Transcriber's Note
The Transcriber's Note generally goes at the very end of the book after the footnotes and index section. Transcriber's Notes can help the reader understand how you've processed the text and what decisions you have made in preparing the text.
As you work on a book, you should prepare the information that you want to include in your Transcriber's Note. Your Transcriber's Note should always inform readers of important information about preparation of the book such as the fact you have moved illustrations near the text to which they refer.
Transcriber's Notes should also describe the extent to which you have or have not dealt with errors within the text. Some post-processors enter every single change they have made (and the relative page numbers) into the Transcriber's note. Others prepare a simpler Transcriber's Note to inform readers that punctuation has been normalized, that obvious errors have been corrected, or that all apparent printer's errors have been retained. It's up to you how detailed your Transcriber's Note is.
One benefit of the Transcriber's Note providing either a detailed list of all corrected errors or one stating that all apparent printer's errors have been retained is that it stops the PG whitewashers from getting long errata requests to "fix" your text. Stating that all printer's errors have been retained is not, however, an excuse for leaving in transcription errors such as bad Optical Character Recognition (OCR), scannos, or similar detectable problems.
If your book includes bolding, italics, etc., it is good practice to explain in the plain text version of the Transcriber's Note all the characters such as
_ that you have used to represent that formatting.
It is a good idea to check recent Project Gutenberg postings of Distributed Proofreader books to see how other post-processors have handled Transcriber's Notes. If in doubt, you can also talk to the Project Manager or discuss the issue in the forums.
Please also check your own spelling and grammar in this section.
Note: Regardless of how much information you plan to put into the Transcriber's Note, you should personally keep track of every printer's error you find in the book. If you don't, you'll find that you have trouble as you run various checks on the book telling whether a problem is a printer's error or a new error introduced by the Optical Character Recognition or our proofreading and formatting rounds.
Transcriber's Note at the start of the text
Although the Transcriber's Note should usually be placed at the end of the book, some Post-Processors also include a short Transcriber's Note about reading the text at the start of the book before the title page. This type of note would include such things as instructions on how to play the music samples included with the book or information on viewing the graphics.
Occasionally, Post-Processors place a short warning here such as an alert that some of the recipes in the book may contain ingredients we now know to be poisonous. This practice is frowned upon by many Post-Processors, but, should you decide to include such a Transcriber's Note, it is important that it be short, professional, and non-judgmental.
More about detailed Transcriber's Notes
Sometimes detailed Transcriber's Notes can be quite lengthy.
Transcriber's Notes: Page 13, "10,00 troops" changed to "10,000 troops." (We fought 10,000 troops at St Germaine.) Page 27, "Faw-cett" [Note: please link each such change to the change made in the text] changed to "Fawcett". (Major Fawcett dictated the memo.) etc. etc.
While we don't retain the individual page numbers in the plain text version, providing the page number in a detailed Transcriber's Note (if you are preparing a detailed Transcriber's Note) gives the reader an idea of where an error is within the book. The reader can also search for the corrected text you have included in the Note to find the exact location of your edit. In the HTML version, if you are preparing a detailed Transcriber's Note, you should link each change to the related text that was changed.
Where to find more information about Transcriber's Notes
For more details about Transcriber's Notes, please read the Transcriber's Note section of the Getting your PP Project Ready for PPV document. For information about Transcriber's Notes for books in languages other than English, please read this section.
Curly quotes or straight quotes?
Over the years there has been a great deal of dispute over which type of quote to use in our books. You will find a great deal of discussion of the pros and cons in our forums. The final decision rests with the individual post-processor. If you decide to convert your book from straight to curly quotes (or if your book has a mix of both straight and curly and you want all curly), you can use one of the Post-processing Workbench tools — PPsmq.
Plain Text Version, HTML Version — what do I do first?
Different post-processors use different tools and different approaches and will check for errors at different spots in their process. Some will prepare a plain text version, save it, and convert a copy of the text to HTML. Others will create an HTML version, save it, and use that copy to produce the plain text version. Still others create a source file from which they generate both the plain text and the HTML versions using a tool such as ppgen.
The important thing is that no new errors are introduced to the book through whatever method you use and that you check all your plain text and HTML very carefully (that's where the Post-Processing Workbench comes in).
Creating a plain text version
Some tools such as ppgen involve working on a master file that generates both the text and HTML versions at the same time. Nevertheless, even if you use such a tool, this section of the FAQ is worth reading because you will need to do all the main checks on the resulting text output.
In working on the plain text version, it's especially important that you use a monospaced font so you can align text for tables and verse, etc. since use of tabs cannot be used to space text (tabs do not translate within the text, HTML and e-reader formats). The DP Sans Mono Font is especially useful for this since it allows you to easily differentiate between ones and lower-case "l"s, etc. which is useful when you start to check your project for errors.
Plain text character encoding
Please save your plain text version as UTF-8. Both Project Gutenberg and Distributed Proofreaders now default to UTF-8 encoding. Project Gutenberg will now accept UTF-8 text format even if the same text could be represented fully as ASCII, Latin-1 (ISO-8859-1) or CP-1252 (Windows-1252). For more information about Project Gutenberg and UTF-8, please read Project Gutenberg's File Formats document. For information about UTF-8 and Post-Processing, please read this wiki article.
Save a new version
Saving a copy of your work so far for use with the HTML version
Changes made to convert the book to plain text will remove the markup that is needed for the HTML version. Therefore, if your process is to work first on the plain text version of the book and then create the HTML version, it's important, before you start work on the plain text version, to save a safe copy of the version of the text that you've been working on up to this point so you can use it later to create the HTML version.
You should give the to-be-saved-for-the-HTML version file a distinctive name such as "name-markup.txt" — so you don't mix it up with the plain text file you'll be working on for the next bit.
Creating a plain text file
For the plain text version of the book, you should save the project you've been working on with a yet another different name such as name.txt. As mentioned above, please keep a separate file for the work done so far with a different name for use in creating your HTML version.
Note that all file and folder names must be in lower-case characters to ensure there is no upper/lower-case conflict later in the process at post-processing verification (PPV) or at Project Gutenberg (PG). You should also give the plain text file a name that can easily be associated with the book you are working on. Please keep the name short but at least four characters long (files of three characters have been known in the past to cause PG problems).
As you work on the text version, please remember to save periodically with a new name such as "namea.txt", "nameb.txt", etc. so that there is no risk that you can make an error and lose all the work you've done.
Any remaining comments should be removed from the text version once they've been resolved.
For the plain text version, you must remove all the page separators and check either side of them to see whether the next page requires a blank line, is a section or chapter, or needs to be continuous text. You can rejoin words split across pages at this point if you haven't done so earlier. Some post-processing tools such as Guiguts do a good job of removing the page separators but it's also possible to remove them using search and replace tools.
You should change all bold, italic, etc. markup (
<f>) and notate what you used for each markup in your notes so you can mention them in your Transcriber's Note at the end of the book.
It's most usual to replace italic markup with
_ and bolding markup with
=. For other markup, you might use
~ etc. For a discussion of the options, please see the this forum thread.
For small capital text that is in
<sc> tags, post-processors generally convert the text to all-caps and remove the
<sc> tags entirely. For more information on how to handle smallcaps, please refer to the Guide to Small Caps.
Once you're done, you should do a quick search for the
> characters to make sure none have slipped through and you still have markup code in your plain text file.
For the HTML version of your project, please leave your ligatures as they appear in the proofing image. If
œ have been entered into the text as
[oe], please convert them to
œ. For the plain text version, please also leave your ligatures as they appear in the proofing image. It's also acceptable however for Post-Processors to convert
œ ligatures to
oe for the plain text version.
You should also check that each
œ ligature is really an
œ and not an
æ ligature since the two ligatures are extremely similar and it is a common error for proofreaders to mix up
æ in both upper- and lower-case, especially when these ligatures are presented in italics. It is very useful to review the æ and œ ligatures document which includes pictures of the two ligatures as they often appear in our books.
For the HTML version of your project, please convert each hyphenated dash (--) to an em dash (—). Your HTML em dashes can be the Unicode character itself — or you can use a code such as — , —, or — (for more information on using these codes, please consult an HTML instruction text). However, for the plain text version, it is common to use hyphenated dashes rather than em dashes, though em dashes are acceptable.
Footnotes in the plain text version
Now is a good time to tidy your footnotes, (that is, make them read
 text, rather than
[Footnote 1: text], etc.). You should renumber the footnotes so that each one in the book has a unique number, alphabetic letter, or Roman numeral in order to make it easier for the reader to search the text. Alphabetic letters and Roman numerals are not recommended if the book has more than 20 to 30 footnotes as they may become hard to read/distinguish. There may be some projects in which you may prefer to retain the numbering as in the original publication. As Post-Processor, that is your choice.
For the plain text version, you may put each footnote after the paragraph it refers to or at the end of the chapter or section depending on what you believe would make it easiest for the readers. Consider using end-of-paragraph footnotes if the footnotes are short, unique, and not infrequent. It is best to use end-of-section or -chapter footnotes for longer footnotes (such as those that have poetry or block quotes) or those that have multiple references in the text for one footnote. Whichever you choose, please be consistent within the work — use all end-of-paragraph footnotes or end-of-section/chapter footnotes within one work. Don't switch back and forth.
Sidenotes in the plain text version
Many post-processors panic when they see sidenotes. This is usually the wrong reaction (though is sometimes justified). The simplest case of sidenote use is when there are no more than one sidenote per paragraph (often at the start of the paragraph). However, in some cases the situation becomes more complicated, with several sidenotes per paragraph.
In the plain text version it is probably best to leave sidenotes inside the
[Sidenote: Text of sidenote.] markup, so the reader can tell what they are. Some people like to have them as headings, leaving a blank line between the sidenote and the paragraph.
In deciding where to put the sidenote, you have at least a couple of options for the plain text version of your book. The first option is to put each sidenote (still in its
[Sidenote: Text of sidenote.] markup) at the start of the sentence to which it refers. This has the advantage that the text stays easy to read; however, if the sentences are long, the sidenotes could end up quite far from their referents. The second option is to try to place the sidenotes more exactly, by placing them in the middle of sentences. However, this can lead to a text that is much harder to read.
Indices in the plain text version
Please retain page numbers in your index.
Illustrations in the plain text version
The Illustrations section of the Formatting Guidelines specifies that formatters insert the words
[Illustration] for captionless illustrations and
[Illustration: (text of caption)] (for example,
[Illustration: AS HE FIRED IT LARRY LEAPED TO ONE SIDE TO ESCAPE THE LION'S CLAWS.]) for ones with captions. It is quite acceptable for you to use that format for illustrations in your plain text version of the book.
Some Post-Processors like to specify whether an image without a caption is decorative or not (for example,
[Illustration: Decorative Image]). Also, if the illustration is purely decorative and clearly used to separate segments of text within a chapter or other block (not between chapters), it could be treated as a
<tb> (for more information on <tb>s, please read the Check formatting and spacing section of this document). If you decide to replace a decorative image with a <tb> in the plain text version, please add a comment to that effect to your Transcriber's Note.
Occasionally, a captionless illustration contains information that is important to understanding the text. For example, an illustration of a horse could exist in a book as a heading for information about horses. In such a case, some PPers describe the image (for example,
[Illustration: Horses]) in their plain text version. If you decide to do this, please explain what you have done in your Transcriber's Note.
For information about placement of illustrations, please read the Handle any illustrations section of this document. For information on handling illustrations for the HTML version, please read the Preparing illustrations for the HTML version section of this document.
Greek in the plain text version
Greek may occasionally have been transliterated (converted to Latin letters) during proofreading at the request of the Project Manager. There are various ways to handle this.
In the plain text version, you can leave the transliteration, commonly removing the
[Greek: Transliterated text.] markup although you may wish to use another markup of your own, such as a
+, and mention its use to indicate Greek in a Transcriber's Note. However, most PPers prefer to simply produce the Greek using the original Greek characters. Please post in the forums if you would like help with this.
For more information about handling Greek in this document, please read the Greek in the HTML version section.
Once you're ready, you should resave your work at this point with yet another new name and prepare to rewrap the text. Also, please continue to save backup copies of your file regularly as you work — it will be much easier to recover from a formatting decision gone terribly wrong.
Before you do this, please make sure that you have all the areas such as poetry and tables marked with so that they will not be rewrapped with the rest of the text.
Please check the What tools can I use? section of this document for information about where to find tools to help with rewrap.
Plain text line length
According to Project Gutenberg's standards, text line length should be 60-70 characters for regular text with 75 characters wide being the maximum length. If you must go beyond that length for poetry or tables, etc., please include a note to the Post-processing verifiers to alert them to the reason (or in your upload form, if you have Direct Upload capability).
In post-processing the plain text version, you should aim at a line-length of 72 characters so that most lines will end up being 70 or less. There may be justification for 80 characters for tables or other essentials (long line poetry might be another example). If there's absolutely no way to shorten a feature such as a family tree, you can leave it as is. It is often worth posting in the Post-Processing Forum though as others may see a sensible way to condense or reformat the feature.
Short lines (usually less than 54 characters wide though it is preferable to have lines at least 60 characters wide) should be corrected. Of course poetry and tables of content, indices, etc. may require short lines of text.
Poetry, tables and blockquotes, etc. that should not rewrap
Once rewrapping is completed (and been checked!), it's time to complete the formatting for the table of contents, poetry and tables etc. that were not rewrapped. If you need help formatting features such as tables, Greek, poetry, etc. please see the Help section.
Any text such at poetry or tables that should not be rewrapped should be indented between one to four spaces from the left margin. This is a Project Gutenberg requirement to prevent rewrapping in future versions of your text.
Indents within a poem, i.e., relative indents, should be added on to your chosen indent. For example, if a line is indented by 2 spaces from the line above and you are using a 4-space indent for poetry, in your final version this line will be indented 6 spaces altogether. Remember: You cannot use tab characters.
As long as you make sure your rewrap markers have been set correctly so that the line separation of poems are not lost, post-processing poetry shouldn't be difficult. If you have difficulty, you could look at recently posted poetry books at Project Gutenberg for layout ideas. Some post-processing software has extra features for handling poetry — please refer to the user guide or manual for more information.
Blockquotes should also be indented to show their separation from the rest of the text. However, if block quotes in a book are not separated from the rest of the text, i.e., if they do not appear any different from regular paragraphs within the book, there is no need to indent them.
Tables, including tables of contents and lists of illustrations, also need to be indented to avoid rewrap/respacing. If you need help with a table, you should post in the Turn the Tables forum area. There is also good information about table creation in the wiki.
Vertical spacing in the plain text version
It is important to ensure that the number of spaces above and between elements are correct:
- Place four blank lines at the top of the book. That ensures that the Project Gutenberg boilerplate text is well-separated from the book proper.
- There should be four blank lines between the frontispiece and the title page.
- Each new chapter should have four blank lines above it.
- If the chapter has a subchapter or related text, etc. that text should be separated from the chapter heading by one line.
- The main text of a chapter should be separated by two spaces from the chapter heading and related chapter heading text.
- New sections should have two blank lines above them and one blank line after as per the DP Formatting Guidelines.
Removing end-of-line spaces and rewrap markers
Once your text is suitably rewrapped and non-rewrapped formatting handled, it's time to remove any end of line spaces.
Here Again, use the post-processing software wherever possible! All current tools include this task. It is now time to remove the rewrap markers and double-check that the flow of the text looks OK.
Straighten up the title page, front matter, table of contents, and list of illustrations
When formatting the title page, you have some leeway. You can adjust the pieces if you like such as moving the author's name directly under the "by". Relative indenting is not required, but can be added if you wish.
You should block indent a consistent amount (from one to four spaces) if there are consecutive lines that should not be rejoined later if the text were rewrapped.
Please leave all the original information on the title page, including the edition, year of publication and any copyright notice (unless this is a reprint — check with the Project Manager if in doubt). It is better to display as much information as possible than to try to find it once the book has been posted for years.
You may also remove redundant half-titles and reorder front matter.
For the table of contents and list of illustrations, please retain the page numbers they list — and check to be sure they really are the correct page numbers (If they aren't you'll want to mention that in the Transcriber's Note as well as whether or not you have corrected them in the text).
You should line up the chapter titles and page numbers to make them look neat and easy to read. Copying the original format of the table of contents usually works fairly well.
Check formatting and spacing
In checking through the formatting and spacing within your plain text version, there are several things to check, including:
- Rewrap was completed correctly and that there are no overly long lines or lines that are too short. For information on line length, please see the Line Length section of this document.
- Vertical spacing should be done according to the Vertical spacing in the plain text version section of this document.
- All rewrap markers have been removed.
- Check spacing around such things as poetry and correspondence.
[Blank Tags]have been removed.
<tb>should be replaced with a line of asterisks — that is, 7 spaces, followed by 5 stars, each spaced by 7 from the next, like this:
* * * * *
- Sidenotes are correctly spaced
- Mac and *nix users need to change line endings to
Once your review is complete, you should do a final check using a tool such as Gutcheck or the Post-Processing Workbench's PPtext to make sure that there are no remaining problems and that no issues have been introduced during the tidy-up process (such as short lines being left after the removal of markup codes). PPtext also useful if you are using curly quotes since it checks for suspect quotes.
PPtext and other tools
The PPtext tool was written specifically to pick out many of the most common problems we find in transcribing texts. It is probably the single most important check you will perform and is performed on the plain text file. If you are using a program such as ppgen to generate both the plain text and HTML files from a source file, you should run the PPtext tool against the text file and update your source as you locate issues to be corrected.
Gutcheck as well as several other important tools are included in the checks done by the PPtext portion of the Post-Processing Workbench. This tool is also useful for checking curly quotes. You may also use Project Gutenberg's online online gutcheck. If you do not want to use an online version, you can download Gutcheck from here, and run it according to the instructions given there.
You should either run the Gutcheck initially with all options turned on or run each check individually (making sure not to skip any).
Not all potential problems flagged by Gutcheck are genuine errors (for example, it may report short lines where the text contains poetry or a table) but each should be looked into and corrected if necessary. You should continue to run Gutcheck after each series of corrections until it doesn't flag any more "true" errors.
If you find problems that will also apply to the HTML version of the book, you should note the problem and make sure you also correct it in that version as well.
Some common PPtext/Gutcheck things to watch for
- Footnote markers are falsely flagged as "Wrongly spaced brackets". Check them anyway.
- Lengthy hyphenated words often cause short lines above or below. Try rewrapping just that paragraph a few spaces shorter to rearrange the words sufficiently to cure this error. Short lines for the table of contents, lines of poetry, etc. are okay.
- Unless you are checking a deliberately-ASCII version of your text, you do not need to worry about characters flagged by "Non-ASCII character".
- Wrongly spaced/missing quotes often appear where characters' quoted speech runs through several paragraphs. Check these, but if they are right according to the proofreading guidelines, that's good enough for Gutenberg posting. It is specifically for curly quote checking that the PPscan portion of PPtext was developed.
The Byte-order mark (BOM) should be removed from UTF-8 plain text or HTML files before uploading them for Smooth Reading, PPV or direct uploading to Project Gutenberg. We realize however that submitters use different toolsets and it's not always easy to know whether a BOM is included or to remove it if is. Therefore, don't panic if you are not sure: there is automation in place at Project Gutenberg to ensure errant BOMs are not included in the final release. For more information on removing BOMs, please read this article. For information about UTF-8 and Post-Processing, please read this wiki article.
Smooth Reading your Project
Smooth Reading for PP
Smooth Reading (SR) is a very important step in the post-processing process. SR involves having volunteers read through the book attentively and mark possible errors to report to the Post-Processor. An extra pair of eyes is always helpful in finding things you might have overlooked in the text, and a good way to find those extra eyes is by making use of the Smooth Reading Pool!
Is my book suitable/will it be a benefit?
Yes! -- Even if your book is extremely specialized and took months to crawl through the proofing rounds, it may be of interest to a Smooth Reader (SRer). As for benefit -- any mistakes picked up before posting to PG are good!
How much Post-Processing should I do first?
Post-Processors (PPers) may submit their book to Smooth Reading at different stages in their workflow depending on how they are processing the book. While some PPers may submit several formats to Smooth Reading at once, others may submit a single text format while they continue to prepare other formats.
What formats should I upload for SR?
At a minimum, please include a text file. This is not only best for those who are on slow or metered connections, but is also the easiest format for a Smooth Reader to use to insert comments about possible errors, even if they're actually SRing in a browser or e-reader. Text files should be encoded as UTF-8.
Allowed formats are:
- text (.txt extension)
- HTML (either .htm or .html, plus images)
- epub (.epub extension)
- mobi (.mobi extension)
See How do I prepare my files for uploading to SR? for more details.
Should I include the .bin file?
File names should be reasonably short, and contain only lower case letters
a-z, the numbers
0-9, and the hyphen
- and underscore characters
_, with a single dot separating the filename from the extension. No capital letters, spaces, or other special characters should be used. For example:
Please use filenames that correctly reflect the contents. This may help Smooth Readers to more readily locate the downloaded files later.
How do I prepare my files for uploading to SR?
- Text File
- Please use the
.txtextension. Post-processing files are now assumed to be UTF-8 and Project Gutenberg no longer requires the -utf8 suffix for filenames so it is no longer necessary to include
-utf8in SR file names. This means, however, that if you use an encoding other than UTF-8 you must leave a note to the SRer mentioning what encoding your file uses.
- Please ensure that your file does not include a Byte-Order Mark (BOM).
- Multiple versions of the text file may be uploaded for SR, and each will be available both for download and for viewing online.
- Please use the
- Include the HTML file and the images folder. The HTML extension can be either :::'.htm' or '.html'.
- Please ensure that your file does not include a Byte-Order Mark (BOM).
- Epub and Mobi
- Both Epub (.epub) and Mobi (.mobi) are already compressed files, and are self-contained. No special preparation is needed -- they should just be added to the master zip file for upload. These file types are available for download only.
Once you have prepared all the formats you plan to upload for Smooth Reading, please combine them into a single zip. For example:
us_history.txt(at least one text format is required)
It is good practice to give your upload zip file a name similar to the file names you've included within it. Using the example above, us_history_sr.zip or us_history_upload.zip would be good choices. The DP system will store the uploaded zip on the server by renaming it to associate it with that project: the projectID followed by 'smooth_avail', but when a Smooth Reader unzips it, they will see your original filenames.
Where do I find Smooth Readers?
Submit your project to the Smooth Reading Pool!
You can also advertise your project in the Smoooth Readers Team thread or in the Project Discussion for your project, and possibly other places. For instance, if your project is in a language other than English, or has significant parts that are in another language, it may be helpful to advertise it in that language's team thread.
It is very helpful to post a comment with the SR upload indicating special points of interests or value in the book. New listings are promoted regularly in the Smoooth Readers Team thread, and include any PPers' promotional comments. If you don't show enthusiasm about your project, why should someone smooth read it?
How do I make my project available for Smooth Reading?
Once your complete SR zip file is ready, go to the Project Page for your book and scroll down to the Smooth Reading section of the page. There, you will be able to upload the project to the pool for between 7 and 42 days. The default is 21 days, but that is easily changed by using the arrows, or simply typing in a number between 7 and 42.
Decide how long your project should stay in the SR Pool
When deciding how long you want your project to be in the Smooth Reading Pool, please take into consideration:
- Any known special deadlines you may have (i.e. an upcoming vacation, a special day).
- Size--not in pages (because page size can vary widely) but the size of the text file in kilobytes (KB). The larger the file, the more time the Smooth Readers will need to finish reading it.
A good starting estimate is to divide the size of the text file in KB (not zipped) by 25 for the number of days to read. This is based on approximately 300 words per page at 16-17 pages per day, with an average of 5 characters per word.
Consider adding about 5 days to that estimate to give Smooth Readers time to finish their current projects and to find your book in the listing. Although you may be perfectly happy to extend the project's time in the pool if requested, some SRers may be reluctant to ask for extensions, and may skip books they know they can't finish in the remaining time.
If a Smooth Reader does ask for additional time to complete the book, or if you wish to extend the time because it has not been read yet, you can easily extend the time for an additional 1-42 days at any time before the initial period expires. Please use this feature carefully; once the time is extended, it cannot be reduced to a shorter period.
If the time is not extended before the SR period expires, the project will have to be re-uploaded to the SR pool like a new project if you wish it to be returned to the SR pool.
Finish submitting project to the SR pool
Once you've decided how long you want your project to spend in the SR Pool, simply choose the number of days, and click on the
You will be presented with a form that allows you to leave comments about your project for the Smooth Readers. In addition to any special instructions you may want to leave, such as what to look for or to ask for attention in a particular section, please consider including a short description or excerpt from the text to elicit interest in your book, especially if the title is not very descriptive.
Then click on the Choose File button and navigate to where the zip you wish to upload is stored on your computer, and then click on the Upload file button. Once you've submitted your project to the Smooth Reading Pool, this area will include the options to replace the Smooth Reading text, and to extend the project's time in the pool.
Can I add or replace a Smooth Reading file?
Yes. However, you will need to re-upload all formats. What you upload will overwrite what is there already, so anything you want to remain available must be in the new zip you upload to the project.
Can I replace just a single file?
No; when a replacement zip is uploaded to the project page, the "smooth" directory that contains all the different formats is recreated. Even if you need to replace only a single file, you will need to re-upload all formats.
What if I accidentally put the wrong project in the SR pool?
If the "wrong" project is within 1‒2 days of being ready for SR, replace your SR upload with a zip of a text file explaining what happened, and also let a Smooth Reading coordinator and the SR team know by posting in the Smoooth Readers Team thread. This text file can then be replaced with the correct zip file when it becomes available in a couple days.
If the "wrong" project is not close to being ready, let a Smooth Reading coordinator and the SR team know by posting in the Smoooth Readers Team thread. A Smooth Reading coordinator will arrange to have it removed.
Upload the SR zip to the correct project.
Can I remove my book from the Smooth Reading Pool
No, you cannot remove your book from the pool before the deadline, or decrease the amount of time a book is available for. This would be unfair to the Smooth Readers who may be working on your book already -- not all Smooth Readers use the "Volunteer" button to officially sign up for SRing a book; it's an option, not a requirement.
Getting notification when a project finishes Smooth Reading
It is important not to upload a project to Project Gutenberg or to PPV before its time in the Smooth Reading Pool is completed. Consequently, it's good to set an event subscription to alert you when the project finishes Smooth Reading. To do this, simply scroll down to the Event Subscriptions portion of the Project Page and click on the subscription box beside "Project finishes Smooth Reading".
What do I do with the feedback?
When a Smooth Reader finishes, they should upload the annotated SR text file back to the project with their comments. Once a report is uploaded, you can download it, or you can wait until the project leaves the Smooth Reading pool and download all SR reports from the bottom of the Smooth Reading section of the Project Page.
If a Smooth Reader notifies you of their findings directly, without uploading a file to the project page, please ask them to also upload it to the project if possible. If it is not possible, please post a message to the Smoooth Readers team topic as soon as possible. This will allow the SR coordinator to manually add the read to the round statistics. Notification must be posted before the end of the month in which the project leaves the SR pool to be recorded.
Search the text of each report for
[** comments. You will need to consider any issues mentioned in the
[** notes and deal with them as you would with any proofreaders' or formatters' notes when Post-Processing. Some genuine scannos or missing words may need to be corrected. Some 'errors' may be an author or editor mistake which you may want to address with a Transcriber's Note (TN). You do not have to act on every Smooth Reader suggestion. Use your best judgment. Make sure any corrections that you do use are made in both the text and HTML formats.
Can I get non-DPers involved?
In order to be able to upload an SR report to the project page, a Smooth Reader needs a DP account. If you know someone you think might be interested in Smooth Reading, please feel free to offer to act as their liaison with DP, and if they like it, encourage them to register an account!
Contacting Smooth Readers
Smooth Readers often are interested in learning what changes the Post-Processor makes based on the feedback they receive; why some things were changed and some not. It's not an official requirement to exchange Private Messages with Smooth Readers, but letting them know their efforts are useful and appreciated may also encourage them to read future projects you submit to SR.
Creating an HTML version
The HTML version of your e-book is very important. Project Gutenberg generates the e-book epub and mobi formats of your book from the HTML version. Although many people download the e-book formats to their e-readers, they usually depend on the HTML version when they want to preview or read the book online. Researchers also may use the comments stored within the HTML version to determine hidden information such as non-displayed page numbers. It's also possible for information such as headings, language, etc. to be scraped from the HTML version for other uses.
Post-processors who use the "Plain Text to HTML" process, will start work on the HTML version once they have submitted their completed plain text version to Smooth Reading. If you are using a tool such as ppgen which generates both the text and HTML versions at the same time from a source file, you will still need to do all the main checks on the resulting HTML output so you should definitely review the following information.
In working on the HTML version, it's a good idea to use a font such as DP Sans Mono font that helps you to differentiate between ones and lower-case "l"s, etc. which is useful as you check your project for errors.
To start converting a plain text file to the HTML version, go back to your marked-up copy of the book that you had saved prior to working on the plain text version and save it again with a new file name such as "name.html" or "name.htm" (some operating systems don't allow more than three characters after the period). Make sure you keep a version of the marked-up file for backup and reference.
Code line lengths
As you prepare your HTML version, please keep your code line lengths reasonable so that they will be easier for others to review and update if necessary.
Projects that are part of a series
If you are working on a project that is part of a series or Uberproject or is a periodical, you should use the style guide if one is defined for you in the project comments or in the UberProjects forum. You can also check recent uploads to Project Gutenberg that have been done for the projects in that series.
Page numbering and page rejoining
Should you include page numbers?
Post-processors vary in whether or not they decide to display page numbers on the ebook pages. Some decide to always include them, others include them only for books that include multiple page number references, and still others never include page numbers. It's up to you as Post-processor to decide. Regardless, please do not remove specific references to page numbers in the table of contents, indices, etc.
Even if you do not intend to include page numbering in your completed book, it is good practice to include the book's page numbers as comments within the HTML text so that anyone wanting to quote formally from the book will have a way to check what the page number was. If you do that, please add a note to the Transcriber's Note to say you have done that.
Checking page numbers
You should check the page numbering throughout the book and record which page number corresponds to each DP-numbered page of the books (they are usually different) and which should be roman numerals etc.
Very often the book page numbering skips blank pages and full-page illustrations, etc. and often numbering does not start until after the table of contents and may involve use of roman numerals before the start of the first chapter. Consequently, you will need to account for the page numbering starts and skips when you handle the page numbering for the book.
Some post-processing tools such as Guiguts do a good job of handling the numbering and in removing the page separators.
When you remove your page separators (when you do this will depend on what process you're using to process your book), it is important that you have gone through the text and checked both sides of the separators to see if the next page requires a blank line, is a section or chapter, or needs to be continuous text. You should also have rejoined words split across pages. Please see the Do a first-pass check section of this document for more information.
Once your HTML version is mostly completed, you should check that page numbering (either visible or commented) is correct and that all table of contents, lists of illustrations, indices, and inter-page references display and link to the correct pages.
Please also check to ensure that your links clearly go to the right location. For example, readers who click on a link from a table of contents to a chapter head or from a list of illustrations to an image, should then see the chapter heading or image itself, rather than just the first line of text after that heading or image.
Converting to HTML
There are numerous ways to convert a project to HTML. Some people use various search and replace routines. Others use a tool such as Guiguts. If you use Guiguts, please remember to clean up the HTML and CSS stylesheets that can render poorly. There is also PG2HTML which will generate a very basic HTML version for you to work with. Some people also use ppgen to generate both the plain text and HTML versions of a book at the same time.
Many people have learned HTML for the first time at Distributed Proofreaders as part of their post-processing, and it doesn't have to be terribly difficult! Even if you are using a product such as ppgen to generate your HTML, it is important to know something about HTML — many ppgeners add snippets of HTML to their ppgen source files when they work on more complex books. Also, as mentioned above, Guiguts-converted HTML requires considerable tweaking to the HTML.
The DP HTML Best Practices pages are a good source of information about best practices for HTMLing at DP. There are also several useful links to HTMLing information on the Post-Processing section of the DP wiki's main page.
What HTML version should you use?
Your HTML should validate as XHTML 1.0 Strict or 1.1 (HTML checks should generate no error or warning messages).
What CSS version should you use?
Note: Since August 2017, Project Gutenberg now accepts CSS 3 provided that the code has been marked as "completed work," and has "REC" (for Recommendation) status according to the W3 specifications.
If you use any CSS 3 elements (including use of the transparent element for dropcaps), please add a note to the PPVer concerning what CSS3 has been included and why (If you have Direct Upload, please include a similar note for the Whitewashers when you upload your book). The acceptance of some CSS 3 and Project Gutenberg is a very new policy and may be adjusted based on an issues experienced with submissions, so Project Gutenberg has asked DP volunteers to watch https://upload.pglaf.org/ for any changes.
Please make sure your
<title> is present and that it is correctly worded (for example,
<title>The Project Gutenberg eBook of Alice's Adventures in Wonderland, by Lewis Carroll</title>
<title>Alice's Adventures in Wonderland, by Lewis Carroll—A Project Gutenberg eBook</title>).
For the dash separating
A Project Gutenberg eBook from the book title, you may use two hyphens (
--), a UTF-8 em dash (
—), or an HTML entity for the em dash (
<html> tag should be placed each on its own line. This assists the Whitewashers at Project Gutenberg.
You should make sure that your HTML header includes language (please make sure it's the correct language!) and the version of HTML that you are using. Here is an example:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="https://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
Please ensure that your header does not include lines like this:
<?xml version="1.0" encoding="utf-8"?>
Tables should display left, right, and center justification and top and bottom align appropriately. You should also make sure your tables use the
<th> element for table headings (for more information on table headings, please see the W3schools site).
You should be careful not to use tables for things that aren't actually tables. The HTML
<table> element is designed for representing tabular data in an HTML document or for tables of contents, etc. It is not meant to be used for things such as creating a border around content, or centering text.
Because not all systems have every font you might use, please be careful not to set your HTML to use specific font faces. It is fine though to designate in your style sheet that certain font families such as use serif or sans-serif should be used for certain styles. Our books rarely use monospace fonts; however, should you need to use such a font, please set it as such in your style sheet rather than by using
For font sizing, please use em or percent settings or use "small," "medium", "large," etc. and avoid px sizing units for items other than images or borders. According to W3schools, "Absolute length units are not recommended for use on screen, because screen sizes vary so much."
If the original printed version of your book used a smaller size of font for elements such as blockquotes, it is good practice to set your fonts to smaller for them as well.
Text foreground and background colors
If you are specifying either text foreground color or text background color, please always remember to specify both foreground and background color. By doing this, you will prevent the text from being hidden in if the reader's display background or font-color choice happens to match the setting you failed to specify. It's especially common for e-readers to be set to display light text on a dark background for evening reading.
Please do not use
or empty tags to indent or separate text. The best way to control the spacing around elements is to use the padding and margin CSS properties.
Title pages and front matter in the HTML version
The book title should be set as an h1 heading and the rest of the page should be set up carefully using margin, padding style, font size and bolding style sheet settings.
As with preparing the title page and front matter for the plain text version, you have some leeway. You can adjust the pieces if you like such as moving the author's name directly under the "by". Relative indenting is not required, but can be added if you wish.
Please leave all the original information on the title page, including the edition, year of publication and any copyright notice (unless this is a reprint — check with the Project Manager if in doubt). It is better to display as much information as possible than to try to find it once the book has been posted for years.
You may also remove redundant half-titles and reorder front matter.
Please do not use empty HTML tags, multiple
and blank lines to separate text on this page. For more information about preparing title pages, please read the Title Pages section of the DP HTML Best Practices.
Headings and chapters in the HTML version
Please be careful not to use heading tags such as
<h2>, etc. for things that are not headings. The headings within your book should be arranged hierarchically with a single
<h1> (the book's title) and appropriate
<h3> for the following chapters and subsections, etc.
You should set a page break before each new chapter and also use
<div class="chapter"> at chapter breaks to enable proper page breaks for e-readers. For more information on the HTML to use for headings, please read the Headings section of our Easy Epub guidelines. If you are using ppgen, it should add these
<div> codes if you have entered the chapter and headings properly within the ppgen source file.
Multi-part headings should be encased within the related heading tag. For more information about headings, please read DP HTML Best Practices — Headings.
Your style sheet will allow you to add smallcaps to the text at the start of chapters if you'd like. You'll find some excellent advice about illustrated dropcaps in the DP HTML Best Practices and in the Dropcaps section of the Easy Epub document. The Easy Epub document is especially valuable for explaining how to use
@media in your style sheets for dropcaps since the e-reader version of your book that is generated from your HTML has difficulty with dropcaps. There is also help available on dropcaps and illustrated dropcaps in the Post-processing forum.
You should be careful to check your book's page numbering before chapter breaks.
Using a style sheet
If a style attribute or set of attributes is used in several spots throughout your book, it is good practice to include it in your style sheet rather than to define it each time. That way, you can more easily change the way the element looks by simply changing it in the style sheet rather than having to go to each spot you've used it in the book. Use of styles defined by a style sheet can also help make the formatting of your book more consistent.
Use of style sheets also reduce the amount of HTML code you need to use throughout the book. For example, by specifying all the formatting details for a chapter heading in your style sheet, you simply have to specify that certain text is a chapter heading rather than having to provide the details about font-family, weight, vertical spacing, etc. each time you use a chapter heading.
Please give your styles meaningful names related to what they are used for so that it is easy to recognize what style does what (that can be invaluable when you have to troubleshoot what went wrong when your style doesn't do what you planned it to do). Also, it is good to use comments to make your style sheet easier to work with and to edit and troubleshoot. For example, you might consider adding the comment
/* Poetry */ in your style sheet to mark where your your styles related to poetry are listed.
Greek in the HTML version
Whether your file is saved as UTF-8 or not, the HTML version of your book can use Greek characters either via UTF-8 or HTML entities. Either way, Greek will display for readers if they have a relevant font installed. Some post-processors enclose the Greek in a
<span> which uses the transliteration as a "title" attribute so that non-Greek readers can tell how the words would be pronounced.
Preparing illustrations for the HTML version
Where to find the illustrations and cover image for your book
The cover image and illustrations will generally have been included as higher resolution files included in the project zip file that you downloaded when you originally downloaded the book to post-process. Please use these higher resolution files (rather than the low-resolution images included on the proofreading pages) when you prepare the images for the book.
Where to store the images for the HTML version
Please store all the images that will be used in your HTML version inside a folder called "images" within the project directory of your HTML project. (Reference to the images within your HTML will then be to the file within the images directory (for example,
<img src="images/i_004.jpg" alt="">).
When you've completed work on the HTML, you should do a final check to make sure that all images are used correctly within each page and that you haven't included any temporary or redundant files (other than thumbnail.db — though it's not a bad idea to remove it too) in the project's images folder that you upload as part of your zip file.
Preparing your images
For the HTML version, you should remove all major blemishes and rotation/distortion problems in your images and crop them appropriately. The Guide to Image Processing provides details on the acceptable dimensions of images and some help in how to prepare them with various tools (you're welcome of course to use your own image preparation tools). If you have problems preparing an image, there is help available in the Illustrators forum thread.
Some pointers for images in your HTML version are available in the Check your Images portion of the Getting your PP Project Ready for PPV document.
If the illustrations are seriously damaged or missing, please contact the Project Manager for replacements. Allow a reasonable amount of time for a response in case the PM is on vacation and, if you don't hear back within a week or two, please contact db-req and describe the problem.
Generally, you should use .png format for line-drawings and .jpg for photographs and images with complex tones and colors. For more information, please refer to JPG vs PNG. If you are working with LaTeX, you may use .svg images.
When you work on your images, please be careful with the number of times you save a .jpg image as you work on it. Any modification to a .jpg causes further degradation to the image, as .jpg is a "lossy" compression algorithm.
Please remember to include captions as text accompanying the image.
If there is a List of Illustrations, please link items to each image appropriately.
For information on where to place illustrations in the text and handling them prior to creating the HTML version, please read the Handle any illustrations and Illustrations in the plain text version sections of this document.
An "alt tag" is the term we usually use for the alt attribute for the HTML "img" tag. You should use alt tags appropriately.
If the image already has a caption, you should include an
alt="" tag (this is called an "empty alt tag") since adding alt text there would be redundant. If the image does not have a caption but includes readable text, please include the text in your alt tag. Also, if the image does not have a caption or readable text but you believe that it would be good to describe the image for the reader (such as
alt="Photograph of George Miller"), please place that text in your alt tag (If you decide to do this, please explain what you have done in your Transcriber's Note).
If the image is purely decorative, you may use an empty alt tag or place a description of the image in the alt text. Please remember though that if you use an empty alt tag here, and the image does not load, the reader may not realize that the image was supposed to be there.
Please do not place a space within the quotes of an empty alt tag since that may cause problems for some screen-readers.
Very long alt tags should be avoided.
There are two ways to include images in your book -- either directly as "inline" images that will display directly on the HTML page or as a "linked-to" images that are included in your images directory but are accessible to the reader only by clicking on a link present within your HTML.
Here is a synopsis of the image size standards:
- Inline images: Normal illustrations should be no larger than 256K and up to 5000x5000 pixels in dimension.
- Linked-to images: Images that are targets of an href can be no larger than 1 MB and up to 5000x5000 pixels.
- Covers: Covers should be at least 650x1000 pixels (width x height) and up to 5000x5000 pixels. Larger dimensions are recommended. Cover file size should be no larger than 256K.
Inline images should be no larger than 256Kb and may have dimensions up to 5000 x 5000 pixels. (eBookmaker, the Project Gutenberg epub/mobi generator, will downsize any image that is larger than the above limits in ways that are beyond our control, resulting in an unpredictable image quality.)
Inline should be sized sensibly. They should be no larger in terms filesize than necessary for the clear display of the image. In addition, their dimensions should fit reasonably with the text, with large images relatively large and small images relatively small. Please check your images for quality and size in the HTML and e-reader formats.
It is also important to check that you haven't inadvertently distorted your images by inappropriate use of HTML sizing code for the image. Such sizing, if specified in your HTML code, should be in em or percentage units rather than in pixels.
This type of image may be used for displaying images such as large complex maps that must be larger in size than an inline image.
When using this type of image, you should link the larger image from a normally-sized inline image. Images that are targets of an href should be no larger than 1Mb in size and may have dimensions of up to 5000x5000 pixels. For accessibility, the anchor tag for linked-to images should have a description of the image in the title attribute, and an id attribute to facilitate user navigation.
You must include a cover image called
cover.jpg (or, much more rarely,
cover.png, if the image is appropriate as a "png") with your project. Cover images should be at least 650 x 1000 pixels (width x height) and may have dimensions up to 5000x5000 pixels. Larger dimensions are recommended. Cover file size should be no larger than 256Kb. If you specify image dimensions in your HTML code, please use em or percentage units rather than pixels.
As with regular illustrations, you should remove all major blemishes and rotation/distortion problems from the cover image. You should also remove any library code tags. Also, if your project has no cover, please do not include a cover from a different version of the book or created from stock images.
For more information about cover images and DP policy, please read the PP Guide to Cover Pages. If you would like to have another volunteer create a cover for you, simply post on the We've Got You Covered team area of the forums. You'll be asked to provide images from the book if they're available or possibly a copy of the title page.
Footnotes in the HTML version
You should number the footnotes in the HTML version in the same way you did in the plain text version.
For the HTML version, footnotes should be moved to the end of the chapter or section or to the end of the book.
Each footnote must be hyperlinked to the anchor tag to which it refers. Most post-processing tools will do this automatically for you. For information on how to do this, please refer to the tutorials, guides, or manual for whatever software you are using to prepare the HTML version. If you are doing the work manually you may have to use some complex search-and-replaces.
Sometimes several tags reference the same footnote. For example, you might have 18 anchors called "1" in the text, all referring to a single footnote. This is not a problem; just make sure that all anchors point to the right footnote.
Sidenotes in the HTML version
Sidenote placement is easier in the HTML version in that it is relatively easy to get them closer to their referents using style sheet settings. However, you should check your final product in several different browsers and at different browser window sizes, as it is very easy to get sidenotes that overlap other text.
In HTML, it is probably best to float sidenotes off to the side. You can choose whether to put them in the margin so that they don't interrupt the flow of the text, or whether you want them to stick into the text which will then flow around them. Many post-processors prefer to follow the original look of the book as much as they can. If you want sidenotes in the margin, you'll probably want to use a larger margin than usual for your book in order to make room for them. Your HTML version and the related e-reader versions may be easier to read if you put all your sidenotes on the same side.
Indices in the HTML version
As in the plain text version, you should keep the page numbers in the index. Each page number should link to the page or words to which it refers. For advice on how to do that, please read the TOC and Index section of the CSS Cookbook.
If you want to do your index linking semi-automatically you could ask in the Regular Expression Clinic. For further help or formatting queries, you can also try the Junkies, Index team. If you [i]do[/i] use regular expressions to create your index links, please remember to save a copy of your project before doing so — and check the result right after to make sure you haven't automatically made some other changes that may be hard to correct.
I know it can be boring, but you should manually check each of the links in your index once you complete your book.
External links (including embedded fonts) are not permitted by Project Gutenberg except for:
- the standard link to pgdp.net that is included in all our books, and
- links to other books in a series, within the Project Gutenberg collection (usually multi-volume works with a common index or other inter-volume references).
Links to original page scans should not be in the book (Project Gutenberg hopes to add those to the catalog database at some point).
to any dashes (for example,
Mr. ----) etc. or other elements that should not break up across lines or pages.
Horizontal rules sometimes do not translate correctly in the ebook version that is generated from your HTML. Therefore it is important that you center the rule and set the rule's margins appropriately. For information on how to do that, please read the HRs section of the Easy Epub document.
Use @media in your CSS for different media types
It is important for your book to display well in the e-reader versions that Project Gutenberg generates from your HTML. Most of your HTML will convert seamlessly to e-reader format, but some elements such as illustrated drop caps do not. In such cases, you should use
@media declarations to specify how style elements should be used in various platforms.
Updating your Transcriber's Note
Once you have mostly completed preparing the HTML version of your project, you should update the Transcriber's Note to include anything specific to the HTML version and remove anything that is specific to the text version. You should also add (and check) the links there.
For more information about Transcriber's Notes, please check the Create a Transcriber's Note section of this document.
Checking your formatting, HTML and style sheet
Once you have mostly finished your HTML work, you should check through the book to ensure that special elements such as correspondence, tables, poems, sidenotes and footnotes etc. are correctly formatted and appear correctly regardless of the width of the browser window. This work often involves tweaks to the HTML and style sheets. You should also check your work using an HTML checker such as HTML Tidy and address any issues it finds (it's OK though to leave your tables without a summary even if Tidy comments on it). Please remove from your style sheet any styles that are not used in your text.
If you are using Guiguts to generate your HTML, you should remove
Code << /*>![CDATA[ XML blockout */ and
/* XML end ]]>*/ from around the CSS. This isn't an error but it is redundant.
Smooth Reading reviews
Once you receive the smooth reading reviews from the smooth readers, you should implement any necessary changes in both the HTML and plain text versions and update your Transcribers Note comments, etc.
Final HTML Checks
In addition to the checks already talked about in the Creating an HTML version of this document, there are several final checks you should do.
Removing Byte Order Marks
Some systems and text editing programs add a special code called a Byte Order Mark (BOM) to the start of UTF-8 documents. To find out if you are somehow adding a BOM to your books, you can use the W3C Internationalization Checker online tool.
If you find that you have a BOM in your plain text or HTML versions, please remove that code before submitting the project for PPV review or to Project Gutenberg. For more information about BOMs and removing them, please read our Byte Order Mark wiki page. The W3 site also has useful information about BOMs. We realize however that submitters use different toolsets and it's not always easy to know whether a BOM is included or to remove it if is. Therefore, please don't panic if you are not sure: there is automation in place at Project Gutenberg to ensure errant BOMs are not included in the final release.
Before doing your style sheet (CSS) check, you should remove
Code << /*>![CDATA[ XML blockout */ and
/* XML end ]]>*/ from around the CSS code plus any redundant CSS code. There are several tools that will help you locate CSS redundancies including tools mentioned in the Useful Tools topic in the forums.
You should also check that the CSS validates as CSS 2.1 or below (CSS checks should generate no error or warning messages other than for use of the "transparent" element for dropcaps). Please use the CSS checker at W3 to do the verification.
Note: Since August 2017, Project Gutenberg now accepts CSS 3 provided that code has been marked as "completed work," and has "REC" (for Recommendation) status according to the W3 specifications. As with illustrated dropcaps, however, it is necessary to add a note to the PPVer (or the Whitewasher if you have Direct Upload access) concerning what CSS3 has been included. (This is a very new policy and may be adjusted based on an issues experienced with submissions, so Project Gutenberg has asked DP volunteers to watch their site for changes.)
It is a good idea to take a look at your HTML with CSS turned off (many browsers have features that permit this) just to be sure that the book is at least readable even if CSS is not active.
You should use the W3 HTML validator to check that your HTML validates as XHTML 1.0 Strict or 1.1 . Please correct any errors or warnings (other than table "summary") reported by the validator.
Other Final Checks for the HTML version
- Make sure all your links work. It is a good idea to manually check internal links such as footnotes, Tables of Contents, Lists of Illustrations, indices, and the Transcriber's Note. However, tools such as PPhtml (a Post-processing Workbench tool that checks the links and other issues in an uploaded zip file) or the W3 linkchecker should be used to check links as well.
- It is good to run your HTML through the Post-processing Workbench PPhtml tool. This tool runs a variety of checks and can yield quite useful results.
- Ensure that there are no redundant images in your images folder and that your images look OK in the HTML. For more information about image preparation and requirements, please read the Preparing your images section of this document.
- Check that your
<title>is present and that it's correctly worded and spelled. For more information about HTML Titles, please read the HTML title section of this document.
- You should check that your Transcriber's Note is properly formatted and uses correct grammar and spelling.
- Please check your final HTML in more than one type of browsers to ensure that you have not included browser-specific coding that could appear strangely in another browser. As you scroll through the book in the browser, please look carefully for anything strange such as overlapping sidenotes or page numbers, missing or distorted images.
And finally, once you are happy with your HTML version of the book, it's time to check how it appears when converted to e-reader format. Of course, if you make changes to the HTML to correct problems in the e-reader version, please remember to re-check the HTML version again once you're done.
Checking E-reader versions
Project Gutenberg expects our HTML to convert seamlessly to the epub and mobi (Kindle) formats. Before uploading to a PPVer or to Project Gutenberg, please use PG's ebookmaker converter to convert your book's HTML and review the result carefully.
When using ebookmaker, please avoid placing non-ASCII characters such as utf-8 em-dashes (
—) in the ebookmaker Title, Author, Encoding or E-book Number entry fields. Otherwise there are usually problems with the file conversion.
If you follow the guidelines presented in the Creating an HTML version of this document, you'll usually find of your book will convert relatively easily to e-reader format.
The Easy Epub wiki pages provide very useful information on how to ready your project for e-readers. The wiki pages include information on how to view your project in e-reader format even if you don't have an e-reader. Please use a suggested viewer to test the epub and mobi versions of the book.
Once you've converted your HTML to e-reader format, you should page through it in your viewers to see if there are any issues to correct. Some common problem areas are:
Front and End of Book
- Title page layout
Body of Book
- Horizontal rules
- Obscured sections within the book such that text covers other text or blank areas occur where text should be
- If hovers were used in the HTML, all important "hovered" information should be present and readable in a non-hovered way within the e-reader version. Also Transcriber's Note references to hovers should not display the e-reader version.
- Page numbers (if present)
Checking that you haven't introduced errors into the book
In converting to plain text and HTML, post-processors sometimes inadvertently introduce errors to the text such as deleting or adding words or ever paragraphs. Consequently it's a good idea to run a final check to compare the plain text and HTML with the original proofread text. Tools such as the Post-processing Workbench PPtext tool are useful for this.
I've finished preparing my book — now what...?
Once you have done all your checks and have reviewed the Getting your PP Project Ready for PPV document to see if there's anything you might have missed, you are ready to submit your project.
Uploading for verification (PPV)
If you are relatively new to post-processing, you will need to upload your project to the Post-Processing Verification (PPV) pool. From there, an experienced post-processor known as a Post-Processing Verifier (PPVer) will select it. This person will carefully go over your work making sure that all of the requirements have been met, i.e., spellcheck has been done, images are correctly sized and formatted, it passes Gutcheck/PPtext, the HTML is valid, etc. For information on what the PPVer looks at, you can read the Post-Processing Verification Guidelines.
Sometimes a PPVer will return a project to you for further work. This does not mean you are not a good post-processor — it probably just means that you missed a step or two of the process. A private message or e-mail will accompany a return explaining why and what steps you can take to repair your file and usually offering assistance or suggesting where assistance can be obtained.
Once you have completed the recommended changes, go back to the Post-Processing Files section on the Project Page, click on the button labeled "Return to your current PPVer for further checking", and upload your updated zip.
Once the book passes all PPV checks, the PPVer will submit it to Project Gutenberg for posting.
Once you meet the requirements to obtain Direct Upload access, you will be able to upload your project to Project Gutenberg yourself. However, even when you do have Direct upload capability, you are welcome to submit a project to PPV for review should you need a second opinion. (If you do that, however, please mention in the project's upload comments that you have Direct Upload capability and your reason for submitting it to PPV).
Preparing your zip file for PPV
For PPV, you will need to create a zip file of the work you have done. To prepare this, you should locate the files and directories you will be uploading. It is often a good idea to create a new "clean" directory into which you move everything that you will be sending for verification, including the images directory and its contents.
Here are the files and directories you'll need:
- projectname.txt.bin (if you have it from Guiguts — this is helpful for post-processing verifiers who also use Guiguts. They won't upload this file to Project Gutenberg.)
- projectname.html.bin (if you have it from Guiguts)
- images folder and its entire contents (Any illustrations should already be stored inside this "images" folder) — Please check there are no extra images or files other than the images used in the books or the thumb.db file (though you're welcome to remove the thumb.db file)
- any other files that are part of the project (such as midi or mp3 files in a "music" directory)
- Please be sure not to include the page scans in your zip file.
Please keep the filenames short (but at least four letters long), with only letters, numbers, hyphens, and/or underscores — no spaces or special characters such as
$, etc. Filenames and directory names must all be lower case (This is for DP standardization purposes since our servers are case-sensitive and the Windows systems many volunteers use are not). Please use filenames that correctly reflect the contents. This is helpful both to the PPVer and the Project Gutenberg production team (whitewashers).
Depending on your zip software, you may have to adjust its settings to "Save Relative Paths". This prevents the PPVer from getting extra (undesired) folders on their computers. If you are using a Mac, you may need to "omit Finder files" too in order to omit invisible files.
Uploading for PPV review
- Go to the Project Page for your book, and select "Upload for Verification" from the buttons at the bottom of the page.
- The "Upload post-processed file for verification" page will appear.
- Use the "Browse..." button to select your zip file.
- Add any comments for the PPVer. Here you can include information about any checks you've done or point out special features of the work which the PPVer should be alert to. If you want the PPVer to write you directly at a specific e-mail address rather than using Private Messaging, please provide that e-mail address. Also, if you want your name to be different in the credits from what it is in your Site Preferences, please mention that in the comments.
- When you are ready, click the "Upload file" button to submit your project.
- Check your site preferences to make certain that the Name and Credits Wanted section in the General tab are correct. They determine whether and how your name is listed in the credits section of the book at Project Gutenberg.
- If you are re-uploading after having made recommended changes, the button to upload will be labeled "Return to your current PPVer for further checking"
Once you have submitted your project, it will enter the PPV pool from which the PPVers select projects to review.
When your project is reviewed, the PPVer will contact you to let you know of any corrections you should make to the project. He or she will also provide advice and guidance regarding best practices. In some cases, the project may go back-and-forth a few times between you, but once it's ready, the PPVer will upload the project to Project Gutenberg.
If you have not already received feedback from your PPVer, you can expect feedback after your project is posted to Project Gutenberg. This feedback will tell you the great things that you did along with any suggestions for improvements in future work. For information on what to do if you have received no feedback, please read the section "If you don't hear from your PPVer" below.
If you don't hear from your PPVer
If you have received no feedback and your project has posted, please contact your PPVer (if you don't know who your PPVer was, you can check your book's Project Page for the username). Also, if a PPVer has had your project for a very long time and still hasn't contacted you or posted the project, it is a good idea to send a polite note to the PPVer to ask about the delay.
Rarely, the PPVer to whom you return a project may be unavailable to finish a project when you return it to them. If, after a reasonable time, you have received no feedback from your PPVer, and the project is still in their PPV queue, it's reasonable to attempt to contact them.
In both cases, if you are unable to reach your PPVer, please contact the PPV Coordinator at ppv-coord at pgdp.net.
Uploading to Project Gutenberg yourself
If you have been granted Direct Upload (DU) access, you may upload your project directly to Project Gutenberg.
When you obtained DU, a Post-processing Verification Coordinator will have contacted you with detailed information on how to upload your project. You should also review the Guide to Direct Uploading (DU) and Posting to PG.
Before uploading, please ensure that your project has finished Smooth Reading and is no longer sitting in that pool.
When uploading, please preview your upload before hitting "Submit" in order to trigger the extensive error checking, encoding detection and validation within the upload program. From Preview, you will see a more complete analysis of your upload, allowing you to make adjustments if appropriate before submitting.
If you are uploading to Project Gutenberg yourself, please do not upload any Guiguts .bin files as part of your zip. Otherwise the zip will be basically the same as the one you used to send to the PPVers.
With respect to the .txt file you submit to PG, only one .txt format need now be submitted. Since DP is now using UTF-8, please post in UTF-8 format. PG will now accept UTF-8 text format even if the same text could be represented fully as ASCII, Latin-1 (ISO-8859-1) or CP-1252 (Windows-1252).
Also, please remember to use filenames that correctly reflect the contents. This is helpful to the Project Gutenberg production team (whitewashers).
How to find out when your project is posted
If you would like email notification when your book is posted, please check-mark the "Project posted to Project Gutenberg" box in the Events section of the Project Page.
What happens once my book is uploaded to Project Gutenberg?
Once your book is uploaded to Project Gutenberg by the PPVer or by you (if you have Direct Upload access), a friendly Project Gutenberg Whitewasher (WWer) will do a final check of the book and, if everything is fine, add the Project Gutenberg boilerplate of names and legal information. Sometimes a WWer will have a question for you (that question may come through your PPVer if the PPVer was the uploader).
Once your book has passed the Whitewasher checks, your project is then posted on Project Gutenberg for the world to enjoy! Congratulations!
Help! I've got a problem with ...
Missing or problem images or pages
Sometimes the content provider accidentally skips one or more pages when scanning or, although the page scan is present, part or all of it is unreadable. Project Managers (PMs) usually check through their books for such problems after they've prepared them for Distributed Proofreader work, but please don't rely on this — check for yourself, too.
If you find a problem, first, attempt to contact the Project Manager to see if you can get a better scan or a scan of the missing page. If the PM is for some reason unable to get a good scan for you, please contact db-req. In your note to them, please include the title and projectID for the book and describe the problem. You should include scan image file names (such as 01.png) as necessary. Please wait to hear from the PM or db-req before uploading your project to PPV or Project Gutenberg.
A problem after the project has posted!
Don't Panic! Everyone who post-processes has done this. If they haven't, they will eventually.
If the book is quite recently posted to Project Gutenberg (in the last week or two) and you do not yet have Direct Upload capability, please inform your PPVer of the problem. He or she will pass on the information to the WhiteWasher (WWer) who archived your book and can most easily fix it. If you have Direct Upload access, however, please simply contact the WWer.
If your book has been posted for more than a week or two, you can send an errata note to Project Gutenberg giving the error information. In the note, please state that you are the Post-Processor of the book and include the PG text number, title, and author and a clear description of the problem and how to fix it. If you've checked the problem against the page images, please mention that too.
What's different about ...
Periodicals and Uberprojects
Some projects are part of a larger series of publications and require a common look and feel. The forum area for Uberprojects lists the forum topic areas related to many of the large multi-volume projects that are likely to be seen on Distributed Proofreaders. Periodicals are often included as Uberprojects since they usually need a standard style for the text and HTML versions to ensure a consistent look for the whole series.
Many people are put off proofreading, formatting, or post-processing periodicals because they are perceived as difficult. However, canny post-processors realize that mastering a periodical's style gives them access to many entertaining projects with little competition. A periodical's Style Guide puts an end to those hours spent mulling over whether a heading should be marked with
Although periodicals often have longer pages than usual and may have adverts or other unusual formatting issues, post-processors who have become familiar with that periodical's style, will find such features relatively easy to handle. The Style Guide information for a specific periodical should be described on the Project Page itself or in the Uberproject thread. If not, or if the explanation is unclear, you can contact the Project Manager or post in its Uberproject forum area. An excellent way to see how to handle aspects of a particular periodical is to view the most recently posted issues on Project Gutenberg. Please make sure you select ones which have also been posted by DP to ensure absolute consistency of style.
You can also find help regarding periodicals in the Proofreading Periodicals team.
Many people are nervous about proofreading, formatting, or post-processing drama because it is perceived as difficult in some way. Sometimes it actually can be quite hard, for example, when the play is written in sixteenth-century English with little attention paid to spelling or grammatical niceties. Most drama, however, is quite straightforward.
For all plays, please check the Formatting Guidelines, and ensure your plain text version is in line with these.
You should format speakers' character names as similarly as possible to the original text. If the text is metrical (written like poetry in which line breaks are significant), please check the
/* */ markers carefully before doing any rewrapping (or consider checking by hand through for rewrap sections, such as stage directions.)
If the project contains unclosed brackets, be aware that checking tools such as Gutcheck or the Post-Processing Workbench PPtext tool will have many false positives.
There are various ways to format plays in HTML. It is a good practice, however, to make the HTML version look as much like the original text as is sensible. Searching Project Gutenberg for recent DP drama postings may give some ideas about how to handle the HTML. You are welcome as always also to post questions in the Post-Processing Forum. There is also information on how to handle Drama style sheets in the CSS Cookbook.
The Plays The Thing team can also offer help and advice about both the HTML and plain text versions.
Some Distributed Proofreader books are entirely about music or include sections of music or a short tune for a song sung in the text. A simple way to post-process a book containing music is to include all scores as illustrations in the HTML. However, much more value can be added to the book if you transcribe the music into a common notation format. This has three great advantages:
- an audio file such as midi or mp3 for HTML allows the Project Gutenberg reader to listen to the music
- the readers can download and edit the music notation using their preferred music software (something that is especially easy with midi and MusicXML formats).
- a clear musical score (PDF) for HTML can be useful in situations in which a reader may want to have a clean copy of the score to play.
Our wiki music guidelines contain information on how to handle music in both the plain text and HTML versions of a project. They also offer a detailed discussion of the different music transcription programs, such as MuseScore, Finale, Sibelius, Lilypond, and so forth, all of which can produce sound and image files, as well as editable source files. The guidelines also contain information for post-processors about different ways to present music in the HTML, along with a list of sample e-books containing music.
To obtain help with music transcription, simply post a message to the Music team thread or send a private message to one of the volunteer music transcribers listed at the end of the Music Guidelines page.
Languages Other Than English (LOTE)
It is important to be able to read and write a language fluently before post-processing a book entirely in that language. However, in some cases, if there are few or no native speakers available on the site, post-processors may take on a project in a language with which they have little experience, especially if they can find someone who does fluently read the language who can spellcheck and smooth read the book.
Some books have small bits of text in other languages. Even if you are familiar with those languages' modern versions, please remember that the usage and spelling in that language may have been different in the period in which the book was written.
As a Post-Processor, you have a bit of freedom in choosing the best format for your text. For LOTE texts this may sometimes lead to decisions which would be unusual or even plainly incorrect in English. If you make such a decision, you might get lots of Gutcheck or PPtext errors. If you have a good reason for your decision and if you have applied your decision consistently, you can ignore those errors. You might want to mention this decision in your upload notes. For example, for many languages it looks more natural to have spaces around em-dashes; if that's the case, it's perfectly fine to insert them.
You should replace specific English words which appear in the final text with translations of those words in the main language of the text. e.g.
Nota de rodapé.
Assistance with languages other than English
In case of doubt regarding foreign words and phrases, you should check the forums for assistance:
- Help with: Arabic
- Help with: Chinese
- Help with: Cyrillic
- Help with: Danish
- Help with: Dutch
- Help with: English
- Help with: French
- Help with: Gaelic
- Help with: German
- Help with: Greek
- Help with: Hebrew
- Help with: Indonesian
- Help with: Italian
- Help with: Japanese
- Help with: Latin
- Help with: Middle English
- Help with: Norwegian
- Help with: Old Swedish
- Help with: Polish
- Help with: Polish
- Help with: Portuguese
- Help with: Sanskrit
- Help with: Spanish
You can also check the Language Skills List to find whom to ask for help or advice.
Transcriber's Notes for books in languages other than English
If at all possible for modern languages, please provide a Transcriber's Note in the language in which the book was printed. If you need assistance, please consult the Project Manager or our language forum areas.
Some of our books include a great deal of mathematical notation. These projects are usually handled by means of LaTeX, a markup language that is entirely distinct from other types of DP formatting but that is normally confined to projects containing a lot of math. LaTeX is an integrated collection of typesetting software which is oriented toward mathematical and scientific content not easily presentable in plain text or HTML. PPers working on books involving complex mathematical formulas needing LaTeX may use Scalable Vector Graphics (SVG) files.
LaTeX projects submitted to Project Gutenberg must include the LaTeX source as a single file together with any illustrations (in an "images" subdirectory, as for non-LaTeX projects). Projects must be compilable with TeX Live, which the Project Gutenberg Whitewasher will use to generate the uploaded PDF. The MiKTeX distribution for Windows may be assumed equivalent to TeX Live for formatting and post-processing.
The primary repositories of LaTeX information and advice at Distributed Proofreaders are:
Symbols and scripts, non-ASCII characters, non-Latin scripts, and downright weird things
If you are unsure about how to represent a symbol, please post a query with a link to the page image containing the symbol in the Post-Processing Forum. Asking there will net you a varied range of ideas about whether the problematic ink blob is in Unicode or can be improvised using ASCII-art or represented in another fashion. For information about UTF-8 and Post-Processing, please read this wiki article.
These should be included as printed. There are two ways to handle these: you can leave the amendments up to the reader, or you can make corrections in the text, adding a Transcriber's Note to state that you've done so.
See also ...
- DP Official Documentation regarding post-processing
- Getting your PP Project Ready for PPV
- DP Wiki post-processing links
- Post-Processing Workbench
- DP HTML Best Practices
- Post-Processing Forum, which contains a "No Dumb Questions" for PPers thread
- PP Tools
- W3 HTML Validator
- W3 CSS Validator
- Ngram Viewer
- Post-processing Advice
- Guiguts PP Process Checklist
- Miller's PP walkthrough using Guiguts (this is old but has some good information)
- Post-Processing German books
[View Log of Major Changes to this Document]
Version 3.13: last updated 8 June 2020 to update image (including cover) changes to reflect new Project Gutenberg guidelines.
Version 3.12: last updated 6 June 2020 to update information about ligatures in relation to our Unicode update.
Version 3.11: last updated 3 June 2020 to update information about dashes in relation to our Unicode update.
Version 3.10: last updated 22 April 2020 to include updated information on recent changes to Smooth Reading