From DPWiki
Jump to navigation Jump to search

1. Initial Setup (1 hr.)

  • Go to Project page
    • Read details and requirements.
    • bookmark the project URL and note project ID number.
    • read the project forum page, note any issues proofers raised.
  • Make a project folder, e.g. C:\dp\projects\bookname
  • Download the text and images files and unpack in new folder:
    • Text to bookname.txt.
    • page images (nnn.png) and hi-res illustration scans (imagenn.png) in subfolder pngs
    • empty subfolder images.
  • Create bookname-notes.txt for keeping track of issues, "to-do", notes for WW, etc.
  • (Optional) If suspect subsequent proofers making changes for the worse: "Download Concatenated Text" of the output of the various rounds, and use a diff tool to check the changes.
  • Look up proofer names; change any that contain spaces, periods, underscores, etc. Make note of those changes in notes.txt.

2. Sequential Inspection of Text (4-20 hr.)

This is the only step in which you will examine the whole ASCII text in sequence; hereafter you navigate with searches. Some post-proofers still read the book carefully, although this is not as crucial as it used to be under the old two-round system. Others skim the text comparing it to the page images and double-checking format.

Either way, be sure to turn on automatic scanno highlighting before starting during this pass.

Check for:

  • Proper markup of <i>italic</i> and <b>bold</b>.
    • Watch for punctuation wrongly contained in markups, such as <i>(ibid.</i> or <b>Subtopic.</b>.
    • Watch for upright words within italic passages which might have been missed.
  • Proper markup of Greek and other transliterations (content check later)
  • Block material all marked in some fashion:
    • poetry, misc. tabular in /* */
    • block quotes in /# #/
    • Fix block markups that cross page boundaries now
  • Remove [Blank Page]s, pages containing repeated verbiage, etc.
  • Figures properly in [Illustration: caption]
    • check: caption text agrees with List of Illustrations (if any)
    • consistent spelling, abbreviation, capitalization in captions
  • Fix Footnotes, Illustrations still inside a paragraph.
    • move outside paragraph to next or prior page as appropriate
    • don't worry about duplicate footnote number/symbol now
    • sidenotes handled later
  • Make notes of things that will need attention in the HTML:
    • Author cross-references like "(p. 150)" and "see page 222" that should become links.
    • How the editor laid out special sections such as tables and sidebars.

3. Fix Block Markups (15-60 min.)

  • Use the Search menu to step through all /* */ blocks.
    • check for a blank line before and after markup
    • make sure correct type of markup used
    • close-up where broken at page boundaries (in step above; leaving in case you forget)
    • apply specific indent value if desired
    • convert poetry from /*..*/ to /P..P/
    • make sure poetry line numbers are at least two spaces to the right of the line.
  • Use the Search menu to step through all /#..#/ blocks
    • check for a blank line before and after markup
    • make sure correct type of markup used
    • close-up where broken at page boundaries (in step above; leaving in case you forget)
    • check consistent indentation of block text
    • apply specific margin values if desired
  • Use Orphaned Markup dialog to check and correct orphans of each type in turn. Do not omit the lowly parenthesis, often mis-scanned as curly-brace.
  • Search&Replace: text: (?<!/)\*(?!/) (a literal asterisk, but one neither preceded nor followed by a slash), regex; keep clicking "Search" to check all asterisks in document.
    • look for malformed thought-breaks (5 stars)

4. Basic Fixup (10 min.)

  • Save file.
  • Run Fixup with all options checked.
    • If file changed, save under different name, and compare with the previous version, to see what changes were made. (-->diff program such as WinMerge or CSDiff)
  • Run Remove end of line spaces

5. Format Front Matter (15 min.)

  • Format the title page, preserving as much of the original material as possible. Protect in /*...*/
  • Edit the TOC. Find each matching chapter head; make sure heads are 1:1 with TOC. Protect TOC with /X...X/.
  • If book has illustrations, edit or create List of Illustrations (Note: this is not a requirement). Make sure it is 1:1 with [Illustration] captions. Protect with /*...*/.
  • If book has music, might be nice to create a list like the list of illustrations above.

Regarding /*...*/ vs. /X...X/ for the title page, table of contents, etc.:

These markers sort of indicate how the ToC /ToI will be formatted. For the text version, /X X/ preserves the list exactly as it is; but /* */ will add a little bit of space on the left. This space on the left is intended for future users of the document to indicate that this section which is set off by the space should not be re-wrapped (other parts of the text may be re-wrapped if they find different line length more convenient). Typically I actually prefer to have that space in both ToC and ToI.

In HTML, /X X/ will be replaced with <pre> and </pre> tags (<pre> stands for "preformatted"), but /* */ will probably add extraneous markup that you might not really want. This is a little moot, since I actually prefer to format my ToC/ToI using regexes or manually; rather than using the Guiguts auto-generated output.

6. Remove Visible Page Breaks (10-30 min.)

7. Apply Word-Frequency Checks (10-60 min.)

Open the Word Frequency report.

  • Set the Frq switch; click All Words. List is now sorted by word frequency; scroll to the end and skim up the list of words that only appear 1 time looking for oddities and obvious misspellings.
  • Click Character Cnts.
    • Note characters that appear only once, check usage.
    • Check for equal counts of left & right parens and brackets.
  • Set the Alph switch; click All Words. Scroll to the word Footnote and write down count for later use. (If the count is large, click once on Footnote and click 1st Harm. The harmonic window shows you any of the common misspellings of "Footnote" that occur.)
  • Click Emdashes. This shows words with emdashes in them as well as similar words without emdashes (aka: suspects) marked with ****. Check suspects against the text and page images. Preserve author's intent even when inconsistent. Hint: Enable the Suspects flag and click Emdashes again to see only suspects words.
  • Click Hyphens. Same as Emdashes above but for Hyphens.
  • Click Ital/Bold. Check for any words that might need to be bolded/italicized.
  • Click Alpha/num. Scan list for one/ell and oh/zero errors.
  • Click ALL CAPS. Scan list looking for oddities.
  • Click MiXeD CasE. Scan list looking for letters such as o that sometimes OCR wrongly as uppercase. Oh/zero errors can show up here, too.
  • Click Check Accents. Scan list looking for mistakes, inconsistent usages.
  • Click Check , Upper. Scan list for comma-for-period errors.
  • Click Check . Lower. Scan list for period-for-comma errors.

8. Apply Scanno Checks (1-3 hr.)

See this topic for usage of the scanno checks.

  • If you have installed Jeebies, use Fixup> Run Jeebies and examine its report of possible he/be errors.
  • Start scanno searching based on eng-common.rc. Work through the list.
  • Apply scanno searching based on misspelled.rc. Work through the list.
  • Apply scanno searching based on regex.rc. Work through the list.
  • User:Camomiletea/Regexes

9. Apply Spellcheck (30-90 min.)

  • Run Word Frequency to get the word counts in the spellcheck.
  • Start the spellcheck process.
  • Proceed through the document, correcting words or adding them to the project dictionary as appropriate. (Add everything to project dictionary, except the wrong words.)
  • Review custom.dic file - this sometimes shows proper names with variant spellings.

10. Apply Gutcheck (10-45 min.)

Start the Gutcheck Process.

  • Work through the list, correcting as appropriate.

11. Edit Transliterations (0-? hr.)

Needs to be done after Spell-check/Gutcheck, since they act up with UTF-8.

  • Search&Replace: text: \[[^FIS] (left-bracket followed by anything other than F, I or S), regex. Check content of each transliteration. For Greek, use the Greek Transliteration Tool. I'd use a different regex for HTML from the one proposed in the article.

Text version:

  • Convert to beta code:
    • Search: \[Greek: +((.|\n)+?)\]
    • Replace: [Greek: \GB$1\E]
  • Remove accents, label Greek with {}
    • same search term
    • Replace {\GA$1\E}

HTML version:

  • Add transliteration in the title attribute (beta code):
    • Search: \[Greek: +((.|\n)+?)\]
    • Replace: <ins class="greek" lang="grc" xml:lang="grc" title="[Greek: \GB$1\E]">$1</ins>
  • Remove accents and disscard "[Greek: ]":
    • Search: \[Greek: +((.|\n)+?)\]
    • Replace \GA$1\E

12. Fix Sidenotes (0-? hr.)

Needs to be after spellcheck/gutcheck, since they are going to be moved. Read the discussion. Step through sidenotes with: Search&Replace of [S, not regex, not whole word, ignore case. Click Search to find each Sidenote.

  • Compare to page image. Move note above paragraph if feasible.
  • Otherwise, position it above the sentence to which it applies, with blank lines to prevent rewrapping if you decide that is best.

13. Fix Footnotes (0-? hr.)

Needs to be after spellcheck/gutcheck, since they are moved. Read the discussion and follow the steps on this page.

14. Fix Poetry Line Numbers (0-20 min.)

If the book has poetry that uses line numbers, read this page and align the line numbers consistently.

15. Check balanced markup

Search&Replace for \<(\w+)>\n?[^<]+<(?!/\1>) (any starting markup in <..> that doesn't end in an identical closing markup). (Note: this regular expression sees <tb> as unbalanced, and shows the text from the <tb> to the next markup as an error. If you can devise a better regex please do!) Because it includes a newline, the search may take several seconds to return the first result.

  • Correct the error and click search until no more are found.

16. Rewrap and Save markup

  • Use Edit>Select All then Selection>Rewrap Selection. Wait while rewrap completes. Do not clean up rewrap markers.
  • Save changes as bookname-rewrap.txt. This will be the source of both text and HTML.
  • Save again as bookname-latin.txt (or bookname-utf.txt as the case may be).

17. Resolve proofer notes

  • resolve proofer notes, which are indicated by asterisk
    • text: TN mentioning corrections; silent corrections + list of corrections at the end; make sure to rewrap affected text after removing notes and remove space at lineend.
    • HTML: Transcriber's note at the start explaining markup; <ins title="Explanation of the correction">corrected text</ins> within text + list of corrections at the end

18. Convert <tb>, Italic, Bold, and Smallcap (10 min.)

These steps are for the text document; HTML treated below.

  • Fix <tb> markup for the text version: In the Text Processing menu, select "Convert <tb> to asterisk break" which converts all in one step. (Note: you may not see the Text Processing menu if you do not have the latest version.)
    • Interactive replace: menu/Search -> Search & Replace to replace interactively: Search field, <tb>; Replace field,
             *       *       *       *       *
      . Use Search and Replace buttons to step through mark up; Rpl All if happy with the operation.
  • Fix italics: In the Text Processing menu, select "Convert Italics." Italic markup is replaced with underscores.
    • Interactive: Same as <tb>: Search field, </?i>; Replace, _. Set Regex checkbox.
  • Fix bold. Decide if you want to mark bold with =, or $, or by all uppercase.
    • For = or $, in the Text Processing menu select Options and set the appropriate character; then select Text Processing > Convert Bold.
    • Interactive: As for italics: Search, </?b>, Replace, =, $ or preferred character.
    • For uppercase, use a regex search for <b>(\n?[^<]+)</b> (<b> then anything including newline up to the first </b>). Replacement: \U$1\E.
Click Search, then Replace until you are confident it works; then Replace All. Afterward, search for b> and hand-edit any remaining bold.
  • Uppercase selected small-cap, which proofers have changed to <sc>Title-Cased-Text</sc>.

PG guidelines say that where only an opening word or phrase of a section is small-capped, it should be left as title case. Some works have whole headings small-cap; some have used small-cap as a means of emphasis. These should be uppercased in the text. To handle either case: regex find <sc>(\n?[^<]+)</sc> (<sc> then anything including newlines up to </sc>; note this will not find small-cap that spans other markup such as italic.) Replacement 1: \U$1\E Replacement 2: $1 alone. Click Search and evaluate the usage: click R&S opposite replacement 1 to uppercase; click opposite replacement 2 to just remove the markup. After, search for sc> and hand-edit any remaining markup.

  • Save the document.

19. Fix ASCII Tables (0-? hr.)

  • Use Search>Find Next /**/ Block to step through all tabular material.
    • Compare to page image; reformat to best convey author intent.
    • For complex tables, use Table Special Effects to reformat.

20. Clear Rewrap Markers (10-30 min.)

  • Page through entire text, looking for improper indentation. If found, re-open, clicking NO when asked if you want to save the edits. Find and fix broken rewrap markups. Repeat this step.
  • Open Fixup>Footnote Fixup; tidy up footnotes. See this discussion.
  • Remove all rewrap markers: see this page.
  • Use Fixup>Remove End-of-line Spaces.
  • Use Fixup>Run Gutcheck and resolve any new issues.
  • Save the document.

Food for Thought Later: Determine Character Coding (5-60 min.)

I've never done this; need to research/think on it

Character codes are described here. You need to understand the coding your etext uses.

First, apply Fixup > Convert Windows CP 1252 characters to Unicode. This gets rid of any Windows-unique characters but may insert Unicode characters in their place.

Search with the regex \P{IsASCII} (note uppercase P). If nothing is found, the book now contains only characters from the 7-bit ASCII set and you are done.

If 8-bit characters are found, you must take action. First apply Fixup> Run Word Frequency Routine. In the report window, click the Unicode>FF button. Words containing a multi-byte (Unicode) character are listed. If none are shown, the text is probably, but not certainly, Latin-1; at any rate Unicode characters are confined to non-word punctuation.

If your text has symbols from Latin-1 or Unicode, read or re-read this item of the Gutenberg FAQ. This section and this one in DP's post-processing FAQ have additional information about characters sets and when to make more than one text version. Decide if you will upload a single version or if you should do the division into ASCII and high-bit versions. If you will do it, then:

  • Use File>Save As to "fork" your single document into versions:
bookname-asc for a pure-ASCII version;
bookname-lt1 for a version with Latin-1 accented characters;
and/or bookname-utf8 for a version that has Unicode characters.

Note that the "-asc" and so forth should not replace the normal .txt at the end of the file name. You will end up with files named bookname-utf8.txt, etc., under this naming scheme.

  • Open bookname-asc.
  • Search with the regex \P{IsASCII} (note uppercase P) to step through each character not 7-bit ASCII
  • Replace each, using some consistent substitution scheme (for example, ['e] for é, etc.).
  • Add a "Transcriber's Note" to the head of the text to document your substitution scheme.
  • In a similar manner, search bookname-lt1 for Unicode characters and replace them with Latin-1 equivalents. Add a "Transcriber's Note" to document the substitutions.

Pure-ASCII etext bookname-asc and optional Latin-1 bookname-lt1 and bookname-utf8 are ready to upload!

21. Prepare HTML Edition (4-? hr.)

  • Open bookname-rewrap.txt that was saved in step 16.
  • Save as bookname-html.html.
  • If you will insert visible page numbers or anchors at page boundaries, then configure the page labels before proceeding
  • Don't remove the rewrap markers. These are needed for generation of proper HTML.
  • Open the HTML Palette and set optional switches as desired.
  • Apply Automatic HTML conversion and wait while it completes.
  • Save the file and open it in a browser.
  • Scroll through looking for systematic errors. (Title pages, tables, etc. will look terrible; no matter). If automatic conversion messed up, delete the file and start this step over with the backup file.
  • Page through the book looking for text that was not handled well by automatic HTML generation, in particular:
    • Title pages.
    • Tables (review Accessibility guidelines).
    • Tables of Contents and Indexes, which are best formatted using unordered lists.
      • For index, it's easier to work with the pre-rewrap/pre-HTML version. But remember to reinsert page numbers, fix up markup, non-Latin and Latin characters.
    • Illustrations (Note: GG removes markup from captions -- use regexes instead if captions have italics, etc.).
    • Small caps?
  • Use the EmEditor (for simple texts) or Dreamweaver (for more complex projects) to mark up these areas. Use regex replacements to make systematic changes.
  • Fix CSS for print and handheld. Some thoughts -- User:Camomiletea/Mobile.
  • Open the file in one or more web browsers (Internet Explorer and at least one other such as Firefox). Page through the entire book.
    • Where you see a problem, make a correction in your editing program, save the file, and click the "reload" button in each browser.
  • Hyperlink page references in text, TOC, and index (discussed here).
  • Work through Accessibility Recipes as needed/desired.
  • See step 17 for handling proofreader notes; decide on other Transcriber Notes you need to include.
  • In Firefox / Web Developer extension (User:Camomiletea/Tools#Firefox),
    • check Document Outline for sensibility
    • Disable CSS and see how well the book holds up
    • Upload to my webspace, and run validation (HTML, CSS, link, accessibility) correcting all flags.
  • Stress-check in various browsers, especially if there are floating elements and if I used max-width for poetry in CSS:
    • increase font size
    • resize the browser window
    • add font-family: monospace; to body
  • Remove unused CSS, manually or using a tool such as the Firefox addons Firebug (with CSS Usage extension) or Dust-Me Selectors.

NEW: You can upload an HTML file and see what the EPUB version will look like: Project Gutenberg Online EPubMaker

22. Process Hi-resolution Images (? hr.)

If the project manager provided high-resolution scans of the images in the text, use an image-processing program such as The Gimp or Adobe Photoshop Elements to optimize them. See

You can do this before, during, or after HTML conversion.

For each image:

  • Load image from the originals folder (see step 1)
  • Straighten it (almost all scanned images are off-perpendicular; some are trapezoidal owing to the page not being flat on the scan window).
  • Crop it to remove all redundant white space and borders (provide margins and borders with CSS styling of the <img> markup).
  • Correct the contrast (you must have calibrated your monitor, see this page).
  • Sharpen.
  • Correct any major scratches, freckles, dirt, etc.
  • Save in the subfolder images using appropriate type:
    • Line drawings in .png at 8 bits per pixel (not the default 24-bit RGB format).
      • remember to pngcrush
    • Photographs as .jpg with an appropriate compression level such as (Photoshop) level 6.
  • Page through entire HTML book making sure that each image is being loaded correctly. Test each thumbnail if used.
    • upload the HTML to my webspace and run Link check again

23. Optional final checks

  • Upload files for smoothreading. May be done once you've completed text, and whilst you are working on HTML / image processing.
  • Post the HTML for preview in the forums.
  • After stepping away from the project, run additional checks, e.g. ppvtxt / ppvhtml.
  • Re-run gutcheck.
  • Diff F2 / text.
  • Diff text / HTML.

24. Upload the Finished Project

  • Prepare a new folder with a short name. I usually use bookname.
  • Move into it only the files to be uploaded:
    • the etext file(s) bookname-latin.txt, and/or bookname-utf.txt.
    • the HTML file if one was made
    • the images folder if required by HTML
Do not include the original images or the page images; do not include any work files or scratch files or auto-backup editions. All filenames should contain lowercase letters only.
  • Use a zip utility to make a zip archive of this folder.
  • Windows users: The "images" folder will often contain a hidden file called thumbs.db. This shouldn't be included in the upload. The easiest way to get rid of it is to open the finished zip-file, navigate to the "images"-folder and delete it from there if present.
  • Follow the Guide to Direct Uploading