User:Tintazul/PP Checklist

From DPWiki
Jump to navigation Jump to search

This is my own post-processing checklist. You may use it if you think it might be, well, useful. Feel free to drop me a line of comment, too. There are other PP checklists and info:


A. Preliminary phase

Pick a project and check whether it's complete.

A.1 Find project

  1. Read project comments.
  2. Read project forum.
  3. When happy, commit to PP.

A.2 Create local setup

  1. Create folder for project, <author>_<title>.
  2. Create notes file, notes_<title>.txt and write anything special from comments/forum.
  3. Download text into directory, unzip and rename as <title>.txtlowercase letters only.
  4. Download good_words.txt and bad_words.txt (if any)
  5. Download page scans into subfolder pngs.
  6. Download illustrations into subfolder originals.
  7. Keep zips as backup, in case you have to get back to the pre-PP versions.
  8. Check that there's an illustration matching each Illustration tag. If any missing, talk to PM, or use the Missing page wiki.

A.3 Page labels

  1. Right-click bottom label reading, Lbl: Pg xxx
  2. Relabel pages according to images.
  3. If pages missing, talk to PM, or use the Missing page wiki.


B. Text phase, pre-fork

Deal with issues which affect both plain-text and HTML.

B.1 Sequential inspection of text

  1. Go through all scans, and note if markup is correct: blocks, quotes, illustrations, tables.
  2. Check if every paragraph is really a paragraph.
  3. Note anything that will need special treatment in the notes file: page references to turn into hyperlinks etc.

B.2 Apply errata

  1. If there's an errata, apply it and leave transcriber's note: "Errata on page XXX applied to text."
  2. Applying it silently is OK (in my opinion), since the typos were never the author's intention so aren't considered part of the author's work
  3. Addenda and corrigenda are not to be replaced, but should be left as is

B.3 Resolve proofers' notes

  1. Search for regex [^/]\*[^/]

B.4 Check blocks

  1. Search > Find orphaned brackets and markup – do it for all blocks
  2. Search > Find next /* ... */ block.
    • Note that to keep any block from rewrapping it must be indented, so the rule is /*[somenumber]
  3. Same for all other block types.

B.5 Basic Fixup

  1. Fixup > Run Fixup
  2. Fixup > Remove end-of-line spaces

B.6 Check punctuation

  1. Search for opening punctuation inside opening tag: regex (<(i|b|sc|g|f)>)([([{"«]) (and replace $3$1 if needed)
  2. Search for other punctuation inside closing tag: regex ([,;.:!?"»)\]}])(</(i|b|sc|g|f)>) (and replace $2$1 if needed)
  3. Search for other punctuation outside closing tag: regex (</(i|b|sc|g|f)>)([,;.:!?"»)\]}]) (and replace $3$1 if needed)
  4. Check ellipses according to http://www.pgdp.net/c/faq/proofreading_guidelines.php#period_p
  5. Check em dash -- against page images -- regex for regular dash: ([^-]|\n)--([^-]|\n)

B.7 Check Transliterations

  1. Search for [Greek and check all against page scans
  2. Search regex: \[[^FIS]

B.8 Fix Sidenotes

  1. Fixup > Sidenote fixup

B.9 Fix Footnotes

  1. See PPTools/Guiguts/Footnotes

B.10 Fix Poetry Line Numbers

  1. See http://www.pgdp.net/wiki/PPTools/Guiguts/Fixup#Poetry_Line_Numbers

B.11 Remove Page Separators

  1. Save a copy of the text before removing separators.
  2. Fixup > Remove page separators

B.12 Apply Spellcheck

  1. Guiguts can't spellcheck accented chars. Copy text to other program to spellcheck. (Spellcheck wasn't working under Linux, it is now under Mac.)
  2. For English texts, run jeebies

B.13 Apply Word-Frequency Checks

  1. Go through all buttons.

B.14 Apply Scanno Checks

  1. See PPTools/Guiguts/Searching#Using_Scanno_Searches

B.15 Apply Gutcheck

  1. Run through whole list. See Gutcheck options because of HTML tags.
  2. Turn everything off except for Short line and Long line. Play around with the line length until warnings go away.

B.16 Check balanced markup

  1. If there are <tb> the check won't work. Change first <tb> to _THOUGHTBREAK_, then perform the test, then change back to <tb>.
  2. Search regex \<(\w+)>\n?[^<]+<(?!/\1>)

B.17 Check subscripts and superscripts

  1. Make sure there aren't any º and ^o which should be degree signs
  2. If your book is LOTE, then you know the basic text will be Latin-1, so you can change the ordinals:
    • Search regex \^o\b replace with º
    • Search regex \^a\b replace with ª
  3. Subscripts: search for _ – must be formatted as _{ }, see http://www.pgdp.net/c/faq/document.php#subscr
  4. Superscripts: search for ^ – may be marked as ^ or ^{ }, see http://www.pgdp.net/c/faq/document.php#supers

C. Post-fork

These changes only affect the text-only version (but note exceptions C.8 and C.9).

C.1 Fork for HTML

  1. Save your work.
  2. Save a second time, as <title>.html
  3. Continue working on .txt
  4. From now on, when changing anything to the file always consider applying same changes to <title>.html file

C.2 Convert <tb>, Italic, Bold, and Small caps

  1. <tb>: Fixup > Convert <tb> to asterisk break
  2. Bold surrounded by *? search regex </?b> replace *
  3. Italics surrounded by _? search regex </?i> replace _
  4. Small caps to uppercase? search regex <sc>(\n?[^<]+)</sc> replace \U$1\E
  5. Create Transcriber's Notes explaining any use of characters to represent formatting, e.g. * = _ ^{} []
  6. Search for < to see if you've missed anything (you'll find <f> if you have it)

C.3 Format Non-wrapping Text

  1. Check front matter. No need to center lines in text files.
  2. Fix ASCII Tables. See Fixup > ASCII Table Special Effects if needed.
  3. Align Tables of Contents.

C.4 Localize labels

  1. "Footnote" / "Sidenote" / "Illustration" / "Greek" must be translated for LOTE
  2. Describe any illustrations which don't have a caption (i.e., never let the text edition simply with [Illustration])

C.5 Replace ordinals

  1. Any ordinals found will mean that text will need a Latin-1/UTF-8 version
  2. Feminine ordinal: replace regex \^a\b with ª (apply automatically to text, not to scientific books)
  3. Masculine ordinal: replace regex \^o\b with º (apply automatically to text, not to scientific books)

C.6 Fork text according to character coding

  1. Search regex \P{IsASCII} – anything found means not just ASCII
  2. DP-INT: <name>-asc.txt mandatory for English, <name>-lt1.txt for LOTE; DP-Canada: <name>-lt1.txt mandatory in all cases
  3. optional <name>-utf8.txt
  4. For HTML, use character entities as much as possible to avoid encoding as UTF-8
  5. HTML <name>-lt1.txt or <name>-utf8.txt (as needed)
  6. Prepare different text versions, using ['e] etc. and leaving transcriber's notes as necessary
  7. Do C.3 Format Non-wrapping Text again on any text versions where characters were changed; note that replacing é with [e'] may cause alignment issues.

C.7 [Greek: tag to Greek characters

Note: This applies to all UTF-8 versions, text and HTML.

  1. Consider leaving the Latin transliteration as a mouse-over tip:
    • Regex (\[Greek: )([^]]+)(]) replace <abbr class="greek" title="$2">$1$2$3</abbr>
    • Tip readers to the existence of alternate text with CSS, e.g. abbr.greek{border-bottom:1px dotted black}
  2. Either search for each occurrence of [Greek:, or if you have lots of Greek try some systematic process, like this:
    • The searches below will break if you have a [ character inside a [Greek: tag.
    • Each of the below substitutions must be applied repeatedly, until no results are found. They catch only one occurrence inside each [Greek: tag.
    • Transform line breaks within Greek into spaces: regex (\[Greek: [^]]*)\n replace $1 (remember the space after $1 ).
    • Search for page tags within Greek: regex (\[Greek: [^]]*)span. Move the page tags manually outside, or break Greek tag in two and leave page tag in between.
    • Replace Latin letters with Greek letters. Pay attention to the order: digraphs and accented letters first. Don't forget similar letters like uppercase A and uppercase alpha.
      • Example: regex (\[Greek: [^]]*)C[Hh]([^]]*]) replace $1Χ$2 (a capital chi)
      • TODO: create systematic way of applying these replacements, instead of typing them each time (sed script?)
    • Check for anything left inside the tag: regex (\[Greek: [^]]*)([A-Za-z'`])([^]]*\]) replace $1_$2_$3 to mark it. Undo the replace and solve as needed. Search again.
    • Do you need to replace Greek punctuation?
    • After replacing systematically, go through each [Greek: tag, checking the against the page scan. Add breathing signs ῾ ᾽ etc.
  3. Delete Greek tags: regex \[Greek: ([^]]*)] replace $1

C.8 A.O.F.Ch. (Any Other Funky Characters)

Note: This applies to all UTF-8 versions, text and HTML.

  1. Do you have any other weird characters? Your UTF-8 versions will be much spiffier ;-) if they don't have any [ ] for characters which exist.
  2. For transliterating Arabic, Cyrillic, hieroglyphs etc. you may need to ask for expert help. See the Language Skills List or the language-related forums.
  3. Look for any characters encoded with [, find the corresponding character in a Character Map application in your computer and replace it.

C.9 Rewrap and Clear Rewrap Markers

  1. Save backup
  2. Select all, Selection > Rewrap Selection.
  3. Fixup > Clean rewrap markers
  4. Rerun Gutcheck, noting short and long lines only.

C.10 Last Re-check For Text

  1. Sometimes you may be left with five blanks instead of four, and three instead of two (why?). To fix:
    • Five blanks: Search regex (\n\n\n\n\n)\n replace $1
    • Three blanks: Search regex ([^\n]\n\n\n)\n([^\n]) replace $1$2
  2. Go through the text; fix issues as needed; if necessary reopen pre-rewrap saved file. Apply any needed changes also to .html file.

C.11 Smooth Reading

  1. Unix/Mac todos *.txt;*.html (or unix2dos; do we break .bin file here?)
  2. Optional. The more errors found during PP, the more reason to put up for SR. Offer least-modified text file for SR (usually Latin-1).
  3. Upload. Write any comments useful for Smooth Readers.


D. HTML phase

Produce the HTML version.

D.1 Prepare illustrations

  1. Image filenames (same as all other files) should not include uppercase characters
  2. Straighen up: rotate; crop
  3. Clean up: Brightness/contrast; for black-and-white, Levels; if really clean then Threshold.
  4. Define format: .png better for lots of uniform color (e.g. line art), .jpg for lots of mixed color (e.g. photograph)
  5. Downsize: reduce colour depth for line art
  6. Resize: inline illustration if maximum 800 px greatest dimension; otherwise,
    • Thumbnail imageXX.png max 400 px,
    • High-resolution image imageXXh.png: rule 1,200 pixels; can be larger, but as small as possible without losing readability / detail (e.g. maps)
  7. Save: inside images subfolder
  8. Optimize image sizes: pngcrush for png, jpegoptim for thumbnail jpgs

D.2 Generate HTML

  1. create HTML using Guiguts – and think of it as a draft!Fixup > HTML Fixup
  2. Refer to http://www.gutenberg.org/wiki/Gutenberg:HTML_FAQ and PPTools/Guiguts/HTML

D.3 Clean up Guiguts HTML

  1. Change document title to The Project Gutenberg eBook of <title>, by <author>.
  2. Guiguts places page markers wherever they occur – even inside words (ewww!) Make words whole: regex (\w|\d|-)(<span class="pagenum".+\[Pg[^]]+\]</a></span>)((\w|\d|-)+) replace $1$3$2
  3. The CSS for the page numbers makes tags for [Blank page]s appear on top of each other. The tags are placed in reverse order: the latest tag first, so to remove the second consecutive tag (corresponding to a blank page):
    • regex (<a name="Page_\w+" id="Page_\w+">\[Pg \w+]</a></span>)<span class="pagenum"><[^>]+>[^<]+</a></span> replace $1
    • Perform global replaces until nothing is found (to check for two or more blank pages in a row).
    • Note: this will not remove the tag for the last page if it is blank.
  4. If the HTML is in UTF-8, change line <meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1" /> to UTF-8
  5. Opening/closing tags on separate lines: html, head, body; blank line at end of HTML (or else the PG white-washing tool will break)

D.4 Start the Transcriber's Notes

Right after the opening <body> tag, place (in my case): <div class="mynote"> <p><b>Transcriber's notes:</b></p> </div> Place here any transcriber's notes you have already written down, if any

D.5 Clean up the auto-generated TOC

  1. Did the book already have a TOC?
    • Then you don't need the auto-generated one: merge it with the one already in the book.
    • If the location of the existing TOC isn't obvious, place a like to it in the TN: "A table of contents can be found <here>"
  2. Else, convert auto-generated TOC to a bullet list (unless you want to delete it):
    • Change <p> ... </p> to <ul> ... </ul> (or perhaps <ul class="toc"> if you're giving it any special formatting)
    • Cut the TOC items and paste them in a separate text editor
    • Find <a> replace <li><a>
    • Find <br /> replace </li>
    • Find <b> replace with nothing; same for </b>
    • Now paste the changed TOC back in the book, inside the TN, labelled <p><b>Table of Contents:</b></p>

D.6 Subscripts and superscripts

  1. Feminine ordinal: replace regex \^a\b with ª (apply automatically to text, not to scientific books)
  2. Masculine ordinal: replace regex \^o\b with º (apply automatically to text, not to scientific books)
  3. Superscript with { }: replace regex \^{([^}]+)} with <sup>$1</sup>
  4. Superscript without { }: replace regex \^(\w+) with <sup>$1</sup>
  5. Search for ^ (no regex) to see if anything escaped
  6. Search regex \^{?(\w+)}? replace with <sup>$1</sup>

D.7 Insert illustrations

  1. Replace each [Illustration tag with HTML code (see CSS Cookbook/Images)
  2. Consider creating an index of illustrations inside the TN. If the work already contained an index of illustrations, consider putting a link to it in the TN.

D.8 Sequential inspection of HTML

Go through the whole file fixing anything and checking everything. Refer to the page scans as needed.

  1. Check headings.
    • If you need several levels of headings below <h2>, check your HTML code thoroughly
    • Be consistent: pay attention to heading levels, subtitle styling etc.
  2. Format tables.
    • Check anything that needs turning into tables, and turn it
    • All tables must have a summary attribute
  3. Deal with special formatting needs: front matter, advertising, <f> etc.
    • Note: revert page number styles to normal after changing any parent elements. For instance, if you have
      dt{letter-spacing:0.1em}
      you should then also have
      dt .pagenum{letter-spacing:0}
      to get back to the regular font for page numbers. Same thing for other properties like font-size etc.
  4. Convert internal references to internal links, e.g.
    • page numbers: See page...
    • references to plates and illustrations
    • references to chapters, sections etc.
  5. Deal with any special coding needs. A few past examples:
    • hanging pilcrows in poetry and plays
  6. Whenever you fix something, ask yourself: should I change the plain text file(s) as well?

D.9 Check links

  1. Check all internal links with Fixup > HTML Markup > Link Checker
  2. Projects are not supposed to have external links; there's an exception for having several HTML files inside the same e-book, but I am not sure (for breaking overly long books into chunks, perhaps?)

D.10 Cleanup CSS

  1. For each CSS instruction, check if it's being used in the book; if not, remove it
  2. how to look for use of a classname: Find regex class="[^"]*\bclassname\b

D.12 Apply Smooth Reading

  1. Unix/Mac: todos *.html (or unix2dos)
  2. Check out results from Smooth Reading, apply them to the files
  3. If Smooth Reading wasn't done, consider reading the e-book

D.13 Validate HTML

  1. Validate HTML: http://validator.w3.org/
  2. If warning with Byte-order marker, open in an editor where you can see the BOM (e.g. Emacs on Linux) and remove it.
  3. Validate again with HTML Tidy: http://infohound.net/tidy/
  4. The two checks above may be done with Firefox HTML Tidy extension; check all settings, including Accessibility.
  5. Validate CSS: http://jigsaw.w3.org/css-validator/

E. Final phase

Almost done...

E.1 Upload / Hand over to PPV

  1. Create zip with: *.txt; *.html; *.bin; images/
  2. Write any comments for PPV/DU. Indicate previous PPVer so that same person checks your work again.
  3. Upload.

E.2 Deal with PPV/WW feedback

  1. Go through everything on list. Add things you got wrong to this list.

E.3 Thank proofers

  1. Wait for file to be published to get link.
  2. Get list of proofers. Linux: grep -e '-----File: ' filename.txt | sed -e 's:\\:\n:g' | sed '/^$/d' | sed '/^-/d' | sort | uniq
  3. Also possible to get list from Guiguts (how?)
  4. Send a PM to all of them (or ask someone who can send it).

E.4 Archive project

  1. Remove any backups which you're sure you'll never want to look at again anytime in the future. Be conservative.
  2. Zip project, and move to storage (e.g. external drive)
  3. Congratulations! Do another!