User:Tintazul/PP Checklist
This is my own post-processing checklist. You may use it if you think it might be, well, useful. Feel free to drop me a line of comment, too. There are other PP checklists and info:
A. Preliminary phase
Pick a project and check whether it's complete.
A.1 Find project
- Read project comments.
- Read project forum.
- When happy, commit to PP.
A.2 Create local setup
- Create folder for project, <author>_<title>.
- Create notes file, notes_<title>.txt and write anything special from comments/forum.
- Download text into directory, unzip and rename as <title>.txt – lowercase letters only.
- Download good_words.txt and bad_words.txt (if any)
- Download page scans into subfolder pngs.
- Download illustrations into subfolder originals.
- Keep zips as backup, in case you have to get back to the pre-PP versions.
- Check that there's an illustration matching each Illustration tag. If any missing, talk to PM, or use the Missing page wiki.
A.3 Page labels
- Right-click bottom label reading, Lbl: Pg xxx
- Relabel pages according to images.
- If pages missing, talk to PM, or use the Missing page wiki.
B. Text phase, pre-fork
Deal with issues which affect both plain-text and HTML.
B.1 Sequential inspection of text
- Go through all scans, and note if markup is correct: blocks, quotes, illustrations, tables.
- Check if every paragraph is really a paragraph.
- Note anything that will need special treatment in the notes file: page references to turn into hyperlinks etc.
B.2 Apply errata
- If there's an errata, apply it and leave transcriber's note: "Errata on page XXX applied to text."
- Applying it silently is OK (in my opinion), since the typos were never the author's intention so aren't considered part of the author's work
- Addenda and corrigenda are not to be replaced, but should be left as is
B.3 Resolve proofers' notes
- Search for regex [^/]\*[^/]
B.4 Check blocks
- Search > Find orphaned brackets and markup – do it for all blocks
- Search > Find next /* ... */ block.
- Note that to keep any block from rewrapping it must be indented, so the rule is /*[somenumber]
- Same for all other block types.
B.5 Basic Fixup
- Fixup > Run Fixup
- Fixup > Remove end-of-line spaces
B.6 Check punctuation
- Search for opening punctuation inside opening tag: regex (<(i|b|sc|g|f)>)([([{"«]) (and replace $3$1 if needed)
- Search for other punctuation inside closing tag: regex ([,;.:!?"»)\]}])(</(i|b|sc|g|f)>) (and replace $2$1 if needed)
- Search for other punctuation outside closing tag: regex (</(i|b|sc|g|f)>)([,;.:!?"»)\]}]) (and replace $3$1 if needed)
- Check ellipses according to http://www.pgdp.net/c/faq/proofreading_guidelines.php#period_p
- Check em dash -- against page images -- regex for regular dash: ([^-]|\n)--([^-]|\n)
B.7 Check Transliterations
- Search for [Greek and check all against page scans
- Use http://www.gutenberg.org/wiki/Gutenberg:Greek_How-To as a transliteration guide
- If you have Greek, then you'll benefit from a UTF-8 version (at least in the HTML)
- Search regex: \[[^FIS]
B.8 Fix Sidenotes
- Fixup > Sidenote fixup
B.9 Fix Footnotes
B.10 Fix Poetry Line Numbers
B.11 Remove Page Separators
- Save a copy of the text before removing separators.
- Fixup > Remove page separators
B.12 Apply Spellcheck
- Guiguts can't spellcheck accented chars. Copy text to other program to spellcheck. (Spellcheck wasn't working under Linux, it is now under Mac.)
- For English texts, run jeebies
B.13 Apply Word-Frequency Checks
- Go through all buttons.
B.14 Apply Scanno Checks
B.15 Apply Gutcheck
- Run through whole list. See Gutcheck options because of HTML tags.
- Turn everything off except for Short line and Long line. Play around with the line length until warnings go away.
B.16 Check balanced markup
- If there are <tb> the check won't work. Change first <tb> to _THOUGHTBREAK_, then perform the test, then change back to <tb>.
- Search regex \<(\w+)>\n?[^<]+<(?!/\1>)
B.17 Check subscripts and superscripts
- Make sure there aren't any º and ^o which should be degree signs
- If your book is LOTE, then you know the basic text will be Latin-1, so you can change the ordinals:
- Search regex \^o\b replace with º
- Search regex \^a\b replace with ª
- Subscripts: search for _ – must be formatted as _{ }, see http://www.pgdp.net/c/faq/document.php#subscr
- Superscripts: search for ^ – may be marked as ^ or ^{ }, see http://www.pgdp.net/c/faq/document.php#supers
C. Post-fork
These changes only affect the text-only version (but note exceptions C.8 and C.9).
C.1 Fork for HTML
- Save your work.
- Save a second time, as <title>.html
- Continue working on .txt
- From now on, when changing anything to the file always consider applying same changes to <title>.html file
C.2 Convert <tb>, Italic, Bold, and Small caps
- <tb>: Fixup > Convert <tb> to asterisk break
- Bold surrounded by *? search regex </?b> replace *
- Italics surrounded by _? search regex </?i> replace _
- Small caps to uppercase? search regex <sc>(\n?[^<]+)</sc> replace \U$1\E
- Create Transcriber's Notes explaining any use of characters to represent formatting, e.g. * = _ ^{} []
- Search for < to see if you've missed anything (you'll find <f> if you have it)
C.3 Format Non-wrapping Text
- Check front matter. No need to center lines in text files.
- Fix ASCII Tables. See Fixup > ASCII Table Special Effects if needed.
- Align Tables of Contents.
C.4 Localize labels
- "Footnote" / "Sidenote" / "Illustration" / "Greek" must be translated for LOTE
- Describe any illustrations which don't have a caption (i.e., never let the text edition simply with [Illustration])
C.5 Replace ordinals
- Any ordinals found will mean that text will need a Latin-1/UTF-8 version
- Feminine ordinal: replace regex \^a\b with ª (apply automatically to text, not to scientific books)
- Masculine ordinal: replace regex \^o\b with º (apply automatically to text, not to scientific books)
C.6 Fork text according to character coding
- Search regex \P{IsASCII} – anything found means not just ASCII
- DP-INT: <name>-asc.txt mandatory for English, <name>-lt1.txt for LOTE; DP-Canada: <name>-lt1.txt mandatory in all cases
- optional <name>-utf8.txt
- For HTML, use character entities as much as possible to avoid encoding as UTF-8
- HTML <name>-lt1.txt or <name>-utf8.txt (as needed)
- Prepare different text versions, using ['e] etc. and leaving transcriber's notes as necessary
- Do C.3 Format Non-wrapping Text again on any text versions where characters were changed; note that replacing é with [e'] may cause alignment issues.
C.7 [Greek: tag to Greek characters
Note: This applies to all UTF-8 versions, text and HTML.
- Consider leaving the Latin transliteration as a mouse-over tip:
- Regex (\[Greek: )([^]]+)(]) replace <abbr class="greek" title="$2">$1$2$3</abbr>
- Tip readers to the existence of alternate text with CSS, e.g. abbr.greek{border-bottom:1px dotted black}
- Either search for each occurrence of [Greek:, or if you have lots of Greek try some systematic process, like this:
- The searches below will break if you have a [ character inside a [Greek: tag.
- Each of the below substitutions must be applied repeatedly, until no results are found. They catch only one occurrence inside each [Greek: tag.
- Transform line breaks within Greek into spaces: regex (\[Greek: [^]]*)\n replace $1 (remember the space after $1 ).
- Search for page tags within Greek: regex (\[Greek: [^]]*)span. Move the page tags manually outside, or break Greek tag in two and leave page tag in between.
- Replace Latin letters with Greek letters. Pay attention to the order: digraphs and accented letters first. Don't forget similar letters like uppercase A and uppercase alpha.
- Example: regex (\[Greek: [^]]*)C[Hh]([^]]*]) replace $1Χ$2 (a capital chi)
- TODO: create systematic way of applying these replacements, instead of typing them each time (sed script?)
- Check for anything left inside the tag: regex (\[Greek: [^]]*)([A-Za-z'`])([^]]*\]) replace $1_$2_$3 to mark it. Undo the replace and solve as needed. Search again.
- Do you need to replace Greek punctuation?
- After replacing systematically, go through each [Greek: tag, checking the against the page scan. Add breathing signs ῾ ᾽ etc.
- Delete Greek tags: regex \[Greek: ([^]]*)] replace $1
C.8 A.O.F.Ch. (Any Other Funky Characters)
Note: This applies to all UTF-8 versions, text and HTML.
- Do you have any other weird characters? Your UTF-8 versions will be much spiffier ;-) if they don't have any [ ] for characters which exist.
- For transliterating Arabic, Cyrillic, hieroglyphs etc. you may need to ask for expert help. See the Language Skills List or the language-related forums.
- Look for any characters encoded with [, find the corresponding character in a Character Map application in your computer and replace it.
C.9 Rewrap and Clear Rewrap Markers
- Save backup
- Select all, Selection > Rewrap Selection.
- Fixup > Clean rewrap markers
- Rerun Gutcheck, noting short and long lines only.
C.10 Last Re-check For Text
- Sometimes you may be left with five blanks instead of four, and three instead of two (why?). To fix:
- Five blanks: Search regex (\n\n\n\n\n)\n replace $1
- Three blanks: Search regex ([^\n]\n\n\n)\n([^\n]) replace $1$2
- Go through the text; fix issues as needed; if necessary reopen pre-rewrap saved file. Apply any needed changes also to .html file.
C.11 Smooth Reading
- Unix/Mac todos *.txt;*.html (or unix2dos; do we break .bin file here?)
- Optional. The more errors found during PP, the more reason to put up for SR. Offer least-modified text file for SR (usually Latin-1).
- Upload. Write any comments useful for Smooth Readers.
D. HTML phase
Produce the HTML version.
D.1 Prepare illustrations
- Image filenames (same as all other files) should not include uppercase characters
- Straighen up: rotate; crop
- Clean up: Brightness/contrast; for black-and-white, Levels; if really clean then Threshold.
- Define format: .png better for lots of uniform color (e.g. line art), .jpg for lots of mixed color (e.g. photograph)
- Downsize: reduce colour depth for line art
- Resize: inline illustration if maximum 800 px greatest dimension; otherwise,
- Thumbnail imageXX.png max 400 px,
- High-resolution image imageXXh.png: rule 1,200 pixels; can be larger, but as small as possible without losing readability / detail (e.g. maps)
- Save: inside images subfolder
- Optimize image sizes: pngcrush for png, jpegoptim for thumbnail jpgs
D.2 Generate HTML
- create HTML using Guiguts – and think of it as a draft! – Fixup > HTML Fixup
- Refer to http://www.gutenberg.org/wiki/Gutenberg:HTML_FAQ and PPTools/Guiguts/HTML
D.3 Clean up Guiguts HTML
- Change document title to The Project Gutenberg eBook of <title>, by <author>.
- Guiguts places page markers wherever they occur – even inside words (ewww!) Make words whole: regex (\w|\d|-)(<span class="pagenum".+\[Pg[^]]+\]</a></span>)((\w|\d|-)+) replace $1$3$2
- The CSS for the page numbers makes tags for [Blank page]s appear on top of each other. The tags are placed in reverse order: the latest tag first, so to remove the second consecutive tag (corresponding to a blank page):
- regex (<a name="Page_\w+" id="Page_\w+">\[Pg \w+]</a></span>)<span class="pagenum"><[^>]+>[^<]+</a></span> replace $1
- Perform global replaces until nothing is found (to check for two or more blank pages in a row).
- Note: this will not remove the tag for the last page if it is blank.
- If the HTML is in UTF-8, change line <meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1" /> to UTF-8
- Opening/closing tags on separate lines: html, head, body; blank line at end of HTML (or else the PG white-washing tool will break)
D.4 Start the Transcriber's Notes
Right after the opening <body> tag, place (in my case): <div class="mynote"> <p><b>Transcriber's notes:</b></p> </div> Place here any transcriber's notes you have already written down, if any
D.5 Clean up the auto-generated TOC
- Did the book already have a TOC?
- Then you don't need the auto-generated one: merge it with the one already in the book.
- If the location of the existing TOC isn't obvious, place a like to it in the TN: "A table of contents can be found <here>"
- Else, convert auto-generated TOC to a bullet list (unless you want to delete it):
- Change <p> ... </p> to <ul> ... </ul> (or perhaps <ul class="toc"> if you're giving it any special formatting)
- Cut the TOC items and paste them in a separate text editor
- Find <a> replace <li><a>
- Find <br /> replace </li>
- Find <b> replace with nothing; same for </b>
- Now paste the changed TOC back in the book, inside the TN, labelled <p><b>Table of Contents:</b></p>
D.6 Subscripts and superscripts
- Feminine ordinal: replace regex \^a\b with ª (apply automatically to text, not to scientific books)
- Masculine ordinal: replace regex \^o\b with º (apply automatically to text, not to scientific books)
- Superscript with { }: replace regex \^{([^}]+)} with <sup>$1</sup>
- Superscript without { }: replace regex \^(\w+) with <sup>$1</sup>
- Search for ^ (no regex) to see if anything escaped
- Search regex \^{?(\w+)}? replace with <sup>$1</sup>
D.7 Insert illustrations
- Replace each [Illustration tag with HTML code (see CSS Cookbook/Images)
- Consider creating an index of illustrations inside the TN. If the work already contained an index of illustrations, consider putting a link to it in the TN.
D.8 Sequential inspection of HTML
Go through the whole file fixing anything and checking everything. Refer to the page scans as needed.
- Check headings.
- If you need several levels of headings below <h2>, check your HTML code thoroughly
- Be consistent: pay attention to heading levels, subtitle styling etc.
- Format tables.
- Check anything that needs turning into tables, and turn it
- All tables must have a summary attribute
- Deal with special formatting needs: front matter, advertising, <f> etc.
- Note: revert page number styles to normal after changing any parent elements. For instance, if you have
dt{letter-spacing:0.1em}
you should then also have
dt .pagenum{letter-spacing:0}
to get back to the regular font for page numbers. Same thing for other properties like font-size etc.
- Note: revert page number styles to normal after changing any parent elements. For instance, if you have
- Convert internal references to internal links, e.g.
- page numbers: See page...
- references to plates and illustrations
- references to chapters, sections etc.
- Deal with any special coding needs. A few past examples:
- hanging pilcrows in poetry and plays
- Whenever you fix something, ask yourself: should I change the plain text file(s) as well?
D.9 Check links
- Check all internal links with Fixup > HTML Markup > Link Checker
- Projects are not supposed to have external links; there's an exception for having several HTML files inside the same e-book, but I am not sure (for breaking overly long books into chunks, perhaps?)
D.10 Cleanup CSS
- For each CSS instruction, check if it's being used in the book; if not, remove it
- how to look for use of a classname: Find regex class="[^"]*\bclassname\b
D.12 Apply Smooth Reading
- Unix/Mac: todos *.html (or unix2dos)
- Check out results from Smooth Reading, apply them to the files
- If Smooth Reading wasn't done, consider reading the e-book
D.13 Validate HTML
- Validate HTML: http://validator.w3.org/
- If warning with Byte-order marker, open in an editor where you can see the BOM (e.g. Emacs on Linux) and remove it.
- Validate again with HTML Tidy: http://infohound.net/tidy/
- The two checks above may be done with Firefox HTML Tidy extension; check all settings, including Accessibility.
- Validate CSS: http://jigsaw.w3.org/css-validator/
E. Final phase
Almost done...
E.1 Upload / Hand over to PPV
- Create zip with: *.txt; *.html; *.bin; images/
- Write any comments for PPV/DU. Indicate previous PPVer so that same person checks your work again.
- Upload.
E.2 Deal with PPV/WW feedback
- Go through everything on list. Add things you got wrong to this list.
E.3 Thank proofers
- Wait for file to be published to get link.
- Get list of proofers. Linux: grep -e '-----File: ' filename.txt | sed -e 's:\\:\n:g' | sed '/^$/d' | sed '/^-/d' | sort | uniq
- Also possible to get list from Guiguts (how?)
- Send a PM to all of them (or ask someone who can send it).
E.4 Archive project
- Remove any backups which you're sure you'll never want to look at again anytime in the future. Be conservative.
- Zip project, and move to storage (e.g. external drive)
- Congratulations! Do another!