User:Monicas wicked stepmother/MWS PP Checklist

From DPWiki
Jump to navigation Jump to search

My Post Processing checklist (in progress)

Create new subfolder under PP for the downloaded files, with a name connected to the book title.

1. Download and unzip the files into folders. Unzip images into the "png" subfolder and then move the non-png page images into the "originals" subfolder. Make the originals folder and the projectXXXX.txt file read-only.

2. Open the projectxxx.txt file in GG, and immediately re-name as catchyname-master.txt. Create a text file in Notepad++ with the name catchynamePPnotes.txt in the project folder, for all your notes.

2. Check all pages are present - combine this with next step.

3. Recalculate page numbers. Find the first page of text with a page number, calculate back to the first page number of the book, and then "right click" on LBL on the toolbar (usually on the bottom of GG) to get the page numbering tool. Check the last page (press "end" key in Xnview) to make sure that the last page number is also correct. You may need to repaginate if illustrations are not included in page numbers. This is VITAL if book has an index, or refers in the text of footnotes to other pages in the book. It also means that page numbers (if visible) aren't shown for the early (often blank pages). Search through the pages for illustrations, and make those pages (and the blanks behind them) "No count" on the page numbering tool. IF YOU DON'T DO THIS BEFORE GENERATING THE HTML, ALL YOUR PAGE NUMBERS WILL BE INCORRRECT (unless you are VERY lucky). If you need another reason to do this early, sometimes you can find page numbers aren't consecutive, which can mean a missing page or two - it's much better to find this BEFORE you get into PPing the project.

4 Carefully examine the text, check for words split over page breaks and move them. Make careful note of any ** notes, and change as necessary. Look at the png images to make sure all formatting is done (especially important for P2 skipped projects). Remove all [blank page]. I try to remove page breaks here, rejoining words split over pages (move second part of word up to previous page, DO NOT allow the word to be split over the page - this causes problems with search, text-to-speech of HTML and derivatives). Re-join split footnotes onto a single page.

Make a note of EVERYTHING that needs to be checked later, with page numbers or png if not numbered (that's why repagination first is important - otherwise the TN at the end won't have the right page numbers). Things to note for later use - foreign languages if not in italics, abbreviations, extra levels of headings (GG recognises 4 blank lines and 2 blank lines, but can't differentiate between sub and sub-sub headings.

5 Check all markup - if foreign words used make notes (the HTML will need to be marked up), change front matter to /F F/ table of contents to /X X/ lists to /L L/ and poems to /P P/ (if you use capitals rather than lower case you will get a LOT less gutcheck/bookloupe flags). Remove markup that extents over pages (make sure it is still balanced). Make sure that all poem lines are correctly indented.

6 Check all asterisks without slash - this will point to all proofers' notes, EOL and EOP hyphens

7 Check for all orphaned brackets, and all orphaned DP markup.

8 Run basic fixup with all options checked (untick space around hyphens if required)

9 Run "remove end of line spaces"

10 Format front matter, enclose in /F F/

11 Edit transliterations, using the GG menu Tools -> Character Tools -> Search for transliterations

12 Remove physical page separators (I usually do this in careful reading), but check to see none left.

13 Word frequency to determine spelling/scannos

14 Scanno checks

15 Jeebies

16 Gutcheck

17 Spell check

18 FOOTNOTES - use footnote fixup AFTER checking that all footnotes are formatted correctly, and are joined if over multiple pages. DO NOT use "footnote tidy" on the footnotes palette - it is for text editions only.

19. Before splitting files, decide on character encoding. Save xxxmaster as Latin1, then convert to utf8 characters and save as xxxmaster_utf8.txt. Finally, save that file as xxx1_utf8.txt and xxx1.html. So you are left with two source files, the latin1 and utf8 text files, which are the basis for both the final txt and html files in case you need to restart, or compare the finished file to the start (to make sure no text etc. deleted)

Characters to convert: fractions: Search: /(\d) [oe] all macrons regex search for (?<=\[)([^FSIG\d]) greek ordinals, superscripts and subscripts emdash


BEFORE generating HTML, use autoindex on index section, surround with /X X/. Optionally, change table markup to /X X/ Convert to utf8 characters now, if not done earlier.

Generate HTML

Use the GG HTML generator. Check the "page anchors", "CSS blockquote", keep utf8, Latin-1 characters. When finished, IMMEDIATELY run a validator test (in GG: HTML -> HTML Validator) - fix all problems, usually caused by page number code not being in block markup <p>, or page numbers with block markup within a list - delete the block markup around the page anchor and move inside the list markup. I find it much easier to KEEP an html file validated by using the tool regularly than the alternative of making lots of changes and then seeing what's gone wrong and trying to fix it. Sometimes the validator can give several screens of error messages, but they are all spawned from one simple error - so it's easier to test after each change you make to the file to make sure you have done it correctly. It is much harder to find the "real" error amongst the list of error messages, but often the first item listed is the real problem.

Format front matter

  • Make sure <h1> markup is correctly around the whole of the title, use <span> or <small> and <big> to make necessary font size changes.

Check chapter and other headings markup

  • Make sure all the chapter heading is inside the <h2> markup, with different sizes of font if necessary to replicate original. If there are lower level headings, make sure they have been given the correct markup.
  • To remove the hyperlinks from chapter headings:

Search: <h2><a name="([\w\s\p{IsPunct}\n]+?)" id="([\w\s\p{IsPunct}\n]+?)">((.|\n)+?)</a> Replace: <h2>$3

  • Use <div class="chapter"> around all chapter starts to force a 'chunk' break in epub. This stops a new chunk being made in the beginning of a chapter. If the chapters are very short, I only force a chunk break at every second, third etc. chapter - this it to avoid an epub file being made up of many, many small parts.
  • Change contents into table format, ditto for illustration list (if given)

Convert italic markup

  • Convert italic markup to lang, cite or em

Search: <i>((.|\n)+?)</i>

Replace: <em>$1</em> Replace: <cite>$1</cite> Replace: <i lang="fr" xml:lang="fr">$1</i> (or whatever language is most common). If there are a LOT of different foreign language markups, change the third option to <lang>$1</lang>, and then run a second S/R with the three most common language markups in the replace fields:

Search: <lang>((.|\n)+?)</lang>

Replace: <i lang="fr" xml:lang="fr">$1</i> Replace: <i lang="la" xml:lang="la">$1</i> Replace: <i lang="it" xml:lang="it">$1</i>

  • similarly, convert bold to strong where appropriate:

Search: <b>((.|\n)+?)</b> Replace: <strong>$1</strong>

Convert poetry

  • Convert all poetry to Best Practices
  • This uses a few search / replace. I replace the front matter first:

Search: <div class="poem"> Replace: <div class="poetry-container"><div class="poetry">

  • then

Search: <span class="i0"> Replace: <div class="verse"> (note indent on replace)

Then each indented span separately - just keep the same code and increment the numbers by two until you're not getting any more hits e.g.: Search: <span class="i2"> Replace: <div class="verse indent2">

- you can do "replace all" on the first three S & R, there won't be any other code that could be confused with it. Take note of which indent numbers are found, then you can delete unwanted CSS indents from the header. For a final check, search for [<span class="i] to make sure none are left over.

Then the end of each line: Search: <br /></span> Replace: </div> (be careful here, don't do a "replace all" unless you are sure there are no other instances of
outside poetry).

Finally: Search: </div></div> Replace: </div></div></div> (again, carefully here, there may be multiple closing divs outside poetry markup (e.g. images). I do the poetry changes immediately after the HTML is generated, because they are fewer instances of multiple closing divs then.

All this takes me about five minutes for an entire book of poetry. I do go through and add a bit of indentation to make it neater, but that's optional.

Check poetry signature, make sure it is indented as the original.


Change all blockquot to div class="blockquot" (or tick CSS blockquote in HTML generator). Two S/R: Search: <blockquote> Replace: <div class="blockquot">

Search: </blockquote> Replace: </div>

Language markup

Add language attribute to non-italicised text (noted in the sequential read) and to which is also a foreign language

Format Illustrations

format - use basic code, remove whatever is inapplicable. Cross-check captions, hyperlink with list of illustrations (if any). Finally run ppvimage to make sure all the dimensions etc. have been added correctly.

add covernote if cover created by PPer

Check text for "See p. XXX" Appendix, figure etc

Create index

The Auto-index is a bit greedy and can markup non-page numbers (e.g. years), so remove all non-page number hyperlinks. Add second link to page spans e.g. 110-113. Create hyperlinks within index (i.e. "see xxxx")

Move links to notes, as the footnotes have been moved away from the original page.

Format tables

The html generator mucks up tables so much, you are probably better off starting with the ascii table. To do this, enclose tables in /X X/ markup before generating the html, then it will be left intact, with <pre> </pre> tags around it.

Convert footnote markup

I convert all footnotes numbers or letters with underscores to hyphens. One advantage of this is that if an underscore is seen, then it is a footnote (or anchor) that didn't get converted.

  • Make sure regex and start at beginning are ticked for all of these S & R:

Search: Footnote_([^_]+)_[^_]+?" Replace: Footnote-$1"


Search: FNanchor_([^_]+)_[^_]+?" Replace: FNAnchor-$1"

These two regexes will change Footnote_1_1 etc to Footnote-1 (and all the anchors etc.)

  • Add title attribute to footnotes

Search class="fnanchor">\[([^\]]+)\] Replace class="fnanchor" title="Go to footnote $1.">[$1]

Search href="#FNAnchor Replace title="Return to text." href="#FNAnchor

  • take the bracket out of the return links

Search (<a title="Return to text."[^>]+>)<span class="label">\[([^\]]+)\]</span></a> Replace <span class="label">[$1$2</a>]</span>

  • Change all FNAnchor to Anchor (no regex)
  • For multiple anchors to the same footnote, add

<a href="#Footnote-x" class="fnanchor">[x]</a> to the second and subsequent anchors, replacing x with the fn number

for some reason, when I hit "Replace All", the first found item doesn't get changed. So run each S & R again, but be careful with the fourth S & R (Return to text)

Find and format section headers

Search: </p>\n\n\n<p>((.|\n)+?)</p> Replace: </p>\n\n\n<h3>$1</h3>

Format Abbreviations

Best Practices ask us to markup abbreviations so that screen readers can use it instead of trying to read it as a word.

Roman numbers: Search: \b([IVXLCDM]+)\b

Replace: <abbr title="\C::arabic("$1")\E">$1</abbr>

OR for monarchs etc, replace with e.g. title="the fifth"

St. Saint or Street?

Search: St.

Replace: <abbr title="Saint">St.</abbr> Replace: <abbr title="Street">St.</abbr>

Look for US States abbreviations

Format uppercase small caps

Some small cap markup has uppercase letters, but should be in lowercase - but we don't write the characters in lowercase because if the CSS doesn't work for small caps, the text will be shown as lowercase. What we do is search for <span class="smcap">, and replace with <span class="smcap lowercase"> when all the characters are uppercase.

Check thought breaks

In some projects, <tb> has been used to mark a larger vertical space than usual between paragraphs. Remove the <hr class="tb"> markup and insert a class into the following paragraph e.g. <p class="mt2">

Check outline correct with appropriate hx markup


Convert italics and bold to symbols that aren't used in the text (usually underscores for italic and equal signs for bold).

Convert small cap markup

Check thought breaks

In some projects, <tb> has been used to mark a larger vertical space than usual between paragraphs. Remove the <tb> and replace with a blank line.

Rewrap check line length with ^.{75,}