User:Hutcheson/Postprocessing Tools

From DPWiki

Basic Postprocessing Assumptions

  • I will be postprocessing files from various sources
    • DP-Canada, DP-US, solo projects, etc.
  • I will be required to generate multiple outputs.
    • I post to CCEL, PG-US, PG-Canada.
  • HTML is the format for the definitive product.
    • If someone wants typewriter-formatted text, I provide it--without asking why. But text versions lose formatting information, and are much less readable on various-width screens.
    • Any other markup language is much less portable: ePub is changing too fast, and none of my target archives want ODF yet.
  • UTF is the character set for the definitive product.
    • ASCII is fine, unless you have Greek, or Math, or Old English, or .... anything beyond Fortran. *** (I admit that my wife gave me a mug saying "If you can't say in in Fortran, don't say it".)
    • If someone wants Latin-1, I provide it--without asking questions. But how does a reader know whether it's Latin-1 Latin-2-through-n or Mac or Windows-whatever or some other 8-bit code? I lived through the days of BCD. I am NOT going back.
  • All working copies of the text will be in ASCII charset.
    • As a user and programmer, I want to know that my intermediate files can be used by any tool, viewed with any viewer, using any font, with no confusion.
  • Flexibility is necessary.
    • Different books require different formatting. One-size designs don't fit anyone but the deformed idiot child of the designer.
  • Automation is good.
    • Manual work is what I have to get paid to do.

Workflow Overview

The names in brackets are perl filters. Most filters are driven by pseudo-xml formatted stylesheets.

  • Convert input files into my favored (simple) internal format
    • From UTF source files (filters ordered to avoid interference with special character conventions):
      • [utfchars] Character-set transliteration from UTF to (numeric) entities.
      • [dp-pages] Convert page-break lines to "#123" form (for historical reasons)
      • [oddchars] Character-set transliteration from Latin-1 PG-diglyph to (extended HTML) entities.
      • [gxliter] Character-set transliteration from DP-like Greek to extended HTML entities
      • [from-utf] Character-set transliteration from numeric entities to extended HTML entities
    • From Latin-1 input files
      • [dp-pages]
      • [oddchars]
      • [gxliter] Character-set transliteration from DP-like Greek to extended HTML entities
  • Convert DP footnotes into CCEL (inlined) footnotes)
    • [dp-fn]
  • Preprocess (in the C-compiler sense)
    • (manually) adjust Greek transliteration
    • [smartq] insert curly-quote entities
    • (manually) check single- and double-quotes that smartq couldn't resolve based on the immediate context
    • (manually) review "*" characters (cleaning up proofers' notes if possible)
    • (manually) add unusual or one-off HTML markup
    • [ppmusic] (optional) process [Illustration: ...] etc. DP markup
    • [dp] (optional) add boiler-plate HTML markup
    • (optionally) specialized tools, written when needed

Up to this point, I review and manually correct the file between steps. After this point, I change the source file and automatically rebuild output files.

  • Compile the source to create all output files
    • [ppmusic] (if not done above) process [Illustration: ...] etc. DP markup
    • [dp-table] convert tables (formatted with my special rules) into HTML (see /Table Support)
    • [dp] (if not done above) add boiler-plate HTML markup
    • [dp-pages] convert page breaks to HTML tags
    • [cnvttag] for text files, remove HTML tags and wrap/indent/format accordingly.
    • [txttable] for text files, convert HTML TABLE markup to spaced-out text
    • [cnvttag] for text files, replace HTML and extended HTML entities by ASCII or UTF characters
    • [cnvttag] for HTML files, replace extended HTML entities by HTML numbered entities
    • [cnvttag] for HTML files, replace extended HTML tags by standard HTML
    • [imgsize] scan image files, inserting actual image size into HTML
  • Run validity checks
    • (external tools) spellcheck, gutcheck, W3C validator
    • [chklinks] linkcheck
    • [epubmeta] handrolled epub metadata and heading check
  • Publish
    • copy all output files to project directory
    • tweak title for intended publication website (DP-US, DP-Canada, etc.)

Core Tools

dp

dp is driven by a parameter file. Each paragraph in the source file is processed, including:

  • Determine type of paragraph based on type of preceding paragraph and number of preceding blank lines
  • Insert type-specific boilerplate before and after the paragraph
  • Characterize lines based on number of leading spaces and presence of internal whitespace
  • Insert boilerplate before, after, and in space break within each line

All the boilerplate, and the type of the next paragraph, is defined in a parameter file.

So the parameter file could say: Imagine a "p" paragraph: it starts with <p>, ends with </p>, and is followed by: one blank line and another "p" paragraph; two blank lines and a "subheading"; or four blank lines and a "heading". Any line in the paragraph that begins with, say, one space, is wrapped with <span class="rightjust"> and </span> tags.

<proc id="p" next="p;h3;h2">

<p:pref> <p>

<p:suff> </p>

<l:pref_1> <span class="rightjust">

<l:suff_1> </span>

Such next-paragraph logic handles simple narrative books from the first chapter 1 heading to the trailing advertisements. But sometimes help is needed. Paragraph type can be overridden by a "//" line as needed (see example).

I have separate parameter files for each project (book or series), although I tend to start each project with the parameter files from some previous similar project.


Example of Trivially-Handled Paragraph

//tocheader

CONTENTS


I. In Media Res 1

II. Mea Maxima Confusa 23



//h2

CHAPTER I

In Media Res


[The parameter file could set this paragraph to be a paragraph, or a subheading, whichever was more common in this particular book. Unusual cases can use the overrides.]

cnvttag

cnvttag takes two kinds of parameters: entity lists and tag processing commands. Each kind of output file (HTML, ThML, ASCII, Latin-1, UTF) has a separate transliteration table--all entities can be converted to UTF numbers (formatted differently in HTML versus UTF output), or "dumbed down" to Latin=1 or ASCII.

cnvttag can be run multiple times on the same source, each time with a different parameter file (my standard practice is one pass for HTML/ThML involving only transliteration and tag substitutions, three passes for UTF/Latin-1 output because more file formatting is involved.)

I have standard parameter files with a list of entities: including html tags, extended at need into UTF-space.

  • HTML defines AElig and aelig. But suppose you want AElig in UTF and "Ae" in ASCII? My "Aelig", is transliterated by the parameter file as appropriate.
  • HTML has frac12 (ASCII "1/2"); My cfrac12 is transliterated to frac2 or ASCII "-1/2"
  • HTML has alpha; I added alphaacute, alphasmooth, alpharoughgrave, etaacuteiota, etc.

Scripts and non-filter tools

pp.bat

This is the standard preprocessing workhorse. It takes parameters:

  • inputfilename
  • -p parameterfilename [optional; multiple -p options allowed for each perl filter]
  • filtername [multiple filters allowed]
    • The output files of the successive filters are a.hym aa.hym aaa.hym etc.
    • The input file can be either inputfilename.hym or src\inputfilename.hym or inputfilename (pp searches for first match)
    • The parameter filename has an assumed suffix of ".pss", and can be found in ., .\pss, or ..\perl directories (pp searches for first match)
    • the perl filter is ..\perl\filtername.pl .
    • Certain filters are assumed to have standard parameter files: for instance, dp uses dpcommon.pss and dplocal.pss. These are in addition to whatever is specified by -p parameters.