User:Hutcheson/Postprocessing Tools
Basic Postprocessing Assumptions
- I will be postprocessing files from various sources
- DP-Canada, DP-US, solo projects, etc.
- I will be required to generate multiple outputs.
- I post to CCEL, PG-US, PG-Canada.
- HTML is the format for the definitive product.
- If someone wants typewriter-formatted text, I provide it--without asking why. But text versions lose formatting information, and are much less readable on various-width screens.
- Any other markup language is much less portable: ePub is changing too fast, and none of my target archives want ODF yet.
- UTF is the character set for the definitive product.
- ASCII is fine, unless you have Greek, or Math, or Old English, or .... anything beyond Fortran. *** (I admit that my wife gave me a mug saying "If you can't say in in Fortran, don't say it".)
- If someone wants Latin-1, I provide it--without asking questions. But how does a reader know whether it's Latin-1 Latin-2-through-n or Mac or Windows-whatever or some other 8-bit code? I lived through the days of BCD. I am NOT going back.
- All working copies of the text will be in ASCII charset.
- As a user and programmer, I want to know that my intermediate files can be used by any tool, viewed with any viewer, using any font, with no confusion.
- Flexibility is necessary.
- Different books require different formatting. One-size designs don't fit anyone but the deformed idiot child of the designer.
- Automation is good.
- Manual work is what I have to get paid to do.
Workflow Overview
The names in brackets are perl filters. Most filters are driven by pseudo-xml formatted stylesheets.
- Convert input files into my favored (simple) internal format
- From UTF source files (filters ordered to avoid interference with special character conventions):
- [utfchars] Character-set transliteration from UTF to (numeric) entities.
- [dp-pages] Convert page-break lines to "#123" form (for historical reasons)
- [oddchars] Character-set transliteration from Latin-1 PG-diglyph to (extended HTML) entities.
- [gxliter] Character-set transliteration from DP-like Greek to extended HTML entities
- [from-utf] Character-set transliteration from numeric entities to extended HTML entities
- From Latin-1 input files
- [dp-pages]
- [oddchars]
- [gxliter] Character-set transliteration from DP-like Greek to extended HTML entities
- From UTF source files (filters ordered to avoid interference with special character conventions):
- Convert DP footnotes into CCEL (inlined) footnotes)
- [dp-fn]
- Preprocess (in the C-compiler sense)
- (manually) adjust Greek transliteration
- [smartq] insert curly-quote entities
- (manually) check single- and double-quotes that smartq couldn't resolve based on the immediate context
- (manually) review "*" characters (cleaning up proofers' notes if possible)
- (manually) add unusual or one-off HTML markup
- [ppmusic] (optional) process [Illustration: ...] etc. DP markup
- [dp] (optional) add boiler-plate HTML markup
- (optionally) specialized tools, written when needed
Up to this point, I review and manually correct the file between steps. After this point, I change the source file and automatically rebuild output files.
- Compile the source to create all output files
- [ppmusic] (if not done above) process [Illustration: ...] etc. DP markup
- [dp-table] convert tables (formatted with my special rules) into HTML (see /Table Support)
- [dp] (if not done above) add boiler-plate HTML markup
- [dp-pages] convert page breaks to HTML tags
- [cnvttag] for text files, remove HTML tags and wrap/indent/format accordingly.
- [txttable] for text files, convert HTML TABLE markup to spaced-out text
- [cnvttag] for text files, replace HTML and extended HTML entities by ASCII or UTF characters
- [cnvttag] for HTML files, replace extended HTML entities by HTML numbered entities
- [cnvttag] for HTML files, replace extended HTML tags by standard HTML
- [imgsize] scan image files, inserting actual image size into HTML
- Run validity checks
- (external tools) spellcheck, gutcheck, W3C validator
- [chklinks] linkcheck
- [epubmeta] handrolled epub metadata and heading check
- Publish
- copy all output files to project directory
- tweak title for intended publication website (DP-US, DP-Canada, etc.)
Core Tools
dp
dp is driven by a parameter file. Each paragraph in the source file is processed, including:
- Determine type of paragraph based on type of preceding paragraph and number of preceding blank lines
- Insert type-specific boilerplate before and after the paragraph
- Characterize lines based on number of leading spaces and presence of internal whitespace
- Insert boilerplate before, after, and in space break within each line
All the boilerplate, and the type of the next paragraph, is defined in a parameter file.
So the parameter file could say: Imagine a "p" paragraph: it starts with <p>, ends with </p>, and is followed by: one blank line and another "p" paragraph; two blank lines and a "subheading"; or four blank lines and a "heading". Any line in the paragraph that begins with, say, one space, is wrapped with <span class="rightjust"> and </span> tags.
<proc id="p" next="p;h3;h2">
<p:pref> <p>
<p:suff> </p>
<l:pref_1> <span class="rightjust">
<l:suff_1> </span>
Such next-paragraph logic handles simple narrative books from the first chapter 1 heading to the trailing advertisements. But sometimes help is needed. Paragraph type can be overridden by a "//" line as needed (see example).
I have separate parameter files for each project (book or series), although I tend to start each project with the parameter files from some previous similar project.
Example of Trivially-Handled Paragraph
//tocheader
CONTENTS
I. In Media Res 1
II. Mea Maxima Confusa 23
//h2
CHAPTER I
In Media Res
[The parameter file could set this paragraph to be a paragraph, or a subheading, whichever was more common in this particular book. Unusual cases can use the overrides.]
cnvttag
cnvttag takes two kinds of parameters: entity lists and tag processing commands. Each kind of output file (HTML, ThML, ASCII, Latin-1, UTF) has a separate transliteration table--all entities can be converted to UTF numbers (formatted differently in HTML versus UTF output), or "dumbed down" to Latin=1 or ASCII.
cnvttag can be run multiple times on the same source, each time with a different parameter file (my standard practice is one pass for HTML/ThML involving only transliteration and tag substitutions, three passes for UTF/Latin-1 output because more file formatting is involved.)
I have standard parameter files with a list of entities: including html tags, extended at need into UTF-space.
- HTML defines AElig and aelig. But suppose you want AElig in UTF and "Ae" in ASCII? My "Aelig", is transliterated by the parameter file as appropriate.
- HTML has frac12 (ASCII "1/2"); My cfrac12 is transliterated to frac2 or ASCII "-1/2"
- HTML has alpha; I added alphaacute, alphasmooth, alpharoughgrave, etaacuteiota, etc.
Scripts and non-filter tools
pp.bat
This is the standard preprocessing workhorse. It takes parameters:
- inputfilename
- -p parameterfilename [optional; multiple -p options allowed for each perl filter]
- filtername [multiple filters allowed]
- The output files of the successive filters are a.hym aa.hym aaa.hym etc.
- The input file can be either inputfilename.hym or src\inputfilename.hym or inputfilename (pp searches for first match)
- The parameter filename has an assumed suffix of ".pss", and can be found in ., .\pss, or ..\perl directories (pp searches for first match)
- the perl filter is ..\perl\filtername.pl .
- Certain filters are assumed to have standard parameter files: for instance, dp uses dpcommon.pss and dplocal.pss. These are in addition to whatever is specified by -p parameters.