Basic Postprocessing Assumptions

I will be postprocessing files from various sources
- DP-Canada, DP-US, solo projects, etc.
I will be required to generate multiple outputs.
- I post to CCEL, PG-US, PG-Canada.
HTML is the format for the definitive product.
- If someone wants typewriter-formatted text, I provide it--without asking why. But text versions lose formatting information, and are much less readable on various-width screens.
- Any other markup language is much less portable: ePub is changing too fast, and none of my target archives want ODF yet.
UTF is the character set for the definitive product.
- ASCII is fine, unless you have Greek, or Math, or Old English, or .... anything beyond Fortran. *** (I admit that my wife gave me a mug saying "If you can't say in in Fortran, don't say it".)
- If someone wants Latin-1, I provide it--without asking questions. But how does a reader know whether it's Latin-1 Latin-2-through-n or Mac or Windows-whatever or some other 8-bit code? I lived through the days of BCD. I am NOT going back.
All working copies of the text will be in ASCII charset.
- As a user and programmer, I want to know that my intermediate files can be used by any tool, viewed with any viewer, using any font, with no confusion.
Flexibility is necessary.
- Different books require different formatting. One-size designs don't fit anyone but the deformed idiot child of the designer.
Automation is good.
- Manual work is what I have to get paid to do.

Workflow Overview

The names in brackets are perl filters. Most filters are driven by pseudo-xml formatted stylesheets.

Convert input files into my favored (simple) internal format
- From UTF source files (filters ordered to avoid interference with special character conventions):
  - [utfchars] Character-set transliteration from UTF to (numeric) entities.
  - [dp-pages] Convert page-break lines to "#123" form (for historical reasons)
  - [oddchars] Character-set transliteration from Latin-1 PG-diglyph to (extended HTML) entities.
  - [gxliter] Character-set transliteration from DP-like Greek to extended HTML entities
  - [from-utf] Character-set transliteration from numeric entities to extended HTML entities
- From Latin-1 input files
  - [dp-pages]
  - [oddchars]
  - [gxliter] Character-set transliteration from DP-like Greek to extended HTML entities

Convert DP footnotes into CCEL (inlined) footnotes)
- [dp-fn]

Preprocess (in the C-compiler sense)
- (manually) adjust Greek transliteration
- [smartq] insert curly-quote entities
- (manually) check single- and double-quotes that smartq couldn't resolve based on the immediate context
- (manually) review "*" characters (cleaning up proofers' notes if possible)
- (manually) add unusual or one-off HTML markup
- [ppmusic] (optional) process [Illustration: ...] etc. DP markup
- [dp] (optional) add boiler-plate HTML markup
- (optionally) specialized tools, written when needed

Up to this point, I review and manually correct the file between steps. After this point, I change the source file and automatically rebuild output files.

Compile the source to create all output files
- [ppmusic] (if not done above) process [Illustration: ...] etc. DP markup
- [dp-table] convert tables (formatted with my special rules) into HTML (see /Table Support)
- [dp] (if not done above) add boiler-plate HTML markup
- [dp-pages] convert page breaks to HTML tags
- [cnvttag] for text files, remove HTML tags and wrap/indent/format accordingly.
- [txttable] for text files, convert HTML TABLE markup to spaced-out text
- [cnvttag] for text files, replace HTML and extended HTML entities by ASCII or UTF characters
- [cnvttag] for HTML files, replace extended HTML entities by HTML numbered entities
- [cnvttag] for HTML files, replace extended HTML tags by standard HTML
- [imgsize] scan image files, inserting actual image size into HTML

Run validity checks
- (external tools) spellcheck, gutcheck, W3C validator
- [chklinks] linkcheck
- [epubmeta] handrolled epub metadata and heading check

Publish
- copy all output files to project directory
- tweak title for intended publication website (DP-US, DP-Canada, etc.)

Core Tools

dp

dp is driven by a parameter file. Each paragraph in the source file is processed, including:

Determine type of paragraph based on type of preceding paragraph and number of preceding blank lines
Insert type-specific boilerplate before and after the paragraph
Characterize lines based on number of leading spaces and presence of internal whitespace
Insert boilerplate before, after, and in space break within each line

All the boilerplate, and the type of the next paragraph, is defined in a parameter file.

So the parameter file could say: Imagine a "p" paragraph: it starts with , ends with , and is followed by: one blank line and another "p" paragraph; two blank lines and a "subheading"; or four blank lines and a "heading". Any line in the paragraph that begins with, say, one space, is wrapped with and tags.

<p:pref>

<p:suff>

<l:pref_1>

<l:suff_1>

Such next-paragraph logic handles simple narrative books from the first chapter 1 heading to the trailing advertisements. But sometimes help is needed. Paragraph type can be overridden by a "//" line as needed (see example).

I have separate parameter files for each project (book or series), although I tend to start each project with the parameter files from some previous similar project.

Example of Trivially-Handled Paragraph

//tocheader

CONTENTS

I. In Media Res 1

II. Mea Maxima Confusa 23

//h2

CHAPTER I

In Media Res

[The parameter file could set this paragraph to be a paragraph, or a subheading, whichever was more common in this particular book. Unusual cases can use the overrides.]

cnvttag

cnvttag takes two kinds of parameters: entity lists and tag processing commands. Each kind of output file (HTML, ThML, ASCII, Latin-1, UTF) has a separate transliteration table--all entities can be converted to UTF numbers (formatted differently in HTML versus UTF output), or "dumbed down" to Latin=1 or ASCII.

cnvttag can be run multiple times on the same source, each time with a different parameter file (my standard practice is one pass for HTML/ThML involving only transliteration and tag substitutions, three passes for UTF/Latin-1 output because more file formatting is involved.)

I have standard parameter files with a list of entities: including html tags, extended at need into UTF-space.

HTML defines AElig and aelig. But suppose you want AElig in UTF and "Ae" in ASCII? My "Aelig", is transliterated by the parameter file as appropriate.
HTML has frac12 (ASCII "1/2"); My cfrac12 is transliterated to frac2 or ASCII "-1/2"
HTML has alpha; I added alphaacute, alphasmooth, alpharoughgrave, etaacuteiota, etc.

Scripts and non-filter tools

pp.bat

This is the standard preprocessing workhorse. It takes parameters:

inputfilename
-p parameterfilename [optional; multiple -p options allowed for each perl filter]
filtername [multiple filters allowed]

- The output files of the successive filters are a.hym aa.hym aaa.hym etc.
- The input file can be either inputfilename.hym or src\inputfilename.hym or inputfilename (pp searches for first match)
- The parameter filename has an assumed suffix of ".pss", and can be found in ., .\pss, or ..\perl directories (pp searches for first match)
- the perl filter is ..\perl\filtername.pl .
- Certain filters are assumed to have standard parameter files: for instance, dp uses dpcommon.pss and dplocal.pss. These are in addition to whatever is specified by -p parameters.

Anonymous

Search

User:Hutcheson/Postprocessing Tools

Namespaces

More

Page actions

Contents

Basic Postprocessing Assumptions

Workflow Overview

Core Tools

dp

Example of Trivially-Handled Paragraph

cnvttag

Scripts and non-filter tools

pp.bat

Navigation

Wiki Navigation

DP Navigation

Wiki tools

Wiki tools

Anonymous

Search

User:Hutcheson/Postprocessing Tools

Basic Postprocessing Assumptions

Workflow Overview

Core Tools

dp

Example of Trivially-Handled Paragraph

cnvttag

Scripts and non-filter tools

pp.bat

Navigation

Wiki tools

Page tools