User:Hutcheson/Postprocessing Tools/Table Support
Table formatting is, in the normal DP workflow, the most difficult part of a project. By dint of extraordinary sweat and copious tears, the DP formatters produce a text-based pre-wrapped layout based on their best guess of how best to use the limited horizontal space in the line. If the postprocessor disagrees with their decisions about column layout, each tweak involves manually revising every single line of the table. In any case, producing the HTML version involves even more sweat and tears to remove all the formatters' work, before sweating blood to add all the required HTML tags. And no part of that postprocessing effort can be effectively automated. Life just isn't long enough for that....
If the DP formatters produce a semantics-based version of the table (using my special formatting rules), and HTML is produced first (as in my postprocessing workflow), then (1) the effort of converting DP output into HTML can be largely-automated, and (2) the effort of converting HTML into text can be largely-automated, and (3) the result can be tweaked by modifying a single "format" and rebuilding the files.
This automation is incorporated into two perl scripts, (1) dp-table and (2) txttable, which may be used independently of my other workflow tools.
dp-table
DP-table reads a possibly-marked-up file (such as DP output, or an almost-fully-HTML-ized file, or anything in between). It looks for tables, which consist of:
- One or more blank lines
- A line beginning with the characters "//table "
- one or more lines beginning with the characters "<format ", intermingled with....
- one or more non-black lines formatted according to my formatting rules, with and/or tags added to handle exceptional cases
- Another blank line terminating the table.
Example:
//table {l;r} <format val="l30x15s1 r5"> <th2>CATALOG <th>Book Cost Ferns of Denali National Park 0.25 Poisonous Snakes of Texas|3.15 ... ... The Fall of Wicheta Falls | 1.00 Total | <span class="overline">25.15</span>
This table is exactly as the formatters left it, except for these changes manually made by the postprocessor:
- Removed the nowrap ending line, and replaced the beginning line with "//table ..."
- Added a "<format>" line to indicate the number of spaces for each column.
- Added tags (here, a "") to allow for formatting not covered by DP rules.
- Added "" tags to indicate which rows are table headings instead of table data.
dp-table automatically:
- Ignores the <format> tag (which must somehow be removed from the HTML file before uploading to Gutenberg.)
- Breaks up the line into cells (separated by multiple spaces, or a single | character) (empty cells are indicated by adjacent | characters)
- Adds HTML tags and
at the beginning and end of the table.
- Adds and tags to the beginning and end of each line.
- Adds and to every cell on any line where there was already one tag.
- Adds and to every cell on every other line (retaining any existing tags).
- Adds "class=" attribute to each tag, based on the bracketed list of classes on the "//table" line (retaining any existing classes).
- Expands certain abbreviated and tags:
- <td3> or <td3c> is expanded to
- <tdc> or <td3c> is expanded to (I assume classes l, r, c are defined in the stylesheet as text-align:left/right/center)
- <th4> is expanded to
This (and a few lines in the CSS stylesheet) will create reasonable HTML results, for most tables and for most window-sizes. CSS is adjusted as for any project, but I seldom need much beyond:
table.center { margin-right:auto; margin-left:auto; } table.center td.c { vertical-align:top; text-align:center; } table.center td.r { vertical-align:top; text-align:right; } table.center td.l { vertical-align:top; text-align:left; }
txttable
txttable assumes:
- Its input is Latin-1 (possibly with HTML entities for UTF characters).
- UTF characters, except for a handful of known exceptions, are exactly 1 space wide
- non-table tags (<i>, <b>, <span>, etc.) have already been removed or replaced by the usual text (_ for <i>, etc.)
- each table row is completely contained in one text file line (it doesn't care if that line is very very long).
txttable reads the HTML file and writes it out unchanged (except for material between <table> and </table> tags.) It:
- Replaces <table> and </table> tags with <pre> and </pre> (which, of course, must be somehow removed before uploading the text file to Gutenberg)
- For "<format>" lines, it breaks the "val" into column descriptions (separated by spaces). Each column description can contain any combination of:
- x{number} number of leading spaces in this cell (defaults to 2)
- y{number} number of leading spaces in OVERFLOW lines in this cell (defaults to same as x)
- s{number} a non-zero number tells txttable to watch for and handle "colspan" attributes (if there are any, s1 is probably what you want).
- c{number} center the value in a column that many spaces wide
- i{number} left-justify the value in a column that many spaces wide
- r{number} right-justify the value in a column that many spaces wide
- For <tr> lines, lays out each cell according to the format specification
- If the value will not fit within the allotted column, wraps the cell to another line.
- If the value cannot be wrapped because a single word won't fit the column, prints an unpleasant error message.
- Prints a blank line between <th> and <td> rows (in either order). Does not print a blank line between successive <th> lines or successive <td> lines.
- Does not draw cell borders with "ASCII art."
The decisions about blank lines and borders suit me. However, one obvious trivial enhancement would be to add a "d{number}" code to force the whole table to be double-spaced. It would be more work, but not difficult, to add "ASCII art" borders as another option.
In my normal workflow:
- 1. The tags which would cause txttable trouble are automatically removed elsewhere.
- 2. The leftover tags (<format> or <pre>) are automatically removed before a project is zipped for uploading.
- 3. There is provision for inserting blank lines in a table.
It would be exceedingly trivial to write perl scripts which did these three things, and nothing else. But it would not be hard, in a trial run, to do that manually.