PPTools/Ppgen/Tutorial/DotSR

From DPWiki
< PPTools‎ | Ppgen
Jump to navigation Jump to search

Intro/Overview

Normally if you want to change what ppgen generates you will change your source file. Sometimes, though, that may prove difficult. It may mean that you have to make extensive use of .if and separate coding for text vs. HTML. Or it may mean having to edit the output by hand before testing it or submitting it for PPV or upload. Both of those can cause problems. For example, using .if, when you find an error in the actual text you may have to make changes in multiple places, and you may miss one of them, or make an error. Or you may forget to do some edits after the output has been generated.

In some cases the ppgen command .sr may help you avoid those problems. With .sr you can have ppgen perform search/replace operations, specified as regular expressions, after it generates the output files. You may specify a command that applies to both the text and HTML outputs, or only to the text, or only to the HTML. You may also specify .sr commands that are applied before ppgen does some of its processing, rather than after, which can help in some cases such as allowing better paragraph wrapping, or table layout.

Most users probably will not need to use .sr, but some users have found it very useful for their specific projects. You will need to understand regular expressions, specifically the dialect used by Python. (All regular expression implementations are similar, but most have subtle differences.) If you need tutorial info, I have found the website [regular-expressions.info] useful. If you understand regular expressions but would like information specifically on using them with Python, I recommend the [Regular Expression explanation in the Python manual] (especially section 6.2.1, and the description of the flags at the start of section 6.2.2 in the description of re.compile).

Notes:

  1. Beginning with version 3.53ca4 ppgen supports both the standard Python "re" package (documented here) and the optional "regex" package (documented here) if the PPer has installed it. ppgen sets the version number to VERSION0 so by default .sr processing will work the same whether you have installed the regex package or not. However, if you want to use the extended facilities provided by the regex package you can supply an override to that by using the VERSION1 global flag on any .sr command that needs it.
  2. Beginning with version 3.54e ppgen also allows you to perform the "replace" part of the search/replace operation using a macro written in Python. This can provide a way of handling even more complex cases than you can handle with regular expressions. Most users will not need and can safely ignore this capability

Syntax of the .sr Command

The .sr command has two operands. The first one tells ppgen which output file(s) to process, when to process them, and supplies additional options. The second one provides the search and relace strings.

 .sr <options> <search/replace strings>

The required <options> operand is one or more of the characters ulth, f, b, and/or p. This tells ppgen to apply the operation to one or more of the output files and specifies some additional processing options:

 u: The UTF-8 text output file
 l: The Latin-1 text output file
 t: Any text output file
 h: The HTML output file
 
 f: This applies the operation while operating in "filtering" mode (using the -f command line option)
 b: This applies the operation before processing (toward the end of pre-processing) rather than during post-processing,
      which may help paragraph, table, and .nf wrapping in some cases
 B: (starting with ppgen 3.54e) Like b, this also applies the operation before processing, but much earlier in the pre-processing phase. (Gory details below.)
 p: This tells ppgen to prompt the user whether to apply a replacement or not, on a case-by-case basis. This can not be used if the search string contains \n

The <search/replace strings> are specified as /search/replace/ where "/" is any single character you choose that does not appear in the search and replace parts of the operand. In the strings,

  • search provides a Python regular expression that specifies the text to locate within the file, and
  • replace specifies the replacement operation to be performed. You may specify either:
    • a Python regular expression; or
    • (starting with ppgen 3.54e) a special form indicating a Python macro that you have provided to perform the replacement operation. This special form is: {{python <macro-name> <options>}} where:
      • <macro-name> specifies the name of the macro
      • <options> specifies optional additional information to be passed to the macro.

Examples:

For example:

 .abc.def. would tell ppgen to change all "abc" to "def"
 +abc+def+ or /abc/def/ would do the same thing

Those were simple examples, not really making use of regular expressions. You might also have something like

 .\[(\d+)].{\1}.

That one would find strings beginning with a "[", followed by a string of digits, then a "]". It would then replace the "[]" with "{}" keeping the original number between them. Note, though, that regular expressions often make use of the "." character, so this might not be a good choice of delimiter for complex regular expressions.

If you wanted to invoke a Python macro named mac1 to perform the replacement, then that prior example might be written as

 .\[(\d+)].{{python mac1}}.


For a more complex example, suppose you're working on a two volume book, with an index in volume 2 (II). In the index, topics have page numbers, and if the index refers to volume 1 (I) it shows that. Perhaps something like

 Stained Glass
   History  I 23, 55; 15, 78.

For that entry, you might want to link to volume I pages 23 and 55 and also to pages 15 and 78 within the current volume. With the standard support in ppgen you can create page number links very easily:

   History  I #23#, #55#; #15#, #78#.

However, those will all be links to pages in the current volume, so you need to do something different for the ones in volume 1. One approach is to use placeholders for now, while you work on your book. Eventually, before you upload it to PG (or submit it for PPV) you can contact one of the WWers at PG and request that they reserve two ebook numbers for you. Then, as a final step before you upload or submit for PPV you will modify the source to replace the placeholders with the actual ebook numbers that PG will use, and for that you can use the .sr command.

You might code the placeholders like this:

   History  I #23:vol1_23#, #55:vol1_55#; #15#, #78#.

Note that ppgen will not create 'Page_' links using this notation, but will use your placeholder string as-is:

   History  I <a href='#vol1_23'>23</a> ...

Then, you could use a .sr command to fix the volume 1 links so they actually work. During the early work on the source, you can reference the relative path to your working location:

  .sr h |#vol1_(\d+)|../vol1/volume1.html#Page_\1

This allows you to test those external links prior to requesting the actual numbers from PG white-washers.

Then, just prior to submission, you can change the replacement pattern to point to the actual address of the posted project:

  .sr h |#vol1_(\d+)|http://www.gutenberg.org/files/48154/48154-h/48154-h.htm#Page_\1 |

(Hint: to avoid broken link messages during ppgen, create a set of placeholders at the end of your text:

  <target id=vol1_23>placeholder for vol1_23.

These can be deleted at the end or simply enclosed with .ig / .ig- in case future repairs are required.)

.sr Processing Notes

  • The .sr commands are processed in the order you specify them.
  • If the search string does not contain a "\n" then the lines of the file are processed individually, and the search string must be completely contained within the line. However, you can use the replace string to add a "\n" to a line, which will split it into two lines.
  • If the search string does contain a "\n" then ppgen will concatenate the entire file together, with lines separated by "\n", and the search and replacement strings may span lines. If you do not specify any of the (?aiLmsux) extension flags (described in section 6.2.1 of the Python manual referenced above) as part of the search string then:
    • Matching is case sensitive (T will not match t, for example). Specifying (?i) at the start of the search string will make it case insensitive.
    • The "." character will not match a newline ("\n"). Specifying (?s) allows the "." to match the newline character.
    • The "^" character will match only at the beginning of the entire file. Specifying (?m) will also allow "^" to match after each newline.
    • The "$" character will match only at the end of the entire file. Specifying (?m) will also allow the "$" to match before each newline.
  • Remember that the replacement occurs after wrapping. In text files, this might create long lines. Make sure that the replacement string is the same length as the original string.
  • Remember that the characters "//" signal the start of a comment. So, if your replacement string needs to contain a "//" (for example, if you're changing a placeholder into a URL then you can't simply use http://... because ppgen will truncate the .sr directive after "http:". Instead you need to escape the second / character with a backslash, making it \/. For example, .sr h ~placeholder~http:/\/rest-of-url-goes-here

User-contributed Examples of .sr Usage

(Please feel free to contribute .sr commands you've found useful, along with a basic explanation of what they do.)

Correct text alignment issues due to character length difference between — and --

When generating Latin-1 output from a UTF-8 encoded source file, ppgen converts UTF-8 em-dash characters '—' into two hyphen characters '--'. This can potentially cause alignment issues in the Latin-1 output as one character is being replaced by two. The .sr directive can be used to address alignment problems like these.

In the following example, a UTF-8 em-dash is found inside a table cell. Without the .sr statement, the '|' character representing the right edge of the table would be one character out of alignment. The .sr statement solves the issue by removing a space in the Latin-1 output to account for the extra hyphen character.

Source file (UTF-8 encoded):

.sr l ~\| Wealth--the same \|~| Wealth--the same|~

+-----------------+
| Wealth—the same |
+-----------------+

UTF-8 output:

+-----------------+
| Wealth—the same |
+-----------------+

Latin-1 output:

+-----------------+
| Wealth--the same|
+-----------------+


Images embedded in text

Cases like this, where images are embedded and flow with the text, can be handled with the .sr directive.

Source file:

This paragraph has an image [il=i_001.jpg] inside it.

// Ancient chinese glyphs
.de .glyph { display: inline-block; width: auto; height: auto; vertical-align: middle; }
.sr h ~\[il=(.+?)\]~<img class='glyph' src='images/\1' alt='' />~
.sr t ~\[il=(.+?)\]~[Chinese: **]~

HTML output:

                            +-----------+
                            | image of  |
This paragraph has an image | i_001.jpg | inside it.
                            +-----------+

Text output:

This paragraph has an image [Chinese: **] inside it.

Note: If the [il=i001.jpg] statement contains any spaces, then there is a chance that word wrap will break the statement up and cause it to not be replaced.

If your images are all of the same height, and you want the images to scale when the surrounding text size is adjusted, then use the following .de statement which expresses image height in em units:

.de .glyph { display: inline-block; width: auto; height: 0.75em; vertical-align: middle; }

Implementing rowspan for table cells

ppgen supports the colspan attribute for tables using the <span> tag, which allows a cell to span multiple columns of a table row. However, it does not yet have support for rowspan, which would allow a cell to span multiple rows. You can implement that yourself in many cases using the .sr directive, by "tagging" the cell data with a special character string, then recognizing that string using .sr and replacing it with appropriate HTML (or blanks, for the text version).

For example, suppose the generated HTML for your cell looks like this:

Artisans, etc.

but you need it to look like:

Artisans, etc.

Your .ta probably looks something like this:

 .ta ...
 Artisans, &c. | ...
  | ...
 ...
 .ta-

You could instead code it like this:

 .ta ...
 %%rs=2Artisans, &c. | ...
 %%dc| ...
 ...
 .ta-

Then, you could provide the following .sr directives somewhere in your ppgen -src.txt file:

 .sr t ~%%rs=.~~     // remove any %%rs= for text
 .sr t ~%%dc~    ~   // replace the %%dc with spaces in text

 .sr h ~(<td.*?)>%%rs=(.)~\1 rowspan=\2>~   // turn %%rs= into a rowspan specification for HTML

.sr h ~%%dc~~ // for HTML, remove the cell covered by the rowspan of the cell above it.

Note that for the text version:

  1. This approach cannot remove any horizontal border below the cell. So, if you're using horizontal borders between rows this approach is not perfect for the text version.
  2. The %%rs= and %%dc tags increase the apparent width of that column, even though you'll delete them later. You will need to account for that in your table layout specifications on the .ta directive.

.sr using Python macros

Python macros within .sr processing operates much like other Python macros supported by ppgen. You start by defining the macro, but you do not define any parameters:

 .dm macro1 lang=python
 Python code goes here
 .dm-

You then invoke it on the .sr directive, e.g.,

 .sr th $[abc]+${{python macro1}}$

or, if you use the same macro for multiple purposes and want to give it a hint about what it should do, you might add one like this:

 .sr th $[abc]+${{python macro1 swapcase}}$

A more complete example of a Python macro in .sr

 .dm macro1 lang=python
 match = var["match"]          // variable "match" is an re match object (e.g., match = re.search( ... )
 matched = match.group(0)      // match.group(0) is the text string that was matched
 
 if var["hint"] == "swapcase": // variable "hint" lets you see the options provided in the .sr after the macro name
   matched = matched.swapcase()  // swap the case of the matched string if hint is swapcase
 elif var["hint"] == "upper":
   matched = matched.upper()     // or make it upper case if the hint is upper
 elif var["hint"] == "lower":
   matched = matched.lower()     // or make it lower case if the hint is lower
 else:
   print("Unknown python-change option: {}".format(var["hint"]))
 
 var["out"] = matched          // you assign the result to variable "out"
 .dm-

Differences from other Python macros in ppgen

Python macros involved with .sr processing receive 3 additional variables:

  • var["match"] provides access to the re match object returned by the re.search operation performed by ppgen. For complete details on match objects see the Python re documentation here. Basically, though, after an assignment statement like match = var["match"] above you can access:
    • match.group(0) to see the complete string that was matched
    • match.group(1) to see group number 1 (which a normal regex would refer to as "\1")
    • match.group(2) to see group number 2 ("\2")
    • etc.
  • var["hint"] provides access to the optional parameters (or hints) provided after the macro name on the .sr: .sr th $whatever${{python macname hint}}$
  • var["srchfor"] provides access to the search string specified on the .sr directive.


Gory internals of ppgen processing

Knowledge of the processing structure (phases) of ppgen can sometimes prove useful when figuring out how to use the .sr directive. If you don't specify the f, b, or B options then the directive runs after the output (either text or HTML) has been built. If your replacement changes the length of any text at that point, you might affect the layout of tables, or create lines that are too long or short, or that could have wrapped differently. Depending on exactly what kind of changes your .sr operations make, that may not matter. But if it does, you can tell .sr to process the operations earlier.

f option

The f option is used only when ppgen is being used as a "filter", an operation which takes the input file, performs some transformations on it, and creates a new version of the input file rather than creating a normal ppgen output file (-utf8.txt or .html). During filtering ppgen will, if requested, process Greek transliterations and/or characters with diacritical marks, but other than that will simply produce a character-for-character copy of the input file. With f the .sr will run during the filtering process. For ppgen users, filtering is most often useful as a preliminary step to learn what Greek or diacritical characters exist, and gauge the additional work that will be needed (marking accents, telling ppgen how to handle odd diacritical characters). However, filtering is also useful for users who do not use ppgen to process their files. For example, Guiguts or PPQT2 users can use filtering to transform their files with Greek transliterations or diacritical markup into files that instead have proper UTF-8 Greek transcriptions and proper UTF-8 diacritical characters, before using their other tools to finish the work.

B and b options

The B (new with ppgen 3.54e) and b options cause ppgen to run at different times during what ppgen calls the "pre-processing phase".

Phases of ppgen processing

ppgen has two major "control" phases, text and HTML. Within each of those ppgen has 3 major phases: pre-processing, processing, and post-processing. Pre-processing also has an inner phase that handles pre-processing tasks that are common to both the text and the HTML control phases.

  • During pre-processing for text, ppgen:
    • Runs the common pre-processing phase
    • Removes things that are irrelevant in the text output, such as:
      • <lang> and <abbr> tags
      • internal page links and page numbers (.pn directives)
    • Assigns footnote numbers
    • Handles inline markup (<i>, etc.)
    • Runs any .sr directives that specify the t and b options


  • During pre-processing for HTML, ppgen:
    • Runs the common pre-processing phase
    • Protects internal page links
    • Handles page numbering (.pn), transforming the numbers into a protected form and merging them down to an appropriate loation
    • Converts any <br> outside of .li blocks to <br />
    • Assigns footnote numbers and handles footnote references
    • Replicates inline markup (<i>, etc.) that spans lines in .nf blocks
    • Handles inline markup of all kinds
    • Runs any .sr directives that specify the h and b options


  • During the common pre-processing phase, ppgen:
    • Loads the filter control file if the user requested filtering
    • Processes Greek and characters with diacritic markup
    • Runs and removes any .sr directives that specify the f option, if filtering was requested
    • Terminates if filtering, or if the user requested it via options on the .gk or .cv directives
    • Examines the complete file in several passes, during which it:
      • Removes comments (// text) and ignored lines (.ig)
      • Processes .if directives, either keeping or deleting lines as requested.
      • Finds, saves, and removes any .sr directives that specify b, B, or had no option specified
      • Processes .dm directives to define macros
      • Processes macro invocations (via .pm, then via <pm>)
      • Runs any saved .sr directives that specify the appropriate t or h option and the B option
      • Handles character mappings (.ma) if appropriate
      • Handles courtesy remaps of some directives (e.g., ".nf" becomes ".nf l")
      • Remaps some important characters or character strings to protected versions (ellipses, \_, other \<something> escaped characters, etc.)
      • Defines caption models (.cm)
      • Remaps ".ce" to ".nf c" and ".rj" to ".nf r"
      • Handles ".sp" directives within ".nf" blocks
      • Handles ".dt" (display title) and ".ci" (cover image) directives and removes them
      • Handles ".bn" directives
      • Adds protection to superscripted or subscripted characters
      • Converts <i> to <I> (also for b, sc) if needed


  • During the processing phase, ppgen performs all the work to built the output text.


  • During the post-processing phase for text, ppgen:
    • Combines any consecutive space (.sp) requests, keeping the larger amount
    • Restores any protected characters to their usual formats
    • Runs any .sr directives that specify the t option but not the f, b, or B options


  • During the post-processing phase for HTML, ppgen:
    • Restores any protected characters to their usual formats
    • Removes style= info and converts to dynamically generated class names
    • Adds in the HTML header and footer and generates the CSS
    • Runs any .sr directives that specify the h option but not the f, b, or B options