LaTeX postprocessing guidelines/Lprep

From DPWiki
Jump to navigation Jump to search

Lprep: a LaTeX code remover

lprep is a perl program which can strip out most of the LaTeX commands from your file, leaving an almost-plain text file which can usefully be checked with non-LaTeX PPing tools like gutcheck, guiguts, etc.

Lprep was conceived and written by rfrank, with several PPers contributing subsequent enhancements.

Using lprep

You need to have perl installed. (If you already use guiguts then you already have perl.) The program is called using

perl -w Lprep.pl basename

where the project is basename.tex. Unless lprep crashes, you should end up with basename.txt containing the de-LaTeXed text and basename.log containing any lprep messages.

Configuring lprep

Lprep has been built to deal automatically with the LaTeX code and packages most likely to be used in DP projects. However, if you have defined new commands in your preamble, or are using commands from packages lprep hasn't been programmed to handle automatically, you can extend lprep's capabilities. This requires a bit of understanding of how lprep works, (and perhaps a tiny bit of perl syntax and regex knowledge). See below for sample configuration code. For portability, and to work with the LaTeX posting procedure at PG, since version 0.33 lprep reads the configuration data from the end of the LaTeX source: just after \end{document} put a line beginning ### followed on subsequent lines by the lprep configuration stuff and then another line beginning ### to flag the end of the configuration data. If you are also appending a LaTeX compilation log to your source (as required when uploading to PG), this should come after the lprep configuration data. Even if you don't need any customisation, it's good practice to conclude your LaTeX with

\end{document}
###

###

(This ensures lprep stops reading before the log file.) Lprep can also read configuration data from a basename.cfg file for the project, but use of this feature is now deprecated because it doesn't mesh with the PG upload procedure.


The basic idea is that the configuration file allows you to add to the lists of things processed in various steps of lprep's operation. (You don't need to worry about macros that only occur in math mode—lprep never looks inside math mode—or only within the preamble—lprep skips the preamble entirely.)

What lprep does

The program makes several passes through the source .tex file:

  1. On the first pass, everything before \begin{document} and after \end{document} is removed. Within the body of the document, comments are stripped out, and page separators are made to resemble standard DP ones. Many PPers sanitise their page separators with a leading % or %% and the program handles these automatically. If you use a different syntax then you can assign a page separator-recognising regex to $PageSeparator in the configuration data (consult sub preliminaries in lprep.pl for the default regex used, or see the example configuration file below); the program will replace the characters of what it recognises as a page separator by hyphens. On this pass some other things are sanitised in preparation for subsequent passes.
  2. On the second pass, any TeX conditionals are cleaned out. Specifically, any control word beginning \if is assumed to start a conditional clause. The negative clause (that is, what follows the \else if there is one) is retained and the rest removed.
  3. On the third pass, things like displayed mathematics, figures and tables are removed and replaced by a text flag. The standard environments are handled automatically, but additional ones can be provided by defining @MathEnvironments in the configuration file. Commands from the configuration file are executed first, so can be used to override the normal behaviour of the program. See below for the syntax expected.
  4. On the fourth pass, inline mathematics delimited by $...$ and displayed mathematics delimited by $$...$$ are also removed and replaced by a text flag. This pass is not configurable.
  5. On the fifth pass, the remaining code is checked for parts which may be deprecated LaTeX style, ineffective (eg \Large{foo}), or better suited to inclusion in the document preamble than the document body. This gratuitous pontification is not configurable.
  6. On the sixth pass, the mandatory and optional arguments of control words are dealt with. By "dealt with" we mean removed, modified, or retained. The most common situation of a single mandatory argument which should be retained as is (ie \foo{bar} becomes "bar" in the output) is handled automatically by the eleventh pass, so for this pass it is only necessary to detail how less common situations should be handled. Many standard LaTeX constructions are built in to the program, but you can add others by defining @ControlwordArguments in the configuration file. See below for the expected syntax. (If unexpected results are obtained, change the value of $trace in the subroutine code to get some idea of what is going wrong.) Sectioning and cross-referencing commands which normally add or generate a number will produce "00" where the number would usually appear. For example, \chapter{Blah blah.} will become "CHAPTER 00 Blah blah.", \subsection{Foobar.} will become "§00.00 Foobar.", and \ref{label:xx} will become " (00)".
  7. On the seventh pass, control words which should simply be replaced by some text (eg \textellipsis becomes "...") are dealt with. Additional replacements can be incorporated by defining @ControlwordReplace in the configuration file. See below for the expected syntax. If $French is set to be nonzero in the configuration file, a few French-related substitutions are added to the standard list.
  8. On the eighth pass, control symbols are subjected to similar treatment. Those covered can be extended by defining @ControlsymbolReplace in the configuration file: see below for the expected syntax.
  9. On the ninth pass, various TeX accents are turned into Latin1 accented characters. This can be customised by defining @AccentReplace in the configuration file. See below.
  10. The tenth pass will evaluate whatever is assigned to the string $CustomClean in the configuration file. Obviously this needs to be valid perl code, and must perform its own traversal through the file contents in @file.
  11. The eleventh pass strips out anything resembling a control word which has survived to this point. Each removal is logged, but to discover control words which may need to be configured into earlier passes it may be necessary to suppress this pass: this can be achieved by setting $StripEverything to 0 in the configuration file.
  12. The final pass removes (some) superfluous whitespace, any remaining braces, and cleans up quote marks and restores previously sanitised characters to their original form.

There are many legal TeX constructs which will not be handled—or will be handled incorrectly—by the program. There is a better chance of the program handling your code if you use proper LaTeX syntax and eschew plain TeX hacks.

Configuration syntax

The configuration variables beginning with @ are perl arrays, whose elements are ?what? You should be able to get the idea from the sample code below. Anything from the configuration file takes precedence over what is already hard-coded into lprep, so you can over-ride "normal" processing if necessary.

@MathEnvironments
[start string, end string, text replacement]
@ControlwordArguments
Syntax is
[control word, 1=Mandatory/0=Optional, 1=Keep/0=Delete, opening replacement, closing replacement, ...]

with the last four elements repeated for each argument to be processed. For an optional argument, the value of "Keep" can be a default text value (surrounded by a disposable [...]). A final mandatory argument which needs to be kept (minus braces) can be handled by the eleventh pass processing (so "1, 1, '', ''" at the end is redundant).

@ControlwordReplace
[control word, text replacement]
@ControlsymbolReplace
[control symbol, text replacement]
@AccentReplace
[Tex code for accented letter, latin1 replacement]

These parameters obviously have to be presented in a form perl can digest: see the examples below, or consult the lprep.pl source code.

Sample configuration code

Here are some examples of how the configuration data can be used to tailor lprep to a specific project.

  • If you use a control sequence rather than a TeX comment to hide page separators (so you can display the png numbers for example)
$PageSeparator = qr/^\\PGx?--/; # interpret lines beginning \PG-- or \PGx-- as page separators
  • To deal with a nonstandard mathematics environment: this will replace \begin{LRalign}...\end{LRalign} with "<aligned equation>"; the entries in the configuration array are handled as strings.
@MathEnvironments = (
  ['\\begin{LRalign}','\\end{LRalign}','<aligned equation>']
                   );
  • Dealing with project-specific control words that take arguments; the first element in each entry is handled as a regular expression, so characters like * that have special meanings in a regular expression need to be escaped
@ControlwordArguments = (
  ['\\chapindex', 1, 0, '', ''], # remove a single mandatory argument
  ['\\DPpdfbookmark', 0, 0, '', '', 1, 0, '', '', 1, 0, '', ''], # remove three arguments, the first of which is optional
  ['\\footnoteT', 0, 0, '', '', 1, 1, '~[Transcriber\'s note: ', ']'], # delete optional argument and decorate mandatory one
  # normally \section*{foo} is handled by stripcontolwords, but this PPer's version of sectioning commands adds a period, hence
  ['\\section\\*', 1, 1, '', '.~'],
  # and we also need to override the standard treatment of \section[foo][bar]{baz} for the same reason
  ['\\section', 0, 0, '', '', 0, 0, '', '', 1, 1, '§00 ', '.~'],
  ['\\Vpageref', 0, 0, '', '', 0, 0, '', '', 1, 0, 'On page (00)', ''], # remove two optional arguments and replace a mandatory one
  # an empty default value for optional argument ensures decorations are used even if the argument is absent
  ['\\begin{COROLLAIRE}', 0, [], 'Corollaire ', '---']
                      );
  • Replace project-specific shortcuts with appropriate text (the first entry is again handled as a regular expression)
@ControlwordReplace = (
 ['\\EG', 'Ex. gr.'],
 ['\\hoipolloi', '~[GREEK: hoi polloi] ']
                     );
  • Most LaTeX syntax can be dealt with systematically; plain TeX odds and ends are more difficult because they tend to be less amenable to being captured by a regular expression. The following will remove everything from \LP to end of line: if the LaTeX preamble includes \let\LP\empty then \LP will not affect the LaTeX code at all, but will serve as a "comment" to hide horrible TeX hacks from lprep. (To avoid unexpected loss of text, make sure no other control words begin with \LP, or modify the regex below so it matches nothing but \LP.) Here "end of line" may not be quite what you expect: when lprep removes TeX comments during the initial pass it joins the line containing the comment to the following line. Hence a line in the LaTeX source like ...\LP...%... will result in everything from \LP to the end of the next line being deleted by lprep.
$CustomClean = 'print "\\nCustom cleaning in progress...";
 my $cline = 0;
 while ($cline <= $#file) {
   $file[$cline] =~ s/\\\\LP.*$//; # strip marked raw TeX
   $cline++
 }
 print "done\\n";';
  • The configuration code is appended to the LaTeX source:
...
\end{document}

###
$PageSeparator = qr/^\\PGx?--/; # interpret lines beginning \PG-- or \PGx-- as page separators
@MathEnvironments = (
  ['\\begin{LRalign}','\\end{LRalign}','<aligned equation>']
@ControlwordArguments = (
  ['\\chapindex', 1, 0, '', ''], # remove a single mandatory argument
  ['\\DPpdfbookmark', 0, 0, '', '', 1, 0, '', '', 1, 0, '', ''], # remove three arguments, the first of which is optional
  ['\\footnoteT', 0, 0, '', '', 1, 1, '~[Transcriber\'s note: ', ']'], # delete optional argument and decorate mandatory one
  # normally \section*{foo} is handled by stripcontolwords, but this PPer's version of sectioning commands adds a period, hence
  ['\\section\\*', 1, 1, '', '.~'],
  # and we also need to override the standard treatment of \section[foo][bar]{baz} for the same reason
  ['\\section', 0, 0, '', '', 0, 0, '', '', 1, 1, '§00 ', '.~'],
  ['\\Vpageref', 0, 0, '', '', 0, 0, '', '', 1, 0, 'On page (00)', ''], # remove two optional arguments and replace a mandatory one
  # an empty default value for optional argument ensures decorations are used even if the argument is absent
  ['\\begin{COROLLAIRE}', 0, [], 'Corollaire ', '---']
                      );
                   );
@ControlwordReplace = (
 ['\\EG', 'Ex. gr.'],
 ['\\hoipolloi', '~[GREEK: hoi polloi] ']
                     );
$CustomClean = 'print "\\nCustom cleaning in progress...";
 my $cline = 0;
 while ($cline <= $#file) {
   $file[$cline] =~ s/\\\\LP.*$//; # strip marked raw TeX
   $cline++
 }
 print "done\\n";';
###

This is pdfeTeX, Version 3.141592-1.21a-2.2 (MiKTeX 2.4) (preloaded format=latex 2007.3.9)  5 DEC 2007 19:36
entering extended mode
...

Caveats

  • Because lprep removes the preamble and the contents of math displays, tabulars, etc., any text that is buried in these will not be passed to the lprep output, and hence will not be available for checking with other tools. This probably means that such text will need close manual scrutiny to detect residual errors.
  • Because of the number of sanitisations, substitutions, etc that it carries out, Lprep does not always get interword/interblock spacing correct in the output: there will occasionally be additional whitespace, and occasionally whitespace will get lost, which may trigger spurious errors in gutcheck and friends.
  • Because lprep converts any automatic numbering to "00", you need to check consistency of numbering some other way.