PPTools/Ppgen/Tutorial/Diacritics
Overview
Note: Diacritic processing is currently available ppgen 3.46j and later.
The DP site currently supports only Latin-1 characters for proofreading, and the Proofreading Guidelines document how to handle non-Latin-1 characters, especially those with diacritical marks. If you project has non-Latin-1 characters, such as ă (a with a breve over it) you should find that the proofers have provided diacritic markup for that character: [)a]
As the PPer you will probably want to convert those to the actual Unicode characters for your utf8.txt file and your .html file, and you can do that easily with the ppgen support for DP's diacritic markup style. PPgen has built-in support to recognize the diacritic markup for most characters in the Latin Extended-A and Latin Extended-B code pages, and many characters in the Latin Extended Additional code page. Within those code pages, ppgen supports any characters that are expressible in the standard DP markup (plus a few extensions) without conflicts, and that exist in Unicode as single characters. It does not have built-in support to generate characters that require use of combining diacritic marks, such as an a with an inverted breve underneath, but you can tell it how you want that processed if you have one in your book.
If you have characters that ppgen does not understand, or if you want something different than the standard handling, you may tell ppgen what you want it do generate.
By default ppgen will not perform diacritic processing. If you want it, you must enable that processing by using the .cv command.
Simple Diacritic Processing
To enable diacritic processing, simply place a .cv command anywhere in your source file. I recommend having it near the top for good documentation. If all you want is for ppgen to handle diacritics during processing, you can use the simplest form of .cv, without options:
.cv
If you need to specify additional diacritic handling (for example, a character that ppgen cannot handle automatically), then you can use 2 options of .cv, in= and out= as follows:
.cv in=[in-value] out=out-value where: in-value is a 1-8 character string, which must have [ and ] around it on the .cv command, which ppgen will find in the source text. out-value is the character string that ppgen should use to replace the [in-value], or the word "ignore" (without the quotes). Note: If the in-value or the out-value contain blanks then you must use " or ' around the operands. Example: in="[in value]"
The diacritics you supply will be applied before any of the built-in ones, and thus you can override the standard ppgen handling if you need to. If for some reason you do not want ppgen to use one of the built-in diacritic transformations, you can use out=ignore to disable that one. For example, ppgen will recognize [s] as a long-s, but you may not want that. To disable that processing you could specify: .cv in=[s] out=ignore
For out-value you may specify a UTF-8 character directly, such as ă or you may specify character unicode value(s) in the form \unnnn. For example, to get an a with an inverted breve underneath you could use out=a\u032f
Recommendation: During diacritic processing, ppgen will list all of the characters it replaces. You may also find it helpful to specify the "-l" (log) command line option when running ppgen, which will cause ppgen to list any suspected diacritics that it did not handle and which may need further work on your part. In fact, you might consider making a run early in your work on the project, with a .cv command to enable the processing and -l to list the results, just to see what diacritics your project contains and what ppgen will do with them.
Example: ppgen.py -l -i your-src.txt
What characters are built-in?
To see the complete list of built-in characters in the version of ppgen that you're using, you can run ppgen with the -cvg or --listcvg option. With that option, ppgen will list all of the built-in diacritic and Greek characters to output file ppgen-cvglist.txt and terminate.
Example: ppgen.py -cvg
You'll notice some that do not fit our usual diacritic markup as shown in the Proofreading Guidelines, but that we've seen commonly used enough over the years that I thought ppgen should have support for them.
Advanced Topics
Applying the transformations to your source file
If you run ppgen normally, and enable the handling of the diacritic markup, your source file will have the marked up characters and the output files (-utf8.txt, .html) will have the transformed characters. You may find it beneficial, however, to have the transformations applied to your source file so you can see them more easily while working on the source.
For example, that would allow you to see, in your source file, the characters you still need to work on. This may also be useful to someone who normally uses some other tool (Guiguts, PPQT) rather than ppgen, but wants a simple way to perform the diacritic transformation without having to deal with all the characters manually.
To do that you can use the -f (for filter) command line option.
Example:
ppgen.py -i file-src.txt -f filter.txt (and, optionally, -l to have more informational messages logged)
With the -f option specified, ppgen will read the specified filter file and process any .cv or .gk commands in it. Then ppgen will process the input file specified by -i (the name must still end with -src.txt), perform the diacritic and Greek transformations (if requested by .cv and .gk in the filter file or in the source file), and create a UTF-8 encoded output file named file-cvgout-utf8.txt which should be identical to the input file except for the .cv and .gk transformations.
If you're a ppgen user and happy with the results, you can then rename file-cvgout-utf8.txt to somename-src.txt and continue working on that as the next iteration of your source file.
If you're not a ppgen user, you can rename file-cvgout-utf8.txt to any name you would usually use, and continue working on it with your normal tools.
The filter file
The filter file should normally contain only .cv and .gk commands, which will serve to trigger the transformation processing. If desired, your commands can use the in= and out= operands to request additional character transformations beyond those built-in to ppgen. A minimal filter file to request both diacritic and greek transformations would have simply:
.cv .gk
Other options on .cv
The .cv command contains several other options intended for testing of ppgen itself, but which may also prove useful to a PPer.
pre= specifies a character string that ppgen will place in its output file just before the transformed characters.
suf= specifies a character string that ppgen will place in its output file just after the transformed characters.
keep= has values n (default), a, and b. With keep=a ppgen will retain the original diacritic markup and place it in the output file just after the transformed character. With keep=b ppgen will retain the original markup and place it in the output file just before the transformed character.
(Available in 3.46k) italic= has values n(default) and y. Sometimes formatters are confused about marking characters with diacritics that are italic and will format them as, e.g., [<i>)A</i>] rather than the correct <i>[)A]</i> If you specify italic=y ppgen will look for these cases and correct them before applying the diacritic transformations. (Note: this transformation will apply to both Latin-1 and UTF-8 output files.)
(Available in 3.46k) bold= has values n(default) and y. Sometimes formatters are confused about marking characters with diacritics that are bold and will format them as, e.g., [<b>)A</b>] rather than the correct <b>[)A]</b> If you specify bold=y ppgen will look for these cases and correct them before applying the diacritic transformations. (Note: this transformation will apply to both Latin-1 and UTF-8 output files.)
quit=y will cause ppgen to terminate processing immediately after performing the diacritic and/or Greek transformations.
done tells ppgen that there are no further .cv commands in the input file. This may save some time with large projects by allowing ppgen to stop looking for .cv commands and begin doing the transformation work.
Examples:
Suppose the input file has a line with This is an [alpha] With diacritic transformations enabled, the output file will normally contain This is an α If a .cv command specifies ".cv keep=b" then the output for that line would be This is an [alpha]α If, in addition, a .cv command specifies ".cv pre=' >>' suf=<<" then the output would be This is an [alpha] >>α<<
You may find these options useful when initially working with the .cv support, or even later when initially working on a new project. You could, for example, filter your source file using ppgen and the -f option, with a filter file containing:
.cv pre=' >>' suf="<<" keep=b
and your transformed source file would then have the original diacritic markup, plus a clear indication of the output that resulted from the transformation. You can examine each transformed letter and make sure it is what you wanted, and matches the source image. Any diacritics that remain without a transformation will also be obvious and you can then figure out what .cv command you need to make the transformation work, or you may be able to determine that the proofers made a mistake in the diacritic markup and correct it to work on the next run. You could then iterate that until you have handled all the diacritic markup in the project, and move on to the next steps in your workflow.