Post-Processing With PGTEI 0.4

From DPWiki
Jump to navigation Jump to search

Note: PGTEI is no longer used at DP. Information on this page may be out of date.

Note: The following text was written when the downloadable TEI text was muich different from now. A new version of this tutorial is in the works.

In the following documentation, we list what may be necessary to do for you as post-processor of a PGDP project to write a TEI file that produces both TXT and HTML output which is acceptable for PGDP PPVers. We do NOT cover producing acceptable PDF output--the PDF backend of gnutenberg-press 0.4 is brittle, and especially buggy regarding LOTE and images, in the humble opinion of this writer.

This page assumes you have read through the PGTEI documentation, have installed the gnutenberg-press 0.4 software (the Windows version of Guiguts comes bundled with the Gnutenberg Press; see here), and have at least once successfully created TXT, HTML, and PDF output using that software on a TEI file (be it your own file or one of the example files from gutenberg.org).

We also assume you have some means of filtering text, like a sophisticated word processor. The instructional examples here use vim's capability of marking passages of text and piping them through the sed program. However, the description should be general enough for you to be able to use any other program or language (like perl) to the same effect.

At the end, you should be able to take any non-HARD project, write a TEI file for it, and get it and the produced TXT/HTML files through PPV.

Best Practices

Please see the on-going "Best Practices" document. Discussion will be ongoing in the forums [here], while the results of that discussion will be located here.

Installation issues

There is one minor problem with the installation of PGTEI 0.4 concerning the PDF backend. If you get errors like

Cannot read font coverage file: $HOME/PGTEI/0.4/pdf/fonts/generated/pgtei-fonts.coverage
at $HOME/PGTEI/0.4/pdf/mod-pdf.pl line 811.

then there is manual work needed with the installation. The file pgtei-fonts.cfg.default in the directory pdf/fonts has to be checked and copied to pgtei-fonts.cfg, and the directory 'generated' has to be created in the same place by unpacking the included generated.tar.bz2 file.

How do I use the provided sed filters?

For each filter, we give a verbal description, so you can always use other means like perl to achieve the same effect. If you are a lucky Unix/Linux user and have sed on your machine then you can pipe the text via commandline line through sed; however, it's much more convenient to use a sophisticated editor like vim or emacs to mark the passages of text you want to filter and use an editor command to pipe it through sed.

For example, vim has the command '!' for piping. I have the following line in my .vimrc configuration file

vmap <F10> :!sed -f

This configures vim/gvim to behave like this: when having marked text with one of the visual commands and pressing 'F10', the marked text is piped through a command consisting of 'sed -f ' plus a filter file name that I only need to give after the -f and to press 'RETURN'.

Any way you use your filters, you'll soon come up with half a dozen filter files in your working direcories. This is the time to think about giving them useful names so you can recognize and reuse them later.

How to start

When you download your project files, you always can download a preliminary PGTEI version with the page images (see the link Download Zipped TEI Text on your project page). This file is in the PGTEI 0.3 format and will probably not work with your software. However, it contains some useful markup that we can include with our file. The alternative would be to write our PGTEI 0.4 file from scratch. There are no significant advantages/disadvantages to either of these approaches, and we encourage you to try both.

We will work with the provided file in this document, so if you follow us with your project, then please also download the provided TEI file of your project. The first three parts of your project TEI file (header, stylesheets, front matter), however, cannot be taken from that downloaded file, and must be written/copied from scratch, as follows.

Meta information

Although we want to use the given TEI, we start with a brand new header. First, create a new file with your project basename (let's call it book) and the .tei extension. The book.tei file should start with

<?xml version="1.0" encoding="iso-8859-1" ?>

<!DOCTYPE TEI.2 SYSTEM "http://www.gutenberg.org/tei/marcello/0.4/dtd/pgtei.dtd">

<TEI.2 lang="XX">
  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title>XXXXXXXXXXXXXXXX</title>
        <author><name reg="YYYYYYY, XXXXXXX">XXXXXXXX YYYYYYYYY</name></author>
        <!--<editor><name reg="YYYYYYY, XXXXXXX">XXXXXXXX YYYYYYYYY</name></editor>-->
      </titleStmt>
      <editionStmt>
        <edition n="1">Project Gutenberg TEI Edition 1</edition>
      </editionStmt>
      <publicationStmt>
        <publisher>Project Gutenberg</publisher>
        <date>XXXXXXXXX, 200X</date>
        <idno type="etext-no">99999</idno>
        <availability>
          <p>This eBook is for the use of anyone anywhere
          at no cost and with almost no restrictions whatsoever.
          You may copy it, give it away or re-use it under
          the terms of the Project Gutenberg License online at
          www.gutenberg.org/license</p>
        </availability>
      </publicationStmt>
      <sourceDesc>
        <bibl>
        </bibl>
      </sourceDesc>
    </fileDesc>
    <encodingDesc>
    </encodingDesc>
  <profileDesc>
    <langUsage>
      <language id="XX"></language>
   </langUsage>
  </profileDesc>
  <revisionDesc>
    <change>
      <date value="200X-XX">XXXXXXX 200X</date>
      <respStmt>
        <resp>Produced by [project manager], [post-processor],
        and the Online Distributed Proofreading Team at
        &lt;http://www.pgdp.net/&gt;.</resp>
      </respStmt>
      <item>Project Gutenberg TEI edition 1</item>
    </change>
  </revisionDesc>
</teiHeader>

The fields marked with X have to be filled out. In particular,

  • the language of the etext should be given as lang attribute to the TEI.2 element, in the form of a language id. Language ids should be the well-known two- or three-letter code, and any id you use here and later in the book should be declared additionally within the langUsage element.
  • the title and author is mandatory, editor is optional (e.g. when part of a collection). Any non-ASCII characters in the title have to be given as HTML entities, e.g. &Auml; for Ä.
  • the etext number will be filled by the whitewasher
  • the sourceDesc (source description) supplies a description of the source text(s) from which the etext was derived or generated. Please fill in the full bibliographic metadata using the <bibl> element and friends. See also the TEI documentation regarding the source description. This bibliography can also contain a pointer to where the scans were from.
  • the respStmt (statement of responsibility) should be taken from the project page because that's also the place where the PM adds possible credits for the scans (e.g. gallica). Note also that the TEI syntax gives you the name, orgName, and persName elements here to fill out. For loose text inside the respStmt you would use the resp element. See also the TEI documentation on the respStmt element.

Rendering Styles

After the teiHeader attribute, put some style macros ("stylesheet") like the following:

<pgExtensions>
  <pgStyleSheet>
    .boxed           { x-class: boxed }
    .shaded          { x-class: shaded }
    .rules           { x-class: rules; rules: all }
    .indent          { margin-left: 2 }
    .right           { margin-left: 16 }
    .bold            { font-weight: bold }
    .italic          { font-style: italic }
    .gesperrt       { font-weight: bold }
    .antiqua        { font-style: italic }
    .small           { margin-left: 2 }
    .smallcaps       { font-variant: small-caps }
    speaker          { font-variant: small-caps; font-weight: normal }
    figure           { text-align: center; }
    .w100            { }
    .w75             { }
    .w66             { }
    .w50             { }
    .w25             { }
    @media pdf {
      .w100            { width: 100% }
      .w75             { width: 75% }
      .w66             { width: 66% }
      .w50             { width: 50% }
      .w25             { width: 25% }
      }
  </pgStyleSheet>
  <pgCharMap formats="txt">
    <char id="U0x2014">
      <charName>mdash</charName>
      <desc>EM DASH</desc>
      <mapping>--</mapping>
    </char>
    <char id="U0x2003">
      <charName>emsp</charName>
      <desc>EM SPACE</desc>
      <mapping>  </mapping>
    </char>
    <char id="U0x2026">
      <charName>hellip</charName>
      <desc>HORIZONTAL ELLIPSIS</desc>
      <mapping>...</mapping>
    </char>
    <!-- FIXME: PG Footer Workaround -->
    <char id='U0x00A0'>
      <charName>nbsp</charName>
      <desc>NO-BREAK SPACE</desc>
      <mapping> </mapping>
    </char>
  </pgCharMap>
</pgExtensions>

The above code also contains three examples for Unicode character mappings. If you use characters outside Latin-1 in the project (by writing &#x0200; for example), then this is the way to define what should be put in the TXT output for them.

Front Matter

The front matter is part of the text attribute, and it should start like this:

<text lang="XX">
  <front>
     <div>
       <divGen type="pgheader" />
     </div>
 
     <div>
       <divGen type="encodingDesc" />
     </div>
 
     <div rend="page-break-before: right">
       <divGen type="titlepage" />
     </div>

This is the simplest way to include a title page. From the line <divGen type="titlepage" />, the PGTEI 0.4 software will generate a standard title page using the data you have already given in the header. If such a title page doesn't suffice then you can construct your own title like this, consisting of several centered lines of text, and possibly a centered rule:

     <div rend="page-break-before: always">
       <p rend="font-size: xx-large; text-align: center; bold">Zimmerblattpflanzen</p>
       <p rend="font-size: large; text-align: center">Von</p>
       <p rend="font-size: x-large; text-align: center; bold">Prof. Udo Dammer</p>
       <p rend="font-size: small; text-align: center">Kustos des Kgl. Botanischen Gartens zu Dahlem-Berlin</p>
       <milestone unit="tb" rend="rule: 25%" />
       <p rend="font-size: large; text-align: center">Mit 48 Abbildungen</p>
       <p rend="font-size: small; text-align: center">Zweite Auflage</p>
       <p rend="text-align: center; bold">Berlin 1908</p>
     </div>

See the chapter on pictures for how to include a vignette here.

We close the front attribute by letting the software generate a table of contents:

    <div rend="page-break-before: always">
       <head>Contents</head>
       <divGen type="toc" />
     </div>
  </front>

We discuss this TOC in one of the next chapters. Note that you give the heading for it yourself so you can substitute 'Contents' with whatever is used in your project.

Preparing The Text Body

The text body is marked up as <body>, so after closing the front matter, we copy this line:

  <body>

Every further step we do to prepare the body works on the downloaded TEI file, so this is the time to copy the marked up text body only from that downloaded file to our file book.tei. The header and footer from the downloaded file is discarded. We will now make extensive use of filtering to transform the body to what we need.

Page Numbers/Anchors

The downloaded body usually starts with page number markup like

<pb id='003.png' proofer1='arrianarose' proofer2='' proofer3='phil1980' proofer4='sailingdove' proofer5='Josephine'/>

We don't want the proofer information enlarging the final text, also preserving some privacy. The page number id doesn't need the png extension, either. On the other hand, we'll want an anchor at every page break as target to point our links at. So, what we need is to pipe the whole body through this sed filter:

s@<pb id=.\([0-9]\+\).*@<pb n="\1"/><anchor id="Pg\1"/>@g
This takes the first string of digits in a line starting with '<pb id=' and replaces the line with the string '<pb n="', followed by the number, the string '"/><anchor id="Pg', the number again, and the string '"/>'. This way, the example line above becomes
<pb n="003"/><anchor id="Pg003"/>

You might want to manipulate the filter in case your book doesn't need three-digit page numbers. Note that the format of the anchor id ('PgXXX') has to be the same as that which you will use later for targeting the links. Note also that the id attribute name of pg was changed to 'n', reflecting the change from PGTEI version 0.3 to version 0.4.

Probably, your page numbers don't fit the original and they have to be decreased by a constant amount. We can only offer an awk script for the purpose, contrary to the rest of this document. However, arithmetics is possible but difficult with sed. Set the off variable to the offset you need:

BEGIN { FS = "\""; off=12; }
{
  if ($1 ~ /^<pb n=$/)
  {
    printf ("%s\"%03d\"%s\"Pg%03d\"%s", $1, $2-off, $3, $2-off, $5);
  }
  else
    print $0;
}

Note the format string assumes you want three-digit page numbers output.

Divisions

At this point, you should start thinking of which parts of the text would become major (or even minor) divisions. With hierarchically structured text this is evident, and most books also contain something like a preface or introduction, even if it doesn't have such a heading. Any of these text divisions have to be enclosed in <div>...</div> markup. Since we want any of the (at least!) major divisions in the main TOC, even if there is no original TOC, the actual markup would be

<div rend="page-break-before: always">
<index index="toc" />
<index index="pdf" />
<head>Preface</head>

Note that the index marks have to go before the heading for it to appear in the respective index. If you need a TOC entry for a division without heading, write something like

<div rend="page-break-before: always">
<index index="toc" level1="Introduction" />

If you don't want a page break before the division start in the PDF output, leave the rend assignment out of the <div>.

Table Of Contents and Other Poems

With eTexts of old books, there are three possibilities for giving a table of contents:

  • give the original TOC, together with original page numbers which become hyperlinks to respective anchors in the HTML document
  • mark up the text using text headings for anchors (employing div attributes to span the corresponding text parts). This is more convenient for HTML output because the notion of pages does not exist there. It will also automatically fill the running header in the PDF output with the chapter name
  • both of the above, with the original TOC being a part of the text, i.e., a chapter of its own

We will employ the third variant for our project. If you don't want a generated TOC, leave the code snippet with "divgen...toc" out of the front matter. If you don't want the original TOC, just leave that out.

The original TOC is usually foofed by DP as 'poem', that is, between /* */ markers. So marked passages appear in the downloaded TEI as isolated lines each marked up with <l>...</l>. You only have to add line group markup <lg>...</lg> around them instead of <p>...</p> to make them valid PGTEI 0.4. This filter can do that for you:

/^<p>/ { N; s@.*\n<!-- poem -->@<lg>@; P; D; }
/^<!-- poem -->/ { N; s@.*\n</p>@</lg>@; P; D; }
It searches for two-line combinations that run before and after passages marked as poems such that
<p>
<!-- poem -->
...
<!-- poem -->
</p>
becomes
<lg>
...
</lg>

For the hyperlinking, you will then need to replace page numbers with respective links, like in this sed filter:

s@ \([1-9][0-9][0-9]\)@ <ref target="Pg\1">\1</ref>@g
s@ \([1-9][0-9]\)@ <ref target="Pg0\1">\1</ref>@g
s@ \([1-9]\)@ <ref target="Pg00\1">\1</ref>@g
This filter does the following: it replaces every 1-to-3 digit number preceded by a space character with a string consisting of space, the string '<ref target="Pg', the number padded left with zeroes, the string '">', the number again, followed by the string '</ref>'. So for 'Toc entry 123', we get 'Toc entry <ref target="Pg123">123</ref>'.

Since a possible index uses the same page numbers, you might as well apply the above filter to the index, too, while you're at it. Keep an eye on other numbers that might be transformed, or page refs like III, IV, V that need their link set by another method. Use your own judgement where to apply this filter!

If the PPV demands the page numbers to be justified right or all on the same column, then the best way to achieve is to use a table without rules. Note that rend="rules: none" will leave lines around the cells in text mode so just don't give the rule value at all and don't use any | characters in the latexcolumns and tblcolumns value.

Finally note please that you must remove all instances of <index index='toc' /> whenever you remove <divGen type='toc' /> because PGTEI version 0.4 won't be able to generate PDF output otherwise.

Blockquotes

For bulk conversion of /#...#/ blockquotes, we use the following filter:

/^<p>/ { N; s@.*\n/\#@<quote rend="display">@; P; D; }
/^\#\// { N; s@.*\n</p>@</quote>@; P; D; }
It replaces line pairs
<p>
/#
...
#/
</p>
with
<quote rend="display">
...
</quote>

Font and font attribute changes

Since italic and bold markup is already converted to <hi>...</hi> pairs in the downloaded TEI file, we only have to do the same for possible 'gesperrt' (spaced out) text and 'antiqua' (unbroken font within fraktur text) passages, suggesting the simple filter

s@<g>@<hi rend="gesperrt">@g
s@</g>@</hi>@g
s@<f>@<hi rend="antiqua">@g
s@</f>@</hi>@g

Having the four rendering changes bold / italic / gesperrt / antiqua as style macros in our internal style sheet (see above) allows us to change the rendering of, e.g., all gesperrt passages to italic instead of bold in an easy and transparent way. It also preserves the original intent of the author by adding a layer of indirection. Note that, although it's possible to render gesperrt text using PGTEI 0.4's 'letter-spacing:' rend attribute, we advise against it, since the PDF backend outputs single isolated characters for it and not only is such text not searchable, it also allows line breaks between, leading to bad results.

You might have noticed we didn't include <sc>...</sc> markup in the above discussion. The reason is, the four entities b/i/f/g are normally used indiscriminately for anything the author wants to highlight. The hi entity of TEI is just defined as fallback whenever no specialized TEI markup applies.

Smallcaps: names and speakers

However, smallcaps mostly is used by authors to highlight names, suggesting its replacement with TEI's name entity; also it especially has a place in drama works to mark the name of the current speaker. In the former case, the simple filter

s@<sc>@<name>@g
s@</sc>@</name>@g

would apply. Dramas need the <speaker> element and, since PGTEI 0.4 renders that bold, we included in our DP-conformant style sheet this line to get normal smallcaps (and caps in TXT output):

     speaker          { font-variant: small-caps; font-weight: normal }

Note that regardless if <p> or <lg> follows the speaker, a line break is output in all cases. We don't know if it's possible to use the <speaker> entity on the same line with what is being said.

In conclusion, the transformation of dramatic speech from downloaded PGTEI 0.3 to 0.4 results in

<p>
<sc>The Speaker.</sc> Speech
</p>
being changed to
<sp>
<speaker>The Speaker.</speaker><p>Speech</p>
</sp>

and the filter that works it looks like

/^<p>/ { 
  N;
  /.*\n<sc>/ {
    N;
    s@.*\n<sc>\(.*\)</sc> \+\([^<]*\)@<sp>\
<speaker>\1</speaker><p>\2@g;
    P; D;
    }
  }
s@</p>@</p>\
</sp>@g

although you have to be careful to apply it to the mentioned passages only.

Illustrations, beware!

There are several restrictions for image usage, and all can be considered bugs in the PGTEI 0.4 software. All can be worked around, more or less.

First, do not change image format within the document. It would lead in most cases to the PDF backend not coming out right or not at all.

Secondly, the default behaviour of <figDesc> is identical output in TXT and HTML format, so in order to get DP-conformant descriptions we have to use <pgIf>. For example, to get "[Illustration: Heroes Returned from War]" in TXT but "Heroes Returned from War" as HTML caption, you need

<pgIf output="txt">
  <then>
    <p>[Illustration: Heroes Returned from War]</p>
  </then>
  <else>
    <p><figure url="test.png"><head>Heroes Returned from War</head></figure></p>
  </else>
</pgIf>

Also, don't ever use floating figures. Even if, with it, the PDF comes out right on your system, it might show messed up on gutenberg.org because they use slightly different parameters which can make a big difference and messups have happened frequently that made part of the images disappear.

Miscellaneous

s@&@&amp;@g
s@<tb>@<milestone unit="tb" rend="rule: 25%" />@g
s@"@@;s@"@@
s@<sc>@<name>@g
s@</sc>@</name>@g

Footnotes and back matter

After closing the text body with </body> the back matter should follow and look like

    <back>
      <div rend="page-break-before: right">
        <divGen type="pgfooter" />
      </div>
    </back>
  </text>
</TEI.2>

Possible end notes go in before the pgfooter division with

      <div rend="page-break-before: always">
        <head>Footnotes</head>
        <divGen type="endnotes" />
      </div>

Unfortunately, gathering of footnotes themselves and placing them back in the text is work intensive and I haven't found a script that does it, up to now. In any case, you almost always want end notes for reasons of readability of all back ends. If you generate them per chapter don't forget to set the target of <divGen> to the id of the respective division.

Usual procedure

For many of the following tasks, it's quite useful to work on the TXT output with gutcheck (on the main text body).

The normal procedure of preparing your text would follow here, that is, a spellcheck and appendage of transcribers' notes. Be aware, however, that TEI provides mark up for corrections. We suggest you use <corr> (for corrections given by proofers and your own while spellchecking!) and gather the lines using grep afterwards for usage in the transcribers' notes.

Also use gutcheck to find short lines in the TXT version. Fix them by placing <lb/> in strategical positions before or after the offending line.

It's this time when you would search for remaining asterisks, brackets, double hyphens.

Finally, if you cannot find any more problems, create HTML and TXT version, zip them together with TEI and a possible images/ directory, and upload it to PGDP. In order to use the right channels, place a comment in the upload

When TXT and HTML check okay, please forward TEI file to David Widger.

Remember, any corrections that the PPVer wants on the TXT or HTML would have to be made to the TEI file, and TXT and HTML generated from that. Have fun.

LOTE issues

Using other languages than English will present the stringent PPer with a few problems we try to give workarounds for in the following.

We haven't tried full Unicode projects with PGTEI 0.4, due to lack of occasion. Please add your experiences here.

Using non-ASCII characters in title or chapter head will garble the PDF's running headers. There is no workaround.

Usually, you will set the lang variable in the header and see that all used language codes appear in <langUsage>...</langUsage>.

Different quotation marks cannot be introduced, as the PGTEI documentation states, by giving pre and post rendition variables. This does not work. The workaround would be to leave them in the quoted text and give a rend value of "pre: none; post: none;" in the internal stylesheet if you use ....

The german language Halbgeviertstrich which is used instead of mdash in german language HTML should appear as -- (Minus, Minus) in the text backend. We can readily achieve this with the following code in the pgExtensions header:

 <pgCharMap formats="txt">
   <char id="U0x2013">
     <charName>ndash</charName>
     <desc>EN DASH</desc>
     <mapping>--</mapping>
   </char>
 </pgCharMap>