From DPWiki
Jump to navigation Jump to search

Note: PGTEI is no longer used at DP. Information on this page may be out of date.

See also:

Introduction to PGTEI Markup for the DP PPer

What is PGTEI?

PGTEI is short for Project Gutenberg - Text Encoding Initiative. It is a dialect of the full Text Encoding Initiative ([1]). TEI, in turn, is an XML-based coding specification that allows almost any type of information about a text to be encoded within the text itself in a consistent, easily machine-retrievable manner. For example, one could encode information about the source DP used to create an e-book within the flow of the text. The final output may not show that information to the casual reader, but the information is available to anyone that wants it and it is easily retrievable by an automated cataloging system. For instance, the page numbers found in the original source can be easily encoded into our file, but the end user can easily configure his reading experience to ignore that information completely. But for those that need this information, such as scholars looking to reference specific material, it can be just as easily found.

Why use PGTEI?

The PG archive contains two main "encoding" schemes currently: plain text (usually in the form of Latin-1, sometimes in the form of ASCII or UTF-8) and HTML. Plain text, by its very definition, encodes very little data other than the letters and spaces themselves. Over time, certain conventions have crept in such as using 4 blank lines to indicate a chapter break and underline characters to indicate italics, etc. But the possible markup in such a scheme is very limited as well as fairly haphazard: there is no defined standard.

HTML, while much better, fails to encode semantic data. For instance, it is very good as presentational material such as bold, italics, and centering. It is not as good at semantic/metadata markup, such as marking a line as a chapter heading, a blockquote or a footnote. HTML is not very good at divorcing content from layout, which makes it even harder for the user to change the presentational style to fit his/her needs.

NOTE: CSS markup in HTML has come a long way to separating content and layout, but it is still tied to HTML which is a presentational markup at its heart.

HTML is first concerned with how something looks and second with what something is. Conversely, TEI is first concerned with what something is and second with how that something looks. This is both a simple and yet extremely profound difference.

Why don't we use another pre-existing standard?

There really are two ways to answer this. First, we are using a pre-existing standard. TEI is a well-established, highly documented standard used the world over. Second, PGTEI allows us to specify a particular subset of the full 1500-page specification that we will support. There are parts of the full specification that we will likely never use. It is just that big and complex. By creating our own PGTEI subset, we let anyone familiar with TEI know that this file was encoded with a specific subset of features. Then, if said person wants to know which subset of features we used, he or she can go read the DTD files which are publically available at Project Gutenberg.

Does this new PGTEI format replace/supersede the old plain text (ASCII) format? HTML? Is PGTEI going to be required for future submission?

PGTEI will mean different things for different people. The first group (and many would argue, the most important) is the group of people who simply want to read a particular eBook in our collection. For these folks, nothing at all should change, especially in the near term. All eBooks encoded in PGTEI are converted to the old stand-bys, plain text and HTML (with the happy addition of PDF, which isn't standard now). The only noticeable difference to the end user is the universal availability of HTML and PDF versions (something not always available currently) and possibly a more consistent "layout" since the conversion to plain text and HTML are handled by a computer program instead of manually by different people.

The second group of people are those who create the texts. For right now, PGTEI is recommended, but by no means required. Will that change someday to required? That is possible. But is is far enough down the road that it shouldn't be an immediate concern. Think of this as a grand experiment with a lot of potential.

We are definitely committed to making this "transition" as smooth as possible, both in the experience of group one (the readers) and group two (the producers). We are trying to exhaustively test each step in the process and are committed to providing as much documentation and help as we can.

How does PGTEI dovetail with DP's formatting guidelines?

As much as possible we have made an effort to have a one to one correspondance between the two. DP's formatting shortcut has the advantage of being fairly simple to remember and fairly easy to type. PGTEI is MUCH more wordy, but it is also much more flexible.

Please see the wiki page on converting our format to PGTEI for further information.

Common Elements of a PGTEI Document


The <teiHeader> contains a lot of the metadata about a particular eBook, such as the source edition it was derived from, the title, the author, the illustrator, editor, etc. It contains the date it was posted to the PG archives, the text number that was assigned to it, a running record of changes that have been made to the text and the people responsible for those changes, etc.

This is the section about the eBook.

Below is an example <teiHeader> section that you can modify for your own use. In the future, we want to have a webpage that you fill in the appropriate information and it spits out a fully formatted <teiHeader>. If anyone has web form writing experience and wants to help out, let me know. How? Add link! Vaguery 14:22, 29 May 2006 (PDT)

     <title>Alice's Adventures in Wonderland</title>
     <respStmt><resp>Illustrated by</resp> <name>John Tenniel</name></respStmt>
     <author><name reg="Carroll, Lewis">Lewis Carroll</name></author>
     <editor role="illustrator"><name reg="Tenniel, John">John Tenniel</name></editor>
     <edition n="30">
       Edition 30
     <publisher>Project Gutenberg</publisher>
     <date value="1991-01">January, 1991</date>
     <idno type="etext-no">11</idno>
       This eBook is for the use of anyone anywhere at no cost and
       with almost no restrictions whatsoever. You may copy it, give it
       away or re-use it under the terms of the Project Gutenberg
       License online at
     <title level="s">#1 in our series by Lewis Carroll</title>
     <idno type="vol">1</idno>
     <taxonomy id="lc">
         <title>Library of Congress Classification</title>
     <language id="en"></language>
     <language id="fr"></language>
     <classCode scheme="lc">PR</classCode>
     <date value="1991-01">January 1991</date>
     <item>Project Gutenberg edition 10</item>
     <date value="1994-03">March 1994</date>
     <item>Project Gutenberg edition 30</item>
     <date value="2003-03">March 2003</date>
     <respStmt><name>Marcello Perathoner</name></respStmt>
     <item>TEI Markup</item>

<divGen type="pgheader" />

This line tells the TEI conversion process to construct a standard PG boilerplate header. See divGen type="pgheader" for more specific information.

<divGen type="encodingDesc" />

This line tells the TEI conversion process to list any special character encoding done in this file. See divGen type="encodingDesc" for more specific information.

<divGen type="footnotes" rend="newpage" />

This line collects all the footnotes and endnotes scattered throughout the text and collects them at the end of the document (in HTML and Text) or at the end of the page (in PDF). This is a very powerful command and saves a lot of time in the text preparation stage.

This line is typically placed at the end of the document in the <back> section.

The Four Main Sections of a PGTEI Document


See the information in the previous section.


This includes the title page, the header, the Table of Contents, Introductions, Prefaces, etc.


The main bulk of the book.


The appendices, indices, footnotes, colophon, footer, etc.

Basic Markup

Divisions or Chapters - <div>

This divides the text into divisions (such as chapters).

 <div type="chapter" rend="newpage">
   <index index="toc" /><index index="pdf" />
   <head>Chapter 1 - It Starts!</head>
   <p> ... </p>

Paragraphs - <p>

Just like in HTML, this markup surrounds a paragraph. A paragraph can be said to be a distinct division of written or printed matter that consists of one or more complete sentences, and typically deals with a single thought or topic. If you encounter a distinct division of text which looks like a paragraph, but is not composed of at least one complete sentence, and you cannot discover a more appropriate TEI element to use in its place, use the <ab> ("anonymous block") element instead.

A paragraph:

 <p>This is an example paragraph.  Boring isn't it?</p>

Not a paragraph:

 <ab>DAVID A. WILSON - Attorney and Counselor at Law</ab>

Italics - <i></i>

This one is handled inside a rend attribute.

 <p rend="font-style: italic">This paragraph is in italics.</p>

renders as

 This paragraph is in italics.

But see also

 This text is <emph>emphasized</emph>.


 This text is in <foreign lang="fr" rend="font-style: italic">une langue étrangère</foreign>.

depending on why the text was italicized in the first place. (If it's neither, the <p>-based markup is fine.) The lang attribute should be one defined in the <langUsage> section in the header; it is used by the PDF backend for hyphenation purposes.

Bold - <b></b>

 <p rend="font-weight: bold">This paragraph is in bold.</p>

renders as

 This paragraph is in bold.

Small-caps - <sc></sc>

 <p rend="font-variant: small-caps">This paragraph is in small-caps.</p>

renders as

 This paragraph is in small-caps.

Sidenotes - [Sidenote: xxx]

 <note place="marginnote"><p>xxx</p></note>

See sidenotes in the Post-Processing FAQ.

Illustrations - [Illustration: xxx]

   <figure rend="width: 95%" url="images/image##.png">
   <figDesc>optional image description</figDesc>

See Illustration section in the PP-FAQ.

Poetry - <lg>, <l>

This is poetry markup. The <lg></lg> markup surrounds a stanza and the <l></l> surrounds a single line in that stanza.

 <l>Mary had a little lamb,</l>
 <l>It's fleece was white as snow.</l>
 <l>And everywhere that Mary went</l>
 <l>The lamb was sure to go.</l>

You can handle indention through a rend="margin-left: X" command. Replace X with the number of indent spaces you want to use.

Note: I believe the latest convention is to use rend="style=margin-left: X" instead: anyone with a clue please confirm.

 <l>Mary had a little lamb,</l>
 <l rend="margin-left: 2">It's fleece was white as snow.</l>
 <l>And everywhere that Mary went</l>
 <l rend="margin-left: 2">The lamb was sure to go.</l>

NOTE: You can add additional information such as line numbers and the type of poem, but that is beyond the basic level of markup covered in this document.

Formatting Using rend="" Attributes

The rend attribute allows you to assign some kind of presentational information to some text. Since it is an attribute, it is added as a sub-section of an element (i.e., <p rend="center">).

rend can be roughly divided into two types of rend: font formatting and layout formatting.

For a list of layout-oriented rends ... See PGTEI Documentation [2]

For a list of font formatting-oriented rends ... See PGTEI Documentation [3]

Division or Chapter Titles - <head>

This marks text as the heading or title of the division. The conversion process will put the correct level of <h1> or <h2>, etc. markup in the HTML conversion or the correct number of blank lines around it for the text conversion. This process is completely automated, no rends needed.

If you want this heading to appear in the Table of Contents, put the following lines right before it.

 <index index="toc" /><index index="pdf" />

The toc line adds the header to the Table of Contents. The pdf line adds it to the PDF's bookmark.

NOTE: If you want an entry in the Table of Contents that does not conform to a <head></head> statement, you can do so like this:

 <index index="toc" level1="Text as it will appear in the Table of Contents" />
 <index index="pdf" level1="Text as it will appear in the Table of Contents" />

This can be useful in the pdf line, especially, since the PDF format only supports ISO-8859-1 characters, and you may need to manually type in a workable ISO-8859-1 character string.

You may see 'pdb' indexing on occasion; this is left over from version 0.3, and does nothing in version 0.4, as PDB support was removed. See this forum post on the topic.

Page Numbers - <pb n="x" />

This is a page marker. Every time there is a new page in the original, you can mark this with <pb n="x" /> where x is the page number.

NOTE: It is also recommended to add <anchor id="Pgxx" /> right after it, where xx is the page number. This provides an HTML anchor link in that conversion, which can make link to a page (say from an index) much easier to do.

Footnotes - <note place="foot"><p>...</p></note>

Footnotes are handled inline in the text. For example:

 <p>This is an example sentence.  If I wanted to place a
 footnote<note place="foot"><p>Look I did!</p></note>, 

then I would use the note element.

This would render roughly like this:

 This is an example sentence.  If I wanted to place a footnote[1], 
 then I would use the note element.

 [Footnote 1: Look I did!]

Thought Breaks - <milestone unit="tb" rend="stars: n" />

This is the equivalent of the <tb> markup. Replace n with the number of stars you want.

The full usage is follows:

 <milestone unit="tb" rend="stars: n|rule: n%" />
   no rend         : insert small vertical gap
   rend="stars: n" : insert n stars
   rend="rule: n%" : insert horizontal rule n percent wide

Block Quote - <quote rend="display">

This is roughly equivalent to the /# ... #/ markup we use around block quotes in the DP.

 <quote rend="display">
   <p>This is an indented paragraph.</p>