User:Jhellingman/TEI Guidelines

From DPWiki

My personal TEI guidelines differ in a number of ways from the PGTEI guidelines. These changes are 90 percent historical (I developed my TEI conventions independently from Marcello, and will have to update over 500 projects to confirm to PGTEI), and 10 percent philosophical (We have some different opinions on what constitutes good semantic mark-up).

Why TEI

HTML and plain vanilla text are the norm at DP, so why all the fuss about using TEI?

To me, the benefits of using TEI are plain and simple. It is all about separation of concerns, that is, divide and conquer. With TEI, I first decide what a certain element in the text is, and only later what it should look like. This has a number of benefits.

  1. We can use the semantic information for other purposes.
    • Extracting fragments with certain characteristics (Since I also tag the language of elements, I can extract all phrases in certain language from my TEI files; this feature is great for spell-checking.)
    • Automatic generation of tables of contents and indexes. (Although I normally keep the one that exists in a book).
  2. We can decide on the 'looks' using scripts, and thus achieve better consistency in formatting. (Although existing books are often not very consistent in that anyway.)
    • Whenever we improve the formatting scripts, we can re-run them, and have all our old HTML version improved automatically.
    • We can have multiple versions of the formatted text available.

But, I have to admit, also has a number of drawbacks.

  1. TEI has a steep learning curve.
    • The basics are not more difficult than HTML, but you have to learn quite a lot about features used in books only a few times (for example, to support verse, plays, or dictionaries).
  2. TEI tooling is by far not as advanced as you can find for HTML (Even though both TEI and HTML are based on SGML)
    • Converting TEI to HTML (or any other format) can be automated, but is far from trivial to program (I've done my best to supply some XSLT scripts, though).
  3. It is sometimes hard to guess the semantics of a certain layout in a printed book.
    • So you'll have to use rendering hooks (or hints) every now and then.

Which Version of TEI?

Currently three versions of TEI are in use:

  • P3: Based on SGML and DTDs. I used this mainly because SGML is easier to type and the conversion to XML is automatic.
  • P4: A slightly revised version of P3, which made the shift to XML.
  • P5: A mayor overhaul of TEI, based on XML and Unicode, and using Relax NG schemas.

Much of my tooling is based on P3, but the P5 version is a considerable improvement, which resolves a number of outstanding issues with both P3 and P4, so I use elements from P5 in the ebooks I work on, and can generate fully P5 compliant XML in my process. The steps I typically follow here are:

  1. Validate my P3 master file against the DTD (with nsgml).
  2. Convert any ad-hoc transcriptions I use to SGML entities (P3 is not XML based, and hence does not support Unicode out-of-the-box).
  3. Convert the master file to XML (with sx).
  4. Convert this XML file to TEI P4 (with saxon and the tei2xtei.xsl stylesheet).
  5. Convert the resulting XML file to HTML (with the tei2html.xsl stylesheet).
  6. Convert the XML from step 4 to TEI P5 (using a series of P4 to P5 scripts currently in development).

For the time being however, I will be sticking with SGML and transcription schemes for my master files.

General Encoding Principles

  • Tagging is additional. This means that I consider a TEI encoded text as characters plus tagging, where the characters are all characters in the original source, and the tagging is an interpretation or indication of what those characters are, such as part of a heading, paragraph or line of poetry. No characters will be removed or replaced by tags, and when you remove the tags, the characters will remain. Of course, there are exceptions:
    • Corrections of obvious mistakes. Obvious mistakes appear in a corr element, but the correction replaces the original, which will be moved to a sic attribute.
    • Page numbers, headers, etc.. Page numbers, page headers and other text that result from the physical characteristics of a book, such as signature indicators, repeated table headings, etc., are dropped from the main text. Note that page numbers are very important to keep, as they are often used for internal and external cross references, so they will be present in the encoded text. Page numbers will reappear in the n attribute of the pb element, and page headers and signature information may be re-included in fw elements if of particular interest. This is however not a high priority.
    • End-of-line hyphenation is resolved.
  • Retain original page numbers and footnote markers. To ease further processing of cross references, all original page numbers are kept (as value of the n attribute on a pb element), as are original footnote numbers. These may be renumbered or removed in the final rendering phases, but not in the TEI master files.

Note that although TEI includes a large number of semantic tags, such as foreign, it is not always possible to use them, and we need to revert to appearance oriented tags, such as hi (for highlighted text), as the original semantic intent cannot be deduced from the work. In many cases, using semantic tags also is considerably more labor intensive, and are thus not used.

Common Additions

When preparing a text for Project Gutenberg, I may add the following additional information.

  • Expansions of uncommon abbreviations used in the text. In HTML, the expansion will appear as a tool-tip. Reader software should be able to read the expanded version of the abbreviation.
  • SI equivalents for all obsolete units of measure (such as pounds, stones, miles, feet, etc.) that appear in the text. In HTML, the SI equivalent will appear as a tool-tip.
  • Transcriptions for non-Latin scripts. In HTML, the transcription will appear as a tool-tip.

In the TEI header, I will list my common editorial practices, and any special considerations for the text in question. I may also add details on the author and other people involved in the production of the original.

The transformation to HTML can automatically generate a colophon with transcribers notes, listing all corrections made.

Naming Conventions for Ids

I follow the following naming conventions for internal ids in TEI documents. I only provide ids when they are needed to link to certain elements, or, in case of pictures, to generate a file name for the external picture file. Note that wherever possible in the TEI to HTML transform, the ids are preserved in the final HTML, and thus can be used to deeplink into the HTML versions.

  • <pb>: pbn, where n is the actual page number. If page numbers are not unique, they can be disambiguated by prefixing one set of page numbers with a section prefix, for example, pb1.3 for page 3 in the first part of the work.
  • <note>: nn.m, where n is the page number the note occurs on, and m the note number. The actual number if present, or one from the range A...Z if asterisks, daggers, etc. are used.
  • <div0>: ptn or bkn, depending on whether the <div0> is called a part or book. n is the actual number used, in Arabic figures. Note that I only use <div0> in exceptional cases.
  • <div1>: chn, where n is the actual number of the chapter in Arabic figures.
  • The main table of contents always gets the id toc.
  • The main list of illustrations always gets the id loi.
  • <div2>: chn.m, where n is the number of the chapter in Arabic figures, and m is the number of the section. <div3>, etc., extend this notation with further numbers for subsections, etc.
  • <figure>: pn, where n is either the page number of the page on which the figure appears, or the figure number as used in the source. When multiple unnumbered figures appear on a page, the id becomes pn-m, where m is a sequence number, counting the figures in the order a reader would encounter them in the text. When a figure appears on an unnumbered page the page number of the facing page is supplemented with a letter in the range a...z. When a source numbers illustrations, and makes a distinction between plates and figures, the plates get ids as indicated here, and the figures get ids starting with fig.

Id's starting with the letter x are reserved for automatically generated id's.

Differences between my TEI guidelines and PGTEI

  • Use of SGML. I use the SGML version of TEI, which is slightly easier for human editors than its XML reincarnation. Since I employ an automated conversion from SGML to XML, this is no problem. The automatic conversion is performed with J. Clark's SX tool, available at www.jclark.com. After this I run the tei2tei.xsl stylesheet.
  • Use of Latin-1 only. Since my SGML work predates Unicode, I don't use Unicode and stick to Latin-1 only. All characters outside Latin-1 are encoded with entities. When including sections in non-Latin script, such as Greek, I use ad-hoc transcription schemes. Since I have tools to convert these to Unicode, this is no problem.
  • Use of extensions. I try to avoid extensions to TEI, and stick exclusively to TEILite, or borrow elements from the full-blown TEI on a case-by-case basis when required.
  • Use of the rend attribute. We both use the rend attribute to provide hints on rendering elements. I use the concepts of rendition ladders, whereas PGTEI uses (since version 0.4) slightly modified CSS. Since this is mainly a syntactic distinction, I may migrate to CSS in future. (Which means I'll have to write a conversion tool for this purpose.)
  • Use of <divGen> for tables of contents. Since we are digitizing pre-existing texts, I avoid the use of the <divGen> attribute for tables of contents and similar sections in favor of encoding these as they appear in the source. The only exception is where the source has no table of contents. Note that titles in original tables of contents often differ considerably from the actual headings used. Sometimes this is (apparently) intentional, sometimes a mistake, which I will then correct.
  • Use of <divGen> for footnotes. I automatically generate footnote sections at the end of the chapter they appear in. This requires some tweaks with nested texts in quoted sections, etc., but these can be handled by software easily.
  • Use of <q> elements. I try to follow the principle that TEI texts should be the characters in the source plus tagging, and thus encode all quotation marks with the proper characters or character entities, as they appear in the source. I only use the <q> element when required to add attributes to a run-in quotation. I do not object to tagging quotations, except that a typically don't consider it worth the trouble, but insist on keeping the quotation marks.

Extensions Used

In my TEI master files, I use the following extensions to TEI Lite.

<xref url="">

To directly reference to URLs in an <xref> element, I've added the @url attribute.

To reference another PG book, I simply write <xref url="pg:6737>, and the transform will turn it into a live HTML link. Internal Project Gutenberg ids can be referenced via a shortcut: <xref url="pg:6737#ch1">. Currently, such URLs will translate to http://www.gutenberg.org/etext/6737 and http://www.gutenberg.org/files/6737/6737-h/6737-h.htm#ch1, respectively.

Note that in TEI P5, the xref and xptr elements have retired, and the attribute @url has become the norm.

<ditto>

<ditto> is used to indicate parts of tables or lists that have been represented by repeat commas (ditto marks) in the source. These parts are always fully written-out, but surrounded by <ditto> tags. The idea here is that reading software can correctly read this text, while rendering software can re-insert the ditto marks while maintaining the proper spacing.

<trans>

<trans> is used to indicate the transcription of sections in non-Latin script, such as Greek phrases cited in works. The transcription is given in the trans attribute. In HTML versions, the original will show, with the transcription as a pop-up. In ASCII versions, I will use the transcription. In future versions, I may start using the new <choice> element for this purpose.

<as>

<as> is used to mark arbitrary strings that need some special processing. A typical example for its use are rendering effects for which no other suitable element can be found. (The idea is to use semantic containers whenever possible, but we are not always able to guess the author's intention when working with old books). When we us <as>, we can use that as an element to attach attributes to.

The TextHeatMap tool, when applied to TEI files, inserts <as> elements, with the type attribute giving the 'temperature' of a word.

Geographic Locations

The tags <placeName> and <geogName> can be used to tag Geographic names and locations. In addition, <geoLoc> can be used to tag a location, as expressed in some coordinate system. The tag reg can be used to provide a regular (or modern) name for the place indicated. Furthermore, a tag geoLoc can be used to encode the location in decimal coordinates.

Example:

The island province of <placeName geoloc="9.833333,124.166667">Bohol<placeName> has an 
area of 4,117.3 square kilometers, and is located at 
<geoLoc geoloc="9.833333,124.166667">9°50′N, 124°10′E</geoLoc>.

Tagging locations as such will make it easy to generate maps of mentioned places with a book.

A good (sometimes overwhelming) source for location information is the U.S. National Geospatial-Intelligence Agency. Further help may be found on Statoids.com and Fallingrain.com.

The HTML version of the text will use the geo microformat.

Dates and date-ranges

Amounts

<amount> is used to tag an amount of money, as expressed in a certain currency. It has the following attributes

  • @currency: The currency code for this currency following ISO 4217. For obsolete currencies without an ISO code, we will introduce ad-hoc solutions.
  • @date: The date this amount was stated. Can be just a year, and defaults to the value found in the docDate element.
  • @amount: The actual amount, expressed as a decimal number of the main unit. (For example, in the older British system a shilling is GBP 0.05, and a penny GBP 0.0041667.) (Under consideration if we allow notations like "-/3/2" for such non-decimal systems.)

This tags allows us to generate pop-ups with the equivalent in current day currencies. Of course an equivalent is sometimes difficult to define, especially if the economic conditions are considerably different, as they were a few centuries ago. Possible choices would be purchasing power parity or daily wage for unskilled labor.

Example:

This book cost me <amount currency="GBP" amount="0.158333" date="1912">three shillings two pence</amount>.

Some important non decimal currency systems:

  • GPB until 1971: 1 pound = 20 shillings = 240 pence
  • INR until 1957: 1 rupee = 16 annas = 64 paise = 192 pies (= 1/15 mohur)
  • PKR until 1961: 1 rupee = 16 annas = 64 paise = 192 pies

See Apples to Peeren converter.

Units

<measure> is used to tag a measure, as expressed in a certain unit. It has the following attributes

  • reg: The equivalent of the measure in metric units.

Example:

He walked <measure reg="16 km">ten miles</measure> to get back home.

Tools

I use the following tools to process my TEI files.

  • Patc, a home-grown program to quickly find-and-replace many things at once. Useful for converting sections in various transcriptions to Unicode.
  • Perl, the well-known scripting language.
  • James Clark's sp toolkit for parsing SGML (nsgml) and converting SGML to XML (sx).
  • Michael Kay's saxon XSLT processor
  • Home-grown XSLT style sheets to convert XML to HTML.
  • Tidy, to cleanup the generated HTML.
  • A number of Home-grown Perl-scripts to glue everything together.

Links

http://www.lib.umich.edu/tcp/docs/index.html

Downloads

  • The source code of tools, scripts, stylesheets, are available from GitHub. You can download the code (using a subversion checkout) from GitHub.
  • My completed Project Gutenberg books' master files will appear on GitHub under the organization name GutenbergSource.