Unicode and You
Boilerplate: This text uses UTF-8 (unicode) file encoding. If the apostrophes and quotation marks in this paragraph appear as garbage, you may have an incompatible browser or unavailable fonts. First, make sure that your browser’s “character set” or “file encoding” is set to Unicode (UTF-8). You may also need to change the default font.
This page is a compilation of answers to assorted questions about post-processing in Unicode, and about conversion in general: both UTF-8 to Latin-1, and Latin-1 to ASCII. The terms “Unicode” and “UTF-8” are used interchangeably. They are not really the same thing, but if you already know what UTF-16 is, you probably do not need to read this stuff.
Do I gotta?
Well, no, you don’t gotta. Unless your text is in Chinese. Or Armenian or Greek or Hindi or any of the other languages that use a non-Roman script. That one’s easy. But there are other situations where Latin-1 won’t work, including most Eastern European languages. Beyond that again, you’re in judgement-call territory. Does your text have a lot of Greek words? Is the main character named Phœbe? Does the printed original use „low 9 - high 6“ quotation marks?
Yes, OK, but how?
The mechanics depend on your text editor or post-processing software. You have to do two things: save the text as UTF-8 in the first place, and make sure it is in UTF-8 every time you reopen a saved document. The text editor may or may not do this automatically. If the file encoding isn’t in plain sight, make it easy on yourself by parking a few unusual characters at the beginning of the file, like the Boilerplate paragraph at the top of this page. If the file encoding is wrong, a sequence like “þ” will turn into gibberish: âþâ, say, or ‚Äú√æ‚Äù.
Will the BOM make my computer explode?
The BOM or Byte Order Mark is an invisible character placed at the very beginning of some UTF-8 files. Explore your text editor’s settings to find out whether it uses the BOM and, if so, whether you can leave it off by default. In plain-text files, the presence or absence of a BOM makes no difference; the whitewashers’ software will deal with it. But the BOM should not be present in your HTML files, where it can cause trouble.
What do I call the file?
Your plain-text file should have a name ending in “UTF-8” (capitalization and hyphens optional). This used to be required by the whitewashers’ software, which otherwise couldn’t tell. Now it is for the whitewashers’ and your own benefit, because the name confirms that this file was intended to be in Unicode. Or that it wasn’t.
Plain text
Your post-processed file will always have a plain-text version. So Lesson #1 is:
Plain Text Is Not The Same As ASCII
The terms “plain text” and “ASCII” are not interchangeable. If they have been sharing space in your brain, build a wall between them.
Plain text is a file format. It means exactly what it sounds like: text and nothing else. No italics, no size changes, no color-coded links, just words. Are you old enough to remember typewriters? That’s what plain text gives you.
ASCII is a file encoding or character set. It defines which letters you’re allowed to use. Basically, they’re the characters you see on an English-language keyboard: plain letters, numerals, and some common symbols like & and *. The special thing about ASCII is that it is shared by all file encodings—think of it as the lowest common denominator—so even if your text editor goes haywire and decides that your file is in ISO-Latin-9 or Baltic Rim DOS, any unaccented letters and numbers will be unaffected.
Now What?
When you convert your raw file into Unicode, it will not look any different. All you’ve done so far is made it possible to add or restore characters that were lost or altered in proofing.
Here are some changes to consider. If your text editor has a “match case” option, use it; second choice is to make it case-sensitive and do upper- and lower-case forms separately. Some of the following will be a lot faster and easier if you can use RegExes, but as with everything else in post-processing, there is always an alternative.
Caution: Think twice before changing all your double hyphens -- into em dashes —. Text files are often read in a monospaced font—if they’ve got tables or anything with significant horizontal formatting they almost have to be—and then em dashes become indistinguishable from single hyphens. On the other hand, if you like using en dashes between pairs of numbers, go ahead; it won't matter if they do come out looking like hyphens.
Letters
Grab any [oe] and convert to œ. It’s ugly in monospace, but it’s still a recognizable character. Also check for unbracketed OE and Oe; some printers didn’t have the capital Œ.
If you can do Regular Expressions, do a global search for \[..\] (open bracket, any two characters, close bracket), or \[\D\D\] if there are a lot of numbered footnotes. Almost everything you net this way will be something with a diacritic that’s easily converted, such as [~e] to ẽ. Searching in this format will also turn up those minimalist [**] notes, which you may as well figure out sooner as later.
Careful! When you start doing the replacements, like [oe] to œ, either switch off RegExes or \escape the brackets, or yœu will bœ œxtrœmœly sœrry.
Numbers
Almost all fractions can be restored. Search for \d/\d (digit, slash, digit) and see what comes up. Halves and quarters are in Latin-1; Unicode has the full set of /3 /5 /6 (no /7) and /8. Caution: If you have tables using the full range of fractions, they may come out a bit wiggly, depending on the fonts involved.
In some cases you may also need or want to replace superscripted numerals, not only ¹²³ but the whole ⁴⁵...⁹⁰ series, along with the subscripts ₀₁...₈₉. As with fractions, they may display inconsistently.
Sorry, but there are no unicode characters for the old-fashioned high-low numerals. These depend purely on the font you are using.
Symbols and Punctuation
Make all quotation marks and apostrophes curly. Unless you are very talented with Regular Expressions, you will almost certainly have to do some of them manually. Watch out for multi-paragraph quotations, and for newer British texts that use single quotes. An utterance like
‘’Tain’t so!’ said M‘Guire
is not likely to be RegExed into submission.
If your text uses “low-9” quotation marks „ „ as ditto marks, you can pick them up in a global search for spaced quotes—a search you’d be doing anyway.
If there are only a few widely spaced footnotes, you may choose to restore the original † and ‡ (daggers) markers.
Non-Roman Scripts
De-transliterate any Greek and, if possible, other scripts like Hebrew or Arabic. Depending on your available software, you may be able to do part of this automatically, or you may have to change one letter at a time. If there are a lot of scattered Greek words and phrases, you may find it easier to pull them out into a separate Greek-only file, de-transliterate in a batch, and put them back. The easiest way to restore diacritics that were lost in proofing is to post in one of the Greek (Hebrew, Arabic...) Help threads. Someone will do it for you, so all you have to do is copy and paste.
Don’t get rid of the original transliteration yet. You will need it later.
Before you go any further
Now that you’ve got everything in place, spin-off a copy for the HTML version if you plan on making one. Exact timing will depend on your post-processing technique. But if your text contains anything in a non-Roman script, make sure you generate the HTML before you strip away the transliterations.
HTML
If your plain-text file is UTF-8, the HTML can be UTF-8 too (and this is probably easiest), but need not be so. In fact, you can make it UTF-8 even if the plain-text file was Latin-1. As HTML uses other ways to indicate the character set used, it doesn’t need any special markers in the name.
Alternatively, you could keep the HTML file as ASCII or ISO-8859-1, and insert characters outside this range as (numeric) entities, as indicated in the next session.
Headers, Character Sets and Entities
Your HTML header must match the file encoding you used while creating the text, or the person reading your HTML document may see garbage anywhere you have a non-ASCII character. If you’re in Latin-1, there will be a line near the top of your HTML file:
<meta http-equiv = "Content-Type" content = "text/html; charset=ISO-8859-1">
For UTF-8, change the last part to:
charset=UTF-8">
Don’t be fooled by the term charset. It sounds as if it means “permitted characters”, but in fact any HTML document can contain any character. The charset declaration simply tells the “user agent” (HTML-speak for “browser”) how to convert what you wrote into what the reader sees. If you use entities for all your non-ASCII characters, like ä for ä and æ for æ, then it doesn’t matter what you put in the “charset” field; it can be US-ASCII for all the browser cares. But your HTML file will be a lot easier to read and edit if you keep the characters in their displayed form.
If you are using XHTML, you may also wish to correctly set the same character set in the XML declaration, that is, the first line of your file, that should read:
<?xml version="1.0" encoding="UTF-8"?>
When the Header Isn’t Enough
Some older browsers can’t read the charset declaration, but have no trouble with UTF-8 once you’ve told them to use it. So you may want to include a Transcriber’s Note at the beginning of the HTML telling readers to check their text encoding. This tends to be fairly easy to change, on a menu somewhere instead of buried deep in the Preferences. In browsers that don’t do font substitution, users may also need to change the default font. Unfortunately, this one is buried in the Preferences.
Greek and Other Horrors
When you first create the HTML file, the de-transliterated Greek should be side by side with the proofed transliteration, preferably in a consistent format such as μῆνιν [Greek: mênin]. This will make it easier to convert the package into Greek text plus transliteration popup:
<span class = "greek" title = "mênin">μῆνιν</span>
Other unusual characters—anything that your reader’s browser might not be able to display—should also get popups or descriptions of some kind. This can be targeted to the type of text and its likely audience. If every page of the book contains huge slabs of Greek, it is probably safe to assume that your readers can deal with it on their own and don’t need help. But if it’s something like an etymological dictionary, there will be readers who need to know what it says even if they can’t read it themselves.
Watch those Anchors
If your HTML includes anchors generated from the actual words in the text, such as Index entries or chapter titles, take a closer look. No matter what charset you’re using, convert your anchors to plain ASCII—prerably to letters, numbers and lowlines only. Unpack any æ or œ; deal with accents in some consistent way; convert unusual letters like yoghs and thorns; replace spaces with _ or simply close them up.
Caution! Neither the HTML validator nor the Link Checker objects to non-ASCII characters in links. And as long as you’re only navigating within your original file, it won’t matter. But links from other files, especially links that have to travel across the internet, risk becoming garbled in transit. So don’t take any chances.
Downshifting: Retro-Conversion to Latin-1
Do I gotta? If your primary text file is UTF-8, do you also need to make Latin-1 and/or ASCII files? The answer to this depends on what’s in your file. The whitewashers use a program called Unitame, which has a built-in list of conversions that it uses when auto-generating Latin-1 files. At this stage, any non-Latin-1 diacritics are deleted, while common punctuation marks are changed to the nearest equivalent. The letter œ is unpacked to oe (no brackets), and capital Œ becomes Oe. The em dash — changes back to two hyphens --.
If there is anything in a non-Roman script, your upload acknowledgement will include the line “Unitame says so-and-so-many characters need to be handled manually”. This is mainly directed at the whitewasher, who will then say either “Oh, ###, there’s stuff I have to deal with” or “Whew, the PP did it themselves so I don’t have to.” You may think you’re safe because your file has no Greek in it, but watch out: the dagger †, double dagger ‡ and bullet • (not to be confused with the Latin-1 mid-dot ·) all have to be handled manually. So do any fractions in /3, /5, /6 or /8. If your file does have Greek, bring out that transliteration; if you threw it away, you’ll have to do it all over again.
If you are curious, download the Unitame program—it lives at Sourceforge in the same general area as Gutcheck—and open the unitame.dat file. You should be able to do this in a text editor. Next to each listed letter is either another letter or a blank space.
Here’s the catch: the line “Unitame says all characters can be handled automatically” may not be true for your specific project.
- If you’ve got a linguistics-heavy text, those non-Latin-1 diacritics may not be expendable; they’re probably essential to the meaning.
- High-low quotation marks „“ are auto-converted to "typewriter" quotes. This is fine for Swedish, which now uses American-style double quotes, so the reader won’t be faced with something unfamiliar. But for other languages you may prefer to change the „“ to »guillemets«.
- Capital Œ always becomes Oe, yielding all-capped words like OeDIPUS.
Think of your decision as weighing time. Each file that the whitewashers receive has to be separately checked with Gutcheck and who-knows-what-else, so the question is whether it would take them longer to do the changes by hand than to run one or two extra batteries of tests. Keep in mind that the changes will generally take longer for the whitewashers than for you, because they don’t know the text.
You may choose to retro-convert to something like the proofed form, with letters “unpacked” in brackets (ẽ becomes [~e]), or you may prefer to expand some letters (ẽ becomes en or em). Whatever approach you take, make sure your Transcriber’s Note explains what you have done.
... and then to ASCII
At this stage, the surviving diacritics disappear. If the conversion is done automatically, most diacritics are simply deleted, while ä, ö, ü become ae, oe, ue. The letter æ is unpacked to ae. Fractions revert to their proofed form, so ½ becomes 1/2.
As with Unicode-to-Latin-1 conversion, look at your non-ASCII characters and make sure the auto-conversion does everything you want and nothing you don’t want.
Capital letters depend on the whitewasher. One auto-conversion program goes to Title Case: Ä, Ö, Ü, Æ become Ae, Oe, Ue, and again Ae. Another program goes to ALL CAPS, so you get AE, OE and UE.
- If your text is loaded with words like “coöperate” (dieresis) or “cañon” (tilde) auto-conversion will give you “cooeperate” and “canon”.
- Check for words in ALL CAPS and Title Case. Auto-conversion may give you AeNEAS and UeBERRASCHUNG—or it may give you AEneas and UEberraschung.
- Compound fractions have to be cleaned up by hand: when ½ becomes 1/2, 1½ becomes 11/2.
Along with the characters themselves, look at what auto-conversion does to your line length. The biggest change is probably the degree sign ° which defaults to “ deg.” (leading space, following period) for a total of five characters. Two degree signs in a 72-character line, and you’re up to 80.
What do I tell the whitewashers?
It’s thoughtful to include a line in your upload notes explaining why you made a separate Latin-1 and/or ASCII file. The whitewashers will then know that your object was to save them trouble, not to create extra trouble.
PG no longer requires ASCII versions of everything, so if your text is heavily loaded with complicated stuff, you can include a “no ASCII-7” note with your upload. Texts in languages other than English don’t get an ASCII version unless you ask for one.
Reading your files
You’ve got the file working beautifully on your own computer. But what happens when it goes out into the world to be read by total strangers using unknown equipment?
Terminology
You are the producer. The person who reads your file is the user. The user agent is their software, such as a text reader or browser. You are the first “user”; often PPV is the second one.
File encoding
Plain text
Plain-text files don’t carry file-encoding information. It’s up to the text reader-- the user’s word processor, text editor, cell phone, browser, brand-named device yet to be invented-- to figure it out. If it guesses wrong, the user sees garbage for any non-ASCII character. Some text readers have a “reinterpret” option; others require the user to close the file and change the encoding before re-opening. This is out of your control. All you can do is give the appropriate information.
HTML
The HTML header includes a line identifying the file encoding. Most of the time, that’s all you need. But some very old browsers-- and possibly other devices that use HTML-- can’t read this information, so it’s a good idea to include a bit of boilerplate telling the reader you’re in Unicode (UTF-8).
Fonts
Again, plain-text files don’t carry font information. They use whatever the user has set as a default. HTML can carry font information, but you already know the warnings about this approach. A font that is standard with browsers today may have gone out of fashion in five years.
In rare cases it may be appropriate to suggest a font. But in general, if the user is accustomed to dealing with Greek, Old English or whatever, they will probably already know what fonts they need. And if they’re not accustomed to it, they will most likely run straight to the transliteration anyway.
Font substitution
If your text includes unusual characters or non-Roman scripts, font substitution does most of the work for you. Any character that isn't available in the default font will be pulled out of the nearest available font that does have it. But this is out of your (the producer’s) control, and generally out of the user’s control as well. Some applications and operating systems use font substitution as a matter of course; some don’t. So it’s a good idea to include a line in your boilerplate about changing the default font.
Rare characters
If possible, try to find out which characters in your file have good font support-- Greek, for example, or vowels with macrons-- and which ones exist only in a handful of specialist fonts. Any character that only showed up in the most recent version of Unicode (currently 5.0) is not likely to be widely available. In these cases it’s especially helpful to provide a transliteration and/or screenshot. For plain-text files, you might even do it with ASCII art.
If your text contains a considerable number of rare characters, it may be wise to include a character manifest to your transcriber notes, listing each character with a descriptive name, such that readers are aware of their appearance, and will notice whether such characters are rendered correctly or not.
Guiguts and UTF8
As soon as you put a non-Latin1 character (e.g. a Greek letter, the oe ligature, curly quotes, etc) into the file in Guiguts then save, Guiguts will automatically save the file as utf8.