Character Codes and Compatibility

From DPWiki
Jump to navigation Jump to search
Out of date clock icon.svg Article needs Updating

This article needs to be updated to reflect the latest available information. Discuss changes on the Discussion page or at the Documentation thread in the DP Forums.

Exquisite-khelpcenter.png Note

It looks like this was originally written to be a resource for Post-processing. The opening is rather confusing, and much of the information is probably available on other pages.

The characters in a text file are encoded as small binary numbers. In order for them to make sense, there must be agreement on how the numbers are to be decoded: agreement, for example, that the number 32 will be decoded as a space, and 33 as an exclamation mark. The many ways of doing this creates widespread confusion.

You may find the information below helpful, but if not, or if what's written below makes you think you need to create lots of different text versions, it's best to ask for advice from a PP mentor, an experienced PP/Ver or in the forums. For any specific case, it's usually fairly simple - it just gets complicated to try to describe every circumstance.

To summarise, if your project contains lots of "unusual" characters, e.g. Greek, then you may only need to supply a utf8 version. If it only contains "regular" characters then you may only need to supply a Latin-1 version. Sometimes you may want to supply both versions, for example to provide Latin-1 transliterations of the Greek. Very rarely, you may want to supply an ASCII version.

If you use Guiguts, all that is needed to make a text file a utf8 file is to include at least one utf8 character, e.g. the oe ligature, and save the file. Guiguts detects that utf8 is needed and saves it as such automatically. If you remove all characters that require utf8, then the next time you save, Guiguts will save it as Latin-1 again.

When directly uploading to PG, there are tick boxes on the upload form to indicate which text files you are uploading. Just tick Latin-1 or Unicode (utf8) or both, as well as HTML, depending on what you are uploading.

Seven-Bit ASCII

There is one standard encoding on which all common operating systems, web browsers, and text editors agree: the 7-bit ASCII code. It is called that because it is an agreement on the use of the numbers that can be represented in 7 binary bits, 0-127.

A seven-bit binary number can represent any of 128 values, but for technical reasons that made sense in the 1950s, the numbers 0-31 were reserved for control purposes. The remaining 96 values, 32-127, are used to encode the English alphabetics and common punctuation. To see all the ASCII values, see Alan Wood's ASCII page.

The PG FAQ says "You should use plain ASCII for straight English texts." Seven-bit ASCII is favored by Project Gutenberg because an etext that uses only 7-bit ASCII can be read on any equipment, any software, anywhere. [**This advice is out of date - it is rarely necessary to submit an ASCII version. Don't bother unless you know what you are doing. If you really think you need to, ask your mentor, or a friendly, more experienced PP/Ver, or post in the No Dumb Questions for PPers thread]

Latin-1

The ISO (International Organization for Standards, a body that coordinates the work of the national standards organizations of many countries) expanded 7-bit ASCII to create ISO-8859-1, Latin alphabet No. 1, also referred to as Latin-1. The Latin-1 code uses 8-bit numbers, allowing the use of numbers 128-255. However, for tedious technical reasons, the 32 codes 128-159 are skipped.

Latin-1 uses the 96 numbers 160-255 to provide most accented characters needed for western European languages, plus a variety of special symbols. To see the symbols beyond ASCII, see Alan Wood's page.

It is tempting to use Latin-1 because it contains all the accented vowels commonly found in European and North American books. However, the PG FAQ says you should use ISO-8859 when you must, but "also provide a 7-bit plain ASCII version with the accents stripped ... we make a point of always supplying an ASCII version where possible, even if the ASCII version is degraded when compared to the 8-bit original." [** This advice is out of date - PG automatically create an ASCII version if you upload a Latin-1 version so you don't need to. Unless you know what you are doing, don't bother about the ASCII version. If you want more advice, ask your mentor, a PP/Ver or post in the forums]

The regular expression [\x7f-\xff] finds all the Latin-1 special characters in a document. Use it to find accented characters that must be changed to make a 7-bit ASCII version of an etext. To convert a Latin-1 etext to 7-bit ASCII, you use the PG scheme for diacritical markup, described in the PGDP style guide. Many accented characters can be preserved in ASCII form using this markup.

Unicode

If you are PPing you should read the information about UTF-8 in the Post-Processing FAQ.

There is no way to get all the characters of the world's languages into a set of 255 numbers. The only solution is to use more bits per character. Unicode is a standard that has assigned numeric codes to nearly 100,000 letter symbols, using numbers in the range of zero to about one million. (Guiguts supports only Unicode values to hex FFFF or 65,000, a limitation of the Perl toolkit in which Guiguts is written.) The first 128 numbers are the same as the 128 codes of 7-bit ASCII, so an ASCII file is in fact, a Unicode file as well!

Each symbol in Unicode has a code number that may be stated either in decimal or hexadecimal, and a standardized, descriptive caption. For example the œ ligature in "Phœnician" has the number 339 decimal (or 153 hex) and the caption "LATIN SMALL LIGATURE OE."

The Unicode symbols are assigned to numeric blocks of consecutive numbers. Each block has a name, for example "Latin Extended A" or "Greek Extended."

Numbers greater than 255 won't fit in a single byte. In the most common encoding, called UTF-8, each Unicode character is encoded as a sequence of from one to four bytes. Unicode UTF-8 uses the numbers 128-255 in 2-byte, 3-byte and 4-byte groups to represent characters. As a result, a Unicode text file is not compatible with a Latin-1 text file. Latin-1 uses individual bytes in the range of 128-255 as character codes; UTF-8 uses multiple combinations of 128-255 to represent single characters. When software treats one under the belief that it is the other, wrong special characters are displayed.

PG accepts etexts in UTF-8 coding when the additional characters are necessary to the book. You can find out if the document contains any Unicode greater than ASCII one-byte codes. To find any multi-byte characters including punctuation, search with this regular expression: \P{IsASCII} (follow the letter case exactly). This finds all multi-byte characters even if they are punctuation. Using Guiguts you can find all words containing multi-byte characters using the Word Frequency panel.

Further discussion of Unicode file formats is in Should I save in Unicode or UTF-8?.

HTML Character Entities

Character entities are only for use in HTML files. They consist of a sequence of ASCII letters that tell the browser to display a particular character. An entity always starts with the ampersand (&) and ends with the semicolon. For example, ¼ is the entity for the character ¼. To see a list of all defined entities see the W3 standard page.

You cannot use HTML entities in the text file; a reader would not understand ¼. You do use entities in an HTML file so that the file itself, bookname.html, is a 7-bit ASCII file, yet the browser can display accented, Greek or mathematical symbols. (Previous versions of Guiguts converted all Latin-1 and Unicode symbols to entities automatically during HTML conversion.)

Besides the entities with names, like ¼ (¼) and π (π), you can specify any Unicode symbol as an entity by writing an ampersand, a number sign, and the decimal code value. For example, 1044 is the Unicode number for the Cyrillic letter DE, so Д is the HTML entity for it (Д). Of course, just because you command a character doesn't mean it will appear (cf Falstaff, "but will they come when you do call for them?") If the font in use lacks that character, it will display as a blank or perhaps an open square.

The heading of an HTML or XHTML document is supposed to specify its character encoding. This is usually done with the following statement in the head section: <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> Specifying charset=ISO-8859-1 tells the browser that the document might contain the full Latin-1 set, which most browsers support.

Windows Special Characters

Personal computers have always used the 8-bit byte as the natural unit of storage. When PCs were new, the 7-bit ASCII standard predominated, but different authorities made different choices about how to use the other 128 codes a byte permits.

As Windows was being developed, a series of 8-bit character encodings (called "code pages" by Microsoft) were defined for its use. They were based on the ISO 8859 family of encodings, but included extra characters in code points that the ISO had reserved for control characters. The corresponding encoding for ISO Latin-1 is called Windows CP 1252. Its numbers 0-127 are the same as 7-bit ASCII, but the numbers 128-159 were assigned to other symbols like trademark, bullet, and endash. (For a detailed critique of the Windows character set see this page.)

If you are using Windows, you can easily enter a symbol like a curved-double-quote that is not in Latin-1. To do so makes your document incompatible with Latin-1 and with Unicode. It is difficult to find these non-Latin-1 symbols manually. (Using Guiguts, the menu command Fixup> Convert Windows CP 1252 characters to Unicode changes all Windows-unique codes to their Unicode equivalents.)

MacRoman

Early in the history of the Mac, Apple defined its own 8-bit set of 223 characters. The "MacRoman" code includes 7-bit ASCII but uses the codes 128-255 for other symbols—a different selection of symbols than MS-DOS used, but equally incompatible with Latin-1 or Unicode.

(This explains why special characters are jumbled in email between Windows and Mac users. The Windows mailer assumes Windows codes, and the Mac mailer assumes MacRoman codes.)

A Mac user who edits a text using a tool other than Guiguts (which never uses Windows or MacRoman codes) must take care not to pollute the document with special characters that appear correct but are coded in the MacRoman set.

TextEdit is the default application if you double-click a file of type "txt." It allows MacRoman by default, but you can make it safe. Open TextEdit> Preferences. Select the "Open and Save" button. Set the "Plain Text Encoding" for both Open and Save to the choice, "Western (ISO Latin 1)." (If this choice is not at first available in the pop-up menu, select "Customize Encodings List" from the end of the menu and enable the Latin-1 choice in the list of all encodings.) Also set "Western (ISO Latin 1)" as the encoding for saved HTML files.

BBEdit can use any code set, but you must tell it which to use. Open BBEdit> Preferences and select the page named "Text Files: Opening." Set the preference "If File's Encoding Can't Be Guessed, use: Western (ISO Latin 1)." Go to the page "Text Files: Saving" and set "Default Text Encoding: Western (ISO Latin 1)." Then, before you save any PG document, pull down the File Options menu in the document header (the icon is a tiny page symbol) and make sure that "Encoding: Western (ISO Latin 1)" is set.