Character Codes and Compatibility
Every computer file is made up of zeroes and ones. The character encoding is the system the computer uses to translate those zeroes and ones into readable text.
This used to be a complicated area, with different apps and operating systems encoding text in lots of different ways. These days Unicode has made things simpler, although there are still a few issues to be aware of.
For more general information on character encoding and file formats, see the Projet Gutenberg File Format document.
Unicode
Unicode is a unified system of character encoding that is meant to be able to represent almost any character. The latest version, released in 2024, includes encoding for more than 150,000 characters in more than 150 different scripts. Unicode is a widely accepted standard and almost every modern computer, operating system and application supports it.
Unicode is the current standard for Distributed Proofreaders and for Project Gutenberg, and every file submitted to Project Gutenberg should be a Unicode text file. Older formates, like ASCII and Latin-1, should not be used.
Unicode is a standard, but even within Unicode there are some choices that need to be made. Post-processors should ensure that any files submitted to Project Gutenberg should be encoded as UTF-8 text files with no BOM and CR/LF line endings. Each is explained below
UTF-8
There are actually three types of Unicode file: UTF-8, UTF-16, and UTF-32. These all use the same Unicode character set; the only difference is in the way they are encoded. UTF-8 is by far the most common of the three Unicode encodings. The others are used only in some specialized applications.
Any software you use to create text files will support UTF-8. The only thing you need to know is that if a "Save As" menu gives you different options, always choose "UTF-8".
Some software may give a confusing choice of saving as either "Unicode" or "UTF-8." If so, always choose UTF-8; "Unicode" may mean UTF-16.
BOM
BOM stands for "Byte Order Mark." This is a special character used in some Unicode applications for highly technical reasons. However, when software, including Project Gutenberg's, isn't expecting it, it can cause problems. All you need to know as a DP volunteer is that if you're given the option, be sure to save any postprocessing files as "No BOM."
Line Endings
For historical reasons, there are multiple ways to signal the end of a line of text in a text file. Some use the "Carriage Return" character, some use the "Line Feed" character, and some use both. You will often see these abbreviated as "CR", "LF", and "CR/LF".
For submitting to Project Gutenberg, use CR/LF encoding. This is the default with most Windows software, but Unix/Linux and MacOS users may need to take an extra step when saving to make sure they are saving with CR/LF.
Non-Unicode Encodings at Project Gutenberg
There are a few other encodings that may be used at Project Gutenberg for specialized types of document, such as musical notation or complex math. However, these should only be used in unusual circumstances.
Older Character Encodings
This is a brief overview of some of the encodings that were commonly used before Unicode became widely adopted.
Seven-Bit ASCII
Before Unicode was created, the most standard encoding was the 7-bit ASCII code. It is called that because it is an agreement on the use of the numbers that can be represented in 7 binary bits, 0-127.
A seven-bit binary number can represent any of 128 values, but for technical reasons that made sense in the 1950s, the numbers 0-31 were reserved for control purposes. The remaining 96 values, 32-127, were used to encode the English alphabet and common punctuation. To see all the ASCII values, see Alan Wood's ASCII page.
In the early days of Project Gutenberg, the rule was that any text that could be represented with only ASCII characters should be encoded as a seven-bit ASCII file. This made sense before Unicode, when ASCII was the only widely compatible file format. However, now that Unicode has achieved almost universal support, even files containing only basic ASCII characters should be submitted as Unicode text.
ASCII does survive in one sense, though. The first 128 characters of Unicode are the original ASCII character set.
Latin-1
ASCII worked well for English, but it did not provide a way to include characters with accents or diacriticals, meaning many European languages couldn't be encoded in ASCII. To solve this problem, the ISO (International Organization for Standards) expanded 7-bit ASCII to create ISO-8859-1, Latin alphabet No. 1, usually referred to as Latin-1. The Latin-1 code uses 8-bit numbers, doubling the number of characters that can be encoded from 128 to 256, with codes 128-159 skipped for tedious technical reasons.
The extra characters provided accented characters needed for many western European languages, plus a variety of special symbols. To see the symbols beyond ASCII, see Alan Wood's page.
Latin-1 was formerly used at Project Gutenberg for texts that needed characters beyond the simple English alphabet. However, all such texts should now be submitted in Unicode format.
Platform-Specific Encodings
Before Unicode, or even Latin-1, became standard, operating systems created their own 8-bit encodings, allowing them to work with languages other than English. Windows had an encoding called Windows CP 1252; Macs had Mac OS Roman.
Today, modern versions of both operating systems and almost all text editors and other applications use UTF-8 by default. You should not encounter these older encodings unless you are dealing with very old operating systems or very old files. If you do, most text editors can convert to UTF-8, and there is a command in the Guiguts Fixup menu to convert Windows CP 1252 to UTF-8.
Older Encodings for Non-English Texts
Before Unicode, languages that uses non-Roman alphabets, like Chinese, Arabic, and Hindi, could not be written in ASCII, and computers used individual and often conflicting standards for these files. In most cases, these texts can and should now be encoded in UTF-8.
HTML Character Entities
There is one more type of character encoding that is relevant to postprocessing.
Character entities are only for use in HTML files. They consist of a sequence of letters that tell the browser to display a particular character. An entity always starts with the ampersand (&) and ends with the semicolon. For example, ¼ is the entity for the character ¼. To see a list of all defined entities see the W3 standard page.
You cannot use HTML entities in a text file; a reader would not understand ¼. You do use entities in an HTML file so that the file itself, bookname.html, is a 7-bit ASCII file, yet the browser can display accented, Greek or mathematical symbols. (Previous versions of Guiguts converted all Latin-1 and Unicode symbols to entities automatically during HTML conversion.)
Besides the entities with names, like ¼ (¼) and π (π), you can specify any Unicode symbol as an entity by writing an ampersand, a number sign, and the decimal code value. For example, 1044 is the Unicode number for the Cyrillic letter DE, so Д is the HTML entity for it (Д). Of course, just because you command a character doesn't mean it will appear (cf Falstaff, "but will they come when you do call for them?") If the font in use lacks that character, it will display as a blank or perhaps an open square.
The heading of an HTML or XHTML document is supposed to specify its character encoding. If nothing is specified, the default for HTML5 is UTF-8; to specify a different encoding, you will need to specify this in the head section. For example: <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> Specifying charset=ISO-8859-1 tells the browser that the document might contain the full Latin-1 set, which most browsers support.