PPing with utf-8

From DPWiki
Jump to navigation Jump to search
Content of this page is out of date. Please review the Post-Processing FAQ for the newest information.

This page gives some information on utf-8 (sometimes written utf8, or sometimes referred to loosely as Unicode) and why you might want to use it when PPing. It is basically some information I sent to someone, who found it helpful and asked me to add it to the wiki. I hope you find it helpful too.

There are a couple of other wiki pages that mention utf-8: Character Codes and Compatibility and UTF-8.

HTML files

First, to clarify about HTML. The HTML source file that you create can only have characters in it from the declared charset (near the top of the file). You've probably only used iso-8859-1 before (which includes A to Z, 0 to 9, most of the other stuff you can type on your keyboard and a few fairly common symbols), and when you wanted a special character, like an emdash, you've put something like — which the reader's browser will then display as "—". This can be done for any "unusual" characters, either using a named ampersand code like — or using numbers, like ō (which represents a small o with macron: "ō").

However, there is an alternative for HTML, which is to declare the charset to be utf-8, and then to include the actual character, e.g. "—" and "ō". This has the advantage of being more readable of course, especially if there is a lot of Greek, or characters with macrons, etc.

Text files

Now to text files. PG wants UTF-8 files, so that is what we supply. For the plain text version, it is common to use hyphenated dashes (--) rather than em dashes, though em dashes are acceptable.

With the utf8 text file, which is a bit like the HTML with a utf-8 charset described above. You just put in it whatever characters you want, e.g. "—" and "ō". It doesn't need a special header like the HTML file did though - when you save the file (if you use Guiguts) it automatically detects that it needs saving as utf-8 and does that for you.

Guiguts and UTF8

As soon as you put a non-Latin1 character (e.g. a Greek letter, the oe ligature, curly quotes, etc) into the file in Guiguts then save, Guiguts will automatically save the file as utf8.

Other information

In all this, you might find the regexp [^\x00-\xff] useful - it finds any character that are outside the Latin-1 range, i.e. the ones that need converting from utf8 to a Latin-1 representation.

When you (or your PPVer) directly upload to PG, zip both text files and your HTML and images folder. Then tick the Latin-1 and HTML boxes as usual, but also tick the Unicode (UTF8) box as well. Then the WWers will know to look out for it.