Should I save in Unicode or UTF-8?

From DPWiki
Jump to navigation Jump to search

What difference does it make if you save a text as Unicode or as UTF-8?

Unicode is generally UTF-16, which isn't handled as many places as UTF-8 is. The difference is that UTF-8 is ASCII for ASCII characters, and so can be easily handled by programs designed to take ASCII or Latin-1 with few modifications. It takes 2-3 bytes for most others characters, and 4 bytes for a few ("supplementary" or, more informally astral characters).

UTF-16 takes up two bytes for all characters except for astral characters, where it takes 4.

Once compressed, the size makes little difference. One test showed it didn't matter whether you encoded the characters in ASCII as themselves (eg. A) or by spelling out their full Unicode name in ASCII (e.g., LATIN CAPITAL LETTER A); when using a modern compressor (like Bzip2 or Rar), it was about the same size at the end, and even older compressors like zip (or gzip) aren't effected much. So use UTF-8 no matter what the content of the text is.

Project Gutenberg to which we submit our ebooks for distribution uses UTF-8.