UTF-8

Jargon Guides

The LONG DP Jargon Guide
Forums jargon for DPers
Wiki jargon for DPers
Book Publishing jargon for DPers
LOTE Jargon Guides

UTF-8 is a widely-used standardized method to encode Unicode characters as a sequence of bytes (or octets or numbers between 0 and 255, inclusive). One benefit of it is that the first 128 characters are encoded the same as ASCII encoding.

Encoding details

Byte values from 0 to 127, inclusive, represent the usual ASCII characters, byte values from 128 to 191, inclusive, are used to represent a block of 6 bits from a larger Unicode code number, byte values 192 and above are used as prefixes both determining how many 6 bit blocks follow and containing a couple of initial bits.

Incidentally, Latin-1 characters with Unicode numbers from 128 to 191, inclusive, are encoded as a byte with value 192 followed by (the code of) the character itself; Latin-1 characters from 192 to 255 are encoded as 193 followed by the character code minus 64.

For information about Post-Processing and UTF-8, please review the Post-Processing FAQ.

Encoding details

Navigation menu

Search