DP Code - Unicode
This page is intended for DP developers and other technical individuals. For a more proofer/PM-centric discussion, see Site conversion to Unicode.
Introduction
Unicode is a standardized method to uniquely express (not quite) every glyph used in human language.
This wiki page covers some of the high-level concepts and challenges for DP to support Unicode.
Unicode vs UTF-8
Unicode and UTF-8 aren't the same thing, although they are often conflated. UTF-8 is one of several methods to encode Unicode "characters" or code points. Others are UTF-16 and UTF-32. UTF-8 is the most commonly used method and the lingua franca encoding used on the internet.
It's mostly semantics but in this document I've taken care to use the proper term to highlight what are Unicode-specific challenges and what are UTF-8-specific challenges.
Unicode Concepts
I strongly encourage developers attempting to implement Unicode read through Unicode Explained by Jukka Korpela.
For another look at Unicode, see xkcd 1726 and 1953 (don't forget the tooltip).
Multilingual planes
Unicode supports millions of characters. The first 65k are in what is called the Base Multilingual Plane (BMP). These encompass most of the code points (eg: characters) that DP will want to encode and support scripts like Latin, Greek, Cyrillic, and others.
Other planes support less common code points, like Egyptian hieroglyphics in the Supplemental Multilingual Plane (SMP), less-common Chinese and Japanese code points in the Supplemental Ideographic Plane (SIP), and others. Each plane is composed of 65k code points.
Unicode planes are important because of how code points in higher planes are encoded in UTF-8. The higher the code point the more bytes are required to encode the code point in UTF-8. This is most relevant because of our underlying database.
This also impacts DP's other middleware like phpBB and the wiki.
UTF-8 encoding
As mentioned, UTF-8 is a way to encode Unicode code points into a byte stream. UTF-8 is optimized for ASCII characters (all 127 of them), meaning it only takes one byte to represent an ASCII character. For any non-ASCII character, such as the rest of Latin-1, UTF-8 requires two or more bytes to encode them. This makes UTF-8 space-efficient for strings that are primarily composed of ASCII characters, and less space-efficient for those that aren't.
Byte Order Marker (BOM)
Unicode encodings UTF-16 and UTF-32 use 2 and 4 bytes, respectively, to encode Unicode code points. Because byte order differs depending on the hardware platform, Unicode has the idea of a byte order marker, or BOM. This character indicates to the consumer which byte order to use when processing the string.
Because the UTF-8 encoding uses a single byte as it's smallest unit, no BOM is required. While a BOM code point is valid in a UTF-8 string, it serves no purpose and its use is discouraged.
To simplify things, DP removes BOM in all strings upon input and normalization.
Composition characters
Unicode has the idea of composition characters, that is characters can be created by combining two Unicode code points together. For instance, [e with accent] could be either U+00F9 -- or -- U+0065 [e without accent] + U+0301 [accent]. Both are perfectly valid Unicode.
Ligatures
Unicode contains ligature characters as well. [eg: fi = f + i in single character, or U+FB01].
DP will need to establish best practices on when to use and not use ligatures.
Normalization
Because there are multiple ways to encode a single character, Unicode has the idea of normalization. [1] Normalizing strings is important to provide useful diffs between pages.
There are 4 normalized forms, two decomposed forms and two composed forms:
- NFD - Canonical decomposition
- NFC - Canonical decomposition, canonical composition
- NFKD - Compatibility decomposition
- NFKC - Compatibility decomposition, canonical composition
NFC is the one we use at DP. It decomposes all composable characters, and recomposes them into canonical form. That is, it replaces all composable characters with a single Unicode character if it exists. This form does not change ligatures.
NFKC does the same thing as NFC, but also decomposes ligatures and other characters, so the fi ligature becomes f and then i. While this might seem useful to normalize ligatures, it impacts a wide range of other Unicode code points with compatibility mappings, such as ^2 into 2, so 4^2 would be changed to 42.
The World Wide Web Consortium (W3C) favors NFC[2], as should DP.
Normalizing functions are available in php [3] and javascript [4]
Challenges
Web server
Apache is configured to serve up pages using a default encoding. PHP pages can override this, but the apache configuration is what tells the browser of non-PHP pages such as word lists, etc. Converting to Unicode will require updating all local documents to be encoded in UTF-8 rather than Latin-1 and updating the Apache config accordingly.
Database
UTF-8 support
MySQL versions before 5.5.3 have the 'utf8' encoding. This supports the BMP code points only and uses between 1 and 3 bytes to encode a code point.
MySQL 5.5.3 and later introduced the 'utf8mb4' encoding. This supports more than just BMP code points and uses between 1 and 4 bytes to encode a code point.
https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-conversion.html
PROD and TEST are currently (2018/06/07) running 5.7. We should aim for supporting 5.5.3 and later with utf8mb4 encoding.
Column sizes
In MySQL 4.x the size of the char() and varchar() fields represented the number of bytes. In MySQL 5.x and later they represent the number of characters. So in MySQL 5.5 varchar(128) can store 128 Unicode characters. This means that we don't have to go through and increase the size of our database columns to accommodate the varying size of UTF-8 characters.
Fonts
In order for a user to view UTF-8 code points, the fonts used by a user must contain glyphs for those code points. Fonts included with modern operating systems usually include a large array of Unicode glyphs. Many other fonts, including DPMono, don't. Even though Microsoft's Consolas supports more than the base Latin-1 set but is still severely lacking.
It's interesting to note how font-fallback works for missing glyphs. This page confirms cpeel's suspicions in that it's at least not uncommon for browsers to walk down the remainder of the font-family list when trying to replace a missing glyph. This implies that we should support a way for the code to specify a known font that covers all of the desired codepoints so at least it falls back to a known glyph.
Some links to research more:
- https://en.wikipedia.org/wiki/Monospace_(Unicode)
- http://savannah.gnu.org/projects/freefont/
- http://ergoemacs.org/emacs/emacs_unicode_fonts.html
- https://dejavu-fonts.github.io/
DP is going to need to rethink fonts in order to support Unicode in the proofreading interface.
Here is a Summary of Considerations for a Proofing Font to support DP Proofing. It was started by summarizing items discussed in this thread. It initially concerns Latin1 only.
Process & Code
Decisions
- Unicode contains a lot of characters. We are restricting the characters a project supports at the project level.
- To support the broadest set of Unicode characters we're going with MySQL's utf8mb4 character set. This means our new base MySQL version is 5.5.
- We're moving DP wholesale, including the database, in one fell (fail) swoop. It's simply going to be too complex to support having the code support both Latin1 and Unicode concurrently.
Discoveries
- In MySQL 4.x the size of the char() and varchar() fields represented the number of bytes. In MySQL 5.x and later they represent the number of characters (via https://stackoverflow.com/questions/1997540/mysql-varchar-lengths-and-utf-8). So in MySQL 5.5 varchar(128) can store 128 Unicode characters. This means that we don't have to go through and increase the size of our database columns to accommodate the varying size of UTF-8 characters.
- MySQL has the ability to convert an entire table's encoding at one time.
- Pseudo-related: MySQL 5.0.3 increased the maximum size of a varchar() column to 64k, although the actual size is limited by the other columns, as the maximum row size is also 64k.
Tools
- GuiGuts has been updated to allow Bookloupe to be used in place of gutcheck, which doesn't support UTF-8.