Site conversion to Unicode

From DPWiki
Jump to: navigation, search
Exquisite-khelpcenter.png Note

This article is no longer maintained, since the roll-out of the UTF-8 version of the site on 19 May 2020. The sections that remain relevant will be copied to DP Official Documentation

Background

Since its inception in 2000, Distributed Proofreaders has only supported the Latin-1 character set -- both the web site itself and all of the projects. After years of discussion and over 2 years of development & testing, we rolled out Unicode support on May 19th, 2020.

See all of the related topics attached to Task 1791 for many of the historical discussions on UTF-8 support dating back to January 2004.

Why change to Unicode & UTF-8?

  • Unicode allows use of characters not available in the ISO-8859-1 encoding, ultimately including selected non-Latin character sets that we cannot use directly, now, such as Greek, Cyrillic, Hebrew.
  • UTF-8 is the most widely used of all the Unicode encodings, and for situations where most of the text will be ASCII characters, is more compact for storage than UTF-16 or UTF-32.
  • UTF-8 provides Post-Processors with text that does not need to be converted from ISO-8859-1, possibly having to return transliterated text to its original form.
  • Makes it easier for sites that commonly need non-ISO-8859-1 characters to install and use our code.
  • The DPF Board voted several years ago to urge moving the site to UTF-8.

What does this mean for me?

Proofreaders & Formatters

For proofreaders and formatters, very little will change. The proofreading interface will look and operate almost identically to how it did before with two small changes:

  1. If you enter a Unicode character that is not supported by the project, you are notified when you attempt to WordCheck or save the page and given the option to remove the character or continue and have the system remove the character. This ensures that only valid characters are saved with a page. All of the characters in the character picker are valid for the project.
    • In IE11, the page text is not normalized before validating against the project's character suites. This could result in valid-appearing text not actually being valid when using combining characters that would normalize to a single character. For example if you enter É from the character picker it will be fine but if you enter a plain E and a combining acute accent (U+0301) it will look the same on the page but IE11 will flag it as a bad character.
  2. A new feature in the proofreading interface is automatic diacritical markup conversion. When you use diacritical markup, after you type in the closing ], if the character is a valid character in the project, it will be automatically converted into the character itself. This also works for a few ligatures as well. You can try it by typing in [oe] or [:a]! There is no expectation that existing markup will be converted to Unicode characters for existing projects, so don't feel obligated to change those from prior rounds.
    • In IE11, this function only works for the æ and œ ligatures.

Project Managers

While the site is fully-Unicode capable, we are limiting the supported characters to those in the Basic Latin and Latin-1 Supplement Unicode blocks. This is very close to the same set of Latin-1 characters we have always supported, with the addition of Œ œ Š š Ž ž Ÿ ‹ ›, and gives us a chance to validate core site functionality with minimal changes elsewhere (note that Š š Ž ž and the micro sign µ remain in the character suite, but are not present in a picker set). These characters compose a character suite called Basic Latin and all new and existing projects use this character suite. You can view the list of characters in the Basic Latin character suite.

We have additional character suites ready to enable after we are comfortable with the code and have the right processes and documentation in place. You can see them on the character suites page.

Character Enforcement

To ensure that projects only ever contain supported characters, we enforce the character restrictions two places:

  1. Project load (add_files)
  2. Page save (saving in proofreading interface)

During project load, the page text encoding is guessed from one of: UTF-8, UTF-16, UTF-32, Windows-1252, ISO-8859-1 and converted into UTF-8. Any pages with a Byte-Order Mark (BOM) will have it removed silently. The text is then normalized in NFC format. Then a set of standard character transforms are done to convert some Unicode characters to their preferred ASCII version (see #Unicode_to_ASCII_mapping). Finally, any remaining characters that are not in the project's supported character suites are removed. The page load interface (add_files) detects this and gives a warning for each page for any changes that will be done. At that point, the Project Managers (PMs) may choose to abort the load and fix the rejected characters themselves, or allow the load script to remove them entirely.

While the code does its best to guess the file encoding, CPs and PMs would be best served by providing OCR files in UTF-8.

In the proofreading interface, before every page save, javascript runs to validate that all characters being saved are in the project's supported character suites. If not, an interstitial page is shown highlighting the unsupported characters for the user to go back and correct or remove.

The code removes any unsupported characters in the text before saving it to the database, regardless of the code path, to ensure that the page text being saved always adheres to the project's supported character suites.

FAQ

Everyone

How does pgdp.net's Unicode support differ from pgdpcanada.net?
DP Canada does not restrict the set of Unicode characters that are valid for a project, whereas this site does.
Why restrict characters for a project?
One of the advantages to the site being Latin-1 was that there was a very small set of characters that could end up on a project page and all characters had a fairly unique appearance. Unicode allows literally millions of characters, many of which look very very similar but are actually different. Restricting projects to a subset of Unicode characters provides some guide rails for proofreaders, PMs, and PPs and helps prevent incorrect characters being used for correct ones.
Who made decision X?
There were thousands of decisions made during the conversion to support Unicode. Not everyone will like all of them. Some of them were made by developers after consultation with squirrels and other stakeholders. Many of the decisions were made after discussion with the community in numerous forum threads, such as:
Why is Internet Explorer 11 not fully supported?
IE11 does not implement the String.prototype.normalize() function, which is used for various validation functions and the automatic diacritical markup conversion. IE11 users are encouraged to change to a more modern browser, such as Microsoft Edge, Firefox, Google Chrome, or Opera.

Proofreaders & Formatters

Why does the diacritical markup conversion appear to work inconsistently?
The diacritical markup conversion is triggered by typing in the closing ]. If you add the [] using the button, or add the [ last, the conversion will not kick in.
Diacritical markup is only converted to Unicode characters if those characters are part of the project's character suite. Not all characters are part of every project's character suites and it's perfectly OK for diacritical markup to remain on pages.
Do I need any special fonts?
No. We're initially limiting characters to the Basic Latin set and fonts that worked with just the Latin-1 character set will work with this Unicode subset. Our new default font, DejaVu Sans Mono, already supports a wide range of Unicode characters for when additional character suites are supported as does the new DP Sans Mono. You can try both of them in your user preferences).
If curly quotes are converted to straight quotes during the page load, why doesn't that happen in the proofreading interface automatically?
We made a very intentional decision to not modify the page text after the proofreader clicks save*. Instead, if characters are not valid for the project, the proofreader is informed of invalid characters before the page is saved and given the option to change them.
*Note that while we use client-side validation to ensure only valid characters are present on a page when it leaves the user, we enforce this server-side before every page save to the database as well in case the client-side fails or is bypassed.

Project Managers

Do I need to do anything for my projects after the conversion?
No, the conversion handles converting project page texts and project word lists. The conversion even corrects any HTML entities in project titles and authors that were inadvertently added in these fields.
Were all projects converted to UTF-8?
All project metadata was converted to UTF-8. The project tables (which include all of the page texts) were converted for all projects except those that have been archived. If a project needs to be un-archived, Site Admins have a tool to convert the project table to UTF-8. There are also safeguards in the code to prevent working on a project whose project table has not been converted.

Character Support

When will we support character suites besides Basic Latin?
We are currently limiting projects to only use the Basic Latin character suite. After the development team and squirrels feel confident that we have adequate testing, documentation, and processes to support additional character suites we can roll that out. This could be a few weeks or a few months.
Will PMs be able to add individual characters to projects instead of full character suites?
The underlying infrastructure supports this and we plan to allow Project Managers to add individual Unicode characters to their projects. This is intended to supplement character suites, not replace them. Development has not yet begun on this enhancement.
Why can't we create custom project character suites?
Limiting a project's characters to a set of suites isn't intended to enforce the absolute set of characters that should exist in a project. Rather they're meant to provide guide rails for proofreaders and PPs alike. Forcing a PM to create/tweak a character suite for every project was seen as both too much overhead for them and too much code complexity for too little return.
Are other project fields, like titles or authors, limited to the project's character suites?
No, project fields can contain any Unicode character, although you are strongly encouraged to not use Unicode characters outside the Basic Multilingual Plane (BMP), i.e. so-called Astral plane characters, in project titles and authors. While these will work perfectly well for all DP code, they will cause project thread creation to fail because the forum software doesn't support them. If this happens, simply remove the Astral characters from the title or author and project thread creation will succeed.
How do characters get on a project's character picker?
The contents of the character picker in the proofreading interface is derived from the project's supported character suites. Each character suite defines a set of character menus that will be rendered in the proofreading interface. The different picker menus are manually crafted and not automatically generated from the character suites.

Word Lists

Are words on project word lists limited to the project's character suites?
No, currently project word lists can contain any Unicode character, although words containing characters not in your project's set will never match a word seen by a proofreader. See Task 1888.
Was diacritical markup in word lists changed to the Unicode character during the conversion?
No, word list contents were not changed during conversion. PMs may want to review word lists for projects in-flight and add word forms that use the accented characters and not just the diacritical markup.
How will WordCheck handle words with characters in different languages?
If WordCheck finds words that contain characters from 2 or more different Unicode scripts, it will flag them as bad words. This can help detect when visually similar letters are used in a word that come from different scripts (such as the 'B' from Cyrillic being used in an English word). Note: this scenario can't happen until additional character suites are enabled beyond Basic Latin and Extended European Latin.

Post Processors

What happens to post-processing artifacts for projects currently in PP?
Zip files:
  • Zip file contents (text, TEI text) were not changed in the conversion so text files within zip files will still be in Latin-1.
  • If the post-processing files are manually regenerated by a PF or Squirrel, upon request, or if the Post-Processor uses the "Download" button to download the CTF instead of the "Download Zipped Text" link, text files within the zips will be in UTF-8.
Text files: All text files in the project directory, including concatenated text files (CTFs), were converted to Unicode during the conversion. When these files are downloaded or viewed in the browser they will be UTF-8.

Related pages

Appendix

Unicode to ASCII mapping

The following Unicode characters will be changed to basic ASCII equivalents during page load into a new project. If any of the characters are valid for a project, they will be changed upon page save too. For the definitive list, see the get_utf8_to_ascii_codepoints() function in pinc/unicode.inc.

Character Name Codepoint ASCII
hyphen U+2010 -
non-breaking hyphen U+2011 -
en-dash U+2013 -
minus sign U+2212 -
figure dash U+2012 --
em-dash U+2014 --
horizontal bar U+2015 --
open curly double quote U+201C "
close curly double quote U+201D "
open curly single quote U+2018 '
close curly single quote U+2019 '
horizontal ellipsis U+2026 ...
horizontal tab U+0009 [space]
no-break space U+00A0 [space]
ogham space mark U+1680 [space]
en quad U+2000 [space]
em quad U+2001 [space]
en space U+2002 [space]
em space U+2003 [space]
three-per-em space U+2004 [space]
four-per-em space U+2005 [space]
six-per-em space U+2006 [space]
figure space U+2007 [space]
punctuation space U+2008 [space]
thin space U+2009 [space]
hair space U+20A0 [space]
narrow no-break space U+202F [space]
medium mathematical space U+205F [space]
ideographic space U+3000 [space]
vertical tab U+000B [newline]
form feed U+000C [newline]
next line U+0085 [newline]
line separator U+2028 [newline]
paragraph separator U+2029 [newline]