User:Lvl/Devel/I18n TODO list

From DPWiki
Jump to: navigation, search

My in-progress TODO list for the internationalisation (short: i18n) of the DP site. See also:


High priority

Have administrators install locales on the server (test and prod).

According to user contributed comments for 
  http://fr2.php.net/manual/fr/book.gettext.php
and
  http://fr2.php.net/manual/en/function.setlocale.php
the locales need to be installed on the server in order to display
the translation in these locales.

The default procedure seems to install utf-8 locales. Donovan
managed to install Latin-1 locales by 
editing /var/lib/locales/supported.d/local to indicate desired 
pairs and doing the usual dpkg-reconfigure locales.
That needs to be documented somewhere near the bottom of SETUP/installation.txt


Hunt for unwanted locale-specific behaviour

If we change the locale, any locale-specific behaviour have to be checked. For instance:

  • do not use \w, \W, \b and \B in regexes if we only have in mind ASCII letters (e.g. for user data validation);
  • Check also if \d can include characters other than [0-9] (arabic digits, and so on)
  • check what sorting function are locale dependant

TODO: find if there is a list of locale-dependent functions and external tools.


Fix and revise the gettext markup

This is in progress in User:Lvl/Devel/Distributed_gettext_fixing_effort. Note that this is only the first fix. Strings which cannot be translated will be reported by translators.

Handle context in gettext markup

In a second phase, I expect to investigate how to support context-dependant translation such as pgettext(). That should handle the following cases:

  • _("PM") is not translateable (project manager, or private message?).
  • Same for _("none") (because it may have different forms in translated language depending on the subject)
  • Need for short names in UI.

The standard procedure is to use the context extensions in the PO files. See --keyword[=keywordspec] in the xgettext manual Unfortunately we currently have xgettext version 0.15.4, which is obsolete and does not support context. (current version of the GNU libintl is version 0.18)

Once we have a decent xgettext, we should be able to use this php function:

if (!function_exists('pgettext')) 
{   
   function pgettext($context, $msgid) 
   {   
       $contextString = "{$context}\004{$msgid}";   
       $translation = _($contextString);   
       if ($translation == $contextString)  return $msgid;   
       else  return $translation;   
   }
}

(as documented here and here, this is supposed to work because entries with context are stored in the MO file as the context and the msgid separated with ASCII character \004. Of course I shall test that it works correctly first, once a recent xgettext is installed.)

Agree on a way of handling translated Guidelines and FAQs

The code contains already a mechanism to put in faq/de/formatting_guidelines.php the German translation of faq/formatting_guidelines.php. Following this convention allows the code to automatically link to the correct Guideline from various places (the stats bar, the proofreading interface, etc.). I recommand to move the translated Proofreading and Formatting Guidelines there, with placeholder files in their previous places relocating to this place.

faq_central.php: this could be kept unique and gettexted, provided there is a way for language communities to link to additional material in the wiki.

Other documents: If translated as php code, I suggest they be put in faq/xx/ for language xx. If translated as wiki pages, see next item.

Agree on a way of handling links to various items (forums, wiki pages)


Have a more robust way of selecting the language preference.

  • remove the ?lang=xx and language cookie non-feature (discussed on Jabber with Donovan)
  • remove the ugly language selection box on the front page.
  • better implementation of HTTP_ACCEPT_LANGUAGE header field. See this article, the rfc 2616 and a script sample.

This is in progress, I hope to put some code in pinc/gettext_setup.inc for review soon.

Translation center

  • have ~lvl/c/locale/translator/new_index.php reviewed
  • decide if the previous translation center may be deleted entirely.
  • commit to CVS

Medium priority

LOTE random rules

Add a field language in table rules, and select the random rules per round and language. Will need to rework the script used to extract the random rules out of the Guideline source code files.


Translate emails

In theory when a mail is being sent to someone else, we should do something like:

  $temp = mysql_query("SELECT email, u_intlang FROM users WHERE username = '...'");
  $email = mysql_result($temp, 0, "email");
  $temp_intlang = mysql_result($temp, 0, "u_intlang");
  // temporarily change the current locale to the recipient's language.
  setlocale(LC_ALL, $temp_intlang);
            
  // build the mail subject and body, in the current localt
  $body = _("...");
  $subject = _("...");
 
  // Send the mail using $subject and $body 

  // restore the current locale
  global $intlang;
  setlocale(LC_ALL, $intlang);

So, check all use of function maybe_mail() and distinguish two cases:
1) mail is sent to a mailing list: then in conjunction with
the email address global variable, add an email language global variable,
e.g. in addition to

  $email_addr = $promotion_requests_email_addr;

add

  $email_lang = $promotion_requests_intlang;

and then use setlocale(LC_ALL, $email_lang) before building the body 
and subject. (that allows in theory another site to have its mailing 
lists in a different language, while sharing the same code)

Or to simplify we could use the same intlang value for all 
administrative purposes, and define a global variable $site_admin_intlang;

2) mail is sent to an individual (a proofreader, project manager, etc.);
use function 
  $email = get_email_and_setlocale($username)
before building the subject and body.

Similarly one could define a function
  get_email_and_setlocale_project_manager($project)
if needed.

And do not forget to restore the language after maybe_mail().


Check date translation

Dates are displayed:

  • by string concatenation(e.g. when presentation done inside SQL queries)
    • Suggest that presentation code be moved outside of SQL queries.
    • This perhaps needs a generic table code to replace dpsql_dump_table
  • by date(). date() is not localised (month names will always display in english)
    • TODO, check that last assertion.
  • by strftime()

sfrtime() would be the correct method do display localised dates, if only the locale work correctly. So, suggest to wait for confirmation that the locale issue works. If locale are the right choice:

  • use strftime() for all localised dates
  • use date() for dates which should not be localised.

Else, the solution will depend upon the solution chosen for localisation.


Translate genres

In each files where genre is displayed, include:

 include_once("genres.inc");

and use $GENRES[$genre] instead of simply $genre. (tools/project_manager/edit_common.inc already has the translated genres)

In pinc/filter_project_list.inc, genre should be made an associative array. Draft code to be further expanded:

1) include_once("genres.inc");
2) replace around line 225 with
        if($field == "special_code" || $field=="difficulty" || $field=="genre")

3) in function _load_project_filter_field_values, put 
    global $GENRES;
and add the three lines below inside the while loop:
        else if ($field == "genre")
            $return[$a_res[0]] = array_key_exists($a_res[0], $GENRES))
                ? $GENRES[$a_res[0]] : $a_res[0];        


Translate languages

Problem statement

This one is a bit more complex because there are many language names, and because new language names seem to be continually created (the IANA registry as of 30 june 2010 contains 7843 different language names).

Note that there are several lists in the ISO 639 standard, as described in the ISO 639 FAQ. The 639-1 list contain two-letter language names for the most significant languages of today (but for instance not ancient Greek). The 639-2 list contain two distinct three-letter language codes for the most significant languages, one of this code bing used in bibliography. Our language list pinc/iso_lang_list was apparently derived from the 639-2 list but differs significantly. (some of the difference is due to minor rewording of language names, and some are due to new codes or other changes) In any case, the code does not enforce that all languages be present in this list (e.g. a project created from a MARC record can have an unknown language code).

Language names in other languages can be found:

  • for French, directly in the ISO 639-2 standard, since the official standard's code list gives all language names in both French and English.
  • the Unicode Common Locale Data Repository (CLDR) contains all the data. Note that some translated language names are only drafts, not standards. And language names are slightly different from the ISO list (language codes are mostly the same though).

Of course the CLDR does not identify languages by the same tags as the ISO 639-2, because for those languages that are found in the ISO 639-1 list (those having a two letter code), the CLDR uses the two letter code.

Generating a po file from the CLDR does not seem to be the right idea, notably because our language names in our database do not use the same language names as the CLDR.

summary proposal

So, my proposal would be to store language names, either as include files, or in a table in our database, tothether with a script that can either refresh these files or table using a newer release of the CLDR, or add new languages into which the site is translated. A strategy remains to be defined as to where should the various translations occur:

  • between our language name(1) to our 3-letter code,
  • then from our 3-letter code to the CLDR language tag (using the two letter code from ISO 639-2 when it exists),
  • and finally from the language tag to the localised language name.

(1) our current database stores the primary and secondary languages of a project in a language field containing the language names separated by "with": "English with German". There is a proposal by Casey to change that and use database identifiers instead. I'm not sure what kind language identifiers Casey was having in mind, but the more I think of it, the more I'm tempted to use language tags for everything internally.

detailed proposal (dababase version)

My proposal, which I hope other developers will review and agree to during my leave in August, is to proceed in several steps:

Step 1: add the database table

# tag: the language tag according to the CDLR convention.
# intlang: the language tag in which this language is translated
#      (currently one of "fr", "de", "it", "pt", "es", and for 
#      the long term this is to be also a language tag according 
#      to the same convention)
# name: the name of the language tag in the $lang language.
# e.g. tag='en', intlang='fr', name='Anglais'
#      tag='zh_Hans', intlang='en', name='Simplified Chinese'
#      tag='enm', intlang='pt', name='inglês médio'
#      tag='enm', intlang='pt_PT', name='inglês medieval'
CREATE TABLE `language_names` (
 `tag` varchar(25) NOT NULL default ,
 `intlang` varchar(25) NOT NULL default ,
 `name` text NOT NULL default ,
 PRIMARY KEY  (`tag`,`lang`)
) TYPE=MyISAM DEFAULT CHARSET=latin1;

If the site is in latin-1, the name will be stored in latin1, with HTML entities if there are characters outside latin-1 range (that seems to appear for some rare language names, even in otherwise latin-1 languages).

If the site is in utf-8, the name will be stored directly in utf-8.


Step 2: populate the database table from the CDLR database using a simple script.

The script will be using a grep-like approach for parsing the CDLR files, and not a full XML parser. That script is only intended to be run from time to time by system administrators to add the language names when a new translation is added to the site.

If later a need arises for a more robust full XML parser, that will be done later.


Step 3: add in the code a hardcoded mapping from our internal iso-369-2-like language names to the CLDR language tag.

I suggest that this include file is put beside our current pinc/iso_lang_list.inc, i.e. in pinc/iso_lang_list.inc, currently containing

 $lang_list=array(
   ...
   array("lang_name" => "German", "lang_code" => "ger/deu"),
   ...
 );

add a tag field, like this:

   array("lang_name" => "German", "lang_code" => "ger/deu", "tag" => "de"),

(this is to guarantee that the list of our own lang_name values, which unfortunately are what is stored in our database presently, is at only one place in the code.)

In a longer term, this table should go away completely once we convert the database to use language tags, instead of our lang_name; the only conversion left needed will be a conversion from the ISO 639-2 bibliographical code found in MARC, to the language tag, i.e. only the

 "lang_code" => "ger/deu", "tag" => "de"

fields will be needed in the long term (possibly as an associative array whose key is the MARC language code, rather than requiring a loop searching for values). But that's for a longer term, not part of this here proposal.


Step 4: implement a utility function that translates a language field, either of the form "German", or "French with German".

Function to be added in pinc/languages.inc:

// return a translated version of the project language
// field in the user's language.
function translated_proj_lang($proj_lang)
{
   if (preg_match('/ with /', $proj_lang))
   {
       $languages = preg_split('/ with /', $proj_lang );
       // TRANSLATORS: %s are the primary and secondary language names.
       return sprintf("%s with %s", 
           translated_lang($languages[0]),
           translated_lang($languages[1]));
   }
   else 
       return translated_lang($proj_lang)
}
// queries the language names database and returns
// the name of the given language in the user's language
function translated_lang($lang_name)
{
   global $lang_array, $intlang;
   if ($intlang == "en")
       return $lang_name;

   $tag = tag_for_langname($lang_name);
   if (isset($tag))
   {
       $result = mysql_query("SELECT name FROM language_names 
           WHERE tag = '" . mysql_real_escape_string($tag) . "'
           AND lang = '" . $intlang . "'");
       if ($result == FALSE) 
       {
           // translation not found, return the original string
           return $lang_name;
       }
       $name = mysql_result($result, 0, 'name');
       return $name;
   }
   // Unknown language name, return the original string
   return $lang_name;
} 

Step 5: call the above function whenever displaying a language name to the user.

Step 6: (optional improvement, not needed for first deployment) progressively replace the direct usage of the data in pinc/iso_lang_list.inc by calls to functions which return the expected results. For instance, there is a need for a function which would return a list of all couples (lang identifier, lang label in user's language), where the lang identifier is either our current lang_name, or the lang tag, depending on whether we move to lang tags soon or not.

This last item is not part of the proposal subject to review.


Future steps (not part of this proposal) could be to refactor the current language field and use language tags directly in the database.

Variant using php include files

Donovan pointed to me that putting the CLDR-extracted data as php include files instead of in a dabatase table should be more efficient and more maintainable (because it is then carried along with the code, admins don't have to administer it independently on several servers).

This basically replaces the steps 1 and 2 above with a script being used to create include files. And of course the functions in step 4 use the associative arrays instead of a mysql query.

Remaining to do: I should investigate if anything special is needed to store UTF-8 data in these files.

Document what steps are needed to add a new translation

  • Document the tables in languages.inc and lang_data.inc


Rewrite entirely the faq/translate.php

  • think what activities will be needed between the several sites (test site

and live site)


Low priority

HTTP error pages (403, 404, ...)

Mark 403.php and 404.php for translation.

These files are in /0/htdocs in the test server. They are presently not part of CVS, which makes it difficult to both edit and have xgettext run on it. I suggest moving the pages into CVS and add instructions (and perhaps a script) in SETUP to have the .htaccess point to them directly.

Translated phpbb2 interface

  • The language packs installed lack some strings (nothing is displayed when the translation is missing). Obtain sources and investigate.
  • Did DP make changes to the forum (e.g. the "view unread posts") ?


jpgraphs

(copied from DP Code Text Localization.)

The collision between gettext() and the image cache needs to be resolved before we localize these files (stop using the image cache for translated sites? make the image cache smarter by caching per-language?)

According to the jpgraph documentation, If the cache name is specifed as 'auto' then the cache name will be based on the basename of the script with an extension indicating the image format used, i.e. JPG, GIF or PNG. Therefore to cache according to the language, I suppose we just have to replace that 'auto' parameter with something that contains the current $intlang.

The fix that minimizes the number of impacted files is simply to construct the cache name in the various init_pages_graph() functions directly in stats/jpgraph/common.inc, based on $intlang and $_SERVER['PHP_SELF']. As an extension I suggest to pass the cache name as an optional parameter to these init_xxx_graph() functions, so as to allow creating several graphs in the same script.

number_format

This function is used to display 12345 as 12,345. It is not locale-compliant, and should be replaced by a locale-compliant version that would need to be written or obtained.


Translation database

Do we need a translation database? It could hold things like the forum for translation team, the place where to mention translation bugs, the name of the last person uploading a given po file, if needed. To be thought further.


Translator icon in the stats

This does not need a database change, merely three lines added in stats/include/member.inc, function showMbrRoles().


Translate News

I suggest to proceed thus. Site admins and news admins will edit their news at will using the current tools. The only change is that translators will be alerted when a piece of news is either newly created or edited.

Means of alerting translators can be:

  • sending an email to all members registered as translators according to the user_settings table?
  • writing a post in a dedicated forum topic (and expecting translators are watching that topic)?
  • other?

Each piece of news can receive one translation in each of the supported translation language. This translation is stored in a new table.

CREATE TABLE `transl_news_items` (
 `id` int(11) NOT NULL default '0',
 `lang` char(2) NOT NULL default ,
 `date_updated` int(11) NOT NULL default '0',
 `content` text NOT NULL,
 PRIMARY KEY  (`id`,`lang`)
) TYPE=MyISAM DEFAULT CHARSET=latin1;

When displaying a news item, if the user's locale is different than english, the system displays the translated version if its date is later than the edit date of the english news item. Furthermore if the current user is know as a translator, next to the display of news, a link is given to edit the translation of this news item. That links to an interface where the translator can see what english news are more recent than their translated version.


Translate the walkthrough

Convert the walkthrough to php and use gettext markup. A lot of strings will be identical, so the additional work for translators should be small. Prior to doing that, make it more in sync with the current Activity Hub. Amy told me that a revised version exists on her sandbox, as a starting point.

Translate the quizzes

(To be further analysed. First understand what is preventing Amy's quizzes from being incorporated to the code. Analyse what is needed to have also LOTE quizzes, in addition to translations of english quizzes? ...)

Translate wiki interface?

Depending on the feasibility as reported by squirrels.


Document somewhere (in SETUP?) the limitations of using locales

Limitations of using locales:

  • that requires that apache does not run multithreaded.
  • does not work on windows.

And document how to run the site with no translation, for those users who cannot satisfy the above requirements.

No implementation needed

(for reference, these topics do not require implementation and are our of scope of this todo list)

  • encourage LOTE communities to translate first page of wiki
  • LOTE forums (cf LOTE committee)

Anything missing?

If you spot anything missing, please tell me!