Difference between revisions of "LocaleMapping"

From Apache OpenOffice Wiki
Jump to: navigation, search
Line 1: Line 1:
 +
{{PageIgnoresWikiGuidelines}}
 +
 
== Mapping Unix Locale Identifiers to BCP-47 ==
 
== Mapping Unix Locale Identifiers to BCP-47 ==
  

Revision as of 14:19, 17 May 2010

Documentation caution.png

This page ignores the Wiki Contribution Guidelines. New pages should not do that. Please read the Guidelines and fix the page. At least:

  • use categories
  • provide links between pages so that your page is neither orphaned nor a dead-end page.
  • keep drafts on your userpage
  • see Help:Translating for nonenglish content

Please note that it is possible that pages that are not following the guidelines will be deleted to avoid confusion.

Feel free to remove this warning from the page when you have fixed it. (However, when the page is fixed this warning will be removed soon anyway).

For more information just ask the people of the project:


Mapping Unix Locale Identifiers to BCP-47

Some notes on plausibly mapping Unix Locale Identifiers (ok, basically the glibc ones) to BCP-47[1].

Format of Unix Locale Identifier

From POSIX:2008[2] we have

If the locale value has the form: language[_territory][.codeset] it refers to an implementation-provided locale, where settings of language, territory and codeset are implementation-dependent. [Some categories can be] defined to accept an additional field "@modifier ", which allows the user to select a specific instance of localisation data within a single category (for example, for selecting the dictionary as opposed to the character ordering of data). The syntax for these environment variables is thus defined as: [language[_territory][.codeset][@modifier]]

and from the gettext manual[3] we have

The functions recognize the format of the value of the environment variable. It can split the value is different pieces and by leaving out the only or the other part it can construct new values. This happens of course in a predictable way. To understand this one must know the format of the environment variable value. There is one more or less standardized form, originally from the X/Open specification: language[_territory[.codeset]][@modifier]

Documentation caution.png Clearly the territory, codeset and modifier sections are all equally optional. i.e. in the case of a string such as tt_RU@iqtelif tt is the language, RU is the territory, the codeset is empty and the modifier is iqtelif. Equally for tt_RU@iqtelif.foo[4] tt is the language, RU is the territory, the codeset is empty and the modifier is "iqtelif.foo". Parsing this by just searching for the first "_", then subsequent "." and then subsequent "@" would give tt for language, "RU@iqtelif" for territory and foo for encoding which is bogus[5].

Mapping

For the most part the simple case is, when excluding the encoding, that the Unix identifier is language_territory and that language-territory would form a valid bcp subtags. There are other cases to consider through of two main categories

use of collective and/or obsolete languages

  1. glibc continues to support locales identified by long names, e.g. deutsch, japanese.sjis.
  2. glibc also continues to support the obsolete language codes of no and iw
  3. glibc has some locales like ber_DZ and ber_MA. ber is now classified as a collective language so its unclear from the identifier itself as to what specific language is truly indicated

@modifiers

These are effectively free-form, but the existing modifiers in glibc break down into...

  1. @modifiers that indicate a particular currency, e.g. en_IE@euro
  2. @modifiers that indicate a non-default script, e.g. uz_UZ@cyrillic, be_BY@latin
  3. @modifiers that indicate a dialect of the language, e.g. aa_ER@saaho
  4. @modifiers that indicate a non-default collation rule, e.g. gez_ER@abegede

Best mapping

  1. Substitute any identifiers appearing in locale.alias according to those aliases
  2. Parse to language, territory, encoding, modifier
  3. Substitute language of iu to he
  4. Substitute language of no to nb
  5. Ignore @euro modifier, it's redundant now
  6. Convert the modifiers of...:
    1. "cyrillic" to script-tag of "Cyrl"
    2. "latin" to script-tag of "Latn"
    3. "devanagari" to script-tag of "Deva"
    4. "iqtelif" to script-tag of "Latn" (?)
  7. aa_ER@saaho claims "Afar language locale for Eritrea (Saaho Dialect)", but ssy denotes "Saho, A language of Eritrea. Very similar to Afar". Convert aa to ssy when the @modifier is saaho, i.e. ssy-ER
  8. gez_ER@abegede claims "Abegede Collation for Ge'ez", there seems to be no existing tag to indicate this anywhere, suggest a private tag of x-abegede for the interim, i.e. gez-ER-x-abegede

Debatable issues surround the ber_ family

  1. ber_DZ locale claims it's "Amazigh language locale for Algeria (latin)", Algeria has apparently standardized on Kabyle, writing in Latin. so maybe could convert ber_DZ to kab-DZ when territory is DZ[6].
  2. ber_MA locale claims it's "Amazigh language locale for Morocco (tifinagh)". Its a little unclear as to what exactly is specified here in the absence of a "Standard Amazigh"/"Standard Tamazight". (It's of no help to e.g. examine the translations in the glibc locale description file to see what language they were written in because they are all just copied from the Azerbaijani locale file, and so aren't in any Berber Language!) But the languages being taught through Tifinagh in Morocco seem to be rif, tzm and shi where tzm and shi have approximately 3 million speakers to rif's 1.5. In either case there seems to be or have been plenty of controversy about the script itself, so adding the Tfng script tag to add some distinguishing information to the tag seems called for, especially as there's no suppress-script field in the language-subtag-registry entry for those language.

fontconfig

fontconfig[7] language tags are language-territory. There seems to be some confusion around about this though and I see various .conf files with @ in the language field.

References:
  1. BCP-47
  2. http://www.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html POSIX:2008
  3. gettext manual
  4. rhbz#589138 Oddly named tt_RU@iqtelif.UTF-8/tt_RU.utf8@iqtelif.UTF-8 locales.
  5. gnome#618108 Fix glib locale splitter
  6. xdg#19881 Berber orthographies in Latin and Tifinagh
  7. xdg#19869 fontconfig should change to BCP 47 language tags
Personal tools