LocaleMapping

From Apache OpenOffice Wiki
Revision as of 19:31, 21 May 2010 by Caolan (Talk | contribs)

Jump to: navigation, search

Mapping Unix Locale Identifiers to BCP-47

Some notes on plausibly mapping Unix Locale Identifiers (ok, basically the glibc ones) to BCP-47[1].

Format of Unix Locale Identifier

From POSIX:2008[2] we have

If the locale value has the form: language[_territory][.codeset] it refers to an implementation-provided locale, where settings of language, territory and codeset are implementation-dependent. [Some categories can be] defined to accept an additional field "@modifier ", which allows the user to select a specific instance of localisation data within a single category (for example, for selecting the dictionary as opposed to the character ordering of data). The syntax for these environment variables is thus defined as: [language[_territory][.codeset][@modifier]]

and from the gettext manual[3] we have

The functions recognize the format of the value of the environment variable. It can split the value is different pieces and by leaving out the only or the other part it can construct new values. This happens of course in a predictable way. To understand this one must know the format of the environment variable value. There is one more or less standardized form, originally from the X/Open specification: language[_territory[.codeset]][@modifier]

Documentation caution.png Clearly the territory, codeset and modifier sections are all equally optional. i.e. in the case of a string such as tt_RU@iqtelif tt is the language, RU is the territory, the codeset is empty and the modifier is iqtelif. Equally for tt_RU@iqtelif.foo[4] tt is the language, RU is the territory, the codeset is empty and the modifier is "iqtelif.foo". Parsing this by just searching for the first "_", then subsequent "." and then subsequent "@" would give tt for language, "RU@iqtelif" for territory and foo for encoding which is bogus[5].

Mapping

For the most part the simple case is, when excluding the encoding, that the Unix identifier is language_territory and that language-territory would form a valid bcp subtags. There are other cases to consider through of two main categories

use of collective and/or obsolete languages

  1. glibc continues to list locales identified by long names, e.g. deutsch, japanese.sjis.
  2. glibc also continues to list obsolete language codes, e.g. iw
  3. glibc has some locales like ber_DZ and ber_MA. ber is now classified as a collective language so its unclear from the identifier itself as to what specific language is truly indicated. no_NO is typically accepted by e.g. gettext aliasing to be equivalent to nb_NO

@modifiers

These are effectively free-form, but the existing modifiers in glibc break down into...

  1. @modifiers that indicate a particular currency, e.g. en_IE@euro
  2. @modifiers that indicate a non-default script, e.g. uz_UZ@cyrillic, be_BY@latin
  3. @modifiers that indicate a dialect or variant of the language, e.g. aa_ER@saaho, ca_ES@valencia
  4. @modifiers that indicate a non-default collation rule, e.g. gez_ER@abegede
  5. @modifiers that indicate that East Asian ambiguous width characters should default to being considered narrow, e.g. zh_CN@cjknarrow

Best mapping

  1. Substitute any identifiers appearing in locale.alias according to those aliases
  2. Parse to language, territory, encoding, modifier
  3. Substitute obsolete language codes with language-subtag-registry registered replacements, e.g. iw to he, in to id and ji to yi
  4. aa_ER@saaho claims "Afar language locale for Eritrea (Saaho Dialect)", but ssy denotes "Saho, A language of Eritrea. Very similar to Afar". Convert aa to ssy when the @modifier is saaho, i.e. aa_ES@saaho -> ssy-ER. Unicode Technical Standard #35[6] suggest the same.
  5. Ignore @euro modifier, it's redundant now
  6. Ignore @cjknarrow modifier, it's orthogonal information (?)
  7. Convert any @modifiers that appears as a "Property Value Alias" entry in iso-15924 to that script code, e.g. "cyrillic" to "Cyrl" script-tag, latin to "Latn", Devangari to "Deva", etc.
  8. Convert any @modifiers that are equivalent to language-subtag-registry registered variants to variant-tags, e.g. valencia is a registered BCP 47 variant so, ca_ES@valencia -> ca-ES-valencia
  9. Debatable issues
    1. "iqtelif" to script-tag of "Latn" (?). The variant iqtel was requested but not granted. tt_TU@iqtelif -> tt-Latn-RU-x-iqtelif|tt-Latn-RU-x-iqtel
    2. gez_ER@abegede claims "Abegede Collation for Ge'ez", there seems to be no existing tag to indicate this anywhere. The "u" extension might in the future give a route to supporting this. Maybe private tag of x-abegede for the interim, i.e. gez-ER-x-abegede (?).

Debatable issues surround the ber_ collective code

  1. ber_DZ locale claims it's "Amazigh language locale for Algeria (latin)". Algeria has apparently standardized on Kabyle, written in Latin. so ber_DZ is probably practically equivalent to kab-DZ[7]. Where previously glibc used a collective code ("no") it got aliased out to a specific individual code. The same might be called for again here.
  2. ber_MA locale claims it's "Amazigh language locale for Morocco (tifinagh)". It's a little unclear as to what exactly is specified here in the absence of a "Standard Amazigh"/"Standard Tamazight". (It's of no help to e.g. examine the translations in the glibc locale description file to see what language they were written in because they are all just copied from the Azerbaijani locale file, and so aren't in any Berber Language!). But the languages being taught through Tifinagh in Morocco seem to be rif, tzm and shi where tzm and shi have approximately 3 million speakers to rif's 1.5. In either case there seems to be or have been plenty of controversy about the script itself, so adding the Tfng script tag to add some distinguishing information to the tag seems called for, especially as there's no suppress-script field in the language-subtag-registry entry for those languages.

fontconfig

fontconfig[8] language tags are language-territory. There seems to be some confusion around about this though and I see various .conf files with @ in the language field.

References:
  1. BCP-47
  2. http://www.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html POSIX:2008
  3. gettext manual
  4. rhbz#589138 Oddly named tt_RU@iqtelif.UTF-8/tt_RU.utf8@iqtelif.UTF-8 locales.
  5. gnome#618108 Fix glib locale splitter
  6. Unicode Technical Standard #35
  7. xdg#19881 Berber orthographies in Latin and Tifinagh
  8. xdg#19869 fontconfig should change to BCP 47 language tags
Personal tools