Difference between revisions of "LocaleMapping"

Revision as of 10:22, 11 May 2010

Mapping Unix Locale Identifiers to BCP-47

Some notes on plausibly mapping Unix Locale Identifiers (ok, basically the glibc ones) to BCP-47^[1].

Format of Unix Locale Identifier

From POSIX:2008^[2] we have

If the locale value has the form: language[_territory][.codeset] it refers to an implementation-provided locale, where settings of language, territory and codeset are implementation-dependent. [Some categories can be] defined to accept an additional field "@modifier ", which allows the user to select a specific instance of localisation data within a single category (for example, for selecting the dictionary as opposed to the character ordering of data). The syntax for these environment variables is thus defined as: [language[_territory][.codeset][@modifier]]

and from the gettext manual^[3] we have

The functions recognize the format of the value of the environment variable. It can split the value is different pieces and by leaving out the only or the other part it can construct new values. This happens of course in a predictable way. To understand this one must know the format of the environment variable value. There is one more or less standardized form, originally from the X/Open specification: language[_territory[.codeset]][@modifier]

Clearly the territory, codeset and modifier sections are all equally optional. i.e. in the case of a string such as tt_RU@iqtelif tt is the language, RU is the territory, the codeset is empty and the modifier is iqtelif. Equally for tt_RU@iqtelif.foo^[4] tt is the language, RU is the territory, the codeset is empty and the modifier is "iqtelif.foo". Parsing this by just searching for the first "_", then subsequent "." and then subsequent "@" would give tt for language, "RU@iqtelif" for territory and foo for encoding which is bogus^[5].

Mapping

For the most part the simple case is, when excluding the encoding, that the Unix identifier is language_territory and that language-territory would form a valid bcp subtags. There are other cases to consider through of two main categories

use of collective and/or obsolete languages

glibc continues to support locales identified by long names, e.g. deutsch, japanese.sjis.
glibc also continues to support the obsolete language codes of no and iw
glibc has some locales like ber_DZ and ber_MA. ber is now classified as a collective language so its unclear from the identifier itself as to what specific language is truly indicated

@modifiers

These are effectively free-form, but the existing modifiers in glibc break down into...

@modifiers that indicate a particular currency, e.g. en_IE@euro
@modifiers that indicate a non-default script, e.g. uz_UZ@cyrillic, be_BY@latin
@modifiers that indicate a dialect of the language, e.g. aa_ER@saaho
@modifiers that indicate a non-default collation rule, e.g. gez_ER@abegede

Best mapping

Substitute any identifiers appearing in locale.alias according to those aliases
Parse to language, territory, encoding, modifier
Substitute language of iu to he
Substitute language of no to nb
Ignore @euro modifier, it's redundant now
Convert the modifiers of...:
1. "cyrillic" to script-tag of "Cyrl"
2. "latin" to script-tag of "Latn"
3. "devanagari" to script-tag of "Deva"
4. "iqtelif" to script-tag of "Latn" (?)
aa_ER@saaho claims "Afar language locale for Eritrea (Saaho Dialect)", but ssy denotes "Saho, A language of Eritrea. Very similar to Afar". Convert aa to ssy when the @modifier is saaho, i.e. ssy-ER
gez_ER@abegede claims "Abegede Collation for Ge'ez", there seems to be no existing tag to indicate this anywhere, suggest a private tag of x-abegede for the interim, i.e. gez-ER-x-abegede

Debatable issues surround the ber_ family

ber_DZ locale claims it's "Amazigh language locale for Algeria (latin)", Algeria has apparently standardized on Kabyle, writing in Latin. so maybe could convert ber_DZ to kab-DZ when territory is DZ^[6].
ber_MA locale claims it's "Amazigh language locale for Morocco (tifinagh)". Its a little unclear as to what exactly is specified here in the absence of a "Standard Amazigh"/"Standard Tamazight". (It's of no help to e.g. examine the translations in the glibc locale description file to see what language they were written in because they are all just copied from the Azerbaijani locale file, and so aren't in any Berber Language!) But the languages being taught through Tifinagh in Morocco seem to be rif, tzm and shi where tzm and shi have approximately 3 million speakers to rif's 1.5. In either case there seems to be or have been plenty of controversy about the script itself, so adding the Tfng script tag to add some distinguishing information to the tag seems called for, especially as there's no suppress-script field in the language-subtag-registry entry for those language.

fontconfig

fontconfig^[7] language tags are language-territory. There seems to be some confusion around about this though and I see various .conf files with @ in the language field.

References:

↑ BCP-47
↑ http://www.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html POSIX:2008
↑ gettext manual
↑ rhbz#589138 Oddly named tt_RU@iqtelif.UTF-8/tt_RU.utf8@iqtelif.UTF-8 locales.
↑ gnome#618108 Fix glib locale splitter
↑ xdg#19881 Berber orthographies in Latin and Tifinagh
↑ xdg#19869 fontconfig should change to BCP 47 language tags

[1] BCP-47

[2] ttp://www.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html POSIX:2008

[3] ttext manual

[4] rhbz#589138 Oddly named tt_RU@iqtelif.UTF-8/tt_RU.utf8@iqtelif.UTF-8 locales.

[5] #618108 Fix glib locale splitter

[6] xdg#19881 Berber orthographies in Latin and Tifinagh

[7] xdg#19869 fontconfig should change to BCP 47 language tags

[1]

[2]

[3]

[4]

[5]

[6]

[7]

@@ Line 52: / Line 52: @@
 Debatable issues surround the ber_ family
-# ber_DZ locale claims it's "Amazigh language locale for Algeria (latin)", Algeria has apparently standardized on Kabyle, writing in Latin. so maybe could convert ber_DZ to kab-DZ when territory is DZ<ref>[http://bugs.freedesktop.org/show_bug.cgi?id=19881 xdg#19881]</ref>.
+# ber_DZ locale claims it's "Amazigh language locale for Algeria (latin)", Algeria has apparently standardized on Kabyle, writing in Latin. so maybe could convert ber_DZ to kab-DZ when territory is DZ<ref>[http://bugs.freedesktop.org/show_bug.cgi?id=19881 xdg#19881] Berber orthographies in Latin and Tifinagh</ref>.
 # ber_MA locale claims it's "Amazigh language locale for Morocco (tifinagh)". Its a little unclear as to what exactly is specified here in the absence of a "Standard Amazigh"/"Standard Tamazight". (It's of no help to e.g. examine the translations in the glibc locale description file to see what language they were written in because they are all just copied from the Azerbaijani locale file, and so aren't in any Berber Language!) But the languages being [http://www.adrar.nl/indexEng.html taught through Tifinagh] in Morocco seem to be [http://www.ethnologue.com/show_language.asp?code=rif rif], [http://www.ethnologue.com/show_language.asp?code=tzm tzm] and [http://www.ethnologue.com/show_language.asp?code=shi shi] where tzm and shi have approximately 3 million speakers to rif's 1.5. In either case there seems to be or have been plenty of controversy about the script itself, so adding the Tfng script tag to add some distinguishing information to the tag seems called for, especially as there's no suppress-script field in the [http://www.iana.org/assignments/language-subtag-registry language-subtag-registry] entry for those language.
 === fontconfig ===
+[http://bugs.freedesktop.org/show_bug.cgi?id=19869 fontconfig]<ref>[http://bugs.freedesktop.org/show_bug.cgi?id=19869 xdg#19869] fontconfig should change to BCP 47 language tags</ref> language tags are [http://www.xemacs.org/Documentation/packages/html/fontconfig_2.html#SEC8 language-territory]. There seems to be some confusion around about this though and I see various .conf files with @ in the language field.
-Need to have a look see at the language tags in fontconfig, are they actually using language-territory@modifiers like I saw in one .conf file recently ? [http://www.xemacs.org/Documentation/packages/html/fontconfig_2.html#SEC8 docs] say just language-territory
 References: <references/>

Difference between revisions of "LocaleMapping"

Revision as of 10:22, 11 May 2010

Contents

Mapping Unix Locale Identifiers to BCP-47

Format of Unix Locale Identifier

Mapping

use of collective and/or obsolete languages

@modifiers

Best mapping

fontconfig

Views

Personal tools

Navigation

Search

Tools