Difference between revisions of "User:Goranrakic/Localization/Texcat"

Latest revision as of 09:51, 16 July 2018

OpenOffice.org is using libtextcat to guess a paragraph language using n-gram text analysis. This is used to generate a dynamic language list for the context menu for the language status bar.

To add or update support for your language, you need to prepare a language model for libtextcat. You may find existing language models in the OpenOffice.org source code in /libtextcat/data/new_fingerprints directory, or in the basis3.3/share/fingerprint directory inside your OpenOffice.org installation.

Generating a new language model

OpenOffice.org uses a patched libtextcat library to support UTF-8 language models, but it may be a little tricky to prepare them.

You should first get the original tex_cat Perl script from http://odur.let.rug.nl/~vannoord/TextCat/ and prepare a plain text file with a text representing your language. You should do some test, but according to this page text with about 30 000 words should be enough. The text should not include any formating (like HTML or XML tags) and you may want to quickly proofread it to remove any parts written in a foreign language or non-standard script. The text should say to OpenOffice.org, when you see something like this, recognize it as may language. Wikipedia can be a good source of representative text.

Perl script will give the best language model if run on the text encoded in the one byte per character code page, like in ISO-8859 encodings. This will let the tex_cat to have character level n-grams, and not dive into the byte arrays of multibyte encodings like UTF-8.

Find the correct code page to represent your language. If your text is in UTF-8 encoding, you can use iconv to convert it to ISO-8859. For Serbian Cyrillic we convert from serbian.txt in UTF-8 to serbian-iso8859_5.txt in ISO8859-5 using:

iconv -c -f UTF-8 -t ISO8859-5 < serbian.txt > serbian-iso8859_5.txt

Now run the Perl script on your input text to generate a language model:

./text_cat -n serbian-iso8859_5.txt > serbian-iso8859_5.lm

What is left now is to convert generated language model back to UTF-8:

iconv -f ISO8859-5 -t UTF-8 < serbian-iso8859_5.lm > serbian.lm

Testing language model

Move your language model to the the OpenOffice.org instalation, inside the basis3.3/share/fingerprint direcory. Edit fpdb.conf and place new line mapping language model to the OpenOffice.org language code and specify the encoding following the pattern of existing entries.

Restart OpenOffice.org and type some text. You may want to comment other models placing a # character at the begining of a line while testing. Read the specification document to understand how the language list should look like if your model is working correctly.

Integrating the model into OpenOffice.org

You should generate your language model, patch fpdb.conf and define new packaging modules for you language models. Use issue 114751 as example.

@@ Line 12: / Line 12: @@
 Perl script will give the best language model if run on the text encoded in the one byte per character code page, like in ISO-8859 encodings. This will let the tex_cat to have character level n-grams, and not dive into the byte arrays of multibyte encodings like UTF-8.
-{{Template:Documentation/Tip|
+{{Tip|
 Find the correct code page to represent your language. If your text is in UTF-8 encoding, you can use iconv to convert it to ISO-8859. For Serbian Cyrillic we convert from serbian.txt in UTF-8 to serbian-iso8859_5.txt in ISO8859-5 using:
   iconv -c -f UTF-8 -t ISO8859-5 < serbian.txt > serbian-iso8859_5.txt}}

Difference between revisions of "User:Goranrakic/Localization/Texcat"

Latest revision as of 09:51, 16 July 2018

Generating a new language model

Testing language model

Integrating the model into OpenOffice.org

Views

Personal tools

Navigation

Search

Tools