Difference between revisions of "User:Goranrakic/Localization/Texcat"

From Apache OpenOffice Wiki
Jump to: navigation, search
(Created page with 'OpenOffice.org is using [http://software.wise-guys.nl/libtextcat/ libtextcat] to guess a paragraph language. This is used to [http://specs.openoffice.org/appwide/linguistic/Set_L…')
(No difference)

Revision as of 19:47, 4 October 2010

OpenOffice.org is using libtextcat to guess a paragraph language. This is used to generate a dynamic language list for the context menu.

To add or update support for your language, you need to prepare a language model for libtextcat. You may find existing language models in the OpenOffice.org source code in /libtextcat/data/new_fingerprints directory, or in the basis3.3/share/fingerprint directory inside your OpenOffice.org installation.

Generating new language model

OpenOffice.org uses a patched library to support UTF-8 language models, but it may be a little tricky to prepare them. You should first get the original tex_cat Perl script from http://odur.let.rug.nl/~vannoord/TextCat/ and prepare a plain text file with a text representing your language. You may test, but according to [this http://www.mnogosearch.org/guesser/] something about 30 000 words should be enough. The text should not include any formating (like HTML or XML tags) and you may want to quickly proofread it to remove any parts written in a foreign language or non-standard script. The text should say to OpenOffice.org, when you see something like this, recognize it as may language. Wikipedia can be a good source.

Perl script will give the best results if run on the text encoding in the one byte per character code page, like ISO-8859 encodings. This will let the tex_cat to have character level n-grams, and not dive into the byte arrays of multibyte encodings like UTF-8.

Template:Documentation/Tip

Now run the Perl script on your input text to generate a language model:

./text_cat -n serbian-iso8859_5.txt > serbian-iso8859_5.lm

What is left now is to convert generated language model back to UTF-8:

iconv -f ISO8859-5 -t UTF-8 < serbian-iso8859_5.lm > serbian.lm


Testing language model

Move your language model to the the OpenOffice.org instalation, inside the basis3.3/share/fingerprint direcory. Edit fpdb.conf and place new line mapping language model to the OpenOffice.org language code and specify the encoding following the pattern of existing entries.

Restart OpenOffice.org and type some text. You may want to comment other models placing a # character at the begining of a line while testing. Read the specification document to understand how the language list should look like if your model is working correctly.


Integrating the model into OpenOffice.org

You should generate your language model, patch fpdb.conf and define new packaging modules for you language models. Use issue 114751 as example.

Personal tools