Difference between revisions of "Documentation/SL/Using TeX hyphenation patterns in OpenOffice.org"

From Apache OpenOffice Wiki
Jump to: navigation, search
 
(30 intermediate revisions by 2 users not shown)
Line 1: Line 1:
[[Category:Localization]]
+
{{DISPLAYTITLE:Using TeX hyphenation patterns in OpenOffice.org}}
Summary: this document describes how to use TeX hyphenation patterns in OpenOffice.org.
+
[[Category:Localization]][[Category:SL]]
 +
 
 +
This document describes how to properly convert TeX hyphenation patterns for OpenOffice.org and other software projects using the Hyphen hyphenation library.
 +
 
 +
* Written by Martin Srebotnjak.
 +
* Some portions of text contributed by László Németh and Mojca Miklavec.
 +
* Version: 1.1 (July 24, 2010).
  
Written by Martin Srebotnjak; some portions of text contributed by László Németh and Mojca Miklavec.
 
 
==Introduction==
 
==Introduction==
OpenOffice.org uses [http://sourceforge.net/projects/hunspell/files/Hyphen/ Hyphen], part of the [Hunspell project], as its hyphenation engine. The hyphenation files, from OpenOffice.org 3.0 onwards packed as an OpenOffice.org extension, consist of three files: the dictionary file (a text file with all the patterns; hyph_xx_YY.dic), the rules file (a text file with hyphenation rules for the language; hyph_xx_YY.idx) and the release notes (a text file with all the credits and licensing information (example: hyph_rel_notes_xx_YY.txt). The language descriptor xx_YY is ISO-code, you can look it up in the following table: http://wiki.services.openoffice.org/wiki/Languages
+
OpenOffice.org uses [http://sourceforge.net/projects/hunspell/files/Hyphen/ Hyphen], part of the [http://hunspell.sourceforge.net/ Hunspell project], as its hyphenation engine.
 +
 
 +
The hyphenation files are represented by two files:
 +
*the patterns file (a text file with all the patterns and extra hyphenation rules; hyph_xx_YY.dic) and
 +
*the readme file (a text file with all the credits and licensing information; README_hyph_xx_YY.txt).
 +
 
 +
The language descriptor xx_YY is an actual ISO-code, you can look it up in the following table: http://wiki.services.openoffice.org/wiki/Languages
 +
 
 +
From OpenOffice.org 3.0 onwards the hyphenation patterns are packed as an OpenOffice.org extension, usually as a part of a dictionary language pack (with a spell-checking dictionary for the same language and, optionally, a thesaurus). Here is a list of available dictionary language packs: http://extensions.services.openoffice.org/en/dictionaries
 +
 
 +
==Using TeX patterns==
 +
Hyphen (and OpenOffice.org) can use TeX hyphenation patters for hyphenation, which is great, because TeX patterns are available for more than 50 different languages.
  
==Using Tex Patterns==
+
But because of differences between TeX hyphenation and Hyphen the TeX hyphenation patterns must be first converted. If conversion is not applied, several issues can surface:
Hyphen can use TeX hyphenation patters for hyphenation, but because of differences between TeX hyphenation and Hyphen the TeX hyphenation patterns must be first converted. If conversion is not applied, several issues can surface:
+
* not all TeX patterns will work in OpenOffice.org - which means that TeX patterns will perform substandardly in OpenOffice.org;
* not all TeX patterns will work in OpenOffice.org - which means that TeX patterns do not perform as well as in TeX;
+
 
* if code-page is not set correctly the TeX patterns can behave erratically in OpenOffice.org;
 
* if code-page is not set correctly the TeX patterns can behave erratically in OpenOffice.org;
  
===Conversion of TeX patterns===
+
==Conversion of TeX patterns==
 
The following conversion process must be followed step-by-step:
 
The following conversion process must be followed step-by-step:
  
*1. Download up-to-date TeX hyphenation patterns *
+
===1. Download up-to-date TeX hyphenation patterns===
 
[http://tug.org/tex-hyphen/ Tex hyphenation] repository contains up-to-date TeX hyphenation patterns. They are located here:
 
[http://tug.org/tex-hyphen/ Tex hyphenation] repository contains up-to-date TeX hyphenation patterns. They are located here:
http://tug.org/svn/texhyphen/trunk/hyph-utf8/
+
 
 +
http://tug.org/svn/texhyphen/trunk/hyph-utf8/tex/generic/hyph-utf8/patterns/txt/
  
 
Example: for Slovenian language one would download file hyph-sl.pat.txt from the SVN repository.
 
Example: for Slovenian language one would download file hyph-sl.pat.txt from the SVN repository.
  
2. Convert TeX hyphenation patterns file into proper character set
+
===2. Convert TeX hyphenation patterns file into proper character set===
Hyphen for OpenOffice.org (prior to version 3.4) uses ISO-8859-X code-pages while TeX hyphenation patterns are in UTF-8. So conversion of downloaded patterns into right ISO-8859-X code-page is necessary.
+
  
Example: Slovenian language uses ISO-8859-2 code-page, so one would open the UTF-8 file in a code-page savvy text editor and convert&save it into ISO-8859-2 code-page.
+
Recent Hyphen version of OpenOffice.org 3 has a problem with UTF-8 encoded hyphenation patterns: it misses the first hyphenation break point in special circumstances (in LEFTHYPHENMIN positions before letters with diacritics, eg. "zaživeti" where the "zaži" does not get split). Consider the conversion of your UTF-8 encoded TeX patterns to ISO-8859-X code-pages (or KOI8-R).
  
3. Run the substrings.pl conversion script
+
Example: Slovenian language uses ISO-8859-2 code-page, so one would open the UTF-8 file in a code-page savvy text editor, then convert and save it as an ISO-8859-2 coded text.
Hyphen library (based on libhnj from Raph Levien) uses a time optimized implementation of the original Liang's algorithm of TeX, and substring.pl conversion is a requirement of this implementation. You can download the latest version of substrings.pl conversion script from [http://sourceforge.net/projects/hunspell/files/Hyphen the Hyphen repository]. At the time of writing this was:
+
http://sourceforge.net/projects/hunspell/files/Hyphen/2.5/hyphen-2.5.tar.gz/download hyphen-2.5.tar.gz
+
  
The script takes the following parameters: the input file name, the output file name, the code-page setting and the LEFTHYPHENMIN and RIGHTHYPHEMIN values that define the minimum left and right length of hyphenated words.
+
===3. Run the substrings.pl conversion script===
 +
Hyphen library (based on libhnj from [http://www.levien.com/ Raph Levien]) uses a time optimized implementation of the [http://tug.org/docs/liang/ original Liang's algorithm of TeX], and substring.pl conversion is a requirement of this implementation. You can download the latest version of substrings.pl conversion script from [http://sourceforge.net/projects/hunspell/files/Hyphen the Hyphen repository]. At the time of writing this was:
  
Example: for Slovenian the ISO-8859-2 code page is used and left and right hyphenmin values are 2. So one would use:
+
http://sourceforge.net/projects/hunspell/files/Hyphen/2.5/hyphen-2.5.tar.gz/download
  
./substrings.pl hyph-sl.pat.txt hyph_sl_SI.dic ISO8859-2 2 2
+
The script has the following parameters: the input file name, the output file name, the code-page setting (ISO8859-1, ISO8859-2, ..., ISO8859-10, KOI8-R) and the LEFTHYPHENMIN and RIGHTHYPHEMIN values that define the minimum left and right length of hyphenated words.
  
Warning: see how ISO8859-2 is used and not ISO-8859-2, remember to omit the first hyphen in the ISO-2 codepage name!
+
Example: for Slovenian the ISO-8859-2 code page is used, left and right hyphenmin values are 2. So one would use:
 +
<pre>./substrings.pl hyph-sl.pat.txt hyph_sl_SI.dic ISO8859-2 2 2</pre>
  
4. Add hyphenation rules for special characters
+
<strong>Warning:</strong> note how ISO8859-2 is used and not ISO-8859-2! Remember to omit the first hyphen in the ISO codepage name!
Special characters (apostrophe, hyphen, n-dash, m-dash ...) are word characters in OpenOffice.org, but not boundary characters in the hyphenation of OpenOffice.org which is an incompatibility with the TeX boundary hyphenation patterns. It results in potentially bad hyphenation for words with hyphens and other special characters. Please consider adding the following lines at the end of the converted hyphenation patterns file (.dat):  
+
 
 +
===4. Add hyphenation rules for special characters===
 +
Special characters (apostrophe, hyphen, in version 3.2.0 n-dash and m-dash ...) are word characters in OpenOffice.org, but not boundary characters in the hyphenation of OpenOffice.org which is an incompatibility with the TeX boundary hyphenation patterns. It results in potentially bad hyphenation for words with hyphens and other special characters. Please consider adding the following lines at the end of the converted hyphenation patterns file (hyph_xx_YY.dic):  
 
<pre>
 
<pre>
 
8-8
 
8-8
Line 48: Line 65:
 
-b8
 
-b8
 
-c8
 
-c8
 +
...
 +
8'8
 +
8a8'8
 +
8b8'8
 +
8c8'8
 +
...
 +
'a8
 +
'b8
 +
'c8
 
...
 
...
 
</pre>
 
</pre>
Note: "<pre>...</pre>" represents all missing lines for other characters of your alphabet.
+
Note: "<code>...</code>" represents all missing lines for other characters of your alphabet.
  
4. Create the appropriate rules file
+
For words with apostrophes OpenOffice.org calculates bad hyphen limits. Add the following lines before your patterns to fix this calculation:
  
*** missing in action
+
<pre>
 +
1'.
 +
NEXTLEVEL
 +
</pre>
  
5. Create the appropriate release notes
+
English hyphenation patterns use an extended version to handle the words ending with "'s" and "'t" correctly, too:
  
Official TeX hyphenation patterns are  
+
<pre>
 +
1'.
 +
1's./'=s,1,2
 +
1't./'=t,1,2
 +
NEXTLEVEL
 +
</pre>
 +
 
 +
With UTF-8 patterns, you can fix the hyphen limits for the typographical apostrophes, too:
 +
 
 +
<pre>
 +
1'.
 +
1’.
 +
NEXTLEVEL
 +
8'8
 +
8a8'8
 +
8b8'8
 +
8c8'8
 +
...
 +
'a8
 +
'b8
 +
'c8
 +
...
 +
8’8
 +
8a8’8
 +
8b8’8
 +
8c8’8
 +
...
 +
’a8
 +
’b8
 +
’c8
 +
...
 +
</pre>
 +
 
 +
Note: Hyphen 2.7 (it hasn't been integrated yet) supports a better method to handle this kind of hyphenation. See NOHYPHEN feature in README.compound.
 +
 
 +
===5. Create/update the appropriate readme file===
 +
The readme file contains the credits (author of the patterns, other collaborators...) and licensing information.
 +
 
 +
Official TeX hyphenation patterns are released under the GNU Lesser General Public License (LGPL) which makes them appropriate for inclusion into OpenOffice.org.
 +
 
 +
If you use TeX hyphenation patterns from other sources remember to check and mention the license they are available under in the readme file.
 +
 
 +
The hyphenation patterns to be included in the official OpenOffice.org builds must also include the filled-in http://external.openoffice.org/ form data.
 +
 
 +
===6. Create/update an OpenOffice.org dictionary extension===
 +
Package the converted hyphenation patterns in a new (or update the existing) OpenOffice.org extension and upload it to the [http://extensions.services.openoffice.org OpenOffice.org extension repository].
 +
 
 +
Before uploading try extensively if it works properly with different versions of OpenOffice.org.
 +
 
 +
If the patterns perform well and the patterns are licensed under LGPL, the patterns could eventually make it into the official releases of OpenOffice.org.
 +
 
 +
For further details see [http://wiki.services.openoffice.org/wiki/Extensions_development Extensions development].
 +
 
 +
==Future fixes==
 +
 
 +
Hyphen 2.5 has already fixed the problem with the UTF-8 encoded patterns for platforms with external Hyphen libraries. The integration of Hyphen 2.5 is planned in OpenOffice.org 3.4. You can limit your UTF-8 encoded hyphenation patterns for OpenOffice.org 3.4 by adding the following lines to the distribution.xml of your extension:
 +
 
 +
<pre>
 +
    <dependencies>
 +
        <OpenOffice.org-minimal-version value="3.4" d:name="OpenOffice.org 3.4" />
 +
    </dependencies>
 +
</pre>
  
6. Create an OpenOffice.org dictionary extension
+
This will allow older versions of OpenOffice.org to use the older, non-UTF-8 version of your extension.
Package the converted hyphenation patterns in a OpenOffice.org extension and upload it to the OOo extension repository. Try extensively if it works properly in OpenOffice.org. If the patterns perform well and the patterns are licensed under LGPL, the patterns can make it into the vanilla OpenOffice.org.
+
  
=On the roadmap=
+
==Conclusion==
With OpenOffice.org 3.4 support for UTF-8 patters will be introduced, which makes the Step 2 (from above) obsolete and changes the conversion line from Step 3 into:
+
Since Hyphen engine is used also by other open-source software projects (as are the OpenOffice.org hyphenation patterns files), following these instructions will provide correct patterns for programs like Scribus, KOffice etc.
./substrings.pl hyph-sl.pat.txt hyph_sl_SI.dic UTF-8 2 2
+
But that change will not affect older OOo version and such patterns will not work with version before 3.4. If you decide to make an extension version with hyphenation patterns for OOo in UTF-8, do not forget to set the required version of OpenOffice.org to 3.4 or higher!
+
  
=Conclusion=
+
{{SeeAlso|EN|
Since Hyphen/Hunspell is used also for other software, as well as OpenOffice.org hyphenation patterns, following these instructions will provide good patterns for software like ...
+
*[http://hunspell.sourceforge.net/tb87nemeth.pdf Automatic non-standard hyphenation in OpenOffice.org by László Németh]
 +
}}
  
=See also=
+
{{PDL1}}
[http://hunspell.sourceforge.net/tb87nemeth.pdf Automatic non-standard hyphenation in OpenOffice.org by László Németh]
+

Latest revision as of 09:52, 17 July 2018


This document describes how to properly convert TeX hyphenation patterns for OpenOffice.org and other software projects using the Hyphen hyphenation library.

  • Written by Martin Srebotnjak.
  • Some portions of text contributed by László Németh and Mojca Miklavec.
  • Version: 1.1 (July 24, 2010).

Introduction

OpenOffice.org uses Hyphen, part of the Hunspell project, as its hyphenation engine.

The hyphenation files are represented by two files:

  • the patterns file (a text file with all the patterns and extra hyphenation rules; hyph_xx_YY.dic) and
  • the readme file (a text file with all the credits and licensing information; README_hyph_xx_YY.txt).

The language descriptor xx_YY is an actual ISO-code, you can look it up in the following table: http://wiki.services.openoffice.org/wiki/Languages

From OpenOffice.org 3.0 onwards the hyphenation patterns are packed as an OpenOffice.org extension, usually as a part of a dictionary language pack (with a spell-checking dictionary for the same language and, optionally, a thesaurus). Here is a list of available dictionary language packs: http://extensions.services.openoffice.org/en/dictionaries

Using TeX patterns

Hyphen (and OpenOffice.org) can use TeX hyphenation patters for hyphenation, which is great, because TeX patterns are available for more than 50 different languages.

But because of differences between TeX hyphenation and Hyphen the TeX hyphenation patterns must be first converted. If conversion is not applied, several issues can surface:

  • not all TeX patterns will work in OpenOffice.org - which means that TeX patterns will perform substandardly in OpenOffice.org;
  • if code-page is not set correctly the TeX patterns can behave erratically in OpenOffice.org;

Conversion of TeX patterns

The following conversion process must be followed step-by-step:

1. Download up-to-date TeX hyphenation patterns

Tex hyphenation repository contains up-to-date TeX hyphenation patterns. They are located here:

http://tug.org/svn/texhyphen/trunk/hyph-utf8/tex/generic/hyph-utf8/patterns/txt/

Example: for Slovenian language one would download file hyph-sl.pat.txt from the SVN repository.

2. Convert TeX hyphenation patterns file into proper character set

Recent Hyphen version of OpenOffice.org 3 has a problem with UTF-8 encoded hyphenation patterns: it misses the first hyphenation break point in special circumstances (in LEFTHYPHENMIN positions before letters with diacritics, eg. "zaživeti" where the "zaži" does not get split). Consider the conversion of your UTF-8 encoded TeX patterns to ISO-8859-X code-pages (or KOI8-R).

Example: Slovenian language uses ISO-8859-2 code-page, so one would open the UTF-8 file in a code-page savvy text editor, then convert and save it as an ISO-8859-2 coded text.

3. Run the substrings.pl conversion script

Hyphen library (based on libhnj from Raph Levien) uses a time optimized implementation of the original Liang's algorithm of TeX, and substring.pl conversion is a requirement of this implementation. You can download the latest version of substrings.pl conversion script from the Hyphen repository. At the time of writing this was:

http://sourceforge.net/projects/hunspell/files/Hyphen/2.5/hyphen-2.5.tar.gz/download

The script has the following parameters: the input file name, the output file name, the code-page setting (ISO8859-1, ISO8859-2, ..., ISO8859-10, KOI8-R) and the LEFTHYPHENMIN and RIGHTHYPHEMIN values that define the minimum left and right length of hyphenated words.

Example: for Slovenian the ISO-8859-2 code page is used, left and right hyphenmin values are 2. So one would use:

./substrings.pl hyph-sl.pat.txt hyph_sl_SI.dic ISO8859-2 2 2

Warning: note how ISO8859-2 is used and not ISO-8859-2! Remember to omit the first hyphen in the ISO codepage name!

4. Add hyphenation rules for special characters

Special characters (apostrophe, hyphen, in version 3.2.0 n-dash and m-dash ...) are word characters in OpenOffice.org, but not boundary characters in the hyphenation of OpenOffice.org which is an incompatibility with the TeX boundary hyphenation patterns. It results in potentially bad hyphenation for words with hyphens and other special characters. Please consider adding the following lines at the end of the converted hyphenation patterns file (hyph_xx_YY.dic):

8-8
8a8-8
8b8-8
8c8-8
...
-a8
-b8
-c8
...
8'8
8a8'8
8b8'8
8c8'8
...
'a8
'b8
'c8
...

Note: "..." represents all missing lines for other characters of your alphabet.

For words with apostrophes OpenOffice.org calculates bad hyphen limits. Add the following lines before your patterns to fix this calculation:

1'.
NEXTLEVEL

English hyphenation patterns use an extended version to handle the words ending with "'s" and "'t" correctly, too:

1'.
1's./'=s,1,2
1't./'=t,1,2
NEXTLEVEL

With UTF-8 patterns, you can fix the hyphen limits for the typographical apostrophes, too:

1'.
1’.
NEXTLEVEL
8'8
8a8'8
8b8'8
8c8'8
...
'a8
'b8
'c8
...
8’8
8a8’8
8b8’8
8c8’8
...
’a8
’b8
’c8
...

Note: Hyphen 2.7 (it hasn't been integrated yet) supports a better method to handle this kind of hyphenation. See NOHYPHEN feature in README.compound.

5. Create/update the appropriate readme file

The readme file contains the credits (author of the patterns, other collaborators...) and licensing information.

Official TeX hyphenation patterns are released under the GNU Lesser General Public License (LGPL) which makes them appropriate for inclusion into OpenOffice.org.

If you use TeX hyphenation patterns from other sources remember to check and mention the license they are available under in the readme file.

The hyphenation patterns to be included in the official OpenOffice.org builds must also include the filled-in http://external.openoffice.org/ form data.

6. Create/update an OpenOffice.org dictionary extension

Package the converted hyphenation patterns in a new (or update the existing) OpenOffice.org extension and upload it to the OpenOffice.org extension repository.

Before uploading try extensively if it works properly with different versions of OpenOffice.org.

If the patterns perform well and the patterns are licensed under LGPL, the patterns could eventually make it into the official releases of OpenOffice.org.

For further details see Extensions development.

Future fixes

Hyphen 2.5 has already fixed the problem with the UTF-8 encoded patterns for platforms with external Hyphen libraries. The integration of Hyphen 2.5 is planned in OpenOffice.org 3.4. You can limit your UTF-8 encoded hyphenation patterns for OpenOffice.org 3.4 by adding the following lines to the distribution.xml of your extension:

    <dependencies>
        <OpenOffice.org-minimal-version value="3.4" d:name="OpenOffice.org 3.4" />
    </dependencies>

This will allow older versions of OpenOffice.org to use the older, non-UTF-8 version of your extension.

Conclusion

Since Hyphen engine is used also by other open-source software projects (as are the OpenOffice.org hyphenation patterns files), following these instructions will provide correct patterns for programs like Scribus, KOffice etc.



See Also



Content on this page is licensed under the Public Documentation License (PDL).
Personal tools