Difference between revisions of "Documentation/SL/Using TeX hyphenation patterns in OpenOffice.org"

From Apache OpenOffice Wiki
Jump to: navigation, search
Line 5: Line 5:
 
some portions of text by László Németh and Mojca Miklavec
 
some portions of text by László Németh and Mojca Miklavec
  
OpenOffice.org uses http://sourceforge.net/projects/hunspell/files/Hyphen/ Hyphen, part of the Hunspell project, as its hyphenation tool. The hyphenation files, from OpenOffice.org 3.0 onwards packed as an OpenOffice.org extension, consist of three files: the dictionary file (a text file with all the patterns; example: for Slovenian it is hyph_sl_SI.dic), the rules file (a text file with hyphenation rules for the language; example: for Slovenian it is hyph_sl_SI.idx) and the release notes (a text file with all the credits and licensing information (example: for Slovenian it is
+
OpenOffice.org uses [http://sourceforge.net/projects/hunspell/files/Hyphen/ Hyphen], part of the [Hunspell project], as its hyphenation engine. The hyphenation files, from OpenOffice.org 3.0 onwards packed as an OpenOffice.org extension, consist of three files: the dictionary file (a text file with all the patterns; hyph_xx_YY.dic), the rules file (a text file with hyphenation rules for the language; hyph_xx_YY.idx) and the release notes (a text file with all the credits and licensing information (example: hyph_rel_notes_xx_YY.txt). The language descriptor xx_YY is used according to the following table:
  
 
Hyphen can use TeX hyphenation patters for hyphenation, but because of differences between TeX hyphenation and Hyphen the TeX hyphenation patterns must be first converted. If conversion is not applied, several issues can surprise:
 
Hyphen can use TeX hyphenation patters for hyphenation, but because of differences between TeX hyphenation and Hyphen the TeX hyphenation patterns must be first converted. If conversion is not applied, several issues can surprise:
1) not all TeX patterns work in OpenOffice.org which means that TeX patterns perform sub-quality;
+
* not all TeX patterns work in OpenOffice.org which means that TeX patterns perform sub-quality;
2) if code-page is not set correctly the TeX patterns can behave erratically in Hyphen;
+
* if code-page is not set correctly the TeX patterns can behave erratically in Hyphen;
  
 
Because of this the following conversion process must be followed step-by-step:
 
Because of this the following conversion process must be followed step-by-step:
  
 
1. Download up-to-date TeX hyphenation patterns
 
1. Download up-to-date TeX hyphenation patterns
http://tug.org/tex-hyphen/Tex hyphenation repository contains up-to-date TeX hyphenation patterns. They are located here:
+
[http://tug.org/tex-hyphen/ Tex hyphenation] repository contains up-to-date TeX hyphenation patterns. They are located here:
 
http://tug.org/svn/texhyphen/trunk/hyph-utf8/
 
http://tug.org/svn/texhyphen/trunk/hyph-utf8/
  
Line 25: Line 25:
  
 
3. Run the substrings.pl conversion script
 
3. Run the substrings.pl conversion script
Hyphen library (based on libhnj from Raph Levien) uses a time optimized implementation of the original Liang's algorithm of TeX, and substring.pl conversion is a requirement of this implementation. You can download the latest version of substrings.pl conversion script from http://sourceforge.net/projects/hunspell/files/Hyphen the Hyphen repository. At the time of writing this was http://sourceforge.net/projects/hunspell/files/Hyphen/2.5/hyphen-2.5.tar.gz/download hyphen-2.5.tar.gz
+
Hyphen library (based on libhnj from Raph Levien) uses a time optimized implementation of the original Liang's algorithm of TeX, and substring.pl conversion is a requirement of this implementation. You can download the latest version of substrings.pl conversion script from [http://sourceforge.net/projects/hunspell/files/Hyphen the Hyphen repository]. At the time of writing this was:
 +
http://sourceforge.net/projects/hunspell/files/Hyphen/2.5/hyphen-2.5.tar.gz/download hyphen-2.5.tar.gz
  
 
The script takes the following parameters: the input file name, the output file name, the code-page setting and the LEFTHYPHENMIN and RIGHTHYPHEMIN values that define the minimum left and right length of hyphenated words.
 
The script takes the following parameters: the input file name, the output file name, the code-page setting and the LEFTHYPHENMIN and RIGHTHYPHEMIN values that define the minimum left and right length of hyphenated words.
Line 51: Line 52:
 
Official TeX hyphenation patterns are  
 
Official TeX hyphenation patterns are  
  
6. Package the converted hyphenation patterns in a OpenOffice.org extension and upload it to the OOo extension repository. Try extensively if it works properly in OpenOffice.org. If quality is
+
6. Package the converted hyphenation patterns in a OpenOffice.org extension and upload it to the OOo extension repository. Try extensively if it works properly in OpenOffice.org. If the patterns perform well and the patterns are licensed under LGPL, the patterns can make it into the vanilla OpenOffice.org.
  
 
*On the roadmap*
 
*On the roadmap*
Line 57: Line 58:
 
./substrings.pl hyph-sl.pat.txt hyph_sl_SI.dic UTF-8 2 2
 
./substrings.pl hyph-sl.pat.txt hyph_sl_SI.dic UTF-8 2 2
 
But that change will not affect older OOo version and such patterns will not work with version before 3.4. If you decide to make an extension version with hyphenation patterns for OOo in UTF-8, do not forget to set the required version of OpenOffice.org to 3.4 or higher!
 
But that change will not affect older OOo version and such patterns will not work with version before 3.4. If you decide to make an extension version with hyphenation patterns for OOo in UTF-8, do not forget to set the required version of OpenOffice.org to 3.4 or higher!
 +
 +
Further reading:

Revision as of 12:00, 23 July 2010

Summary: this document describes how to prepare TeX hyphenation patterns for OpenOffice.org.

Credits: written by Martin Srebotnjak some portions of text by László Németh and Mojca Miklavec

OpenOffice.org uses Hyphen, part of the [Hunspell project], as its hyphenation engine. The hyphenation files, from OpenOffice.org 3.0 onwards packed as an OpenOffice.org extension, consist of three files: the dictionary file (a text file with all the patterns; hyph_xx_YY.dic), the rules file (a text file with hyphenation rules for the language; hyph_xx_YY.idx) and the release notes (a text file with all the credits and licensing information (example: hyph_rel_notes_xx_YY.txt). The language descriptor xx_YY is used according to the following table:

Hyphen can use TeX hyphenation patters for hyphenation, but because of differences between TeX hyphenation and Hyphen the TeX hyphenation patterns must be first converted. If conversion is not applied, several issues can surprise:

  • not all TeX patterns work in OpenOffice.org which means that TeX patterns perform sub-quality;
  • if code-page is not set correctly the TeX patterns can behave erratically in Hyphen;

Because of this the following conversion process must be followed step-by-step:

1. Download up-to-date TeX hyphenation patterns Tex hyphenation repository contains up-to-date TeX hyphenation patterns. They are located here: http://tug.org/svn/texhyphen/trunk/hyph-utf8/

Example: for Slovenian language one would download file hyph-sl.pat.txt from the SVN repository.

2. Convert TeX hyphenation patterns file into proper character set Hyphen for OpenOffice.org (prior to version 3.4) uses ISO-8859-X code-pages while TeX hyphenation patterns are in UTF-8. So conversion of downloaded patterns into right ISO-8859-X code-page is necessary.

Example: Slovenian language uses ISO-8859-2 code-page, so one would open the UTF-8 file in a code-page savvy text editor and convert&save it into ISO-8859-2 code-page.

3. Run the substrings.pl conversion script Hyphen library (based on libhnj from Raph Levien) uses a time optimized implementation of the original Liang's algorithm of TeX, and substring.pl conversion is a requirement of this implementation. You can download the latest version of substrings.pl conversion script from the Hyphen repository. At the time of writing this was:

http://sourceforge.net/projects/hunspell/files/Hyphen/2.5/hyphen-2.5.tar.gz/download hyphen-2.5.tar.gz

The script takes the following parameters: the input file name, the output file name, the code-page setting and the LEFTHYPHENMIN and RIGHTHYPHEMIN values that define the minimum left and right length of hyphenated words.

Example: for Slovenian the ISO-8859-2 code page is used and left and right hyphenmin values are 2. So one would use: ./substrings.pl hyph-sl.pat.txt hyph_sl_SI.dic ISO8859-2 2 2 Warning: see how ISO8859-2 is used and not ISO-8859-2, remember to omit the first hyphen in the ISO-2 codepage name!

4. Add hyphenation rules for special characters Special characters (apostrophe, hyphen, n-dash, m-dash ...) are word characters in OpenOffice.org, but not boundary characters in the hyphenation of OpenOffice.org which is an incompatibility with the TeX boundary hyphenation patterns. It results in potentially bad hyphenation for words with hyphens and other special characters. Please consider adding the following lines at the end of the converted hyphenation patterns file (.dat): 8-8 8a8-8 8b8-8 8c8-8 ... -a8 -b8 -c8 ...

4. Create the appropriate rules file

      • missing in action

5. Create the appropriate release notes Official TeX hyphenation patterns are

6. Package the converted hyphenation patterns in a OpenOffice.org extension and upload it to the OOo extension repository. Try extensively if it works properly in OpenOffice.org. If the patterns perform well and the patterns are licensed under LGPL, the patterns can make it into the vanilla OpenOffice.org.

  • On the roadmap*

With OpenOffice.org 3.4 support for UTF-8 patters will be introduced, which makes the Step 2 (from above) obsolete and changes the conversion line from Step 3 into: ./substrings.pl hyph-sl.pat.txt hyph_sl_SI.dic UTF-8 2 2 But that change will not affect older OOo version and such patterns will not work with version before 3.4. If you decide to make an extension version with hyphenation patterns for OOo in UTF-8, do not forget to set the required version of OpenOffice.org to 3.4 or higher!

Further reading:

Personal tools