Difference between revisions of "User:LA2"

From Apache OpenOffice Wiki
Jump to: navigation, search
m (Diary)
m (Diary)
Line 9: Line 9:
 
'''December 20, 2006:''' As I observed the other day, writely.com automatically understands which language I'm using, and applies the right spelling dictionary. Apparently, Microsoft Word has a similar feature. Why have I never seen this feature in free software for spell checking? A quick web search indicates that I'm not the first to ask this question. Suggested solutions include [http://www.faqts.com/knowledge_base/view.phtml/aid/4382 trying an ad hoc list of common words, prefixes and suffixes] from each language or [http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/326576 sampling trigrams]. There is also an attempt at [http://www.yetanotherblog.com/2006/01/21/bayesian-language-detection/ Bayesian language detection]. Nothing indicates that the creators of these three approaches are familiar with [http://en.wikipedia.org/wiki/Zipf's_law Zipf's law]. Looking for "you" and "me" is not optimal, when in fact "the" is known to be the most common word in English texts. The Bayesian filter approach probably comes closest to optimum, but in an overly complicated way. If we were to look for a single word from each language, that would be the most common word, which is "the" for English, "der" for German and "och" for Swedish. Since 7%  of all English words are "the" we would need on average 14 words to find one occurrence of "the". In Swedish, which doesn't use a definite article (the, der, le, los), the most common word is "och" (and), which makes up 3.8 % of all words, so we'd need a sample of 26 words to expect one occurrence of "och". In order to determine the language from a shorter sample, we could throw in some more words from each language, from the top of the frequency listing. The 10 most frequent words in Swedish (''och, i, att, en, av, som, den, till, med, på'') together account for 14% of the words in any text corpus. The top 20 words (''det, för, de, han, är, ett, sig, så, jag, var'') account for 21%, the top 50 words account for 29%, and the top 100 words account for 36%. Some of these words also appear in English but aren't so frequent there, so they are better predictors for Swedish than for English. For each language, a weighted prediction score can be computed. Finding an "I" adds more to the Swedish score than to the English. These top ranking word frequencies haven't changed much since Shakespeare, so a C program should be able to keep a static table of the weights for 100 words in source code. The downloadable text of Wikipedia and the written OpenOffice.org documentation could be used as a corpus for sampling the weights.
 
'''December 20, 2006:''' As I observed the other day, writely.com automatically understands which language I'm using, and applies the right spelling dictionary. Apparently, Microsoft Word has a similar feature. Why have I never seen this feature in free software for spell checking? A quick web search indicates that I'm not the first to ask this question. Suggested solutions include [http://www.faqts.com/knowledge_base/view.phtml/aid/4382 trying an ad hoc list of common words, prefixes and suffixes] from each language or [http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/326576 sampling trigrams]. There is also an attempt at [http://www.yetanotherblog.com/2006/01/21/bayesian-language-detection/ Bayesian language detection]. Nothing indicates that the creators of these three approaches are familiar with [http://en.wikipedia.org/wiki/Zipf's_law Zipf's law]. Looking for "you" and "me" is not optimal, when in fact "the" is known to be the most common word in English texts. The Bayesian filter approach probably comes closest to optimum, but in an overly complicated way. If we were to look for a single word from each language, that would be the most common word, which is "the" for English, "der" for German and "och" for Swedish. Since 7%  of all English words are "the" we would need on average 14 words to find one occurrence of "the". In Swedish, which doesn't use a definite article (the, der, le, los), the most common word is "och" (and), which makes up 3.8 % of all words, so we'd need a sample of 26 words to expect one occurrence of "och". In order to determine the language from a shorter sample, we could throw in some more words from each language, from the top of the frequency listing. The 10 most frequent words in Swedish (''och, i, att, en, av, som, den, till, med, på'') together account for 14% of the words in any text corpus. The top 20 words (''det, för, de, han, är, ett, sig, så, jag, var'') account for 21%, the top 50 words account for 29%, and the top 100 words account for 36%. Some of these words also appear in English but aren't so frequent there, so they are better predictors for Swedish than for English. For each language, a weighted prediction score can be computed. Finding an "I" adds more to the Swedish score than to the English. These top ranking word frequencies haven't changed much since Shakespeare, so a C program should be able to keep a static table of the weights for 100 words in source code. The downloadable text of Wikipedia and the written OpenOffice.org documentation could be used as a corpus for sampling the weights.
  
The paper by Géza Németh and Csaba Zainkó, ''[http://www.nslij-genetics.org/wli/zipf/nemeth02.pdf Multilingual Statistical Text Analysis, Zipf's Law and Hungarian Speech Generation]'' explains that to cover 97.5% of an English corpus, you need a dictionary of 20,000 unique word forms, but to achieve the same coverage in German you need 80,000 word forms and for Hungarian 400,000 word forms. This is an important observation, as it sets a goal for how large spell checker dictionaries need to be. Presumably, Finnish and Estonian have the same characteristics as Hungarian, while the Scandinavian languages (Swedish, Danish, Norwegian) take a middle position between that and German. Our current dictionaries contain many specialized words that aren't very common. These words don't contribute so much to the dictionary's footprint or coverage of a typical corpus as they should do. For example, for word we add in its basic form, we also try to include every other form of that word, even though those forms aren't very common. What we need for German are the 80,000 most common word forms, but when we add the really common word Haus, we also tend to add the less common Hauses, Häuser and Häusern. Perhaps we should aim for 80,000 basic forms (this is very close to what Björn Janke has, see below) instead of 80,000 variations.
+
The paper by Géza Németh and Csaba Zainkó, ''[http://www.nslij-genetics.org/wli/zipf/nemeth02.pdf Multilingual Statistical Text Analysis, Zipf's Law and Hungarian Speech Generation]'' explains that to cover 97.5% of an English corpus, you need a dictionary of 20,000 unique word forms, but to achieve the same coverage in German you need 80,000 word forms and for Hungarian 400,000 word forms. This is an important observation, as it sets a goal for how large spell checker dictionaries need to be. Presumably, Finnish and Estonian have the same characteristics as Hungarian, while the Scandinavian languages (Swedish, Danish, Norwegian) take a middle position between that and German. Our current dictionaries contain many specialized words that aren't very common. These words don't contribute so much to the dictionary's footprint or coverage of a typical corpus as they should do. For example, for word we add in its basic form, we also try to include every other form of that word, even though those forms aren't very common. What we need for German are the 80,000 most common word forms, but when we add the really common word Haus, we also tend to add the less common Hauses, Häuser and Häusern. Perhaps we should aim for 80,000 basic forms (this is very close to what Björn Jacke has, see below) instead of 80,000 variations (he has 300,000 variations).
  
 
'''December 18, 2006:''' I update the [http://runeberg.org/words/ Nordic Words] page and publish a version of my own Swedish dictionary (ss100.txt) with 221.599 words. I also do a little survey of the spell checking support in various programs:
 
'''December 18, 2006:''' I update the [http://runeberg.org/words/ Nordic Words] page and publish a version of my own Swedish dictionary (ss100.txt) with 221.599 words. I also do a little survey of the spell checking support in various programs:

Revision as of 00:13, 21 December 2006

LA2 is the username for Lars Aronsson, Sweden, also known from Wikipedia, Project Runeberg, and other projects.

Useful links

Diary

December 20, 2006: As I observed the other day, writely.com automatically understands which language I'm using, and applies the right spelling dictionary. Apparently, Microsoft Word has a similar feature. Why have I never seen this feature in free software for spell checking? A quick web search indicates that I'm not the first to ask this question. Suggested solutions include trying an ad hoc list of common words, prefixes and suffixes from each language or sampling trigrams. There is also an attempt at Bayesian language detection. Nothing indicates that the creators of these three approaches are familiar with Zipf's law. Looking for "you" and "me" is not optimal, when in fact "the" is known to be the most common word in English texts. The Bayesian filter approach probably comes closest to optimum, but in an overly complicated way. If we were to look for a single word from each language, that would be the most common word, which is "the" for English, "der" for German and "och" for Swedish. Since 7% of all English words are "the" we would need on average 14 words to find one occurrence of "the". In Swedish, which doesn't use a definite article (the, der, le, los), the most common word is "och" (and), which makes up 3.8 % of all words, so we'd need a sample of 26 words to expect one occurrence of "och". In order to determine the language from a shorter sample, we could throw in some more words from each language, from the top of the frequency listing. The 10 most frequent words in Swedish (och, i, att, en, av, som, den, till, med, på) together account for 14% of the words in any text corpus. The top 20 words (det, för, de, han, är, ett, sig, så, jag, var) account for 21%, the top 50 words account for 29%, and the top 100 words account for 36%. Some of these words also appear in English but aren't so frequent there, so they are better predictors for Swedish than for English. For each language, a weighted prediction score can be computed. Finding an "I" adds more to the Swedish score than to the English. These top ranking word frequencies haven't changed much since Shakespeare, so a C program should be able to keep a static table of the weights for 100 words in source code. The downloadable text of Wikipedia and the written OpenOffice.org documentation could be used as a corpus for sampling the weights.

The paper by Géza Németh and Csaba Zainkó, Multilingual Statistical Text Analysis, Zipf's Law and Hungarian Speech Generation explains that to cover 97.5% of an English corpus, you need a dictionary of 20,000 unique word forms, but to achieve the same coverage in German you need 80,000 word forms and for Hungarian 400,000 word forms. This is an important observation, as it sets a goal for how large spell checker dictionaries need to be. Presumably, Finnish and Estonian have the same characteristics as Hungarian, while the Scandinavian languages (Swedish, Danish, Norwegian) take a middle position between that and German. Our current dictionaries contain many specialized words that aren't very common. These words don't contribute so much to the dictionary's footprint or coverage of a typical corpus as they should do. For example, for word we add in its basic form, we also try to include every other form of that word, even though those forms aren't very common. What we need for German are the 80,000 most common word forms, but when we add the really common word Haus, we also tend to add the less common Hauses, Häuser and Häusern. Perhaps we should aim for 80,000 basic forms (this is very close to what Björn Jacke has, see below) instead of 80,000 variations (he has 300,000 variations).

December 18, 2006: I update the Nordic Words page and publish a version of my own Swedish dictionary (ss100.txt) with 221.599 words. I also do a little survey of the spell checking support in various programs:

Google's writely.com is a web word processor. It has a built-in spell checker that automatically recognizes the language. It's Swedish spell checker behaves exactly like OpenOffice.org 2.0.4, which indicates the same Ispell/Aspell/Myspell/Hunspell dictionary is used for Swedish (Göran Andersson's dictionary from 2003). When I pasted the words from my test page, there were so many errors that the spell checker automatically shut down and I had manually to turn it back on again.

The Opera web browser (I tried version 9.10) has a built-in spell checker for web forms. The user interface is a bit old-fashioned, in that it doesn't underline the errors, but uses a dialog window that steps through the web form. On Apple's Mac OS/X it uses the system's built-in spell checker, but on all other platforms it requires the user install GNU Aspell.

The word processor Abiword has built-in spell checking support. The user interface underlines any errors. The Swedish spell checking is apparently based on Göran Andersson's 2003 dictionary, although I cannot find out which software it uses (GNU Aspell, ispell or Myspell).

KDE's editors kate and kwrite have built-in spell checking support, apparently based on GNU Aspell. The user interface doesn't underline errors, but provides a dialog window that steps through the text one word at a time.

Note to self: I should take a closer look at Freedict.de. Where do these dictionaries really come from? Are they maintained?

A look at the German ispell dictionaries by Björn Jacke:

          Occurrences            Affix
Nov. 2005  Feb. 2003  Nov. 1999  flag   Usage
 -------    -------    -------   ----   ----------------------
  79681      81191      75748           Basic forms
 307261     308860     294897           Unique words
 -------    -------    -------   ----   ----------------------
  13755      13933      11257     /S    Genitive -s
  11815      11723      11397     /A    Adjective inflexion
   8848       9319      10070     /P    Plural -en
   8166       8374       8048     /N    Plural -n
   7367       7346       7004     /D    Participle -d
   6837       6828       6595     /I    Regular verbs, present tense
   6620       6611       6358     /X    Regular verbs, present tense
   5310       5303       5140     /Y    Regular verbs, past tense
   4315       4578       4525     /T    Genitive -es
   4189       4406       4118     /E    Plural -e
   2066       2061       1991     /O    Participle inflexion
   1999       1971         82     /J    -ung and inflexions
   1846       1840       1813     /C    Adjective comparison
   1580       1656       1636     /p    Irregular plurals
   1452       1047        831     /F    -in and inflexions
    721        719        615     /Z    Non-regular verbs, past tense
    672        665        619     /U    Prefix un-
    619        620        615     /V    Prefix ver-
    574        569        486     /B    -bar and inflexions
    497        492        138     /W    Imperatives
    289        310        280     /R    Plural -er
    235        251        250     /Q    Plural -sse
    206        206        208     /G    Prefix ge-
     64         68         65     /q    Plural -sse, special case for feminines
     57         56         44     /M    -chen and inflexions
     20         21         22     /f    Words ending in -ph can also have -f
     18         17         19     /L    -lich and inflexions
      4          4          4     /H    -heit and inflexions

December 15, 2006: Two Danish OpenOffice developers meet with CST, Center for Sprogteknologi, a commercial provider of Danish dictionaries, to discuss how to improve the Danish spell checking dictionary for OpenOffice. Brief report on the 'dansk' mailing list. To me it seems unlikely that any useful solution would be found this way.

December 11, 2006: Version 2.0.4 of OpenOffice.org has auto corrections (AutoKorrigeringar) for Swedish, based on a static list of about 100 word pairs, e.g. HJE -> hej, MEDECIN -> medicin. Where do they come from? They're not part of the spelling dictionary. There are also word pairs for Danish and German (both have longer lists), but none for Norwegian.

Firefox 2.0 offers spell checking for web forms (e.g. wiki editing). There is a Swedish spelling dictionary by Hasse Wallanger, based on the Swedih Myspell dictionary of August 14, 2003 ("baserad på den svenska ordlistan från 20030814 för Myspell"). It behaves a little differently than the Swedish spell checker in OpenOffice 2.0.4, in that it allows free concatenation of words. It also only spell checks an initial fraction of a web form. In bug 360434 this is explained. Type about:config in the URL field and look for the variable extensions.spellcheck.inline.max-misspellings which defaults to 500. Double click on this value and change it to a much higher value, e.g. 15000.

December 7, 2006: I think we need a test case for the Swedish spell checking, that is separate from the development of the dictionary. As a pilot test, I'm starting a subpage /Test av stavningskontrollen. Göran Andersson publishes version 1.22 of DSSO.

December 2, 2006: I sign up for various OpenOffice mailing lists, and this wiki. What takes me here is the poor spell checking support for Swedish in OpenOffice 2.0.2. The spelling dictionary is version 1.3.8 from sv.speling.org, which hasn't been updated since March 2002. It only contains 24490 words (basic forms), some of which are misspelled. The myspell affix file seems to have been automatically converted from the ispell affix file.

Timeline

  • November 25, 2006: Göran Andersson publishes version 1.19 of DSSO. Version 1.21 follows on December 1.
  • 2006: The Swedish Academy publishes the 13th edition of SAOL.
  • 2005: Volume 34 of SAOB ends at Tojs. The full work is expected to be completed in 2017.
  • January 2005: Project Runeberg's OCR spelling dictionaries for Swedish and Danish are published within Nordic Words.
  • April 2004: Public editing of susning.nu is closed. The user community migrates to the Swedish Wikipedia.
  • March 6, 2003: My posting Svensk ordlista on the SSLUG-LOCALE mailing list.
  • February 2003: On the Swedish web forum Gnuheter, I ask around for a business case for a Swedish dictionary (Affärsmodeller och fritt innehåll) without getting any useful answers.
  • 2003: Göran Andersson takes back control of the Swedish spelling dictionary, now dsso.se, dissatisfied with some modifications made to it during the time it was at sv.speling.org.
  • May 6, 2002: I join sslug-locale mailing list for speling.org.
  • 2002-2003: I digitize two editions (58 volumes) of the classic Swedish encyclopedia Nordisk familjebok (1876-1926). This is more food for word frequencies and spelling dictionaries.
  • October 2001: I start susning.nu, a Swedish wiki, which grows very fast. As a spinoff I return to computing word frequencies and compiling my own spelling dictionary.
  • January 29, 1998: Göran Andersson hands over his Swedish ispell dictionary (now version 1.2.1) to sv.speling.org
  • September 26, 1997: Göran Andersson's ispell dictionary version 1.2 accepts compound words. The list has 24082 basic forms, expanding to 117617 unique words.
  • February 23, 1997: Göran Andersson's ispell dictionary version 1.1 has 24722 basic forms, expanding to 84740 unique words.
  • January 15, 1997: Göran Andersson's ispell dictionary version 1.0 has 27737 basic forms, expanding to 76364 unique words. The brand new affix file is based on inspiration from a Danish affix file.
  • November 1996: Within Project Runeberg, the subproject "Nordic Words" is started, maintained by Anders Brun. No updates are made after December 1997.
  • 1993: The Swedish Academy introduces computers in editing SAOB.
  • December 1992: I start Project Runeberg, the Scandinavian e-text archive
  • 1991-1993: I experiment with spelling dictionaries for spell and ispell.
  • 1970s-1980s: A Swedish morphological spellchecker "stava" is developed at FOA/QZ in Stockholm. Traces of this might be available at KTH. Viggo Kann would know. Several later Swedish spell checkers with the same name exist. Various dictionaries float around. Linguists have access to proprietary lists for research purposes, and are not interesting in creating "open content".
  • 1970: Major Swedish newspapers abandon plural forms of verbs.
  • 1968: The polite use of "Ni" (You/Sie) is replaced with simple "du" (you/du), making Swedish conversation as simple as English.
  • 1960s: A young computational linguist Sture Allén uses paper tape from newspaper typesetters to compute word frequencies of the Swedish language. Laying the foundation for the Språkdata department at Gothenburg University, he later becomes secretary of the Swedish Academy.
  • 1945: Swedish public schools make plural endings of verbs optional. Students who opt not to use them, must indicate this and then stick to their chosen style.
  • 1910: The polite use of "Ni" (You/Sie) is introduced in Swedish as a replacement for complicated titles, making Swedish conversation as simple as German.
  • 1907: A major Swedish spelling reform does away with the combinations dt, fv, and hv.
  • 1883: A new editor restarts the Academy's dictionary. The first fascicle is printed in 1893 and the first volume of "Svenska Akademiens Ordbok" (SAOB) is completed in 1898. The dictionary documents Swedish spelling since 1526.
  • 1874: The Swedish Academy publishes a spelling dictionary in one volume, "Svenska Akademiens Ordlista" (SAOL). Its 6th edition (1889) and 8th edition (1923) are out of copyright.
  • 1786: The Swedish Academy is founded by king Gustav III. One of its main tasks is to compile a dictionary of the Swedish language. Work begins immediately, but stops already in 1814. New attempts are started in 1834 and 1855. A fascicle for the letter "A" is published in 1870.
  • 1526: Sweden's Lutheran church reformer Olaus Petri translates the New Testament to Swedish. Old Testament follows in 1541. His style of writing marks the beginning of modern Swedish orthography.
  • 9th Century A.D.: About the same time as Iceland is populated by the Norwegians, Sweden's longest runic inscription, the Rök runestone is carved. Runes are Scandinavian letters inspired by Greek/Latin alphabets but adopted for carving in stone or wood. Two different runic alphabets were used between c. 500 and 1000 A.D., the first with 24 letters, later simplified to one with 16 letters. With the introduction of Christianity around 1000 A.D., runes are gradually replaced with Latin script.
Personal tools