Difference between revisions of "User:LA2"

Revision as of 07:01, 7 January 2007

LA2 is the username for Lars Aronsson, Sweden, also known from Wikipedia, Project Runeberg, and other projects.

Useful links

Swedish spelling dictionaries: sv.speling.org — dsso.se — Nordic Words — Viggo Kann
OpenOffice projects: Lingucomponent — Localization — Native-lang — Danish (wiki) — Norwegian (wiki) — Swedish (wiki)
Grammar checking: LanguageTool — Summer of Code 2006 — Granska

Diary

January 6, 2007: Wikipedia history diff as a revision corpus, summary of experiments by Marcin Miłkowski, after we discussed this on the dev@lingucomponent mailing list. "In short, it seems that Lars' idea was brilliant".

December 30, 2006: Here's an experiment in coverage. One very classical Swedish text is Nils Holgerssons underbara resa genom Sverige, by Nobel Prize winner Selma Lagerlöf, at the same time a geography textbook and a novel. The text at Project Runeberg is in modern spelling (post 1906), but uses plural forms of verbs (pre 1970). The text with HTML markup removed contains 198148 words or 1091951 characters. I'm assuming the spelling is all correct. Passing this through aspell with the 2003 Swedish dictionary (aspell -l sv list) outputs 6741 words as not recognized or 3.4 percent of all words in the text body. This means the Aspell dictionary covers 96.6 percent of the text, which actually isn't so bad. Further filtering through my own dictionary of old spellings (including plural verbs) removes another 1842 words or 0.93 percent, increasing the coverage to 97.5 percent. If instead I change to my own main dictionary, coverage increases only marginally. Apparently, the improvements I feel I have made, don't really matter if you are spell checking Nils Holgersson.

 $ cat k*.html | wc -w
 198148
 $ cat k*.html | aspell -l sv list | wc -w
 6741

The 4883 words (97.5 %) output from the combination of my own dictionaries are 3670 unique word forms, of which 2991 appear only once, 433 appear twice, 129 appear three times and 117 appear four or more times. If I added these 117 word forms to my dictionary, that would cover another 639 words or 0.32 percent of this text, pushing the coverage to 97.8 percent.

It turns out some of those words shouldn't be added to a dictionary because they are names of fictional characters that only appear in this book. A select few spelling errors are also found and will be corrected. Of the unrecognized words, many are minor variations (in case and punctuation) that are covered by just adding one word to the dictionary. After some work, my coverage is up to 98.49 percent, leaving 2991 words unrecognized, being 2674 unique word forms of which 2509 appear only once and 111 appear twice.

December 21, 2006: Apparently the Swedish spell checker in Microsoft Word 6.0 accepts the following misspelled words from my test page: andledning, andvänd, andvändning, bakrund, ballett, diskusanalys, finlandsvensk, finness, fiskeläger, följetång, företeckning, förmögenhetskatt, hårddraget, innerbär, jämnlik, kolrot, Lindköping, lösensumma, majonäs, model, modellbetäckning, parantes, situationstecken, stadsbesök, stadschef, terass, trilogi, vädersträck, överrens

And Microsoft Word 2003 accepts these errors: andledning, alvarlig, andvändare, ballett, Ceasar, diskusanalys, europisk, fiskeläger, frisörsalong, följetång, företeckning, förmögenhetskatt, grejor, hårddraget, interesse, krigsföring, landsbyggd, Lindköping, lösensumma, mediespelare, modellbetäckning, parantes, San Fransisco, sattelit, situationstecken, stadsbesök, stadschef, Stockolm, Storbrittanninen, tabblett, tipps, utryck, våldtäckt, vädersträck, ytterliggare, åldersbestigna, överrens.

December 20, 2006: As I observed the other day, writely.com automatically understands which language I'm using, and applies the right spelling dictionary. Apparently, Microsoft Word has a similar feature. Why have I never seen this feature in free software for spell checking? A quick web search indicates that I'm not the first to ask this question. Suggested solutions include trying an ad hoc list of common words, prefixes and suffixes from each language or sampling trigrams. There is also an attempt at Bayesian language detection. Nothing indicates that the creators of these three approaches are familiar with Zipf's law. Looking for "you" and "me" is not optimal, when in fact "the" is known to be the most common word in English texts. The Bayesian filter approach probably comes closest to optimum, but in an overly complicated way. If we were to look for a single word from each language, that would be the most common word, which is "the" for English, "der" for German and "och" for Swedish. Since 7% of all English words are "the" we would need on average 14 words to find one occurrence of "the". In Swedish, which doesn't use a definite article (the, der, le, los), the most common word is "och" (and), which makes up 3.8 % of all words, so we'd need a sample of 26 words to expect one occurrence of "och". In order to determine the language from a shorter sample, we could throw in some more words from each language, from the top of the frequency listing. The 10 most frequent words in Swedish (och, i, att, en, av, som, den, till, med, på) together account for 14% of the words in any text corpus. The top 20 words (det, för, de, han, är, ett, sig, så, jag, var) account for 21%, the top 50 words account for 29%, and the top 100 words account for 36%. Some of these words also appear in English but aren't so frequent there, so they are better predictors for Swedish than for English. For each language, a weighted prediction score can be computed. Finding an "I" adds more to the Swedish score than to the English. These top ranking word frequencies haven't changed much since Shakespeare, so a C program should be able to keep a static table of the weights for 100 words in source code. The downloadable text of Wikipedia and the written OpenOffice.org documentation could be used as a corpus for sampling the weights.

The paper by Géza Németh and Csaba Zainkó, Multilingual Statistical Text Analysis, Zipf's Law and Hungarian Speech Generation explains that to cover 97.5% of an English corpus, you need a dictionary of 20,000 unique word forms, but to achieve the same coverage in German you need 80,000 word forms and for Hungarian 400,000 word forms. This is an important observation, as it sets a goal for how large spell checker dictionaries need to be. Presumably, Finnish and Estonian have the same characteristics as Hungarian, while the Scandinavian languages (Swedish, Danish, Norwegian) take a middle position between that and German. Our current dictionaries contain many specialized words that aren't very common. These words don't contribute so much to the dictionary's footprint or coverage of a typical corpus as they should do. For example, for word we add in its basic form, we also try to include every other form of that word, even though those forms aren't very common. What we need for German are the 80,000 most common word forms, but when we add the really common word Haus, we also tend to add the less common Hauses, Häuser and Häusern. Perhaps we should aim for 80,000 basic forms (this is very close to what Björn Jacke has, see below) instead of 80,000 variations (he has 300,000 variations).

I made some tests on a corpus of 17.99 million Swedish words from proofread texts in Project Runeberg. Many of these words use old spelling (and describe old concepts) and won't be found in modern dictionaries, so this is not a perfect test case for contemporary spell checking dictionaries. When I run this corpus through my own dictionaries, which do contain some words in old spelling, it leaves a remainder of 1.482 million words or 8.2% of the corpus, meaning that I now have 91.8% coverage of this corpus. If I combine my old spelling component with the existing Aspell dictionary (Göran Andersson's from 2003), it leaves a remainder of 1.837 million words or 10.4% of the corpus, meaning a 89.6% coverage. So my progress above Göran's dictionary is indeed very small. This coverage around 90% can be achieved for German with a dictionary of the 20,000 most common word forms, which can be compared to the 24,000 basic forms in Göran Anderssons's 2003 dictionary. Even though my dictionary has many additional word forms, their contribution to the corpus coverage isn't very large.

Corpus coverage %	Required number of word forms	Comment
3.44	1	och
5	2	i
10	6	att, en, av, som
15	10	den, till, med, på
20	17	det, för, de, han, är, ett, sig
25	30
30	53
35	90
40	151
45	260
50	451
55	795
60	1387
65	2415
70	4227
75	7452
80	13606

The long tail
Corpus coverage %	Required number of word forms
85	26544
86	30731
87	35800
88	42026
89	49735
90	59402
91	71767
92	87837
93	109094
94	137919
95	178319
96	234459
97	320515
98	453458
99	633358
100	812979

December 19, 2006: On the dev@lingucomponent list, Kevin Scannell discusses how to use precision and recall metrics for spell checkers.

December 18, 2006: I update the Nordic Words page and publish a version of my own Swedish dictionary (ss100.txt) with 221.599 words. I also do a little survey of the spell checking support in various programs:

Google's writely.com is a web word processor. It has a built-in spell checker that automatically recognizes the language. It's Swedish spell checker behaves exactly like OpenOffice.org 2.0.4, which indicates the same Ispell/Aspell/Myspell/Hunspell dictionary is used for Swedish (Göran Andersson's dictionary from 2003). When I pasted the words from my test page, there were so many errors that the spell checker automatically shut down and I had manually to turn it back on again.

The Opera web browser (I tried version 9.10) has a built-in spell checker for web forms. The user interface is a bit old-fashioned, in that it doesn't underline the errors, but uses a dialog window that steps through the web form. On Apple's Mac OS/X it uses the system's built-in spell checker, but on all other platforms it requires the user install GNU Aspell.

The word processor Abiword has built-in spell checking support. The user interface underlines any errors. The Swedish spell checking is apparently based on Göran Andersson's 2003 dictionary, although I cannot find out which software it uses (GNU Aspell, ispell or Myspell).

KDE's editors kate and kwrite have built-in spell checking support, apparently based on GNU Aspell. The user interface doesn't underline errors, but provides a dialog window that steps through the text one word at a time.

Note to self: I should take a closer look at Freedict.de. Where do these dictionaries really come from? Are they maintained?

A look at the German ispell dictionaries by Björn Jacke:

          Occurrences            Affix
Nov. 2005  Feb. 2003  Nov. 1999  flag   Usage
 -------    -------    -------   ----   ----------------------
  79681      81191      75748           Basic forms
 307261     308860     294897           Unique words
 -------    -------    -------   ----   ----------------------
  13755      13933      11257     /S    Genitive -s
  11815      11723      11397     /A    Adjective inflexion
   8848       9319      10070     /P    Plural -en
   8166       8374       8048     /N    Plural -n
   7367       7346       7004     /D    Participle -d
   6837       6828       6595     /I    Regular verbs, present tense
   6620       6611       6358     /X    Regular verbs, present tense
   5310       5303       5140     /Y    Regular verbs, past tense
   4315       4578       4525     /T    Genitive -es
   4189       4406       4118     /E    Plural -e
   2066       2061       1991     /O    Participle inflexion
   1999       1971         82     /J    -ung and inflexions
   1846       1840       1813     /C    Adjective comparison
   1580       1656       1636     /p    Irregular plurals
   1452       1047        831     /F    -in and inflexions
    721        719        615     /Z    Non-regular verbs, past tense
    672        665        619     /U    Prefix un-
    619        620        615     /V    Prefix ver-
    574        569        486     /B    -bar and inflexions
    497        492        138     /W    Imperatives
    289        310        280     /R    Plural -er
    235        251        250     /Q    Plural -sse
    206        206        208     /G    Prefix ge-
     64         68         65     /q    Plural -sse, special case for feminines
     57         56         44     /M    -chen and inflexions
     20         21         22     /f    Words ending in -ph can also have -f
     18         17         19     /L    -lich and inflexions
      4          4          4     /H    -heit and inflexions

December 15, 2006: Two Danish OpenOffice developers meet with CST, Center for Sprogteknologi, a commercial provider of Danish dictionaries, to discuss how to improve the Danish spell checking dictionary for OpenOffice. Brief report on the 'dansk' mailing list. To me it seems unlikely that any useful solution would be found this way.

December 11, 2006: Version 2.0.4 of OpenOffice.org has auto corrections (AutoKorrigeringar) for Swedish, based on a static list of about 100 word pairs, e.g. HJE -> hej, MEDECIN -> medicin. Where do they come from? They're not part of the spelling dictionary. There are also word pairs for Danish and German (both have longer lists), but none for Norwegian.

Firefox 2.0 offers spell checking for web forms (e.g. wiki editing). There is a Swedish spelling dictionary by Hasse Wallanger, based on the Swedih Myspell dictionary of August 14, 2003 ("baserad på den svenska ordlistan från 20030814 för Myspell"). It behaves a little differently than the Swedish spell checker in OpenOffice 2.0.4, in that it allows free concatenation of words. It also only spell checks an initial fraction of a web form. In bug 360434 this is explained. Type about:config in the URL field and look for the variable extensions.spellcheck.inline.max-misspellings which defaults to 500. Double click on this value and change it to a much higher value, e.g. 15000.

December 7, 2006: I think we need a test case for the Swedish spell checking, that is separate from the development of the dictionary. As a pilot test, I'm starting a subpage /Test av stavningskontrollen. Göran Andersson publishes version 1.22 of DSSO.

December 2, 2006: I sign up for various OpenOffice mailing lists, and this wiki. What takes me here is the poor spell checking support for Swedish in OpenOffice 2.0.2. The spelling dictionary is version 1.3.8 from sv.speling.org, which hasn't been updated since March 2002. It only contains 24490 words (basic forms), some of which are misspelled. The myspell affix file seems to have been automatically converted from the ispell affix file.

Timeline of Swedish dictionaries

November 25, 2006: Göran Andersson publishes version 1.19 of DSSO. Version 1.21 follows on December 1.
2006: The Swedish Academy publishes the 13th edition of SAOL.
2005: Volume 34 of SAOB ends at Tojs. The full work is expected to be completed in 2017.
January 2005: Project Runeberg's OCR spelling dictionaries for Swedish and Danish are published within Nordic Words.
April 2004: Public editing of susning.nu is closed. The user community migrates to the Swedish Wikipedia.
March 6, 2003: My posting Svensk ordlista on the SSLUG-LOCALE mailing list.
February 2003: On the Swedish web forum Gnuheter, I ask around for a business case for a Swedish dictionary (Affärsmodeller och fritt innehåll) without getting any useful answers.
2003: Göran Andersson takes back control of the Swedish spelling dictionary, now dsso.se, dissatisfied with some modifications made to it during the time it was at sv.speling.org.
May 6, 2002: I join sslug-locale mailing list for speling.org.
2002-2003: I digitize two editions (58 volumes) of the classic Swedish encyclopedia Nordisk familjebok (1876-1926). This is more food for word frequencies and spelling dictionaries.
October 2001: I start susning.nu, a Swedish wiki, which grows very fast. As a spinoff I return to computing word frequencies and compiling my own spelling dictionary.
January 29, 1998: Göran Andersson hands over his Swedish ispell dictionary (now version 1.2.1) to sv.speling.org
September 26, 1997: Göran Andersson's ispell dictionary version 1.2 accepts compound words. The list has 24082 basic forms, expanding to 117617 unique words.
February 23, 1997: Göran Andersson's ispell dictionary version 1.1 has 24722 basic forms, expanding to 84740 unique words.
January 15, 1997: Göran Andersson's ispell dictionary version 1.0 has 27737 basic forms, expanding to 76364 unique words. The brand new affix file is based on inspiration from a Danish affix file.
November 1996: Within Project Runeberg, the subproject "Nordic Words" is started, maintained by Anders Brun. No updates are made after December 1997.
1993: The Swedish Academy introduces computers in editing SAOB.
December 1992: I start Project Runeberg, the Scandinavian e-text archive
1991-1993: I experiment with spelling dictionaries for spell and ispell.
1970s-1980s: A Swedish morphological spellchecker "stava" is developed at FOA/QZ in Stockholm. Traces of this might be available at KTH. Viggo Kann would know. Several later Swedish spell checkers with the same name exist. Various dictionaries float around. Linguists have access to proprietary lists for research purposes, and are not interesting in creating "open content".
1970: Major Swedish newspapers abandon plural forms of verbs.
1968: The polite use of "Ni" (You/Sie) is replaced with simple "du" (you/du), making Swedish conversation as simple as English.
1960s: A young computational linguist Sture Allén uses paper tape from newspaper typesetters to compute word frequencies of the Swedish language. Laying the foundation for the Språkdata department at Gothenburg University, he later becomes secretary of the Swedish Academy.
1945: Swedish public schools make plural endings of verbs optional. Students who opt not to use them, must indicate this and then stick to their chosen style.
1939: At Easter, with fascicle 156, the Swedish Academy celebrates SAOB being halfway (A--K) completed.
1910: The polite use of "Ni" (You/Sie) is introduced in Swedish as a replacement for complicated titles, making Swedish conversation as simple as German.
1907: A major Swedish spelling reform does away with the combinations dt, fv, and hv.
1883: A new editor restarts the Academy's dictionary. The first fascicle is printed in 1893 and the first volume of "Svenska Akademiens Ordbok" (SAOB) is completed in 1898. The dictionary documents Swedish spelling since 1526.
1874: The Swedish Academy publishes a spelling dictionary in one volume, "Svenska Akademiens Ordlista" (SAOL). Its 6th edition (1889) and 8th edition (1923) are out of copyright.
1786: The Swedish Academy is founded by king Gustav III. One of its main tasks is to compile a dictionary of the Swedish language. Work begins immediately, but stops already in 1814. New attempts are started in 1834 and 1855. A fascicle for the letter "A" is published in 1870.
1526: Sweden's Lutheran church reformer Olaus Petri translates the New Testament to Swedish. Old Testament follows in 1541. His style of writing marks the beginning of modern Swedish orthography.
9th Century A.D.: About the same time as Iceland is populated by the Norwegians, Sweden's longest runic inscription, the Rök runestone is carved. Runes are Scandinavian letters inspired by Greek/Latin alphabets but adopted for carving in stone or wood. Two different runic alphabets were used between c. 500 and 1000 A.D., the first with 24 letters, later simplified to one with 16 letters. With the introduction of Christianity around 1000 A.D., runes are gradually replaced with Latin script.

@@ Line 7: / Line 7: @@
 == Diary ==
-'''January 6, 2007:''' [http://morfologik.blogspot.com/2007/01/wikipedia-history-diff-as-revision.html Wikipedia history diff as a revision corpus], summary of experiments by Marcin Miłkowski, after we discussed this on the dev@lingucomponent mailing list.
+'''January 6, 2007:''' [http://morfologik.blogspot.com/2007/01/wikipedia-history-diff-as-revision.html Wikipedia history diff as a revision corpus], summary of experiments by Marcin Miłkowski, after we discussed this on the dev@lingucomponent mailing list. "In short, it seems that Lars' idea was brilliant".
 '''December 30, 2006:''' Here's an experiment in coverage. One very classical Swedish text is [http://runeberg.org/nilsholg/ Nils Holgerssons underbara resa genom Sverige], by Nobel Prize winner Selma Lagerlöf, at the same time a geography textbook and a novel. The text at Project Runeberg is in modern spelling (post 1906), but uses plural forms of verbs (pre 1970). The text with HTML markup removed contains 198148 words or 1091951 characters. I'm assuming the spelling is all correct. Passing this through aspell with the 2003 Swedish dictionary (aspell -l sv list) outputs 6741 words as not recognized or 3.4 percent of all words in the text body. This means the Aspell dictionary covers 96.6 percent of the text, which actually isn't so bad. Further filtering through my own dictionary of old spellings (including plural verbs) removes another 1842 words or 0.93 percent, increasing the coverage to 97.5 percent. If instead I change to my own main dictionary, coverage increases only marginally. Apparently, the improvements I feel I have made, don't really matter if you are spell checking ''Nils Holgersson''.

Difference between revisions of "User:LA2"

Revision as of 07:01, 7 January 2007

Useful links

Diary

Timeline of Swedish dictionaries

Views

Personal tools

Navigation

Search

Tools