Difference between revisions of "User:LA2"

From Apache OpenOffice Wiki
Jump to: navigation, search
m
m (Diary)
Line 59: Line 59:
 
{| cellspacing=0 cellpadding=1 border=1  
 
{| cellspacing=0 cellpadding=1 border=1  
 
|-
 
|-
! rowspan=2 | Works of Danish literature in Project Runeberg !! rowspan=2 | Years !! colspan=4 | Size !! colspan=6 | Occurrences of !! rowspan=2 | Comment
+
! rowspan=2 | Works of Danish literature in Project Runeberg !! rowspan=2 | Years !! colspan=4 | Size !! colspan=8 | Occurrences of !! rowspan=2 | Comment
 
|-
 
|-
! Volumes !! Pages !! Words !! Vocabulary !! foer !! fór !! skjøn... !! skøn... !! ere !! bleve
+
! Volumes !! Pages !! Words !! Vocabulary !! foer !! fór !! skjøn... !! skøn... !! ere !! bleve !! vox...<br>vex... !! voks...<br>veks...
 
|- align=right
 
|- align=right
| align=left | [http://runeberg.org/salmonsen/2/ Salmonsens konversationsleksikon] (*scanned so far, out of 26) || 1915-1930 || 9* || 9173 || 7,397,317 || 539,286 || - || - || 6 || bgcolor=pink | 2499 || 39 || 1 ||
+
| align=left | [http://runeberg.org/salmonsen/2/ Salmonsens konversationsleksikon] (*scanned so far, out of 26) || 1915-1930 || 9* || 9173 || 7,397,317 || 539,286 || - || - || 6 || bgcolor=pink | 2499 || 39 || 1 || 8 || bgcolor=pink | 3587 ||
 
|- align=right
 
|- align=right
| align=left | [http://runeberg.org/dagligt/ Dagligt Liv i Norden i det sekstende Aarhundrede] || 1914-1915 || 14 || 3817 || 1,197,865 || 83,248 || 11 || 10 || 5 || bgcolor=pink | 582 || 198 || 17 ||
+
| align=left | [http://runeberg.org/dagligt/ Dagligt Liv i Norden i det sekstende Aarhundrede] || 1914-1915 || 14 || 3817 || 1,197,865 || 83,248 || 11 || 10 || 5 || bgcolor=pink | 582 || 198 || 17 || - || bgcolor=pink | 509 ||
 
|- align=right
 
|- align=right
| align=left | [http://runeberg.org/wiedgust/ Gustav Wied : Mindeudgave] || 1920 || 8 || 3414 || 897,195 || 70,043 || - || 85 || 9 || bgcolor=pink | 224 || 23 || - ||
+
| align=left | [http://runeberg.org/wiedgust/ Gustav Wied : Mindeudgave] || 1920 || 8 || 3414 || 897,195 || 70,043 || - || 85 || 9 || bgcolor=pink | 224 || 23 || - || 1 || bgcolor=pink | 193 ||
 
|- align=right
 
|- align=right
| align=left | [http://runeberg.org/havebrug/ Nordisk illustreret Havebrugsleksikon] || 1920-1921 || 2 || 1130 || 846,839 || 71,720 || - || - || 1 || bgcolor=pink | 211 || 6 || - ||
+
| align=left | [http://runeberg.org/havebrug/ Nordisk illustreret Havebrugsleksikon] || 1920-1921 || 2 || 1130 || 846,839 || 71,720 || - || - || 1 || bgcolor=pink | 211 || 6 || - || 9 || bgcolor=pink | 1275 ||
 
|- align=right
 
|- align=right
| align=left | [http://runeberg.org/adjorgen/ Historiske Afhandlinger  af A. D. Jørgensen] || 1898-1899 || 4 || 1864 || 587,235 || 57,421 || - || - ||  4 || bgcolor=pink | 281 || 82 || 1 || Uses "å", non-capitalized nouns
+
| align=left | [http://runeberg.org/adjorgen/ Historiske Afhandlinger  af A. D. Jørgensen] || 1898-1899 || 4 || 1864 || 587,235 || 57,421 || - || - ||  4 || bgcolor=pink | 281 || 82 || 1 || 98 || 15 || align="left" | Also uses "å", non-capitalized nouns
 
|- align=right
 
|- align=right
| align=left | [http://runeberg.org/gblevned/ Georg Brandes Levned] || 1905-1908 || 3 || 1212 || 331,587 || 39,455 || 15 || - || 3 || bgcolor=pink | 377 || 3 || - ||
+
| align=left | [http://runeberg.org/gblevned/ Georg Brandes Levned] || 1905-1908 || 3 || 1212 || 331,587 || 39,455 || 15 || - || 3 || bgcolor=pink | 377 || 3 || - || - || bgcolor=pink | 90 ||
 
|-
 
|-
| colspan=13 align="left" | ''Pre-1892 spelling''
+
| colspan=15 align="left" | ''Pre-1892 spelling''
 
|- align=right
 
|- align=right
| align=left | [http://runeberg.org/dbl/ Dansk biografisk Lexikon] || 1887-1905 || 17 || 12036 || 4,388,789 || 168,177 || - || 45 || bgcolor=pink | 2192 || 6 || bgcolor=pink | 2286 || bgcolor=pink | 1579 ||
+
| align=left | [http://runeberg.org/dbl/ Dansk biografisk Lexikon] || 1887-1905 || 17 || 12036 || 4,388,789 || 168,177 || - || 45 || bgcolor=pink | 2192 || 6 || bgcolor=pink | 2286 || bgcolor=pink | 1579 || bgcolor=pink | 901 || 4 ||
 
|- align=right
 
|- align=right
| align=left | [http://runeberg.org/tiphyche/ Tidsskrift for Physik og Chemi] || 1871-1878 || 8 || 3068 || 832,077 || 78,905 || 1 || - || bgcolor=pink | 239 || 1 || bgcolor=pink | 2496 || bgcolor=pink | 476 ||
+
| align=left | [http://runeberg.org/tiphyche/ Tidsskrift for Physik og Chemi] || 1871-1878 || 8 || 3068 || 832,077 || 78,905 || 1 || - || bgcolor=pink | 239 || 1 || bgcolor=pink | 2496 || bgcolor=pink | 476 || bgcolor=pink | 370 || - ||
 
|- align=right
 
|- align=right
| align=left | [http://runeberg.org/ildalihi/ Illustreret dansk Litteraturhistorie ] || 1902 || 3 || 2460 || 726,421 || 79,768 || 25 || - || bgcolor=pink | 738 || 2 || bgcolor=pink | 1021 || bgcolor=pink | 112 ||
+
| align=left | [http://runeberg.org/ildalihi/ Illustreret dansk Litteraturhistorie ] || 1902 || 3 || 2460 || 726,421 || 79,768 || 25 || - || bgcolor=pink | 738 || 2 || bgcolor=pink | 1021 || bgcolor=pink | 112 || bgcolor=pink | 197 || - ||
 
|- align=right
 
|- align=right
| align=left | [http://runeberg.org/dandig/ Illustreret dansk Literaturhistorie. Danske Digtere i det 19de Aarhundrede] || 1907 || 1 || 814 || 328,631 || 42,407 || 6 || - || bgcolor=pink | 386 || 26 || 91 || 20 ||
+
| align=left | [http://runeberg.org/dandig/ Illustreret dansk Literaturhistorie. Danske Digtere i det 19de Aarhundrede] || 1907 || 1 || 814 || 328,631 || 42,407 || 6 || - || bgcolor=pink | 386 || 26 || 91 || 20 || bgcolor=pink | 124 || 3 ||
 
|}
 
|}
  
The most visible sign of the 1892 spelling reform was the dropping of the silent j after g and k. This is shown here in the "skjø" and "skø" occurrences.
+
The most visible sign of the 1892 spelling reform was the dropping of the silent j after g and k. This is shown here in the "skjø" and "skø" occurrences. Also, letters c/qv/x/z were changed to s/kv/ks/s in many words, as shown here in the change from vox/vex... to voks/veks...
  
Abandoning plural verbs (ere, bleve) seems to largely coincide with the 1892 reform.
+
Plural verbs (ere, bleve) were made optional in 1900, and this reform largely coincides with the 1892 spelling reform.
  
 
Dropping the silent e after long wovels is a less clear sign. The word "for", being a preposition with the same meaning as in English (Swedish "för"; German "für"), is one of the 20 most common words in Danish (ranking 11 thru 19 in the texts above). However, it is also the imperfect of the verb "at fare" (to fare, to travel, to go, to leave; German "fuhr", Swedish "for"). In this capacity, it has a longer o sound (just like German "fuhr" and Swedish "for") and has historically been spelled "foer", then "fór" and in modern Danish just "for". As can be seen above, the occurrences of these older forms is one distinct feature of the spelling in the period 1880-1930. Adding to this complexity, "foer" can also be the spelling of another word (lining, the inner cloth of a jacket; Swedish "foder"). A non-ambigious case is erfoer/erfór/erfor (experienced, learned; German "erfuhr"), but this is far too uncommon to be useful for statistics.
 
Dropping the silent e after long wovels is a less clear sign. The word "for", being a preposition with the same meaning as in English (Swedish "för"; German "für"), is one of the 20 most common words in Danish (ranking 11 thru 19 in the texts above). However, it is also the imperfect of the verb "at fare" (to fare, to travel, to go, to leave; German "fuhr", Swedish "for"). In this capacity, it has a longer o sound (just like German "fuhr" and Swedish "for") and has historically been spelled "foer", then "fór" and in modern Danish just "for". As can be seen above, the occurrences of these older forms is one distinct feature of the spelling in the period 1880-1930. Adding to this complexity, "foer" can also be the spelling of another word (lining, the inner cloth of a jacket; Swedish "foder"). A non-ambigious case is erfoer/erfór/erfor (experienced, learned; German "erfuhr"), but this is far too uncommon to be useful for statistics.

Revision as of 06:15, 4 February 2007

LA2 is the username for Lars Aronsson, Sweden, also known from Wikipedia, Project Runeberg, and other projects.

Useful links

Diary

February 1, 2007: After the IETF in RFC 3066 (January 2001) devised a best current practice for language codes for use in Internet standards and protocols, there was a need for more codes. In particular, Germans wanted codes for their language before and after the 1996 spelling reform. For some time, the Internet Assigned Numbers Authority (IANA) maintained a list of additional language tags but this has been incorporated into the new series of three RFCs: 4645. Initial Language Subtag Registry, 4646. Tags for Identifying Languages and 4647. Matching of Language Tags (September 2006). In addition to these rules, there is a new registry, operated by IANA. Here, de-1901 is the traditional German spelling (daß, illustrierte, Schiffahrt, Tier) and de-1996 is the new German spelling (dass, illustrierte, Schifffahrt, Tier). No other languages have codes with regards to spelling reform. And there is yet no code for German before 1901 (daß, illustrirte, Schiffahrt, Thier).

I think it could make sense to propose the following language codes:

Code Used for Samples
da-1775 Danish orthography before the 1892 reform Kjøbenhavn
Dansk biografisk Lexikon, 1st ed.
da-1892 Danish spelling reforms of 1889-1892.
Plural verbs (ere, bleve) became optional in 1900 and are
almost completely absent from literature that follows this spelling
København, Maade, kunde, skulde, vilde
Salmonsens Konversationsleksikon, 2nd ed.
da-1948 Modern day Danish, reform of 1948 måde, kunne, skulle, ville
sv-1801 Orthography of Carl Gustaf af Leopold elf, godt, jern, qvacksalfvare
blefvo, gingo, åto, ega, äro
Nordisk familjebok, 1st ed.
sv-1889 SAOL, 6th ed. godt, järn, kvacksalfvare, älf
blefvo, gingo, åto, äga, äro
Nordisk familjebok, 2nd ed.
sv-1906 Modern day Swedish, spelling reform of Fridtjuv Berg.
Plural verbs (äro, blevo) become optional around 1940
and are completely absent around 1970
gott, järn, kvacksalvare, älv
blevo, gingo, åto, äga, äro
Nils Holgerssons underbara resa genom Sverige
SAOL, 8th ed.

For Norwegian, the situation is a lot more complex and I need to learn more before I can propose something like this:

Code Used for Samples
nb-1862 First uniquely Norwegian (non-Danish) instruction on orthography
nb-1907 First official norm for riksmål mænd, ryg, hesterne
nb-1917 Reform introduces letter å, changes many æ to e, removes r from plurals menn, rygg, hestene
nb-1982 Final adjustment of the 1938 reform.
Is this really different from nb-1917?
nn-1853 Ivar Aasen's landsmål Vin, Dyr, Sjo, kastade-kastat
nn-1901 Norwegian education ministry's norm for landsmål ven, dør, sjø, kasta, hestarne
nn-1917 Reform introduces letter å, removes r from plurals hestane
nn-1938 Reform


January 28, 2007: During the weekend I'm trying to figure out if there is any algorithm that can determine the language of a text. There are several approaches, such as comparing the most common words or counting bigrams and trigrams. In Perl, there is a CPAN module called Lingua::Ident. It seems to work fine for telling English apart from Spanish, but it is a whole different problem to separate Norwegian bokmål from nynorsk. Or to tell Swedish modern spelling apart from old spelling. Just from trigram analysis, Swedish and Norwegian are very similar. However, a spell checker will find the differences. Run a text through a spell checker for modern Swedish (or Danish or Norwegian) spelling, and all the words with old spelling come out.

Last week (see Jan. 22 below) I released word frequency statistics for old Norwegian texts. I have now completely mapped all texts in Project Runeberg to language and year and started to look closer at Danish. There have been two major Danish spelling reforms in 1892 and 1948, as described in the timeline below. The following table shows how some large text bodies in Project Runeberg relate to these dates:

Works of Danish literature in Project Runeberg Years Size Occurrences of Comment
Volumes Pages Words Vocabulary foer fór skjøn... skøn... ere bleve vox...
vex...
voks...
veks...
Salmonsens konversationsleksikon (*scanned so far, out of 26) 1915-1930 9* 9173 7,397,317 539,286 - - 6 2499 39 1 8 3587
Dagligt Liv i Norden i det sekstende Aarhundrede 1914-1915 14 3817 1,197,865 83,248 11 10 5 582 198 17 - 509
Gustav Wied : Mindeudgave 1920 8 3414 897,195 70,043 - 85 9 224 23 - 1 193
Nordisk illustreret Havebrugsleksikon 1920-1921 2 1130 846,839 71,720 - - 1 211 6 - 9 1275
Historiske Afhandlinger af A. D. Jørgensen 1898-1899 4 1864 587,235 57,421 - - 4 281 82 1 98 15 Also uses "å", non-capitalized nouns
Georg Brandes Levned 1905-1908 3 1212 331,587 39,455 15 - 3 377 3 - - 90
Pre-1892 spelling
Dansk biografisk Lexikon 1887-1905 17 12036 4,388,789 168,177 - 45 2192 6 2286 1579 901 4
Tidsskrift for Physik og Chemi 1871-1878 8 3068 832,077 78,905 1 - 239 1 2496 476 370 -
Illustreret dansk Litteraturhistorie 1902 3 2460 726,421 79,768 25 - 738 2 1021 112 197 -
Illustreret dansk Literaturhistorie. Danske Digtere i det 19de Aarhundrede 1907 1 814 328,631 42,407 6 - 386 26 91 20 124 3

The most visible sign of the 1892 spelling reform was the dropping of the silent j after g and k. This is shown here in the "skjø" and "skø" occurrences. Also, letters c/qv/x/z were changed to s/kv/ks/s in many words, as shown here in the change from vox/vex... to voks/veks...

Plural verbs (ere, bleve) were made optional in 1900, and this reform largely coincides with the 1892 spelling reform.

Dropping the silent e after long wovels is a less clear sign. The word "for", being a preposition with the same meaning as in English (Swedish "för"; German "für"), is one of the 20 most common words in Danish (ranking 11 thru 19 in the texts above). However, it is also the imperfect of the verb "at fare" (to fare, to travel, to go, to leave; German "fuhr", Swedish "for"). In this capacity, it has a longer o sound (just like German "fuhr" and Swedish "for") and has historically been spelled "foer", then "fór" and in modern Danish just "for". As can be seen above, the occurrences of these older forms is one distinct feature of the spelling in the period 1880-1930. Adding to this complexity, "foer" can also be the spelling of another word (lining, the inner cloth of a jacket; Swedish "foder"). A non-ambigious case is erfoer/erfór/erfor (experienced, learned; German "erfuhr"), but this is far too uncommon to be useful for statistics.

Salmonsens encyclopedia seems to be useful as a reference, not only because it is the largest body, but also since it consequently sticks to the 1892 reform.

Of the total 8.3 million words in Salmonsen+Wied, there are 576K unique words, including some OCR errors. Here is the coverage distribution:

Corpus
coverage %
Required number
of word forms
Comment
3.38 1 og
8.62 3 i, af
16.81 10 en, til, er, den, at, de, med
28.28 30 der, som, det, for, paa, ved, han, et, -, var, har, sig, fra, ikke, men, blev, B., A., om, e
39.42 100 ... første (7648 occurrences)
49.91 300 ... smaa (2499 occurrences)
60.87 1000 ... hvorpaa (712 occurrences)
70.11 3000 ... Venstre (231 occurrences)
79.62 10000 ... Tiltrækning (60 occurrences)
86.73 30000 ... diplomatique (16 occurrences)
92.64 100000 ... Wanderjahre (3 occurrences)
96.66 300000
100.00 576534


January 24, 2007: What about translation dictionaries. Could that be a new component for OpenOffice? What's available and how are they used? Two command line applications for English-German are leo and translate, both available as Ubuntu packages. Below is a comparison screenshot of the two GUI applications OpenDict (left, using FreeDict dictionaries) and Ding (right), both showing a lookup of the word "fly" in the English-German translation dictionary. In this particular comparison, Ding wins on a number of points:

  • In Ding you don't have to click to see the different "fly" words, only scroll.
  • Ding shows word classes (adj.), gender of nouns (Fliege f.), and the inflection of verbs (flew, flown).

LA2-dictfly.png

January 22, 2007: As an experiment, I publish some Norwegian word frequency lists by year 1880-1935 based on Project Runeberg's texts.

January 18, 2007: Aspell has some very annoying limitations in that colons and digits cannot be parts of words. How should I handle Swedish words such as Maj:ts and p2p-överföring? I have tried to send my questions to hunspell-devel, but does anybody read that list? colon in WORDCHARS (Jan. 4) and digits in words (today).

January 14, 2007: The Finns are running their own software project for spell and grammar checking, Voikko. Fortunately for the rest of us, a description of their architecture is available in English. For more details on the project, see Harri Pitkänen's Hunspell-fi in Kesäkoodi 2006: Final Report (PDF, 14 pages).

January 7, 2007: A self-appointed committee, named "Stavekontrolden", for the improvement of the Danish spell checker holds a constituting assembly in Odense, as Finn Gruwier Larsen reports on the "dansk" mailing list. Chairman is Esben Aaberg. There is already a website at www.stavekontrolden.dk.

January 6, 2007: Wikipedia history diff as a revision corpus, summary of experiments by Marcin Miłkowski, after we discussed this on the dev@lingucomponent mailing list. "In short, it seems that Lars' idea was brilliant".

December 30, 2006: Here's an experiment in coverage. One very classical Swedish text is Nils Holgerssons underbara resa genom Sverige, by Nobel Prize winner Selma Lagerlöf, at the same time a geography textbook and a novel. The text at Project Runeberg is in modern spelling (post 1906), but uses plural forms of verbs (pre 1970). The text with HTML markup removed contains 198148 words or 1091951 characters. I'm assuming the spelling is all correct. Passing this through aspell with the 2003 Swedish dictionary (aspell -l sv list) outputs 6741 words as not recognized or 3.4 percent of all words in the text body. This means the Aspell dictionary covers 96.6 percent of the text, which actually isn't so bad. Further filtering through my own dictionary of old spellings (including plural verbs) removes another 1842 words or 0.93 percent, increasing the coverage to 97.5 percent. If instead I change to my own main dictionary, coverage increases only marginally. Apparently, the improvements I feel I have made, don't really matter if you are spell checking Nils Holgersson.

 $ cat k*.html | wc -w
 198148
 $ cat k*.html | aspell -l sv list | wc -w
 6741

The 4883 words (97.5 %) output from the combination of my own dictionaries are 3670 unique word forms, of which 2991 appear only once, 433 appear twice, 129 appear three times and 117 appear four or more times. If I added these 117 word forms to my dictionary, that would cover another 639 words or 0.32 percent of this text, pushing the coverage to 97.8 percent.

It turns out some of those words shouldn't be added to a dictionary because they are names of fictional characters that only appear in this book. A select few spelling errors are also found and will be corrected. Of the unrecognized words, many are minor variations (in case and punctuation) that are covered by just adding one word to the dictionary. After some work, my coverage is up to 98.49 percent, leaving 2991 words unrecognized, being 2674 unique word forms of which 2509 appear only once and 111 appear twice.

December 21, 2006: Apparently the Swedish spell checker in Microsoft Word 6.0 accepts the following misspelled words from my test page: andledning, andvänd, andvändning, bakrund, ballett, diskusanalys, finlandsvensk, finness, fiskeläger, följetång, företeckning, förmögenhetskatt, hårddraget, innerbär, jämnlik, kolrot, Lindköping, lösensumma, majonäs, model, modellbetäckning, parantes, situationstecken, stadsbesök, stadschef, terass, trilogi, vädersträck, överrens

And Microsoft Word 2003 accepts these errors: andledning, alvarlig, andvändare, ballett, Ceasar, diskusanalys, europisk, fiskeläger, frisörsalong, följetång, företeckning, förmögenhetskatt, grejor, hårddraget, interesse, krigsföring, landsbyggd, Lindköping, lösensumma, mediespelare, modellbetäckning, parantes, San Fransisco, sattelit, situationstecken, stadsbesök, stadschef, Stockolm, Storbrittanninen, tabblett, tipps, utryck, våldtäckt, vädersträck, ytterliggare, åldersbestigna, överrens.

December 20, 2006: As I observed the other day, writely.com automatically understands which language I'm using, and applies the right spelling dictionary. Apparently, Microsoft Word has a similar feature. Why have I never seen this feature in free software for spell checking? A quick web search indicates that I'm not the first to ask this question. Suggested solutions include trying an ad hoc list of common words, prefixes and suffixes from each language or sampling trigrams. There is also an attempt at Bayesian language detection. Nothing indicates that the creators of these three approaches are familiar with Zipf's law. Looking for "you" and "me" is not optimal, when in fact "the" is known to be the most common word in English texts. The Bayesian filter approach probably comes closest to optimum, but in an overly complicated way. If we were to look for a single word from each language, that would be the most common word, which is "the" for English, "der" for German and "och" for Swedish. Since 7% of all English words are "the" we would need on average 14 words to find one occurrence of "the". In Swedish, which doesn't use a definite article (the, der, le, los), the most common word is "och" (and), which makes up 3.8 % of all words, so we'd need a sample of 26 words to expect one occurrence of "och". In order to determine the language from a shorter sample, we could throw in some more words from each language, from the top of the frequency listing. The 10 most frequent words in Swedish (och, i, att, en, av, som, den, till, med, på) together account for 14% of the words in any text corpus. The top 20 words (det, för, de, han, är, ett, sig, så, jag, var) account for 21%, the top 50 words account for 29%, and the top 100 words account for 36%. Some of these words also appear in English but aren't so frequent there, so they are better predictors for Swedish than for English. For each language, a weighted prediction score can be computed. Finding an "I" adds more to the Swedish score than to the English. These top ranking word frequencies haven't changed much since Shakespeare, so a C program should be able to keep a static table of the weights for 100 words in source code. The downloadable text of Wikipedia and the written OpenOffice.org documentation could be used as a corpus for sampling the weights.

The paper by Géza Németh and Csaba Zainkó, Multilingual Statistical Text Analysis, Zipf's Law and Hungarian Speech Generation explains that to cover 97.5% of an English corpus, you need a dictionary of 20,000 unique word forms, but to achieve the same coverage in German you need 80,000 word forms and for Hungarian 400,000 word forms. This is an important observation, as it sets a goal for how large spell checker dictionaries need to be. Presumably, Finnish and Estonian have the same characteristics as Hungarian, while the Scandinavian languages (Swedish, Danish, Norwegian) take a middle position between that and German. Our current dictionaries contain many specialized words that aren't very common. These words don't contribute so much to the dictionary's footprint or coverage of a typical corpus as they should do. For example, for word we add in its basic form, we also try to include every other form of that word, even though those forms aren't very common. What we need for German are the 80,000 most common word forms, but when we add the really common word Haus, we also tend to add the less common Hauses, Häuser and Häusern. Perhaps we should aim for 80,000 basic forms (this is very close to what Björn Jacke has, see below) instead of 80,000 variations (he has 300,000 variations).

I made some tests on a corpus of 17.99 million Swedish words from proofread texts in Project Runeberg. Many of these words use old spelling (and describe old concepts) and won't be found in modern dictionaries, so this is not a perfect test case for contemporary spell checking dictionaries. When I run this corpus through my own dictionaries, which do contain some words in old spelling, it leaves a remainder of 1.482 million words or 8.2% of the corpus, meaning that I now have 91.8% coverage of this corpus. If I combine my old spelling component with the existing Aspell dictionary (Göran Andersson's from 2003), it leaves a remainder of 1.837 million words or 10.4% of the corpus, meaning a 89.6% coverage. So my progress above Göran's dictionary is indeed very small. This coverage around 90% can be achieved for German with a dictionary of the 20,000 most common word forms, which can be compared to the 24,000 basic forms in Göran Anderssons's 2003 dictionary. Even though my dictionary has many additional word forms, their contribution to the corpus coverage isn't very large.

Corpus
coverage %
Required number
of word forms
Comment
3.44 1 och
5 2 i
10 6 att, en, av, som
15 10 den, till, med, på
20 17 det, för, de, han, är, ett, sig
25 30
30 53
35 90
40 151
45 260
50 451
55 795
60 1387
65 2415
70 4227
75 7452
80 13606
The long tail
Corpus
coverage %
Required number
of word forms
85 26544
86 30731
87 35800
88 42026
89 49735
90 59402
91 71767
92 87837
93 109094
94 137919
95 178319
96 234459
97 320515
98 453458
99 633358
100 812979

December 19, 2006: On the dev@lingucomponent list, Kevin Scannell discusses how to use precision and recall metrics for spell checkers.

December 18, 2006: I update the Nordic Words page and publish a version of my own Swedish dictionary (ss100.txt) with 221.599 words. I also do a little survey of the spell checking support in various programs:

Google's writely.com is a web word processor. It has a built-in spell checker that automatically recognizes the language. It's Swedish spell checker behaves exactly like OpenOffice.org 2.0.4, which indicates the same Ispell/Aspell/Myspell/Hunspell dictionary is used for Swedish (Göran Andersson's dictionary from 2003). When I pasted the words from my test page, there were so many errors that the spell checker automatically shut down and I had manually to turn it back on again.

The Opera web browser (I tried version 9.10) has a built-in spell checker for web forms. The user interface is a bit old-fashioned, in that it doesn't underline the errors, but uses a dialog window that steps through the web form. On Apple's Mac OS/X it uses the system's built-in spell checker, but on all other platforms it requires the user install GNU Aspell.

The word processor Abiword has built-in spell checking support. The user interface underlines any errors. The Swedish spell checking is apparently based on Göran Andersson's 2003 dictionary, although I cannot find out which software it uses (GNU Aspell, ispell or Myspell).

KDE's editors kate and kwrite have built-in spell checking support, apparently based on GNU Aspell. The user interface doesn't underline errors, but provides a dialog window that steps through the text one word at a time.

Note to self: I should take a closer look at Freedict.de. Where do these dictionaries really come from? Are they maintained?

A look at the German ispell dictionaries by Björn Jacke:

          Occurrences            Affix
Nov. 2005  Feb. 2003  Nov. 1999  flag   Usage
 -------    -------    -------   ----   ----------------------
  79681      81191      75748           Basic forms
 307261     308860     294897           Unique words
 -------    -------    -------   ----   ----------------------
  13755      13933      11257     /S    Genitive -s
  11815      11723      11397     /A    Adjective inflexion
   8848       9319      10070     /P    Plural -en
   8166       8374       8048     /N    Plural -n
   7367       7346       7004     /D    Participle -d
   6837       6828       6595     /I    Regular verbs, present tense
   6620       6611       6358     /X    Regular verbs, present tense
   5310       5303       5140     /Y    Regular verbs, past tense
   4315       4578       4525     /T    Genitive -es
   4189       4406       4118     /E    Plural -e
   2066       2061       1991     /O    Participle inflexion
   1999       1971         82     /J    -ung and inflexions
   1846       1840       1813     /C    Adjective comparison
   1580       1656       1636     /p    Irregular plurals
   1452       1047        831     /F    -in and inflexions
    721        719        615     /Z    Non-regular verbs, past tense
    672        665        619     /U    Prefix un-
    619        620        615     /V    Prefix ver-
    574        569        486     /B    -bar and inflexions
    497        492        138     /W    Imperatives
    289        310        280     /R    Plural -er
    235        251        250     /Q    Plural -sse
    206        206        208     /G    Prefix ge-
     64         68         65     /q    Plural -sse, special case for feminines
     57         56         44     /M    -chen and inflexions
     20         21         22     /f    Words ending in -ph can also have -f
     18         17         19     /L    -lich and inflexions
      4          4          4     /H    -heit and inflexions

December 15, 2006: Two Danish OpenOffice developers meet with CST, Center for Sprogteknologi, a commercial provider of Danish dictionaries, to discuss how to improve the Danish spell checking dictionary for OpenOffice. Brief report on the 'dansk' mailing list. To me it seems unlikely that any useful solution would be found this way.

December 11, 2006: Version 2.0.4 of OpenOffice.org has auto corrections (AutoKorrigeringar) for Swedish, based on a static list of about 100 word pairs, e.g. HJE -> hej, MEDECIN -> medicin. Where do they come from? They're not part of the spelling dictionary. There are also word pairs for Danish and German (both have longer lists), but none for Norwegian.

Firefox 2.0 offers spell checking for web forms (e.g. wiki editing). There is a Swedish spelling dictionary by Hasse Wallanger, based on the Swedih Myspell dictionary of August 14, 2003 ("baserad på den svenska ordlistan från 20030814 för Myspell"). It behaves a little differently than the Swedish spell checker in OpenOffice 2.0.4, in that it allows free concatenation of words. It also only spell checks an initial fraction of a web form. In bug 360434 this is explained. Type about:config in the URL field and look for the variable extensions.spellcheck.inline.max-misspellings which defaults to 500. Double click on this value and change it to a much higher value, e.g. 15000.

December 7, 2006: I think we need a test case for the Swedish spell checking, that is separate from the development of the dictionary. As a pilot test, I'm starting a subpage /Test av stavningskontrollen. Göran Andersson publishes version 1.22 of DSSO.

December 2, 2006: I sign up for various OpenOffice mailing lists, and this wiki. What takes me here is the poor spell checking support for Swedish in OpenOffice 2.0.2. The spelling dictionary is version 1.3.8 from sv.speling.org, which hasn't been updated since March 2002. It only contains 24490 words (basic forms), some of which are misspelled. The myspell affix file seems to have been automatically converted from the ispell affix file.

Timeline of Scandinavian orthography

  • November 25, 2006: Göran Andersson publishes version 1.19 of DSSO. Version 1.21 follows on December 1.
  • 2006: The Swedish Academy publishes the 13th edition of SAOL.
  • 2005: Volume 34 of SAOB ends at Tojs. The full work is expected to be completed in 2017.
  • 2005: Spelling reform in bokmål. Some forms from riksmål are introduced: frem.
  • January 2005: Project Runeberg's OCR spelling dictionaries for Swedish and Danish are published within Nordic Words.
  • April 2004: Public editing of susning.nu is closed. The user community migrates to the Swedish Wikipedia.
  • March 6, 2003: My posting Svensk ordlista on the SSLUG-LOCALE mailing list.
  • February 2003: On the Swedish web forum Gnuheter, I ask around for a business case for a Swedish dictionary (Affärsmodeller och fritt innehåll) without getting any useful answers.
  • 2003: Göran Andersson takes back control of the Swedish spelling dictionary, now dsso.se, dissatisfied with some modifications made to it during the time it was at sv.speling.org.
  • May 6, 2002: I join sslug-locale mailing list for speling.org.
  • 2002-2003: I digitize two editions (58 volumes) of the classic Swedish encyclopedia Nordisk familjebok (1876-1926). This is more food for word frequencies and spelling dictionaries.
  • October 2001: I start susning.nu, a Swedish wiki, which grows very fast. As a spinoff I return to computing word frequencies and compiling my own spelling dictionary.
  • January 29, 1998: Göran Andersson hands over his Swedish ispell dictionary (now version 1.2.1) to sv.speling.org
  • September 26, 1997: Göran Andersson's ispell dictionary version 1.2 accepts compound words. The list has 24082 basic forms, expanding to 117617 unique words.
  • February 23, 1997: Göran Andersson's ispell dictionary version 1.1 has 24722 basic forms, expanding to 84740 unique words.
  • January 15, 1997: Göran Andersson's ispell dictionary version 1.0 has 27737 basic forms, expanding to 76364 unique words. The brand new affix file is based on inspiration from a Danish affix file.
  • November 1996: Within Project Runeberg, the subproject "Nordic Words" is started, maintained by Anders Brun. No updates are made after December 1997.
  • 1993: The Swedish Academy introduces computers in editing SAOB.
  • December 1992: I start Project Runeberg, the Scandinavian e-text archive
  • 1991-1993: I experiment with spelling dictionaries for spell and ispell.
  • 1986: Spelling reform in riksmål. Some words from bokmål are introduced: Etter, språk, nå.
  • 1970s-1980s: A Swedish morphological spellchecker "stava" is developed at FOA/QZ in Stockholm. Traces of this might be available at KTH. Viggo Kann would know. Several later Swedish spell checkers with the same name exist. Various dictionaries float around. Linguists have access to prorietary lists for research purposes, and are not interesting in creating "open content".
  • June 1, 1981: Norwegian parliament adopts a proposal from Norsk språkråd (January 1979) to once again allow in bokmål many of the forms that were banned in 1938. Female gender inflections become optional. The new rules are introduced in schools during 1982.
  • 1979: Of Norwegian children 16.4 % receive school education in nynorsk.
  • 1972: Norsk språkråd (Norwegian language council) replaces Norsk språknemnd. A paragraph on uniting the two languages is dropped from the mission statement. The new council includes representatives from the two protest organizations Riksmålsforbundet and Foreldreaksjonen mot samnorsk.
  • 1970: Major Swedish newspapers abandon plural forms of verbs.
  • 1968: The polite use of "Ni" (You/Sie) is replaced with simple "du" (you/du), making Swedish conversation as simple as English.
  • 1960s: A young computational linguist Sture Allén uses paper tape from newspaper typesetters to compute word frequencies of the Swedish language. Laying the foundation for the Språkdata department at Gothenburg University, he later becomes secretary of the Swedish Academy.
  • 1959: Norsk språknemnd (Norwegian language committee), formed by the government in 1951, publishes Ny læreboknormal 1959, that relaxes parts of the 1938 reform.
  • 1959: Friends of further reform and unification of Norwegian language form an association, Landslaget for språklig samling.
  • 1952: The Norwegian association "Riksmålsforbundet" protests against further reform and publishes their own dictionary, reinstating many words that were removed from bokmål in the 1938 reform.
  • 1951: Norwegian parliament unanimously decides to change counting words from the German/Danish pattern (tre-og-femti, three-and-fifty) to English/Swedish (femti-tre, fifty-three). In 1970 a poll shows that 70% of the population agree this was a good reform, but only 30% actually use it.
  • March 22, 1948: A Danish spelling reform introduces å (for aa) and removes capitalization of nouns. Also, the words kunde, skulde, vilde are replaced with kunne, skulle, ville.
  • 1945: Swedish public schools make plural endings of verbs optional. Students who opt not to use them, must indicate this and then stick to their chosen style.
  • 1944: The percentage of Norwegian children that receive school education in nynorsk peak at 34.1 %. Ongoing industrialization, urbanization and increased wealth benefits bokmål.
  • 1939: At Easter, with fascicle 156, the Swedish Academy celebrates SAOB being halfway (A--K) completed.
  • 1938: Spelling reform for both Norwegian languages aims to bring them closer to each other. Female gender is made mandatory in bokmål.
  • 1929: The two Norwegian languages get new names. Riksmål changes to bokmål, and landsmål changes to nynorsk. However, those who protested the 1938 reform of bokmål took up the old name riksmål.
  • 1917: Norwegian spelling reform for both languages. The letter Å is introduced. R is removed from plurals (hestane/hestene). In riksmål many æ change to e (menn, verk). Female gender is introduced in riksmål and made optional.
  • 1913: The proceedings of the Swedish parliament (riksdagens protokoll) adopt the spelling of the 1906 reform.
  • 1910: The polite use of "Ni" (You/Sie) is introduced in Swedish as a replacement for complicated titles, making Swedish conversation as simple as German.
  • 1907: The first official spelling standard for Norwegian riksmål. This is close to the language of Bjørnson (hesterne, mænd, mænn, værk, ryg). Many Danish b/d/g are changed to p/t/k. Nouns and verbs get Norwegian inflexions. In part this norm was guided by the idea to unify the two Norwegian languages (samnorsktanken).
  • 1906: A major Swedish spelling reform does away with the combinations dt, fv, and hv. This is introduced by minister of church and schools Fridtjuv Berg (1851-1916).
  • 1901: Norway's education (church) ministry defines a standard orthography for those school textbooks that are writen in landsmål. Vin, Dyr, Sjo are changed to ven, dør, sjø. Verb forms kastade-kastat are changed to kasta-kasta. This reform of 1901 is also known as "Midlandsnormalen".
  • 1900: Danish education ministry allows the dropping of plural forms of verbs (ere, bleve).
  • 1892: Norway's school districts can decide whether they should teach landsmål or the common language known from books. Secondary schools introduce this reform in 1896.
  • 1888 or 1889, and revised in 1891 or 1892: Denmark's ministry of schools and churches (under minster Jacob Frederik Scavenius) authorizes spelling that allows (though not requires?) using j and v rather than i and u at the end of diphthongs, reducing the use of c, q, z, and x to foreign words, abandoning double wovels, abandoning the silent e, abandoning silent j after k and g. The dictionary by Viggo Såby (1835-1898), Ordbog med befalet Retskrivning til Brug for Skolene becomes the norm for spelling in Danish schools.
  • 1889: The 6th edition of SAOL introduces many of the changes proposed by the 1869 congress. This includes the change from e to ä in elf/älf, jern/järn. It also allows a change from qv to kv, e.g. qvarn/kvarn, qvinna/kvinna.
  • 1885: Norwegian parliament rules that landsmål is a parallel official language.
  • 1883: A new editor restarts the Academy's dictionary. The first fascicle is printed in 1893 and the first volume of "Svenska Akademiens Ordbok" (SAOB) is completed in 1898. The dictionary documents Swedish spelling since 1526.
  • 1877: Norway no longer requires capitalization of nouns.
  • 1874: The Swedish Academy publishes a spelling dictionary in one volume, Svenska Akademiens Ordlista (SAOL). This 1st edition by Johan Erik Rydqvist (1800-1877) is very conservative in spelling, as a direct protest against Hazelius and the changes proposed by the 1869 congress. Its 6th edition (1889) and 8th edition (1923) are out of copyright.
  • 1869: A Scandinavian spelling congress (det nordiske Retskrivningsmøde, det nordiska rättstavningsmötet) is held in Stockholm, suggesting that nouns should no longer be capitalized (in Danish-Norwegian) and that ä should replace e in many places (in Swedish). Among the Norwegian representatives was Henrik Ibsen, who immediately adopted the new proposals in his own writing, such as changing from gj/kj to g/k (gerne, kærlighed, skemt, igen), from ei/øi to ej/øj (Freja, dreje, fløjel), from ch/x/qv to k/ks/kv (Krist, veksel, kvinde). In the 1870s he set out to republish his older works in this new language. The Swedish Academy was not invited, because the organizers wanted to achieve consensus in the direction of reform, and this would not have been accepted by the Academy. Secretary for the Swedish section was Artur Hazelius (1833-1901), who published Om svensk rättstafning. 1. Om rättstafningens grunder med särskildt afseende på svenska språket (1870, "On the foundations of orthography with special consideration on the Swedish language") and 2. Redogörelse för Nordiska rättstafningsmötets förslag till ändringar i stafningssättet jemte berättelse om mötet (1871, "Presentation of the Scandinavian spelling congress' proposals for changes in orthography and proceedings of the congress").
  • 1862: Norwegian instruction on orthography changes ph/ch/x/qv to f/k/ks/kv. Double wovels (Eed, Huus, siig, viid) and silent e (gaaer, roer, Tyrannie) are dropped (Ed, Hus, sig, vid, gaar, ror, Tyranni). This reform has no effect in Denmark.
  • 1853: Ivar Aasen publishes some samples of dialects in Prøver af Landsmaalet i Norge (1853). In this collection, some stories are also printed in a standardized version of Norwegian language, which marks the creation of landsmaal (in 1929 renamed nynorsk), one of the two Norwegian languages. Writers Aa. O. Vinje and Arne Garborg start to use the new language.
  • 1842-1846: Norwegian linguist Ivar Aasen travels the country to collect samples from dialects. His observations are summarized in a grammar and a dictionary: Det norske Folksprogs Grammatik (1848) and Ordbog over det norske Folksprog (1850).
  • 1842: Public schools are made compulsory by law in Sweden.
  • 1830: Sixteen years after Norway's separation from Denmark, the idea to create a Norwegian language is first mentioned.
  • 1826: Danish linguist Rasmus Rask (1787-1832) publishes a Forsøg til en videnskabelig dansk Retskrivningslære (Attempt to a scientific Danish orthography), in which he proposes a far-reaching reform of Danish spelling. Among other things, he proposed the introduction of å to replace aa (this reform took place in 1948). One of his disciples was Niels Mathias Petersen (1791-1862), who continued to work for reforming Danish language.
  • 1814: Public schools are made compulsory by law in Denmark.
  • 1786: The Swedish Academy is founded by king Gustav III. One of its main tasks is to compile a dictionary of the Swedish language. Work begins immediately, but stops already in 1814. New attempts are started in 1834 and 1855. A fascicle for the letter "A" is published in 1870.
  • 1775: Danish government issues the first of instructions on spelling to higher schools.
  • 1750-1800: In the latter half of the 18th century, capitalization of nouns is introduced in Danish.
  • 1753: Swedish scholar Sven Hof publishes Swänska språkets rätta skrifsätt ("The right spelling of the Swedish language")
  • 1726: Danish-Norwegian playwright Ludvig Holberg (1684–1754) documents his own spelling in "Orthographiske Anmerkninger" in Metamorphosis. An online version is found here. However, that text does not use Holberg's unique orthography. This might seem odd, but is explained by the fact that book printers changed Holberg's very disciplined spelling to their own random spelling. A good background is given in this article on the history of Danish language in the encyclopedia Salmonsens Konversationsleksikon.
  • 1703: First attempt to use Antiqva (rather than Fraktur) for Danish books, but only very few books are printed this way. Fraktur continues to dominate.
  • 1647: A Danish Bible translation does away with male/female gender of nouns and always uses "den".
  • 1526: Sweden's Lutheran church reformer Olaus Petri translates the New Testament to Swedish. Old Testament follows in 1541. His style of writing marks the beginning of modern Swedish orthography.
  • 9th Century A.D.: About the same time as Iceland is populated by the Norwegians, Sweden's longest runic inscription, the Rök runestone is carved. Runes are Scandinavian letters inspired by Greek/Latin alphabets but adopted for carving in stone or wood. Two different runic alphabets were used between c. 500 and 1000 A.D., the first with 24 letters, later simplified to one with 16 letters. With the introduction of Christianity around 1000 A.D., runes are gradually replaced with Latin script.
Personal tools