Difference between revisions of "User:LA2"
m (→Diary) |
|||
Line 7: | Line 7: | ||
== Diary == | == Diary == | ||
− | '''January 6, 2007:''' [http://morfologik.blogspot.com/2007/01/wikipedia-history-diff-as-revision.html Wikipedia history diff as a revision corpus], summary of experiments by Marcin Miłkowski, after we discussed this on the dev@lingucomponent mailing list. | + | '''January 6, 2007:''' [http://morfologik.blogspot.com/2007/01/wikipedia-history-diff-as-revision.html Wikipedia history diff as a revision corpus], summary of experiments by Marcin Miłkowski, after we discussed this on the dev@lingucomponent mailing list. "In short, it seems that Lars' idea was brilliant". |
'''December 30, 2006:''' Here's an experiment in coverage. One very classical Swedish text is [http://runeberg.org/nilsholg/ Nils Holgerssons underbara resa genom Sverige], by Nobel Prize winner Selma Lagerlöf, at the same time a geography textbook and a novel. The text at Project Runeberg is in modern spelling (post 1906), but uses plural forms of verbs (pre 1970). The text with HTML markup removed contains 198148 words or 1091951 characters. I'm assuming the spelling is all correct. Passing this through aspell with the 2003 Swedish dictionary (aspell -l sv list) outputs 6741 words as not recognized or 3.4 percent of all words in the text body. This means the Aspell dictionary covers 96.6 percent of the text, which actually isn't so bad. Further filtering through my own dictionary of old spellings (including plural verbs) removes another 1842 words or 0.93 percent, increasing the coverage to 97.5 percent. If instead I change to my own main dictionary, coverage increases only marginally. Apparently, the improvements I feel I have made, don't really matter if you are spell checking ''Nils Holgersson''. | '''December 30, 2006:''' Here's an experiment in coverage. One very classical Swedish text is [http://runeberg.org/nilsholg/ Nils Holgerssons underbara resa genom Sverige], by Nobel Prize winner Selma Lagerlöf, at the same time a geography textbook and a novel. The text at Project Runeberg is in modern spelling (post 1906), but uses plural forms of verbs (pre 1970). The text with HTML markup removed contains 198148 words or 1091951 characters. I'm assuming the spelling is all correct. Passing this through aspell with the 2003 Swedish dictionary (aspell -l sv list) outputs 6741 words as not recognized or 3.4 percent of all words in the text body. This means the Aspell dictionary covers 96.6 percent of the text, which actually isn't so bad. Further filtering through my own dictionary of old spellings (including plural verbs) removes another 1842 words or 0.93 percent, increasing the coverage to 97.5 percent. If instead I change to my own main dictionary, coverage increases only marginally. Apparently, the improvements I feel I have made, don't really matter if you are spell checking ''Nils Holgersson''. |
Revision as of 07:01, 7 January 2007
LA2 is the username for Lars Aronsson, Sweden, also known from Wikipedia, Project Runeberg, and other projects.
Useful links
- Swedish spelling dictionaries: sv.speling.org — dsso.se — Nordic Words — Viggo Kann
- OpenOffice projects: Lingucomponent — Localization — Native-lang — Danish (wiki) — Norwegian (wiki) — Swedish (wiki)
- Grammar checking: LanguageTool — Summer of Code 2006 — Granska
Diary
January 6, 2007: Wikipedia history diff as a revision corpus, summary of experiments by Marcin Miłkowski, after we discussed this on the dev@lingucomponent mailing list. "In short, it seems that Lars' idea was brilliant".
December 30, 2006: Here's an experiment in coverage. One very classical Swedish text is Nils Holgerssons underbara resa genom Sverige, by Nobel Prize winner Selma Lagerlöf, at the same time a geography textbook and a novel. The text at Project Runeberg is in modern spelling (post 1906), but uses plural forms of verbs (pre 1970). The text with HTML markup removed contains 198148 words or 1091951 characters. I'm assuming the spelling is all correct. Passing this through aspell with the 2003 Swedish dictionary (aspell -l sv list) outputs 6741 words as not recognized or 3.4 percent of all words in the text body. This means the Aspell dictionary covers 96.6 percent of the text, which actually isn't so bad. Further filtering through my own dictionary of old spellings (including plural verbs) removes another 1842 words or 0.93 percent, increasing the coverage to 97.5 percent. If instead I change to my own main dictionary, coverage increases only marginally. Apparently, the improvements I feel I have made, don't really matter if you are spell checking Nils Holgersson.
$ cat k*.html | wc -w 198148 $ cat k*.html | aspell -l sv list | wc -w 6741
The 4883 words (97.5 %) output from the combination of my own dictionaries are 3670 unique word forms, of which 2991 appear only once, 433 appear twice, 129 appear three times and 117 appear four or more times. If I added these 117 word forms to my dictionary, that would cover another 639 words or 0.32 percent of this text, pushing the coverage to 97.8 percent.
It turns out some of those words shouldn't be added to a dictionary because they are names of fictional characters that only appear in this book. A select few spelling errors are also found and will be corrected. Of the unrecognized words, many are minor variations (in case and punctuation) that are covered by just adding one word to the dictionary. After some work, my coverage is up to 98.49 percent, leaving 2991 words unrecognized, being 2674 unique word forms of which 2509 appear only once and 111 appear twice.
December 21, 2006: Apparently the Swedish spell checker in Microsoft Word 6.0 accepts the following misspelled words from my test page: andledning, andvänd, andvändning, bakrund, ballett, diskusanalys, finlandsvensk, finness, fiskeläger, följetång, företeckning, förmögenhetskatt, hårddraget, innerbär, jämnlik, kolrot, Lindköping, lösensumma, majonäs, model, modellbetäckning, parantes, situationstecken, stadsbesök, stadschef, terass, trilogi, vädersträck, överrens
And Microsoft Word 2003 accepts these errors: andledning, alvarlig, andvändare, ballett, Ceasar, diskusanalys, europisk, fiskeläger, frisörsalong, följetång, företeckning, förmögenhetskatt, grejor, hårddraget, interesse, krigsföring, landsbyggd, Lindköping, lösensumma, mediespelare, modellbetäckning, parantes, San Fransisco, sattelit, situationstecken, stadsbesök, stadschef, Stockolm, Storbrittanninen, tabblett, tipps, utryck, våldtäckt, vädersträck, ytterliggare, åldersbestigna, överrens.
December 20, 2006: As I observed the other day, writely.com automatically understands which language I'm using, and applies the right spelling dictionary. Apparently, Microsoft Word has a similar feature. Why have I never seen this feature in free software for spell checking? A quick web search indicates that I'm not the first to ask this question. Suggested solutions include trying an ad hoc list of common words, prefixes and suffixes from each language or sampling trigrams. There is also an attempt at Bayesian language detection. Nothing indicates that the creators of these three approaches are familiar with Zipf's law. Looking for "you" and "me" is not optimal, when in fact "the" is known to be the most common word in English texts. The Bayesian filter approach probably comes closest to optimum, but in an overly complicated way. If we were to look for a single word from each language, that would be the most common word, which is "the" for English, "der" for German and "och" for Swedish. Since 7% of all English words are "the" we would need on average 14 words to find one occurrence of "the". In Swedish, which doesn't use a definite article (the, der, le, los), the most common word is "och" (and), which makes up 3.8 % of all words, so we'd need a sample of 26 words to expect one occurrence of "och". In order to determine the language from a shorter sample, we could throw in some more words from each language, from the top of the frequency listing. The 10 most frequent words in Swedish (och, i, att, en, av, som, den, till, med, på) together account for 14% of the words in any text corpus. The top 20 words (det, för, de, han, är, ett, sig, så, jag, var) account for 21%, the top 50 words account for 29%, and the top 100 words account for 36%. Some of these words also appear in English but aren't so frequent there, so they are better predictors for Swedish than for English. For each language, a weighted prediction score can be computed. Finding an "I" adds more to the Swedish score than to the English. These top ranking word frequencies haven't changed much since Shakespeare, so a C program should be able to keep a static table of the weights for 100 words in source code. The downloadable text of Wikipedia and the written OpenOffice.org documentation could be used as a corpus for sampling the weights.
The paper by Géza Németh and Csaba Zainkó, Multilingual Statistical Text Analysis, Zipf's Law and Hungarian Speech Generation explains that to cover 97.5% of an English corpus, you need a dictionary of 20,000 unique word forms, but to achieve the same coverage in German you need 80,000 word forms and for Hungarian 400,000 word forms. This is an important observation, as it sets a goal for how large spell checker dictionaries need to be. Presumably, Finnish and Estonian have the same characteristics as Hungarian, while the Scandinavian languages (Swedish, Danish, Norwegian) take a middle position between that and German. Our current dictionaries contain many specialized words that aren't very common. These words don't contribute so much to the dictionary's footprint or coverage of a typical corpus as they should do. For example, for word we add in its basic form, we also try to include every other form of that word, even though those forms aren't very common. What we need for German are the 80,000 most common word forms, but when we add the really common word Haus, we also tend to add the less common Hauses, Häuser and Häusern. Perhaps we should aim for 80,000 basic forms (this is very close to what Björn Jacke has, see below) instead of 80,000 variations (he has 300,000 variations).
I made some tests on a corpus of 17.99 million Swedish words from proofread texts in Project Runeberg. Many of these words use old spelling (and describe old concepts) and won't be found in modern dictionaries, so this is not a perfect test case for contemporary spell checking dictionaries. When I run this corpus through my own dictionaries, which do contain some words in old spelling, it leaves a remainder of 1.482 million words or 8.2% of the corpus, meaning that I now have 91.8% coverage of this corpus. If I combine my old spelling component with the existing Aspell dictionary (Göran Andersson's from 2003), it leaves a remainder of 1.837 million words or 10.4% of the corpus, meaning a 89.6% coverage. So my progress above Göran's dictionary is indeed very small. This coverage around 90% can be achieved for German with a dictionary of the 20,000 most common word forms, which can be compared to the 24,000 basic forms in Göran Anderssons's 2003 dictionary. Even though my dictionary has many additional word forms, their contribution to the corpus coverage isn't very large.
|
|
December 19, 2006: On the dev@lingucomponent list, Kevin Scannell discusses how to use precision and recall metrics for spell checkers.
December 18, 2006: I update the Nordic Words page and publish a version of my own Swedish dictionary (ss100.txt) with 221.599 words. I also do a little survey of the spell checking support in various programs:
Google's writely.com is a web word processor. It has a built-in spell checker that automatically recognizes the language. It's Swedish spell checker behaves exactly like OpenOffice.org 2.0.4, which indicates the same Ispell/Aspell/Myspell/Hunspell dictionary is used for Swedish (Göran Andersson's dictionary from 2003). When I pasted the words from my test page, there were so many errors that the spell checker automatically shut down and I had manually to turn it back on again.
The Opera web browser (I tried version 9.10) has a built-in spell checker for web forms. The user interface is a bit old-fashioned, in that it doesn't underline the errors, but uses a dialog window that steps through the web form. On Apple's Mac OS/X it uses the system's built-in spell checker, but on all other platforms it requires the user install GNU Aspell.
The word processor Abiword has built-in spell checking support. The user interface underlines any errors. The Swedish spell checking is apparently based on Göran Andersson's 2003 dictionary, although I cannot find out which software it uses (GNU Aspell, ispell or Myspell).
KDE's editors kate and kwrite have built-in spell checking support, apparently based on GNU Aspell. The user interface doesn't underline errors, but provides a dialog window that steps through the text one word at a time.
Note to self: I should take a closer look at Freedict.de. Where do these dictionaries really come from? Are they maintained?
A look at the German ispell dictionaries by Björn Jacke:
Occurrences Affix Nov. 2005 Feb. 2003 Nov. 1999 flag Usage ------- ------- ------- ---- ---------------------- 79681 81191 75748 Basic forms 307261 308860 294897 Unique words ------- ------- ------- ---- ---------------------- 13755 13933 11257 /S Genitive -s 11815 11723 11397 /A Adjective inflexion 8848 9319 10070 /P Plural -en 8166 8374 8048 /N Plural -n 7367 7346 7004 /D Participle -d 6837 6828 6595 /I Regular verbs, present tense 6620 6611 6358 /X Regular verbs, present tense 5310 5303 5140 /Y Regular verbs, past tense 4315 4578 4525 /T Genitive -es 4189 4406 4118 /E Plural -e 2066 2061 1991 /O Participle inflexion 1999 1971 82 /J -ung and inflexions 1846 1840 1813 /C Adjective comparison 1580 1656 1636 /p Irregular plurals 1452 1047 831 /F -in and inflexions 721 719 615 /Z Non-regular verbs, past tense 672 665 619 /U Prefix un- 619 620 615 /V Prefix ver- 574 569 486 /B -bar and inflexions 497 492 138 /W Imperatives 289 310 280 /R Plural -er 235 251 250 /Q Plural -sse 206 206 208 /G Prefix ge- 64 68 65 /q Plural -sse, special case for feminines 57 56 44 /M -chen and inflexions 20 21 22 /f Words ending in -ph can also have -f 18 17 19 /L -lich and inflexions 4 4 4 /H -heit and inflexions
December 15, 2006: Two Danish OpenOffice developers meet with CST, Center for Sprogteknologi, a commercial provider of Danish dictionaries, to discuss how to improve the Danish spell checking dictionary for OpenOffice. Brief report on the 'dansk' mailing list. To me it seems unlikely that any useful solution would be found this way.
December 11, 2006: Version 2.0.4 of OpenOffice.org has auto corrections (AutoKorrigeringar) for Swedish, based on a static list of about 100 word pairs, e.g. HJE -> hej, MEDECIN -> medicin. Where do they come from? They're not part of the spelling dictionary. There are also word pairs for Danish and German (both have longer lists), but none for Norwegian.
Firefox 2.0 offers spell checking for web forms (e.g. wiki editing). There is a Swedish spelling dictionary by Hasse Wallanger, based on the Swedih Myspell dictionary of August 14, 2003 ("baserad på den svenska ordlistan från 20030814 för Myspell"). It behaves a little differently than the Swedish spell checker in OpenOffice 2.0.4, in that it allows free concatenation of words. It also only spell checks an initial fraction of a web form. In bug 360434 this is explained. Type about:config in the URL field and look for the variable extensions.spellcheck.inline.max-misspellings which defaults to 500. Double click on this value and change it to a much higher value, e.g. 15000.
December 7, 2006: I think we need a test case for the Swedish spell checking, that is separate from the development of the dictionary. As a pilot test, I'm starting a subpage /Test av stavningskontrollen. Göran Andersson publishes version 1.22 of DSSO.
December 2, 2006: I sign up for various OpenOffice mailing lists, and this wiki. What takes me here is the poor spell checking support for Swedish in OpenOffice 2.0.2. The spelling dictionary is version 1.3.8 from sv.speling.org, which hasn't been updated since March 2002. It only contains 24490 words (basic forms), some of which are misspelled. The myspell affix file seems to have been automatically converted from the ispell affix file.
Timeline of Swedish dictionaries
- November 25, 2006: Göran Andersson publishes version 1.19 of DSSO. Version 1.21 follows on December 1.
- 2006: The Swedish Academy publishes the 13th edition of SAOL.
- 2005: Volume 34 of SAOB ends at Tojs. The full work is expected to be completed in 2017.
- January 2005: Project Runeberg's OCR spelling dictionaries for Swedish and Danish are published within Nordic Words.
- April 2004: Public editing of susning.nu is closed. The user community migrates to the Swedish Wikipedia.
- March 6, 2003: My posting Svensk ordlista on the SSLUG-LOCALE mailing list.
- February 2003: On the Swedish web forum Gnuheter, I ask around for a business case for a Swedish dictionary (Affärsmodeller och fritt innehåll) without getting any useful answers.
- 2003: Göran Andersson takes back control of the Swedish spelling dictionary, now dsso.se, dissatisfied with some modifications made to it during the time it was at sv.speling.org.
- May 6, 2002: I join sslug-locale mailing list for speling.org.
- 2002-2003: I digitize two editions (58 volumes) of the classic Swedish encyclopedia Nordisk familjebok (1876-1926). This is more food for word frequencies and spelling dictionaries.
- October 2001: I start susning.nu, a Swedish wiki, which grows very fast. As a spinoff I return to computing word frequencies and compiling my own spelling dictionary.
- January 29, 1998: Göran Andersson hands over his Swedish ispell dictionary (now version 1.2.1) to sv.speling.org
- September 26, 1997: Göran Andersson's ispell dictionary version 1.2 accepts compound words. The list has 24082 basic forms, expanding to 117617 unique words.
- February 23, 1997: Göran Andersson's ispell dictionary version 1.1 has 24722 basic forms, expanding to 84740 unique words.
- January 15, 1997: Göran Andersson's ispell dictionary version 1.0 has 27737 basic forms, expanding to 76364 unique words. The brand new affix file is based on inspiration from a Danish affix file.
- November 1996: Within Project Runeberg, the subproject "Nordic Words" is started, maintained by Anders Brun. No updates are made after December 1997.
- 1993: The Swedish Academy introduces computers in editing SAOB.
- December 1992: I start Project Runeberg, the Scandinavian e-text archive
- 1991-1993: I experiment with spelling dictionaries for spell and ispell.
- 1970s-1980s: A Swedish morphological spellchecker "stava" is developed at FOA/QZ in Stockholm. Traces of this might be available at KTH. Viggo Kann would know. Several later Swedish spell checkers with the same name exist. Various dictionaries float around. Linguists have access to proprietary lists for research purposes, and are not interesting in creating "open content".
- 1970: Major Swedish newspapers abandon plural forms of verbs.
- 1968: The polite use of "Ni" (You/Sie) is replaced with simple "du" (you/du), making Swedish conversation as simple as English.
- 1960s: A young computational linguist Sture Allén uses paper tape from newspaper typesetters to compute word frequencies of the Swedish language. Laying the foundation for the Språkdata department at Gothenburg University, he later becomes secretary of the Swedish Academy.
- 1945: Swedish public schools make plural endings of verbs optional. Students who opt not to use them, must indicate this and then stick to their chosen style.
- 1939: At Easter, with fascicle 156, the Swedish Academy celebrates SAOB being halfway (A--K) completed.
- 1910: The polite use of "Ni" (You/Sie) is introduced in Swedish as a replacement for complicated titles, making Swedish conversation as simple as German.
- 1907: A major Swedish spelling reform does away with the combinations dt, fv, and hv.
- 1883: A new editor restarts the Academy's dictionary. The first fascicle is printed in 1893 and the first volume of "Svenska Akademiens Ordbok" (SAOB) is completed in 1898. The dictionary documents Swedish spelling since 1526.
- 1874: The Swedish Academy publishes a spelling dictionary in one volume, "Svenska Akademiens Ordlista" (SAOL). Its 6th edition (1889) and 8th edition (1923) are out of copyright.
- 1786: The Swedish Academy is founded by king Gustav III. One of its main tasks is to compile a dictionary of the Swedish language. Work begins immediately, but stops already in 1814. New attempts are started in 1834 and 1855. A fascicle for the letter "A" is published in 1870.
- 1526: Sweden's Lutheran church reformer Olaus Petri translates the New Testament to Swedish. Old Testament follows in 1541. His style of writing marks the beginning of modern Swedish orthography.
- 9th Century A.D.: About the same time as Iceland is populated by the Norwegians, Sweden's longest runic inscription, the Rök runestone is carved. Runes are Scandinavian letters inspired by Greek/Latin alphabets but adopted for carving in stone or wood. Two different runic alphabets were used between c. 500 and 1000 A.D., the first with 24 letters, later simplified to one with 16 letters. With the introduction of Christianity around 1000 A.D., runes are gradually replaced with Latin script.