Grammar Checking
Please view the guidelines
|
---|
Popular Subcategories: Extension:DynamicPageList (DPL), version 2.3.0 : Warning: No results. Internal Documentation: Extension:DynamicPageList (DPL), version 2.3.0 : Warning: No results. API Documentation: Ongoing Efforts: Extension:DynamicPageList (DPL), version 2.3.0 : Warning: No results. Projects on this Wiki: (edit list)
|
Sw.OpenOffice.org |
Grammar checking is seen as a particular implementation of a text iteration and markup process, other iteration/markup processes like spell checking or smart tagging basically can work in the same way (though currently they are not implemented like this). If grammar checking is mentioned in the following documentation this can be seen as a placeholder for the more general task of text markup. As the objects carrying out the text iteration are aware of the particular markup process they are used for it is basically possible to fine tune the iteration for the needs of that process.
Involved objects
Outside view
The grammar checking process consists of
- one or more documents to be checked
- one or more grammar checker implementations, each supporting at least one language.
- one or more grammar check dialogs (at most one instance per document)
- one context menu when clicking on text marked as incorrect
- a global grammar checking iterator (common to all documents) implemented as singleton, checking one sentence (of an arbitrary document) at a time.
- one thread object per grammar checker that is used to perform the checking without blocking the GUI
- objects iterating through the text of a document, one object representing a single grammar checking task that was requested
- objects representing text blocks in a text document (“flat paragraphs”) that abstract from the concrete structure of the document and provide access to the text by simple text strings and integer values describing positions and lengths of sub string.
It is assumed that the grammar checking is done sentence by sentence and a single language can be assigned to each sentence (but each sentence may have a different one). About mixing languages in one sentence see below.
Overview of the UNO types involved
All involved objects (except the thread object that is a C++ object derive from osl::Thread) communicate with each other through UNO interfaces. The whole process uses the following UNO types:
- interface com.sun.star.text.XFlatParagraph
- interface com.sun.star.text.XTextMarkup
- interface com.sun.star.container.XStringKeyMap
- struct com.sun.star.lang.Locale
- constants com.sun.star.text.TextMarkupType
- interface com.sun.star.text.XFlatParagraphIterator
- interface com.sun.star.text.XFlatParagraphIteratorProvider
- interface com.sun.star.linguistic2.XGrammarChecker
- interface com.sun.star.linguistic2.XGrammarCheckingIterator
- service com.sun.star.linguistic2.GrammarCheckingIterator
- interface com.sun.star.linguistic2.XGrammarCheckingResultListener
- struct com.sun.star.linguistic2.GrammarCheckingResult
- struct com.sun.star.linguistic2.SingleGrammarError
For a description of the types and hints for how to implement them see Grammar Checking API.
Objects and their interfaces
We have three parts working together. The first part comes from the document being checked and it is an implementation that is specific for the particular type of document (e.g. Writer or Calc). It encapsulates the access to the text of the document. A document wanting to become checked for grammar errors must support the interface com.sun.star.text.XFlatParagraphIteratorProvider. Through this interface it must be able to provide objects implementing com.sun.star.text.XFlatParagraphIterator that themselves return objects implementing com.sun.star.text.XFlatParagraph. The latter interface is derived from com.sun.star.text.XTextMarkup. In the following we will call these objects "flat paragraph iterators" (FPIterator) and "flat paragraphs" (FP). If the word "paragraph" is used this will also denote an "FP", not a real paragraph in the document as not always both are the same.
An FP is not necessarily a paragraph as in the documents context, it can be a collection of them (e.g. a list) and it not only contains the flow text but also other text content like text frames, headers and footers etc. As only the document core can handle such FP objects objects efficiently this is a document specific implementation. The FP does not reveal the complete internal text structure or its attributes, its content is only accessible as a string containing the complete text block.
An FPIterator is an object that allows to iterate through all the FP objects that together make up the document text content. The order in which the paragraphs are iterated is arbitrary and an implementation detail of the FPIterator. The "regular" text content usually should be provided in reading direction, but how other text like headers and footers (that exist only once but are repeated on every page) or text frames (that may be embedded into the flow text) fit in is not predetermined. Iterating through text is always assigned to a text markup process that shall treat the whole document. Thus the iteration will wrap-around at the end of the document and it will not end before all paragraphs have been marked as "checked" for the particular markup process (like grammar checking). Paragraphs marked as "checked" will be skipped in the iteration. So for clients of an FPIterator it's simple to use them: ask it for new FP objects until none is returned and don't care about how it's implemented.
The second part is a grammar checker. A grammar checker is a component implementing the interface com.sun.star.linguistic2.XGrammarChecker. For each language there may be a particular component that is able to check for grammar errors in this language. The configuration will tell which component is responsible for what language. The implementation of com.sun.star.linguistic2.XGrammarChecker representing a particular component will encapsulate the "private" API of this grammar checking component. This private API can be UNO based or pure Java, a CLI or COM interface, a C API etc., everything that can be used or bridged to inside an implementation of a UNO interface. As the interface is pretty small it should be not very complicated to wrap existing grammar checkers for using them in OpenOffice.org.
In the middle lies the third component, that mediates between the other two. It implements the "logic" of the grammar checking process. As it talks to the other two parts by their defined UNO API only this middle part is independent from the particular document type or grammar checking component. A UNO service called com.sun.star.linguistic2.GrammarCheckingIterator is the component that actually carries out the grammar checking process for all supported scenarios. It is a singleton that controls all running grammar checking processes and thus also knows all existing grammar checking components. It implements the interface com.sun.star.linguistic2.XGrammarCheckingIterator and also provides an object implementing com.sun.star.linguistic2.XGrammarCheckingResultListener. In the following this object will be called the GCIterator.
Required tasks
- Automatic grammar checking: while the user is editing his documents they should be checked for grammar errors in the background. Found errors should be marked somehow so that the users becomes aware of them. Preferably the visible part of a document should get preference.
- Interactive grammar checking via context menu: when the user clicks on a text part that has marked for containing a grammar error she should be provided with information and suggestions how to fix them or discard the mark up
- Interactive grammar checking via dialog: the user wants to see the information and suggestions returned from the grammar checker immediately and so instead of marking the text the process will present the result in a dialog and asks the user about how to proceed
Sample process of automatic grammar checking
The document will get access to the GCIterator and requests checking the document by calling startGrammarChecking() and providing:
- a unique interface to the document (to be used to identify this document); as this interface is for identification purposes only perhaps com.sun.star.uno.XInterface is the appropriate type. If any other type is used it should be considered that this type will set a precondition that “documents” must fulfill that want to use the grammar checking API.
- a reference to an interface com.sun.star.text.XFlatParagraphIteratorProvider. Ususally this will be the same object as the document but we didn't want to require that, so we pass both interfaces in the call
- a flag indicating that this request is for automatic checking only and thus no suggestions are required and no dialog must be displayed.
You might wonder why the starting paragraph is not passed. The simple explanation is that the flag for automatic processing will be passed when an FPIterator is created and so this one will know that the whole document has to be processed, starting from the first paragraph.
The GCIterator maintains a queue of sentences to be processed. When called with the above arguments it creates an entry consisting of those values and adds it at the end of the queue. The entry also must contain the current paragraph and the starting position of the current sentence to be checked. For this reason the GCIterator creates an FPIterator, passing the flag for automatic checking and the right TextMarkupType (GRAMMAR) to it. Then it retrieves the paragraph to start with by calling getFirstPara() at the FPIterator. This FP and the starting position inside it (now it's 0) are put into the queue entry.
For the sake of simplicity for now let's assume there is only one document to be processed. In reality the queue may contain elements for several documents and the GCIterator will process the entries belonging to the same document one after another.
Now the GCIterator will dequeue the first element from the queue (which is the one we just added). It retrieves the text of the paragraph, checks the BreakIterator for a suggested end-of-sentence position (that is indicated by it's starting position) and, after identifying the languages to use, calls all the respective grammar checker(s) one-by-one to check that single sentence. To avoid blocking of the UI (grammar checking can take some time) all this will happen asynchronously by creating a thread object for each used grammar checker component and executing all grammar checking steps in this thread.
The GCIterator will return immediately after creation of the thread object that is initialized with a copy of the queue entry. We could also retrieve the text of the FP before returning and place it into the queue entry, thus avoiding access to the document from the grammar checker's thread. Results from the grammar checker are received in the callback method of an XGrammarCheckerResultListener interface provided by the GCIterator (preferably implemented as an individual object). The results will be used to mark possible errors in the FP using its XTextMarkup interface, but only if the FP is not modified . Otherwise we discard the results silently. The "modified" flag was set if the FP has been changed or deleted since it was returned by the iterator so it is clear that the checking was done on outdated content.
It is possible to dispatch the markup step into the Office thread if concurrency problems can be an issue. Whether processing the text will be halted while that happens is open for debate.
When the last grammar checker result for this sentence has been processed and there is still unprocessed text left in the paragraph the GCIterator will continue with putting a new queue entry to the end of the queue that differs to the one already processed only by the new starting position. Then it proceeds as with the first queue entry. If the paragraph has been checked completely this way then the getNextParagraph function from the XFlatParagraphIterator interface is called to retrieve the next paragraph to be checked. If there is one found we start anew as described above with the new paragraph by creating a new queue entry and putting it at the end of the queue. More about handling the queue can be found at below.
The whole iteration will be continued until all paragraphs have been marked as checked. This is indicated by the FPIterator by returning an empty reference in getNextPara().
Please note that all the asynchronity we require to have for background grammar checking is implemented in the GCIterator only, and each grammar checker implementation should run in the same thread that makes life easier for the grammar checker component but still provides a sufficient amount of parallelism.
Sample process of interactive grammar checking
There are some basic differences when comparing interactive grammar checking with automatic checking:
- the results of grammar checking a sentence need to be interactively post-processed by the user.
- each grammar checker is allowed to make use of it's own implementation of a grammar checking dialog and another dialog to view and modify implementation specific options as well. The 'options dialog' should have two entry points: one accessible from a toolbar, and the other one would be a button in the grammar checking dialog. If the grammar checker features only an options dialog but not a grammar checker dialog the office internal dialog must be able to start that options dialog. (See questions and problem section as well!)
- due to some grammar checkers requiring the text of previous sentences in the paragraph to be known in order to determine if the current one is correct one can not just simply check one sentence after another if a change is applied. If for example the first two sentences are without error and the third sentence got corrected by the user we can't simply proceed to the fourth sentence. Because it can't be figured out what the specific grammar checker implementation keeps track of it can't be helped but to throw everything away and tell that grammar checker that a new paragraph is to be started. Thus we need to have the grammar checker check the first three sentences (without reporting any error for them) in order to build up the internal data to check the fourth sentence. Only then we can pass the fourth sentence on to the grammar checker and expect the results to be correct. And for all the following sentences of that paragraph we have to do it all over again. One slightly different approach would be that not the iterator has to pass all the previous sentences on to the checker again but instead have it done by the grammar checker itself implicitly if it has need to do so. After all the grammar checker is always given the whole text along with the sentence-start-position. But the grammar checker implementation needs to be aware of that by doing so it may encounter sentences in languages it does not know about and that would usually not have been passed to this specific checker.
Going with the preferred way of having the grammar checker scan previous text implicitly if needs be, interactive checking looks like this:
The document determines the first paragraph to be checked, usually the one where the cursor is located. As in the "automatic" case the process is started with calling startGrammarChecking() and the same kind of arguments are passed. The FPIterator again knows the paragraph to start with from the flag for automatic checking passed to it (now with value "false"). In order to have it a little less complicated when determining if the whole document was processed we probably like to start checking at the beginning of the paragraph and not a specific sentence within even if the cursor is placed e.g. in the last sentence (this can be discussed though). So basically the general way of processing the document is the same as in case of automatic checking. The starting paragraph is different but the process will also go through the whole document. Of course the way the results are handled is different as mentioned above.
Once called the GCIterator creates a new entry for the queue, but now it places that entry at the start of the queue instead at the end. This way interactive checking will take precedence over automatic checking and the latest UI triggered request will be at the top of the queue and gets processed next.
Results must be processed by some code that implements the XGrammarCheckingResultListener interface, we call this code "the dialog" for simplicity. This will be called by the GCIterator when itself received the callback from the grammar checker. If this code needs to be executed in the Office thread is must take care of this by itself, it will be called in the thread of the grammar checker. When the dialog is executed in the Office thread either the grammar checker thread will be blocked or the dialog has to implement a queue for further incoming results when the checking proceeds.
For sake of simplicity we stick to only one single grammar checking dialog used by all checkers here in this text!
If one or more of the grammar checkers report an error with the current sentence then the error reports from all the checkers are collected and the grammar checking dialog is started (if not already open, see below) and filled with the necessary data by the GCIterator (the text and the complete list of errors). The iterator will not wait for the dialog to be finished or to advance to the next sentence, it will continue with it's own tasks (e.g. entering it's main loop and start checking a sentence from another document). The dialog will only show the very sentence the error was found in and has to allow for at least
- showing all the error positions (preferably all at once),
- reviewing each errors (displaying the detailed information about that error) and suggestions for corrections,
- modifying the sentences text freely,
- changing the language of text parts or all the text,
- ignoring the errors and continuing with the next sentence,
- committing the changes made and continue with checking (as long as the paragraph was not modified or invalidated meanwhile),
- if that very paragraph was modified meanwhile there will be a button that allows the dialog to discard the changes (that are not yet applied) and restart checking with the sentence the cursor currently is in (which may be in a completely different paragraph) by adding that to the top of the queue (if anything is left),
- and if the paragraph was invalidated (deleted) the changes in the dialog are to be discarded as well and getNextParagraph should be called to continue checking and (if anything is left) thus adding the next sentence to be checked to the top of the queue,
- or canceling the interactive checking and closing the dialog.
If the changes are committed they are applied to the paragraph by using the XFlatParagraph interface.
Then if there is still text left in the paragraph the next sentence is added at the start of the queue (as described above). If the paragraph was processed the getNextParagraph function is called to get the next paragraph to be checked, if no such paragraph is found the iteration is finished and the dialog can be closed. Otherwise we continue by putting an entry for interactively checking the first sentence of the new found paragraph at the start of the queue. (Either way the entry needs to have the XGrammarCheckingResultListener reference set in order to provide the dialog with new data to be displayed when the next sentence with errors was found.)
Then the dialog is left open and the GCIterator takes control again and can proceed with the next entry from the start of the queue. This way the process continues until the next error is found or the iteration over the document is finished.
If the dialog is closed (either because the iteration has finished or because the cancel button was pressed) the interactive checking is stopped simply by not adding another entry to the queue.
Please note that because the starting point for grammar checking the whole document may vary (be it automatic or interactive) this may result in different errors! For example: In German it is correct to write dolphin either as "Delfin" or as "Delphin". But still one would probably want to enforce consistent use of only one of the two spellings. Thus if a grammar checker likes to enforce this it has internally to keep track what spelling was encountered first and reject the other spelling hence forward.
Side note: The dialog needs to implement the XComponent interface and the GCIterator needs to be it's listener.
Opening the context menu by right clicking on a text part that is marked as being incorrect requires yet another approach. The differences here are:
- Only a single sentence should be checked (but still to do this correctly the grammar checker may need to scan all the previous text in the paragraph)
- and only those errors/corrections (or part of them if the list gets too long) should be displayed that belong to the respective marked text part. That is only for a subset of all the errors in a sentence the corrections are needed which may leave some room for optimization.
Thus when the right-click takes place the document (when creating the menu which is to be done in the main thread) calls the respective function of the GCIterator and an entry similar to interactive checking of that very sentence is added to the start of the queue. The only differences will be that there are some additional values in that entry:
- one for the starting position of the marked text part, and one for its length. Thus indicating that the grammar checkers only need to find out errors in that text range and the return value (which usually should hold all errors/corrections for that sentence) needs only to cover that range as well.(On the other hand it would be possible to retrieve all errors and thus behave exactly as interactive checking and just ignore the results that are out of the indicated range.)
- a flag needs to indicate that this is for the context menu only (and thus there is no need for a iteration to be started, i.e. no further queue entry will be added implicitly when processing this entry)
- also a reference to the XGrammarCheckingResultListener interface that is used by the GCIterator to provide the context menu with the results is needed.(Naturally this implementation of the interface is a different one then the one used in the dialog for interactive checking.)
Since the call to the GCIterator is asynchronously we need to wait a reasonable limited amount of time (e.g. 3 seconds) to receive the results via the call-back. If we do get them in time we can show the context menu as planned. If not, since we can't wait forever, we have to display a fallback menu (either the regular one or one showing an entry like "grammar checking timed out").
Since the context-menu may already be closed (either before the 3 seconds are over or after) when finally the GCIterator is ready to use the call-back function to provide the results, the context-menu needs to implement the XComponent interface and the GCIterator must be it's listener, and it is required to already register as such when the context-menu calls the function to trigger grammar checking for the sentence.
Right before the context-menu gets displayed it should already dispose. This would be necessary later anyway and doing it now should prevent the call-back function from being executed belated if grammar checking was too slow (or did not return at all) and the fallback menu is displayed.
When everything went fine and the user was able to select a specific correction the XFlatParagraph interface provided as part of the XGrammarCheckingResult will be used to make the changes in the text.
Checking several documents at the same time and mixing all the above tasks
When more than one document is available and we have to cope with mixing background and interactive checking we have the following requirements:
- computing time should be as evenly distributed over all automatic grammar checking tasks (each on one document) as possible
- interactive checking in the current window should always get precedence over all other pending tasks, interactive or automatic
- checking with context menu should always get highest priority
The first requirement can be fulfilled by always checking only one sentence and add further entries for the rest of the text to the end of the queue. Other grammar checking tasks that have been started in the meanwhile can now interfere and all tasks can work in an interplay as each task will requeue at the end once it finishes a single sentence.
All interactive tasks will get the necessary precedence by placing them at the start of the queue so that each user request can be fulfilled as fast as possible (the limiting factor is that the currently running checking process needs to end first). It's easy to see that interactive requests get precedence over others this way also.
If more than one grammar checker is present the queue will be processed in a way that in case we have one thread per checker each thread will pick up those entries that belong to "its" grammar checker and leaves others alone. Besides that nothing is changed in the procedure outlined above.
Problems and questions currently left open
Grammar checking of mixed language text
It is believed that even for sentences that uses several languages there is only a single language the whole sentence is in. (How that language is identified is a completely different matter and probably a complex task though!) And thus that sentence should only be grammar checked in that single language. For example:
The German word for television is Fernseher.
This sentence should be grammar checked in English and not German
If possible though (for example if language attributes are set correctly) it should be noted that Fernseher is not in English and thus at the very least no spelling error should for English should be reported for that word. And probably it is also impossible to report any grammar error that involves embedded foreign words. Thus the best to hope for probably is for the foreign word to be recognized as correct by the respective spell checker.
Even with completely embedded sentence like
In Gallica Caesar said 'Alea iacta est.' and continued his battle.
the above text is in a single language English and not Latin. If an existing grammar checker is smart enough to cope with embedded sentences of a different language I don't know. To keep it simple for the time being the whole text should be grammar checked as one sentence in English and in only that language.
Grammar checking and spell checking at the same time
Should spell checking have an iterator of it's own with a thread of it's own? Or should spell checking be handled by the GrammarCheckingIterator as well?
Other Questions / problems:
- checking is limited to paragraphs (unless the implementation of XFlatParagraph chooses to hide sth. more behind it which is unlikely). Though one could think of enumerations as a possible application for this behavior.
- in the case of several grammar checkers for one languages, what do we do if they report different end-of-sentence positions? We really can't handle each checker individually here.
- does a grammar checker that requires knowledge of the previous text in this paragraph need to have those text presented even if it is in a language it does not know?
- How to achieve consistency of usage (e.g. spelling) when having grammar checkers in multiple languages? E.g. e-mail vs. email? Or does it need to be consistent on a per language base only?
- How to determine the language of a sentence? Use the language of the first word, or language guessing, or the language with the most words,... ?
- Problems related to a specific UI, namely the grammar checking dialog still to be defined, not yet covered.
- The troublesome case of having for example three grammar checkers for one language and two of them wanting to use their own dialog while the third will go with the office internal one is left out. Because if all of them report errors in the same sentence and like to use their own dialog as well we will have to cope with switching between three dialogs just to edit a single sentence. That's just plain awful to even think about. And I doubt there will be even one user to appreciate such a scenario.
- Should the document (e.g. XFlatParagraph) be in charge to determine the language for checking or should it be the GrammarCheckingIterator? Probably the latter...