Difference between revisions of "Grammar Checking"

From Apache OpenOffice Wiki
Jump to: navigation, search
Line 53: Line 53:
  
 
=== Sample process of automatic grammar checking ===
 
=== Sample process of automatic grammar checking ===
The document will get access to the ''GrammarCheckingIterator'' and requests checking the document by providing:  
+
The document will get access to the ''GCIterator'' and requests checking the document by by calling startGrammarChecking() and providing:  
  
* a unique interface to the document (to be used to identify this document); as this interface is for identification purposes only perhaps css.uno.XInterface is the appropriate type. If any other type is used it should be considered that this type will set a precondition that “documents” must fulfill that want to use the grammar checking API. Besides that the only precondition is that the “document” must implement the css.text.XFlatParagraphIteratorProvider interface and the objects returned from it.
+
* a unique interface to the document (to be used to identify this document); as this interface is for identification purposes only perhaps css.uno.XInterface is the appropriate type. If any other type is used it should be considered that this type will set a precondition that “documents” must fulfill that want to use the grammar checking API.  
* an FPiterator object that has been initialized (internally) with the FP object where checking should start (usually the first FP of the document)
+
* a reference to an interface com.sun.star.text.XFlatParagraphIteratorProvider. Ususally this will be the same object as the document but we didn't want to require that, so we pass both interfaces in the call
* the FP object also should contain the starting position of the first sentence (for automatic checking this should be always 0)
+
* a flag indicating that this request is for automatic checking only and thus no suggestions are required and no dialog must be displayed.
* a flag indicating whether this request is for automatic checking only and thus no suggestions are required and no dialog must be displayed.
 
* a reference to a ''XGrammarCheckingResultListener'' interface that stores the returned markup information until it can be handed over to the FPIterator.
 
  
 +
The ''GrammarCheckingIterator'' maintains a queue of sentences to be processed. When called with the above arguments it creates an entry consisting of those values and adds them at the end of the queue.
  
The ''GrammarCheckingIterator'' maintains a queue of sentences to be processed. When called with the above arguments it creates an entry consisting of those four values and adds them at the end of the queue.  
+
You might wonder why the starting paragraph is not passed. The simple explanation is that the flag for automatic processing will be passed when an FPIterator is created and so this one will know that the whole document has to be processed, starting from the first paragraph.
  
For the sake of simplicity for now let's assume there is only one document to be processed. In reality the queue may contain elements for several documents and the GCIterator will process the entries belonging to the same document one after another, always halting the thread executing them after an entry has been checked and restarting it with the new entry once the checked one has been processed by the FPIterator.
+
For the sake of simplicity for now let's assume there is only one document to be processed. In reality the queue may contain elements for several documents and the GCIterator will process the entries belonging to the same document one after another. Whether processing the text will be halted while the results of a finished check are processed is open for debate.
  
Thus (since there are no further API calls) the ''GrammarCheckingIterator'' will dequeue the first element from the queue (which is the one we just added). It retrieves the text of the paragraph, checks the BreakIterator for a suggested end-of-sentence position (that is indicated by it's starting position) and, after identifying the languages to use, calls all the respective grammar checker(s) asynchronously (background checking) or synchronously (interactive checking, but that can be discussed) one-by-one to check that single sentence. Asynchronity is implemented by creating a thread object for each used grammar checker component and executing all grammar checking steps in this thread. The thread will be provided with the current queue entry only, it will not access the queue itself.  
+
Thus (since there are no further API calls) the ''GCIterator'' will dequeue the first element from the queue (which is the one we just added). It retrieves the text of the paragraph, checks the BreakIterator for a suggested end-of-sentence position (that is indicated by it's starting position) and, after identifying the languages to use, calls all the respective grammar checker(s) one-by-one to check that single sentence. To avoid blocking of the UI (grammar checking can take some time) this will happen asynchronously by creating a thread object for each used grammar checker component and executing all grammar checking steps in this thread.  
  
The GCIterator will return immediately after creation of the thread object. Results are received in the callback method of an XGrammarCheckerResultListener interface provided by the GCIterator (preferably implemented as an individual object). The GCIterator will make sure that dispatching of received results or text markup will happen in the “main” thread of the document.
+
The GCIterator will return immediately after creation of the thread object. Results are received in the callback method of an XGrammarCheckerResultListener interface provided by the GCIterator (preferably implemented as an individual object).  
  
Please note that all the asynchronity we require to have for background grammar checking is implemented in the ''GrammarCheckingIterator'' only, and each grammar checker implementation should run in the same thread that makes life easier for the grammar checker component but still provides a sufficient amount of parallelism.
+
Please note that all the asynchronity we require to have for background grammar checking is implemented in the ''GCIterator'' only, and each grammar checker implementation should run in the same thread that makes life easier for the grammar checker component but still provides a sufficient amount of parallelism.
  
For the results returned by a grammar checker we first check if the ''XFlatParagraph'' is not modified (this flag will be set if the FP has been changed or deleted since it was returned by the iterator). If so we remove all previous outdated markings for this sentence and then mark all the incorrect text parts. Otherwise we discard the results silently. (Remark: is it really necessary to remove old markings explicitly?)
+
For the results returned by a grammar checker we first check if the ''FP'' is not modified (this flag will be set if the FP has been changed or deleted since it was returned by the iterator). If so we mark all the incorrect text parts (this will automatically remove all previous outdated markings for this sentence). Otherwise we discard the results silently.  
  
When the last grammar checker result for this sentence has been processed and there is still unprocessed text left in the paragraph the ''GrammarCheckingIterator'' will continue with the new starting position by updating the queue entry and proceed with it.
+
When the last grammar checker result for this sentence has been processed and there is still unprocessed text left in the paragraph the ''GCIterator'' will continue with the new starting position by updating the queue entry and proceed with it.
  
 
If the paragraph has been checked completely this way then the ''getNextParagraph'' function from the ''XFlatParagraphIterator'' interface is called to retrieve the next paragraph to be checked. If there is one found we start anew as described above with the new paragraph. The whole iteration will be continued until all paragraphs have been marked as checked.
 
If the paragraph has been checked completely this way then the ''getNextParagraph'' function from the ''XFlatParagraphIterator'' interface is called to retrieve the next paragraph to be checked. If there is one found we start anew as described above with the new paragraph. The whole iteration will be continued until all paragraphs have been marked as checked.

Revision as of 15:39, 26 September 2007

Grammar checking is seen as a particular implementation of a text iteration and markup process, other iteration/markup processes like spell checking or smart tagging basically can work in the same way (though currently they are not implemented like this). If grammar checking is mentioned in the following documentation this can be seen as a placeholder for the more general task of text markup.

Involved objects

Outside view

The grammar checking process consists of

  • one or more documents to be checked
  • one or more grammar checker implementations, each supporting at least one language.
  • one or more grammar check dialogs (at most one instance per document)
  • one context menu when clicking on text marked as incorrect
  • a global grammar checking iterator (common to all documents) implemented as singleton, checking one sentence (of an arbitrary document) at a time.
  • one thread object per grammar checker that is used to perform the checking without blocking the GUI
  • objects iterating through the text of a document, one object representing a single grammar checking task that was requested
  • objects representing text blocks in a text document (“flat paragraphs”) that abstract from the concrete structure of the document and provide access to the text by simple text strings and integer values describing positions and lengths of sub string.

Overview of the UNO types involved

All involved objects (except the thread object that is a C++ object derive from osl::Thread) communicate with each other through UNO interfaces. The whole process uses the following UNO types:

  • interface com.sun.star.text.XFlatParagraph
  • interface com.sun.star.text.XTextMarkup
  • interface com.sun.star.container.XStringKeyMap
  • struct com.sun.star.lang.Locale
  • constants com.sun.star.text.TextMarkupType
  • interface com.sun.star.text.XFlatParagraphIterator
  • interface com.sun.star.text.XFlatParagraphIteratorProvider
  • interface com.sun.star.linguistic2.XGrammarChecker
  • interface com.sun.star.linguistic2.XGrammarCheckingIterator
  • service com.sun.star.linguistic2.GrammarCheckingIterator
  • interface com.sun.star.linguistic2.XGrammarCheckingResultListener
  • interface com.sun.star.linguistic2.XGrammarCheckingListener
  • interface com.sun.star.linguistic2.XGrammarCheckerListener
  • struct com.sun.star.linguistic2.GrammarCheckingResult
  • struct com.sun.star.linguistic2.SingleGrammarError

Objects and their interfaces

We have three parts working together. The first part comes from the document being checked and it is an implementation that is specific for the particular type of document (e.g. Writer or Calc). It encapsulates the access to the text of the document. A document wanting to become checked for grammar errors must support the interface com.sun.star.text.XFlatParagraphIteratorProvider. Through this interface it must be able to provide objects implementing com.sun.star.text.XFlatParagraphIterator that themselves return objects implementing com.sun.star.text.XFlatParagraph. The latter interface is derived from com.sun.star.text.XTextMarkup. In the following we will call these objects "flat paragraph iterators" (FPIterator) and "flat paragraphs" (FP).

The second part is a grammar checker. A grammar checker is a component implementing the interface com.sun.star.linguistic2.XGrammarChecker. For each language there may be a particular component that is able to check for grammar errors in this language. The configuration will tell which component is responsible for what language. The implementation of com.sun.star.linguistic2.XGrammarChecker representing a particular component will encapsulate the "private" API of this grammar checking component. This private API can be UNO based or pure Java, a CLI or COM interface, a C API etc., everything that can be used or bridged to inside an implementation of a UNO interface. As the interface is pretty small it should be not very complicated to wrap existing grammar checkers for using them in OpenOffice.org.

In the middle lies the third component, that mediates between the other two. It implements the "logic" of the grammar checking process. As it talks to the other two parts by their defined UNO API only this middle part is independent from the particular document type or grammar checking component. A UNO service called com.sun.star.linguistic2.GrammarCheckingIterator is the component that actually carries out the grammar checking process for all supported scenarios. It is a singleton that controls all running grammar checking processes and thus also knows all existing grammar checking components. It implements the interface com.sun.star.linguistic2.XGrammarCheckingIterator and also provides an object implementing com.sun.star.linguistic2.XGrammarCheckingResultListener. In the following this object will be called the GCIterator.

For a description of some of the types and hints about their implementation see Grammar Checking API.

Required tasks

  • Automatic grammar checking: while the user is editing his documents they should be checked for grammar errors in the background. Found errors should be marked somehow so that the users becomes aware of them. Preferably the visible part of a document should get preference.
  • Interactive grammar checking via context menu: when the user clicks on a text part that has marked for containing a grammar error she should be provided with information and suggestions how to fix them or discard the mark up
  • Interactive grammar checking via dialog: the user wants to see the information and suggestions returned from the grammar checker immediately and so instead of marking the text the process will present the result in a dialog and asks the user about how to proceed

Sample process of automatic grammar checking

The document will get access to the GCIterator and requests checking the document by by calling startGrammarChecking() and providing:

  • a unique interface to the document (to be used to identify this document); as this interface is for identification purposes only perhaps css.uno.XInterface is the appropriate type. If any other type is used it should be considered that this type will set a precondition that “documents” must fulfill that want to use the grammar checking API.
  • a reference to an interface com.sun.star.text.XFlatParagraphIteratorProvider. Ususally this will be the same object as the document but we didn't want to require that, so we pass both interfaces in the call
  • a flag indicating that this request is for automatic checking only and thus no suggestions are required and no dialog must be displayed.

The GrammarCheckingIterator maintains a queue of sentences to be processed. When called with the above arguments it creates an entry consisting of those values and adds them at the end of the queue.

You might wonder why the starting paragraph is not passed. The simple explanation is that the flag for automatic processing will be passed when an FPIterator is created and so this one will know that the whole document has to be processed, starting from the first paragraph.

For the sake of simplicity for now let's assume there is only one document to be processed. In reality the queue may contain elements for several documents and the GCIterator will process the entries belonging to the same document one after another. Whether processing the text will be halted while the results of a finished check are processed is open for debate.

Thus (since there are no further API calls) the GCIterator will dequeue the first element from the queue (which is the one we just added). It retrieves the text of the paragraph, checks the BreakIterator for a suggested end-of-sentence position (that is indicated by it's starting position) and, after identifying the languages to use, calls all the respective grammar checker(s) one-by-one to check that single sentence. To avoid blocking of the UI (grammar checking can take some time) this will happen asynchronously by creating a thread object for each used grammar checker component and executing all grammar checking steps in this thread.

The GCIterator will return immediately after creation of the thread object. Results are received in the callback method of an XGrammarCheckerResultListener interface provided by the GCIterator (preferably implemented as an individual object).

Please note that all the asynchronity we require to have for background grammar checking is implemented in the GCIterator only, and each grammar checker implementation should run in the same thread that makes life easier for the grammar checker component but still provides a sufficient amount of parallelism.

For the results returned by a grammar checker we first check if the FP is not modified (this flag will be set if the FP has been changed or deleted since it was returned by the iterator). If so we mark all the incorrect text parts (this will automatically remove all previous outdated markings for this sentence). Otherwise we discard the results silently.

When the last grammar checker result for this sentence has been processed and there is still unprocessed text left in the paragraph the GCIterator will continue with the new starting position by updating the queue entry and proceed with it.

If the paragraph has been checked completely this way then the getNextParagraph function from the XFlatParagraphIterator interface is called to retrieve the next paragraph to be checked. If there is one found we start anew as described above with the new paragraph. The whole iteration will be continued until all paragraphs have been marked as checked.

Each time a queue entry has been processed the GCIterator checks whether there is another entry for the same grammar checking component and continues with it by putting it up for processing in the thread assigned to the particular grammar checker. So not only the document but also the queue is accessed in the main thread.

Sample process of interactive grammar checking

There are two basic differences when comparing interactive grammar checking with automatic checking:

  • the results of grammar checking a sentence need to be interactively post-processed by the user.
  • each grammar checker is allowed to make use of it's own implementation of a grammar checking dialog and another dialog to view and modify implementation specific options as well. The 'options dialog' should have two entry points: one accessible from a tool-bar, and the other one would be a button in the grammar checking dialog. If the grammar checker features only an option dialog but not a grammar checker dialog the office internal dialog must be able to start that option dialog. (See questions and problem section as well!)
  • due to some grammar checkers requiring the text of previous sentences in the paragraph to be known in order to determine if the current one is correct one can not just simply check one sentence after another if a change is applied.If for example the first two sentences are without error and the third sentence got corrected by the user we can't simply proceed to the fourth sentence. Because it can't be figured out what the specific grammar checker implementation keeps track of it can't be helped but to throw everything away and tell that grammar checker that a new paragraph is to be started. Thus we need to have the grammar checker check the first three sentences (without reporting any error for them) in order to build up the internal data to check the fourth sentence. Only then we can pass the fourth sentence on to the grammar checker and expect the results to be correct. And for all the following sentences of that paragraph we have to do it all over again.One slightly different approach would be that not the iterator has to pass all the previous sentences on to the checker again but instead have it done by the grammar checker itself implicitly if it has need to do so. After all the grammar checker is always given the whole text along with the sentence-start-position. But the grammar checker implementation needs to be aware of that by doing so it may encounter sentences in languages it does not know about and that would usually not have been passed to this specific checker.

Going with the preferred way of having the grammar checker scan previous text implicitly if needs be, interactive checking looks like this:

The document determines the first paragraph to be checked (for example the one where the cursor is displayed). In order to have it a little less complicated when determining if the whole document was processed we probably like to start checking at the beginning of the paragraph and not a specific sentence within even if the cursor is placed e.g. in the last sentence (this can be discussed though).

When the starting paragraph is determined the document accesses the GrammarCheckingIterator and provides similar data as for automatic checking:

  • the unique reference to the document
  • an FPiterator object that has been initialized with the FP object where checking should start (usually the paragraph where the cursor is located)
  • the start-of-sentence position of the first sentence. Here 0.
  • and the flag indicating interactive checking now
  • also now a reference to a XGrammarCheckingResultListener interface, implemented by the dialog, that is used by the GrammarCheckingIterator as call-back to provide the dialog with the text, data and results to be displayed. [Remark: if we do it synchronously we can get the results as a direct return value of the grammar checker. Keeping the API asynchronous would allow us to also do interactive checking in the background.]

The GCIterator waits for the current background processing step for the selected grammar checker to end (thus blocking the main thread) and creates a new entry for the queue, but now it places that entry at the start of the queue instead at the end. This way interactive checking will take precedence over automatic checking and the latest UI triggered request will be at the top of the queue and gets processed next. [Alternatively, if interactive checking is done in the thread too, the entry is just placed in the queue and the call returns.]

As long as no error is found by the grammar checkers the iteration and the tasks to be done in each iteration are the same as for automatic checking. That is aside from the flag for new queue entries indicating interactive checking and those entries being added at the start of the queue (and most probably not using a thread for doing the check).

For sake of simplicity we stick to only one single grammar checking dialog used by all checkers here in this text!

If one or more of the grammar checkers report an error with the current sentence then the error reports from all the checkers are collected and the grammar checking dialog is started (if not already open, see below) and filled with the necessary data by the GrammarCheckingIterator (the text and the complete list of errors). The iterator will not wait for the dialog to be finished or to advance to the next sentence, it will continue with it's own tasks (e.g. entering it's main loop and start checking a sentence from another document). The dialog will only show the very sentence the error was found in and has to allow for at least

  • showing all the error positions (preferably all at once),
  • reviewing each errors (displaying the detailed information about that error) and suggestions for corrections,
  • modifying the sentences text freely,
  • changing the language of text parts or all the text,
  • ignoring the errors and continuing with the next sentence,
  • committing the changes made and continue with checking (as long as the paragraph was not modified or invalidated meanwhile),
  • if that very paragraph was modified meanwhile there will be a button that allows the dialog to discard the changes (that are not yet applied) and restart checking with the sentence the cursor currently is in (which may be in a completely different paragraph) by adding that to the top of the queue (if anything is left),
  • and if the paragraph was invalidated (deleted) the changes in the dialog are to be discarded as well and getNextParagraph should be called to continue checking and (if anything is left) thus adding the next sentence to be checked to the top of the queue,
  • or canceling the interactive checking and closing the dialog.

If the changes are committed they are applied to the paragraph by using the XFlatParagraph interface.

Then if there is still text left in the paragraph the next sentence is added at the start of the queue (as described above). If the paragraph was processed the getNextParagraph function is called to get the next paragraph to be checked, if no such paragraph is found the iteration is finished and the dialog can be closed. Otherwise we continue by putting an entry for interactively checking the first sentence of the new found paragraph at the start of the queue. (Either way the entry needs to have the XGrammarCheckingResultListener reference set in order to provide the dialog with new data to be displayed when the next sentence with errors was found.)

Then the dialog is left open and the GrammarCheckingIterator takes control again and can proceed with the next entry from the start of the queue. This way the process continues until the next error is found or the iteration over the document is finished.

If the dialog is closed (either because the iteration has finished or because the cancel button was pressed) the interactive checking is stopped simply by not adding another entry to the queue.

Please note that because the starting point for grammar checking the whole document may vary (be it automatic or interactive) this may result in different errors! For example: In German it is correct to write dolphin either as "Delfin" or as "Delphin". But still one would probably want to enforce consistent use of only one of the two spellings. Thus if a grammar checker likes to enforce this it has internally to keep track what spelling was encountered first and reject the other spelling hence forward.

Side note: The dialog needs to implement the XComponent interface and the GrammarCheckingIterator needs to be it's listener.

Using the context menu with grammar checking

Opening the context menu by right clicking on a text part that is marked as being incorrect requires yet another approach. The differences here are:

  • Only a single sentence should be checked (but still to do this correctly the grammar checker may need to scan all the previous text in the paragraph)
  • and only those errors/corrections (or part of them if the list gets too long) should be displayed that belong to the respective marked text part. That is only for a subset of all the errors in a sentence the corrections are needed which may leave some room for optimization.

Thus when the right-click takes place the document (when creating the menu which is to be done in the main thread) calls the respective function of the GrammarCheckingIterator and an entry similar to interactive checking of that very sentence is added to the start of the queue. The only differences will be that there are some additional values in that entry:

  • one for the starting position of the marked text part, and one for it's length. Thus indicating that the grammar checkers only need to find out errors in that text range and the return value (which usually should hold all errors/corrections for that sentence) needs only to cover that range as well.(On the other hand it would be possible to retrieve all errors and thus behave exactly as interactive checking and just ignore the results that are out of the indicated range.)
  • a flag needs to indicate that this is for the context menu only (and thus there is no need for a iteration to be started, i.e. no further queue entry will be added implicitly when processing this entry)
  • also a reference to the XGrammarCheckingResultListener interface that is used by the GrammarCheckingIterator to provide the context menu with the results is needed.(Naturally this implementation of the interface is a different one then the one used in the dialog for interactive checking.)

Since the call to the GrammarCheckingIterator is asynchronously we need to wait a reasonable limited amount of time (e.g. 3 seconds) to receive the results via the call-back. If we do get them in time we can show the context menu as planned. If not, since we can't wait forever, we have to display a fallback menu (either the regular one or one showing an entry like "grammar checking timed out").

Since the context-menu may already be closed (either before the 3 seconds are over or after) when finally the GrammarCheckingIterator is ready to use the call-back function to provide the results, the context-menu needs to implement the XComponent interface and the GrammarCheckingIterator must be it's listener, and it is required to already register as such when the context-menu calls the function to trigger grammar checking for the sentence.

Right before the context-menu gets displayed it should already dispose. This would be necessary later anyway and doing it now should prevent the call-back function from being executed belated if grammar checking was too slow (or did not return at all) and the fallback menu is displayed.

When everything went fine and the user was able to select a specific correction the XFlatParagraph interface provided as part of the XGrammarCheckingResult will be used to make the changes in the text.

Checking several documents at the same time and mixing all the above tasks

Other applications of the iterator concept:

The idea of having a global iterator that iterates over the documents text in using the interface XFlatParagraphIterator and giving access to the a paragraph with the XFlatParagraph interfaces thereby doing "some task" should be applicable as well to the following tasks:

  • word count
  • smart tags
  • spell checking(?)

The different needs for the iteration order (or even skipping some paragraphs) might be implemented by using specific iterators or else by giving the iteration function a specific context for the iteration. For example:

getNext( eActionContext )

where eActionContext might be one of

CONTEXT_WORD_COUNT,

CONTEXT_SMART_TAGS,

CONTEXT_GRAMMAR_CHECKING

Problems and questions currently left open

Grammar checking of mixed language text

It is believed that even for sentences that uses several languages there is only a single language the whole sentence is in. (How that language is identified is a completely different matter and probably a complex task though!) And thus that sentence should only be grammar checked in that single language. For example:

The German word for television is Fernseher.

This sentence should be grammar checked in English and not German

If possible though (for example if language attributes are set correctly) it should be noted that Fernseher is not in English and thus at the very least no spelling error should for English should be reported for that word. And probably it is also impossible to report any grammar error that involves embedded foreign words. Thus the best to hope for probably is for the foreign word to be recognized as correct by the respective spell checker.


Even with completely embedded sentence like

In Gallica Caesar said 'Alea iacta est.' and continued his battle.

the above text is in a single language English and not Latin. If an existing grammar checker is smart enough to cope with embedded sentences of a different language I don't know. To keep it simple for the time being the whole text should be grammar checked as one sentence in English and in only that language.

Grammar checking and spell checking at the same time

Should spell checking have an iterator of it's own with a thread of it's own? Or should spell checking be handled by the GrammarCheckingIterator as well?

Other Questions / problems:

  • checking is limited to paragraphs (unless the implementation of XFlatParagraph chooses to hide sth. more behind it which is unlikely). Though one could think of enumerations as a possible application for this behavior.
  • in the case of several grammar checkers for one languages, what do we do if they report different end-of-sentence positions? We really can't handle each checker individually here.
  • does a grammar checker that requires knowledge of the previous text in this paragraph need to have those text presented even if it is in a language it does not know?
  • How to achieve consistency of usage (e.g. spelling) when having grammar checkers in multiple languages? E.g. e-mail vs. email? Or does it need to be consistent on a per language base only?
  • How to determine the language of a sentence? Use the language of the first word, or language guessing, or the language with the most words,... ?
  • Problems related to a specific UI, namely the grammar checking dialog still to be defined, not yet covered.
  • The troublesome case of having for example three grammar checkers for one language and two of them wanting to use their own dialog while the third will go with the office internal one is left out. Because if all of them report errors in the same sentence and like to use their own dialog as well we will have to cope with switching between three dialogs just to edit a single sentence. That's just plain awful to even think about. And I doubt there will be even one user to appreciate such a scenario.
  • Should the document (e.g. XFlatParagraph) be in charge to determine the language for checking or should it be the GrammarCheckingIterator? Probably the latter...
Personal tools