OpenOffice.org Internship/Tasks/Proper paragraphs

The task is to implement correct importing text paragraphs. In current version of extension we can import only single lines what is quite inconvenient when we try to edit text.

Line importing in current extension

There is no information that would come from XPDF to inform that tag BT was met, so we cannot determine if a new text object occurs. Line is recognized by the position of consecutive glyphs (rectangles containing glyphs indeed). If two consecutive rectangles are close enough to each other, they are threaten as belonging to the same line. This solution is not perfect because we have to determine what means "close enough".

Idea of paragraph importing

To import whole paragraphs I suggest similar solution to the one described above, but instead of glyphs and lines we will consider lines and paragraphs. It implies following: when lines are close enough they are threaten as one paragraph. Several cases may occur, but most of them are quite easy.

Moreover glyph processing is quite complex. It would be better to use encapsulation in order to delegate functionality of glyph processing to standalone class. It would reduce the mess in pdifprocessor that contains methods responsible for every kind of processing. The main goal is to make pdiprocessor a wrapper containing smaller classes with separate responsibilities - there is a lot of advantages of this approach.

Another solution

Another solution would be to modify Gfx and OutDev from XPDF. As it was said in the beginning of this page, there is no information when BT is met. So the solution would be to inform OutDev about it, by changing the code. Unfortunately I see some problems associated with this solution: BT contains much more than a single paragraph sometimes, and another is position glyphs within draw text objects. Moreover it requires changes in makefile (the extension code).

Description of implemented solution

Introduced classes

Changes in PDFIProcessor

All responsibility for glyph processing has been moved to CharGlyphsProcessor class initialized by passing PDFProcessor object to the only constructor. It is not the whole object indeed, but only several required functionalities implemented with facade design pattern - it's not suggested to modify PDFIProcessor content within CharGlyphsProcessor. PDFIProcessor posses CharGlyphsProcessor object and instead of running processGlyphLine in drawGlyphs function, CharGlyphsProcessor::process is executed, what starts processing of current glyph.

CharGlyphProcessor

The only constructor of the class receives facade to PDFIProcessor class - this solution allows to call required methods from PDFIProcessor, but hide rest of methods, what prevents against modifying it's content externally. Objects of the class posses a pointer to currently computed paragraph. The main function of the class is "process" method, that receives arguments rFontMatrix, aRect and char to draw and starts paragraph structure creating. Every new glyphs is tried to add to current paragraph, if it fails a new line withing current paragraph is tried to be created. If it fails as well, the paragraph is drop and a new one is created to replace the current one. Every time new glyph or line is add the paragraph properties need to be updated to correctly count if next glyphs/lines might be contained in it. The second public method is "drop" to drop it's content overtly, when pdf "end of text object" command is met while parsing.

CharGlyphParagraph

Object contains a list of lines that might be found in the paragraph. It provides public method "add" to add new glyphs to it, that returns true when new glyph was successfully add, or false otherwise. There is a simple mathematical equation to determine if such a addition is possible. Another public function is "drop" method that calls "drop" function of every line in the paragraph.

CharGlyphLine

The class represents single line within a paragraph. Likewise CharGlyphParagraph, CharGlyphLine provides public methods "add" and "drop" with similar actions. Moreover object of the class posses a list of glyphs in current line.

CharGlyph

Class is used to represent glyph object with all its properties and functions allowing operations on it.

ParagraphLineElement

The class derives from Element class and is logical representation of line in paragraph. All transformation are still applied to paragraph, but the class has been introduced in order to distinct single lines in paragraph. Introducing this class implies implementing several changes in classes like ElementFactory, TreeVisiting and implementing function allowing to visit this kind of node tree.

Dropping

By dropping we mean the process of adding current paragraph to xml tree. When drop function from GlyphProcessor is called and current glyph cannot be add to current paragraph we need to save current paragraph in tree and create new paragraph to replace it. Saving (dropping) is done in the following way:

Create FrameElement element in GlyphProcessor "drop" function and execute "drop" from CharGlyphParagraph passing FrameElement as an argument.
Create ParagraphElement element in CharGlyphParagraph "drop" function and execute "drop" function for every line contained in the paragraph passing FrameElement and ParagraphElement as arguments. The parent of ParagraphElement is FrameElement.
Create ParagraphLineElement in CharGlyphLine and execute "drop" function for every glyph contained in the line passing FrameElement, ParagraphElement and ParagraphLineElement as arguments. The parent of ParagraphLineElement is ParagraphElement.
Create TextElement and set its parent as ParagraphLineElement.

Now, while browsing xml tree new objects and functions are run, what makes the logical structure more legible and easier to understand and maintenance.

Summary

While implementing the feature some problems has been met. First of all there is a problem with char spaces that occurs very often and makes text bad looking. Current solution to normalize char spaces is very weird and is going to be improved. Another one is to test more and create optimal method of recognition paragraphs and new lines. Moreover I still suggest moving complex tasks of processing to small classes responsible only for one type of processing.