Proper paragraphs import

The task is to implement correct importing text paragraphs. In current version of extension we can import only single lines what is quite inconvenient when we try to edit longer text.

Line importing in current extension

There is no information that would come from XPDF to inform that tag BT was met, so we cannot determine if a new text object occurs. Line is recognized by the position of consecutive glyphs (rectangles containing glyphs indeed). If two consecutive rectangles are close enough to each other, they are threaten as belonging to the same line. This solution is not perfect because we have to determine what means "close enough", but it works correctly in most of cases.

Idea of paragraph importing

To import whole paragraphs I suggest similar solution to the one described above, but instead of glyphs and lines we will consider lines and paragraphs. It implies following: when lines are close enough they are threaten as one paragraph. In the real implementation only the first glyph of next line would be used to determine if next line might be appended to current paragraph.

Moreover glyph processing is quite complex. It would be better to use encapsulation in order to delegate functionality of glyph processing to standalone class. It would reduce the mess in PDFIProcessor code where all methods responsible for every kind of processing are placed. The main goal is to make pdiprocessor a wrapper containing smaller classes with separate responsibilities - there is a lot of advantages of this approach like scalability as the most important. Instead of writing several functions in PDFIProcessor and facing with big mess, developer would add internal class and call its methods from PDFIProcessor class.

Another solution

Another solution would be to modify Gfx and OutDev from XPDF. As it was said in the beginning of this page, there is no information when BT is met. So the solution would be to inform OutDev about it, by changing the code. Unfortunately I see some problems associated with this solution: BT contains much more than a single paragraph sometimes, and another is position glyphs within draw text objects. Moreover it requires changes in makefile and in the extension code.

Description of implemented solution

Introduced classes

Changes in PDFIProcessor

All responsibility for glyph processing has been moved to CharGlyphsProcessor class initialized by passing PDFProcessor object to the only constructor. It is not the whole object indeed, but only several required functionalities implemented with facade design pattern - it's not suggested to modify PDFIProcessor content within CharGlyphsProcessor. At least the content not associated with glyph processing. PDFIProcessor posses CharGlyphsProcessor object and instead of running processGlyphLine in drawGlyphs function, CharGlyphsProcessor::process is executed, what starts processing of current glyph.

CharGlyphProcessor

The only constructor of the class receives facade to PDFIProcessor class - this solution allows to call required methods from PDFIProcessor, but hide rest of methods, what prevents against modifying it's content externally. Objects of the class posses a pointer to currently computed paragraph. The main function of the class is "process" method, that receives arguments rFontMatrix, aRect and char to draw and starts paragraph structure creating. Every new glyphs is tried to add to current paragraph, if it fails a new line within current paragraph is tried to be created. If it fails as well, the paragraph is flushed and a new one is created to replace the current one. Every time new glyph or line is add the paragraph properties need to be updated to correctly count if next glyphs/lines might be appended. The second public method is "drop" to drop it's content overtly, when pdf "end of text object" command is met while parsing. Notice, that there was no such a simple solution in previous version of code, what makes some problems sometimes!

CharGlyphParagraph

Object contains a list of lines that might be found in the paragraph. It provides public method "add" to add new glyphs to current paragraph, that returns true when new glyph was successfully add, or false otherwise. There is a simple mathematical equation to determine if such a addition is possible, very similar to the one from the previous version of code, but with a small change because rotated text was always threaten as single glyphs. Another public function is "flush" method that calls "flush" function of every line in the paragraph.

CharGlyphLine

The class represents single line within a paragraph. Likewise CharGlyphParagraph, CharGlyphLine provides public methods "add" and "flush" with similar actions. Moreover object of the class posses a list of glyphs in current line.

CharGlyph

Class is used to represent glyph object with all its properties and functions allowing operations on it.

ParagraphLineElement

The class derives from Element class and is logical representation of line in paragraph. All transformation are still applied to paragraph, but the class has been introduced in order to distinct single lines in paragraph. Introducing this class implies implementing several changes in classes like ElementFactory, TreeVisiting and implementing function allowing to visit this kind of node tree. It poses visitedBy function.

Changes in TreeVisiting

I implemented new "visit" methods in treevisiting, drawtreevisitng because of introducing new type. The working version is only in draw now. Methods in writertreevisiting are empty. The visiting order is now following: (...), ParagraphElement, ParagraphLineElement, CharGlyph. In visit function in ParagraphLineElement (visit(ParagraphLineElement &, ...)) we only add tag <draw:break-line/> to move to next line, when we finish visiting every glyph in the line.

Flushing

By flushing we mean the process of adding current paragraph to tree. When flush function from GlyphProcessor is called and current glyph cannot be add to current CharGlyphParagraph we need to save current paragraph in tree and create new paragraph to replace it. Saving (flushing) is done in the following way:

Create FrameElement element in GlyphProcessor "flush" function and execute "flush" from CharGlyphParagraph passing FrameElement as an argument.
Create ParagraphElement element in CharGlyphParagraph "flush" function and execute "flush" function for every line contained in the paragraph passing FrameElement and ParagraphElement as arguments. The parent of ParagraphElement is FrameElement.
Create ParagraphLineElement in CharGlyphLine and execute "flush" function for every glyph contained in the line passing FrameElement, ParagraphElement and ParagraphLineElement as arguments. The parent of ParagraphLineElement is ParagraphElement.
Create TextElement and set its parent as ParagraphElement.

Now, while browsing tree new objects and functions are run, what makes the logical structure more legible and easier to understand and maintenance. Moreover ParagraphLineElement is only logical element to represent lines in code. We do not do any transformation on line, because line is contained within paragraph and every transformation on the paragraph will be applied to the line too.

Examples

Here is an example showing how paragraph importing works for a regular paragraph:

And another one for rotated paragraph:

When paragraph import "fails"

There is ET tag in pdf syntax that informs parser that text object should be closed in this place. In such a situation function flush from CharGlyphProcessor is called what implies creating a new paragraph. Sometimes in pdf files ET finishes every single line, so in this case, every single line will be threaten as a paragraph. But it is not fail, it is just pdf file specification.

Summary

While implementing the feature some problems has been met. First of all there is a problem with char spaces that occurs very often and makes text bad looking. Current solution to normalize char spaces is very weird and is going to be improved. Another one is to test more and create optimal method of recognition paragraphs and new lines. Another problem with importing rotated text were every glyph in line was threaten as a line with one glyph. Moreover I still suggest moving complex tasks of processing to small classes responsible only for one type of processing.