Pdf Import Extension/Current Architecture

From Apache OpenOffice Wiki
Jump to: navigation, search

Currently, the PDF import extension utilizes xpdf for parsing the pdf file, and generating a bunch of low-level output operations to synthesize an ODF document.

This is a bit cumbersome, as xpdf is GPL licensed, which makes it necessary to run it completely out-of-process for OOo (being LGPL-licensed). A dedicated replacement parser is in the making (filter/source/pdfimport/pdfparse), will take some time to be on par with xpdf, though.

Currently, the way PDF files get imported looks like this:

Pdf architecture.png

That is, once triggered from the framework filter configuration, the importer component passes on the filename of the pdf file to the xpdf executable, which loads and parses it, generating a bunch of pretty low-level drawing commands (like "put a glyph at position (x,y)") on stdout. This, in turn, is then read back from the office process, put into a tree structure page-wise, which is afterwards worked upon to combine glyhs, polygons etc. into pieces a bit more sensible to the user (draw shapes, and actual paragraphs of text).

Tree classes

This is the inheritance graph of the classes representing the graphical document tree:

Pdfimport-tree-nodes.png

Output generation classes

This is the interface and the two existing classes generating actual document output:

Pdfimport-odfgenerator.png

Low-level event input

This is the interface and the existing implementation receiving the low-level output commands from the pdf file (the "draw glyph at (x,y)" type of input):

Pdfimport-contentsink.png

There's one more class of this type in the unit test directory filter/source/pdfimport/test, implementing a stub device that just checks basic event generation sanity.

Personal tools