Difference between revisions of "Pdf Import Extension/Current Architecture"

From Apache OpenOffice Wiki
Jump to: navigation, search
Line 5: Line 5:
 
Currently, the way PDF files get imported looks like this:
 
Currently, the way PDF files get imported looks like this:
  
[[Image:Pdf_architecture.png|center|height=70%]]
+
[[Image:Pdf_architecture.png|center]]
 +
 
 +
That is, once triggered from the framework filter configuration, the importer component passes on the filename of the pdf file to the xpdf executable, which loads and parses it, generating a bunch of pretty low-level drawing commands (like "put a glyph at position (x,y)") on stdout. This, in turn, is then read back from the office process, put into a tree structure page-wise, which is afterwards worked upon to combine glyhs, polygons etc. into pieces a bit more sensible to the user (draw shapes, and actual paragraphs of text).
 +
 
 +
==Tree classes==
 +
 
 +
This is the inheritance graph of the classes representing the graphical document tree:
 +
 
 +
[[Image:Pdfimport-tree-nodes.png|center]]
 +
 
 +
==Output generation classes==
 +
 
 +
This is the interface and the two existing classes generating actual document output:
 +
 
 +
[[Image:Pdfimport-tree-nodes.png|center]]
 +
 
 +
==Low-level event input==
 +
 
 +
This is the interface and the existing implementation receiving the low-level output commands from the pdf file (the "draw glyph at (x,y)" type of input):
 +
 
 +
[[Image:Pdfimport-tree-nodes.png|center]]
 +
 
 +
There's one more class of this type in the unit test directory [http://framework.openoffice.org/source/browse/framework/filter/source/pdfimport/test/?only_with_tag=cws_src680_picom filter/source/pdfimport/test]

Revision as of 16:14, 12 November 2007

Currently, the PDF import extension utilizes xpdf for parsing the pdf file, and generating a bunch of low-level output operations to synthesize an ODF document.

This is a bit cumbersome, as xpdf is GPL licensed, which makes it necessary to run it completely out-of-process for OOo (being LGPL-licensed). A dedicated replacement parser is in the making (filter/source/pdfimport/pdfparse), will take some time to be on par with xpdf, though.

Currently, the way PDF files get imported looks like this:

Pdf architecture.png

That is, once triggered from the framework filter configuration, the importer component passes on the filename of the pdf file to the xpdf executable, which loads and parses it, generating a bunch of pretty low-level drawing commands (like "put a glyph at position (x,y)") on stdout. This, in turn, is then read back from the office process, put into a tree structure page-wise, which is afterwards worked upon to combine glyhs, polygons etc. into pieces a bit more sensible to the user (draw shapes, and actual paragraphs of text).

Tree classes

This is the inheritance graph of the classes representing the graphical document tree:

Pdfimport-tree-nodes.png

Output generation classes

This is the interface and the two existing classes generating actual document output:

Pdfimport-tree-nodes.png

Low-level event input

This is the interface and the existing implementation receiving the low-level output commands from the pdf file (the "draw glyph at (x,y)" type of input):

Pdfimport-tree-nodes.png

There's one more class of this type in the unit test directory filter/source/pdfimport/test

Personal tools