Writer/ToDo/PDF Import

Writer Project

Please view the guidelines
before contributing.

Popular Subcategories:

Extension:DynamicPageList (DPL), version 2.3.0 : Warning: No results.

... more Subcategories

Internal Documentation:

Extension:DynamicPageList (DPL), version 2.3.0 : Warning: No results.

... more Internal Documentation

API Documentation:

Ongoing Efforts:

Extension:DynamicPageList (DPL), version 2.3.0 : Warning: No results.

... more Writer Efforts

Projects on this Wiki: (edit list)

Sw.OpenOffice.org

View or edit this template.

Motivation

PDF is a widely used format to exchange documents containing text and graphics between different applications and different platforms. OpenOffice.org is currently able to create such PDF documents via export filters that are already available within every major OpenOffice.org application. Unfortunately, OpenOffice.org is not able to import PDF documents back again, although this is one of the more often requested features.

See issue 10384 for further details.

Most Professional Editing tools as QuarkExpress etc. offer the possibility to place a PDF-Page as a graphic. For now we use a workaround converting the PDF-page to an EPS with tiff-preview that can be placed in an OOo-document. The preview is of low-resolution but the EPS is printed in its original resolution and also in the wanted CMYK colorspace because the PS-print drivers do not alter the EPS to RGB. There are many reasons why this feature would be useful, but placing fully layouted tables (made in Calc) into WriterDocs is one of the most important. In Quark you can select a PDF, choose the page number and place it as a normal graphic. When it can be done then it will be important that the CMYK is preserved.

A key tool for many documents is "marking up" and saving such personal highlites and commentaries for documents we download for review/research purposes. The PDF-import function would be a key enabler of this irreplaceable activity. Allow me to expand. Not often consciously considered as for this specific purpose, PDF is also deemed an archival format, meaning a frozen snapshot in time. If one expects to make use of this original, unmodified form, every time, PDF is often stored in a common repository for search and retrieval. However, for more personal use, as for researchers identifying citations relevant to a given study, this original form must be either

a) printed and highlited to bring attention to relevent excerpts, or

b) excerpted and copy/pasted into a separate file for such references.

It would be desirable to be able to save the highliting mask, with possible commentary, either into a personalized version of the PDF file, or as a separate "commentary" file. The commentaries, if stored separately, would minimize what needs to be forwarded when collaborating, and would minimize the data growth if large groups of reviewers need to store such commentaries in the same central repository. If a first level of capability for the import was ONLY to facilitate this highliting overlay process, it would truly address a widespread need.

Goals for a PDF import

The document created by importing a PDF file should resemble the original as close as possible; nevertheless PDF per se does not lend itself to that end easily: most PDF files contain no information about layout or document structure at all. Therefore a PDF file will never be able to be imported on a 1:1 basis. We have to define goals to define what level of similarity must be achieved on a basis of feasibility.

These goals should be treated as paramount:

all text that is visible in the original PDF document should be imported
text attributes: font family, font size, weight (bold, not bold), style (italic, not italic) should be imported together with the respective text.
all drawing elements (images, vector graphics) should be imported.
if the implementation has to choose between layout fidelity and editability, lean towards layout.

Additionally there are some goals that would greatly enhance the import result, all of these features can by their nature only be implemented with heuristic methods since PDF (unless the file uses tagged PDF) does not contain structural information. The following text features should be detected (sequence in descending importance):

Paragraphs
Enumerations
Titles
Underlined text
subscript/superscript

Use-Cases

There are 3 use cases for a PDF-input filter. To better understand what should be developed, I will separately address these use cases:

Text-Stream Import
Text + Layout Import (non-editable)
Text + Layout Import (fully editable)

Discoleo 18:37, 7 October 2007 (CEST)

Text-Stream Import

Sometimes, people want to import mainly the text-stream to edit it in their preferred program and use it in their own work. In these instances, the exact layout is not that important, and what the import filter should do is:

generate a continuous text stream [i.e. NOT just every line terminated by CR/LF, like the current Adobe Acrobat select tool]
optimally detect some text-structure:
- like sub-/super-script
- paragraphs
- tables
- underlined text

Text + Layout (as background)

Users need sometimes to complete a document/form. Often, governments and other institutions publish official documents in pdf-format (simple PDFs, NOT pdf-forms), BUT one cannot add/write any text to these simple pdf documents.

The import-filter should therefore:

import the pdf (both text-streams and layout) as a background
users shall be able to write new text in the foreground, overlaid over the background document
- however, this should be handled better than pasting the pdf-document as an image and writing over the image
- images create bigger size; poor zooming; difficulty accurately positioning new text to fit the existing text line, ...
- it should be possible to position the cursor on the baseline of an existing text-line, so that newly written text fits the existing text
- the tool should detect existing text-box boundaries, so that one can write new text extending from those boundaries
optimally, some minimal 'pdf-editing' features should be possible
- move whole sections (text+graphic+layout from the remaining document) downwards, e.g. if the new text does NOT fit in the existing free space

Text + Layout (fully editable)

Of course, this would be a nice feature, BUT - considering the pdf format -, this seems a little bit elusive.

However, pdf-documents saved by OOo should contain additional information, that should allow importing them in OOo in a fully editable state. At least OOo-generated documents shall allow this editing mode.

Another approach could be to allow importing PDF to Impress or Draw to edit the complete layout, by treating it as a poster instead of a document (which may be repaginated etc).

Implementation

We will try to come up with a first prototype soon, most probably using an out-of-process xpdf instance to do the parsing (due to license issues). Here's a list of things to do:

Area	Title	State	CWS
Parser	Wrap pdf parser with UNO	100%	picom
Parser	Connect to xpdf out-of-process	100%	picom
Tooling	Enhance rendering API to provide truly generic bitmap access	100%	picom
Canvas	Adapt Canvas implementations to the new API	100%	picom
Tooling	Adapt VCL's canvastools to be able to import XBitmap generically to VCL bitmap	100%	picom
Tooling	Enable GraphicImporter to use rendering::XBitmap	90%	picom
Import	Read content via UNO	100%	picom
Import	Combine low-level structure (like stroke and fill)	90%	picom
Import	Generate SAX events	100%	picom
Import	Generate ODF stream	100%	picom
Import	Detect text flow: portions	0%	picom
Import	Detect text flow: lines	0%	picom
Import	Detect text flow: paragraphs	0%	picom
Import	Detect text style	0%	picom
Import	Detect shape style (e.g. shadow)	0%	picom
Parser	Replacement for xpdf	0%	picom
CVS	Move pdf import to OOo CVS	90%	picom

Writer/ToDo/PDF Import

Contents

Motivation

Goals for a PDF import

Use-Cases

Text-Stream Import

Text + Layout (as background)

Text + Layout (fully editable)

Implementation

Views

Personal tools

Navigation

Search

Tools