Difference between revisions of "OpenOffice.org Internship/Projects/2010/Improve PDF Import"

From Apache OpenOffice Wiki
Jump to: navigation, search
(Current tasks)
(Current tasks)
 
(33 intermediate revisions by the same user not shown)
Line 32: Line 32:
  
 
# [http://qa.openoffice.org/issues/show_bug.cgi?id=93793 Pop-up window which allows to replace fonts]  
 
# [http://qa.openoffice.org/issues/show_bug.cgi?id=93793 Pop-up window which allows to replace fonts]  
# [http://qa.openoffice.org/issues/show_bug.cgi?id=94532 Allow import of only selected pages]
 
 
# [[Native PDF forms]]  
 
# [[Native PDF forms]]  
# [[Proper paragraphs]]
 
 
# [[Processing layout of LaTeX PDF]]  
 
# [[Processing layout of LaTeX PDF]]  
 
# [[Import of complex vector graphics elements]]  
 
# [[Import of complex vector graphics elements]]  
Line 48: Line 46:
 
This section contains the list of tasks that are being done right now.
 
This section contains the list of tasks that are being done right now.
  
# Proper paragraphs
+
# [[User:Joekidd/OpenOffice.org/Internship/PDFImport/Tasks/Misplaced paragraphs | Misplaced paragraphs ]]
  
 
==What has been done so far==
 
==What has been done so far==
  
# [[Tasks/Introduction | Introduction ]]
+
# [[User:Joekidd/OpenOffice.org/Internship/PDFImport/Tasks/Introduction | Introduction ]]
# [[Tasks/Issue109708solved | Fixing rotated text problem]]
+
# [[User:Joekidd/OpenOffice.org/Internship/PDFImport/Tasks/Issue109708 | Issue 109708]]
# [[Tasks/Issue105133 | Issue 105133]]
+
# [[User:Joekidd/OpenOffice.org/Internship/PDFImport/Tasks/Issue105133 | Issue 105133]]
# [[Tasks/Issue92919 | Issue 92919]]
+
# [[User:Joekidd/OpenOffice.org/Internship/PDFImport/Tasks/Issue92919 | Issue 92919]]
 +
# [[User:Joekidd/OpenOffice.org/Internship/PDFImport/Tasks/Proper paragraphs | Proper paragraphs ]]
 +
# [[User:Joekidd/OpenOffice.org/Internship/PDFImport/Tasks/Improving rotated text | Improving rotated text ]]
 +
# [[User:Joekidd/OpenOffice.org/Internship/PDFImport/Tasks/Moving Proper paragraphs GlyphProcessor | Proper paragraphs required code changes]]
 +
# [[User:Joekidd/OpenOffice.org/Internship/PDFImport/Tasks/Testing proper paragraphs | Testing proper paragraphs import]]
 +
# [[User:Joekidd/OpenOffice.org/Internship/PDFImport/Tasks/Improving char spaces | Improving char spaces ]]
 +
# [[User:Joekidd/OpenOffice.org/Internship/PDFImport/Tasks/Allow import of only selected pagesn | Allow import of only selected pages ]]
  
 
==Problematic Tasks==
 
==Problematic Tasks==
  
# [[Tasks/Issue90633 | Issue 90633]]
+
# [[User:Joekidd/OpenOffice.org/Internship/PDFImport/Tasks/Issue90633 | Issue 90633]]
  
 
==Project status==
 
==Project status==
 
*The project is accepted for the OpenOffice summer internship program 2010
 
*The project is accepted for the OpenOffice summer internship program 2010

Latest revision as of 15:57, 11 October 2010


Abstract

The PDF Import Extension allows you to import and modify PDF documents. Best results with 100% layout accuracy can be achieved with the "PDF/ODF hybrid file" format, which this extension also enables. A hybrid PDF/ODF file is a PDF file that contains an embedded ODF source file. Hybrid PDF/ODF files will be opened in OpenOffice.org as an ODF file without any layout changes. Users without this extension can open the PDF part of the hybrid file with their PDF viewer.

The PDF Import Extension also allows you to import and modify PDF documents for non hybrid PDF/ODF files. PDF documents are imported in Draw to preserve the layout and to allow basic editing. This is the perfect solution for changing dates, numbers or small portions of text with a minimum loss of formatting information for simple formatted documents.

Goals for a PDF import

The document created by importing a PDF file should resemble the original as close as possible; nevertheless PDF per se does not lend itself to that end easily: most PDF files contain no information about layout or document structure at all. Therefore a PDF file will never be able to be imported on a 1:1 basis. We have to define goals to define what level of similarity must be achieved on a basis of feasibility.

These goals should be treated as paramount:

  • all text that is visible in the original PDF document should be imported
  • text attributes: font family, font size, weight (bold, not bold), style (italic, not italic) should be imported together with the respective text.
  • all drawing elements (images, vector graphics) should be imported.
  • if the implementation has to choose between layout fidelity and editability, lean towards layout.

Additionally there are some goals that would greatly enhance the import result, all of these features can by their nature only be implemented with heuristic methods since PDF (unless the file uses tagged PDF) does not contain structural information. The following text features should be detected (sequence in descending importance):

  • Paragraphs
  • Enumerations
  • Titles
  • Underlined text
  • subscript/superscript

Backlog

This section contains the list of tasks that are going to be done during internship and haven't been started yet.

  1. Pop-up window which allows to replace fonts
  2. Native PDF forms
  3. Processing layout of LaTeX PDF
  4. Import of complex vector graphics elements
  5. Conversion of tables
  6. Import of EPS graphics
  7. RTL (right-to-left) text/font support
  8. Change ContentSink class
  9. Fix disappearing bookmarks
  10. Fix ghostscript pdf import

Current tasks

This section contains the list of tasks that are being done right now.

  1. Misplaced paragraphs

What has been done so far

  1. Introduction
  2. Issue 109708
  3. Issue 105133
  4. Issue 92919
  5. Proper paragraphs
  6. Improving rotated text
  7. Proper paragraphs required code changes
  8. Testing proper paragraphs import
  9. Improving char spaces
  10. Allow import of only selected pages

Problematic Tasks

  1. Issue 90633

Project status

  • The project is accepted for the OpenOffice summer internship program 2010
Personal tools