Difference between revisions of "OpenOffice.org Internship/Projects/2010/Improve PDF Import"

From Apache OpenOffice Wiki
Jump to: navigation, search
(Building the shape)
Line 2: Line 2:
 
{{DISPLAYTITLE:Internship 2010: Improve PDF Import}}
 
{{DISPLAYTITLE:Internship 2010: Improve PDF Import}}
  
{| align=right style="margin-left: 15px; border:1px solid #aaaaaa; background-color:#f9f9f9; padding:5px; font-size: 95%;" class=box
+
__TOC__
|---
+
!align=center style="background:#ccccff;" |
+
 
+
[[File:Writer Icon.png]]
+
 
+
'''''[[Writer|Writer Project]]'''''
+
 
+
Please view the [[Wiki Contribution Guidelines|guidelines]]
+
<BR>before contributing.''
+
|-
+
|
+
'''''Popular Subcategories:'''''
+
<DPL>
+
category=+**Writer
+
namespace=Category
+
ordermethod=counter
+
order=descending
+
count=5
+
</DPL>
+
* [[:Category:Writer|... more Subcategories]]
+
 
+
'''''Internal Documentation:'''''
+
<DPL>
+
category=+Writer/CoreDoc
+
ordermethod=counter
+
order=descending
+
count=5
+
namespace=
+
</DPL>
+
* [[:Category:Writer/CoreDoc|... more Internal Documentation]]
+
'''''API Documentation:'''''
+
 
+
* [[:Category:Writer/API|Writer API articles]]
+
* [[Documentation/DevGuide/Text/Text_Documents|Development Guide about Writer]]
+
* [http://api.openoffice.org/ API Project Website]
+
 
+
'''''Ongoing Efforts:'''''
+
<DPL>
+
category=+Writer/Effort
+
ordermethod=counter
+
order=descending
+
count=5
+
namespace=
+
</DPL>
+
* [[:Category:Writer/Effort|... more Writer Efforts]]
+
 
+
{{ListofProjects}}
+
<!-- ^^ go to [[Template:ListofProjects]] to edit this master list, used on project templates -->
+
 
+
[[Category:{{{Category|Writer}}}]]
+
<!-- ^^ Automatically adds category tag to any page this is in -->
+
 
+
|-
+
|'''[http://sw.openoffice.org Sw.OpenOffice.org]'''
+
 
+
<div class="plainlinks"> ''[[Template:Writer Project|View]] or [{{SERVER}}{{localurl:Template:Writer_Project|action=edit}} edit] this template.''</div>
+
|}
+
<noinclude>
+
[[Category:Wiki Templates for Navigation]]
+
</noinclude>
+
 
+
  
 
== Abstract ==
 
== Abstract ==
Line 87: Line 26:
 
* Underlined text
 
* Underlined text
 
* subscript/superscript
 
* subscript/superscript
 +
 +
==Backlog==
 +
 +
This section contains the list of tasks that are going to be done during internship and haven't been started yet.
 +
 +
 +
 +
 +
==Current tasks for==
 +
 +
This section contains the list of tasks that are being done right now.
 +
 +
Getting know with code (estimated time 32 hours).
 +
Getting know with mercurial (estimated time 8 hours).
 +
Setting environment and first build (estimated time 8 hours).
 +
Creating wiki page and backlog list (estmiated time 8 hours).
 +
 +
==History==
 +
 +
This section contains the history of the internship divided for weeks.

Revision as of 12:07, 21 July 2010


Abstract

The PDF Import Extension allows you to import and modify PDF documents. Best results with 100% layout accuracy can be achieved with the "PDF/ODF hybrid file" format, which this extension also enables. A hybrid PDF/ODF file is a PDF file that contains an embedded ODF source file. Hybrid PDF/ODF files will be opened in OpenOffice.org as an ODF file without any layout changes. Users without this extension can open the PDF part of the hybrid file with their PDF viewer.

The PDF Import Extension also allows you to import and modify PDF documents for non hybrid PDF/ODF files. PDF documents are imported in Draw to preserve the layout and to allow basic editing. This is the perfect solution for changing dates, numbers or small portions of text with a minimum loss of formatting information for simple formatted documents.

Goals for a PDF import

The document created by importing a PDF file should resemble the original as close as possible; nevertheless PDF per se does not lend itself to that end easily: most PDF files contain no information about layout or document structure at all. Therefore a PDF file will never be able to be imported on a 1:1 basis. We have to define goals to define what level of similarity must be achieved on a basis of feasibility.

These goals should be treated as paramount:

  • all text that is visible in the original PDF document should be imported
  • text attributes: font family, font size, weight (bold, not bold), style (italic, not italic) should be imported together with the respective text.
  • all drawing elements (images, vector graphics) should be imported.
  • if the implementation has to choose between layout fidelity and editability, lean towards layout.

Additionally there are some goals that would greatly enhance the import result, all of these features can by their nature only be implemented with heuristic methods since PDF (unless the file uses tagged PDF) does not contain structural information. The following text features should be detected (sequence in descending importance):

  • Paragraphs
  • Enumerations
  • Titles
  • Underlined text
  • subscript/superscript

Backlog

This section contains the list of tasks that are going to be done during internship and haven't been started yet.



Current tasks for

This section contains the list of tasks that are being done right now.

Getting know with code (estimated time 32 hours). Getting know with mercurial (estimated time 8 hours). Setting environment and first build (estimated time 8 hours). Creating wiki page and backlog list (estmiated time 8 hours).

History

This section contains the history of the internship divided for weeks.

Personal tools