Difference between revisions of "CalcParser"

From Apache OpenOffice Wiki
Jump to: navigation, search
m
 
(2 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
[[image:Py-uno_128.png|right|PyUNO Logo]]
 
[[image:Py-uno_128.png|right|PyUNO Logo]]
So this code mision in life is to parse the XML of OpenOffice.org spreadsheet. The code still needs more work. However this code will work with python 2.3.4 which is the version included in the OpenOffice.org installation.  
+
So this code mission in life is to parse the XML of OpenOffice spreadsheet. The code still needs more work. However this code will work with python 2.3.4 which is the version included in the OpenOffice.org installation.  
  
 
The code uses the SAX (Simple API for XML) is a parser originally written in Java but ported to other languages such as Python.  
 
The code uses the SAX (Simple API for XML) is a parser originally written in Java but ported to other languages such as Python.  
Line 10: Line 10:
 
The incompatibility problem was primarily with the handlers, the '''DefaultHandler''' originally used was deprecated on this libary and was handle on a separate sub-module called '''handler'''. Also it changed from DefaultHandler to ''ContentHandler''. Please check the comment on the code.
 
The incompatibility problem was primarily with the handlers, the '''DefaultHandler''' originally used was deprecated on this libary and was handle on a separate sub-module called '''handler'''. Also it changed from DefaultHandler to ''ContentHandler''. Please check the comment on the code.
  
<code>[python]
+
<syntaxhighlight lang="python">
 
#!/bin/env python
 
#!/bin/env python
 
import sys
 
import sys
 
from xml.sax import saxutils #originally in python 2.4
 
from xml.sax import saxutils #originally in python 2.4
 
from xml.sax import parse
 
from xml.sax import parse
from xml.sax import handler  # Python 2.3 uses the handler for contentHandler</code>
+
from xml.sax import handler  # Python 2.3 uses the handler for contentHandler</syntaxhighlight>
  
The next step and more interesting was to inser the proper tags. SAX uses by default 3 definitions, '''startElement''', '''endElement''', '''characters'''. On this script we created 3 arrays for different levels of the data structure within the content: '''chars''', '''cells''' and '''rows'''.  
+
The next step and more interesting was to insert the proper tags. SAX uses by default 3 definitions, '''startElement''', '''endElement''', '''characters'''. On this script we created 3 arrays for different levels of the data structure within the content: '''chars''', '''cells''' and '''rows'''.  
<code>[python]
+
<syntaxhighlight lang="python">
 
     def startElement(self, name, atts):
 
     def startElement(self, name, atts):
 
         if name=="table:table-cell":    #we pick the table-cell tag
 
         if name=="table:table-cell":    #we pick the table-cell tag
 
             self.chars=[]
 
             self.chars=[]
 
         elif name=="table:table-row":  # and the rows
 
         elif name=="table:table-row":  # and the rows
             self.cells=[]</code>
+
             self.cells=[]</syntaxhighlight>
 
Characters is the handler that handles the content of the tags, startElement is what trigger the parser, this is allocated into the '''table:table-cell''' and '''table:table-row'''.  
 
Characters is the handler that handles the content of the tags, startElement is what trigger the parser, this is allocated into the '''table:table-cell''' and '''table:table-row'''.  
  
<code>[python]
+
<syntaxhighlight lang="python">
 
#!/bin/env python
 
#!/bin/env python
 
import sys
 
import sys
Line 58: Line 58:
 
calcHandler=CalcHandler()
 
calcHandler=CalcHandler()
 
parse(sys.argv[1], calcHandler)
 
parse(sys.argv[1], calcHandler)
print calcHandler.rows</code>
+
print calcHandler.rows</syntaxhighlight>
  
[[Category:Python]]
+
[[Category:Python]][[Category:Uno]]

Latest revision as of 12:30, 15 September 2021

PyUNO Logo

So this code mission in life is to parse the XML of OpenOffice spreadsheet. The code still needs more work. However this code will work with python 2.3.4 which is the version included in the OpenOffice.org installation.

The code uses the SAX (Simple API for XML) is a parser originally written in Java but ported to other languages such as Python.

First thing I faced here, is the change of the SAX library between versions, the original research before I create this code, was for version 2.4 and up which wasn't compatible with 2.3.

Also some of this code was for the Excel XML file scheme and I need to switch it to the OpenDocument spreadsheet XML.

The incompatibility problem was primarily with the handlers, the DefaultHandler originally used was deprecated on this libary and was handle on a separate sub-module called handler. Also it changed from DefaultHandler to ContentHandler. Please check the comment on the code.

#!/bin/env python
import sys
from xml.sax import saxutils #originally in python 2.4
from xml.sax import parse
from xml.sax import handler  # Python 2.3 uses the handler for contentHandler

The next step and more interesting was to insert the proper tags. SAX uses by default 3 definitions, startElement, endElement, characters. On this script we created 3 arrays for different levels of the data structure within the content: chars, cells and rows.

    def startElement(self, name, atts):
        if name=="table:table-cell":    #we pick the table-cell tag
            self.chars=[]
        elif name=="table:table-row":   # and the rows
            self.cells=[]

Characters is the handler that handles the content of the tags, startElement is what trigger the parser, this is allocated into the table:table-cell and table:table-row.

#!/bin/env python
import sys
from xml.sax import saxutils
from xml.sax import parse
from xml.sax import handler  # Python 2.3 uses the handler
 
# Replace DefaultHandler with ContentHandler 
# from the handler modules
class CalcHandler(handler.ContentHandler):
    def __init__(self):
        self.chars=[]
        self.cells=[]
        self.rows=[]
 
    def characters(self, content):
        self.chars.append(content)
 
    def startElement(self, name, atts):
        if name=="table:table-cell":
            self.chars=[]
        elif name=="table:table-row":
            self.cells=[]
 
    def endElement(self, name):
	if name=="table:table-cell":
            self.cells.append(''.join(self.chars))
        elif name=="table:table-row":
            self.rows.append(self.cells)
 
calcHandler=CalcHandler()
parse(sys.argv[1], calcHandler)
print calcHandler.rows
Personal tools