FastParser

From Apache OpenOffice Wiki
Jump to: navigation, search

DRAFT

Abstract

The service com.sun.star.xml.sax.FastParser provides an xml parser that generates sax like events by implementing a XFastParser

IDL

service : XFastParser;

Namespace

com::sun::star::xml::sax

Motivation

For implementing the first OOo xml filters in 2000 we decided to use a sax api to parse the xml streams. At this time there where some already proven sax implementations we could choose from. We decided to use the expat - XML Parser Toolkit. Under the namespace com.sun.star.xml.sax we developed a set of UNO interfaces that assembled a sax api interface. A component was implemented that used expat to implement a reader and a writer. This component is used in the XMLOFF project that implements the import and export filter for the OpenDocumentFormat.

This solution has two drawbacks:

  • The initial sax api didn't know about xml namespaces, so namespaces have to be handled in each filter.
  • The utf8 strings from the xml stream have to be converted to the UNO strings that use utf16 for encoding. This is one of the major bottlenecks when dealing with larger xml documents.

An additional feature that is not provided by the sax api is the concept of import contexts. Usually a sax import filter implements one document handler that gets all xml events for the whole xml stream. Therefor this document handler must know about all valid elements. In the XMLOFF project we added a context stack to the document handler. Basically, we added a create child event to the existing sax events. This event was called on the current context to create a new child context for each direct child element of the current context. When the child element was processed its context would be put on top of the stack and became the current. After the current context gets the end event it is removed from the stack. This made implementing an importer for a rather complex format very easy.

The problem is that the namespace handling and the context feature is buried deep in the XMLOFF project. So new filter implementations would have to reimplement it.

To simplify implementations of xml import filters I designed a new sax like UNO api that implements the following key features:

  • Support for namespaces
  • Support for the context feature
  • Replacing xml names strings with integer tokens

The last feature, replacing xml names with integer tokens, should overcome the string performance issue we had with the old sax api. It also simplifies the implementation of an xml import filter.

For example consider the following code from XMLOFF

void AnimationNodeContext::StartElement( const Reference< XAttributeList >& xAttrList )
{
  for( sal_Int16 nAttribute = 0; nAttribute < nCount; nAttribute++ )
  {
    const OUString& rAttrName = xAttrList->getNameByIndex( nAttribute );
    const OUString& rValue = xAttrList->getValueByIndex( nAttribute );
 
    OUString aLocalName;
    sal_uInt16 nPrefix = GetImport().GetNamespaceMap().GetKeyByAttrName( rAttrName, &aLocalName );
 
    if( IsXMLToken( aLocalName, XML_PAR ) && nPrefix == XML_NAMESPACE_ANIMATION )
    {
      ...
    }
    else if( IsXMLToken( aLocalName, XML_NODE ) && nPrefix == XML_NAMESPACE_SMIL ) )
    {
      ...
    }
  }
}
This contains multiple string conversions, namespace lookups and string compares.

With the fast sax parser api this would look like this

void AnimationNodeContext::startFastElement( sal_Int32 Element, const Reference< XFastAttributeList >& Attribs )
{
  switch( Element )
  {
    case NMSP_ANIMATION|XML_PAR:
      ...
      break;
    case NMSP_SMIL|XML_NODE:
      ...
      break;
  }
}

Implementing an xml format import filter with the fast sax api

To parse an xml stream you need to create an instance of the service com.sun.star.xml.sax.FastParser. This parser needs a XFastDocumentHandler that you must set via the method setFastDocumentHandler(). This XFastDocumentHandler must be implemented by your filter component. The important method of the XFastDocumentHandler is createFastChildContext() from its parent interface XFastContextHandler. It is called from the parser to create a context for the root element of the xml stream.

The fast context handler

When the XFastDocumentHandler returns a XFastContextHandler for the root element to the parser, the parser uses that context to send him the sax events for the root element.

For each child element of an xml element its context is also asked by the parser to return a context for the child element. The returned context then gets all sax events for that element.

A filter can return the same instance of a XFastContextHandler more than once.

Since the XFastDocumentHandler interface is derived from XFastContextHandler you can implement a filter with only one instance by always returning your XFastDocumentHandler as a child context. This way you disable the context feature and you get the sax events the same way as with an old sax api XDocumentHandler.

The fast token handler

You should implement an object with the interface XFastTokenHandler. It is used by the parser to convert xml element local names and attribute names to integer tokens. Usually the parser will retrieve integer tokens by calling getTokenFromUTF8. This method takes an uft8 encoded string as a byte sequence. Since xml files are usually encoded in utf8, no format conversion is needed in the default case.

When the parser processes an xml element it first asks the token handler if he has a valid integer token for the local name of that element. If the token handler does not know the element local name the parser handles this element as an unknown element. Elements with a namespace that was not registered at the parser are also handled as unknown elements, even if the local name is known by the token handler.

Optimize your fast token handler

For maximum performance the fast token handler should know all valid element and attribute names. You can also add common attribute values. It is very easy to automatically generate such a token handler.

For the ms office 12 import filter I used gperf to create a perfect hash for all xml names of that format at compile time. See XFastTokenHandler for a sample implementation.

Pseudo code to parse an xml stream

// namespace id must be < 0x10000
const sal_Int32 NMSP_SAMPLE_1 = (1 << 16);
const sal_Int32 NMSP_SAMPLE_2 = (2 << 16);
 
class MyContext : public XFastContextHandler
{
	// XFastContextHandler
    void startFastElement( sal_Int32 Element, const Reference< XFastAttributeList >& Attribs )
	{
		if( Element == NMSP_SAMPLE_1|XML_sampleelement )
		{
			maName = Attribs->getOptionalValue[NMSP_SAMPLE_2|XML_name];
		}
	}
 
    void endFastElement( sal_Int32 Element ) {}
 
    void startUnknownElement( const OUString& Namespace, const OUString& Name, const Reference< XFastAttributeList >& Attribs ) {}
    void endUnknownElement( const OUString& Namespace, const OUString& Name ) {}
 
 
    Reference< XFastContextHandler > createFastChildContext( sal_Int32 Element, const Reference< XFastAttributeList >& Attribs )
	{
		if( Element == NMSP_SAMPLE_1|XML_sampleelement )
			return this;
		else
			return 0;
	}
 
    Reference< XFastContextHandler > createUnknownChildContext( const OUString& Namespace, const OUString& Name, const Reference< XFastAttributeList >& Attribs )
	{
		return 0;
	}
 
	void characters( const OUString& aChars ) {}
};
 
class MyHandler : public XFastDocumentHandler
{
	// XFastDocumentHandler
    void startDocument()
	{
		// parsing one xml file is started
	}
 
    void endDocument()
	{
		// parsing one xml file is finished
	}
 
    void SAL_CALL setDocumentLocator( const Reference< XLocator >& xLocator ) {}
 
	// XFastContextHandler
    void startFastElement( sal_Int32 Element, const Reference< XFastAttributeList >& Attribs )
	{
	}
 
    void startUnknownElement( const OUString& Namespace, const OUString& Name, const Reference< XFastAttributeList >& Attribs )
	{
	}
 
    void endFastElement( sal_Int32 Element )
	{
	}
 
    void endUnknownElement( const OUString& Namespace, const OUString& Name )
	{
	}
 
    Reference< XFastContextHandler > createFastChildContext( sal_Int32 Element, const Reference< XFastAttributeList >& Attribs )
	{
		if( Element == NMSP_SAMPLE_1|XML_sampleelement )
		{
			return new MyContext; // pipe sax events for this element to my context
		}
		return 0; // ignore sax events for this element and all child elements
	}
 
    Reference< XFastContextHandler > createUnknownChildContext( const OUString& Namespace, const OUString& Name, const Reference< XFastAttributeList >& Attribs )
	{
		// this is called for elements with an unknown namespace or elements which are unknown to the token handler
		// so in general they can be ignored as we don't know anything about them
		return 0;
	}
 
	void characters( const OUString& aChars ) {}
};
 
void parse( const Reference< XInputStream >& xInputStream, OUString& sSystemId )
{
	xml::sax::InputSource aParserInput;
	aParserInput.sSystemId = rFragmentPath;
	aParserInput.aInputStream = sSystemId
 
	Reference< XFastDocumentHandler > xHandler( new MyHandler );
 
	// can be reused for multiple xml file parsings
	// see xmlfilter02/oox/source/token for possible implementation
	Reference< XFastTokenHandler > xTokenHandler( new MyTokenHandler );
 
	Reference< XFastParser > xParser( mxServiceFactory->createInstance( "com.sun.star.xml.sax.FastParser", UNO_QUERY_THROW ) );
	xParser->setFastDocumentHandler( xHandler );
	xParser->setTokenHandler( xTokenHandler );
	xParser->registerNamespace( "http://sample_namespace_1", NMSP_SAMPLE_1 );
	xParser->registerNamespace( "http://sample_namespace_2", NMSP_SAMPLE_2 );
	xParser->parseStream( aParserInput );
}

Migration for XMLOFF project

Since the current major performance bottleneck of the filters inside the XMLOFF project is the string handling, a migration of the filters to the fast sax api should be considered. As this project is already a huge one, a migration where the code is adapted in small pieces may be advisable.

In the first step the class SvXMLImport must be rewritten to use the new sax parser api. The SvXMLImportContext would be derived from the XFastContextHandler interface. The base implementation of its method could call the old sax methods. Therefore all derivations from SvXMLImportContext that do not excplicitly override the methods from XFastContextHandler will still work as before until someone migrate them to the new api.

todo, write more details

Todo List

Design an interface for an xml writer using tokens

The current solution only allows implementing an xml import filter. A writer component and interfaces would be needed to also support export filters.

Implement a connecter that feeds a XFastDocumentContext

There are already some implementations that do not use xml streams but feed the old sax api directly. It is very easy to add the old com.sun.star.xml.sax.XDocumentHandler interface to the FastParser service. This would allow that a filter implemented with the new fast sax api can be feed with events by a component using the old sax api.

Personal tools