|About this template
An experiment with incremental saving in Calc
For a while, there has been the idea to speed up the saving of small changes to large spreadsheet documents by generating only the XML elements for the changed cells.
I did a little experiment to find out how much time this can really save. It's not a complete implementation, just a test to see what's possible. This is what I changed:
- Loading of the file is unchanged. No additional information is kept in memory after the file is loaded.
- What I keep in memory is the list of changes, as long as only supported changes were made. For this test, only input of numbers into cells that weren't empty before is supported. If any other changes are made, the list is discarded and the normal save code is used. For simplicity, it is assumed that the input doesn't change the cell's format. Changed formula results are ignored (in a real implementation, they have to be tracked, too).
- When the file is saved, and the content.xml stream is generated, first the old content.xml is opened and parsed for the positions of the affected cell elements (see below for details). Then, the output is generated by copying the non-affected parts and inserting a new simple cell element instead of each affected cell.
- The styles.xml stream is copied in unchanged form.
To find the affected cell elements in the XML stream quickly, expat's C interface is used, searching for qualified element names table:table, table:table-row, table:table-cell and table:covered-table-cell. Because this is just an optimization, I can assume that our own namespace prefixes are used. Parsing is stopped when the last affected cell was found.
For testing, I used a large, simple file with 500000 cells, only text and numbers.
In CPU time (measured with callgrind), normal saving takes 20.1 billion cycles. Incremental saving with three values changed at the top of the file takes 6.3 billion, and with three values changed at the bottom it takes 11.4 billion cycles. Note that a real implementation would probably need some additional checks that might increase the time again.
In exchange for these saved CPU cycles, there is a bit of additional file access to read the old streams again. This isn't much, because they can be read from the compressed file. Reliably measuring the total time isn't easy, but quick tests with a larger file show the time for incremental saving to be around 50% of normal for changes at the top, and 70% for changes at the bottom. On other machines, this might be totally different.
As mentioned above, the immediate goal of this was only to get the numbers of what is possible. Before making a real implementation of this, we should see where we can get by improving the normal saving, because that would benefit all usages, not only specific, limited modifications.
For reference, the code of this experiment is in the CWS "calcincsave".