Uno/Binary/Analysis/String Performance
author: Kay Ramme
date: 09/13/2005
type: analysis
version: 1.2
Some String Performance Thoughts
Motivation
While doing some performance investigations into OOo document load times, different profilers showed that a significant time was spent in memory de-/allocation and interlock counter in-/decrements. At lot of these invocations were done on behalf of the OOo string implementations. These string implementations are pure heap based and allocate all strings with a length > 0 on the heap.
This document should give a first hint about how-to calculate the memory and time consumptions of a particular string implementation. Being able to calculate these e.g. allows to judge the quality of current filter implementations in respect to strings, and may give a hint about too many string usages. Summary
The shared-heap-string-with-a-buffer performance is (much) better than the shared-heap-string performance, but needs some slightly more memory, depending on the usage case.
The below first thoughts / investigation are probably not yet enough, to really judge potential advantages of the shared-heap-string-with-a-buffer against the shared-heap-string, but may already give some hints for further investigations.
Unfortunately changing / extending OOos string implementations would be an ABI incompatible change, so that doing this should be well motivated. Preface
Despite that some simple optimizations, as not acquiring / releasing the statically allocated empty string, can be done immediately, further thoughts seem to be necessary. Obviously different aspects of strings can be optimized, partly with competing results, depending on the usage scenarios and the optimization focus (space vs. time). Some candidates are:
- Acquire heap blocks for unique strings only.
- Reduce heap de- / allocation calls.
- Reduce acquire / release calls.
- Improve string manipulation operations, as +=.
This document aims to give ideas about time optimizations, as these seem to be the most pressing, regarding e.g. the current document load times.
While loading a big spreadsheet with integer values only, a lot of time is spend in heap de-/allocation and in in-/decrementing interlock counts. These functions are called by string construction and destruction operations, therefor this document only investigates into construction and destruction of strings.
Examined Operations
Examined operations are
- string construction and
- string destruction,
as these are the most costly string operations in a typical usage scenario, e.g. loading spreadsheet documents. Construction and destruction of empty strings has been ignored, for simplicity reasons. Such operations are implementable to execute in nearly zero time, compared to the non empty string operations.
Examined Strings
This document only investigates into the original shared-heap-string, as currently implemented and used in OOo, and into a variant of this string, the shared-heap-string-with-a-buffer, which seems to be promising in respect to the examined operations (construction and destruction).
Parameters
The time relevant parameters for construction and destruction are
- heap allocation times,
- heap de-allocation times,
- acquire times,
- release times and the
- character copy times.
The space relevant parameters are, the
- shared sequence size,
- character sequences length,
- string size and the
- characters size.
The shared-heap-string, as OOos rtl::OUString, allocates every character sequence on the heap, while sharing these sequences in case of copy operations. The life time of the shared sequences is controlled by reference counting. In a multi threaded environment, this reference counting must be implemented in a thread safe manner. The thread safe OOo reference counter used is the oslInterlockedCount.
{{{Uno/Note}}} As the shared-heap-string always does heap operations for creating / deleting shared sequences, it can become a major bottleneck for multi-threaded applications, as heap de-/allocations typically acquire a global MutEx, before performing their underlying function.
Construction
Constructing a shared-heap-string, means to
- acquire a shared sequence, in case the source is a shared-heap-string, or to
- allocate from the heap and to copy a number of characters, in case the source is a literal.
Destruction
Destructing a shared-heap-string means to
- release the shared sequence, or to
- deallocate the shared sequence, in case this was the last reference.
Parameters
Usage dependent parameters are the
- share ratio, the ratio of assigning literals compared to assigning strings,
- string length distribution.
Costs
Time:
- destruction_time := heap_de-allocation_time * (1 - share_ratio) + release_time * share_ratio
- construction_time := acquire_time * share_ratio + (1 – share_ratio) * (heap_allocation_time + character_copy_time * string_length)
Space:
- memory_size := string_size + (1 - share_ratio) * (string_length * character_size + shared_sequence_size)
The shared-heap-string-with-a-buffer reserves some of the strings structural size for buffering characters, instead of always allocating memory from the heap. The shared-heap-string-with-a-buffer performance degenerates to the shared-heap-string performance, in case of the buffer being of size 0.
It is expected to get better performance, when using the shared-heap-string-with-a-buffer, in respect to time and concurrency, especially in scenarios where a lot of small temporary strings are used, for example when parsing / loading XML documents.
Construction
Instead of being constant in time, the construction operation depends on the source strings length in case it fits into the buffer, and is otherwise the same as the shared-heap-string construction operation, in case the source string is longer than the buffer.
{{{Uno/Note}}} The shared-heap-string-with-a-buffer construction operation is guaranteed to be cheaper than the pure heap-string construction operation, if the buffer size is selected in a way, that the character_copy * string_length operation is cheaper than the acquire operation.
Constructing a shared-heap-string-with-a-buffer, means to
- acquire a shared sequence, in case the source length >= the buffer, or to
- copy a number of characters, in case the source length < the buffer, or to
- allocate from the heap and to copy a number of characters, in case the source is a literal, which's length >= the buffer.
Destruction
Destructing a shared-heap-string-with-a-buffer means to
- do nothing, in case the strings length < the buffer, or to
- ,in case the strings length >= the buffer,
- release the shared sequence, in case this was not the last reference, or to
- deallocate the shared sequence, in case this was the last reference.
Parameters
Usage dependent parameters are the
- share ratio, the ratio of assigning literals compared to assigning strings, the
- string length distribution, and the
- buffer size.
Costs
Time:
- destruction_time := probability(sequence >= buffer) * heap-string-destruction_time
- construction_time := probability(sequence >= buffer) * heap-string-construction_time + probability(sequence < buffer) * char_copy_time * sequence_length
Space:
- memory_size := string_size + buffer_size + probability(sequence >= buffer) * (1 - share ratio) * (sequence_length * character_size + shared_sequence_size)
Comparison
Independent Parameters
To find out the relationships of the different operations, I wrote a short test program (see appendix for details). This program gave me some numbers to start with:
localhost:~> ./perf_test.bin fun calls 256 - average spendings: abs 0.000009ms proc 0.000000ms 326insts malloc 256 - average spendings: abs 0.000008ms proc 0.000001ms 519insts free 256 - average spendings: abs 0.000089ms proc 0.000001ms 465insts acquire 256 - average spendings: abs 0.000008ms proc 0.000001ms 338insts release 256 - average spendings: abs 0.000008ms proc 0.000001ms 338insts copied characters 1024 - average spendings: abs 0.000001ms proc 0.000000ms 35insts
Any suggestions / improvement for measuring the execution are certainly welcome. For precision I used the determined number for instructions (insts), to do further calculations.
Defining the costs of operation gives
- heap_allocation_time = 519 insts,
- heap_de-allocation_time = 465 insts,
- acquire_time = 338 insts,
- release_time = 338 inst and
- character_copy_time = 35 insts.
String Parameters
The space relevant parameters for both strings are, the
- shared_sequence_size == sizeof(rtl_uString) == 10 bytes,
- string_size == sizeof(rtl::OUString) == 4 bytes and the
- character_size == sizeof(sal_Unicode) == 2 bytes.
And additionally, the buffer size in case of a shared-heap-string-with-a-buffer
- buffer_size = character_size * buffer_length.
Scenario: Loading a Spreadsheet
A Calc document with 400 000 numeric cells, creates many strings during loading. A cell stored in the document typically looks like this:
<table:table-cell office:value-type="float" office:value="10"> <text:p>10</text:p> </table:table-cell>
The loading procedure in general does the following,
- the XML file gets parsed, about 9 strings (for every tag and value) get created for every cell,
- the strings are passed via callbacks to the Calc engine,
- the Calc engine converts the content string into a number
- the strings get destructed.
For the above cell description, there is no sharing of strings possible, as the only double string (“10”) is part of different tags.
The performance parameters for this scenario are:
- share_ratio = 0
- average string length for the above cell description
- length(“table:table-cell”) = 16c
- length(“office:value-type”) = 17c
- length(“office:value”) = 12c
- length(“text:p”) = 6c
- = (16+17+12+6)c / 4 = 12.75c
- concurrency (max. number of concurrent instances) = 3 (one tag + two attributes)
Obviously, a spreadsheet with only numeric data should not do any heap allocations while loading, except for constructing the spreadsheet model.
Time and Space performance depend on the number of cells to be loaded, and on the concurrency.
- Time Costs :=concurrency * 400000 * (construction_time + destruction_time)
- Space Costs := concurrency * (string_size + average_sequence_length * character_size + shared_sequence_size)
Time
- concurrency * 400000 * (construction_time + destruction_time) :=
- (
- destruction_time
== 3 * 400000 * (heap_de-allocation_time * (1 - share_ratio) + release_time * share_ratio) == 3 * 400000 * 465insts * 1 == 558000000 insts
- +
- construction_time
== 3 * 400000 * (acquire_time * share_ratio + (1 – share_ratio) * (heap_allocation_time + character_copy_time * sequence_length)) == 3 * 400000 * (519insts + 35insts * 12.75) == 1158300000 insts
- )
- == 1716300000 insts
Space:
- concurrency * shared-heap-string_size
== concurrency * (string_size + (1 - share_ratio) * (sequence_length * character_size + shared_sequence_size)) == 3 * (4 bytes + 12.75 * 2 bytes + 10 bytes) == 118.5 bytes
Additional parameters are, the
- buffer_size == 16c,
- probability(sequence >= buffer) == 50% ,
- probability(sequence < buffer) == 50% .
Time
- concurrency * 400000 * (construction_time + destruction_time) :=
- (
- destruction_time
== 3 * 400000 * probability(sequence >= buffer) * heap-string-destruction_time == 3 * 400000 * 0.5 * heap-string-destruction_time == 3 * 400000 * 0.5 * (heap_de-allocation_time * (1 - share_ratio) + release_time * share_ratio) == 3 * 400000 * 0.5 * heap_de-allocation_time == 3 * 400000 * 0.5 * 465insts == 279000000 insts
- +
- construction_time
== 3 * 400000 * (probability(sequence >= buffer) * heap-string-construction_time + probability(sequence < buffer) * char_copy_time * sequence_length) == 3 * 400000 * (0.5 * heap-string-construction_time + 0.5 * char_copy_time * sequence_length) == 3 * 400000 * (0.5 * (acquire_time * share_ratio + (1 – share_ratio) * (heap_allocation_time + character_copy_time * sequence_length)) + 0.5 * char_copy_time * sequence_length) == 3 * 400000 * (0.5 * (heap_allocation_time + character_copy_time * sequence_length) + 0.5 * char_copy_time * sequence_length) == 3 * 400000 * (0.5 * (519insts + 35insts * 12.75) + 0.5 * 35insts * 12.75) == 846900000 insts
- )
- == 1125900000 insts
Average Space:
- concurrency * shared-heap-string-with-a-buffer_size
== concurrency * (string_size + buffer_size + probability(sequence >= buffer) * (1 - share ratio) * (sequence_length * character_size + shared_sequence_size)) == 3 * (string_size + buffer_size + probability(sequence >= buffer) * (1 - share ratio) * (sequence_length * character_size + shared_sequence_size)) == 3 * (4 bytes + 16 * character_size + 0.5 * 1 * (12.75 * character_size + 10 bytes)) == 3 * (4 bytes + 16 * 2 bytes + 0.5 * 1 * (12.75 * 2 bytes + 10 bytes)) == 161.25
Conclusion
Comparing the heap-string-with-a-buffer with the heap-string shows in the above scenario a time advantage of about 35% in respect to time performance. And a space disadvantage of about 27%. Which, in absolute numbers is just about 45 bytes, where as the time advantage is in absolute numbers 690 million insts.
Appendix
To compile the following program, put it into a file called “perf_test.c” and invoke
gcc -I/opt/papi/include/ perf_test.c -L/opt/papi/lib/ -lpapi -lperfctr -o perf_test.bin
Obviously, this runs on Linux x86 only.
/*
** this short program uses the Intel Pentium performance counters,
** which can be accessed through the PAPI project
** (http://icl.cs.utk.edu/papi/software/index.html)
*/
#include <stdlib.h>
#include <stdio.h>
#include <papi.h>
#include <string.h>
static void report(char const * title, int calls, float rtime, float ptime, long_long ins) {
printf("%20.20s %4i - average spendings: abs %fms proc %fms %5liinsts\n",
title,
calls,
rtime / calls,
ptime / calls,
ins / calls);
}
void dummyFunction(void)
{
}
#define FUN_LOOP 256
static void test_fun_call(void)
{
float g_rtime = 0;
float g_ptime = 0;
long_long g_ins = 0;
float g_ipc = 0;
int i = 0;
for (i; i < FUN_LOOP; ++ i)
{
float rtime = 0;
float ptime = 0;
long_long ins = 0;
float ipc = 0;
/* Start counting events */
if (PAPI_ipc(&rtime, &ptime, &ins, &ipc))
abort();
dummyFunction();
/* getting counted events */
if (PAPI_ipc(&rtime, &ptime, &ins, &ipc))
abort();
/* Stop counting events */
if (PAPI_stop_counters(NULL, 0) != PAPI_OK)
abort();
g_rtime += rtime;
g_ptime += ptime;
g_ins += ins;
g_ipc += ipc;
}
report("fun calls", FUN_LOOP, g_rtime, g_ptime, g_ins);
}
#define MAFR_LOOP 256
static void test_malloc_free(void)
{
float g_m_rtime = 0;
float g_m_ptime = 0;
long_long g_m_ins = 0;
float g_m_ipc = 0;
float g_f_rtime = 0;
float g_f_ptime = 0;
long_long g_f_ins = 0;
float g_f_ipc = 0;
int i = 0;
for (i; i < MAFR_LOOP; ++ i)
{
float m_rtime = 0;
float m_ptime = 0;
long_long m_ins = 0;
float m_ipc = 0;
/* Start counting events */
if (PAPI_ipc(&m_rtime, &m_ptime, &m_ins, &m_ipc))
abort();
void * block = malloc(128);
if (PAPI_ipc(&m_rtime, &m_ptime, &m_ins, &m_ipc))
abort();
/* Stop counting events */
if (PAPI_stop_counters(NULL, 0) != PAPI_OK)
abort();
g_m_rtime += m_rtime;
g_m_ptime += m_ptime;
g_m_ins += m_ins;
g_m_ipc += m_ipc;
float f_rtime = 0;
float f_ptime = 0;
long_long f_ins = 0;
float f_ipc = 0;
/* Start counting events */
if (PAPI_ipc(&f_rtime, &f_ptime, &f_ins, &f_ipc))
abort();
free(block);
/* getting counted events */
if (PAPI_ipc(&f_rtime, &f_ptime, &f_ins, &f_ipc))
abort();
/* Stop counting events */
if (PAPI_stop_counters(NULL, 0) != PAPI_OK)
abort();
g_f_rtime += f_rtime;
g_f_ptime += f_ptime;
g_f_ins += f_ins;
g_f_ipc += f_ipc;
}
report("malloc", MAFR_LOOP, g_m_rtime, g_m_ptime, g_m_ins);
report("free", MAFR_LOOP, g_f_rtime, g_f_ptime, g_f_ins);
}
typedef signed long oslInterlockedCount;
oslInterlockedCount osl_incrementInterlockedCount(oslInterlockedCount* pCount)
{
oslInterlockedCount nCount;
__asm__ __volatile__ (
"movl $1, %0\n\t"
"lock\n\t"
"xaddl %0, %2\n\t"
"incl %0"
: "=&r" (nCount), "=m" (*pCount)
: "m" (*pCount)
: "memory");
return nCount;
}
oslInterlockedCount osl_decrementInterlockedCount(oslInterlockedCount* pCount)
{
oslInterlockedCount nCount;
__asm__ __volatile__ (
"movl $-1, %0\n\t"
"lock\n\t"
"xaddl %0, %2\n\t"
"decl %0"
: "=&r" (nCount), "=m" (*pCount)
: "m" (*pCount)
: "memory");
return nCount;
}
#define ACRE_LOOP 256
static void test_acquire_release(void)
{
float g_a_rtime = 0;
float g_a_ptime = 0;
long_long g_a_ins = 0;
float g_a_ipc = 0;
float g_r_rtime = 0;
float g_r_ptime = 0;
long_long g_r_ins = 0;
float g_r_ipc = 0;
oslInterlockedCount bla;
int i;
for (i = 0; i < ACRE_LOOP; ++ i)
{
float a_rtime = 0;
float a_ptime = 0;
long_long a_ins = 0;
float a_ipc = 0;
/* Start counting events */
if (PAPI_ipc(&a_rtime, &a_ptime, &a_ins, &a_ipc))
abort();
osl_incrementInterlockedCount(&bla);
/* getting counted events */
if (PAPI_ipc(&a_rtime, &a_ptime, &a_ins, &a_ipc))
abort();
/* Stop counting events */
if (PAPI_stop_counters(NULL, 0) != PAPI_OK)
abort();
g_a_rtime += a_rtime;
g_a_ptime += a_ptime;
g_a_ins += a_ins;
g_a_ipc += a_ipc;
float r_rtime = 0;
float r_ptime = 0;
long_long r_ins = 0;
float r_ipc = 0;
/* Start counting events */
if (PAPI_ipc(&r_rtime, &r_ptime, &r_ins, &r_ipc))
abort();
osl_decrementInterlockedCount(&bla);
/* getting counted events */
if (PAPI_ipc(&r_rtime, &r_ptime, &r_ins, &r_ipc))
abort();
/* Stop counting events */
if (PAPI_stop_counters(NULL, 0) != PAPI_OK)
abort();
g_r_rtime += r_rtime;
g_r_ptime += r_ptime;
g_r_ins += r_ins;
g_r_ipc += r_ipc;
}
report("acquire", ACRE_LOOP, g_a_rtime, g_a_ptime, g_a_ins);
report("release", ACRE_LOOP, g_r_rtime, g_r_ptime, g_r_ins);
}
#define CHARS_TO_COPY 1024
static void test_character_copy(void)
{
float g_rtime = 0;
float g_ptime = 0;
long_long g_ins = 0;
float g_ipc = 0;
int i = 0;
for (i; i < 100; ++ i)
{
float rtime = 0;
float ptime = 0;
long_long ins = 0;
float ipc = 0;
short int buff1[CHARS_TO_COPY];
short int buff2[CHARS_TO_COPY];
/* Start counting events */
if (PAPI_ipc(&rtime, &ptime, &ins, &ipc))
abort();
memcpy(buff2, buff1, CHARS_TO_COPY);
/* getting counted events */
if (PAPI_ipc(&rtime, &ptime, &ins, &ipc))
abort();
/* Stop counting events */
if (PAPI_stop_counters(NULL, 0) != PAPI_OK)
abort();
g_rtime += rtime;
g_ptime += ptime;
g_ins += ins;
g_ipc += ipc;
}
report("copied characters", CHARS_TO_COPY, g_rtime, g_ptime, g_ins);
}
int main()
{
test_fun_call();
test_malloc_free();
test_acquire_release();
test_character_copy();
return 0;
}