Difference between revisions of "Performance/Reorder Symbols For Libraries"
(→Library sharing) |
|||
(13 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
+ | {{Performance}} | ||
The [http://wiki.services.openoffice.org/wiki/Performance/OOo31_LibrariesOnStartup#Cold_startup_Writer_.28without_AV_real-time_protection.29 comprehensive analysis] of the cold start up behavior of OpenOffice.org shows that file I/O is the main bottleneck. About 80% of the start up time is spent waiting for data from the disk. Most file I/O depends on library loading. This part describes what can be done to reduce I/O time for loading OpenOffice.org libraries. The main ideas are system independent but the solutions must be system/compiler specific. The following chapters describe in detail how we want to reorder code/data within the libraries. | The [http://wiki.services.openoffice.org/wiki/Performance/OOo31_LibrariesOnStartup#Cold_startup_Writer_.28without_AV_real-time_protection.29 comprehensive analysis] of the cold start up behavior of OpenOffice.org shows that file I/O is the main bottleneck. About 80% of the start up time is spent waiting for data from the disk. Most file I/O depends on library loading. This part describes what can be done to reduce I/O time for loading OpenOffice.org libraries. The main ideas are system independent but the solutions must be system/compiler specific. The following chapters describe in detail how we want to reorder code/data within the libraries. | ||
Line 15: | Line 16: | ||
=== Microsoft Visual Studio 2008 === | === Microsoft Visual Studio 2008 === | ||
− | OpenOffice.org uses the Microsoft Visual Studio 2008 C/C++ compiler suite for the Windows build, called wntmsci12[.pro]. Unfortunately Microsoft discontinued the Working Set Tuner application which was part of the Platform SDK. That application allowed developers to optimize the layout of application libraries. A successor called Smooth Working Set Tool is also not available for download. | + | OpenOffice.org uses the Microsoft Visual Studio 2008 C/C++ compiler suite for the Windows build, called wntmsci12[.pro]. Unfortunately Microsoft discontinued the Working Set Tuner application which was part of the Platform SDK. That application allowed developers to optimize the layout of application libraries. A successor called [http://msdn.microsoft.com/en-us/magazine/cc301382.aspx Smooth Working Set Tool] is also not available for download. |
So we have to look for a solution on our own. This has also the big advantage that the solution can be adapted to our needs. What options are available to support us reordering code/data in libraries? If you start the C/C++ compiler and linker with the help option you can see all supported options. The following section shows the options which can help us. | So we have to look for a solution on our own. This has also the big advantage that the solution can be adapted to our needs. What options are available to support us reordering code/data in libraries? If you start the C/C++ compiler and linker with the help option you can see all supported options. The following section shows the options which can help us. | ||
Line 327: | Line 328: | ||
There are some problems with the ORDER file and the linker. | There are some problems with the ORDER file and the linker. | ||
− | {{ | + | {{Warn|The linker crashes reproducable if very long symbols are inside the ORDER file. A symbol with 1670 character length works, a symbol with 3345 chars results in a crash. It looks like that the linker works with a predefined buffer size for the symbols in the ORDER file. It must be verified if linker can order a symbol if it uses less characters. }} |
− | {{ | + | {{Warn|The compiler uses a random number for every type that is declared in a anonymous or counted namespace. This number is newly created for every new compile process. Therefore these symbols cannot be used in ORDER files as the trace code needs an instrumented build which has its own random numbers. }} |
=== Results === | === Results === | ||
Line 477: | Line 478: | ||
===Cold start up performance of OpenOffice.org 3.1 (DEV300m40)=== | ===Cold start up performance of OpenOffice.org 3.1 (DEV300m40)=== | ||
− | I made some start up tests with a standard, a optimized version (just have | + | I made some start up tests with a standard, a optimized version (just have 13 optimized/reordered symbols libraries) and one without rebased libraries (means every library has the same virtual base address so Windows needs to make relocations). Yuan Cheng from IBM reported that non-rebased libraries can boost cold start up performance significantly. |
Test machine: | Test machine: | ||
Line 539: | Line 540: | ||
|| Total CPU Time || 1,653s || 1,747s || +6% (average values) | || Total CPU Time || 1,653s || 1,747s || +6% (average values) | ||
|- | |- | ||
− | || Increase of page file usage || 27 MB || 61 MB || 226% | + | || Increase of page file usage || 27 MB || 61 MB || +226% |
|} | |} | ||
Line 589: | Line 590: | ||
|| 0x10000000||sal3.dll || sal3.dll || sal3.dll || sal3.dll | || 0x10000000||sal3.dll || sal3.dll || sal3.dll || sal3.dll | ||
|} | |} | ||
+ | |||
+ | Unfortunately Windows is not able to share the text part of a library between processes even if the text part is loaded at the same virtual address. If we look deeper into the Vmmap output you can see the following details for the vclmi.dll. | ||
+ | |||
+ | OpenOffice.org non-rebased vclmi.dll | ||
+ | |||
+ | {|width="80%" border="1" cellpadding="2" | ||
+ | !width="8%"|Virtual Address | ||
+ | !width="8%"|Type | ||
+ | !width="8%"|Size | ||
+ | !width="8%"|Committed | ||
+ | !width="8%"|Total WS | ||
+ | !width="8%"|Private WS | ||
+ | !width="8%"|Shareable WS | ||
+ | !width="8%"|Protection | ||
+ | !width="30%"|Details | ||
+ | |- | ||
+ | || 0x02330000||Image||3.016K||3.016K||2.624K|| 2.108K|| 516K|| Execute/Copy on Write||C:\Program Files\OpenOffice.org 3\Basis\program\vclmi.dll | ||
+ | |- | ||
+ | || 0x02330000||Image||4K||4K||4K|| ||4K||Read||Header | ||
+ | |- | ||
+ | || 0x02331000||Image||1.792K||1.792K||1.752K||1.752K|| ||Execute/Read||.text | ||
+ | |- | ||
+ | || 0x024F1000||Image||764K||764K||716K||316K||400K||Read||.rdata | ||
+ | |- | ||
+ | || 0x025B0000||Image||32K||32K||32K||32K|| ||Read/Write||.data | ||
+ | |- | ||
+ | || 0x025B8000||Image||272K||272K|| || || ||Copy on write||.data | ||
+ | |- | ||
+ | || 0x025FC000||Image||8K||8K||8K||8K|| ||Read/Write||.data | ||
+ | |- | ||
+ | || 0x025FE000||Image||144K||144K||112K|| ||112K||Read||.rsrc | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | The section .text is part of the Private WS and therefore not shared between processes. | ||
+ | |||
+ | OpenOffice.org with rebased vclmi.dll | ||
+ | |||
+ | {|width="80%" border="1" cellpadding="2" | ||
+ | !width="8%"|Virtual Address | ||
+ | !width="8%"|Type | ||
+ | !width="8%"|Size | ||
+ | !width="8%"|Committed | ||
+ | !width="8%"|Total WS | ||
+ | !width="8%"|Private WS | ||
+ | !width="8%"|Shareable WS | ||
+ | !width="8%"|Protection | ||
+ | !width="30%"|Details | ||
+ | |- | ||
+ | || 0x56F60000||Image||3.016K||3.016K||1.672K|| 24K|| 1.648K|| Execute/Copy on Write||C:\Program Files\OpenOffice.org 3\Basis\program\vclmi.dll | ||
+ | |- | ||
+ | || 0x56F60000||Image||4K||4K||4K|| ||4K||Read||Header | ||
+ | |- | ||
+ | || 0x56F61000||Image||1.792K||1.792K||1.128K|| ||1.128K||Execute/Read||.text | ||
+ | |- | ||
+ | || 0x57121000||Image||764K||764K||504K||8K||496K||Read||.rdata | ||
+ | |- | ||
+ | || 0x571E0000||Image||4K||4K||4K||4K|| ||Read/Write||.data | ||
+ | |- | ||
+ | || 0x571E1000||Image||12K||12K||4K|| ||4K||Copy on write||.data | ||
+ | |- | ||
+ | || 0x571E4000||Image||4K||4K||4K||4K|| ||Read/Write||.data | ||
+ | |- | ||
+ | || 0x571E5000||Image||284K||284K||4K|| ||4K ||Copy on write||.data | ||
+ | |- | ||
+ | || 0x5722C000||Image||8K||8K||8K||8K || ||Read/Write||.data | ||
+ | |- | ||
+ | || 0x5722E000||Image||144K||144K||12K|| ||12K||Read||.rsrc | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | === Microsoft Linker switch /SWAPRUN === | ||
+ | |||
+ | The Microsoft linker supports a switch which is called /SWAPRUN:[NET|CD]. This sets a flag within a library/executable to inform the loader to read the whole image and write it into the swap file for later use. The possible values NET and CD give hints when Windows should activate this copy mechanism (NET=network drive, CD=Removable drive). This should help applications to run normally (while the system use on-demand paging) even when the network is down or the CD has been removed. We want to see if this switch could be interesting for our goal to improve library loading on cold start up. Unfortunately Windows clearly separates between normal and network/removable drives. Means the positive effect of synchronous and sequential library loading can only be seen for these drives. | ||
+ | |||
+ | === Microsoft Linker switch /DYNAMICBASE === | ||
+ | |||
+ | Starting with Visual Studio 2005 SP1 the Microsoft linker supports a new flag called /DYNAMICBASE. This flag controls a new security feature introduced for Windows Vista that is called ASLR (Address Space Layout Randomization). You find more information about ASLR here [http://technet.microsoft.com/en-us/magazine/cc162458.aspx http://technet.microsoft.com/en-us/magazine/cc162458.aspx]. Libraries which include this flag cannot be rebased to a certain virtual load address. | ||
== Linux == | == Linux == |
Latest revision as of 20:58, 13 July 2018
|
---|
Quick Navigation Team Communication Activities |
About this template |
The comprehensive analysis of the cold start up behavior of OpenOffice.org shows that file I/O is the main bottleneck. About 80% of the start up time is spent waiting for data from the disk. Most file I/O depends on library loading. This part describes what can be done to reduce I/O time for loading OpenOffice.org libraries. The main ideas are system independent but the solutions must be system/compiler specific. The following chapters describe in detail how we want to reorder code/data within the libraries.
Contents
- 1 Main idea
- 2 System dependent solution
- 2.1 Windows
- 2.1.1 Microsoft Visual Studio 2008
- 2.1.2 Results
- 2.1.3 Example for the negative effect of limited symbol order optimization (fwkmi.dll)
- 2.1.4 Cold start up performance of OpenOffice.org 3.1 (DEV300m40)
- 2.1.5 Non-rebased OpenOffice.org
- 2.1.6 Microsoft Linker switch /SWAPRUN
- 2.1.7 Microsoft Linker switch /DYNAMICBASE
- 2.2 Linux
- 2.3 MacOS X
- 2.4 Solaris
- 2.1 Windows
Main idea
Normally the compiler and linker produce a library which consists of many object files. The order of the code/data is dependent on the strategy of the linker and the layout of the library format. During the start up the application libraries are loaded on demand. Dependend on the program flow new code and therefore pages are loaded from disk into memory. Unfortunately the linker doesn't know how the application accesses every library during the start up phase. Therefore the needed code/data is distributed all over the library which causes many page faults and disk access.
System dependent solution
Windows
This chapter describes the solution for the Windows platform.
Microsoft Visual Studio 2008
OpenOffice.org uses the Microsoft Visual Studio 2008 C/C++ compiler suite for the Windows build, called wntmsci12[.pro]. Unfortunately Microsoft discontinued the Working Set Tuner application which was part of the Platform SDK. That application allowed developers to optimize the layout of application libraries. A successor called Smooth Working Set Tool is also not available for download.
So we have to look for a solution on our own. This has also the big advantage that the solution can be adapted to our needs. What options are available to support us reordering code/data in libraries? If you start the C/C++ compiler and linker with the help option you can see all supported options. The following section shows the options which can help us.
Compiler options
Microsoft (R) 32-Bit C/C++-Optimizing Compiler Version 15.00.30729.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. ... /Gh Enable _penter function call /GH Enable _pexit function call /Gy Enable Function-Level Linking ... Microsoft (R) Incremental Linker Version 9.00.30729.01 Copyright (C) Microsoft Corporation. All rights reserved. Syntax: LINK [Options] [Files] [@Commandfile] Options: ... /ORDER:@Filename ...
The /Gh compiler option provides us the ability to be called by every function entry. The call to the hook function will be added by the compiler. That means we have to rebuild all libraries that should be instrumented with the /Gh option set. The /GH option is useful if we want to measure timing therefore we don't need it for record function calls. The /Gy options is needed to reorder symbols by the linker. Fortunately this options is set on official OpenOffice.org builds.
Looking at the documentation for the /Gh option Microsoft states that the hook function must be declared as naked. The function must also preserve all register content.
void __declspec(naked) _cdecl _penter( void );
A function declared with the naked attribute doesn't have prolog or epilog code. It enables a developer to write his own custom prolog/epilog code using the inline assembler. The following skeleton can be used for our target to record all function calls during the start up.
extern "C" void __declspec(naked) _cdecl _penter( void ) { _asm { push eax push ebx push ecx push edx push ebp push edi push esi } // TODO: Add code to determine the caller address and provide it // to a function which records the call. _asm { pop esi pop edi pop ebp pop edx pop ecx pop ebx pop eax ret } }
What we have to do is to retrieve the address of the caller function. This can be done with a little calculation as the address is on the stack. See the following code.
extern "C" void __declspec(naked) _cdecl _penter( void ) { _asm { push eax push ebx push ecx push edx push ebp push edi push esi // calculate the pointer to the return address mov ecx, esp add ecx, 28 // retrieve return address from stack mov eax, dword ptr[ecx] // subtract 5 bytes as instruction for call _penter is 5 bytes long on 32-bit machines, e.g. E8 <00 00 00 00> sub eax, 5 // provide return address to recordFunctionCall push eax call forwardFunctionCall pop esi pop edi pop ebp pop edx pop ecx pop ebx pop eax ret } }
The implementation of the _penter function provides the start address of the called function to an external function which can be implemented by C++ code.
Determine what functions are called during start up
With the help of the _penter function we are able to get the function addresses which are called during the start up. The _penter function calls an second function which can be implemented using C++. The function has to implement the following tasks:
- Determine to which module the address belongs
The order files for the linker must be created for every single library. We also need to know what map file must be checked later. It's not complicated to find the module to an address. The following code retrieves module information from a virtual address. The module information should be cached to minimize the overhead for the hook function, otherwise the runtime of the instrumented code can grow to an unacceptable amount.
bool getModuleDataFromAddress(void* pAddress, HMODULE* pModule, WCHAR* pszModuleName, DWORD dwBufSize, DWORD* dwModuleSize) { // Determine current module name bool bFound = false; DWORD dwProcessID = GetCurrentProcessId(); HANDLE hSnapshot = CreateToolhelp32Snapshot( TH32CS_SNAPMODULE, dwProcessID ); if (hSnapshot != INVALID_HANDLE_VALUE) { MODULEENTRY32 me32; me32.dwSize = sizeof( MODULEENTRY32 ); if( Module32First( hSnapshot, &me32 ) ) { do { BYTE* pModuleBaseAddress = me32.modBaseAddr; DWORD dwSize = me32.modBaseSize; if ( pAddress >= pModuleBaseAddress && pAddress <= ( pModuleBaseAddress + dwSize )) { bFound = true; *pModule = me32.hModule; *dwModuleSize = dwSize; wcsncpy( pszModuleName, me32.szModule, dwBufSize); break; } } while( Module32Next( hSnapshot, &me32 ) ); } CloseHandle(hSnapshot); } return bFound; }
- Control a counter for every module which tags every new detected function with the current count. This gives us the opportunity to sort the function symbols related to their call sequence.
- An access counter for every function to give us a chance to sort the function symbols related to their importance.
- The trace code must be able to write the collected information into a trace file which can be processed later.
It's clear that the implementation of this record function should be optimized as it is called many times (therefore this is time critical).
How to create an ORDER file that is accepted by the linker
There is no real documentation about the decoration schema Microsoft uses for their C++ compilers. A very comprehensive description can be found on the following Wikipedia page: http://en.wikipedia.org/wiki/Microsoft_Visual_C%2B%2B_Name_Mangling.
A second tool can create with the trace file and the map file an order file. The map reveals the symbol for an address and the size of the function can be calculated. Dependent on the sort algorithm the order can be written for the instrumented modules.
Use the map file information to map an address to a symbol
These information must be stored into trace files that can be analyzed by an additional tool. This tool will use the modules map file to determine the symbol from the address and it can also detect if the symbol is static or not. Static symbols cannot be moved by the linker.
Snippet from a typical map file
splmi Timestamp is 49cce429 (Fri Mar 27 15:35:21 2009) Preferred load address is 10000000 Start Length Name Class 0001:00000000 0001266aH .text CODE 0001:00012670 00003e58H .text$x CODE 0001:000164d0 0000010cH .text$yc CODE 0001:000165e0 000000d3H .text$yd CODE 0002:00000000 000007c0H .idata$5 DATA 0002:000007c0 00000004H .CRT$XCA DATA 0002:000007c4 0000001cH .CRT$XCU DATA 0002:000007e0 00000004H .CRT$XCZ DATA 0002:000007e4 00000004H .CRT$XIA DATA 0002:000007e8 00000004H .CRT$XIAA DATA 0002:000007ec 00000004H .CRT$XIC DATA 0002:000007f0 00000004H .CRT$XIZ DATA 0002:00000800 00001bf8H .rdata DATA 0002:000023f8 0000004dH .rdata$debug DATA 0002:00002448 000013d8H .rdata$r DATA 0002:00003820 0000038cH .rdata$sxdata DATA 0002:00003bac 00000004H .rtc$IAA DATA 0002:00003bb0 00000004H .rtc$IZZ DATA 0002:00003bb4 00000004H .rtc$TAA DATA 0002:00003bb8 00000004H .rtc$TZZ DATA 0002:00003bc0 000044acH .xdata$x DATA 0002:0000806c 0000012cH .idata$2 DATA 0002:00008198 00000014H .idata$3 DATA 0002:000081ac 000007c0H .idata$4 DATA 0002:0000896c 00004fc4H .idata$6 DATA 0002:0000d930 000000b9H .edata DATA 0003:00000000 000009e8H .data DATA 0003:000009e8 00000418H .bss DATA 0004:00000000 000000ecH .rsrc$01 DATA 0004:000000f0 00000278H .rsrc$02 DATA Address Publics by Value Rva+Base Lib:Object 0000:00000000 __except_list 00000000 <absolute> 0000:000000e3 ___safe_se_handler_count 000000e3 <absolute> 0000:00009876 __ldused 00009876 <absolute> 0000:00009876 __fltused 00009876 <absolute> 0000:00000000 ___ImageBase 10000000 <linker-defined> 0001:00000000 ??0SplashScreen@desktop@@AAE@ABV?$Reference@VXMultiServiceFactory@lang@star@sun@com@@@uno@star@sun@com@@@Z 10001000 f splash.obj 0001:000002e0 ??1OUString@rtl@@QAE@XZ 100012e0 f i splash.obj 0001:00000300 ??_GSplashScreen@desktop@@EAEPAXI@Z 10001300 f i splash.obj 0001:00000300 ??_ESplashScreen@desktop@@EAEPAXI@Z 10001300 f i splash.obj 0001:00000340 ??1?$WeakImplHelper2@VXStatusIndicator@task@star@sun@com@@VXInitialization@lang@345@@cppu@@UAE@XZ 10001340 f i splash.obj 0001:00000390 ??1SplashScreen@desktop@@EAE@XZ 10001390 f splash.obj 0001:000004d0 ?start@SplashScreen@desktop@@UAAXABVOUString@rtl@@J@Z 100014d0 f splash.obj 0001:000005d0 ??1OGuard@vos@@UAE@XZ 100015d0 f i splash.obj 0001:00000600 ??_GOGuard@vos@@UAEPAXI@Z 10001600 f i splash.obj 0001:00000600 ??_EOGuard@vos@@UAEPAXI@Z 10001600 f i splash.obj 0001:00000650 ?end@SplashScreen@desktop@@UAAXXZ 10001650 f splash.obj ... entry point at 0001:00012246 Static symbols 0001:fffff000 __unwindfunclet$?copy@OUString@rtl@@QBE?AV12@JJ@Z$0 10000000 f cfgfilter.obj 0001:fffff000 __unwindfunclet$??0Exception@uno@star@sun@com@@QAE@ABVOUString@rtl@@ABV?$Reference@VXInterface@uno@star@sun@com@@@1234@@Z$0 10000000 f migration.obj ... 0001:000164ad __ehhandler$?overrideProperty@CConfigFilter@desktop@@UAAXABVOUString@rtl@@FABVType@uno@star@sun@com@@E@Z 100174ad f cfgfilter.obj 0001:000164d0 ??__E?_aMutex@SplashScreen@desktop@@0VMutex@osl@@A@@YAXXZ 100174d0 f splash.obj 0001:00016500 ??__EpServices@@YAXXZ 10017500 f services_spl.obj ...
Currently the trace function writes the trace file at a defined time. This should be changed so this can be controlled from the application code. The trace file has the following format:
Part from a trace file
splmi.dll, Baseaddress: 0x03aa0000 1, 0x03ab74d0, 1 2, 0x03ab7500, 1 3, 0x03ab7530, 1 4, 0x03aa6a20, 1 5, 0x03aa6d30, 1 6, 0x03aa6f60, 1 7, 0x03aa70e0, 1 8, 0x03aa7150, 2 9, 0x03aa3d20, 3 10, 0x03aa6150, 2 11, 0x03aa6650, 2 12, 0x03aa60e0, 2 13, 0x03aa7230, 1 14, 0x03aa4540, 1 15, 0x03aa1000, 1 16, 0x03aa2180, 1 17, 0x03aa1d80, 6 18, 0x03aa2080, 12 19, 0x03aa20f0, 12 20, 0x03aa2b50, 11 21, 0x03aa4790, 5 22, 0x03aa47b0, 5 ...
The first column is just a sequence number for the first access, the second column is the virtual address of the function and the third column the number of accesses to the function.
Combine informtion from the trace and map file to create the order file
Now we have all information necessary to create an order file for an optimized linker run. The symbol can be retrieved from the virtual address written to the trace file and the content of the map file. Let's see how this can be done using the first entry from the snippet of trace file further above.
splmi.dll, Baseaddress: 0x03ab0000 1, 0x03ac74d0, 1 Calculating the RVA (relative virtual address): RVA = 0x03ac74d0 (virtual address) - 0x03ab0000 (virtual module base address) RVA = 0x000174d0 Look up into the map file: -------------------------- The base address for the map symbols is: 0x10000000 Address Publics by Value Rva+Base Lib:Object 0001:000164d0 ??__E?_aMutex@SplashScreen@desktop@@0VMutex@osl@@A@@YAXXZ 100174d0 f splash.obj RVA = 0x100174d0 (Rva+Base) - 0x10000000 (Base) RVA = 0x000174d0 So we have a hit and the first function (symbol) that is called in the splmi.dll: decorated name : ??__E?_aMutex@SplashScreen@desktop@@0VMutex@osl@@A@@YAXXZ undecorated name (according to VisualStudio Debugger): dynamic initializer for 'desktop::SplashScreen::_aMutex' Unfortunately this symbol/function is static as it's located in the static part of the map. Therefore we cannot use it for the order file as the linker ignores static symbols.
Problems
There are some problems with the ORDER file and the linker.
Results
First tests on Windows with reordered symbols (all symbols which are needed during start up are sorted at the start of the library) shows that up to 40% less page faults could be reached (measured with Process Monitor). Look at the following table which provides numbers for some libraries. These are very early test results with prototype code, hopefully it can be further optimized. There are also some strange results which must be analyzed in more detail.
Test machine
- Windows XP Professional SP3
- Athlon XP 2800+ (2083Mhz)
- 768MB RAM
- Samsung 120GB 3.5" 2MB IDE 7200 Hard Disk Drive
- Prefetch disabled
Module | Page Faults (non-optimized) | Page Faults (optimized) | Time for all Read Operations (non-optimized) | Time for all Read Operations (optimized) | Improvement (Page Faults/Time) |
---|---|---|---|---|---|
swmi.dll | 210 | 125 | 1378ms | 866ms | 37% / 37% |
sfxmi.dll | 110 | 86 | 912ms | 659ms | 22% / 28% |
vclmi.dll | 100 | 87 | 645ms | 525ms | 13% / 18% |
fwkmi.dll | 72 | 68 | 422ms | 544ms | 6% / -29% (*) |
tlmi.dll | 26 | 26 | 158ms | 195ms | 0% / -23% (*) |
svxmi.dll | 247 | 181 | 1208ms | 855ms | 27% / 30% |
svtmi.dll | 94 | 75 | 559ms | 424ms | 21% / 24% |
svlmi.dll | 32 | 31 | 189ms | 239ms | 3% / -26% (*) |
(*) There are some indications why some of the optimized libraries are loaded slower. Due to the limited Microsoft linker capabilities to order symbols it's not possible to group all symbols together which are needed during start up. In the end the sum of distances between all needed pages is in the end higher than for the non-optimized version! Even reducing the amount of page faults cannot resolve this problem. The libraries which see a performance hit only see a small reduction of page faults. See the following table which provides some Performance Monitor data regarding the fwkmi.dll library.
Overall the load time for the first test set (13 optimized libraries) could be reduced by ~20%.
Example for the negative effect of limited symbol order optimization (fwkmi.dll)
Non-optimized library | Seq. of read operation | Duration of the read operation in sec. | Offset into the library | Number of bytes read | Distance to the read operation before | Optimized library | Duration of the read operation in sec. | Offset into the library | Number of bytes read | Distance to the read operation before |
---|---|---|---|---|---|---|---|---|---|---|
1 | 0,01011690 | 0 | 4096 | 0 | 0,01004480 | 0 | 4096 | 0 | ||
2 | 0,00934770 | 1310720 | 16384 | 1310720 | 0,00920080 | 1310720 | 16384 | 1310720 | ||
3 | 0,00697820 | 1758720 | 1536 | 448000 | 0,00698870 | 1758720 | 1536 | 448000 | ||
4 | 0,00024110 | 1654784 | 16384 | 103936 | 0,00021060 | 1654784 | 16384 | 103936 | ||
5 | 0,00020020 | 1228800 | 16384 | 425984 | 0,00019160 | 1228800 | 16384 | 425984 | ||
6 | 0,00023190 | 1671168 | 16384 | 442368 | 0,00022870 | 1671168 | 16384 | 442368 | ||
7 | 0,00117200 | 1708032 | 16384 | 36864 | 0,00116130 | 1708032 | 16384 | 36864 | ||
8 | 0,00025590 | 1691648 | 16384 | 16384 | 0,00019720 | 1691648 | 16384 | 16384 | ||
9 | 0,00016690 | 1687552 | 4096 | 4096 | 0,00015960 | 1687552 | 4096 | 4096 | ||
10 | 0,00022470 | 1724416 | 14336 | 36864 | 0,00021890 | 1724416 | 14336 | 36864 | ||
11 | 0,00024120 | 1742848 | 15872 | 18432 | 0,00024040 | 1742848 | 15872 | 18432 | ||
12 | 0,00466450 | 971776 | 32768 | 771072 | 0,00467340 | 971776 | 32768 | 771072 | ||
13 | 0,00032770 | 939008 | 32768 | 32768 | 0,00032300 | 939008 | 32768 | 32768 | ||
14 | 0,00242150 | 1197056 | 31744 | 258048 | 0,00241900 | 1197056 | 31744 | 258048 | ||
15 | 0,00016270 | 1738752 | 16384 | 541696 | 0,00015600 | 1738752 | 4096 | 541696 | ||
16 | 0,00024500 | 1245184 | 16384 | 493568 | 0,00022670 | 1245184 | 16384 | 493568 | ||
17 | 0,00023900 | 1261568 | 16384 | 16384 | 0,00023310 | 1261568 | 16384 | 16384 | ||
18 | 0,00024110 | 1277952 | 16384 | 16384 | 0,00022520 | 1277952 | 16384 | 16384 | ||
19 | 0,00037880 | 1294336 | 16384 | 16384 | 0,00040200 | 1294336 | 16384 | 16384 | ||
20 | 0,00498520 | 693248 | 32768 | 601088 | 0,00390660 | 480256 | 32768 | 814080 | ||
21 | 0,00033700 | 103424 | 32768 | 589824 | 0,00063340 | 447488 | 32768 | 32768 | ||
22 | 0,00380090 | 906240 | 32768 | 802816 | 0,02105090 | 771072 | 32768 | 323584 | ||
23 | 0,02214140 | 152576 | 32768 | 753664 | 0,00071310 | 570368 | 32768 | 200704 | ||
24 | 0,00033420 | 62464 | 32768 | 90112 | 0,00425740 | 414720 | 32768 | 155648 | ||
25 | 0,00033180 | 234496 | 32768 | 172032 | 0,00584710 | 29696 | 32768 | 385024 | ||
26 | 0,00034920 | 29696 | 32768 | 204800 | 0,00486590 | 824320 | 32768 | 794624 | ||
27 | 0,00031510 | 644096 | 32768 | 614400 | 0,01030090 | 656384 | 32768 | 167936 | ||
28 | 0,00026840 | 185334 | 32768 | 458762 | 0,00027560 | 381952 | 32768 | 274432 | ||
29 | 0,03073340 | 308224 | 32768 | 122890 | 0,00840460 | 738304 | 32768 | 356352 | ||
30 | 0,00063740 | 275456 | 32768 | 32768 | 0,00947990 | 1384448 | 16384 | 646144 | ||
... | ... | ... | ... | ... | ... | ... | ... | ... | ||
60 | 0,01025040 | 533504 | 12288 | 1072128 | 0,00590920 | 259072 | 32768 | 1223680 | ||
61 | 0,00523330 | 1037312 | 32768 | 503808 | 0,00020940 | 1622016 | 16384 | 1362944 | ||
62 | 0,00657450 | 1449984 | 16384 | 412672 | 0,00041330 | 1147904 | 16384 | 474112 | ||
63 | 0,00022770 | 1523712 | 16384 | 73728 | 0,00071820 | 1585152 | 16384 | 437248 | ||
64 | 0,01006730 | 1335296 | 16384 | 188416 | 0,00584720 | 291840 | 32768 | 1293312 | ||
65 | 0,00023860 | 1351680 | 16384 | 16384 | 0,00833610 | 324608 | 24576 | 32768 | ||
66 | 0,00499920 | 1638400 | 16384 | 286720 | ||||||
67 | 0,00021360 | 1482752 | 16384 | 155648 | ||||||
68 | 0,00021900 | 1622016 | 16384 | 139264 | ||||||
69 | 0,01140220 | 1147904 | 16384 | 474112 | ||||||
70 | 0,00022650 | 1585152 | 16384 | 437248 | ||||||
Sum | 0,25621180 | 1587200 | 21660692 | 0,26112620 | 1529856 | 28928000 |
Although the optimized version needs 5 less read operations and loads 4% less data into memory the read operations needed about 1% more time. There are other runs on slower test machines where the distance is greater (up to 30%)! The sum of distances between a previous and following read operation is 33% higher for the optimized library! That means we are not able to group the symbols together as necessary with the limited Microsoft linker capabilities.
Cold start up performance of OpenOffice.org 3.1 (DEV300m40)
I made some start up tests with a standard, a optimized version (just have 13 optimized/reordered symbols libraries) and one without rebased libraries (means every library has the same virtual base address so Windows needs to make relocations). Yuan Cheng from IBM reported that non-rebased libraries can boost cold start up performance significantly.
Test machine:
- Windows Vista Ultimate 32-Bit
- Opteron 175 (Dual core) 2,2Ghz
- 4 GB RAM
- Deskstar 7K250 160GB 8MB Cache
- (Super Fetch and Prefetch disabled)
Optimization method | Test run 1 | Test run 2 | Test run 3 | Mean time |
---|---|---|---|---|
DEV300m40 (standard) | 16,05s | 16,38s | 16,24s | 16,23s (100%) |
DEV300m40 (reordered symbols) | 15,10s | 15,05s | 15,00s | 15,05s (-7,3%) |
DEV300m40 (not rebased/reordered) | 12,89s | 12,58s | 12,74s | 12,73s (-21,6%) |
It looks like that reordering cannot provide the same performance boost that you can reach if one don't rebase the libraries. The positive effect of not rebasing is, that most libraries are loaded in one piece synchronously into memory. Therefore a disk is able to read the file sequentially. This is not possible with optimizing the order of symbols. The Microsoft linker is very limited to influence the order of symbols. Static symbols cannot be moved, anonymous namespaces which are often use in OpenOffice.org makes it impossible to move these symbols, too. Symbols with a certain length lead to linker crashes. All these problems lead to limited optimizations. Which in result cannot match the "brute force" solution to not rebase the libraries for better cold start behavior.
Non-rebased OpenOffice.org
There are some drawbacks to not rebase libraries. It must be clarified how severe these drawbacks are:
Drawbacks
- Relocating needs some CPU power.
- During start up the application needs more memory. All pages that have code/data which need relocations will be read into memory.
- All pages which need fix ups must be reloaded from the swap file and not from the image file. That means that non-rebased libraries decrease the amount of virtual memory. See this web page for more information (although fairly old it's true for newer OS)
- Contiguous address space could be decreased. See the following web page for more information: Rebasing DLLs on Windows
- Symbol resolving for stack traces could be a problem. You need to know where the library was loaded when a crash occurred.
Findings
Performance and memory data
Some performance (warm start up) and memory values for DEV300m40 for both standard and non-rebased versions.
Attributes | DEV300m40 (standard) | DEV300m40 (non-rebased) | Difference (reference is standard) |
---|---|---|---|
Virtual Size | 283.040 KB | 283.040 KB | 0% |
Working Set (WS) | 55.152 KB | 86.392 KB | +57% |
WS Private | 18.048 KB | 57.204 KB | +317% |
WS Shareable | 37.104 KB | 29.204 KB | -21% |
WS Shared | 7.556 KB | 7.476 KB | -1% |
Total CPU Time | 1,653s | 1,747s | +6% (average values) |
Increase of page file usage | 27 MB | 61 MB | +226% |
Library sharing
Not rebasing libraries has an impact on memory usage. Pages with relocated code/data cannot be shared between processes when they don't use the same virtual address for the library. The following tests want to find out how severe this problem is. One would think that starting the same executable (in this case swriter.exe -env:UserInstallation=<file URL to the user folder>) several times should lead to a high amount of library sharing (at least for code and read-only data sections).
The Vmmap application from Microsoft (http://technet.microsoft.com/en-us/sysinternals/dd535533.aspx) can help us to see how the libraries, heap and stack are spread in the process address space. The following table is a small part from the Vmmap output for four process started one after the other.
Virtual Address | Process 1 | Process 2 | Process 3 | Process 4 |
---|---|---|---|---|
0x00070000 | uwinapi.dll | uwinapi.dll | uwinapi.dll | uwinapi.dll |
0x000A0000 | Shareable | Shareable | sofficeapp.dll | sofficeapp.dll |
0x00130000 | Heap Default(Private) | salhelper3MSC.dll | ||
0x001C0000 | sofficeapi.dll | sofficeapp.dll | ||
0x00220000 | comphelp4MSC.dll | comphelp4MSC.dll | ||
0x00230000 | comphelp4MSC.dll | |||
0x00240000 | comphelp4MSC.dll | |||
0x00310000 | cppuhelper3MSC.dll | cppuhelper3MSC.dll | ||
0x00320000 | cppuhelper3MSC.dll | |||
0x00330000 | cppuhelper3MSC.dll | |||
0x00390000 | salhelper3MSC.dll | salhelper3MSC.dll | salhelper3MSC.dll | |
0x003B0000 | cppu3.dll | cppu3.dll | cppu3.dll | cppu3.dll |
0x00400000 | soffice.bin | soffice.bin | soffice.bin | soffice.bin |
0x01BB0000 | stlport_vc7145.dll | stlport_vc7145.dll | stlport_vc7145.dll | stlport_vc7145.dll |
0x01C50000 | ucbhelper4MSC.dll | ucbhelper4MSC.dll | ucbhelper4MSC.dll | ucbhelper4MSC.dll |
0x01CC0000 | vos3MSC.dll | vos3MSC.dll | vos3MSC.dll | vos3MSC.dll |
0x10000000 | sal3.dll | sal3.dll | sal3.dll | sal3.dll |
Unfortunately Windows is not able to share the text part of a library between processes even if the text part is loaded at the same virtual address. If we look deeper into the Vmmap output you can see the following details for the vclmi.dll.
OpenOffice.org non-rebased vclmi.dll
Virtual Address | Type | Size | Committed | Total WS | Private WS | Shareable WS | Protection | Details |
---|---|---|---|---|---|---|---|---|
0x02330000 | Image | 3.016K | 3.016K | 2.624K | 2.108K | 516K | Execute/Copy on Write | C:\Program Files\OpenOffice.org 3\Basis\program\vclmi.dll |
0x02330000 | Image | 4K | 4K | 4K | 4K | Read | Header | |
0x02331000 | Image | 1.792K | 1.792K | 1.752K | 1.752K | Execute/Read | .text | |
0x024F1000 | Image | 764K | 764K | 716K | 316K | 400K | Read | .rdata |
0x025B0000 | Image | 32K | 32K | 32K | 32K | Read/Write | .data | |
0x025B8000 | Image | 272K | 272K | Copy on write | .data | |||
0x025FC000 | Image | 8K | 8K | 8K | 8K | Read/Write | .data | |
0x025FE000 | Image | 144K | 144K | 112K | 112K | Read | .rsrc |
The section .text is part of the Private WS and therefore not shared between processes.
OpenOffice.org with rebased vclmi.dll
Virtual Address | Type | Size | Committed | Total WS | Private WS | Shareable WS | Protection | Details |
---|---|---|---|---|---|---|---|---|
0x56F60000 | Image | 3.016K | 3.016K | 1.672K | 24K | 1.648K | Execute/Copy on Write | C:\Program Files\OpenOffice.org 3\Basis\program\vclmi.dll |
0x56F60000 | Image | 4K | 4K | 4K | 4K | Read | Header | |
0x56F61000 | Image | 1.792K | 1.792K | 1.128K | 1.128K | Execute/Read | .text | |
0x57121000 | Image | 764K | 764K | 504K | 8K | 496K | Read | .rdata |
0x571E0000 | Image | 4K | 4K | 4K | 4K | Read/Write | .data | |
0x571E1000 | Image | 12K | 12K | 4K | 4K | Copy on write | .data | |
0x571E4000 | Image | 4K | 4K | 4K | 4K | Read/Write | .data | |
0x571E5000 | Image | 284K | 284K | 4K | 4K | Copy on write | .data | |
0x5722C000 | Image | 8K | 8K | 8K | 8K | Read/Write | .data | |
0x5722E000 | Image | 144K | 144K | 12K | 12K | Read | .rsrc |
Microsoft Linker switch /SWAPRUN
The Microsoft linker supports a switch which is called /SWAPRUN:[NET|CD]. This sets a flag within a library/executable to inform the loader to read the whole image and write it into the swap file for later use. The possible values NET and CD give hints when Windows should activate this copy mechanism (NET=network drive, CD=Removable drive). This should help applications to run normally (while the system use on-demand paging) even when the network is down or the CD has been removed. We want to see if this switch could be interesting for our goal to improve library loading on cold start up. Unfortunately Windows clearly separates between normal and network/removable drives. Means the positive effect of synchronous and sequential library loading can only be seen for these drives.
Microsoft Linker switch /DYNAMICBASE
Starting with Visual Studio 2005 SP1 the Microsoft linker supports a new flag called /DYNAMICBASE. This flag controls a new security feature introduced for Windows Vista that is called ASLR (Address Space Layout Randomization). You find more information about ASLR here http://technet.microsoft.com/en-us/magazine/cc162458.aspx. Libraries which include this flag cannot be rebased to a certain virtual load address.
Linux
On Linux and Solaris, the soffice
wrapper script spawns a helper tool pagein
to pre-load the relevant libraries needed during start-up of soffice.bin
. This greatly reduces the number of I/O-incurring page faults during start-up of soffice.bin
. Currently (i.e., without symbol reordering), it is faster to let pagein
pre-load everything than to let the OS demand-load what is actually needed. It would be interesting to see what the numbers would be with symbol reordering and without pagein
.
MacOS X
The following web page from Apple describes what must be done to reorder code/data of a library to improve locality.
The subject has been proposed to the students in the Education Project Effort. Direct link
Solaris
OpenOffice.org uses the Sun Studio C++ compiler suite for building on both Sparc and x86 CPU systems. The following web page describes what can be done to optimize the code layout of libraries with the Sun Studio C++ compiler suite.