Watch Video

See how to use the counters:

This topic provides an overview of AQTime’s counters. It includes the following sections:

General Information

The Performance and Function Trace profilers can gather different kinds of information about the application. What characteristic the profiler will measure depends on the selected counter. To select a counter use the Active Counter profiler option.

All counters work for managed and unmanaged code and support 32-bit and 64-bit applications. The following counters are available in the current AQTime version:

Elapsed Time
User Time
User+Kernel Time
CPU Mispredicted Branches
CPU Cache Misses
Context Switches
64K Aliasing Conflicts

Split Load Replays
Split Store Replays
Blocked Store Forwards Replays
Soft Memory Page Faults
Hard Memory Page Faults
All Memory Page Faults

Depending on your operation system, installed updates, processor model and so on, you may have some issues when using counters. For more information about these issues and possible workarounds, see Counter Limitations below.

With the help of counters you can not only locate the application routines that are performing poorly, but investigate the reason for this performance issue. For instance, if a function operates slowly, it can be caused by inefficient code, poor memory management or a call to a slow system function. Using several different counters to profile a function, you can find out the exact reason of the delay. Recommendations on using counters can be found in the Search for Bottleneck Reasons topic.

Counter Descriptions

Elapsed Time. When you select this counter, the profiler measures the function execution time. The resultant execution time is the time span between two points in time: the entrance and the exit from the routine. This time period includes the time spent executing code in user mode, time spent executing code in kernel mode, time spent executing code in other applications, time spent switching between threads, etc. Use this counter to determine how fast your application executes the required actions in real time.
User Time. This counter is also used to time the function execution. It lets you determine the “pure” execution time of your code. That is, the resultant time includes only the time spent executing code in user mode. It does not include time spent executing code in kernel mode as well as times spent executing other applications or switching between threads. The launch of several applications during profiling will not affect this counter, since it ignores the time spent executing other threads and operating system code.

Though the User Time counter times the code execution in user mode only, you will see slight inconsistency in profiling results. This happens because the profiled application depends on other processes running in the system. For example, when the CPU switches the context from one thread to another, it may need to update its caches. The time spent for cache update is added to the execution time of your code.

User+Kernel Time. This counter is similar to User Time. However, profiling results will include the time spent executing your application code as well as the time spent executing the kernel code that was called from your code. The results do not include time spent executing other applications, time spent switching between threads, etc.

Note:

Contrary to User Time and User+Kernel Time counters, Elapsed Time includes time, which was spent for execution code in other threads, into the function execution time. What does this mean? The CPU executes several threads concurrently by giving each thread a short period of time for execution. When the time period given to the current thread is over, the CPU switches to another thread, executes it for the short period of time, then switches to the next thread, and so on. Since the time periods are short, the threads seem to run simultaneously. Suppose now that there are 40 threads running in the system and one of these threads is your application’s thread. Imagine, that the CPU executed several initial instructions of the FuncA routine in your thread, but the time period given to your thread is over and the CPU switches to one of the other threads. The CPU will return to your thread and continue executing the FuncA code after it “goes” through the other 39 threads (this is assuming that all threads have the same priority). Before FuncA finishes, the CPU may switch the thread context a hundred times. If you use the Elapsed Time counter, the FuncA time in the profiling results will include time spent executing other threads (this will include time spent executing threads of other applications as well as time of other threads of your application). If you use User Time or User+Kernel Time, the profiling results for FuncA will not include this time.

The “non-time” counters work similar to User Time and User+Kernel Time. For each application routine they perform measurements “within” the routine’s thread only, but not in other threads, where the CPU switched during the routine execution.

CPU Cache Misses. CPU uses the cache memory to read bytes from the main memory faster. CPU loads data in cache and then works with this data, instead of reading them from the main memory. Today CPUs have several levels of cache. The CPU reads data from the first level cache. If data is not in this cache, the CPU attempts to load data from the second-level cache to the first-level cache. If there is not any data in the second-level cache, the CPU attempts to read data from the main memory or from the caches of the other levels.

A cache miss is an event that occurs when the CPU is trying to read data from the cache, but this data is not in the cache. Cache misses reduce the application performance because the CPU needs to access the next-level cache or the main memory (both of which function slower, than the cache of the upper levels). Using the CPU Cache Misses counter you can determine how many times the CPU had to update the second-level cache during function execution. This counter helps you find routines that implement ineffective algorithms for working with memory. The better a routine operates with data in memory, the less cache misses occur during its execution.

We would like to note that CPU Cache Misses counts only those cache misses that occur in the thread where your routine executes. If during the routine execution the CPU switches context to other threads, cache misses that occur in these threads will not be added to the “routine’s” cache misses (see the note above).
Split Load Replays and Split Store Replays counters. The cache memory is organized as a set of “lines” (the number of bytes in each line depends on the processor model). It is possible that data loaded from the memory to the cache will be stored to several cache lines. For instance, an integer value consists of four bytes. Two of these bytes can be stored to one cache line and the other two bytes can be stored to another line. A split load is an event that occurs when the CPU reads data from the cache and one part of the data are located in one cache line and another part - in another line. A split store event is similar to split load but it occurs when CPU writes data to the cache. These events result in a performance penalty since the CPU reads (or writes to) two cache lines instead of one line.

The Split Load Replays and Split Store Replays counters allow you to determine whether the performance slowdowns are caused by the split load and split store events. They count replays that occur due to split loads and split stores. The lower the values in profiler results, the less split load and split store events occurred during application profiling. To decrease the number of the split load and split store events, it is recommended to use the proper data alignment (for instance, 16-byte alignment) in your application.
Blocked Store Forwards Replays counter. Use this counter to determine whether the performance slowdowns are caused by the store-to-load forwards that were blocked. Store-to-load forwarding means that the CPU forwards the store data to the load operation that follows the store.

Store forwarding occurs under certain conditions. If these conditions are violated, store-to-load forwarding is blocked. This typically happens in the following cases (for more information, see the Intel processor documentation at http://www.intel.com):
- The CPU reads a small amount of data and then writes more data at the same address (for example, the CPU reads one member of a structure and then writes the whole structure to the memory).
- The CPU stores lots of data and then loads a smaller block.
- The CPU operates with data which is not aligned properly.
The counter measures the number of replays that occur due to blocked store forwards.

Normally, blocked store forwards occur during each application run. However, an excessive number of replays indicates a performance issue. To avoid blocked store forwards, follow these rules where possible:
- A load that uses store data must have the same start point and alignment that the store data has.
- Load data must be stored in the store data.
- Instead of several small loads after a large store to the same region of memory, use a single large load operation and then store data to the registers where possible.
- To obtain non-aligned data, read the smallest aligned portion of data that entirely includes the desired data and then shift or mask the data as needed.
64K Aliasing Conflicts counter. Use this counter to determine the number of 64K aliasing conflicts that occur during application profiling. A 64K aliasing conflict occurs when the CPU reads data from a virtual address that is more than 64K apart from the previously read address. Such reading reduces the application performance since the CPU needs to update the cache. The 64K aliasing conflicts typically occur if the application works with a lot of small memory blocks that reside in memory far from one another.

CPU Mispredicted Branches. Modern pipelined processors include a branch prediction unit that predicts the results of the comparison instructions. Correct prediction helps the CPU process binary instructions faster. Wrong prediction leads to the pipeline update, which results in a time penalty. In other words, code that is more predictable is executed faster than code that is not very predictable.

The CPU Mispredicted Branches counter lets you determine how well your code can be predicted by the branch prediction unit. If small values are reported, this means your application is more predictable and therefore, faster. Higher values mean that the code is not very predictable and may need to be optimized. This does not mean you need to redesign your algorithm. This just means you can speed up your code by changing the code structure. Suppose, you have the following lines of code:

if (a = 0)
c = 100;
else
c = 200;

If variable a assumes only 0 and 1 values, you can avoid the comparison by creating an array of two elements and using a as the array index:

my_array[0] = 100;
my_array[1] = 200;
...
c = my_array[a];

Note:

For more information on CPU cache misses, split load, split store and blocked store forwarding events, 64K aliasing conflicts and on optimization of branch prediction, see the Intel documentation at http://www.intel.com.

Context Switches. This counter allows you to assess how the operating system schedules threads to run on the processor. A context switch is when the kernel suspends one thread’s execution on the processor, records its current environment (“context”) and restores the newly executing thread’s context. For instance, this happens when a thread with a higher priority than the running thread is ready. A low rate of context switches in a multi-processing system indicates that a program monopolizes the processor and does not allow much processor time for the other threads. A high rate of context switches means that the processor is being shared repeatedly, which may cause considerable performance cost.
Hard Memory Page Faults, Soft Memory Page Faults and All Memory Page Faults counters. If you use these counters, AQTime monitors the application execution and counts how many page faults occur. A “page fault” means that the CPU requests data from memory, but the memory page that holds this data is not available at the moment. There is a difference between “soft” page faults and “hard” page faults. A hard page fault means the operating system moved the memory page to a page file on hard disk, so to provide the requested data, it has to load the memory page from the page file. A soft page fault occurs when the desired memory page is located somewhere in memory. A soft page fault also occurs when the application allocates memory blocks. The Hard Memory Page Faults counter reports about hard page faults that occur during the routine execution; Soft Memory Page Faults - about “soft” page faults. The All Memory Page Faults counter is simply a sum of Hard Memory Page Faults and Soft Memory Page Faults.

Page faults (especially hard page faults) have a dramatic impact on the application’s performance. A delay that is caused by a page fault is much longer than a delay caused by a cache miss. For example, a hard page fault can take 1,000,000 times longer to process than a cache miss. Therefore, your application will be faster if there are not many page faults.
Soft page faults occur more often than hard page faults and they are not as “dangerous”. However, a lot of soft page faults can significantly slow down the application execution. Typically, a large number of soft page faults means the application works with memory ineffectively and the algorithm of working with memory should be optimized.

Counter Limitations

There are several limitations when using counters:

Important: If your operating system includes the Kernel Patch Protection (KPP) feature, using a counter other than Elapsed Time may be unstable and may cause a system crash.

To work around the issue, you can try running the operating system in kernel debug mode (see instructions below). Depending on your Windows versions and installed updates, this may help you to use the counters. If this does not help, you will have to use Elapsed Time.

Several notes:

Running the operating system in debug mode will disable .NET debugging: the debugger of Microsoft Visual Studio will be disabled (that is, Visual Studio will not debug managed code) and AQTime’s Event View panel will not trace and report .NET-specific events. Though the profiling of managed code will work.
For Windows 10 users: We do not recommend enabling the kernel debug mode in Windows 10 as it may cause an unrecoverable crash. You will have to use only the Elapsed Time counter.
For Skype users: Applications that initiate user-mode exceptions, such as Skype, can hang Windows running in kernel debug mode. To avoid the problem, you need to prevent such applications from running automatically at startup before you reboot Windows in the kernel debug mode. Also, do not use these applications while Windows is in debug mode.

The way you enable debug mode depends on your operating system:

Running Windows Vista, Windows 8, and Windows 8.1 in debug mode

Running Windows XP in debug mode

Important: If you have Windows DDK installed, then using some counters may cause the operating system to stop unexpectedly and display the error description on a blue screen.

To solve the problem, launch Driver Verifier (a tool from the Windows DDK package) and disable the aqIPD8.sys driver verification (this driver is part of AQTime). This Driver Verifier blocks the AQTime driver.

If you cannot disable verification of the aqIPD8.sys driver, you can still use the Elapsed Time, Context Switches, Soft Memory Page Faults, Hard Memory Page Faults and All Memory Page Faults counters.

AQTime supports a wide range of processors, however, not all counters are available for particular processor models:
- The Intel Core i7, Intel Core 2 Duo, Intel Pentium II, Intel Pentium III, Intel Pentium M, AMD Phenom, AMD Athlon XP and AMD Athlon 64 processors do not support the Split Load Replays, Split Store Replays, Blocked Store Forwards Replays and 64K Aliasing Conflicts profiler counters.
  
  These processors support only the Elapsed Time, User Time, User+Kernel Time, CPU Cache Misses, CPU Mispredicted Branches, Context Switches, Hard Memory Page Faults, Soft Memory Page Faults and All Memory Page Faults counters.
- The Mobile Intel Pentium 4, AMD Opteron and AMD Turion processors support only the Elapsed Time, Context Switches, Hard Memory Page Faults, Soft Memory Page Faults and All Memory Page Faults counters.
- The Intel Xeon and Intel Xeon MP multi-core processors with the Hyper-Threading technology (for instance, Intel Xeon Duo Core) also support only the Elapsed Time, Context Switches, Hard Memory Page Faults, Soft Memory Page Faults and All Memory Page Faults counters. Single-core Intel Xeon and Intel Xeon MP processors support all of the counters.
- The Intel Pentium 4 and Intel Pentium D processors are free from these limitations and support all profiler counters.
- If your processor supports the SpeedStep technology, we recommend that you turn off the dynamic CPU frequency mode before you start the profiling. Otherwise, the Elapsed Time, User Time and User+Kernel Time counters may produce inaccurate timing results.
On virtual machines, you can use only the Elapsed Time, Context Switches, Hard Memory Page Faults, Soft Memory Page Faults and All Memory Page Faults counters. The following counters require a real CPU for timing and do not work on virtual machines: User Time, User+Kernel Time, CPU Cache Misses, Split Load Replays, Split Store Replays, Blocked Store Forwards Replays, 64K Aliasing Conflicts and CPU Mispredicted Branches.

A replay is an attempt of executing a micro-operation
when conditions for the correct execution of this operation are not satisfied.
Replays may be caused by cache misses, store forwarding issues, etc.
Normally, certain number of replays always occur during the application
execution. However, a superfluous number of replays designates a
performance problem.

Counters Overview

General Information

Counter Descriptions

Counter Limitations

See Also