PCL - The Performance Counter Library:
A Common Interface to Access Hardware Performance Counters on Microprocessors
(Version 2.2)

Rudolf Berrendorf
University of Applied Sciences Bonn-Rhein-Sieg
Computer Science Department
53754 Sankt Augustin, Germany
rudolf.berrendorf@h-brs.de

Bernd Mohr
Research Centre Juelich GmbH
Central Institute for Applied Mathematics
52425 Juelich, Germany
b.mohr@fz-juelich.de

Abstract:

A performance counter is that part of a microprocessor that measures and gathers performance-relevant events on the microprocessor. The number and type of available events differ significantly between existing microprocessors, because there is no commonly accepted specification, and because each manufacturer has different priorities on analyzing the performance of architectures and programs. Looking at the supported events on the different microprocessors, it can be observed that the functionality of these events differs from the requirements of an expert application programmer or a performance tool writer.

PCL, the Performance Counter Library, establishes a common platform for performance measurements on a wide range of computer systems. With a common interface on all systems and a set of application-oriented events defined, the application programmer is able to do program optimization in a portable way and the performance tool writer is able to rely on a common interface on different systems. A low-level interface gives the user direct access to the hardware performance counters to measure non-standard events.

PCL has functions to query the functionality, to start and to stop counters, and to read the values of counters. Performance counter values are returned as 64 bit integers (or floating point numbers for some events) on all systems. PCL supports nested calls to PCL functions thus allowing hierarchical performance measurements. Counting may be done either in system or in user mode. All interface functions are callable in C, C++, Fortran, and Java.


Contents

Introduction

This report describes performance counters on all major microprocessors families and introduces a common interface to access these counters. With performance counters, performance critical events can be counted. This includes all aspects concerning the memory hierarchy (loads/stores, misses/hits, different cache levels, etc.), functional units or pipelines (operation counts, stalls, issues), duration of requests, etc.

As will be shown, the number of, type of, and access to events differs significantly between the processors and the type of supported events might be not very helpful to the application programmer or tool builder who might have different demands of countable events.

To overcome this lack of common platform, we developed PCL, the Performance Counter Library. We first defined a set of events useful to the application programmer and tool builder, and second, established a set of access functions to control and access the performance counters on different platforms. PCL is implemented on many of todays machines ranging from a PC running Linux to parallel systems capable of Teraflops and it is callable from application programs as well as from tools.

The Performance Counter Library PCL is available at

http://www2.inf.h-brs.de/rberre2m/PCL/

and
http://www.fz-juelich.de/zam/PCL/.


Requirements of Application Programmers

People from different areas of computer science and electrical engineering may see different events as most useful for their optimization purposes. Most of the events described so far in the description of the microprocessors are likely most useful to the computer architect, hardware engineer, or low-level device driver writer.

Application programmers optimizing their programs or performance tool writers wish to get performance relevant information related to their programs rather than counting signal switches on certain pins of a chip module. Therefore, those parts of the microprocessor which have appropriate counterparts in a program are most likely to be used by the application programmer to optimize programs. The memory hierarchy in a computer system corresponds directly to program variables and the functional units execute the operations specified in a program. Therefore, we concentrate on those aspects of a computer system.

Our impression is, that taking the union of all available events of all microprocessors is not the right way to define an application interface for an application programmer or tool writer. Our approach is to define a set of events relevant to the user. If microprocessor architecture or programming methodology precedes in a different direction (we don't see that for the near future!), the set of events might then be extended or changed.

Although hardware counters give numbers for a processor, performance numbers should be related to a process (representing the program). Therefore, either the executing process should be bound to a processor, or migrating a process to another processor should be transparent to the process (related to performance counting). Using the second approach needs support of the operating system.

We have categorized the useful events into categories as shown in the following sections.

Memory Hierarchy

Currently, most computer systems support four levels in the memory hierarchy: registers, 1st level cache, 2nd level cache, main memory. Registers are directly controlled by a compiler, so for example, the information how many registers keep live values could be better managed by a compiler. Although main memory statistics could be quite useful in performance analysis (e.g. bank conflicts), performance counters in microprocessors mostly see the main memory as a black box. Therefore, we concentrate on 1st and 2nd level caches.

Accesses to caches can be distinguished by read or write accesses, instruction loads and instruction stores (fetches from a higher level in the hierarchy), or data load/stores. An important performance aspect is the hit and miss rate, which can be calculated from the total number of accesses and either the number of misses or hits. Most microprocessors use (small) translation look-aside buffers (TLB) to speed up the translation of virtual to physical addresses. As misses in the TLB are time consuming, this number (and its relation to the number of hits or the total number of address lookups) is a relevant number for performance optimization.

We distinguish between instruction and data caches on each level. For unified caches (i.e. instruction and data are buffered in the same cache), it is often possible to distinguish instruction and data loads. Therefore on those caches, PCL_LxICACHE_xxx and PCL_LxDCACHE_xxx refer to events concerning instruction and data accesses, respectively.

The available events concerning memory hierarchy are given in table 2.1.


Table 2.1: Events concerning memory hierarchy (x=1 or 2 for 1st or 2nd level cache)
cache  
PCL_LxCACHE_READ number of level-x cache reads
PCL_LxCACHE_WRITE number of level-x cache writes
PCL_LxCACHE_READWRITE number of level-x cache reads or writes
PCL_LxCACHE_HIT number of level-x cache hits
PCL_LxCACHE_MISS number of level-x cache misses
data cache  
PCL_LxDCACHE_READ number of level-x data cache reads
PCL_LxDCACHE_WRITE number of level-x data cache writes
PCL_LxDCACHE_READWRITE number of level-x data cache reads or writes
PCL_LxDCACHE_HIT number of level-x data cache hits
PCL_LxDCACHE_MISS number of level-x data cache misses
instruction cache  
PCL_LxICACHE_READ number of level-x instruction cache reads
PCL_LxICACHE_WRITE number of level-x instruction cache writes
PCL_LxICACHE_READWRITE number of level-x instruction cache reads or writes
PCL_LxICACHE_HIT number of level-x instruction cache hits
PCL_LxICACHE_MISS number of level-x instruction cache misses
TLB  
PCL_TLB_HIT number of hits in TLB
PCL_TLB_MISS number of misses in TLB
Instruction TLB  
PCL_ITLB_HIT number of hits in instruction TLB
PCL_ITLB_MISS number of misses in instruction TLB
Data TLB  
PCL_DTLB_HIT number of hits in data TLB
PCL_DTLB_MISS number of misses in data TLB


Due to the definition, the sum of cache reads and cache writes should be equal to cache read/writes and the the sum of cache hits and cache misses should be equal to cache read/writes, too. Additionally, if two first level caches exist (instruction and data), the sum of instruction cache reads and data cache reads should be equal to cache reads (and so on).

Instructions

Instructions correspond to operations and flow control specified in a program. There are several categories of operations (e.g. integer, logical, floating point) which might be executed by different functional units in the microprocessor. Another aspect (in multiprocessor systems) is atomic operations (e.g. a primitive for a test-and-set-operations) which can be executed successful (the lock could be set) or unsuccessful (the lock could not be acquired as it was already set). We distinguish between the instruction categories as shown in table 2.2.

Additionally, we have included a cycle count which gives the number of cycles spent in this process or on behalf of the process/thread (when counting in user-and-system mode). For clarification, it should be noted that the cycle count should not be used to count the number of elapsed cycles as on multiprogramming systems other processes might be scheduled to the same processor. To count the number of elapsed cycles, an additional event can be used (PCL_ELAPSED_CYCLES).

On some systems, the number of issued instructions might be different to the number of completed instructions due to some error conditions. We have chosen completed instructions, as they correspond more closely to the operations the programmer specified in his program.

Getting the number of operations out of the number of instructions is difficult. For example, on some systems a floating-point add and a floating-point multiply can be initiated by a single add-and-multiply instruction. Therefore, 1 floating point instruction is counted but 2 floating point operations are executed. With PCL (and most of all hardware performance counter implementations) it is not possible to count the number of floating point operations and related number.


Table 2.2: Events concerning instruction categories
PCL_CYCLES spent cycles in process/thread (and eventually in system calls)
PCL_ELAPSED_CYCLES elapsed cycles
PCL_INTEGER_INSTR number of completed integer (or logical) instructions
PCL_FP_INSTR number of completed floating point instructions
PCL_LOAD_INSTR number of completed load instructions
PCL_STORE_INSTR number of completed store instructions
PCL_LOADSTORE_INSTR number of completed load or store instructions
PCL_INSTR sum of all completed instructions
PCL_JUMP_SUCCESS number of correctly predicted branches
PCL_JUMP_UNSUCCESS number of mispredicted branches
PCL_JUMP sum of all branches
PCL_ATOMIC_SUCCESS number of successful atomic instructions
PCL_ATOMIC_UNSUCCESS number of unsuccessful atomic instructions
PCL_ATOMIC sum of all instructions concerning atomic operations


Status of Functional Units

Functional units might be stalled due to blocked resources, missing operands etc. Table 2.3 gives the events defined for stalls. Measuring such an event results (different to all other events) not in the number of stalls but in the number of cycles all stalls of this event type have taken.


Table 2.3: Events concerning functional unit stalls (numbers given in cycles)
PCL_STALL_INTEGER number of cycles the integer/logical unit is stalled
PCL_STALL_FP number of cycles the floating point unit is stalled
PCL_STALL_JUMP number of cycles the branch unit is stalled
PCL_STALL_LOAD number of cycles the load unit is stalled
PCL_STALL_STORE number of cycles the store unit is stalled (write buffer)
PCL_STALL sum of all cycles a unit is stalled


Rates and Ratios

Often, it is useful to get a ratio or rate rather than an absolute number. Good examples are cache miss rates or floating point operations per second. Table 2.4 gives the events defined for such rates and ratios.


Table 2.4: Events concerning rates and rations (numbers are floating point values)
PCL_MFLOPS number of million floating point instructions per second
PCL_IPC number of completed instructions per cycle
PCL_L1DCACHE_MISSRATE miss rate of L1 data cache
PCL_L2DCACHE_MISSRATE miss rate for L2 data cache
PCL_MEM_FP_RATIO ratio of memory references to floating point operations


Measuring these events will mostly be done by deriving the values from other performance numbers (see [1]). The definitions are as follows:

PCL - The Performance Counter Library

The Performance Counter Library has a programming interface to access a set of performance counters with a defined set of countable events. In section 3.1, we specify which of the events defined in chapter 2 are available on what systems and in section 3.2 we define the programming interface. Additionally, we provide a uniform low-level interface to directly access performance counters on a microprocessor to measure non-standard events.


Countable Events

In the following tables we compare the events defined in the last section in tables 2.1 to 2.3 with the available events on the microprocessors currently supported by PCL.

The tables are given in the following scheme. Each entry in the tables specifies if a processor supports the PCL event, and if so, what the corresponding event is. The entry names correspond to the event names in the description of the microprocessors (see chapter A. Empty entries signal that such an event is not available on that microprocessor. Entries in itaic are indirect events as a combination of several other events directly countable by a (hardware) performance counter. Counters used for indirect events can not be used at the same time to measure their own events. Therefore, on a processor any combination of PCL events is possible which does not generate any ressource conflict.

Table 3.1 shows events relevant to the 1st level cache (instruction and data), table 3.2 shows events relevant to the 1st level data cache, and table 3.3 shows events relevant to the 1st level instruction cache. Tables 3.4, 3.5, and 3.6 show events relevant to the 2nd level cache (instruction and data, data, instruction, respectively). If there is a unified cache for data and instructions (as it is on most systems), events defined for 2nd level instruction cache refer to cache references done by instruction fetches, and for the data cache accordingly. Table 3.7 shows events for the translation look-aside buffers (instruction, data, instruction and data). Table 3.8 shows spent cycles, tables 3.10 and 3.9 shows events relevant to instructions, table 3.11 shows events regarding branch instructions, and table 3.12 shows events ragarding atomic instructions. Table 3.13 shows events concerning units which are blocked/stalled. Instead of counting the number of events, the number in this table gives the number of cycles for the event type. Table 3.14 shows the events concerning rates and ratios.


Table 3.1: 1st level cache
processor PCL_L1CACHE_READ PCL_L1CACHE_WRITE PCL_L1CACHE_READWRITE PCL_L1CACHE_HIT PCL_L1CACHE_MISS
Alpha          
21164          
21264          
MIPS          
R10k         R10k_C1_9+R10k_C0_9
R12k         R12k_9+R12k_25
SPARC          
Ultra I/II          
Ultra III          
PowerPC          
PPC604         PPC604_C0_5+PPC604_C1_6
PPC604e         PPC604e_C0_5+PPC604e_C1_6
POWER3         POWER3_C0_6+POWER3_C1_9
POWER3-II         POWER3II_C0_6+POWER3II_C1_9
Intel          
Pentium MMX          
Pro,PII,PIII          
Pentium 4          
AMD          
Athlon          
Hitachi          
SR8000          



Table 3.2: 1st level data cache
processor PCL_L1DCACHE_READ PCL_L1DCACHE_WRITE PCL_L1DCACHE_READWRITE PCL_L1DCACHE_HIT PCL_L1DCACHE_MISS
Alpha          
21164     21164_C1_14 21164_C1_14-21164_C2_5 21164_C2_5
21264          
MIPS          
R10k         R10k_C1_9
R12k         R12k_25
SPARC          
Ultra I/II ULTRA_C0_5 ULTRA_C0_6     ULTRA_C0_1
Ultra III ULTRA3_C0_9 ULTRA3_C0_10      
PowerPC          
PPC604         PPC604_C1_6
PPC604e         PPC604e_C1_6
POWER3         POWER3_C0_17,POWER3_C1_19,POWER3_C5_0
POWER3-II         POWER3II_C0_17,POWER3II_C1_19,POWER3II_C5_0
Intel          
Pentium MMX Pentium_0 Pentium_1 Pentium_0+Pentium_1   Pentium_37
Pro,PII,PIII         PPro_1
Pentium 4         P4_CG4_16
AMD          
Athlon     ATHLON_2 ATHLON_2-ATHLON_3 ATHLON_3
Hitachi          
SR8000         SR8000_C3



Table 3.3: 1st level instruction cache
processor PCL_L1ICACHE_READ PCL_L1ICACHE_WRITE PCL_L1ICACHE_READWRITE PCL_L1ICACHE_HIT PCL_L1ICACHE_MISS
Alpha          
21164     21164_C1_13 21164_C1_13-21164_C2_3 21164_C2_3
21264          
MIPS          
R10k         R10k_C0_9
R12k         R12k_9
SPARC          
Ultra I/II     ULTRA_C0_4 ULTRA_C1_4 ULTRA_C0_4-ULTRA_C1_4
Ultra III ULTRA3_C0_8     ULTRA3_C0_8-ULTRA3_C1_8 ULTRA3_C1_8
PowerPC          
PPC604         PPC604_C0_5
PPC604e         PPC604e_C0_5
POWER3       POWER3_C4_1 POWER3_C0_5,POWER3_C6_0
POWER3-II       POWER3II_C4_1 POWER3II_C0_5,POWER3II_C6_0
Intel          
Pentium MMX Pentium_12       Pentium_14
Pro,PII,PIII PPro_5       PPro_6
Pentium 4          
AMD         P4_CG1_0
Athlon ATHLON_18     ATHLON_18-ATHLON_19 ATHLON_19
Hitachi          
SR8000         SR8000_C2



Table 3.4: 2nd level cache
processor PCL_L2CACHE_READ PCL_L2CACHE_WRITE PCL_L2CACHE_READWRITE PCL_L2CACHE_HIT PCL_L2CACHE_MISS
Alpha 21164_C1_16 21164_C1_17 21164_C1_15 21164_C1_15-21164_C2_14 21164_C2_14
21164          
21264          
MIPS          
R10k         R10k_C1_10+R10k_C0_10
R12k         R12k_26+R12k_10
SPARC          
Ultra I/II     ULTRA_C0_8 ULTRA_C1_8 ULTRA_C1_9
Ultra III     ULTRA3_C0_12 ULTRA3_C0_12-ULTRA3_C1_12 ULTRA3_C1_12
PowerPC          
PPC604          
PPC604e          
POWER3         POWER3_C1_21
POWER3-II         POWER3II_C1_21
Intel          
Pentium MMX          
Pro,PII,PIII     PPro_17   PPro_13
Pentium 4 P4_CG1_10,11,12,13     P4_CG1_10,11,12 P4_CG1_13
AMD          
Athlon     ATHLON_16    
Hitachi          
SR8000          



Table 3.5: 2nd level data cache
processor PCL_L2DCACHE_READ PCL_L2DCACHE_WRITE PCL_L2DCACHE_READWRITE PCL_L2DCACHE_HIT PCL_L2DCACHE_MISS
Alpha          
21164          
21264          
MIPS          
R10k         R10k_C1_10
R12k         R12k_26
SPARC          
Ultra I/II          
Ultra III          
PowerPC          
PPC604          
PPC604e          
POWER3          
POWER3-II          
Intel          
Pentium MMX          
Pro,PII,PIII PPro_11 PPro_12 PPro_11+PPro_12    
Pentium          
AMD          
Athlon          
Hitachi          
SR8000          



Table 3.6: 2nd level instruction cache
processor PCL_L2ICACHE_READ PCL_L2ICACHE_WRITE PCL_L2ICACHE_READWRITE PCL_L2ICACHE_HIT PCL_L2ICACHE_MISS
Alpha          
21164          
21264          
MIPS          
R10k         R10k_C0_10
R12k         R12k_10
SPARC          
Ultra I/II          
Ultra III          
PowerPC          
PPC604          
PPC604e          
POWER3          
POWER3-II          
Intel          
Pentium MMX          
Pro,PII,PIII          
Pentium 4          
AMD          
Athlon          
Hitachi          
SR8000          



Table 3.7: Transfer-Look-aside-Buffer
processor PCL_TLB_HIT PCL_TLB_MISS PCL_ITLB_HIT PCL_ITLB_MISS PCL_DTLB_HIT PCL_DTLB_MISS
Alpha            
21164       21164_C2_4   21164_C2_6
21264       21264_C1_5    
MIPS            
R10k   R10k_C1_7        
R12k   R12k_23        
SPARC            
Ultra I/II            
Ultra III       ULTRA3_C1_17   ULTRA3_C1_18
PowerPC            
PPC604   PPC604_C0_6+PPC604_C1_7   PPC604_C1_7   PPC604_C0_6
PPC604e   PPC604e_C0_6+PPC604e_C1_7   PPC604e_C1_7   PPC604e_C0_6
POWER3   POWER3_C0_19,POWER3_C8_0        
POWER3-II   POWER3II_C0_19,POWER3II_C8_0        
Intel            
Pentium MMX       Pentium_13   Pentium_2
Pro,PII,PIII       PPro_7    
Pentium 4       P4_CG1_9   P4_CG1_8
AMD            
Athlon       ATHLON_23   ATHLON_8
Hitachi            
SR8000       SR8000_C0   SR8000_C1



Table 3.8: Cyles
processor PCL_CYCLES PCL_ELAPSED_CYCLES
Alpha    
21164 21164_C0_0 21164_PCC3.1
21264 21264_C0_0, 21264_C1_0 21264_PCC
MIPS    
R10k R10k_C0_0, R10k_C1_0  
R12k R12k_0  
SPARC    
Ultra I/II ULTRA_C0_0, ULTRA_C1_0 ULTRA_TC
Ultra III ULTRA3_C0_0,ULTRA3_C1_0 ULTRA3_TC
PowerPC    
PPC604 PPC604_C0_1, PPC604_C1_1  
PPC604e PPC604e_C0_1,PPC604e_C1_1,PPC604e_C2_1,PPC604e_C3_1  
POWER3 POWER3_C0_1 and other  
POWER3-II POWER3II_C0_1 and other  
Intel    
Pentium MMX Pentium_C0_43.2 Pentium_TSC
Pro,PII,PIII PPro_61 PPro_TSC
Pentium 4 P4_CYCLES P4_TSC
AMD    
Athlon ATHLON_15 ATHLON_TSC
Hitachi    
SR8000 SR8000_C6 special register



Table 3.9: Instructions (2)
processor PCL_INTEGER_INSTR PCL_FP_INSTR
Alpha    
21164 21164_C1_9 21164_C1_103.3
21264    
MIPS    
R10k   R10k_C1_5
R12k   R12k_21
SPARC    
Ultra I/II    
Ultra III   ULTRA3_C0_24+ULTRA3_C1_39
PowerPC    
PPC604 PPC604_C0_14 PPC604_C0_15
PPC604e PPC604e_C0_14 PPC604e_C0_15
POWER3 POWER3_C5_2+POWER3_C6_1+POWER3_C7_4 POWER3_C1_35+POWER3_C4_5
POWER3-II POWER3II_C5_2+POWER3II_C6_1+POWER3II_C7_4 POWER3II_C1_35+POWER3II_C4_5
Intel    
Pentium MMX   Pentium_303.4
Pro,PII,PIII   PPro_C0_0
Pentium 4   P4_CG4_14
AMD    
Athlon    
Hitachi    
SR8000   SR8000_C7



Table 3.10: Instructions (1)
processor PCL_LOAD_INSTR PCL_STORE_INSTR PCL_LOADSTORE_INSTR PCL_INSTR
Alpha        
21164 21164_C1_11 21164_C1_12   21164_C0_13.5
21264       AL264_0_1
MIPS        
R10k R10k_C1_2 R10k_C1_3   R10k_C0_15,R10k_C1_13.6
R12k R12k_18 R12k_19 R12k_18+R12k_193.7 R12_15
SPARC        
Ultra I/II       ULTRA_C0_1
Ultra III       ULTRA3_C0_1,ULTRA3_C1_1
PowerPC        
PPC604 PPC604_C1_18     PPC604_C0_2, PPC604_C1_2
PPC604e PPC604e_C1_18     PPC604e_C0_2, PPC604e_C1_2, PPC604e_C2_2, PPC604e_C3_2
POWER3 POWER3_C0_4,POWER3_C3_5 POWER3_C2_9 POWER3_C3_5+POWER3_C2_9 POWER3_C0_1,POWER3_C1_0,POWER3_C2_2,POWER3_C3_2
POWER3-II POWER3II_C0_4,POWER3II_C3_5 POWER3II_C2_9 POWER3II_C3_5+POWER3II_C2_9 POWER3II_C0_1,POWER3II_C1_0,POWER3II_C2_2,POWER3II_C3_2
Intel        
Pentium MMX     Pentium_36 Pentium_20
Pro,PII,PIII     PPro_0 PPro_44
Pentium 4 P4_CG4_12 P4_CG4_12 P4_CG4_12 P4_CG4_18, 19
AMD        
Athlon       ATHLON_28
Hitachi        
SR8000     SR8000_C4 SR8000_C5



Table 3.11: Branch instructions
processor PCL_JUMP_SUCCESS PCL_JUMP_UNSUCCESS PCL_JUMP
Alpha      
21164   21164_C2_2  
21264     21264_C1_1
MIPS      
R10k R10k_C0_6-R10k_C1_8 R10k_C1_8 R10k_C0_6
R12k R12k_6-R12k_24 R12k_24 R12k_6
SPARC      
Ultra I/II      
Ultra III   ULTRA3_C0_21+ULTRA3_C1_29 ULTRA3_C0_22
PowerPC      
PPC604 PPC604_C1_8-PPC604_C0_7 PPC604_C0_7 PPC604_C1_8
PPC604e PPC604e_C1_8-PPC604e_C0_7 PPC604e_C0_7 PPC604e_C1_8
POWER3 POWER3_C1_8 POWER3_C3_22-POWER3_C1_8 POWER3_C3_22
POWER3-II POWER3II_C1_8 POWER3II_C3_22-POWER3II_C1_8 POWER3II_C3_22
Intel      
Pentium MMX Pentium_43.8 Penntium_16-Pentium_43.9 Pentium_16
Pro,PII,PIII PPro_52 PPro_51 PPro_50
Pentium 4 P4_CG4_2 P4_CG4_4 P4_CG4_4+P4_CG4_0, 2
AMD      
Athlon ATHLON_32 ATHLON_31 ATHLON_35
Hitachi      
SR8000      



Table 3.12: Atomic instructions
processor PCL_ATOMIC_SUCCESS PCL_ATOMIC_UNSUCCESS PCL_ATOMIC
Alpha      
21164 21164_C2_13    
21264      
MIPS      
R10k R10k_C1_4-R10k_C0_5 R10k_C0_5 R10k_C1_4
R12k R12k_20-R12k_5 R12k_5 R12k_20
SPARC      
Ultra I/II      
Ultra III      
PowerPC      
PPC604 PPC604_C1_9    
PPC604e PPC604e_C1_9    
POWER3   POWER3_C1_13  
POWER3-II   POWER3II_C1_13  
Intel      
Pentium MMX      
Pro,PII,PIII      
Pentium 4      
AMD      
Athlon      
Hitachi      
SR8000      



Table 3.13: Blocked units
processor PCL_STALL_INTEGER PCL_STALL_FP PCL_STALL_JUMP PCL_STALL_LOAD PCL_STALL_STORE PCL_STALL
Alpha            
21164            
21264            
MIPS            
R10k            
R12k            
SPARC            
Ultra I/II            
Ultra III         ULTRA3_C0_5  
PowerPC            
PPC604            
PPC604e   PPC604e_C2_19 PPC604e_C2_12      
POWER3            
POWER3-II            
Intel            
Pentium MMX   Pentium_C0_53.10   Pentium_24 Pentium_23  
Pro,PII,PIII           PPro_58
Pentium 4            
AMD            
Athlon   ATHLON_49        
Hitachi            
SR8000            



Table 3.14: Rates and Ratios
processor PCL_MFLOPS PCL_IPC PCL_L1DCACHE_MISSRATE PCL_L2DCACHE_MISSRATE PCL_MEM_FP_RATIO
Alpha          
21164 21164_C1_10/21164_C2_11*Mhz 21164_C0_1/21164_C2_11 21164_C2_5/21164_C1_14 21164_C2_14/21164_C1_153.11  
21264          
MIPS          
R10k R10k_C1_5/R10k_C0_0*Mhz R10k_C0_15/R10k_C1_0      
R12k R12k_21/R12k_0*Mhz        
SPARC          
Ultra I/II   ULTRA_C0_1/ULTRA_C1_0   ULTRA_C1_9/ULTRA_C0_11  
Ultra III (ULTRA_C0_24+ULTRA_C1_39)/MHz ULTRA_C0_1/ULTRA_C1_1   ULTRA_C1_12/ULTRA_C0_12  
PowerPC          
PPC604 PPC604_C0_15/PPC604_C1_1*Mhz PPC604_C0_2/PPC604_C1_1      
PPC604e PPC604e_C0_15/PPC604e_C1_1*Mhz PPC604e_C0_2/PPC604e_C1_1      
POWER3 POWER3_C1_35/POWER3_C4_5*Mhz POWER3_C0_1/POWER3_C1_1      
POWER3-II POWER3II_C1_35/POWER3II_C4_5*Mhz POWER3II_C0_1/POWER3II_C1_1      
Intel          
Pentium MMX Pentium_30/Pentium_C0_4*Mhz Pentium_20/Pentium_C0_4 Pentium_37/Pentium_36   Pentium_36/Pentium_30
Pro,PII,PIII PPro_C0_0/PPro_61*Mhz PPro_44/PPro_61 PPro_1/PPro_0   P4_CG4_12/P4_CG4_14
Pentium 4 P4_CG4_14/P4_CYCLES*MHz P4_CG4_18, 19/P4_CYCLES   P4_CG1_13/P4_CG1_10,11,12,13  
AMD          
Athlon   ATHLON_28/ATHLON_15 ATHLON_3/ATHLON_2    
Hitachi          
SR8000          



Interface Functions

The interface functions to control the performance counters are given below. All functions are callable from C, C++, Fortran, and Java. All functions return status codes with the following meaning:

PCL_SUCCESS
function successful finished
PCL_NOT_SUPPORTED
requested event is not supported on this hardware
PCL_TOO_MANY_EVENTS
more events requested than performance counters are available
PCL_TOO_MANY_NESTINGS
there are more nested calls than allowed (PCL_MAX_NESTING_LEVEL )
PCL_TOO_ILL_NESTING
either a different number or different types of events are requested in nested calls
PCL_ILL_EVENT
event identifier illegal
PCL_MODE_NOT_SUPPORTED
performance counting for that mode is not supported
PCL_FAILURE
failure for some unspecified reason

Every PCL call needs a handle (denoted by descr) to work in a multi-threaded environment. Such a handle needs to be allocated once with a call to PCLinit before any other PCL function is called. A handle should be deallocated with PCLexit after all PCL functions were called.

High-Level Interface

The high-level interface is the usual interface as it allows a portable access to performance counters.

PCLinit

Allocates a thread-specific descriptor which must be passed to all subsequent PCL calls. The address of a descriptor must be passed.

int PCLinit(
             PCL_DESCR_TYPE* addr_descr   /* I/O: addr of handle */
           );

PCLexit

Releases the thread-specific descriptor.

int PCLexit(
             PCL_DESCR_TYPE  & descr      /* I: handle */
           );

PCLquery

With this function, queries are done if a certain functionality is available on this machine. The user supplies in counter_list an array of size ncounter of event names (of type integers). Event names are any of those introduced in the tables 3.1 to 3.13 in the last section. In mode, the user specifies the execution mode for which performance data should be gathered: PCL_MODE_USER specifies counting in user mode, PCL_MODE_SYSTEM specifies counting in system mode, and PCL_MODE_USER_SYSTEM specifies either of both modes. The function returns PCL_SUCCESS if the requested functionality is possible (i.e. if the requested events can be counted in parallel), otherwise an error code is returned why the requested events are not supported on this system. No resources are allocated on this call.

int PCLquery(
              PCL_DESCR_TYPE & descr,  /* I: handle */
              int *counter_list,       /* I: requested event counters */
              int ncounter,            /* I: number of counters */
              unsigned int mode        /* I: mode flags (PCL_MODE_xxx) */
            );

PCLstart

With PCLstart, performance counting is started (if it is possible). The user supplies in counter_list an array of size ncounter of event names. Event names are any of those introduced in the tables 3.1 to 3.13 in the last section. mode has the same meaning as in the description of PCLquery. If the requested functionality is available, the appropriate performance counters are cleared and started. On success, PCL_SUCCESS is returned, otherwise an error code is returned.

int PCLstart(
             PCL_DESCR_TYPE descr,   /* I: handle */
             int *counter_list,      /* I: events to be counted */
             int ncounter,           /* I: number of counters */
             unsigned int mode       /* I: mode flags (PCL_MODE_xxx) */
            );

PCLread

Reads out performance counters and returns counter values. Each of the the result values is either written into the (user supplied) integer-typed buffer i_results_list or into the (user supplied) floating point typed buffer fp_results_list both of size ncounter. PCL_CNT_TYPE is a 64-bit integer type, PCL_FP_CNT_TYPE is a 64-bit floating point type. Which of the buffers is used for the i-th result depends on the requested i-th event type. If the i-th event type is less than PCL_MFLOPS, the result is an integer value which is stored in i_results_list[i]. If the i-th event type is greater than or equal to PCL_MFLOPS (i.e. belongs to the category rates and ratios), the result is a floating point value stored in fp_results_list[i]. If the i-th result is stored in i_results_list[i], the content of fp_results_list[i] is undefined, and the same holds for the other way.

The arguments supplied with the call to PCLread must correspond to the latest call to PCLstart, i.e. the number of requested performance counters must be equal. If no error occurs, PCL_SUCCESS is returned, otherwise an error code. The performance counters are (logically) not stopped.

int PCLread(
            PCL_DESCR_TYPE descr,             /* I: handle */
            PCL_CNT_TYPE *i_result_list,      /* O: int counter values */
            PCL_FP_CNT_TYPE * fp_result_list, /* O: fp counter values */
            int ncounter                      /* I: number of events */
            );

PCLstop

Stops performance counting and returns counter values. Result values are written into the (user supplied) buffers i_result_list or fp_result_list both of size ncounter. See PCLread for a description how the results are stored in the two arrays. The arguments supplied with the call to PCLstop must correspond to the latest call to PCLstart, i.e. the number of requested performance counters must be equal. If no error occurs, PCL_SUCCESS is returned, otherwise an error code.

int PCLstop(
            PCL_DESCR_TYPE descr,              /* I: handle */
            PCL_CNT_TYPE *i_result_list,       /* O: int counter values */
            PCL_FP_CNT_TYPE * fp_result_list,  /* O: fp counter values */
            int ncounter                       /* I: number of events */
           );

Low-Level Interface

The low-level interface should only be used in rare circumstances. It allows a direct access to hardware performance counters in a uniform way. The user has to be aware of events and events codings for the processor in use. This is different to the high-level interface where an abstract layer exists which hides all low-level and non-portable details. The low-level interface is only accessible from C/C++.

The low-level interface was introduced in version 2.0 and feedback on its design and usage is welcome. The interface may change in the future.

Before using any of the driver functions, a handle need to be allocated by a call to PCLinit(PCL_DESCR_TYPE *descr). After using the driver routines, a call to PCLexit(PCL_DESCR_TYPE descr) must be issued to release the handle.

PCL_driver_info

This function returns information on the processor in use. The function returns PCL_SUCCESS if the operation didn't produce any error.

int PCL_driver_open(
                    PCL_PROCESSOR_INFO *info /* I: address of info struct */
                    );

The type PCL_PROCESSOR_INFO is a struct with at least the following components:

typedef struct
{
    char *vendor;   /* processor vendor */
    char *family;   /* processor family */
    char *model;    /* processor model */
    int mhz;        /* MHz rate */
    int ncounters;  /* number of counters (at least this number) */
}  PCL_PROCESSOR_INFO;

PCL_driver_open

This function has to be called once and before any other driver call to open the hardware driver interface. The function returns PCL_SUCCESS if the operation could be successfully done.

int PCL_driver_open(
                     PCL_DESCR_TYPE descr /* I: handle */
                   );

PCL_driver_start

Starts performance counting. max_counter specifies the maximum counter index ($\le$ PCL_COUNTER_MAX). counter_used_mask is a bit field where the bits 0-max_counter speciy which counters should be started. E.g. if you want to measure counter 0,3, and 4 you may pass 4 or higher for max_counter and a bit field which has at bit position 0, 3, and 4 a 1 and otherwise a 0 (starting bit counting with 0) . If no error occurs, PCL_SUCCESS is returned, otherwise an error code.

int PCL_driver_start(
        PCL_DESCR_TYPE descr                 /* I: handle */
        int max_counter,                     /* I: max. counter index */
        PCL_BIT_MASK_TYPE counter_used_mask, /* I: bit-field of counters to use */
        PCL_DRIVER_COMMAND_TYPE *commands,   /* I: event commands for counters */
        unsigned int count_mode              /* I: count mode */
        );

PCL_driver_read

Reads out performance counters. The read values are returned in counter_values, a user-supplied buffer capable of storing max_counter values. max_counter specifies the maximum counter index ($\le$ PCL_COUNTER_MAX). counter_used_mask gives in a bit field the hardware counters which should be read. If no error occurs, PCL_SUCCESS is returned, otherwise an error code.

int PCL_driver_read(
   PCL_DESCR_TYPE descr     /* I: handle */
   int max_counter,         /* I: max. counter index */
   PCL_BIT_MASK_TYPE counter_used_mask,  /* I: bit-field of counters to be read */
   PCL_CNT_TYPE *counter_values     /* I/O: buffer for results */
   );

PCL_driver_stop

Stops performance counting. If no error occurs, PCL_SUCCESS is returned, otherwise an error code.

int PCL_driver_stop(
      PCL_DESCR_TYPE descr   /* I: handle */
      );

PCL_driver_close

Closes the driver and releases all allocated ressources. If no error occurs, PCL_SUCCESS is returned, otherwise an error code.

int PCL_driver_close(
         PCL_DESCR_TYPE descr /* I: handle */
        );

Useful Macros

There are two macros defined:
  1. PCL_EVENT_IS_INT(e) determines whether the result of an event e is of type integer (64 bits) or has a floating point type (64 bits)
  2. PCL_EVENT_IS_RATE(e) determines whether the result for an event means and event count or an event rate (counts may be added, adding rates makes less sense)

Programming Aspects

The allowed calling sequence is one call to PCLstart followed by zero or more calls to PCLread followed by one call to PCLstop. Between a call to PCLstart and PCLstop (and possible calls to PCLread) may be nested calls to other allowed calling sequences with the same number of events and the same event types.

On system with virtual (low level) performance counters, migrating a process to another processor is possible (SGI, AIX). On the other systems, we bind the executing process to a processor (DEC, SOLARIS)3.11, or the process can not migrate (CRAY). On Solaris systems, if the process is not bound to a specific processor, the process gets bound to the processor 0 when executing the PCLstart function. On DEC systems, the process gets bound to the processor the process is currently running on. If you use pthreads on Solaris systems, you must bind each thread to a processor.

Currently, performance counters are not saved on context switches on Linux systems by our library and therefore performance measurements should be done only on a lightly loaded system.

Currently, we do not check if any other process uses the performance counters as well3.12. Therefore, on certain systems if two distinct processes use performance counters in parallel, they may disturb each other.

To avoid overflow e.g. on systems with 32-bit hardware counters, an interval timer is called on these systems (Solaris, AIX, Linux) which interrupts the process every second. Programs which use the setitimer system call (or the SIGVTALRM signal), may be in conflict with PCL.

Supported Systems

Currently, the Performance Counter Library is available on the systems listed above:

Examples

Simple Example

Below is a simple example program how to use the Performance Counter Library. First, the list of requested events (PCL_LOAD_INSTR for load instructions, and PCL_L1DCACHE_MISS for 1st level data cache misses) is put into the array counter_list. With the call to PCLquery we test, if it is possible to serve these two requested events simultaneously on the computer system where the program is executed. If this is possible, event counting is started with the call to PCLstart. After that follows the code to be measured and a call to PCLstop to stop performance counting and to read out the performance counter values. Then, the results are printed.

#include <pcl.h>

void do_work(){}

int main(int argc, char **argv)
{
    int counter_list[2];
    int ncounter;
    unsigned int mode;
    PCL_CNT_TYPE i_result_list[2];
    PCL_FP_CNT_TYPE fp_result_list[2];
    PCL_DESCR_TYPE descr;


    /* Allocate a handle */
    if(PCLinit(&descr) != PCL_SUCCESS)
        printf("cannot get handle\n");

    /* Define what we want to measure. */
    ncounter = 2;
    counter_list[0] = PCL_CYCLES;
    counter_list[1] = PCL_INSTR;

    /* define count mode */
    mode = PCL_MODE_USER;

    /* Check if this is possible on the machine. */
    if(PCLquery(descr, counter_list, ncounter, mode) != PCL_SUCCESS)
        printf("requested events not possible\n");

    /* Start performance counting.
       We have checked already the requested functionality
       with PCL_query, so no error check would be necessary. */
    if(PCLstart(descr, counter_list, ncounter, mode) != PCL_SUCCESS)
        printf("something went wrong\n");

    /* Here comes the work to be measured. */
    do_work();

    /* Stop performance counting and get the counter values. */
    if(PCLstop(descr, i_result_list, fp_result_list, ncounter) != PCL_SUCCESS)
        printf("problems with stopping counters\n");

    /* print out results */
    printf("%f instructions in %f cycles\n",
           (double)i_result_list[1], (double)i_result_list[0]);

    /* Deallocate handle */
    if(PCLexit(descr) != PCL_SUCCESS)
        printf("cannot release handle\n");

    return 0;
}

Example with Nested Calls

Below is an example how to use nested calls. In this example, for the outer loop as well as for each iteration the number of cycles spent in this code section is measured.

#include <pcl.h>

#define NITER 4

void do_work(){}

int main(int argc, char **argv)
{
    int counter_list[1];
    int ncounter, res, iter;
    unsigned int mode;
    PCL_CNT_TYPE i_all_result_list, i_result_list[NITER];
    PCL_FP_CNT_TYPE fp_all_result_list, fp_result_list[NITER];
    PCL_DESCR_TYPE descr;


    /* Allocate a handle */
    if(PCLinit(&descr) != PCL_SUCCESS)
        printf("cannot get handle\n");

    /* Define what we want to measure. */
    ncounter = 1;
    counter_list[0] = PCL_CYCLES;

    /* define count mode */
    mode = PCL_MODE_USER;

    /* Start performance counting. */
    res = PCLstart(descr,counter_list, ncounter, mode);

    for(iter = 0; iter < NITER; ++iter)
      {
        /* Start performance counting. */
        res = PCLstart(descr, counter_list, ncounter, mode);

        /* Here comes the work to be measured. */
        do_work();

        /* Stop performance counting and get counter values. */
        res = PCLstop(descr, &i_result_list[iter], &fp_result_list[iter], ncounter);
      }

    /* Stop performance counting and get the counter values. */
    res = PCLstop(descr, &i_all_result_list, &fp_all_result_list, ncounter);

    /* print out results */
    printf("used cycles: %f %f %f %f, total: %f\n",
           (double)i_result_list[0], (double)i_result_list[1],
           (double)i_result_list[2], (double)i_result_list[3],
           (double)i_all_result_list);

    /* Deallocate handle */
    if(PCLexit(descr) != PCL_SUCCESS)
        printf("cannot release handle\n");

    return 0;
}

Example in Java

Below is an example how to use PCL in Java.

public class pcl_jtest {
  static final int N = 200;               // matrix dimension
  static double[][] a = new double[N][N];
  static double[][] b = new double[N][N];
  static double[][] c = new double[N][N];

  // test method
  static void matadd(double[][] a, double[][] b, double[][] c) {
    int i, j;
    for (i = 0; i < N; ++i)
      for (j = 0; j < N; ++j)
        a[i][j] = b[i][j] + c[i][j];
  }

  // main program
  public static void main(String[] args) {
    int event;
    long descr = 0;                      // descriptor
    PCL pcl = new PCL();                 // instantiate PCL
    int mode = pcl.PCL_MODE_USER_SYSTEM; // count mode
    int[] events = new int[1];           // events; array required
    long[] i_result = new long[1];       // int results; array required
    double[] fp_result = new double[1];  // fp results


    if(pcl.PCLinit(descr) != pcl.PCL_SUCCESS) 
      System.out.println("problem with init");

    // test supported events
    for(event = 0; event < pcl.PCL_MAX_EVENT; ++event) {
        events[0] = event;
        if(pcl.PCLquery(descr, events, 1, mode) == pcl.PCL_SUCCESS) {
            // start counting
            if(pcl.PCLstart(descr, events,1,mode) != pcl.PCL_SUCCESS)
                System.out.println("problem with starting event");
			
            // test program
            matadd(a,b,c);

            // stop counting
            if(pcl.PCLstop(descr,i_result,fp_result,1) != pcl.PCL_SUCCESS)
                System.out.println("problem with stopping event");

            // print result for event i
            if(event < pcl.PCL_MFLOPS)
              // integer result
              System.out.println(pcl.PCLeventname(event)+":"+i_result[0]);
            else
              // floating point result
              System.out.println(pcl.PCLeventname(event)+":"+fp_result[0]);
          }
      }

    if(pcl.PCLexit(descr) != pcl.PCL_SUCCESS) 
      System.out.println("problem with exit");
   }
}

Using the low-level Interface

Below is an example how to use low-level interface.

#include <pcl.h>

void do_work(){}

int main(int argc, char **argv)
{
    int res, counter_index;
    unsigned int count_mode;
    PCL_DESCR_TYPE descr;
    PCL_PROCESSOR_INFO info;
    PCL_BIT_MASK_TYPE counter_used_mask;
    PCL_DRIVER_COMMAND_TYPE commands[PCL_COUNTER_MAX];
    PCL_CNT_TYPE counter_values[PCL_COUNTER_MAX];


    /* Allocate a handle */
    if(PCLinit(&descr) != PCL_SUCCESS)
        printf("cannot get handle\n");

    /* get processor info */
    if((res = PCL_driver_info(&info)) != PCL_SUCCESS)
      printf("error on PCL_driver_info (%d)\n", res);
    else
      {
        printf("processor vendor  : %s\n", info.vendor);
        printf("processor family  : %s\n", info.family);
        printf("processor model   : %s\n", info.model);
        printf("processor speed   : %d MHz\n", info.mhz);
        printf("number of counters: >=%d\n", info.ncounters);
      }

    /* open driver */
    if((res = PCL_driver_open(descr)) != PCL_SUCCESS)
      printf("error on PCL_driver_open (%d)\n", res);

    /* This is for a DEC Alpha 21164: count FP operations */
    counter_index = 1;
    commands[counter_index] = 0x0a;

    counter_used_mask = (0x1 << counter_index);
    count_mode = PCL_MODE_USER;
    if((res = PCL_driver_start(descr, PCL_COUNTER_MAX, counter_used_mask,
			       commands, count_mode)) != PCL_SUCCESS)
      printf("error on PCL_driver_start (%d)\n", res);

    /* do some work */
    do_work();

    /* read counter */
    if((res = PCL_driver_read(descr, PCL_COUNTER_MAX, counter_used_mask,
			      counter_values)) != PCL_SUCCESS)
      printf("error on PCL_driver_read (%d)\n", res);
    else
      printf("%.0f floating point instructions\n",
	     (double)counter_values[counter_index]);

    /* stop counting */
    if((res = PCL_driver_stop(descr)) != PCL_SUCCESS)
      printf("error on PCL_driver_stop (%d)\n", res);

    /* close driver */
    if((res = PCL_driver_close(descr)) != PCL_SUCCESS)
      printf("error on PCL_driver_close (%d)\n", res);

    /* deallocate handle */
    if(PCLexit(descr) != PCL_SUCCESS)
        printf("cannot release handle\n");

    return 0;
}

Related Projects

In the Parallel Tools Consortium there is a subproject defined called PAPI. Its main aspect is to define an API to access all system specific hardware performance counters, i.e. to start/read out/stop all hardware performance counters on a microprocessor with all events available on that system. This is a different approach than ours as we focus on a single framework on all systems, i.e. a uniform application interface as well as a well-defined set of events accessible with uniform names on all systems. For the PerfAPI project, have a look at http://www.cs.utk.edu/~mucci/pdsa/.

There are a lot of interfaces to access performance counters on one specific system, e.g. libperfex on SGI systems with the R10000-processor or the pfm-device on Digital Unix systems (21064 or 21164 processors). To establish a common platform for performance counting on all POWER and PowerPC microprocessors, IBM has defined an application interface called PMapi. Their approach is as well, to define the set of possible events as the union of all possible events on all POWER and PowerPC microprocessors. On Linux systems, libpperf supports all Pentium, PentiumPro, and Pentium II processors through a common interface.

Summary

PCL - the Performance Counter Library - is a common interface for portable performance counting on modern microprocessors. It is intended to be used by the expert application programmer who wishes to do detailed analysis on program performance, and it is intended to be used by tool writers who need a common platform to base their work on.

The application interface supports query for functionality, start and stop of performance counting and reading out the values of the performance counters. Nested calls to the functions are possible (with the same events) therefore allowing to do hierarchical performance measurements on sections and subsections of a program. Further, performance counting in user mode, system, and user-or-system mode can be distinguished. Language bindings are available for C, C++, Fortran, and Java.

PCL is available at

http://www2.inf.h-brs.de/rberre2m/PCL/

and
http://www.fz-juelich.de/zam/PCL/
.

Acknowledgments

We would like to thank those hardware vendors who supported us with detailled information on performance counters, access to test systems, and/or test hardware. Namely Cray, Hitachi, IBM, Intel, and SGI.

Additionally, we would like to thank those people who have written software we based our work on. For the current PCL version this is perfctr from Mikael Pettersson.

For older PCL versions, we would like to thank Richard Enbody for perfmon on UltraSPARC-systems, and M. Patrick Goda and Michael S. Warren for libpperf which itself is based on the msr device implemented by Stephan Meyer on Linux version 2.0.x, 2.1.x, and 2.2.x.

Bibliography

1
Kirk W. Cameron and Yong Luo.
Performance evaluation using hardware performance counters.
http://www.c3.lanl.gov/ kirk/isca99/.

2
Digital Equipment Corporation, Maynard, Massachusetts.
man 7 pfm.

3
Digital Equipment Corporation, Maynard, Massachusetts.
Alpha AXP Architecture Handbook, version 2 edition, 1994.

4
Silicon Graphics Inc.
man libperfex.

5
MIPS Technologies Inc., Mountain View, California.
Definition of MIPS R12000 Performance-counter.

6
Marco Zagha and et.al.
Performance Analysis using the MIPS R10000 Performance Counters.
In Supercomputing 96. IEEE Computer Society, 1996.

7
Sun Microsystems, Palo Alto, California.
UtraSPARC User's Manual, 1997.

8
SPARC International, Inc.
The SPARC Architecture Manual, Version 9, 1997.

9
Motorala Inc., IBM.
The PowerPC Family : The Bus Interface for 32-Bit Microprocessors, 3 1997.

10
James E. Smith Shlomo Weiss.
POWER and PowerPC.
Morgan Kaufmann Publishers, Inc., 1994.

11
Motorola Inc., IBM.
PowerPC 604e RISC Microprocessor User's Manual, 3 1998.

12
http://developer.intel.com/drg/mmx/AppNotes/perfmon.htm.

13
Intel Corp.
Pentium Pro Family Developers Manual 1-3, 1997.

14
Intel.
Intel IA-64 Architecture Software Developer's Manual, volume 4.
January 2000.


Performance Counters on Microprocessors

This chapter introduces performance counting aspects of commonly used microprocessors. Each section introduces a microprocessor family and is divided into three subsections: base information on the microprocessor, performance counter events sorted by each performance counter, and in the third subsection additional comments and references to existing implementations to access the performance counters on that specific microprocessor. The second part of each section, the description of the performance counters and their events, is given for each event as follows. The first line contains an event identifier which is composed of the name of the microprocessor (e.g. 21164 for the Alpha 21164), the number of the performance counter (e.g. C0 for counter 0), and a number giving the event number. We will refer to the whole name as a unique identifier in all chapters. The next line contains a manufacturer-specific name or definition (in italics) of the event as found in the manufacturer's literature. After that, a description of the event follows.

DEC Alpha

To use performance counters on DEC Alpha microprocessors, additional software support is necessary as the low-level interface is given in PAL-Code. Tru64 (formely Digital Unix) has the pseudo device pfm [2] which has a high-level interface based on ioctl-calls to access the performance counters. The pfm-device on systems distinguishes between user and system mode event counting. Only one process per CPU can open the device, but child processes can be spawned which influence the performance counters as well.

On the CRAY T3E, which uses the 21164 microprocessor too, there is no software interface published to access the performance counters.

DEC Alpha 21164

The RISC-processor DEC Alpha 21164 has 3 performance counters. First, let's have a closer look at the architecture of the microprocessor. The first level of caches contain an instruction (ICACHE) and a data cache (DCACHE), each having a size of 8 KB. The second level cache (SCACHE) has a size of 96 KB buffering instructions and data. An additional option is an external third level cache (BCACHE). A detailed description of the Alpha architecture can be found in [3].

The 21164 contains pipelines of the following types:

The performance counter part on the DEC Alpha 21164 contains 3 counters with distinct purposes. Roughly speaking, counter 0 counts machine cycles or issued instructions, counter 1 counts successful operations, and counter 2 counts unsuccessful operations. For the counters, 2, 24, and 23 different events are defined, respectively, and the counters can operate in parallel. There is one restriction that when counting certain events on counter 2, counter 1 gathers special events. The counters are 16 bit (counter 0,1) and 14 bit (counter 2) wide. The cycle counter is 64 bit wide, but only the lower 32 bits contain cycle values, the upper 32 bit are OS specific.

Events countable on the DEC Alpha 21164 are:

DEC Alpha 21264

The DEC Alpha 21264 is a four-way out-of-order-issue microprocessor that performs dynamic scheduling, register renaming, and speculative execution. There are 4 integer execution units and 2 floating-point execution units. The processor includes a 64 KB 1st level instruction cache and a 64 KB 1st level data cache. The 21264 has 2 performance counters of 20 bit width each. Counters 0 is capable of counting one of 2 different events, and counter 1 is capable of counting one of 7 different events. Therefore, the ability to do a detailled performance analysis on the 21264 is significantly reduced compared to the 21164.

Events countable on the DEC Alpha 21264 are:

MIPS Family

The microprocessors R10000 and R12000 of MIPS are 64 Bit RISC-microprocessors with integrated performance counters.

Software support for the performance counters on R10000 processors is available either on a lower level in IRIX 6.x through the /proc file system or on a higher level through the perfex library [4]. The kernel maintains data structures for 32 virtual performance counters with a size of 64 bits each. It is possible to distinguish between counting in user mode, system mode, or both. When running in user mode, performance counters are saved on context switches. For the perfex library, the routine start_counters zeroes out the internal counters, and read_counters stops the counters after reading them.

R10000

The R10000 processor has 64 physical registers and 32 logical registers. The 1st level cache is split between a data cache and an instruction cache, both of size 32 KB. The 2nd level cache can be between 512 KB and 16 MB and the cache is a unified buffer at it caches data as well as instructions. The main memory can be up to 1 TB.

The R10000 microprocessor has 2 performance counters (a description can be found at http://www.sgi.com/processors/r10k/performance.html) each capable of counting one of 16 different events. The hardware counters are 32 bit wide. The R10000 has 5 execution pipelines executing decoded instructions. There are 2 integer pipelines (ALU1, ALU2), 2 floating point pipelines (FPU1, FPU2), and 1 address pipeline (LOAD/STORE). The integer and floating point pipelines can operate in parallel. For a better understanding we define the two following terms:

Another term to be defined is SCTP-Logic which is the Secondary Cache Transaction Processing Logic, which has the task to store up to 4 internally generated or 1 externally generated 2nd level cache transactions.

R12000

Different to the R10000, the R12000 has 4 counters each capable of counting one of 32 events. For counter 1, a trigger mechanism was included such that an event is counted by counter 1 if any of the other counters reached a certain value. Additionally, conditional counting is possible. For example, it is possible to count the number of cycles in which 4 instructions have been completed. Also, some semantic inaccuracies concerning the definition of events have been clarified [5]. An introduction to measurement and interpretation of events can be found in [6]. The counters are 32 bit wide.

The 4 counters may count any of the 32 events:

SUN ULTRASparc

Performance registers of UltraSPARC processors are controlled by the Performance Control Register (PCR) which can be accessed only in privileged mode. Accesses to the PIC-registers may be either in user or privileged mode, dependent on a bit in the PCR which can be changed in privileged mode. Event counting can be done either for the user mode, system mode, or both. Overflow of the counters is silently. For accurate timing, event counting should be done as taking the difference between two reads of a performance counter.

Actual versions of the Solaris operating system have support for performance counters in form of a programming interface (see man cpc).

UltraSPARC I/II

The UltraSPARC I/II 64-bit microprocessors of SUN have the possibility to count performance relevant events. A detailed description of the SPARC V9 architecture can be found in [7]. Both variants have 8 times 24 64-bit registers which are organized in so-called windows to optimize argument passing on subroutine calls without time-consuming copying of registers to memory. The 1st level cache has a 16 KB data (D-cache) and a 16 KB instruction cache (I-Cache). The 2nd level cache (E-cache) has a size of 512 KB up to 4 MB on UltraSPARC I, and 512 KB up to 16 MB on UltraSPARC II. The main memory can be as large as 2 TB. Another important component of the supporting logic is the UPA, the Universal Port Architecture, which connects several processors over a high-speed crossbar-switch.

The microprocessor contains two performance counters (PIC0, PIC1), which are able to count different events. Each counter can count one of 12 different events, two events can be counted on both counters, which sums up to a total of 22 different events [8]. The counters are 32 bit wide. Additionally, there exists a 64-bit elapsed cycle counter.

There are hardware erratas documenting problems in counting certain events. These problems might affect events of type PCL_L2CACHE_MISS, PCL_L1DCACHE_MISS, PCL_L1ICACHE_READWRITE, PCL_L1ICACHE_HIT, PCL_L1ICACHE_MISS.

UltraSPARC III

The microprocessor contains two performance counters (PIC0, PIC1), each of them 32 bit wide. Additionally, there exists a 64-bit elapsed cycle counter.

PowerPC

PowerPC 604

The PowerPC 604 has 2 performance counters.

PowerPC 604e

The PowerPC 604e is a 32-bit microprocessor with 32 32-bit integer and 32 32-bit floating point registers. The 1st level cache consists of a 32 KB data cache (D-cache) and a 32 KB instruction cache (I-cache). Different to other microprocessors, the PowerPC 604e has no on-chip logic to control a 2nd level chip but signals are available for additional cache logic [9]. Additionally, there exist performance counter events concerning the 2nd level cache. A detailed description of the PowerPC architecture can be found in [10]. The performance counters are 32 bit wide.

The pipelines of the PowerPC 604e consist of:

Sub-unit names are:

The PowerPC 604e has 4 performance counters (PMC1/PMC2/PMC3/PMC4) capable of counting 116 different events [11].

IBM has the PMapi library which supports access to the performance counters on different PowerPC and POWER chips. PMapi supports the distinction between supervisor mode, problem (user) mode, or both. On AIX versions 4.2 and higher, performance counter status is saved and restored on context switches.

POWER3

The POWER3 has 8 performance counters. Missing counter/event pairs mean an unused event for that counter. To be consistent with the numbering scheme with start numbering with POWER3_C0_xx for counter 1.

POWER3-II

The POWER3-II has 8 performance counters. Missing counter/event pairs mean an unused event for that counter. To be consistent with the numbering scheme with start numbering with POWER3II_C0_xx for counter 1.

Intel Pentium Family

Intel Pentium

The Intel Pentium is a 32-bit CISC microprocessor. The Pentium has 2 performance counters with most of the events countable by either of the counters and only some events countable only by a specific counter (as noted). With the introduction of the MMX-extensions, Pentium's with MMX have defined more events as stated (MMX-extensions). We have left out all events which are specific to the MMX functional unit as compilers normally do not generate code for this unit. The performance counters are 40 bit wide, the time stamp counter is 64 bit wide.

The Time Stamp Counter counts the elapsed machine cycles:

The events countable by both counters are:

By default, the instructions RDMSR and WRMSR to access the performance counter registers are kernel-mode instructions (ring 0).

In [12] are software tools concerning the performance counters on Pentium-like processors described. On Linux systems, libpperf is available to access the performance counters. It was written by M. Patrick Goda and Michael S. Warren from Los Alamos National Laboratory. libpperf itself is based on the msr device implemented by Stephan Meyer for Linux 2.0.x and 2.1.x.

Mikael Pettersson has written a kernel patch for all recent kernel versions and a user library to access performance counters on Intel IA32-processors (see here) for the package.

Intel PentiumPro/Pentium II/Pentium III

To keep binary compatibility with the predecessor processors, the PentiumPro, Pentium II, and Pentium III have 8 registers, 32 bit width each. First level cache is 8 KB for instructions (ICache) and 8 KB for data (DCache) on PentiumPro, and 16 KB for both caches on Pentium II and Pentium III. As the PentiumPro, Pentium II, and Pentium III are CISC-microprocessors (complex instruction set computer), every instruction is divided internally into micro-operations (UOP's) of fixed length. Dependent on the complexity of the instruction, the instruction is divided into 1-4 UOP's.

The PentiumPro, Pentium II, and Pentium III has 2 performance counters capable of counting a total of 77 different events (at most two at a time), some of them with an additional unit mask as parameter to further subdivide the event type. Some of the events are countable only by a specific counter. The Pentium III has 4 additional events concerning Streaming SIMD Extensions. Performancxe counter registers are 40 bit wide, the time stamp counter is 64 bit wide. With special instructions it is possible to write values into the performance counter registers (WRMSR). Care has to be taken as this instruction writes only the lower 32 bits, the upper 8 bits are sign extended from bit 31. For the time stamp counter, the upper 8 bits are set to 0.

The Time Stamp Counter counts the elapsed machine cycles:

The events countable by both performance counters are:

All of the events can be counted on PentiumPro as well as on Pentium II and Pentium III. The Pentium II and Pentium III have additional events defined mainly for MMX-extensions [13].

The same remarks as stated above in the Pentium-section concerning software environments apply to the Pentium Pro, Pentium II, and Pentium III as well.

Intel Pentium 4

The performance monitoring mechanism provided in the Intel Pentium 4 processors is considerably different from that provided in the P6 family and Pentium processors (and all other microprocessors). The setup mechanism and MSR layouts are different and incompatible with the P6 family and Pentium processor mechanism.

There are 3 types of registers relevant fpr performance counting:

The performance counters in conjunction with the counter configuration control registers (CCCRs) are used for filtering and counting the events selected by the ESCRs. The Pentium 4 and Intel Xeon processors provide 18 performance counters organized into 9 pairs. A pair of performance counters is associated with a particular subset of events and ESCR's. Each performance counter is 40-bits wide.

The counter pairs are partitioned into four groups:

The Time Stamp Counter counts the elapsed machine cycles:

The Pentium 4 has 18 performance counters. The following list shows all measurable events. Each event can only be measured on specific counter pairs, which belong to a specific counter group. To be consistent with the numbering scheme we start numbering with P4_CG0_xx for counter group 0.

Some events take more than one counter/CCCR/ESCR. See the Intel manual for the details. Therefore the 1-to-1 mapping of low-level events and counters doesn't hold for the P4 anymore.

The following short description of an event contains the hexadecimal values for the ESCR event select, ESCR event mask and the CCCR select (e.g. IOQ_allocation: Bus request type 03H (ESCR event select) 01H (ESCR event mask) 06H (CCCR select)). Below the short description a combination of event class and event name is shown. All events of a class can be combined to a new one. By an addition of the event mask values you can measure a specific combination of events (e.g. branch_retired: If you want to count all taken branches you have to combine branch taken predicted (04H) and branch taken mispredicted (08H). In order to activate both events you must write 0CH in the event mask field.).

Counter group 0 (Counter 1/2) is an exception and do not belong to the counter groups above.

Currently, the generic low level driver interface does not work with the P4.

AMD Athlon Family

AMD Athlon

The AMD Athlon is a 32-bit CISC microprocessor. The Athlon has 4 performance counters. The performance counters are 48 bit wide, the time stamp counter is 64 bit wide.

The Time Stamp Counter counts the elapsed machine cycles:

The events countable on all four counters are:

Intel IA64 Family

A performance counter is called in Intel's literature [14] a PMC. In the Itanium 1 processor there are 4 counters numbered PMC4 to PMC7. To be consistent with the rest of the document we map those counter numbers to the names PMC0, PMC1, PMC2, PMC3, e.g. PMC 4 is called in our document PMC0.

Hitachi SR8000

The SR8000 processors have 8 performance counter registers, each counting exactly one hard-coded event type.



Rudolf 2003-01-02