Perfmon is a tool that allows user-level code to access the performance counters present in the Ultra-series workstations and servers produced by Sun Microsystems. This is accomplished by a loadable driver that re-programs devices with performance counters so that user-level code can access these counters (normally, access to these counters is restricted to code running in privileged mode).
Currently, the following devices are supported:
After you have extracted the distribution, you will find several directories:
.il
files needed for using
the inline functions.
Before you can use Perfmon, you must have your system administrator add the Perfmon package to the system(s) you wish to use. A quick rundown of how this might work:
<become root> cd <installation_dir>/pkgs pkgadd -d MSUperf MSUperf
Note that Perfmon will only work on UltraSPARC-based machines
(where the output of uname -m
returns
sun4u
).
In case you ever need to uninstall Perfmon from a system, just do the following:
<become root> pkgrm MSUperf
In order to write programs using Perfmon, you will need to have
access to perfmon.h
(found in the include
directory), libperfmon.a, and optionally perfmon32.il
or
perfmon64.il
. If you are using a compiler that generates
32-bit code (such as Sun's C compiler version 4.x and lower), you will
need to use perfmon32.il
. If you are using a compiler
that generates 64-bit code (such as some versions of gcc or Sun's C
compiler version 5.x), you will need perfmon64.il
.
The .il
files mentioned above are UltraSPARC-specific
assembly language routines for inlining (for more information, see
the inline(1)
man page). This format is known to work
with Sun's C and C++ compilers, but may need some modification for
use with gcc.
Note that code written using Perfmon will only run on machines where the Perfmon driver is installed and loaded. Attempting to run programs using Perfmon on other machines may result in strange behavior, core dumps, illegal instruction errors, etc.
There are three types of Perfmon routines:
Both the inline and library routines can be run inside of a user
application as you would expect. The driver routines need to be
accessed via ioctl()
. To use ioctl()
,
you must first open the Perfmon device (accessible through
/dev/perfmon
). After the device is open, you simply
use ioctl()
to communicate to the driver what routine
you wish to run (passing arguments as necessary). Here is an example
code segment which opens the device and issues a cache flush
request on the current CPU:
#include <stdio.h> #include <fcntl.h> #include "perfmon.h" main() { int fd; int rc; fd = open("/dev/perfmon", O_RDONLY); if (fd == -1) { perror("open(/dev/perfmon)"); exit(1); } /* Tell the driver to flush the cache of the current CPU */ rc = ioctl(fd, PERFMON_FLUSH_CACHE); if (rc < 0) { perror("ioctl(PERFMON_FLUSH_CACHE)"); exit(1); } }
The UltraSPARC CPU has two 64-bit registers that are used for gathering performance data. The Performance Control Register (PCR) and the Performance Instrumentation Counters (PIC). These registers reflect events that happen on a per-processor basis. For best results, it is recommended that you run your program on an MP machine and bind your process to a specific CPU to prevent process migration to another CPU and the loss of performance data that has been collected.
Access to the PCR is privileged. The PCR can only be accessed by
using the PERFMON_GETPCR
and PERFMON_SETPCR
ioctl()
routines in the Perfmon driver (see the
next section for more details). The PCR has the following bitfields
(taken from Appendix B of the UltraSPARC-I User's Manual):
Name | Bits | Description |
---|---|---|
PRIV | 0 | Privileged. If set, non-privileged access to the PIC will cause a privileged_action trap. For programs using Perfmon, this should always be set to 0. |
ST | 1 | System_trace. If set, events in privileged (system) mode are accumulated. This may be set along with PCR.UT to accumulate all events. |
UT | 2 | User_trace. If set, events in non-privileged (user) mode are accumulated. This may be set along with PCR.ST to accumulate all events. |
S0 | 4-7 | Designates the type of event to accumulate in PIC.D0 (PIC0). See the table below for more information. |
S1 | 11-14 | Designates the type of event to accumulate in PIC.D1 (PIC1). See the table below for more information. |
If the PCR.PRIV bit is clear, the PIC register can be accessed by user
mode programs. Hand-coded assembly routines for doing this are located
in the perfmon.il
files and in the perfmon library. The
PIC register has the following format:
Name | Bits | Description |
---|---|---|
D0 | 0-31 | A 32-bit counter that represents the number of events accumulated specified by the PCR.S0 field. |
D1 | 32-63 | A 32-bit counter that represents the number of events accumulated specified by the PCR.S1 field. |
The include file perfmon.h
has encodings for the various
fields in the PCR register. These encodings are pre-shifted so that
when they are inclusive-ORed together, they produce a value suitable
for writing directly to the PCR register. The defined values are:
Name | PIC field | Description |
---|---|---|
PCR_PRIV_MODE | N/A | Sets PIC access to privileged-mode only. This should probably never be used for programs using Perfmon |
PCR_SYS_TRACE | N/A | Causes events to be accumulated while in privileged (system) mode. |
PCR_USER_TRACE | N/A | Causes events to be accumulated while in non-privileged (user) mode. |
PCR_S0_CYCLE_CNT | PIC0 | Accumulated cycles. This is similar to the SPARC-V9 TICK register, except that cycle counting is controlled by the PCR.UT and PCR.ST fields. |
PCR_S0_INSTR_CNT | PIC0 | The number of instructions completed. Annulled, mispredicted or trapped instructions are not counted. |
PCR_S0_STALL_IC_MISS | PIC0 | I-buffer is empty due to an I-Cache miss. This includes E-Cache miss processing if an E-Cache miss also occurs. |
PCR_S0_STALL_STORBUF | PIC0 | The store buffer cannot hold additional stores, and a store instruction is the first instruction in the group. |
PCR_S0_IC_REF | PIC0 | I-Cache references. I-Cache references are fetches of up to four instructions from an aligned block of eight instructions. I-Cache references are generally prefetches and do not correspond exactly to the instructions executed. |
PCR_S0_DC_READ | PIC0 | D-Cache read references (including accesses that subsequently trap). Non-D-Cacheable accesses are not counted. Atomic instructions, block loads, "internal", and "external" bad ASIs, quad LDD, and MEMBARs also fall into this class. |
PCR_S0_DC_WRITE | PIC0 | D-Cache write references (including accesses that subsequently trap). Non-D-Cacheable accesses are not counted. |
PCR_S0_STALL_LOAD | PIC0 | An instruction in the execute stage depends on an earlier load result that is not yet available. This stalls all instructions in the execute and grouping stages. This also counts cases where no instructions are dispatched due to a one cycle load-load dependency on the first instruction presented to the grouping logic. |
PCR_S0_EC_REF | PIC0 | Total E-Cache references. Non-cacheable accesses are not counted. NOTE: The E-Cache write reference count is determined by subtracting the D-Cache read miss (D-Cache read references minus D-Cache read hits) and I-Cache misses (I-Cache references minus I-Cache hits) from the total E-Cache references. Because of store buffer compression, this is not the same as D-Cache write misses. |
PCR_S0_EC_WRITE_RO | PIC0 | E-Cache hits that do a read for ownership UPA transaction. |
PCR_S0_EC_SNOOP_INV | PIC0 | E-Cache invalidations from the following UPA transactions: S_INV_REQ, S_CPI_REQS_INV_REQ, S_CPI_REQ. |
PCR_S0_EC_READ_HIT | PIC0 | E-Cache read hits from D-Cache misses. NOTE: The E-Cache write hit count is determined by subtracting the read hit and the instruction hit count from the total E-Cache hit count. |
PCR_S1_CYCLE_CNT | PIC1 | Accumulated cycles. This is similar to the SPARC-V9 TICK register, except that cycle counting is controlled by the PCR.UT and PCR.ST fields. |
PCR_S1_INSTR_CNT | PIC1 | The number of instructions completed. Annulled, mispredicted or trapped instructions are not counted. |
PCR_S1_STALL_MISPRED | PIC1 | I-buffer is empty from Branch misprediction. Branch misprediction kills instructions after the dispatch point, so the total number of pipeline bubbles is approximately twice as big as measured from this count. |
PCR_S1_STALL_FPDEP | PIC1 | First instruction in the group depends on an earlier floating point result that is not yet available, but only while the earlier instruction is not stalled for a PCR_S0_STALL_LOAD. This, PCR_S1_STALL_FPDEP and PCR_S0_STALL_LOAD are mutually exclusive counts. |
PCR_S1_IC_HIT | PIC1 | I-Cache hits. |
PCR_S1_DC_READ_HIT | PIC1 | D-Cache read hits are counted in one of two places: 1) When they access the D-Cache tags and do not enter the load buffer (because it is already empty). 2) When they exit the load buffer (due to a D-Cache miss or a non-empty load buffer) |
PCR_S1_DC_WRITE_HIT | PIC1 | D-Cache write hits. |
PCR_S1_LOAD_STALL_RAW | PIC1 | There is a load use in the execute stage and there is a read-after-write hazard on the oldest outstanding load. This indicates that load data is being delayed by completion of an earlier store. |
PCR_S1_EC_HIT | PIC1 | Total E-Cache hits. |
PCR_S1_EC_WRITEBACK | PIC1 | E-Cache misses that do writebacks. |
PCR_S1_EC_SNOOP_COPYBCK | PIC1 | E-Cache snoop copy-backs from the following UPA transactions: S_CPB_REQ, S_CPI_REQ, S_CPD_REQ, S_CPB_MSI_REQ. |
PCR_S1_EC_IC_HIT | PIC1 | E-Cache read hits from D-Cache misses. |
PCR_S1_CYCLE_CNT | PCR_S0_INSTR_CNT
| PCR_USER_TRACE
. This value would then be passed to the
PERFMON_SETPCR
ioctl()
.
ioctl()
Routines
Function | Arguments | Description |
---|---|---|
PERFMON_FLUSH_CACHE |
None | Flushes both L1 and L2 caches on the CPU that the calling thread is running on. |
PERFMON_GETPCR |
Address of 64-bit buffer (unsigned long long for
Sun's C 4.x compilers) |
Gets the current value of the UltraSPARC PCR register that the calling thread is running on and places it in the passed buffer. See the UltraSPARC User's Manual for details on register format. |
PERFMON_SETPCR |
Address of a 64-bit buffer | Sets the value of the UltraSPARC PCR register to the value that is contained in the passed-in buffer. |
These library routines are prototyped in perfmon.h
and can be included in your code by adding the compile time
options: -L$PERFMON_HOME/lib -lperfmon
. See the
Inline Routines section for descriptions of library
functions that are duplicated by inline functions.
Prototype | Description |
---|---|
void cpu_sync() |
This routine executes a membar #Sync instruction
which does a barrier synchronization. After this instruction
completes, all previous instructions and memory accesses are
complete. |
void clr_pic() |
This clears all the bits in both PIC0 and PIC1. |
There are several inline assembly language routines that are part of
Perfmon. They are prototyped in perfmon.h
:
Prototype | Description |
---|---|
unsigned long long get_tick() |
This gets the current value of the TICK register. This register represents the number of clock cycles that have happened since the processor was last powered-on (or reset). |
unsigned long long get_pic() |
This atomically reads both PIC0 and PIC1. |
unsigned long get_pic0() |
Read only the value in PIC0. |
unsigned long get_pic1() |
Read only the value in PIC1. |
unsigned long extract_pic0(unsigned long long) |
Given the 64-bit PIC, extract PIC.D0 (lower 32 bits). |
unsigned long extract_pic1(unsigned long long) |
Given the 64-bit PIC, extract PIC.D1 (upper 32 bits). |
.il
file on
your compile command line. Also, since the routines use SPARC-V9
specific instructions, you must add the -xarch=v8plusa
flag to your compile line so that the compiler will allow V9
instructions in your final executable. For example:
# Compile using inline routines cc -xarch=v8plusa -o tick tick.c perfmon32.il -L$(PERFMON_HOME)/lib -lperfmon # Compile without inline routines (use the library versions) cc -xarch=v8plusa -o tick tick.c -L$(PERFMON_HOME)/lib -lperfmon
buildit.sh
in the
same directory for optimization examples).
#ifdef
in the source. Read the source for
more details.