Perfmon is a tool that allows user-level code to access the performance counters present in the Ultra-series workstations and servers produced by Sun Microsystems. This is accomplished by a loadable driver that re-programs devices with performance counters so that user-level code can access these counters (normally, access to these counters is restricted to code running in privileged mode). For some devices, like the UltraSPARC CPU, accessing the performance counters requires special machine instructions. There is a user library component of Perfmon that provides access to these instructions via C function calls. The library also includes access to other useful functions such as memory/instruction barriers. Currently, the only devices supported are the UltraSPARC-I and the UltraSPARC-II CPUs and will be the only devices discussed in the remainder of this document. See the section on Future Work for devices that may be supported in future versions of Perfmon.
There are two parts to collecting performance data on UltraSPARC CPUs. The first is to program the Performance Control Register (PCR) indicating the type of events that you wish to count. Access to the PCR is always privileged and requires a call into the Perfmon device driver. The second part is to read the Performance Instrumentation Counter (PIC) register to get the current count of the watched events. Access to the PIC is normally privileged, but can be made non-privileged by turning off the lower bit of the PCR register. In addition to the PCR and PIC registers, user access to the UltraSPARC's TICK register (which is incremented once per machine clock cycle at all times) is also enabled. This is done by turning off the upper bit of the TICK register. See the Perfmon User's Guide and the UltraSPARC-I User's Manual for more information.
There were two basic requirements in designing Perfmon. The first was to allow user programs to access the performance counters, which is a privileged operation. The second was to have lightweight access to the accumulated data to minimize the amount of error introduced by the act of reading the performance registers.
Since access to the PCR is always privileged, and access to the PIC and TICK is by default privileged, it was necessary to write some code that runs in the kernel context. For maximum flexibility and ease of installation, it was decided to write a loadable device driver rather than have a specially modified kernel.
The loadable driver is a standard, autoconfiguring SVR4 character device driver. In addition to the static structures and functions needed to support a device driver (see Writing Device Drivers in AnswerBook for more details), there are two functions in the device driver that needed Perfmon-specific code to be written.
When the driver is initially loaded, its _init() routine is called. In this routine, a function is run on all present CPUs that turns off the TICK.npt bit. This allows user-level programs to access the TICK register. In order to guarantee that the function is run on all CPUs, a kernel mechanism called a cross-call was used. This allows one CPU to send a software interrupt packet to all other CPUs in the system telling them what operation to perform (in this case, run a specified function). This is the same mechanism used to update each CPU's TLB on a page miss.
Earlier during development, I was seeing some cases on MP machines
where the driver would load, make the cross-call, and return. However,
when I ran a user program that tried to read TICK, it would crash with
an Illegal instruction
error, indicating that the TICK.npt
bit had not been turned off. If I waited a minute or two, the problem
would go away and everything would work perfectly, implying that the
cross-calls were working, but taking their time doing it. This was finally
resolved by adding calls to xc_attention() and xc_dismissed() around
the cross call. These functions basically force all CPUs into a tight
loop, waiting to receive cross-calls, then release them. Since the
installation of this code, I have not been able to reproduce my
earlier problem, so I'm assuming that it's fixed.
After the drivers has been fully loaded and attached (i.e. it's device node has been created), it waits for users to send ioctl()s to it. The ioctl()s recognized by Perfmon are fully described in the Perfmon User's Guide. Implementation of the supported ioctl()s was fairly easy and straightforward.
The only tricky ioctl() to implement was PERFMON_FLUSH_CACHE. This causes the cache on the current CPU to be flushed. The actual flushing is done by calling a pre-existing kernel routine (cpu_flush_ecache()) that accesses a region of memory that aliases with each cache line in the CPU. The tricky part was getting access to this routine. Under Solaris 2.6 (where I did my initial development), the cpu_flush_ecache() function is a global kernel symbol, meaning that I can just reference that function in my driver code, and it will be resolved when my driver is loaded. However, under Solaris 2.5.1, this function is not a global symbol and cannot be resolved by the kernel module linker (krtld) at module load time. However, the symbol could be resolved once I was already loaded and running in kernel space. This means that in order to support this function, I need to make calls into krtld to resolve cpu_flush_ecache() myself. Luckily, this turned out to be less complicated than it sounds. The first time that a cache flush is requested by the user, I look up the address of cpu_flush_ecache() by using kobj_getsymvalue(). I then keep a pointer to this function around for later use, along with a flag indicating that I have attempted lookup (since it's possible that the symbol doesn't exist). And to make the driver MT-safe, the lookup has to be protected via a mutex lock to avoid any possible race conditions.
The user-land component of Perfmon was relatively easy and quick to implement. The library functions are all written in assembly since they need to use special machine instructions to do their work. Also, since most of the performance counter registers are 64-bit, and the Solaris compilers and OS are currently 32-bit, the library routines had to split the 64-bit registers into two separate registers so that the calling C code could deal with them properly.
Another issue for writing the user-level code was the fact that the performance counters are kept on a per-processor basis rather than a per-process basis. This means that if you run your program on an MP machine, and it migrates between CPUs during its run-time (which is pretty likely given Solaris' work-grabbing scheduler), the data read from the CPU performance counters is useless. Fortunately, there is a non-privileged system call named processor_bind() that will let you bind your process (or a single LWP) to a particular CPU.
Also, since there is some setup required by most programs using Perfmon, a skeleton program was provided to minimize development and testing of programs. The basic outline of the skeleton program is:
/dev/perfmon
.
There are plenty of things that can be done to extend the features and usefulness of Perfmon. Some of the items that are planned for the future are: