Handling oprofile sample buffer overflows

What to do when sample buffer overflows are reported

OProfile is a system-wide profiler for Linux, capable of profiling all running code at low overhead.

As of OProfile release 0.9.6, when you run 'opcontrol -shutdown' after a profiling run, oprofile will check for lost samples and display a warning message if any samples have been dropped. The message recommends reviewing the oprofiled.log for more information on lost samples and to try to eliminate or minimize them.

But how does one do that? And how does one interpret the information about lost samples in the log?

Oprofiled.log layout

Let's start out with the interpretation of the log information. The oprofiled.log is located in:

/samples

where is /var/lib/oprofile by default

# cat /var/lib/oprofile/samples/oprofiled.log

At the end of a profiling run, a section titled "OProfile Statistics" is written to this log. The beginning (summary) part of this section includes several lines giving information about lost samples or events. For example:

oprofiled started Tue Jun 22 17:00:46 2010kernel pointer size: 8Tue Jun 22 17:01:21 2010-- OProfile Statistics --Nr. sample dumps: 4Nr. non-backtrace samples: 4279Nr. kernel samples: 3834Nr. lost samples (no kernel/user): 0Nr. lost kernel samples: 0Nr. incomplete code structs: 0Nr. samples lost due to sample file open failure: 0Nr. samples lost due to no permanent mapping: 1Nr. event lost due to buffer overflow: 0Nr. samples lost due to no mapping: 11Nr. backtraces skipped due to no file mapping: 0Nr. samples lost due to no mm: 42

Following that, is a "Statistics for cpu : " section for each processor, with more information on lost samples. The oprofile kernel driver uses a two-level sample buffering scheme: one buffer per cpu, plus one main "event" buffer which periodically collects all the individual cpu buffer samples. The userspace oprofile daemon reads the samples from this main event buffer.

---- Statistics for cpu : 2Nr. samples lost cpu buffer overflow: 0Nr. samples received: 0Nr. backtrace aborted: 0Nr. samples lost invalid pc: 0---- Statistics for cpu : 1Nr. samples lost cpu buffer overflow: 0Nr. samples received: 0Nr. backtrace aborted: 0Nr. samples lost invalid pc: 0---- Statistics for cpu : 0Nr. samples lost cpu buffer overflow: 0Nr. samples received: 67Nr. backtrace aborted: 0Nr. samples lost invalid pc: 0oprofiled stopped Tue Jun 22 17:01:21 2010

Getting back to the summary part, the "OProfile Statistics".

The "Nr. non-backtrace samples" line shows the number of samples actually received.

Recently discovered (Dec 2011) Bug Alert: The "Nr. non-backtrace samples" statistic up through OProfile 0.9.7 is not accurate. This value is less than the actual number of samples received.

Looking at the "lost samples' lines, you'll almost always see some samples lost due to "no mapping" or "no mm". This can happen when samples are recorded by the oprofile kernel module for an application, but then that app ends before those samples can be processed by the userspace oprofile daemon. There's not much the user can do about these. Proportionally, the number of such lost samples is usually very low, so these can generally be safely ignored.

On the other hand, the user often can (and should) take action to eliminate or minimize the number of "events lost due to buffer overflows", especially if that number is high (say, anything over 1%) compared to the total "Nr. non-backtrace samples". The same is true for "Nr. samples lost cpu buffer overflow" in the per-cpu stats sections. OK, so what action(s) should be taken?

Reducing/Eliminating Events lost due to buffer overflow

Note: The "events lost due to buffer overflow" statistic is mis-named; it should be named "samples lost due to buffer overflow". The following text uses the phrasing that the tool provides, but keep this in mind.

Generally speaking, if you're seeing "events lost due to buffer overflow" in the summary part of the "OProfile Statistics" section of the oprofiled.log, the easiest thing to try is lowering your sampling interval (i.e., using a higher 'count' value for the event(s) you're profiling). But since OProfile is a statistical profiler, you need to keep in mind that the fewer the samples over a given period of time, the less accurate your profile data will be. It's difficult to give a rule of thumb on what constitutes a statistically valid profile. But as a rough example, consider that gprof typically collects 100 samples per second. This is fine for the low-hanging fruit, but may not be fine-grained enough for detailed performance investigations. So, for time-based or cycles-based profiles, you should try to do at least an order of magnitude better than gprof.

Here is an example of how to change the 'count' value for the event PM_CYC_GRP1, the count value is the number that follows the colon:

$ opcontrol --event PM_RUN_CYC_GRP1:25000

You can check the new count value by issuing:

$ opcontrol --statusDaemon not runningEvent 0: PM_RUN_CYC_GRP1:25000:0:1:1Separate options: libraryvmlinux file: /boot/vmlinux-2.6.32.12-0.7-ppc64Image filter: noneCall-graph depth: 0

oprofile provides the ability to tailor the sizes of the key working buffers allocated.

$ opcontrol 2>&1 | grep buffer   --buffer-size=num             kernel buffer size in sample units   --buffer-watershed            kernel buffer watershed in sample units (2.6 only=   --cpu-buffer-size=num         per-cpu buffer size in units (2.6 only)   --note-table-size             kernel notes buffer size in notes units (2.4 only)

If, after lowering your sampling interval, you're still getting too many overflows, try adjusting the buffer size along with the buffer-watershed (which should be around 1/4 the value of buffer size). Once you have eliminated the main buffer overflows, then go after CPU buffer overflows (if any) by adjusting the cpu-buffer-size. The default size for the main event buffer is 131072, and the default size for the per-cpu buffers is 8192.

See the 'opcontrol' man page for details on setting these parameters. The man page uses "num samples" to indicate that the value you enter will initialize a buffer whose size is big enough to hold "x" samples. The samples are unsigned long's, so they would be 64-bits or 32-bits, depending on the bitness of the kernel (since these buffers are allocated by the kernel).

Here is a quick example:

# Set the event (main) buffer size$ opcontrol --buffer-size=1000000# Set the buffer watershed$ opcontrol --buffer-watershed=256000# Set the cpu buffer size$ opcontrol --cpu-buffer-size=16000

Run opcontrol --status again to verify the "non-default" settings.

$ opcontrol --statusDaemon not runningEvent 0: PM_RUN_CYC_GRP1:1000000:0:1:1Separate options: libraryvmlinux file: /boot/vmlinux-2.6.32.12-0.7-ppc64Image filter: noneCall-graph depth: 0Buffer size: 1000000CPU buffer watershed: 250000CPU buffer size: 16000

To restore to the defaults, simply set each of the buffer values to zero (0).

$ opcontrol --buffer-size=0$ opcontrol --buffer-watershed=0$ opcontrol --cpu-buffer-size=0

Testing this on a POWER7 system

We did some testing on a POWER7 @3Ghz, with 32 logical processors - CPUs (i.e., 8 cores, with the default of SMT=4). The table below shows the results. You can see that, with a sampling interval of one sample every 25,000 PM_RUN_CYC events, we found no buffer size settings that prevented lost samples. But by adjusting sampling interval and buffer parameters, we were able to collect profile data with very few (or no) lost samples.

Notes

With a sampling interval of one sample for every 50,000 run cycles on a 3 GHz system, the total possible number of samples per second is 60,000 x num_cpus (in this case 32 CPUs), assuming all CPUs are 100% busy. Using that assumption, the nominal number of samples for this testing scenario is 1,920,000 per second. That's 3Ghz / 50,000 samples = 60,000 samples per sec.

Without oprofile running, an example runtime for a single test application is 23 seconds; 53 seconds runtime with app running on all CPUs (we'll call this "nominal runtime"). Thus, 32 apps should register 1,920,000 x 53 = 101,760,000 total samples over the collective runtime. This is a theoretical value, because when oprofile is running, the profiling overhead causes an increased runtime for the apps.

The data in the table show that, for this type of system, oprofile is indeed capable of processing 1,920,000 million samples per second with appropriate profiling parameters.

Sampling = 25,000 events

All profiling parameters in this case resulted in too many lost samples

Sampling interval

PM_RUN_CYC--buffer-size--buffer-watershed--cpu-buffer-sizeEvents lost from

buffer overflowAverage events lost

per-cpu bufferoprofile overhead

(ratio of nominal)Summary25,0002M.5M16K96 million3.5 million1.33Not good25,0004M1M16K45 million3.5 million1.33Not good25,0008M2M16K10 million3.5 million1.34Not good25,00016M4M16K03.5 million1.33Not good

Sampling = 50,000 events

The last two did the best here. The first lost nearly all the samples.

Sampling interval

PM_RUN_CYC--buffer-size--buffer-watershed--cpu-buffer-sizeEvents lost from

buffer overflowAverage events lost

per-cpu bufferoprofile overhead

(ratio of nominal)Summary50,000defaultdefaultdefault81.5 milliona few on some

several thousand

on others1.21Not good50,0001M.25M16K800,00041.22Not good50,0002M.5M16K031.21Good50,0004M1M16K031.23Good

In this example, the main buffer size didn't make a difference between 2MB and 4MB, so for this workload the 2MB example would have been fine. For other workloads, you may see more events lost.

Sampling = 100,000 events

This set of profiling parameters uses even less resources and should provide as statistically valid a profile as when using a sampling interval of 50,000. Note that the oprofile overhead is also substantially less.

Sampling interval

PM_RUN_CYC--buffer-size--buffer-watershed--cpu-buffer-sizeEvents lost from

buffer overflowAverage events lost

per-cpu bufferoprofile overhead

(ratio of nominal)Summary100,0001M.25M16K01.51.07Very good