A simple model for describing basic sources of possible performance problems

来源：互联网发布：实时网速监控软件编辑：程序博客网时间：2024/05/21 17:28

A simple model for describing basic sources of possible performance problems

In this section we describe a simple model for describing basic sources of possible performance problems. The model is expressed in terms of operating system observables of fundamnetal subsystems and can be directly related back
to the outputs of standard Unix command line tools.

The model is based around a simple conception of aJava application running on a Unix or Unix-like operating system.Figure 3-7 shows the basic components of the model, which consist of:
• The hardware and operating system the application runs on
• The JVM (or container) the application runs in
• The application code itself
• Any external systems the application calls
• The incoming request traffic that is hitting the application

Any of these aspects of a system can be responsible for a performance problem. There are some simple diagnostic techniques that can be used to narrow down or isolate particular parts of the system as potential culprits for performance problems, as we will see in the next section.

Basic Detection Strategies

One definition for a well-performing application is thatefficient use is being made of system resources. This includes CPU usage, memory and network or I/O bandwidth. If an application is causing one or more resource limits to be hit,
then the result will be a performance problem.
It is also worth noting that the operating system itself should not normally bea major contributing factor to system utilisation. The role of an operating system is to manage resources on behalf of user processes, not to consume
them itself. The only real exception to this rule is when resources are so scarce that the OS ishaving difficulty allocating anywhere near enough to satisfy user requirements. For most modern server-class hardware, the only time this
should occur is when I/O (or occassionally memory) requirements greatly exceed capability.
A key metric for application performance is CPU utilisation. CPU cycles are quite often the most critical resource needed by an application, and so efficient use of them is essential for good performance. Applications should be aiming
for as close to 100% usage as possible during periods of high load.

When analysing application performance, the system must be under enough load to exercise it. The behavior of an idle application is usually meaningless for performance work.

Two basic tools that every performance engineer should be aware of are vmstat and iostat. On Linux and other Unixes, these command-line tools provide immediate and often very useful insight into the current state of the virtual memory and I/O subsystems, respectively. The tools only provide numbers at the level of the entire host, but this is frequently enough to point the way to more detailed diagnostic approaches. Let’s take a look at how to use vmstat as an example:

The parameter 1 following vmstat indicates that we want vmstat to provide ongoing output (until interrupted via Ctrl-C) rather than a single snapshot. New output lines are printed, every second, which enables a performance engineer
to leave this output running (or capturing it into a log) whilst an initial performance test is performed.
The output of vmstat is relatively easy to understand, and contains a large amount of useful information, divided into sections.

1. The first two columns show the number of runnable and blocked processes.
2. In the memory section, the amount of swapped and free memory is shown, followed by the memory used as buffer and as cache.
3. The swap section shows the memory swapped from and to disk. Modern server class machines should not normally experience very much swap activity.
4. The block in and block out counts (bi and bo) show the number of 512-byte blocks that have been received from, and sent to a block (I/O) device.
5. In the system section, the number of interrupts and the number of context switches per second are displayed.
6. The CPU section contains a number of directly relevant metrics, expressed as percentages of CPU time. In order, they are user time (us), kernel time (sy, for “system time”), idle time (id), waiting time (wa) and the “stolen time” (st, for virtual machines).

Over the course of the remainder of this book, we will meet many other, more sophisticated tools. However, it is important not to neglect the basic tools at our disposal. Complex tools often have behaviors that can mislead us,whereas the simple tools that operate close to processes and the operating system can convey simple and uncluttered views of how our systems are actually behaving.
In the rest of this section, let’s consider some common scenarios and how even very simple tools such as vmstat can help us spot issues.

Context switching

InSection 3.4.3, we discussed the impact of a context switch, and saw the potential impact of a full context switch to kernel space inFigure 3-6. However, whether between user threads or into kernel space, context switches introduce unavoidable wastage of CPU resources.
A well-tuned program should be making maximum possible use of its resources, especially CPU. For workloads which are primarily dependent on computation (“CPU-bound” problems), the aim is to achieve close to 100% utilisation of CPU for userland work.

To put it another way, if we observe that the CPU utilisation is not approaching 100% user time, then the next obvious question is to ask why not? What is causing the program to fail to achieve that? Are involuntary context switches caused by locks the problem? Is it due to blocking caused by I/O contention?
The vmstat tool can, on most operating systems (especially Linux), show the number of context switches occurring, so on a vmstat 1 run, the analyst will be able to see the real-time effect of context switching.A process that is failing to achieve 100% userland CPU usage and is also displaying high contextswitch rate is likely to be either blocked on I/O or thread lock contention.
However, vmstat is not enough to fully disambiguate these cases on its own. I/O problems can be seen from vmstat, as it provides a crude view of I/O operations as well. However, to detect thread lock contention in real time, tools like VisualVM that can show the states of threads in a running process should be used. One additional common tool is the statistical thread profiler that samples stacks to provide a view of blocking code.

Garbage Collection

As we will see inChapter 7, in the HotSpot JVM (by far the most commonly used JVM), memory is allocated at startup and managed from within user space. That means, that system calls such as sbrk() are not needed to allocate memory. In turn, this means that kernel switching activity for garbage collection is quite minimal. Thus, if a system is exhibiting high levels of system CPU usage, then it is definitely not spending a significant amount of its time in GC, as GC activity burns user space CPU cycles and does not impact kernel space utilization.
On the other hand, if a JVM process is using 100% (or close to) of CPU in user space, then garbage collection is often the culprit. When analysing a performance problem, if simple tools (such as vmstat) show consistent 100% CPU usage, but with almost all cycles being consumed by userspace, then a key question that should be asked next is: “Is it the JVM or user code that is responsible for this utilization?”. In almost all cases, high userspace utilization by the JVM is caused by the GC subsystem, so a useful rule of thumb is to check the GC log & see how often new entries are being added to it.
Garbage collection logging in the JVM is incredibly cheap, to the point that even the most accurate measurements of the overall cost cannot reliably distinguish it from random background noise. GC logging is also incredibly useful as a
source of data for analytics. It is therefore imperative that GC logs be enabled for all JVM processes, especially in production.
We will have a great deal to say about GC and the resulting logs, later in the book. However, at this point, we would encourage the reader to consult with their operations staff and confirm whether GC logging is on in production. If
not, then one of the key points of Chapter 8is to build a strategy to enable this.

I/O

File I/Ohas traditionally been one of the murkier aspects of overall system performance. Partly this comes from its closer relationship with messy physical hardware, with engineers making quips about “spinning rust” and similar, but it
is also because I/O lacks as clean abstractions as we see elsewhere in operating systems.
In the case of memory, the elegance of virtual memory as a separation mechanism works well. However, I/O has no comparable abstraction that provides suitable isolation for the application developer.
Fortunately, whilst most Java programs involve some simple I/O, the class of applications that make heavy use of the I/O subsystems is relatively small, and
in particular, most applications do not simultenously try to saturate I/O at the same time as either CPU or memory.
Not only that, but established operational practice has led to a culture in which production engineers are already aware of the limitations of I/O, and actively monitor processes for heavy I/O usage.
For the performance analyst / engineer, it suffices to have an awareness of the I/O behavior of our applications. Tools such as iostat (and even vmstat) have the basic counters (e.g. blocks in or out) that are often all we need for basic diagnosis, especially uf we make the assumption that only one I/O-intensive application is present per host.
Finally, it’s worth mentioning one aspect of I/O that is becoming more widely used across a class of Java applications that have a dependency on I/O but also stringent performance applications.

Kernel Bypass I/O

For some high-performance applications, the cost of using the kernel to copy data from, for example, a buffer on a network card, and place it into a user space region is prohibitively high. Instead, specialised hardware and software is used to map data directly from a network card into a user-accessible area.This approach avoids a “double-copy” as well as crossing the boundary between user space and kernel, as we can see inFigure 3-8.

In some ways, this is reminiscent of Java’s New I/O (NIO) API that was introduced to allow Java I/O to bypass the Java heap and work directly with native memory and underlying I/O

However, Java does not provide specific support for this model, and instead applications that wish to make use of it rely upon custom (native) libraries to implement the required semantics. It can be a very useful pattern and is increasingly commonly implemented in systems that require very highperformance I/O.
In this chapter so far we have discussed operating systems running on top of “bare metal”. However, increasingly, systems run in virtualised environments, so to conclude this chapter, let’s take a brief look at how virtualisation can fundamentally change our view of Java application performance.

Virtualisation

Virtualisation comes in many forms, but one of the most common is to run a copy of an operating system as a single process on top of an already-running OS. This leads to a situation represented in Figure 3-9where the virtual environment runs as a process inside the unvirtualized (or “real”) operating system that is executing on bare metal.

A full discussion of virtualisation, the relevant theory and its implications for application performance tuning would take us too far afield. However, some mention of the differences that virtualisation causes seems approriate, especially given the increasing amount of applications running in virtual, or cloud environments.
Although virtualisation was originally developed in IBM mainframe environments as early as the 1970s, it was not until recently that x86 architectures were capable of supporting “true” virtualisation. This is usually characterized by
these three conditions:
• Programs running on a virtualized OS should behave essentially the same
as when running on “bare metal” (i.e. unvirtualized)
• The hypervisor must mediate all accesses to hardware resources
• The overhead of the virtualization must be as small as possible, and not a significant fraction of execution time.
In a normal, unvirtualized system, the OS kernel runs in a special, privileged mode (hence the need to switch into kernel mode). This gives the OS direct access to hardware. However, in a virtualized system, direct access to hardware by a guest OS is disallowed.
One common approach is to rewrite the privileged instructions in terms of
unprivileged instructions. In addition, some of the OS kernel’s data structures
need to be “shadowed” to prevent excessive cache flushing (e.g. of TLBs) during
context switches.
Some modern Intel-compatible CPUs have hardware features designed to
improve the performance of virtualized OSs. However, it is apparent that even
with hardware assists, running inside a virtual environment presents an additional level of complexity for performance analysis and tuning.
qIn the next chapter we will introduce the core methodology of performance
tests. We will discuss the primary types of performance tests, the tasks that
need to be undertaken and the overall lifecycle of performance work. We will
also catalogue some common best practices (and antipatterns) in the performance space.

读书笔记：

Optimizing Java

by Benjamin J Evans and James Gough

Printed in the United States of America.

Published by O’Reilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.

1 0