[转]Scaling Enterprise Java on 64-bit Multi-Core X86-Based Servers
来源:互联网 发布:剑灵灵族捏练导入数据 编辑:程序博客网 时间:2024/06/04 18:22
Scaling Enterprise Java on 64-bit Multi-Core X86-Based Servers
by Michael Juntao Yuan, Dave Jaffe11/01/2006
Multi-core and 64-bit CPUs are the hottest commodities in the enterprise server market these days. In recent years, as the cost and power requirement of faster CPU clock speeds has increased, the growth in raw clock speed (usually measured in megahertz) of single CPUs has slowed down. Hardware manufacturers continue to improve X86-based server performance by increasing both the multitasking capability and internal data bandwidth. Both Intel and Advanced Micro Devices are shipping 64-bit processors with two internal CPU cores, and quad core processors are soon to follow. Ninth-generation servers from Dell exploit this new generation of chips. The PowerEdge 1955 blade server, for example, supports up to two 64-bit dual core processors in a blade configuration, with up to ten such blades in a 7-rack unit (12.25") chassis.
However, those new generations of servers also pose new challenges for the software. For instance, to take advantage of the multi-core CPUs, the software application must be able to execute tasks in parallel across the CPUs; to take advantage of the 64-bit memory bandwidth, the application must also be able to manage a large amount of memory efficiently. As a key software platform on enterprise servers, Java Enterprise Edition (Java EE) is on the forefront of this multi-core, 64-bit revolution. Java EE developers must adapt to those challenges to make the most out of hardware investment.
When Java first came out in 1997, the state-of-the-art PC had a single CPU with a less than 300MHz clock speed and less than 64MB of RAM. The first Java applications were mostly on the client side. High-performance multitasking and large memory handling were clearly not the priority for Java designers at that time. But as Java became widely adopted for server-side applications, things started to change. Web applications are inherently multithreaded since each web request can be handled in a separate thread, parallel to other requests. The latest Java platform has greatly improved performance on modern server hardware.
In this article, we look at the current state of enterprise Java and analyze the challenges it faces with the new generation of servers. Based on our experience working on Java EE applications running on the JBoss Application Server in the Dell Scalable Enterprise Technology Center, we provide solutions and tips to scale your Java EE applications to the latest server hardware.
Tune the JVM
The core of the Java platform is the Java Virtual Machine (JVM). The entire Java application server runs inside a JVM. The JVM takes many startup parameters as command line flags, and some of them have great implications on the application performance. So, let's examine some of the important JVM parameters for server applications.
First, you should allocate as much memory as possible to the JVM using the -Xms<size>
(minimum memory) and -Xmx<size>
(maximum memory) flags. For instance, the -Xms1g -Xmx1g
tag allocates 1GB of RAM to the JVM. If you don't specify a memory size in the JVM startup flags, the JVM would limit the heap memory to 64MB (512MB on Linux), no matter how much physical memory you have on the server! More memory allows the application to handle more concurrent web sessions, and to cache more data to improve the slow I/O and database operations. We typically specify the same amount of memory for both flags to force the server to use all the allocated memory from startup. This way, the JVM wouldn't need to dynamically change the heap size at runtime, which is a leading cause of JVM instability. For 64-bit servers, make sure that you run a 64-bit JVM on top of a 64-bit operating system to take advantage of all RAM on the server. Otherwise, the JVM would only be able to utilize 2GB or less of memory space. 64-bit JVMs are typically only available for JDK 5.0.
With a large heap memory, the garbage collection (GC) operation could become a major performance bottleneck. It could take more than ten seconds for the GC to sweep through a multiple gigabyte heap. In JDK 1.3 and earlier, GC is a single threaded operation, which stops all other tasks in the JVM. That not only causes long and unpredictable pauses in the application, but it also results in very poor performance on multi-CPU computers since all other CPUs must wait in idle while one CPU is running at 100% to free up the heap memory space. It is crucial that we select a JDK 1.4+ JVM that supports parallel and concurrent GC operations. Actually, the concurrent GC implementation in the JDK 1.4 series of JVMs is not very stable. So, we strongly recommend you upgrade to JDK 5.0. Using the command line flags, you can choose from the following two GC algorithms. Both of them are optimized for multi-CPU computers.
-
If your priority is to increase the total throughput of the application and you can tolerate occasional GC pauses, you should use the
-XX:UseParallelGC
and-XX:UseParallelOldGC
(the latter is only available in JDK 5.0) flags to turn on parallel GC. The parallel GC uses all available CPUs to perform the GC operation, and hence it is much faster than the default single thread GC. It still pauses all other activities in the JVM during GC, however. -
If you need to minimize the GC pause, you can use the
-XX:+UseConcMarkSweepGC
flag to turn on the concurrent GC. The concurrent GC still pauses the JVM and uses parallel GC to clean up short-lived objects. However, it cleans up long-lived objects from the heap using a background thread running in parallel with other JVM threads. The concurrent GC drastically reduces the GC pause, but managing the background thread does add to the overhead of the system and reduces the total throughput.
Furthermore, there are a few more JVM parameters you can tune to optimize the GC operations.
-
On 64-bit systems, the call stack for each thread is allocated 1MB of memory space. Most threads do not use that much space. Using the
-XX:ThreadStackSize=256k
flag, you can decrease the stack size to 256k to allow more threads. -
Use the
-XX:+DisableExplicitGC
flag to ignore explicit application calls toSystem.gc()
. If the application calls this method frequently, then we could be doing a lot of unnecessary GCs. -
The
-Xmn<size>
flag lets you manually set the size of the "young generation" memory space for short-lived objects. If your application generates lots of new objects, you might improve GCs dramatically by increasing this value. The "young generation" size should almost never be more than 50% of heap.
Since the GC has a big impact on performance, the JVM provides several flags to help you fine-tune the GC algorithm for your specific server and application. It's beyond the scope of this article to discuss GC algorithms and tuning tips in detail, but we'd like to point out that the JDK 5.0 JVM comes with an adaptive GC-tuning feature called ergonomics. It can automatically optimize GC algorithm parameters based on the underlying hardware, the application itself, and desired goals specified by the user (e.g., the max pause time and desired throughput). That saves you time trying different GC parameter combinations yourself. Ergonomics is yet another compelling reason to upgrade to JDK 5.0. Interested readers can refer to Tuning Garbage Collection with the 5.0 Java Virtual Machine. If the GC algorithm is misconfigured, it is relatively easy to spot the problems during the testing phase of your application. In a later section, we will discuss several ways to diagnose GC problems in the JVM.
Finally, make sure that you start the JVM with the -server
flag. It optimizes the Just-In-Time (JIT) compiler to trade slower startup time for faster runtime performance. There are more JVM flags we have not discussed; for details on these, please check out the JVM options documentation page.
Use New Platform APIs
Besides the JVM, the Java platform libraries have also gone through extensive changes to accommodate the newer server hardware. We strongly recommend you upgrade your application to JDK 5.0+ in order to take advantage of all the performance enhancements built into the platform. Three new library APIs introduced in the last two major versions of the JDK are of particular importance for multi-CPU computers.
-
The concurrency utility library in JDK 5.0 is very important for multithread applications. It simplifies the Java thread API and provides a thread-safe set of
Collection
implementations. For instance, the newConcurrentHashMap
is a thread-safeHashMap
and you can read/write it without asynchronized
block. We will get to this in more detail later in this article. -
The NIO (New I/O) library was introduced in JDK 1.4. It allows multiple threads to share one physical connection (e.g., a socket) to the hard disk or network. A thread no longer needs to block the I/O socket to read or write data. Using NIO, we can greatly reduce the thread waiting time caused by a limited number of blocked sockets. The NIO is especially useful in multi-CPU computers where CPUs often wait on the I/O, and where there are many threads.
-
The logging library introduced in JDK 1.4 provides a convenient API to log information from the application to the console, logfiles, or network destinations. The important performance feature of the logging library is that you can configure the logging output by changing the logging level at runtime via configuration files. This helps us to reduce logging--which involves slow I/O operation and is a major cause for CPU waiting--at runtime, without recompiling the application code.
You should write new applications and upgrade older applications to use the concurrency, NIO, and logging APIs whenever possible. If you cannot upgrade, you should use alternative open source libraries that provides similar features. For instance, Doug Lea's util.concurrent library has many of the same features as the JDK 5.0 concurrency API; furthermore, the Apache Log4j library is comparable to the JDK 1.4+ logging library.
Optimize Your Code
In the previous sections, we discussed the general guidelines to build and run high-performance Java EE applications for multiple CPU and large memory servers. However, each application is unique with its own performance requirements and bottlenecks. The only way to make sure that your application is optimized for your hardware is through extensive performance testing. In this section, we cover some of the basic techniques to diagnose performance problems in your application.
It's beyond the scope of this article to cover performance-testing tools and frameworks. In our tests, we used Grinder, an open source performance-testing framework in Java. It can simulate hundreds of thousands of concurrent users across multiple testing computers and gather statistics on a central console. It provides a utility for you to record your test scripts by going through your web application in a browser. The generated script is in Jython, and you can easily modify it to suit your own needs.
As we discussed before, tuning GC operations in the JVM is crucial for performance. The easiest way to see the effects of various GC algorithm parameters is to monitor the time the application spends on GC throughout the load testing. There are two simple ways to do it.
-
You can add a
-verbose:gc
startup flag to the JVM. The JVM would print out every GC operation and its duration in the console. If the server pauses due to a long full GC operation, you can optimize the GC parameters accordingly. If the system runs lengthy full GC very frequently, you probably have a memory leak somewhere in your application. -
If you use the JDK 5.0 JVM, you can also use the JConsole utility to monitor the server usage of resources. The JConsole GUI shows how various regions of the memory are utilized and how much time is spent on GC (see Figure 1). To use the JConsole, you need to start the JVM with the
-Dcom.sun.management.jmxremote
flag and then run thejconsole
command. JConsole can connect to a JVM running on the local computer, or to any JVM running on the network via RMI. You can use one JConsole instance to monitor multiple servers.
Figure 1. JConsole in JDK 5.0 (click for full-size image)
To pinpoint the exact location of a memory leak, you can use an application profiler. The JBoss Profiler is an open source profiler for applications inside the JBoss Application Server.
When the application is fully loaded, the CPU should run between 80% and 100% of its capacity. If the CPU usage is substantially lower, you should look for other bottlenecks, such as whether the network or disk I/O is saturated. However, an underutilized CPU could also indicate contention points inside the application. For instance, as we mentioned before, if there is a synchronized block on the critical path of multiple threads (e.g., a code block frequently accessed by most requests), the multiple CPUs would not be fully utilized. To find those contention points, you can do a thread dump when the server is fully loaded:
-
On a windows machine, type
Ctrl-Break
in the DOS terminal window where the server is started (i.e., the server console) to create a thread dump. -
On a Linux/Unix system, run the
kill -QUIT process_id
command, whereprocess_id
is the ID of the server JVM process, to create a thread dump.
The thread dump prints out detailed information (stack trace with source code line numbers) about all current threads in the server. If all the request-handling threads are waiting at the same point, it would indicate a contention point, and you can go back to the code to fix it.
Sometimes, the contention point is not in the application but in the application server itself. Most Java EE application servers have not completely evolved their code base to take advantage of JDK 5.0 APIs, especially the concurrent utility libraries. In this case, it is crucial to choose an open source application server, such as the JBoss Application Server, where you can make changes to the server code.
Collapse the Tiers
Traditionally, Java EE had been designed for the multitiered architecture. This architecture envisions that the web server, servlet container, EJB server, and database server each runs on its own physical computer, and those computers are tied together through remote call protocols on the local network.
But with the new generation of more powerful server hardware, a single computer is powerful enough to run all those components for a medium-sized website. Running everything on the same physical machine is much more efficient than the distributed architecture described above. All communications are now interthread communications that can be handled efficiently by the same operating system or even inside the same JVM in many cases. It eliminates the expensive object serialization requirements and high network latency associated with remote calls. Furthermore, since different components tend to use different kind of server resources (e.g., the database is heavy on disk usage while Java EE is CPU-intensive), the integrated stack helps us to balance the server usage and reduce overall contention points.
Figure 2. Choose between call-by-reference and call-by-value in the JBoss AS installer (click for full-size image)
The JBoss Application Server has built-in optimizations to support the single JVM deployment. For instance, by default, JBoss AS makes call-by-reference method calls from the servlet to the EJB objects. Call-by-reference can be up to ten times faster than the standard Java EE call-by-value approach, because call-by-value requires object serialization and is primarily for remote calls across JVMs. Figure 2 shows that you can choose from the two call-isolation methods.
The JBoss Web Server project goes one step further and builds native Apache web server functionalities directly into the Java EE servlet container. It allows much tighter integration between components when deployed on the same server, and hence could deliver much better performance than older Java EE servers.
With the entire middleware stack running on the same physical server, we also drastically simplify deployment and management. When you need to scale the application up, you simply add a load balancer, move the shared database server to a different computer, and then add any number of server nodes with the integrated middleware stack (see Figure 3).
Figure 3. The load-balanced architecture
All web requests are made against the load balancer, which then forwards the requests to application servers in a manner that ensures all nodes have similar numbers of request per unit time. The load balancer should be configured to forward all requests from the same user session to the same node (i.e., use sticky session). Most Java EE application servers also support automatic state replication between the nodes to avoid application state loss when a node fails. For instance, the JBoss AS supports several state-replication strategies including buddy replication, in which each node can choose a "buddy" as its failover. Such load balancing and state replication would be very difficult to deploy and manage if we had three or four tiers of servers.
Virtualize the Hardware
The new multi-core 64-bit servers are capable of running heavy-load web applications. But for small web applications, they can be overkill. To fully utilize server capabilities, we sometimes run multiple small websites on the same physical server. Of course, you can deploy multiple applications on the same Java EE container. But to achieve the optimal performance, stability, and manageability, we often wish to run each application in its own Java EE container. How do we do that?
Technically, the primary challenge to run multiple Java EE server instances on the same physical server is to avoid the port conflicts. Like any other server application, the Java EE application server listens on TCP/IP ports to provide services. For instance, the HTTP service listens for web requests on the 80 or 8080 port; the RMI service listens for RMI invocation requests on the 4444 port; the naming service listens on the 1099 port; etc. For the server to run properly, it must obtain exclusive control over the port it listens to. So, when you start multiple server instances on the same computer, you are likely to get port conflict errors. There are three ways to avoid port conflicts.
-
Using virtualization software, such as VMware and Xen, you can run multiple operating systems on the same physical server. You can run a Java EE server instance in each of those guest OSes. The benefit of this approach is its flexibility in architecture and management. Each OS virtual machine can be independently provisioned, managed, and backed up. You can also choose to run the database server, the load balancer, or other server components in their own virtual OSes. However, the drawback of this approach is that there is relatively heavy overhead to virtualize the entire OS just to run the JVM.
-
Many Java EE servers allow you to start a server instance bound to a specific IP address. For instance, in the case of JBoss AS, you can start the server with the
run.bat -b 10.10.20.111
command to start a server instance bound to IP address10.10.20.111
. You can assign multiple IP addresses to your server and then start a Java EE server instance on each of those IP address. Each server instance listens for the same port numbers on different IP addresses, and hence there is no conflict. This approach provides a good balance between server manageability and performance overhead. We recommend it if your application supports IP address binding. -
In the unlikely case when you cannot assign multiple IP addresses to a physical server, an alternative is to reconfigure the port number used by each server instance so that they do not conflict. That would typically require you to go through reams of tedious configuration files, and you must be intimately familiar with those files to look for port numbers. In the JBoss AS, there is a simpler way: JBoss AS has an MBean service called
jboss.system:service=ServiceBindingManager
that can automatically shift port numbers used in the current server instance (e.g., increase all port numbers by 100 from their default values). In general, we do not recommend messing with port numbers since the application server is not regularly tested with nonstandard port numbers, and a number of complications and side effects could arise from such use.
To achieve optimal results, we should run no more server instances than the number of physical CPUs in the server; otherwise, the server instances would wait on one another to use the CPUs, creating more contention points. We should also keep the memory allocation for each server instance at around 1GB for optimal GC results.
Acknowledgments
Michael Yuan would like to thank Phillip Thurmond of Red Hat for reviewing this article and providing helpful suggestions.
Resources
- Download JBoss AS.
- The Dell Scalable Enterprise Technology Center has a lot of information on how to work with Dell multi-core 64-bit servers.
- JBoss AS Clustering Guide has further information about clustering of JBoss AS.
- Grinder and JMeter are two popular open source web performance-testing frameworks.
Michael Juntao Yuan specializes in lightweight enterprise / web application, and end-to-end mobile application development.
Dave Jaffe is an engineer in Dell's Scalable Enterprise Technology Center.
Return to ONJava.com.
You must be logged in to the O'Reilly Network to post a talkback.
Showing messages 1 through 7 of 7.
- Call-by-reference considered harmful
2006-11-06 15:37:32 steve.loughran [Reply | View]
- hmmmm
2006-11-04 01:21:39 lhe [Reply | View]
thanks for this nice article. answers questions i'd need myself to ask before i have the need to ;-)
- No really reasonable suggestion
2006-11-03 15:20:48 kutzi [Reply | View]
Sorry, but I think that some of your suggestions are not very reasonable. Examples:
- "You should [..] upgrade older applications to use the concurrency, NIO, and logging APIs whenever possible."
So you are suggesting changing old, perfectly running application to the new concurrency utils and NIO without any hard fact that the current implementation is to slow? And even more important: the new implementation would be substantial better?
- "If the system runs lengthy full GC very frequently, you probably have a memory leak somewhere in your application."
No, that would more probably mean that allocate too much unneeded objects. A memory leak could be indicated, if the heap usage directly after a Full GC is going constantly up
- "Most Java EE application servers have not completely evolved their code base to take advantage of JDK 5.0 APIs, especially the concurrent utility libraries. In this case, it is crucial to choose an open source application server [..] where you can make changes to the server code."
You are really sure, that you want to suggest that average developers should begin to build the new concurrency classes into the application server of their choice?- No really reasonable suggestion
2006-11-03 20:49:34 michael_yuan [Reply | View]
Kutzi,
Hmm, first and foremost, I am suggesting developers to take advantage of Java 5.0 APIs in their own applications.
Now, if the application server itself has concurrency issues, I understand that the average application developer would not know how to fix it. But you can at least find the contention point in the source code (in the case of an open source app server) and raise a issue (bug report or feature request) to the open source developers in that community ...- No really reasonable suggestion
2006-11-06 05:54:16 kutzi [Reply | View]
> first and foremost, I am suggesting developers to take advantage of Java 5.0 APIs in their own applications.
Yes, that is certainly a good thing to do, but not "whenever possible" as you said, but only if the gains outweigh the risks. E.g. util.concuurent.Locks are faster than synchronized blocks, but they bear the additional risk of forgetting to release it. And, as Goetz says in Java Concurrency in practice, perfomance is a moving target. While Locks used to be much faster than synchronized, in Java 6 the gap has nearly closed.
- No really reasonable suggestion
- No really reasonable suggestion
- Your information on defaults seems outdated
2006-11-02 09:50:57 austinmills [Reply | View]
As of J2SE 5.0, the default max heap is no longer 64M. For server class machines (as according to http://java.sun.com/j2se/1.5.0/docs/guide/vm/server-class.html ), which the 64-bit server would be, the VM would by default be the server VM and the heap boundaries would be (excerpted from http://java.sun.com/j2se/1.5.0/docs/guide/vm/gc-ergonomics.html ):
initial heap size: Larger of 1/64th of the machine's physical memory on the machine or some reasonable minimum.
maximum heap size: Smaller of 1/4th of the physical memory or 1GB.
So, it's not perfect, but it's a lot more reasonable than the pre-5.0 days.
I agree that it's better to just statically define the min and max to the same number -- I personally haven't seen it result in VM crashes, but I have seen noticable pauses due to allocating additional heap.- Your information on defaults seems outdated
2006-11-03 20:55:06 michael_yuan [Reply | View]
austinmills,
Thanks for the correction. I should really update the text if I can ... :) But even the 1/4 physical RAM limit is way too low for a server that is primarily used as a Java app server. So, I guess my primary argument (i.e., not to use the default memory setting) still stands. :)
- Your information on defaults seems outdated
- [转]Scaling Enterprise Java on 64-bit Multi-Core X86-Based Servers
- [转]Scaling Enterprise Java on 64-bit Multi-Core X86-Based Servers
- Installing Oracle Database 10g Release 1 and 2 (32-bit/64-bit) on Red Hat Enterprise Linux AS 4, 3, 2.1, Red Hat Fedora Core 4,
- Unity: Scaling the GUI based on the screen resolution
- All about Eve: Execute-Verify Replication for Multi-Core Servers
- Manage multi-django servers for debugging on one pc
- Building 32-bit Wine on a 64-bit (x86-64) CentOS
- Installing 11.2.0.3 Or 11.2.0.4 (32-bit (x86) or 64-bit (x86-64) ) On RHEL6 Reports That Packages "e
- Installing Active Directory on Windows Server 2008 R2 Enterprise 64-bit
- Enterprise Application Achitecture Design Based on LiteMDA 0.5
- Oracle 9iR2 64bit on RHEL4 x86-64安装技术文档(原版英文)
- How the Computer Works (based on X86/Linux)
- build/core/main.mk:77: You are attempting to build on a 32-bit system. Only 64-bit build environment
- [news]Sun and IBM Agree to Distribute Solaris x86 on IBM's System x servers Worldwide
- Microsoft SQL SERVER 2005 Enterprise Edition 32 Bit & 64 Bit
- X86平台Linux 32bit和64bit编程注意事项
- android ndk Support for 64-bit x86
- java.lang.UnsatisfiedLinkError: Can't load IA 64-bit .dll on a AMD 64-bit platform
- WIN32线程编程
- 我们不是私有财产--《断锁怒潮》观后感
- 标准库中iostream的全面介绍
- C++编码规范
- the first time I come
- [转]Scaling Enterprise Java on 64-bit Multi-Core X86-Based Servers
- Begining PHP , Apache , MySQL Web Development
- 学习吉他充实了闲散时间
- 一些JAVA Tuning的文章
- Cisco AVVID & IP Telephony
- 双击磁盘无法打开问题
- WebLogic下JSP连接数据库
- Kim YeongHee 的英文简历(标准范本)
- 开发网站
Leaving JBoss in its default "unified classloader" mode may deliver performance, but it is a potential support/maintenance nightmare, as there is no longer any way to load duplicate JARs, even across. Furthermore, once you commit to it, the probability of your EAR even deploying on another system is minimal, let alone working. Because you can stick JARs like Log4J in one webapp, and have them used inside an EJB file that would normally be loaded in a different classloader.
It is not a simple "flip this switch for speed" option. Yes, it can deliver speed, if you are unlucky enough to have used EJB in the first place. But the price is serious. Furthermore, stick to spring and hibernate and you don't need the EJB stuff; everything runs in the webapp and so you don't get any speedup.