Scaling Your Java EE Applications

来源：互联网发布：2017淘宝买家秀福利编辑：程序博客网时间：2024/06/02 00:13

If an application is useful, then the network of users will grow crazily fast at some point. Asmore and more mission-critical applications are now running on Java EE, many Java developers arecaring about scalability issues. However, most of popular Web 2.0 sites are built with scriptlanguages, and there are a lot of voices to doubt the scalability of Java Applications. In thisarticle, Wang Yu takes real world cases as examples to explain ways on how to scale Javaapplications based on his experiences on the laboratory projects, and at the same time, bringtogether practice, science, algorithms, frameworks, and experience on failed projects, to helpreaders on building high scalable Java applications.

I have been working in an internal laboratory for years. This laboratory is always equipped withthe latest big servers from our company and is free for our partners to test the performance oftheir products and solutions. Part of my job is to help them tune the performance on all kinds ofthe powerful CMT and SMP servers.

In these years, I have helped testing dozens of Java applications in variety of differentsolutions. Many products are aimed for the same industry domains and have very similarfunctionalities, but the scalability is so different that some of them can not only scale up on the64 CPUs servers, but also scale out to more than 20 server nodes, while others can only be runningon the machines with no more than 2 CPUs.

The key for the difference lies in the vision of the architect when designing the products. Allthese scaled-well Java applications were well prepared for the scalability, from the requirementcollection phase, system design phase to the implementation phase of the products' life cycle. YourJava application scalability is really based on your vision.

Scalability, as a property of systems, is generally difficult to define, and is often mix-usedwith "performance". Yes, yes, scalability is closely related with performance, and its purpose isto get high performance. But the measurement for "scalability" is different from "performance". Inthis article, we will take the definitions from wikipedia:

Scalability is a desirable property of a system, a network, or a process, which indicates itsability to either handle growing amounts of work in a graceful manner, or to be readily enlarged.For example, it can refer to the capability of a system to increase total throughput under anincreased load when resources (typically hardware) are added.

To scale vertically (or scale up) means to add resources to a single node in a system, typicallyinvolving the addition of CPUs or memory to a single computer. Such vertical scaling of existingsystems also enables them to leverage virtualization technology more effectively, as it providesmore resources for the hosted set of operating systems and application modules to share.

To scale horizontally (or scale out) means to add more nodes to a system, such as adding a newcomputer to a distributed software application. An example might be scaling out from one web serversystem to three. As computer prices drop and performance continues to increase, low cost"commodity" systems can be used for high performance computing applications such as seismicanalysis and biotechnology workloads that could in the past only be handled by supercomputers.Hundreds of small computers may be configured in a cluster to obtain aggregate computing powerwhich often exceeds that of traditional RISC processor based scientific computers.

The first installment of this article will discuss scaling Java applications vertically.

How to scale Java EE applications vertically

Many software designers and developers take the functionality as the most important factor in aproduct while thinking of performance and scalability as add-on features and after-work actions.Most of them believe that expensive hardware can close the gap of the performance issue.

Sometimes they are wrong. Last month, there was an urgent project in our laboratory. After theproduct failed to meet the performance requirement of their customer in a 4-CPU machine, thepartner wanted to test their product in a bigger (8-CPU) server. The result was that theperformance was worse than in the 4-CPU server.

Why did this happen? Basically, if your system is a multiprocessed or multithreaded application,and is running out of CPU resources, then your applications will most likely scale well when moreCPUs added.

Java technology-based applications embrace threading in a fundamental way. Not only does theJava language facilitate multithreaded applications, but the JVM is a multi-threaded process thatprovides scheduling and memory management for Java applications. Java applications that can benefitdirectly from multi-CPU resources include application servers such as BEA's Weblogic, IBM'sWebsphere, or the open-source Glassfish and Tomcat application server. All applications that use aJava EE application server can immediately benefit from CMT & SMP technology.

But in my laboratory, I found a lot of products cannot make full usage of the CPU resources.Some of them can only occupy no more than 20% CPU resources in an 8-CPU server. Such applicationscan benefit little when more CPU resources added.

Hot lock is the key enemy of scalability

The primary tool for managing coordination between threads in Java programs is the synchronizedkeyword. Because of the rules involving cache flushing and invalidation, a synchronized block inthe Java language is generally more expensive than the critical section facilities offered by manyplatforms. Even when a program contains only a single thread running on a single processor, asynchronized method call is still slower than an un-synchronized method call.

To observe the problems caused by the synchronized keyword, just send a QUIT signal to the JVMprocess, which gives you a thread dump. If you have seen a lot of thread stacks just like thefollowing in the thread dump file, which means that your system hits "Hot Lock" problem.

........... "Thread-0" prio=10 tid=0x08222eb0 nid=0x9 waiting for monitor entry [0xf927b000..0xf927bdb8] at testthread.WaitThread.run(WaitThread.java:39) - waiting to lock <0xef63bf08> (a java.lang.Object) - locked <0xef63beb8> (a java.util.ArrayList) at java.lang.Thread.run(Thread.java:595) .........

The synchronized keyword will force the scheduler to serialize operations on the synchronizedblock. If many threads compete for the contended synchronizations, and only one thread is executinga synchronized block, then any other threads waiting to enter that block are stalled. If no otherthreads are available for execution, then processors may sit idle. In such situations, more CPUscan help little on performance.

Hot Lock may involve multiple thread switches and system calls. When multiple threads contendfor the same monitor, the JVM has to maintain a queue of threads waiting for that monitor (and thisqueue must be synchronized across processors), which means more time spent in the JVM or OS codeand less time spent in your program code.

To avoid the hot lock problem, following suggestions may be helpful:

Make synchronized blocks as short as possible

When you make the time a thread holds a given lock shorter, the probability that another threadcompetes with the same lock will become lower. So while you should use synchronization to accessshared variables, you should move the thread safe code outside of the synchronized block. Takefollowing code as an example:

Code list 1: public boolean updateSchema(HashMap nodeTree) { synchronized (schema) { String nodeName = (String)nodeTree.get("nodeName"); String nodeAttributes = (List)nodeTree.get("attributes"); if (nodeName == null) return false; else return schema.update(nodeName,nodeAttributes); } }

This piece of code wants to protect the shared variable "schema" when updating it. But the codefor getting attribute values is thread safe, and can be moved out of the block, making thesynchronized block shorter:

Code list 2: public boolean updateSchema(HashMap nodeTree) { String nodeName = (String)nodeTree.get("nodeName"); String nodeAttributes = (List)nodeTree.get("attributes"); synchronized (schema) { if (nodeName == null) return false; else return schema.update(nodeName,nodeAttributes); } }

Reducing lock granularity

When you are using a "synchronized" marker, you have two choices on its granularity: "methodlocks" or "block locks". If you put the "synchronized" on a method, you are locking on "this"object implicitly.

Code list 3: public class SchemaManager { private HashMap schema; private HashMap treeNodes; .... public boolean synchronized updateSchema(HashMap nodeTree) { String nodeName = (String)nodeTree.get("nodeName"); String nodeAttributes = (List)nodeTree.get("attributes"); if (nodeName == null) return false; else return schema.update(nodeName,nodeAttributes); } public boolean synchronized updateTreeNodes() { ...... } }

Compared the code with Code list 2, this piece of code is worse, because it locks on the entireobject when calling "updateSchema" method. To achieve finer granularity, just lock the "schema"instance variable instead of the all "SchemaManager" instances to enable different methods to beparalleled.

Avoid lock on static methods

The worst solution is to put the "synchronized" keywords on the static methods, which means itwill lock on all instances of this class. One of projects tested in our laboratory had been foundto have such issues. When tested, we found almost all working threads waiting for a static lock (aClass lock):

-------------------------------- at sun.awt.font.NativeFontWrapper.initializeFont(Native Method) - waiting to lock <0xeae43af0> (a java.lang.Class) at java.awt.Font.initializeFont(Font.java:316) at java.awt.Font.readObject(Font.java:1185) at sun.reflect.GeneratedMethodAccessor147.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:324) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:838) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1736) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1646) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1274) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1835) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1759) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1646) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1274) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1835) at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:452) at com.fr.report.CellElement.readObject(Unknown Source) .........

When using Java2D to generate font objects for the reports, the developers put a native staticlock on the "initialize" method. To be fair, this was caused by Sun's JDK 1.4 (Hotspot). Afterchanging to JDK 5.0, the static lock disappeared.

Using lock free data structure in Java SE 5.0

The "synchronized" keyword in Java is simply a relatively coarse-grained coordination mechanism,and as such, is fairly heavy for managing a simple operation such as incrementing a counter orupdating a value, like following code:

Code list 4: public class OnlineNumber { private int totalNumber; public synchronized int getTotalNumber() { return totalNumber; } public synchronized int increment() { return ++totalNumber; } public synchronized int decrement() { return --totalNumber; } }

The above code is just locking very simple operations, and the "synchronized" blocks are veryshort. However, if the lock is heavily contended (threads frequently ask to acquire the lock whenit is already held by another thread), throughput can suffer, and contended synchronization can bequite expensive.

Fortunately, in Java SE 5.0 and above, you can write wait-free, lock-free algorithms under thehelp with hardware synchronization primitives without using native code. Almost all modernprocessors have instructions for updating shared variables in a way that can either detect orprevent concurrent access from other processors. These instructions are called compare-and-swap, orCAS.

A CAS operation includes three parameters -- a memory location, the expected old value, and anew value. The processor will update the location to the new value if the value that is therematches the expected old value; otherwise it will do nothing. It will return the value that was atthat location prior to the CAS instruction. An example way to use CAS for synchronization is asfollowing:

Code list 5: public int increment() { int oldValue = value.getValue(); int newValue = oldValue + 1; while (value.compareAndSwap(oldValue, newValue) != oldValue) oldValue = value.getValue(); return oldValue + 1; }

First, we read a value from the address, then perform a multi-step computation to derive a newvalue (this example is just increasing by one), and then use CAS to change the value of addressfrom oldValue to the newValue. The CAS succeeds if the value at address has not been changed in themeantime. If another thread did modify the variable at the same time, the CAS operation will fail,but detect it and retry it in a while loop. The best thing of CAS is that it is implemented inhardware and is extremely lightweight. If 100 threads execute this increment()method at the sametime, in the worst case each thread will have to retry at most 99 times before the increment iscomplete.

The java.util.concurrent.atomicpackage in Java SE 5.0 and above provides classes that support lock-free thread-safe programming onsingle variables. The atomic variable classes all expose a compare-and-set primitive, which isimplemented using the fastest native construct available on the platform. Nine flavors of atomicvariables are provided in this package, including: AtomicInteger; AtomicLong; AtomicReference;AtomicBoolean; array forms of atomic integer; long; reference; and atomic marked reference andstamped reference classes, which atomically update a pair of values.

Using an atomic package is easy. To rewrite the increasing method of code list 5:

Code list 6: import java.util.concurrent.atomic.*; .... private AtomicInteger value = new AtomicInteger(0); public int increment() { return value.getAndIncrement(); } ....

Nearly all the classes in the java.util.concurrent package use atomic variables instead ofsynchronization, either directly or indirectly. Classes like ConcurrentLinkedQueue use atomicvariables to directly implement wait-free algorithms, and classes like ConcurrentHashMap useReentrantLock for locking where needed. ReentrantLock, in turn, uses atomic variables to maintainthe queue of threads waiting for the lock.

One successful story about the lock free algorithms is a financial system tested in ourlaboratory, after replaced the "Vector" data structure with "ConcurrentHashMap", the performance inour CMT machine(8 cores) increased more than 3 times.

Race condition can also cause the scalability problems

Too many "synchronized" keywords will cause the scalability problems. But in some special cases,lack of "synchronized" can also cause the system fail to scale vertically. The lack of"synchronized" can cause race conditions, allowing more than two threads to modify the sharedresources at the same time, and may corrupt some shared data. Why do I say it will cause thescalability problem?

Let's take a real world case as an example. This is an ERP system for manufacture, when testedits performance in one of our latest CMT servers (2CPU, 16 cores, 128 strands ), we found the CPUusage was more than 90%. This was a big surprise, because few applications can scale so well inthis type of machine. Our excitement just lasted for 5 minutes before we discovered that theaverage response time was very high and the throughput was unbelievable low. What were these CPUsdoing? Weren't they busy? What were they busy with? Through the tracing tools in the OS, we foundalmost all the CPUs were doing the same thing - "HashMap.get()", and it seemed that all CPUs werein infinite loops. Then we tested this application on diverse servers with different numbers ofCPUs. The result was that the more CPUs the server has, the more chances this infinite loop wouldhappen.

The root cause of the infinite loop is on an unprotected shared variable-- a "HashMap" datastructure. After added "synchronized" marker to all the access methods, everything was normal. Bychecking the source code of the "HashMap" (Java SE 5.0), we found there was some potential for suchan infinite loop by corrupting its internal structure. As shown as following code, if we make theentries in the HashMap to form a circle, then "e.next()" will never be a null.

Code list 7: public V get(Object key) { if (key == null) return getForNullKey(); int hash = hash(key.hashCode()); for (Entry<K,V> e = table[indexFor(hash, table.length)]; e != null; e = e.next) { Object k; if (e.hash == hash && ((k = e.key) == key || key.equals(k))) return e.value; } return null; }

Not only its get() method, but also put() and other methods are all exposed by this risk. Isthis a bug of JVM? No, this was reported long time ago (please refer to http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6423457).Sun engineers didn't think it a bug, but rather suggested the use of "ConcurrentHashMap". So takeit into consideration when building a scalable system.

Non-Blocking IO vs. Blocking IO

The java.nio package, which was introduced in Java 1.4, allows developers to achieve greaterperformance in data processing and offers better scalability. The non-blocking I/O operationsprovided by NIO allows for Java applications to perform I/O more like what is available in otherlower-level languages like C. There are a lot of NIO frameworks currently, such as Mina from Apacheand Grizzly from Sun, which are widely used by many projects and products.

During the last 5 months, there were two Java EE projects hold in our laboratory which onlywanted to test their products' performance on both traditional blocking-I/O based servers andnon-blocking I/O based servers, to see the difference. They chose Tomcat 5 as blocking-I/O basedservers, and Glassfish as Non-blocking I/O based servers.

First, they tested a few simple JSP pages and Servlets, got the following result (on a 4-CPUsserver):

Concurrent Users

Average Response Time (ms)

Tomcat

Glassfish

138

142

151

100

155

The performance of Glassfish was far behind Tomcat's according to the test result. The customerdoubted about the advantage of non-blocking I/O. Why so many articles and technical reports aretelling about the performance and scalability of the NIO?

After tested more scenarios, they changed their mind, for the results showed the power of NIOlittle by little. What they have tested are:

More complex scenarios instead of simple JSPs and Servlets, involving EJB, Database, file IO,JMS and transactions.
Simulating more concurrent users, from 1000 up to 10,000.
Testing in different hardware environments, from 2CPUs, 4CPUs, up to 16 CPUs.

The figure below shows the results of the testing on a 4-CPU server.

Figure 1: Throughput in a 4CPU server

Traditional blocking I/O will use a dedicated working thread for a coming request. The assignedthread will be responsible for the whole life cycle of the request - reading the request data fromthe network, decoding the parameters, computing or calling other business logical functions,encoding the result, and sending it out to the requester. Then this thread will return to thethread pool and be reused by other requests. This model in Tomcat 5 is very effective when dealingwith simple logical in a small number of concurrent users under perfect network environments.

But if the request involves complex logic, or interacts with outer system such as file systems,database, or a message server, the working thread will be blocked at the most of the processingtime to wait for the return of Syscalls or network transfers. The blocking thread will be held bythe request until finished, but the operating system will park this thread to relieve the CPU todeal with other requests. If the network between the clients and the server is not very good, thenetwork latency will block the threads longer. Even more, when keep-alive is required, the currentworking thread will be blocked for a long time after the request processing is finished. To betterutilized the CPU resources, more working threads are needed.

Tomcat uses a thread pool, and each request will be served by any idle thread in the threadpool. "maxThreads" decides the maximum number of threads that Tomcat can create to servicerequests. If we set "maxThreads" too small, we cannot fully utilize the CPU resources, and moreimportant, will get a lot of requests dropped and rejected by the server when concurrent usersincreases. In this testing, we set "maxThreads" to "1000" (which is too large and unfair toTomcat). Under such settings, Tomcat will span a lot of threads when concurrent users go up to ahigh level.

The large number of Java threads will cause the JVM and OS busy with handling scheduling andmaintenance work of these threads, instead of processing business logic. More over, more threadswill consume more JVM heap memory (each thread stack will occupy some memory), and will cause morefrequent garbage collection.

Glassfish doesn't need so many threads. In non-blocking IO, a working thread will not binding toa dedicated request. If one request is blocking due to any reasons, this thread will reuse by otherrequests, In such way, Glassfish can handle thousands of concurrent users by only tens of workingthreads. By limiting the threads resources, Non-blocking IO has better scalability (refer to thefigure below). That's the reason that Tomcat 6 has embraced non-blocking IO too.

Figure 2: scalability test result

Single thread task problem

A Java EE-based ERP system was tested in our laboratory months ago, and one of its testingscenarios was to generate a very complex annual report. We tested this scenario in differentservers and found that the cheapest AMD PC server got the best performance. This AMD server hasonly two 2.8G HZ CPUs and 4G memory, yet its performance exceeded the expensive 8-CPUs SPARC servershipped with 32G memory.

The reason is because that scenario is a single thread task, which can only be run by a singleuser (concurrently access by many users is meaningless in this case ). So it can just using one CPUwhen running. Such a task cannot scale to multi-processors. At the most of time, the frequency ofCPU plays the leading role of the performance in such cases.

Parallelization is the solution. To parallelize the single thread task, you must find a certainlevel of independence in the order of operations, then use multiple threads to achieve theparallelization. In this case, the customer had refined their "annual report generation" task togenerate monthly reports first, then generate the annual report based on those 12 monthly reports."Monthly reports" are just transition results, since such reports are useful for the end users. But"monthly reports" can be generated concurrently and will be used to generate the final reportquickly. In this way, this scenario was scaled to 4-CPU SPARC servers very well, and exceeded theAMD server more than 80% on performance.

Re-architecture and re-code the whole solution is a time consuming work and error prone. One ofprojects in our laboratory used JOMP and achieved parallelization for itssingle-thread tasks. JOMP is a Java API for thread-based SMP parallel programming. Just likeOpenMP, JOMP uses compiler directives to insert parallel programming constructs into a regularprogram. In a Java program, the JOMP directives take the form of comments beginning with //omp. TheJOMP program is run through a precompiler which processes the directives and produces the actualJava program, which is then compiled and executed. JOMP supports most features of OpenMP, includingwork-sharing parallel loops and parallel sections, shared variables, thread local variables, andreduction variables. The following code is an example of JOMP programming.

Code list 8: Li n k e dLi s t c = new Li n k e dLi s t ( ) ; c . add ( " t h i s " ) ; c . add ( " i s " ) ; c . add ( " a " ) ; c . add ( "demo" ) ; / / #omp p a r a l l e l i t e r a t o r f o r ( S t r i n g s : c ) System . o u t . p r i n t l n ( " s " ) ;

Like most parallelizing compilers, JOMP also focus on loop-level and collection parallelism,studying how to execute different iterations simultaneously. To be parallelized, two iterationsshouldn't present any data dependency-that is, neither should rely on calculations that the otherone performs.

To write a JOMP program is not an easy work. First, you should familiar with OpenMP directives,and familiar with JVM Memory Model's mapping for those directives, then know your business logic toput the right directives on the right places.

Another choice is to use Parallel Java. Parallel Java, like JOMP, supportsmost features of OpenMP; but unlike JOMP, PJ's parallel constructs are obtained by instantiatinglibrary classes rather than by inserting precompiler directives. Thus, "Parallel Java" needs noextra precompilation step. Parallel Java is not only useful for the parallelization on multipleCPUs, but also for the scalability on multiple nodes. The following code is an example of "ParallelJava" programming.

Code list 9: static double[][] d; new ParallelTeam().execute (new ParallelRegion() { public void run() throws Exception { for (int ii = 0; ii < n; ++ ii) { final int i = ii; execute (0, n-1, new IntegerForLoop() { public void run (int first, int last) { for (int r = first; r <= last; ++ r) { for (int c = 0; c < n; ++ c) { d[r][c] = Math.min (d[r][c], d[r][i] + d[i][c]); } } } }); } } });

Scale Up to More Memory

Memory is an important resource for your applications. Enough memory is critical to performancein any application, especially for database systems and other I/O-focused systems. More memorymeans larger shared memory space and larger data buffers, to enable applications read more datafrom the memory instead of slow disks.

Java garbage collection relieves programmers from the burden of freeing allocated memory, indoing so making programmers more productive. The disadvantage of a garbage-collected heap is thatit will halt almost all working threads when garbage is collecting. In addition, programmers in agarbage-collected environment have less control over the scheduling of CPU time devoted to freeingobjects that are no longer needed. For those near-real-time applications, such as Telco systems andstock trade systems, this kind of delay and less controllable behavior are big risks.

Coming back to the question of whether Java applications scale by given more memory, the answeris yes, sometimes. Too little memory will cause garbage collection to happened too frequently.Enough memory will keep the JVM processing your business logic most of time, instead of collectinggarbage.

But it is not always true. A real world case in my laboratory is a Telco system built on a64-bit JVM. By using a 64-bit JVM, the application can break the limit of 4GB memory usage found ina 32-bit JVM. It was tested on a 4-CPU server with 16GB memory, and they gave 12GB memory to theJava application. In order to improve the performance, they cached more than 3,000,000 objects inmemory when initialization to avoid creating too many objects when running. This product wasrunning very fast during the first hour of testing, then suddenly, system halted for more than 30minutes. We had determined that it was the garbage collection that stopped the system for half anhour.

Garbage collection is the process of reclaiming memory taken up by unreferenced objects.Unreferenced objects are ones the application can no longer reach because all references to themhave gone out of extent. If a huge number of live objects exist in the memory (just like the3,000,000 cached objects), the garbage collection process will take a long time to traverse allthese objects. That's why the system halted for such a long and unacceptable time.

In other memory-centric Java applications tested in our laboratory, we found the followingcharacteristics:

Every request processing action needed big and complex objects
It kept too many objects into HttpSession for every session.
The HttpSession timeout was too long, and HttpSession was not explicitly invalidated.
The thread pool, EJB pool or other objects pool was set too large.
The objects cache was set too large.

Those kinds of applications don't scale well. When the number of concurrent users increasing,the memory usage of those applications increases largely. If large numbers of live object cannot berecycled in time, the JVM will spend considerable time on garbage collection. On the other hand, ifgiven too much memory (in a 64-bit JVM), the JVM will still spend considerable time on garbagecollection after running for a relatively long time.

The conclusion is that Java applications are NOT scalable by given too much memory. In mostcases, 3GB memory assigned to Java heap (through "-Xmx" option) is enough (in some operatingsystems, such as Windows and Linux, you may not be able to use more than 2G memory in a 32-bitJVM). If you have more memory than the JVM can use (memory is cheap these days), please give thememory to the other applications within the same system, or just leave it to the operating system.Most OSs will use spare memory as a data buffer and cache list to improve IO performance.

The Real Time JVM (JSR001) has the ability to let the programmer control memory collection.Applications can use this feature to say to the JVM "Hi, this huge space of memory is my cache, Iwill take care of it myself, please don't collect it automatically". This functionality can makeJava applications scale on the huge memory resources. Hope JVM vendors will bring it into thenormal free JVM versions in the near future.

To scale these memory-centric Java applications, you need multiple JVM instances, or multiplemachine nodes.