Byte Buffers and Non-Heap Memory

来源：互联网发布：手机里程表计价软件编辑：程序博客网时间：2024/06/05 18:41

kdgregory.comBlog
Food
Programming
Travel

Most Java programs spend their time working with objects on the JVM heap, using getter and setter methods to retrieve or change the data in those objects. A few programs, however, need to do something different. Perhaps they're exchanging data with a program written in C. Or they need to manage large chunks of data without the risk of garbage collection pauses. Or maybe they need efficient random access to files. For all these programs, ajava.nio.ByteBuffer is an alternative to traditional Java objects.

Prologue: The Organization of Objects

Let's start by comparing two class definitions. The one on the left is C++, the one on the right is Java. They both declare the same member variables, with (mostly) the same types, but there's an important difference: the C++ class describes the layout of a block of memory, the Java class doesn't.

class TCP_Header{    unsigned short  sourcePort;    unsigned short  destPort;    unsigned int    seqNum;    unsigned int    ackNum;    unsigned short  flags;    unsigned short  windowSize;    unsigned short  checksum;    unsigned short  urgentPtr;    char            data[1];};

public class TcpHeader{    private short   sourcePort;    private short   destPort;    private int     seqNum;    private int     ackNum;    private short   flags;    private short   windowSize;    private short   checksum;    private short   urgentPtr;    private byte[]  data;}

C++, like C before it, is a systems programming language. That means that it will be used to directly access objects like network protocol buffers, which are defined in terms of byte offsets from a base address. One way to do this is with pointer arithmetic and casts, but that's error-prone. Instead, C (and C++) allows you to use a structure or class definition as a “view” on arbitrary memory. You can take any pointer, cast it as a pointer-to-structure, and then access that memory using code likep->seqNum; the compiler does the pointer arithmetic for you.

In a real C++ program, of course, you'd define such structures using the fixed-width types fromstdint.h; the int on one system may not be the same size as theint on another (I got to experience this first-hand as processors moved from 16 to 32 bits, and the experience scarred me for life). Java is better than C++ in this respect: anint is defined to be 32 bits, and a short is defined to be 16 bits, regardless of the processor on which your program is running.

However, Java comes with its own data access caveats. Foremost among them is that the in-memory layout of instance data isexplicitly not defined. Code like obj.seqNum does not translate to pointer arithmetic, it translates to the bytecode operationsgetfield or putfield (depending on which side of an assignment it appears). These operations are responsible for finding the particular field within the object and converting it (if necessary) to fit the fixed-width JVM stack.

By giving the JVM the flexibility to arrange its objects' fields as it sees fit, different implementations can make the most efficient use of their hardware. For example, a machine that allows access to memory in 32-bit increments, combined with an object that has several byte fields: the JVM is allowed to re-arrange those fields and combine them into a contiguous 32-bit word, then use shifts and masks to extract the data for use.

ByteBuffer

The fact that Java objects may be laid out differently than defined is irrelevant to most programmers. Since a Java class cannot be used as a view on arbitrary memory, you'll never notice if the JVM has decided to shuffle its members. However, there are situations where it would be nice to have this ability; there is a lot of structured binary data in the the real world. Prior to JDK 1.4, Java programmers had limited options: they could read data into abyte[] and use explicit offsets (along with bitwise operators) to combine those bytes into larger entities, or they could wrap thatbyte[] in a DataInputStream and get automatic conversion without random access.

The ByteBuffer class arrived in JDK 1.4 as part of the java.nio package, and combines larger-than-byte data operations with random access. AByteBuffer can be created in one of two ways: either by wrapping an existing byte array or by letting the implementation allocate its own underlying array. Most of the time, you'll want to do the former:

byte[] data = new byte[16];ByteBuffer buf = ByteBuffer.wrap(data);buf.putShort(0, (short)0x1234);buf.putInt(2, 0x12345678);buf.putLong(8, 0x1122334455667788L);for (int ii = 0 ; ii < data.length ; ii++)    System.console().printf("index %2d = %02x\n", ii, data[ii]);

When working with a ByteBuffer, it's important to keep track of your indices. In the code, above, we wrote a two-byte value at index 0, and a four-byte value at index 2; what happens if we try to read a four-byte value at index 0?

System.console().printf(        "retrieving value from wrong index = %04x\n",        buf.getInt(0));

The best way to ensure that you're always using the correct indices is to create a class that wraps the actual buffer and provides bean-style setters and getters. Taking the TCP header as an example:

public class TcpHeaderWrapper{    ByteBuffer buf;        public TcpHeaderWrapper(byte[] data)    {        buf = ByteBuffer.wrap(data);    }        public short getSourcePort()    {        return buf.getShort(0);    }        public void setSourcePort(short value)    {        buf.putShort(0, value);    }            public short getDestPort()    {        return buf.getShort(2);    }    // and so on

Slicing a ByteBuffer

Continuing with the TCP example, let's consider the actual content of the TCP packet, an arbitrary length array of bytes that follows the fixed header fields. The C++ version definesdata, a single-element array. This is an idiomatic usage of C-style arrays, relying on the fact that the compiler treats an array name as a pointer to the first element of the array, and will emit code to access elements of the array without bounds checks. In Java, of course, arrays are first-class objects, and an array member variable holds a pointer to the actual data (apparently the very earliest implementations of C did the same thing, until Dennis Ritchie decided that the “array is pointer” approach made for a more natural programming style; I highly recommend his article on the history and evolution of C, as an example of how a language designer thinks).

There are two ways to extract an arbitrary array of bytes from a ByteBuffer. The first is to use theget() method to copy bytes into an array that you pre-allocate:

public byte[] getData(){    buf.position(getDataOffset());    int size = buf.remaining();    byte[] data = new byte[size];    buf.get(data);    return data;}

This is usually the wrong approach, because it copies the data from the buffer into your array. Most of the time, you'll want to access the data via aByteBuffer, perhaps using a different bean-style class. Extracting an array just so that you can wrap it is very inefficient. Instead, callslice():

public ByteBuffer getDataAsBuffer(){    buf.position(getDataOffset());    return buf.slice();}

If you looked closely, you may have noticed that the last two code snippets both calledposition(), while in early examples I passed in an explicit offset. MostByteBuffer methods have two forms: one that takes an explicit position, for random access, and one that doesn't, used to read the buffer sequentially. I find little use for the latter, but when working withbyte[] you have no choice: there aren't any methods that support random access.

Something else to remember: when you call slice() the new buffer shares the same backing store as the old. Any changes that you make in one buffer will appear in the other. If you don't want that to happen, you need to useget(), which copies the data.

Beware Endianness

In Gulliver's Travels, the two societies of Lilliputians break their eggs from different ends, and that minor difference has led to eternal strife. Computer architectures suffer from a similar strife, based on the way that multi-byte values (eg, 32-bit integers) are stored in memory. “Little-endian” machines, such as the PDP-11, 8080, and 80x86 store low-order bytes first in memory: the integer value0x12345678 is stored in the successive bytes 0x78, 0x56, 0x34, and 0x12. “Big-endian” machines, like the Motorola 68000 and Sun SPARC, put the high-order bytes first:0x12345678 is stored as 0x12, 0x34, 0x56, and 0x78.

Java manages data in Big-Endian form. However, most Java programs run on Intel processors, which are Little-Endian. This can cause a lot of problems if you're trying to exchange data between a Java program and a C or C++ program running on the same machine. For example, here's a C program that writes a 4-byte signed integer to a file:

#include <stdio.h>#include <stdlib.h>#include <sys/types.h>#include <sys/stat.h>#include <fcntl.h>int main(int argc, char** argv) {    int fd = creat("/tmp/example.dat", 0777);    if (fd < 0)    {        perror("unable to create file");        return(1);    }        int value = 0x12345678;    write(fd, &value, sizeof(value));    close(fd);    return (0);}

On a Linux system, you can use the od command to dump the file's content:

~, 524> od -tx1 /tmp/example.dat0000000 78 56 34 120000004

When you write a naive Java program to retrieve that data, you see the same thing.

byte[] data = new byte[4];FileInputStream in = new FileInputStream("/tmp/example.dat");if (in.read(data) < 4)    throw new Exception("unable to read file contents");ByteBuffer buf = ByteBuffer.wrap(data);System.console().printf("data = %x\n", buf.getInt(0));

If you want to see the correct data, you must explicitly tell the buffer that it's Little-Endian:

buf.order(ByteOrder.LITTLE_ENDIAN);System.console().printf("data = %x\n", buf.getInt(0));

Here's the problem with that code: how do you know that the data is Little-Endian? One common solution is to start files with a “magic number” that indicates the byte order. For example, UTF-16 files begin with the value0xFEFF: a reader can look at those two bytes, and select Big-Endian conversion if they're in the order0xFE 0xFF, Little-Endian if they're in the order 0xFF 0xFE.

An alternative is to specify the ordering, and require writers to follow that specification. For example, the set of protocols collectively known as TCP/IP all require Big-Endian ordering, while the GIF graphics file format is Little-Endian.

Interlude: A Short Tour of Virtual Memory

ByteBuffer isn't just used to retrieve structured data from a byte[]. It also allows you to create and work with memory outside of the Java heap, including memory-mapped files. This latter feature is a great way to work with large amounts of structured data, as it lets you leverage the operating system's memory manager to move data in and out of memory in a way that's transparent to your program.

A program running on a modern operating system thinks that it has a large, contiguous allotment of memory: 2 gigabytes in the case of 32-bit editions of Windows and Linux, 8 terabytes or more for x64 editions (limited both by the operating system and the hardware itself). Behind the scenes, the operating system maintains a “page table” that identifies where in physical memory (or disk) the data for a given virtual address resides.

I've written elsewhere about how the JVM uses virtual memory: it assigns space for the Java heap, per-thread stacks, shared native libraries including the JVM itself, and memory-mapped files (primarily JAR files). On Linux, the programpmap will show you the virtual address space of a running process, divided into segments of different sizes, with different access permissions.

In thinking about virtual memory, there are two concepts that every programmer should understand: resident set size and commit charge. The second is easiest to explain: it's the total amount of memory that your program might be able tomodify (ie, it excludes memory-mapped files and read-only program code). The potential commit charge for an entire system is the sum of RAM and swap space, and no program can exceed this. It doesn't matter how big your virtual address space is: if you have 2G of RAM, and 2G of swap, you can never work with more than 4G of in-memory data; there's no place to store it.

In practice, no one program can reach that maximum commit charge either, because there are always other programs running, and they have their own claims upon memory. If you try to allocate memory that would exceed the available commit charge, you will get an OutOfMemoryError.

The second concept, resident set size (RSS), refers to how many of your program's virtual pages are currently residing in RAM. If a page isn't in RAM, then it needs to be read from disk — faulted into RAM — before your program can access it. The important thing to know about RSS is that you have very little control over it. The operating system tries to minimize the number of system-wide page faults, typically by managing RSS on the basis of time and access frequency: pages that are infrequently accessed get swapped out, making room for pages that are actively accessed. RSS is one reason that “full” garbage collections can take a long time: the GC has to walk the list of live objects, which will involve touching every page in the heap and faulting-in those that haven't been accessed recently.

One final concept: pages in the resident set can be “dirty,” meaning that the program has changed their content. A dirty page must be written to swap space before its physical memory can be used by another page. By comparison, a clean (unmodified) page may simply be discarded; it will be reloaded from disk when needed. If you can guarantee that a page will never be modified, it doesn't count against a program's commit charge — we'll return to this topic when discussing memory-mapped files.

Direct ByteBuffers

There are three ways to create a ByteBuffer: wrap(), which you've already seen,allocate(), which will create the underlying byte array for you, andallocateDirect(). The API docs for this last method are somewhat vague on exactlywhere the buffer will be allocated, stating only that “the Java virtual machine will make a best effort to perform native I/O operations directly upon it,” and that they “may reside outside of the normal garbage-collected heap.” In practice, direct buffers always live outside of the garbage-collected heap.

Knowing this, you might think that a direct buffer is a great way to extend the memory that your program can use. It isn't. The JVM is very good about growing the heap to the limits of physical and virtual memory, so if you've already maxed out your heap, there won't be any place to put a direct buffer.

In fact, the only reason that I can see for using direct buffers in a pure Java program is that they won't be moved during garbage collection. If you've read my article onreference objects, you'll remember that the garbage collector compacts the heap after disposing of dead objects. If you have large blocks of heap memory allocated as buffers, they may get moved as part of compaction, and no matter how fast your CPU, that takes time; it's not something you want to do on every full collection. Since the direct buffer lives outside of the heap, it isn't affected by collections. On the other hand, every data access is a JNI call. Only benchmarking will tell you whether this helps or hurts your particular application.

Direct buffers are useful in a program that mixes Java and native libraries: JNI provides methods to access the physical memory behind a direct buffer, and to allocate new buffers at known locations. Since this technique has a limited audience, it's outside of the scope of this article. If you're interested, I link to an example program at the end.

Mapped Files

While I don't see much reason to use direct buffers in a pure Java program, they're the foundation for mapping files into the virtual address space — a feature that is rarely used, but invaluable when you need it. Mapping a file gives you random access with — depending on your access patterns — a significant performance boost. To understand why, we'll need to take a short detour into the way that Java file I/O works.

The first thing to understand is that the Java file classes are simply wrappers around native file operations. When you callread() from a Java program, you invoke the POSIX system call with the same name (at least on Solaris/Linux; I'll assume Windows as well). When the OS is asked to read data, it first looks into its cache of disk buffers, to see if you've recently read data from the same disk block. If the data is there, the call can return immediately. If not, the OS will initiate a disk read, and suspend your program until the data is available.

The key point here is that “immediately” does not mean “quickly”: you're invoking the operating system kernel to do the read, which means that the computer has to perform a “context switch” from application mode to kernel mode. To make this switch, it will save the CPU registers and page table for your application, and load the registers and page table for the kernel; when the kernel call is done, the reverse happens. This is a matter of a few microseconds, but those add up if you're constantly accessing a file. At worst, the OS schedule will decide that your program has had the CPU for long enough, and suspend it while another program runs.

With a memory-mapped file, by comparison, there's no need to invoke the OS unless the data isn't already in memory. And since the amount of RAM devoted to programs is larger than that devoted to disk buffers, the data is far more likely to be in memory.

Of course, whether or not your data is in memory depends on many things. Foremost is whether you're accessing the data sequentially: there's no point to replacing aFileInputStream with a mapped buffer, even though the JDK allows it, because you'll be constantly waiting for pages to load from disk

The second important question is how big your file is, and how randomly you access it. If you have a multi-gigabyte file and bounce from one spot to another, then you'll be constantly waiting for pages to be read from disk. But most programs don't access their data in a truly random manner. Typically there's one group of blocks that are hit far more frequently than others, and these will remain in RAM. For example, a database server reads the root node of an index on almost every query, while individual data blocks are accessed far less frequently.

Even if you don't gain a speed benefit from memory-mapping your files, you may gain a maintenance benefit by accessing them via a bean-style wrapper class. This will also improve testability, as you can construct buffers around known test data, without any files involved.

Creating the Mapping

Creating a mapped file is a multi-step process, starting with a RandomAccessFile (you can also start with aFileInputStream or FileOutptStream, but there's no point to doing so). From there, you create aFileChannel, and then you call map() on that channel. It's easier to code than to describe:

File file = new File("/tmp/example.dat");FileChannel channel = new RandomAccessFile(file, "r").getChannel();ByteBuffer buf = channel.map(MapMode.READ_ONLY, 0L, file.length());buf.order(ByteOrder.LITTLE_ENDIAN);System.console().printf("data = %x", buf.getInt(0));

Although I assign the return value from map() to a ByteBuffer variable, it's actually aMappedByteBuffer. Most of the time there's no reason to differentiate, but the latter class has two methods that some programs may find useful:load() and force().

The load() method will attempt to load all of the file's data into RAM, trading an increase in startup time for a potential decrease in page faults later. I think this is a form of premature optimization. Unless your program constantly accesses those pages, the operating system may choose to use them for something else, meaning that you'll have to fault them in anyway. Let the OS do its job, and load pages as needed from disk.

The second method, force(), deserves its own section.

Read-Only versus Read-Write Mappings

Few programmers think about what happens to their files when the power goes out. Those that do typically stop thinking once they've calledflush(). However, even if you've flushed the writes out of the operating system and into the disk drive, they may not have found their way to the disk platter: disk drives generally have an on-drive RAM buffer, and blocks live in that buffer until the drive can write them to the physical disk — or the power goes out. You can typically tweak the drive's settings via the OS (not Java), so if you absolutely, positively must ensure writes, you should learn how your particular drives work (and maybe get an SSD drive).

You'll note that I created the RandomAccessFile in read-only mode (by passing the flag"r" to its constructor), and reiterated that when mapping the channel. This will prevent accidental writes, but more importantly, it means that the file won't count against the program's commit charge. On a 64-bit machine, you can map terabytes of read-only files. And in most cases, you don't need write access: you have a large dataset that you want to process, and don't want to keep reading chunks of it into heap memory.

Read-write files require some more thought. The first thing to consider is just how important your writes are. As I noted above, the memory manager doesn't want to constantly write dirty pages to disk. Which means that your changes may remain in memory, unwritten, for a very long time — which will become a problem if the power goes out. To flush dirty pages to disk, call the buffer'sforce() method.

buf.putInt(0, 0x87654321);buf.force();

Those two lines of code are actually an anti-pattern: you don't want to flush dirty pages after every write, or you'll make your program IO-bound. Instead, take a lesson from database developers, and group your changes into atomic units (or better, if you're planning on a lot of updates, use a real database).

Mapping Files Bigger than 2 GB

Depending on your filesystem, you can create files larger than 2GB. But if you look at theByteBuffer documentation, you'll see that it uses an int for all indexes, which means that buffers are limited to 2GB. Which in turn means that you need to create multiple buffers to work with large files.

One solution is to create those buffers as needed. The same underlying FileChannel can support as many buffers as you can create, limited only by the OS and available virtual memory; simply pass a different starting offset each time. The problem with this approach is that creating a mapping is expensive, because it's a kernel call (and you're using mapped files to avoid kernel calls). In addition, a page table full of mappings will mean more expensive context switches. As a result, as-needed buffers aren't a good approach unless you can divide the file into large chunks that are processed as a unit.

A better approach, in my opinion, is to create a “super buffer” that maps the entire file and presents an API that useslong offsets. Internally, it maintains an array of mappings with a known size, so that you can easily translate the original index into a buffer and an offset within that buffer:

public int getInt(long index){    return buffer(index).getInt();}private ByteBuffer buffer(long index){    ByteBuffer buf = _buffers[(int)(index / _segmentSize)];    buf.position((int)(index % _segmentSize));    return buf;}

That's straightforward, but what's a good value for _segmentSize? Your first thought might beInteger.MAX_VALUE, since this is the maximum index value for a buffer. While that would result in the fewest number of buffers to cover the file, it has one big flaw: you won't be able to access multi-byte values at segment boundaries.

Instead, you should overlap buffers, with the size of the overlap being the maximum sub-buffer (orbyte[]) that you need to access. In my implementation, the segment size isInteger.MAX_VALUE / 2 and each buffer is twice that size; one sub-buffer starts halfway into its predecessor:

public MappedFileBuffer(File file, int segmentSize, boolean readWrite)throws IOException{    if (segmentSize > MAX_SEGMENT_SIZE)        throw new IllegalArgumentException(                "segment size too large (max " + MAX_SEGMENT_SIZE + "): " + segmentSize);    _segmentSize = segmentSize;    _fileSize = file.length();    RandomAccessFile mappedFile = null;    try    {        String mode = readWrite ? "rw" : "r";        MapMode mapMode = readWrite ? MapMode.READ_WRITE : MapMode.READ_ONLY;        mappedFile = new RandomAccessFile(file, mode);        FileChannel channel = mappedFile.getChannel();        _buffers = new MappedByteBuffer[(int)(_fileSize / segmentSize) + 1];        int bufIdx = 0;        for (long offset = 0 ; offset < _fileSize ; offset += segmentSize)        {            long remainingFileSize = _fileSize - offset;            long thisSegmentSize = Math.min(2L * segmentSize, remainingFileSize);            _buffers[bufIdx++] = channel.map(mapMode, offset, thisSegmentSize);        }    }    finally    {        // close quietly        if (mappedFile != null)        {            try            {                mappedFile.close();            }            catch (IOException ignored) { /* */ }        }    }}

There are two things to notice here. The first notice is my use of Math.min(). You can't create a mapped buffer that's larger than the actual file;map() will throw if you try. Since I specify segment size rather than number of segments, I need to ensure that they fit reality. At most two buffers will be shrunk by this call, but it's less code to check on every buffer.

The second — and perhaps more important — thing is that I I close the RandomAccessFile after creating the mappings. My original version of this class didn't; it had aclose() method, along with a finalizer to catch programmer mistakes. Then one day I took a closer look at theFileChannel.map() docs, and discovered that the buffer will persist after the channel is closed — it's removed by the garbage collector (and this explains the reason thatMappedByteBuffer doesn't have its own close() method).

Garbage Collection of Direct/Mapped Buffers

That brings up another topic: how does the non-heap memory for direct buffers and mapped files get released? After all, there's no method to explicitly close or release them. The answer is that they get garbage collected like any other object, but with one twist: if you don't have enough virtual memory space or commit charge to allocate a direct buffer, that will trigger a full collectioneven if there's plenty of heap memory available. Normally, this won't be an issue: you probably won't be allocating and releasing direct buffers more often than heap-resident objects. If, however, you see full GC's appearing when you don't think they should, take a look at your program's use of buffers.

Along the same lines, when you're using direct buffers and mapped files, you'll get to see some of the more esoteric variants ofOutOfMemoryError. “Direct buffer memory” is one of the more common, and appears to to indicate an OS-imposed limit (based on my reading ofBits.java). And when I tried to allocate more direct buffers than available commit charge, I received an OOM that didn't even have a message.

Enabling Large Direct Buffers

You may be surprised, the first time that you try to allocate direct buffers on a 64-bit machine, that you getOutOfMemoryError when there's plenty of RAM available. You can usually resolve this problem by passing the following options when starting the JVM:

-d64: This option instructs the JVM to run in 64-bit mode. Most 64-bit installs actually have 32-bit JVMs, and the 32-bit JVM may be more efficient for “small” programs, because of the reduced overhead for pointers. This option isdocumented only for Linux/Solaris JVMs, and the documentation has a lot of caveats regarding when and how a 64-bit JVM is invoked.
-XX:MaxDirectMemorySize: This option is not in the official list of non-standard JVM options, and it doesn't appear to affect the process memory map in any way, but in my experience it's absolutely critical if your program needs large direct buffers.

To summarize, if you're running a program that needs to allocate 12 GB of direct buffers, you'd use a command-line like this:

java -XX:MaxDirectMemorySize=12g com.example.MyApp

If you're working with large buffers (direct buffers or memory mapped files), you should also use the-XX:+UseLargePages option:

java -d64 -XX:MaxDirectMemorySize=12g -XX:+UseLargePages com.example.MyApp

By default, the memory manager maps physical memory to the virtual address space in small chunks (4k is typical). This means that page faults can be handled more efficiently, because there's less data to read or write. However, small pages mean that memory management hardware has to keep track of more information to translate virtual addresses to physical. At best, this means less efficient usage of theTLB, which makes every memory access slower. At worst, you'll run out of entries in the page table (which is reported asOutOfMemoryError).

Thread Safety

ByteBuffer thread safety is covered in the Buffer JavaDoc; the short version is that buffers are not thread-safe. Clearly, you can't use relative positioning from multiple threads without a race condition, but even absolute positioning is not guaranteed (regardless of what you might think after looking at the implementation classes). Fortunately, the work-around is easy: give each thread its own buffer.

There are two methods that let you create a new buffer from an existing one: duplicate() and slice(). I've already described the latter: it creates a new buffer that starts at the current buffer's position. The former creates a new buffer that covers the entire original; it is equivalent to setting the original buffer's position at zero and then calling slice().

The JavaDoc for these methods states that “[c]hanges to this buffer's content will be visible in the new buffer, and vice versa.” However, I don't think this takes the Java memory model into account. To be safe, consider buffers with shared backing store equivalent to an object shared between threads: it's possible that concurrent accesses will see different values. Of course, this only matters when you're writing to the buffer; for read-only buffers, simply having a unique buffer per thread is sufficient.

That said, you still have the issue of creating buffers: you need to synchronize access to theslice() or duplicate() call. One way to do this is to create all of your buffers before spawning threads. However, that may be inconvenient, especially if your buffer is internal to another class. An alternative is to useThreadLocal:

public class ByteBufferThreadLocalextends ThreadLocal<ByteBuffer>{    private ByteBuffer _src;    public ByteBufferThreadLocal(ByteBuffer src)    {        _src = src;    }    @Override    protected synchronized ByteBuffer initialValue()    {        return _src.duplicate();    }}

In this example, the original buffer is never accessed by application code. Instead, it serves as a master for producing copies in a synchronized method, and those copies are used by the application. Once a thread finishes, the garbage collector will dispose of the buffer(s) that it used, leaving the master untouched.

For More Information

There are several example programs that go with this article:

SimpleExample is, as its name implies, a simple example of usingByteBuffer to wrap a byte[].
TcpHeaderWrapper is a bean-style class that hides the underlyingByteBuffer.
SliceExample demonstrates theByteBuffer.slice() method.
DirectBufferAllocator allocates a directByteBuffer and then sleeps while you run pmap. Not very interesting, but add a loop and you can see how much non-heap memory you can allocate. Or you can runAllocationFailureExample.
DataReader demonstrates differences in endianness. It expects data written bydata_writer.c, and if you're running on an x86 system will demonstrate how litte-endian data appears to a Java program.
MappedDataReader is the same thing, but maps the source data file.
MappedSpeedTest compares the performance of memory-mapped files to random-access files. With the default settings, I see a better than 4x speedup, even though both programs are accessing in-memory data; the difference is almost entirely due to kernel context switches.
DirectSpeedTest is a similar program comparing direct and non-direct buffers. Surprisingly, the direct buffers seem to have a tiny speed advantage (I would have expected them to be slower, since they're accessed via JNI), but that could be benchmarking error.
If you want to use a ByteBuffer to access data produced by a C/C++ library,this example might be useful: it wraps a System V shared memory block in a direct buffer. Personally, every time I have to work with JNI I end up regretting it (and rewriting a lot of basic helper methods), but sometimes it's the best solution. Good luck.

I've also written some utility classes for working with buffers. They are all licensed for open-source consumption under theApache 2.0 license, and are available on SourceForge (at present this library is not available from Maven Central).

MappedFileBuffer is my “super buffer” implementation for mapped files. It works by mapping the file as multiple 2Gb buffers. Note, however, that it is not thread-safe.
ByteBufferThreadLocal andMappedFileBufferThreadLocal wrap buffers in a ThreadLocal to provide thread-safe access.
BufferFacade provides a common API over bothByteBuffer and MappedFileBuffer. This helps with writing testable code: you can use an in-memory buffer for tests, and a mapped file for production code.
ByteBufferInputStream andByteBufferOutputStream provide stream-oriented access to and from a buffer. It is useful when you want to store data in an off-heap buffer, but pass that data to a framwork (like a servlet request) that has no conception of channels.

I gave a presentation on ByteBuffers to the Philadelphia Java Users Group in November 2010, which focused on using implementing an off-heap cache similar toEHCache BigMemory. You'll probably find it a bit sparse; I tend to use slides only as a starting point for an extended monologue. However, it does contain a nearly-complete off-heap cache implementation in a couple dozen lines of code.

Wikipedia has some nice articles on virtual memory and paging. However, I recommend ignoring the article on virtual address space; it is simultaneously too detailed and not detailed enough. There's also the article on the translation lookaside buffer that I linked earlier.

To enable large pages, you might need to change your OS configuration. The specific instructions are quite likely to change over time, so I won't repeat them here. I've found a couple of blog postings that are useful and authoritative:one from Sun, and one from Andrig Miller of JBoss. If you try using large pages and get an error message, take a look at these blogs and/or Google the message text.