翻译：改善Java I/O性能

来源：互联网发布：淘宝过年期间发货时间编辑：程序博客网时间：2024/05/16 10:06

声明：第一次做翻译，很多地方不足，希望看的人多指正，我自己也在实践中慢慢进步。

原文：http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/

索引
此文讨论并展示了一系列改善Java I/O性能的技术，其中大部分集中在改善磁盘文件读写，另一部分包括网络I/O和窗口输出等。第一部分讨论底层I/O，第二部分讨论上层I/O，如压缩，格式化和串行化等。本文不讨论应用程序设计问题，如搜索算法和数据结构的选择等，也不讨论系统级如文件缓存等主题。

讨论Java I/O是，需要注意到Java假设有两种不同的磁盘文件组织类型。一种基于字节流，另一种基于字符流。在Java里，一个字符由两个字节表示，而不是像C语言里用一个字节表示一个字符。因此，从文件中读取字符时需要做一些转换。这个区别是值得重视的，本文中会给出一些例子。

提高I/O性能的基本准则

1、避免直接访问磁盘
2、避免直接访问操作系统
3、避免方法调用
4、避免分开处理字节流和字符流

很明显，这些方法不能同时应用，不然的话就I/O就不能完成。但是接下来的三个部分，我们给出一个示例（记录文件中的行数）来展示如何利用这些规则。

方法一：read method
第一种方法简单地实用FileInputStream类中的读方法

import java.io.*;

public class intro1 {

public static void main(String args[]) {

if (args.length != 1) {

System.err.println("missing filename");

System.exit(1);

}

try {

FileInputStream fis = new FileInputStream(args[0]);

int cnt = 0;

int b;

while ((b = fis.read()) != -1) {

if (b == ' ')

cnt++;

}

fis.close();

System.out.println(cnt);

} catch (IOException e) {

System.err.println(e);

}

然而，这种方法促发了很多系统调用(FileInputStream.read)，这个调用是一个返回文件下一字符的底层调用。

方法二：使用一个大的缓存
第二种方法通过使用大缓存避免了第一种方法的问题。

import java.io.*;

public class intro2 {

public static void main(String args[]) {

if (args.length != 1) {

System.err.println("missing filename");

System.exit(1);

}

try {

FileInputStream fis = new FileInputStream(args[0]);

BufferedInputStream bis = new BufferedInputStream(fis);

int cnt = 0;

int b;

while ((b = bis.read()) != -1) {

if (b == ' ')

cnt++;

}

bis.close();

System.out.println(cnt);

} catch (IOException e) {

System.err.println(e);

}

BufferedInputStream.read从输入缓存里读入下一个字节，很从直接访问系统。

方法三：直接缓存
第三种方法不使用BufferedInputStream,而是直接缓存，因此消除了读方法调用：

import java.io.*;

public class intro3 {

public static void main(String args[]) {

if (args.length != 1) {

System.err.println("missing filename");

System.exit(1);

}

try {

FileInputStream fis = new FileInputStream(args[0]);

byte buf[] = new byte[2048];

int cnt = 0;

int n;

while ((n = fis.read(buf)) != -1) {

for (int i = 0; i < n; i++) {

if (buf[i] == ' ')

cnt++;

}

fis.close();

System.out.println(cnt);

} catch (IOException e) {

System.err.println(e);

}

对于一个1M的输入文件，这三种方法的执行时间分别是(单位：秒)：
intro1    6.9
intro2    0.9
intro3    0.4
最快和最慢的时间比大约是17:1。

速度的提升并不意味着你要一味的实用第三种方法，因为这种方法容易出错，特别是在处理文件末尾时。但是我们需要记住的是这三种方法的时间开销，和如何达到需要的速度。

对于大多数应用程序来说，方法二一般是“正确”的。

缓存

方法二和方法三使用了缓存技术，一次从磁盘中读取一块数据，然后每次在块中访问一个字节或字符。缓存是一个基本且重要的提高I/O性能的技术，有多个Java类支持缓存（BufferedInputStream针对字节，BufferedReader针对字符）。

一个很明显的问题：是不是意味着增大缓存就能是I/O更快呢？Java缓存默认为1024或2048字节。缓存大于该数值则会提高I/O速度，但是只会提高很小的比例，一般5-10%。

方法四：整个文件
极端情况是读取文件长度，缓存整个文件。

import java.io.*;

public class readfile {

public static void main(String args[]) {

if (args.length != 1) {

System.err.println("missing filename");

System.exit(1);

}

try {

int len = (int) (new File(args[0]).length());

FileInputStream fis = new FileInputStream(args[0]);

byte buf[] = new byte[len];

fis.read(buf);

fis.close();

int cnt = 0;

for (int i = 0; i < len; i++) {

if (buf[i] == ' ')

cnt++;

}

System.out.println(cnt);

} catch (IOException e) {

System.err.println(e);

}

这种方法很简便，文件可以被当作字节数组。但是一个很明显的问题是有可能没有足够的内存来一次性读入一个很大的文件。

缓存关注的另一方面是把文字输出到终端窗口。默认的，System.out是行缓冲的，这意味着当遇到换行符时，输出缓存就被输出清空。当你希望在实际输入前显示输入提示时，这对交互是很重要的。

方法五：禁止行缓存
但是行缓冲可以被禁止，如下例所示：

import java.io.*;

public class bufout {

public static void main(String args[]) {

FileOutputStream fdout = new FileOutputStream(FileDescriptor.out);

BufferedOutputStream bos = new BufferedOutputStream(fdout, 1024);

PrintStream ps = new PrintStream(bos, false);

System.setOut(ps);

final int N = 100000;

for (int i = 1; i <= N; i++)

System.out.println(i);

ps.close();

}

该示例输出1...100000，运行速度大约是有行缓冲的三倍。

缓存对下面的一些示例也很重要，被用户随机文件存取。

读/写文本文件

之前曾提及，从文件中读取字符时，方法调用的开销可能是巨大的。另一个类似的例子（在文本文件中计算行数）如下：

import java.io.*;

public class line1 {

public static void main(String args[]) {

if (args.length != 1) {

System.err.println("missing filename");

System.exit(1);

}

try {

FileInputStream fis = new FileInputStream(args[0]);

BufferedInputStream bis = new BufferedInputStream(fis);

DataInputStream dis = new DataInputStream(bis);

int cnt = 0;

while (dis.readLine() != null)

cnt++;

dis.close();

System.out.println(cnt);

} catch (IOException e) {

System.err.println(e);

}

该程序实用过时的DataInputStream.readLine方法，该方法调用read来获取每个字符。一个新的做法如下：

import java.io.*;

public class line2 {

public static void main(String args[]) {

if (args.length != 1) {

System.err.println("missing filename");

System.exit(1);

}

try {

FileReader fr = new FileReader(args[0]);

BufferedReader br = new BufferedReader(fr);

int cnt = 0;

while (br.readLine() != null)

cnt++;

br.close();

System.out.println(cnt);

} catch (IOException e) {

System.err.println(e);

}

这种方法可以更快。例如，一个有200,000行的6M大的文件，第二种方法大约快20%。

但即使是第二个程序也不是最快的，有一点值得注意。第一个程序在Java2编译时引起deprecation警告，因为

DataInputStream.readLine是过时的，它有时不能正确地把字节流转换成字符流，在处理其它包含非ASCII字符文件时也不是一个合适的选择（Java实用Unicode字符集，非ASCII）。

之前提到过的字节流和字符流的区别在这里起作用了。一个程序：

import java.io.*;

public class conv1 {

public static void main(String args[]) {

try {

FileOutputStream fos = new FileOutputStream("out1");

PrintStream ps = new PrintStream(fos);

ps.println("䌡ሴ");

ps.close();

} catch (IOException e) {

System.err.println(e);

}

写一个输出文件，但没有保存实际输出的Unicode字符。Reader/Writer类是基于字符的，被用来解决此类问题。在OutputStreamWriter类中，实现了字符到字节的转换。

利用PrintWriter写Unicode字符的程序如下：

import java.io.*;

public class conv2 {

public static void main(String args[]) {

try {

FileOutputStream fos = new FileOutputStream("out2");

OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF8");

PrintWriter pw = new PrintWriter(osw);

pw.println("䌡ሴ");

pw.close();

} catch (IOException e) {

System.err.println(e);

}

该程序实用UTF-8编码，该编码包含ASCII编码，其它字符则用两个或三个字节表示。

格式化开销
事实上，往文件中写入数据只是输出的以部分，另外一个重大的开销是数据格式化。考虑一个输出多行“The square of 5 is 25”的例子。

方法一
第一种方法是简单的输出一个固定的字符串，以对固有开销有所概念。

public class format1 {

public static void main(String args[]) {

final int COUNT = 25000;

for (int i = 1; i <= COUNT; i++) {

String s = "The square of 5 is 25 ";

System.out.print(s);

}

方法二
第二种方法利用"+"实现简单的格式化。

public class format2 {

public static void main(String args[]) {

int n = 5;

final int COUNT = 25000;

for (int i = 1; i <= COUNT; i++) {

String s = "The square of " + n + " is " + n * n + " ";

System.out.print(s);

}

方法三
第三种方法使用java.text.MessageFormat类。

import java.text.*;

public class format3 {

public static void main(String args[]) {

MessageFormat fmt = new MessageFormat("The square of {0} is {1} ");

Object values[] = new Object[2];

int n = 5;

values[0] = new Integer(n);

values[1] = new Integer(n * n);

final int COUNT = 25000;

for (int i = 1; i <= COUNT; i++) {

String s = fmt.format(values);

System.out.print(s);

}

这些程序产生相同的输出，它们的运行时间如下（单位：秒）：
format1   1.3
format2   1.8
format3   7.8

最快的大约是最慢的六倍。如果格式没有被预编译，MessageFormat.format()静态方法替代fmt.format()，第三个程会更慢。

方法四

import java.text.*;

public class format4 {

public static void main(String args[]) {

String fmt = "The square of {0} is {1} ";

Object values[] = new Object[2];

int n = 5;

values[0] = new Integer(n);

values[1] = new Integer(n * n);

final int COUNT = 25000;

for (int i = 1; i <= COUNT; i++) {

String s = MessageFormat.format(fmt, values);

System.out.print(s);

}

该方法运行时间比方法三大约多1/3。

方法三比方法一和方法二慢不意味着你不应使用它。但你应该知道各个方法的开销。

MessageFormat类在国际化环境中非常重要。一个应用程序可能从资源中读取格式，再使用它。

随机访问
RandomAccessFile是一个基于字节流随机访问文件的Java I/O类。同C/C++类似，该类提供了一个seek方法，用来把文件指向任意位置，以便从该位置读写字节。

seek方法访问运行时系统，开销巨大。一种节省开销的方法是在RandomAccessFile上设置你自己的缓存，再实现一个直接读取字节的方法。一个例子如下所示：

import java.io.*;

public class ReadRandom {

private static final int DEFAULT_BUFSIZE = 4096;

private RandomAccessFile raf;

private byte inbuf[];

private long startpos = -1;

private long endpos = -1;

private int bufsize;

public ReadRandom(String name) throws FileNotFoundException {

this(name, DEFAULT_BUFSIZE);

}

public ReadRandom(String name, int b) throws FileNotFoundException {

raf = new RandomAccessFile(name, "r");

bufsize = b;

inbuf = new byte[bufsize];

}

public int read(long pos) {

if (pos < startpos || pos > endpos) {

long blockstart = (pos / bufsize) * bufsize;

int n;

try {

raf.seek(blockstart);

n = raf.read(inbuf);

} catch (IOException e) {

return -1;

}

startpos = blockstart;

endpos = blockstart + n - 1;

if (pos < startpos || pos > endpos)

return -1;

}

return inbuf[(int) (pos - startpos)] & 0xffff;

}

public void close() throws IOException {

raf.close();

}

public static void main(String args[]) {

if (args.length != 1) {

System.err.println("missing filename");

System.exit(1);

}

try {

ReadRandom rr = new ReadRandom(args[0]);

long pos = 0;

int c;

byte buf[] = new byte[1];

while ((c = rr.read(pos)) != -1) {

pos++;

buf[0] = (byte) c;

System.out.write(buf, 0, 1);

}

rr.close();

} catch (IOException e) {

System.err.println(e);

}

该程序只是简单地顺序读取字节流并输出。

当你有访问位置时，这个技术是有帮助的，相邻的字节在几乎同一时间内被读取。如果你在一个排序好的文件上实现折半查找，该技术会很有用。如果你在一个大文件上做真的随机访问，该技术就作用不大了。

压缩
Java在java.util.zip包里提供了压缩和解压缩字节流的类，作为jar文件的基础。

下面这个程序读取一个输入文件，输出一个压缩后的zip文件。

import java.io.*;

import java.util.zip.*;

public class compress {

public static void doit(String filein, String fileout) {

FileInputStream fis = null;

FileOutputStream fos = null;

try {

fis = new FileInputStream(filein);

fos = new FileOutputStream(fileout);

ZipOutputStream zos = new ZipOutputStream(fos);

ZipEntry ze = new ZipEntry(filein);

zos.putNextEntry(ze);

final int BUFSIZ = 4096;

byte inbuf[] = new byte[BUFSIZ];

int n;

while ((n = fis.read(inbuf)) != -1)

zos.write(inbuf, 0, n);

fis.close();

fis = null;

zos.close();

fos = null;

} catch (IOException e) {

System.err.println(e);

} finally {

try {

if (fis != null)

fis.close();

if (fos != null)

fos.close();

} catch (IOException e) {

}

public static void main(String args[]) {

if (args.length != 2) {

System.err.println("missing filenames");

System.exit(1);

}

if (args[0].equals(args[1])) {

System.err.println("filenames are identical");

System.exit(1);

}

doit(args[0], args[1]);

}

接下来这个程序做相反的操作，输入一个只有一个入口的zip文件，解压缩该entry作为输出。

import java.io.*;

import java.util.zip.*;

public class uncompress {

public static void doit(String filein, String fileout) {

FileInputStream fis = null;

FileOutputStream fos = null;

try {

fis = new FileInputStream(filein);

fos = new FileOutputStream(fileout);

ZipInputStream zis = new ZipInputStream(fis);

ZipEntry ze = zis.getNextEntry();

final int BUFSIZ = 4096;

byte inbuf[] = new byte[BUFSIZ];

int n;

while ((n = zis.read(inbuf, 0, BUFSIZ)) != -1)

fos.write(inbuf, 0, n);

zis.close();

fis = null;

fos.close();

fos = null;

} catch (IOException e) {

System.err.println(e);

} finally {

try {

if (fis != null)

fis.close();

if (fos != null)

fos.close();

} catch (IOException e) {

}

public static void main(String args[]) {

if (args.length != 2) {

System.err.println("missing filenames");

System.exit(1);

}

if (args[0].equals(args[1])) {

System.err.println("filenames are identical");

System.exit(1);

}

doit(args[0], args[1]);

}

压缩是否改进I/O性能很大程度上取决于你的硬件设置，特别是处理器和硬盘速度。zip压缩一般能压缩文件至50%大小，但以压缩和解压缩作为代价。一个实验（在300MHz的Pentium、IDE硬盘的PC上读取5-10MB的文件）表明读取压缩文件比直接读取未压缩的文件快1/3。

压缩的一个例子是写慢速度的媒介，如软盘。试验表明压缩后写软盘比压缩前写速度快了近50%。

超高速缓存
硬件超高速缓存的具体讨论超出了本问的范畴，但有时候软件缓存可以提高I/O。考虑如下一个情况：你想从一个文本文件中随即读取很多行。一个方法是读取所有行，把它们保存在ArrayList中。

import java.io.*;

import java.util.ArrayList;

public class LineCache {

private ArrayList list = new ArrayList();

public LineCache(String fn) throws IOException {

FileReader fr = new FileReader(fn);

BufferedReader br = new BufferedReader(fr);

String ln;

while ((ln = br.readLine()) != null)

list.add(ln);

br.close();

}

public String getLine(int n) {

if (n < 0)

throw new IllegalArgumentException();

return (n < list.size() ? (String) list.get(n) : null);

}

public static void main(String args[]) {

if (args.length != 1) {

System.err.println("missing filename");

System.exit(1);

}

try {

LineCache lc = new LineCache(args[0]);

int i = 0;

String ln;

while ((ln = lc.getLine(i++)) != null)

System.out.println(ln);

} catch (IOException e) {

System.err.println(e);

}

getLine方法用来获取任意行。该技术很有用，但是对于大文件会占用很多内存。一种替代方法是记住最后被访问的100行，再从磁盘中读取别的请求。如果行请求确实随机，该方案效果就会失效。

Tokenization
Tokenization表示把字节或字符串分割成类似单词这样的逻辑块。Java提供了StreamTokenizer类，示例如下：

import java.io.*;

public class token1 {

public static void main(String args[]) {

if (args.length != 1) {

System.err.println("missing filename");

System.exit(1);

}

try {

FileReader fr = new FileReader(args[0]);

BufferedReader br = new BufferedReader(fr);

StreamTokenizer st = new StreamTokenizer(br);

st.resetSyntax();

st.wordChars('a', 'z');

int tok;

while ((tok = st.nextToken()) != StreamTokenizer.TT_EOF) {

if (tok == StreamTokenizer.TT_WORD)

;// st.sval has token

}

br.close();

} catch (IOException e) {

System.err.println(e);

}

该示例根据小写字母分割。如果你自己实现相同的功能，应该类似如下实现：

import java.io.*;

public class token2 {

public static void main(String args[]) {

if (args.length != 1) {

System.err.println("missing filename");

System.exit(1);

}

try {

FileReader fr = new FileReader(args[0]);

BufferedReader br = new BufferedReader(fr);

int maxlen = 256;

int currlen = 0;

char wordbuf[] = new char[maxlen];

int c;

do {

c = br.read();

if (c >= 'a' && c <= 'z') {

if (currlen == maxlen) {

maxlen *= 1.5;

char xbuf[] = new char[maxlen];

System.arraycopy(wordbuf, 0, xbuf, 0, currlen);

wordbuf = xbuf;

}

wordbuf[currlen++] = (char) c;

} else if (currlen > 0) {

String s = new String(wordbuf, 0, currlen);

// do something with s

currlen = 0;

}

} while (c != -1);

br.close();

} catch (IOException e) {

System.err.println(e);

}

第二个程序比第一个块大约20%，代价是你需要写一些底层易错的代码。
StreamTokenizer是一个混合类，它可以从字符流中读取，也可以从字节流中读取，对所有字符处理成两个字节。

串行化
串行化根据一个标准格式将任意的Java数据结构转换成字节流。比如，如下程序写出一个包含随机整数的数组。

import java.io.*;

import java.util.*;

public class serial1 {

public static void main(String args[]) {

ArrayList al = new ArrayList();

Random rn = new Random();

final int N = 100000;

for (int i = 1; i <= N; i++)

al.add(new Integer(rn.nextInt()));

try {

FileOutputStream fos = new FileOutputStream("test.ser");

BufferedOutputStream bos = new BufferedOutputStream(fos);

ObjectOutputStream oos = new ObjectOutputStream(bos);

oos.writeObject(al);

oos.close();

} catch (Throwable e) {

System.err.println(e);

}

下面这个程序把数组读回：

import java.io.*;

import java.util.*;

public class serial2 {

public static void main(String args[]) {

ArrayList al = null;

try {

FileInputStream fis = new FileInputStream("test.ser");

BufferedInputStream bis = new BufferedInputStream(fis);

ObjectInputStream ois = new ObjectInputStream(bis);

al = (ArrayList) ois.readObject();

ois.close();

} catch (Throwable e) {

System.err.println(e);

}

注意：这里使用了缓存来提高I/O性能。

有没有比串行化更快速写大数据、再读回的方法呢？或许没有，除非在一些。特出情况下。例如，如果你想把一个60-bit的长整数作为文字而不是8字节的集合输出。长整形作为文字的最大长度大约是20个字符，是二进制表示的2.5倍长。因此这种格式化不会更快。再另一些情况如bitmaps中，一种特殊的格式可能会有所改进。然而，使用你自己的方案会与标准的串行化冲突，这么做时需要有所权衡。

在使用DataInputStream和DataOutputStream串行化时，除了I/O和格式化开销外，还有一些别的开销，如反串行化时需要创建新的对象等。

注意：DataOutputStream方法可以用来开发半定制的格式，例如：

import java.io.*;

import java.util.*;

public class binary1 {

public static void main(String args[]) {

try {

FileOutputStream fos = new FileOutputStream("outdata");

BufferedOutputStream bos = new BufferedOutputStream(fos);

DataOutputStream dos = new DataOutputStream(bos);

Random rn = new Random();

final int N = 10;

dos.writeInt(N);

for (int i = 1; i <= N; i++) {

int r = rn.nextInt();

System.out.println(r);

dos.writeInt(r);

}

dos.close();

} catch (IOException e) {

System.err.println(e);

}

和

import java.io.*;

public class binary2 {

public static void main(String args[]) {

try {

FileInputStream fis = new FileInputStream("outdata");

BufferedInputStream bis = new BufferedInputStream(fis);

DataInputStream dis = new DataInputStream(bis);

int N = dis.readInt();

for (int i = 1; i <= N; i++) {

int r = dis.readInt();

System.out.println(r);

}

dis.close();

} catch (IOException e) {

System.err.println(e);

}

这两个程序先往文件中写10个整数，再将它们读回。

获取文件信息
至此，我们的讨论主要集中在单个文件的读写。但提高I/O性能还有别的方面，如查找文件属性等。如下程序输出文件名的长度：

import java.io.*;

public class length1 {

public static void main(String args[]) {

if (args.length != 1) {

System.err.println("missing filename");

System.exit(1);

}

File f = new File(args[0]);

long len = f.length();

System.out.println(len);

}

Java运行时系统本身不知道文件的长度，因此必须调用操作系统来获取这些信息。这些对别的文件属性也一样，如是文件还是目录，最后被修改时间等。java.io.File类提供了一些查询这些信息的方法。这些查询方法一般来说是很耗时的，应尽量少使用。

一个递归地查询系统所有文件信息的程序如下：

import java.io.*;

public class roots {

public static void visit(File f) {

System.out.println(f);

}

public static void walk(File f) {

visit(f);

if (f.isDirectory()) {

String list[] = f.list();

for (int i = 0; i < list.length; i++)

walk(new File(f, list[i]));

}

public static void main(String args[]) {

File list[] = File.listRoots();

for (int i = 0; i < list.length; i++) {

if (list[i].exists())

walk(list[i]);

else

System.err.println("not accessible: " + list[i]);

}

该程序使用了File类中的isDirectory和exists等方法，每个文件仅被访问一次。