大文件操作(eg:4G)

来源:互联网 发布:软件架构画图工具 编辑:程序博客网 时间:2024/04/29 15:52

大文件指的是超过4G的文件。在32bit机器上操作这样的大文件时,会出现问题。具体的,下面会具体讲解。

 

大文件问题

首先32位机器用fopen/fclose打开大文件没有问题,顺序读写操作while(!feof(fp)){ fread / fgets / fscanf }或while(1){ fwrite / fputs / fprintf} 也没有问题。由于32位机器下long是32位,故fseek (FILE *stream,longoffset, int whence)和long ftell(FILE *stream) 不能访问4G以上文件。另外,要用 fseeko(FILE *stream,off_toffset, int whence)和off_tftello(FILE *stream);代替fseek和ftell。这样,只要你用64bit的类型(off_t,long(64位机器),和longlong(32位机器) 或int64_t/uint64_t)声明offset作为fseeko的参数输入,就可以操作4G以上的文件了。

fseeko和ftello的具体说明见本文《ftello&fseeko》,在文章最后面。

 

注:文件open操作返回的是文件描述符,并没有将文件读入内存。文件内容只有通过read调用时才读才将相应的内容读入内存。

 

解决方法

类型off_t 的定义在 <sys/types.h>里面:

# ifndef __USE_FILE_OFFSET64typedef __off_t off_t;# elsetypedef __off64_t off_t;# endif

off_t在32位机器中是32bit,64位机器中是64bit。那么,在32位机器中,在include之前加入宏定义:#define _FILE_OFFSET_BITS 64,或者编译是加入-D_FILE_OFFSET_BITS 64告诉系统在文件内部使用64位的偏移地址,使off_t变成__off64_t类型。然后,将ftell、fseek换成对应的ftello、fseeko就可以操作大文件了。

另外看到一些英文说法,和上述解决方法差不多,为了便于理解,粘贴如下:

In a nutshell for using LFS you can choose either of the following:

  • Compile your programs with "gcc -D_FILE_OFFSET_BITS=64". This forces all file access calls to use the 64 bit variants. Several types change also, e.g.off_tbecomesoff64_t. It's therefore important to always use the correct types and to not use e.g.intinstead of off_t. For portability with other platforms you should use getconf LFS_CFLAGS which will return -D_FILE_OFFSET_BITS=64on Linux platforms but might return something else on e.g. Solaris. For linking, you should use the link flags that are reported viagetconf LFS_LDFLAGS. On Linux systems, you do not need special link flags.
  • Define _LARGEFILE_SOURCEand_LARGEFILE64_SOURCE. With these defines you can use the LFS functions like open64directly.
  • Use the O_LARGEFILEflag with opento operate on large files.

 

构造大文件

dd if=/dev/zero of=tt bs=1G seek=100 count=0

bs=1G表示每一次读写1G数据,count=0表示读写0次,seek=100表示略过100个Block不写,前面block size是1G,所以共略过100G!
 

获得系统参数

为了更好的进行测试,你需要知道自己的系统是32位还是64位,下面给出了四种方法。
在64bit机器上的运行结果:
[root@SPA c]# uname -a
Linux SPA 2.6.18-194.17.1.b1.05 #3 SMP Fri Jan 25 15:14:45 CST 2013x86_64 x86_64 x86_64 GNU/Linux
[root@SPA c]# getconf LONG_BIT
64
[root@SPA c]# /lib
lib/  lib64/
最后,可以用编程的方法,sizeof( long 或 size_t )就是系统的位数
 

测试程序

测试程序一:
#include <stdio.h>#include <sys/stat.h>#include <sys/types.h>#include <unistd.h>#include <assert.h>#define FILENAME "tt"#define READBUFSIZE 100intmain (){struct stat buf;FILE *stream = fopen (FILENAME, "rw");char readbuf[READBUFSIZE];size_t ret = 0;printf ("\nfollowing messages present system info.\n");printf ("sizeof(size_t) = %d, sizeof(off_t) = %d\n", sizeof (size_t),sizeof (off_t));system ("getconf LONG_BIT");system ("uname -a");printf ("\nfollowing messages present file info.\n\n");if (stat (FILENAME, &buf) != 0){perror ("stat:");return -1;}printf ("file size is %lld Byte\n", buf.st_size);if (!stream){perror ("fopen:");return -1;}{ret = fread (readbuf, READBUFSIZE, 1, stream);printf ("fread:%u Byte\n", ret * READBUFSIZE);assert (ret == 1);}printf ("current pos is %lld Byte\n", ftello (stream));if (fseeko (stream, -READBUFSIZE, SEEK_END) != 0){perror ("fseeko:");return -1;}{ret = fread (readbuf, READBUFSIZE, 1, stream);printf ("fread:%u Byte\n", ret * READBUFSIZE);assert (ret == 1);}printf ("after read last %d Byte ,cur pos is %lld Byte\n", READBUFSIZE,ftello (stream));return 0;}
测试程序二:
#include <stdio.h>#include <sys/stat.h>#include <sys/types.h>#include <unistd.h>#include <assert.h>#define FILENAME "tt"#define READBUFSIZE 100intmain (){struct stat buf;FILE *stream = fopen (FILENAME, "rw");char readbuf[READBUFSIZE];size_t ret = 0;printf ("\nfollowing messages present system info.\n");printf ("sizeof(size_t) = %d, sizeof(off_t) = %d\n", sizeof (size_t),sizeof (off_t));system ("getconf LONG_BIT");system ("uname -a");printf ("\nfollowing messages present file info.\n\n");if (stat (FILENAME, &buf) != 0){perror ("stat:");return -1;}printf ("file size is %lld Byte\n", buf.st_size);if (!stream){perror ("fopen:");return -1;}{ret = fread (readbuf, READBUFSIZE, 1, stream);printf ("fread:%u Byte\n", ret * READBUFSIZE);assert (ret == 1);}printf ("current pos is %lld Byte\n", ftell(stream));if (fseek(stream, -READBUFSIZE, SEEK_END) != 0){perror ("fseeko:");return -1;}{ret = fread (readbuf, READBUFSIZE, 1, stream);printf ("fread:%u Byte\n", ret * READBUFSIZE);assert (ret == 1);}printf ("after read last %d Byte ,cur pos is %lld Byte\n", READBUFSIZE,ftell (stream));return 0;}

 

测试结果

在64bit虚拟机中,上面的程序测试结果一致,如下:
构造100G的文件:dd if=/dev/zero of=tt bs=1G seek=100 count=0
编译程序:gcc -g -o bigfile bigfile.c
运行程序,得结果:

following messages present system info.
sizeof(size_t) = 8, sizeof(off_t) = 8
64
Linux SPA 2.6.18-194.17.1.b1.05 #3 SMP Fri Jan 25 15:14:45 CST 2013 x86_64 x86_64 x86_64 GNU/Linux

following messages present file info.

file size is 107374182400 Byte
fread:100 Byte
current pos is 100 Byte
fread:100 Byte
after read last 100 Byte ,cur pos is 107374182400 Byte

在32bit虚拟机中,有着不同的测试结果,现罗列如下:
对于测试程序一:
构造100G的文件:dd if=/dev/zero of=tt bs=1G seek=100 count=0
编译程序:gcc -g -o bigfile bigfile.c
运行程序,得结果:

following messages present system info.
sizeof(size_t) = 4, sizeof(off_t) = 4
32
Linux localhost.localdomain 2.6.32-220.el6.i686 #1 SMP Tue Dec 6 16:15:40 GMT 2011 i686 i686 i386 GNU/Linux

following messages present file info.

stat:: Value too large for defined data type

构造100G的文件:dd if=/dev/zero of=tt bs=1G seek=100 count=0
编译程序:gcc -D_FILE_OFFSET_BITS=64 -g -o bigfile bigfile.c
运行程序,得结果:

following messages present system info.
sizeof(size_t) = 4, sizeof(off_t) = 8
32
Linux localhost.localdomain 2.6.32-220.el6.i686 #1 SMP Tue Dec 6 16:15:40 GMT 2011 i686 i686 i386 GNU/Linux

following messages present file info.

file size is 107374182400 Byte
fread:100 Byte
current pos is 100 Byte
fread:100 Byte
after read last 100 Byte ,cur pos is 107374182400 Byte

对于测试程序二:

有着和测试程序一同样的结果。

疑问

从测试结果并没有发现ftell和ftello之间的差别,只要加上-D_FILE_OFFSET_BITS=64 选项,程序都可以正确运行。可,为什么最后ftell返回的long型返回值,可以输出那么大的数值呢?

也许问题就出在你这里,我们在打印ftell的返回值时,使用的是%lld格式。为此,我将程序中ftell打印的地方,都换成%ld进行输出,然而,程序依然可以运行,只是,ftell的输出值有的出现了溢出,导致输出信息出错。

至此,无论如何,API都能正常工作,至于原因,因为测试用例的测试点有误,需要改进测试用例。调用ftell和fseek时,如果文件位置超出了32bit数可以表示的范围,那么fseek和ftell将不能正常工作。

 

ftello&fseeko

Linux man ftello部分信息:

NAME
       fseeko, ftello - seek to or report file position

SYNOPSIS
       #include <stdio.h>

       int fseeko(FILE *stream, off_t offset, int whence);

       off_t ftello(FILE *stream);

DESCRIPTION
       The  fseeko()  and ftello() functions are identical to fseek() and ftell() (see fseek(3)), respec-
       tively, except that the offset argument of fseeko() and the return value of ftello()  is  of  type
       off_t instead of long.

       On many architectures both off_t and long are 32-bit types, but compilation with
              #define _FILE_OFFSET_BITS 64
       will turn off_t into a 64-bit type.

RETURN VALUE
       On  successful  completion, fseeko() returns 0, while ftello() returns the current offset.  Other-
       wise, -1 is returned and errno is set to indicate the error.

 

ftell&fseek

linux man ftell部分信息:

NAME
       fgetpos, fseek, fsetpos, ftell, rewind - reposition a stream

SYNOPSIS
       #include <stdio.h>

       int fseek(FILE *stream, long offset, int whence);

       long ftell(FILE *stream);

       void rewind(FILE *stream);

       int fgetpos(FILE *stream, fpos_t *pos);
       int fsetpos(FILE *stream, fpos_t *pos);

DESCRIPTION
       The  fseek()  function sets the file position indicator for the stream pointed to by stream.  The new
       position, measured in bytes, is obtained by adding offset bytes to the position specified by  whence.
       If whence is set to SEEK_SET, SEEK_CUR, or SEEK_END, the offset is relative to the start of the file,
       the current position indicator, or end-of-file, respectively.  A successful call to the fseek() func-
       tion clears the end-of-file indicator for the stream and undoes any effects of the ungetc(3) function
       on the same stream.

       The ftell() function obtains the current value of the file position indicator for the stream  pointed
       to by stream.

       The  rewind()  function  sets  the file position indicator for the stream pointed to by stream to the
       beginning of the file.  It is equivalent to:

              (void) fseek(stream, 0L, SEEK_SET)

       except that the error indicator for the stream is also cleared (see clearerr(3)).

       The fgetpos() and fsetpos() functions are alternate interfaces  equivalent  to  ftell()  and  fseek()
       (with  whence set to SEEK_SET), setting and storing the current value of the file offset into or from
       the object referenced by pos.  On some non-Unix systems an fpos_t object may be a complex object  and
       these routines may be the only way to portably reposition a text stream.

RETURN VALUE
       The  rewind()  function  returns no value.  Upon successful completion, fgetpos(), fseek(), fsetpos()
       return 0, and ftell() returns the current offset.  Otherwise, -1 is returned  and  errno  is  set  to
       indicate the error.

ERRORS
       EBADF  The stream specified is not a seekable stream.

       EINVAL The whence argument to fseek() was not SEEK_SET, SEEK_END, or SEEK_CUR.

 

fstat

Linux man fstat部分信息:

NAME
       stat, fstat, lstat - get file status

SYNOPSIS
       #include <sys/types.h>
       #include <sys/stat.h>
       #include <unistd.h>

       int stat(const char *path, struct stat *buf);
       int fstat(int filedes, struct stat *buf);
       int lstat(const char *path, struct stat *buf);

DESCRIPTION
       These  functions return information about a file.  No permissions are required on the file itself,
       but ? in the case of stat() and lstat() ? execute (search) permission is required on  all  of  the
       directories in path that lead to the file.

       stat() stats the file pointed to by path and fills in buf.

       lstat()  is  identical  to stat(), except that if path is a symbolic link, then the link itself is
       stat-ed, not the file that it refers to.

       fstat() is identical to stat(), except that the file to  be  stat-ed  is  specified  by  the  file
       descriptor filedes.

       All of these system calls return a stat structure, which contains the following fields:

          struct stat {
              dev_t     st_dev;     /* ID of device containing file */
              ino_t     st_ino;     /* inode number */
              mode_t    st_mode;    /* protection */
              nlink_t   st_nlink;   /* number of hard links */
              uid_t     st_uid;     /* user ID of owner */
              gid_t     st_gid;     /* group ID of owner */
              dev_t     st_rdev;    /* device ID (if special file) */
              off_t     st_size;    /* total size, in bytes */
              blksize_t st_blksize; /* blocksize for filesystem I/O */
              blkcnt_t  st_blocks;  /* number of blocks allocated */
              time_t    st_atime;   /* time of last access */
              time_t    st_mtime;   /* time of last modification */
              time_t    st_ctime;   /* time of last status change */
          };

在此处附上stat的信息,是想说明在stat获取的文件大小st_size也是off_t类型的。