自制Java 虚拟机（一）解析class文件

来源：互联网发布：全球发生交通事故数据编辑：程序博客网时间：2024/04/29 23:46

自制Java 虚拟机（一）解析class文件

一、认识class文件结构

一个.java后缀的java源文件，经过javac编译之后的字节码文件，结构如下：（摘自jvm虚拟机规范 version8）

ClassFile {    u4             magic; // 魔数，值为 0xCAFEBABE，表示这是一个java class文件    u2             minor_version; // 次版本号    u2             major_version; // 主版本号    u2             constant_pool_count;  // 等于constant_pool表中条目的数量+1    cp_info        constant_pool[constant_pool_count-1]; // 常量池表，下标从1开始    u2             access_flags; // 该类或接口的访问限制标志    u2             this_class; // 表示该class文件定义的类或者接口，其值是常量池表中的索引，对应一个CONSTANT_Class_info结构    u2             super_class; // 表示该class文件定义的类/接口的父类/父接口，其值是常量池表中的索引，是一个CONSTANT_Class_info结构。特殊情况下，其值是0，表示该类没有父类(java.lang.Object)    u2             interfaces_count; // 该类/接口的父类/接口的数量    u2             interfaces[interfaces_count]; // 该数组中的每一个值都是常量池中的索引, 对应一个CONSTANT_Class_info结构    u2             fields_count; // 该数值表示fields表中field_info的总数（由该类/接口声明的字段），包括类变量和实例变量    field_info     fields[fields_count]; // 每个fields表中的一项是一个field_info结构，含有该字段完整的描述信息，不包括从父类或者父接口中继承而来的字段    u2             methods_count; // 该数值表示methods表中method_info的总数    method_info    methods[methods_count]; // 每个methods中的一项是一个method_info结构，含有该阿方法的完整描述信息，如果该方法不是ACC_NATIVE或者ACC_ABSTRACT，还包含JVM指令（就是方法的代码）    u2             attributes_count; // 该class文件表示的类/接口的attributes表中包含多少个attribute_info    attribute_info attributes[attributes_count];}

其中u1、u2、u4分别表示1个字节、2个字节、4个字节。看了以上结构，很自然地想到可以用一个结构体来表示一个class文件，u1可以用unsigned char 类型，u2用unsigned short类型，u4用unsigned int类型。

不过由于cp_info、field_info、method_info、attribute_info是复合类型，光以上信息还不能够确定如何用C语言中的结构表示一个class文件，所以我们还得继续往下看：

1. 常量池中的主要结构

常量池是个关键，很多java指令都以索引的形式引用常量池中的符号信息。

一个常量池中的项目有如下通用结构：

cp_info {  u1 tag; // 一个字节，表示常量的类型。  u1 info[]; // 该内容因tag的不同而不同}

表1：不同tag对应的常量池类型摘自jvm虚拟机规范 version8

常量池类型 tag CONSTANT_Class 7 CONSTANT_Fieldref 9 CONSTANT_Methodref 10 CONSTANT_InterfaceMethodref 11 CONSTANT_String 8 CONSTANT_Integer 3 CONSTANT_Float 4 CONSTANT_Long 5 CONSTANT_Double 6 CONSTANT_NameAndType 12 CONSTANT_Utf8 1 CONSTANT_MethodHandle 15 CONSTANT_MethodType 16 CONSTANT_InvokeDynamic 18

很自然，我们可以用define来定义这些常量

#define CONSTANT_Class 7#define CONSTANT_Fieldref 9...#define CONSTANT_InvokeDynamic 18

1.1 CONSTANT_Class_info类型

CONSTANT_Class_info表示一个类或接口(interface):

CONSTANT_Class_info {  u1 tag; // 固定为7，即 CONSTANT_Class  u2 name_index; // 其值是常量池中的一个索引，对应一个CONSTANT_Utf8_info结构}

自然，我们可以用C语言定义一个结构体来表示：

typedef struct _CONSTANT_Class_info {  uchar tag; // 为了方便，已经 typedef unsigned char uchar;  ushort name_index; // 为了方便，已经 typedef unsigned short ushort;}

1.2 CONSTANT_Fieldref_info，CONSTANT_Methodref_info，CONSTANT_InterfaceMethodref_info 类型

这三种类型有相似的结构：

.... {  u1 tag; //   u2 class_index; // 其值是常量池中的一个索引，对应一个 CONSTANT_Class_info结构。  u2 name_and_type_index; // 其值是常量池中的一个索引，对应一个CONSTANT_NameAndType_info结构}

于是我们可以用C语言定义如下结构：

typedef struct _CONSTANT_Fieldref_info {    uchar tag;    ushort class_index;    ushort name_and_type_index;    ushort findex; // 该field在对象中的索引，以后备用    uchar ftype; // 该field的类型，以后备用} CONSTANT_Fieldref_info;typedef struct _CONSTANT_Methodref_info {    uchar tag;    ushort class_index;    ushort name_and_type_index;    void* ref_addr; // method的地址，以后备用    ushort args_len; // 该方法的参数数码，以后备用} CONSTANT_Methodref_info;typedef CONSTANT_Methodref_info CONSTANT_InterfaceMethodref_info; // CONSTANT_InterfaceMethodref_info 暂时不涉及，故与CONSTANT_Methodref_info一样

1.3 CONSTANT_String_info

该类型表示java.lang.String类型的常量对象，结构如下：

CONSTANT_String_info {  u1 tag; // 固定为8，表示一个CONSTANT_String  u2 string_index; // 其值是常量池中的一个索引，对应一个CONSTANT_Utf8_info结构。实例化该String对象时的Unicode代码点序列。}

对应C的结构体如下：

typedef struct _CONSTANT_String_info {    uchar tag;    ushort string_index;} CONSTANT_String_info;

1.4 CONSTANT_Integer_info和CONSTANT_Float_info

这两个类型表示4个字节的数字常量，CONSTANT_Integer_info表示的是int型，CONSTANT_Float_info表示的是float型：结构如下：

... {  u1 tag; // 类型标志，3 => CONSTANT_Integer,4 => CONSTANT_Float  u4 bytes; // 以大端字节序存储的int或float的4个字节}

相应，我们可以定义如下C结构体：

typedef struct _CONSTANT_Integer_info {    uchar tag;     int value; // 该int型常量的值} CONSTANT_Integer_info;typedef struct _CONSTANT_Float_info {    uchar tag;     float value; // 该float型常量的值} CONSTANT_Float_info;

1.5 CONSTANT_Long_info和CONSTANT_Double_info

这两个类型表示8个字节的数字常量，CONSTANT_Long_info表示的是long型，CONSTANT_Double_info表示的是double型：结构如下：

... {  u1 tag; // 类型标志： 5 => CONSTANT_Long, 6 => CONSTANT_Double  u4 high_bytes; // 高四字节  u4 low_bytes;  // 低四字节}

这里需要注意：如果常量池索引n是一个CONSTANT_Long_info或CONSTANT_Double_info类型的结构，那么常量池中下一个可用的索引是n+2，索引n+1必须有效但是不可用。（这个有点奇怪，jvm规范中也说道让8字节常数占据两个常量池的位置是个糟糕的选择）

与CONSTANT_Integer_info和CONSTANT_Float_ifno类似，我们定义如下C结构体来存储这连个类型：

typedef struct _CONSTANT_Long_info {    uchar tag;    long value; // 该 long类型的值} CONSTANT_Long_info;typedef struct _CONSTANT_Double_info {    uchar tag;    double value; // 该dobule类型的值} CONSTANT_Double_info;

1.6 CONSTANT_NameAndType_info

该结构描述一个字段/方法的名称和类型信息：

CONSTANT_NameAndType_info {  u1 tag; // 类型标志，固定为12，表示 CONSTANT_NameAndType  u2 name_index; // 其值是常量池中的一个索引，对应一个 CONSTANT_Utf8_info结构,表示该字段或方法的名字。  u2 descriptor_index; // 其值是常量池中的一个索引，对应一个 CONSTANT_Utf8_info结构,表示该字段或方法的类型。}

对应定义如下C结构体：

typedef struct _CONSTANT_NameAndType_info {    uchar tag;    ushort name_index;    ushort descriptor_index;} CONSTANT_NameAndType_info;

1.7 CONSTANT_Utf8_info

该结构估计是被提到次数最多的一个结构了，很多常量池中的结构都有个*_index的字段来指向这个结构。该结构表示一个常量字符串值，如下：

CONSTANT_Utf8_info {  u1 tag; // 类型标志，固定为1，表示CONSTANT_Utf8  u2 length; // 字符串字节数  u1 bytes[length]; // 实际的字符串字节（不打算深入研究，详细请见jvm规范）}

我们可用定义如下C结构来存储：

typedef struct _CONSTANT_Utf8_info {    uchar tag;    ushort length;    char *bytes; // C风格的字符串，最后一个以0结尾，注意要分配 length+1 个字节} CONSTANT_Utf8_info;

1.8 其它结构

常量池中剩下的其它几个几个：CONSTANT_MethodHandle_info、CONSTANT_MethodType_info、CONSTANT_InvokeDynamic_info。暂不打算研究，仅把它们的内容存起来就行，后面深入研究时再讨论。

以下是各自对应的C结构体：

typedef struct _CONSTANT_MethodHandle_info {    uchar tag;    uchar reference_kind;    ushort reference_index;} CONSTANT_MethodHandle_info;typedef struct _CONSTANT_MethodType_info {    uchar tag;    ushort descriptor_index;} CONSTANT_MethodType_info;typedef struct _CONSTANT_InvokeDynamic_info {    uchar tag;    ushort bootstrap_method_attr_index;    ushort name_and_type_index;} CONSTANT_InvokeDynamic_info;

到这里，常量池的各个结构都已经有了对应的C语言结构体来表示和存储，由于各个结构体大同小异，ClassFile中的constant_pool对应的cp_info类型用哪个结构来表示都不合适，我们就用泛型指针 void** 来表示。

typedef void** cp_info;

2. field_info 结构

每个字段由一个field_info结构来描述：

field_info {  u2 access_flags; // 访问标志,ACC_PUBLIC(0x0001)、ACC_PRIVATE(0x0002).....  u2 name_index; // 其值是常量池中的一个索引，对应一个 CONSTANT_Utf8_info结构,表示该字段的名字。  u2 descriptor_index; // 其值是常量池中的一个索引，对应一个 CONSTANT_Utf8_info结构,表示该字段的描述。  u2 attributes_count; // 该字段的额外属性个数  attribute_info attributes[attributes_count]; // 属性表，是一个attribute_info结构}

又冒出个attribute_info结构！jvm规范中定义了20多个attribute_info结构类型，它们都有类似的结构：

attribute_info {  u2 attribute_name_index;  u4 attribute_length;  u1 info[attribute_length];}

当然目前我们不会每个attribute都取分析，不过既然知道了attribute的长度，我们可以一次性读取完，然后只分析我们关心的attribute。

我们定义如下结构来表示和存储attribute_info：

typedef struct _attribute_info{    ushort attribute_name_index;    uint attribute_length;    uchar *info;} attribute_info;

所以，field_info可定义如下：

typedef struct _field_info{    ushort access_flags;    ushort name_index;    ushort descriptor_index;    ushort attributes_count;    attribute_info **attributes;    ushort findex; // 字段所以（留作以后用）    uchar ftype; // 字段类型（留作以后用）} field_info;

3. method_info 结构

用来描述一个类/接口的方法：

method_info {  u2 access_flags;  u2 name_index;  u2 descriptor_index;  u2 attribute_count;  attribute_info attributes[attributes_count];}

与field_info的结构相似，不多说。定义如下C结构：

typedef struct _method_info{    ushort access_flags;    ushort name_index;    ushort descriptor_index;    ushort attributes_count;    attribute_info **attributes;    void* code_attribute_addr; // address of code attribute，喜欢用泛型指针    ushort args_len; // 该方法的形式参数个数} method_info;

目前，对于method_info，我们关注以下属性：

Code_attribute，包含该方法的代码以及一些辅助信息（如局部变量个数，最大操作数栈深度）
LineNumberTable_attribute，虚拟机指令索引与源代码行号的对应表，该属性是可选的，用于调试，包含在Code_attribute中

二、辅助代码

从以上class文件的各个结构描述中，我们经常会需要读取一个字节、两个字节、4个字节、8个字节，以及指定长度的字节。对于ushort 、int、float、long、double类型的数字，我们还需要把大端字节序转换成小端字节序（因为java的class文件是用大端字节序表示，而笔者的cpu是小端字节序的）。这里的大小端转换很简单，仍然是顺序读取，只不过放置顺序是反过来放的。

大端顺序转成小端顺序读取的宏定义（有些递归的味道在里面）：

#define READ_U1(fp, dest) fread(dest, 1, 1, fp)#define READ_U2(fp, dest) \READ_U1(fp, dest+1);\READ_U1(fp, dest)#define READ_U4(fp, dest) \READ_U2(fp, dest+2);\READ_U2(fp, dest)#define READ_U8(fp, dest) \READ_U4(fp, dest+4); \READ_U4(fp, dest)

然后，读取ushort、int、float等的函数可以定义如下：

ushort readUShort(FILE *fp){    uchar uc2[2];    READ_U2(fp, uc2);    return *(ushort*)&uc2[0];}float readFloat(FILE *fp){    char uc4[4];    READ_U4(fp, uc4);    return *(float*)&uc4;}uint readUInt(FILE *fp){    uchar uc4[4];    READ_U4(fp, uc4);    return *(uint*)&uc4;}int readInt(FILE *fp){    char uc4[4];    READ_U4(fp, uc4);    return *(int*)&uc4;}... // long和double省略

三、解析class文件

准备工作已经就绪，可以开始解析class文件了。

代码大致如下：

Class* loadClass(const char *filename){    FILE *fp = fopen(filename, "rb");    if (!fp) {        printf("Cannot open: %s\n", filename);        return NULL;    }    Class *pclass = (Class*)malloc(sizeof(Class));    pclass->parent_class = NULL;    // step 1: read magic number    pclass->magic = readUInt(fp);    printf("Magic: 0x%X\n", pclass->magic);    if (pclass->magic != 0xCAFEBABE) {        printf("Invalid class file!\n");        exit(1);    }    // step2: read version    pclass->minor_version = readUShort(fp);    pclass->major_version = readUShort(fp);    printf("minor_version: %d\n", pclass->minor_version);    printf("major_version: %d\n", pclass->major_version);    printf("--------------------------------------------\n");    // step3: read constant pool    parseConstantPool(fp, pclass);    printf("constant_pool_count: %d\n", pclass->constant_pool_count);    showConstantPool(pclass);    printf("--------------------------------------------\n");    // step4: read access_flags    pclass->access_flags = readUShort(fp);    printf("access_flag: %04X\t%s\n", pclass->access_flags, formatAccessFlag(pclass->access_flags));    // step5: this class    pclass->this_class = readUShort(fp);    printf("this_class: #%d\t%s\n", pclass->this_class, get_class_name(pclass->constant_pool, pclass->this_class));    // step6: super class    pclass->super_class = readUShort(fp);    if (pclass->super_class > 0) {        printf("super_class: #%d\t%s\n", pclass->super_class, get_class_name(pclass->constant_pool, pclass->super_class));    }    printf("--------------------------------------------\n");    // step7: read inerfaces    parseInterface(fp, pclass);    printf("interface_count: %d\n", pclass->interface_count);    showInterface(pclass);    printf("--------------------------------------------\n");    // step8: read fields    parseFields(fp, pclass);    printf("fields_count: %d\n", pclass->fields_count);    showFields(pclass);    printf("--------------------------------------------\n");    // step9: read methods    parseMethods(fp, pclass);    printf("methods_count: %d\n", pclass->methods_count);    showMethods(pclass);    printf("--------------------------------------------\n");    // step10: read attributes    parseAttributes(fp, pclass);    printf("attributes_count: %d\n", pclass->attributes_count);    showAttributes(pclass, pclass->attributes, pclass->attributes_count);    //setThisClassFieldIndex(pclass);    ((CONSTANT_Class_info*)(pclass->constant_pool[pclass->this_class]))->pclass = pclass;    return pclass;}

由于代码太多，涉及的函数就不一一列出了。反正就是“逢山开路，遇水搭桥”，按照jvm规范中的描述来，该怎么办怎么办。

四、测试

Hello.java

package test;public class Hello implements IMath{    static double C_DOUBLE = 12.45;    public int xi = 789;    protected long xl = 35;    public float xf = -235.125f;    public double db = 32.5;    private int priv_i = 2;    public int sum(int x, int y)    {        int s = 0;        for(int i=x;i<=y;i++) {            s+=i;        }        s += sub(x,y);        return s+xi + this.priv_i;    }    private int sub(int x, int y) {        return x-y;    }}interface IMath {    public int sum(int x, int y);}

用javac工具编译成字节码：

javac Hello.java

然后解析Hello.class文件，输出如下（与javap的输出对比）：

常量池的输出对比
代码属性的输出对比

可见解析OK。

阅读全文

0 0