Android Native 开发之 NewString 与 NewStringUtf 解析

来源：互联网发布：e商盟软件编辑：程序博客网时间：2024/06/08 16:24

1、问题起因

最近碰见一个 native crash，引起 crash 的代码如下所示：

jstring stringTojstring(JNIEnv* env, string str) {    int len = str.length();    wchar_t *wcs = new wchar_t[len * 2];    int nRet = UTF82Unicode(str.c_str(), wcs, len);    jchar* jcs = new jchar[nRet];    for (int i = 0; i < nRet; i++)    {        jcs[i] = (jchar) wcs[i];    }    jstring retString = env->NewString(jcs, nRet);    delete[] wcs;    delete[] jcs;    return retString;}

这段代码主要的目的是用来将 c++ 里面的 string 转成 jni 层的 jstring，引发崩溃的代码行是 delete[] jcs，最后跟踪到的原因可能是传入的 str 为空，导致 new 出来的 jchar 数组的大小是 0，等价于 jchar* jcs = new jchar[0]，在某些版本上面 delete 这个长度为 0 的 jchar 数组某些情况下会造成崩溃；

2、代码分析与问题发掘

这个 crash 最后的解决就是加上对字符串是否为空的判断即可，但是疑惑的是为什么会存在这么一个转换函数，接下来我们着重仔细分析一下。先把相关的几个函数源码贴出来：

inline int UTF82UnicodeOne(const char* utf8, wchar_t& wch){    //首字符的Ascii码大于0xC0才需要向后判断，否则，就肯定是单个ANSI字符了    unsigned char firstCh = utf8[0];    if (firstCh >= 0xC0)    {        //根据首字符的高位判断这是几个字母的UTF8编码        int afters, code;        if ((firstCh & 0xE0) == 0xC0)        {            afters = 2;            code = firstCh & 0x1F;        }        else if ((firstCh & 0xF0) == 0xE0)        {            afters = 3;            code = firstCh & 0xF;        }        else if ((firstCh & 0xF8) == 0xF0)        {            afters = 4;            code = firstCh & 0x7;        }        else if ((firstCh & 0xFC) == 0xF8)        {            afters = 5;            code = firstCh & 0x3;        }        else if ((firstCh & 0xFE) == 0xFC)        {            afters = 6;            code = firstCh & 0x1;        }        else        {            wch = firstCh;            return 1;        }        //知道了字节数量之后，还需要向后检查一下，如果检查失败，就简单的认为此UTF8编码有问题，或者不是UTF8编码，于是当成一个ANSI来返回处理        for(int k = 1; k < afters; ++ k)        {            if ((utf8[k] & 0xC0) != 0x80)            {                //判断失败，不符合UTF8编码的规则，直接当成一个ANSI字符返回                wch = firstCh;                return 1;            }            code <<= 6;            code |= (unsigned char)utf8[k] & 0x3F;        }        wch = code;        return afters;    }    else    {        wch = firstCh;    }    return 1;}int UTF82Unicode(const char* utf8Buf, wchar_t *pUniBuf, int utf8Leng){    int i = 0, count = 0;    while(i < utf8Leng)    {        i += UTF82UnicodeOne(utf8Buf + i, pUniBuf[count]);        count ++;    }    return count;}jstring stringTojstring(JNIEnv* env, string str) {    int len = str.length();    wchar_t *wcs = new wchar_t[len * 2];    int nRet = UTF82Unicode(str.c_str(), wcs, len);    jchar* jcs = new jchar[nRet];    for (int i = 0; i < nRet; i++)    {        jcs[i] = (jchar) wcs[i];    }    jstring retString = env->NewString(jcs, nRet);    delete[] wcs;    delete[] jcs;    return retString;}

我们知道将 c++ 里面的 string 转成 jni 层的 jstring 原生就有一个函数 env->NewStringUTF(str.c_str())，为什么不直接调用这个函数，而需要通过这么复杂的步骤进行 string 到 jstring 的转换，因为这段代码的作者已经找不到了，所以我们只能通过代码去推测。

首先我们先看第一个函数 UTF82Unicode，这个函数从名字上看很直接，是将 utf-8 编码转成 unicode(utf-16) 编码，然后分析第二个函数 UTF82UnicodeOne，这个函数看起来会比较懵圈，因为这需要详细了解 utf-16 与 utf-8 编码的格式，我们先详细了解一下这两种常用编码。

3、ucs-2 与 utf-8 编码

首先需要明确的一点我们平时说的 unicode 编码其实指的是 ucs-2 或者 utf-16 编码，unicode 真正是一个编码字符集，它只规定了符号的二进制代码，却没有规定这个二进制代码应该如何存储，所以严格意义上讲 utf-8 和 ucs-2 编码都是 unicode 字符集的一种实现方式，只不过前者是变长编码，后者是定长。

utf-8 编码最大的特点就是变长编码，它使用 1～4 个字节来表示一个符号，根据符号的不同动态变换字节的长度；
ucs-2 编码最大的特点就是定长编码，它规定统一使用 2 个字节来表示一个符号；
utf-16 也是变长编码，用 2 个或者 4 个字节来代表一个字符，在基本多文种平面集上和 ucs-2 表现一样；
unicode 字符集是一个叫 ISO（国际标谁化组织）的国际组织推行的，因为我们知道英文的 26 个字母加上其他的符号通过 ASCII 编码就完全足够了，可是像中文这种有上万个字符的就完全不够用了，所以为了统一全世界不同国家的编码，他们废了所有的地区性编码方案，重新搞一个包括了地球上所有文化、所有字母和符号的编码，命名为 “Universal Multiple-Octet Coded Character Set”，简称 UCS, 俗称 “unicode”，unicode 与 utf-8 编码的对应关系：

Unicode符号范围 | UTF-8编码方式(十六进制) | （二进制）--------------------+---------------------------------------------0000 0000-0000 007F | 0xxxxxxx0000 0080-0000 07FF | 110xxxxx 10xxxxxx0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

那么既然都已经推出了 unicode 统一编码字符集，为什么不统一全部使用 ucs-2/utf-16 编码呢？因为其实对于英文使用国家来说，字符基本上都是英文字符，所以使用 utf-8 编码绝大多数情况都是一个字节代码一个字符，完全没有必要使用 2 字节编码，因为这样反而浪费空间。

除了上面介绍到的两种编码方式外，还有utf-32 编码，也被称为 ucs-4 编码，是一种对于每个字符统一使用 4 个字节来表示的编码方式。这里需要提到的一点是 utf-16 编码为 ucs-2 编码的扩展（在 unicode 编码引入字符平面集概念之前，他们是一样的），ucs-2 编码在基本多文种平面字符集上和 utf-16 结果一致，但是 utf-16 编码可以表示 BMP 之外的字符集，而且需要使用 4 个字节来进行表示，前两个字节构成高位代理，后两个字节构成低位代理，这两个代理构成一个代理对，unicode 总共有 17 个字符平面集：

平面始末字符值中文名称英文名称 0号平面 U+0000 - U+FFFF 基本多文种平面 BMP 1号平面 U+10000 - U+1FFFF 多文种补充平面 SMP 2号平面 U+20000 - U+2FFFF 表意文字补充平面 SIP 3号平面 U+30000 - U+3FFFF 表意文字第三平面 TIP 4～13号平面 U+40000 - U+DFFFF （尚未使用） 14号平面 U+E0000 - U+EFFFF 特别用途补充平面 SSP 15号平面 U+F0000 - U+FFFFF 保留作为私人使用区（A区） PUA-A 16号平面 U+100000 - U+10FFFF 保留作为私人使用区（B区） PUA-B

4、NewString 与 NewStringUTF

我们回到上面的问题：为什么不直接使用 env->NewStringUTF，而是需要自己先做一个 utf-8 编码到 utf-16 编码的转换，然后将转换之后的 jchar 数组传递给 env->NewString 直接生成一个 jstring，这个可以确定是故意这么写的，我们于是下沉到源码中去看一下为什么要单独写一个 utf-8 到 utf-16 的转码函数。

因为 dalvik 和 ART 的实现是有区别的，所以我们分别来介绍一下源码实现：

4.1、 dalvik 源码解析

首先我们来看一下 dalvik 源码中这两个函数的源码，这两个函数定义都在 jni.h 文件中，对应的实现在 jni.cpp 文件中（这里选取的是 Android 4.3.1 的源码）：

/* * Create a new String from Unicode data. * * If "len" is zero, we will return an empty string even if "unicodeChars" * is NULL.  (The JNI spec is vague here.) */static jstring NewString(JNIEnv* env, const jchar* unicodeChars, jsize len) {    ScopedJniThreadState ts(env);    StringObject* jstr = dvmCreateStringFromUnicode(unicodeChars, len);    if (jstr == NULL) {        return NULL;    }    dvmReleaseTrackedAlloc((Object*) jstr, NULL);    return (jstring) addLocalReference(ts.self(), (Object*) jstr);}..../* * Create a new java.lang.String object from chars in modified UTF-8 form. * * The spec doesn't say how to handle a NULL string.  Popular desktop VMs * accept it and return a NULL pointer in response. */static jstring NewStringUTF(JNIEnv* env, const char* bytes) {    ScopedJniThreadState ts(env);    if (bytes == NULL) {        return NULL;    }    /* note newStr could come back NULL on OOM */    StringObject* newStr = dvmCreateStringFromCstr(bytes);    jstring result = (jstring) addLocalReference(ts.self(), (Object*) newStr);    dvmReleaseTrackedAlloc((Object*)newStr, NULL);    return result;}

可以看到这两个函数基本步骤是类似的，先创建一个 StringObject 对象，然后将这个对象加入到 localReference table 中，他们的差别在于生成 StringObject 对象的函数不一样， NewString 调用的是 dvmCreateStringFromUnicode，NewStringUTF 调用了 dvmCreateStringFromCstr，于是我们继续分析调用到的这两个函数，这两个函数的定义在 UtfString.h 文件中，实现是在 UtfString.c 中：

/* * Create a new java/lang/String object, using the given Unicode data. */StringObject* dvmCreateStringFromUnicode(const u2* unichars, int len){    /* We allow a NULL pointer if the length is zero. */    assert(len == 0 || unichars != NULL);    ArrayObject* chars;    StringObject* newObj = makeStringObject(len, &chars);    if (newObj == NULL) {        return NULL;    }    if (len > 0) memcpy(chars->contents, unichars, len * sizeof(u2));    u4 hashCode = computeUtf16Hash((u2*)(void*)chars->contents, len);    dvmSetFieldInt((Object*)newObj, STRING_FIELDOFF_HASHCODE, hashCode);    return newObj;}....StringObject* dvmCreateStringFromCstr(const char* utf8Str) {    assert(utf8Str != NULL);    return dvmCreateStringFromCstrAndLength(utf8Str, dvmUtf8Len(utf8Str));}/* * Create a java/lang/String from a C string, given its UTF-16 length * (number of UTF-16 code points). * * The caller must call dvmReleaseTrackedAlloc() on the return value. * * Returns NULL and throws an exception on failure. */StringObject* dvmCreateStringFromCstrAndLength(const char* utf8Str,    size_t utf16Length){    assert(utf8Str != NULL);    ArrayObject* chars;    StringObject* newObj = makeStringObject(utf16Length, &chars);    if (newObj == NULL) {        return NULL;    }    dvmConvertUtf8ToUtf16((u2*)(void*)chars->contents, utf8Str);    u4 hashCode = computeUtf16Hash((u2*)(void*)chars->contents, utf16Length);    dvmSetFieldInt((Object*) newObj, STRING_FIELDOFF_HASHCODE, hashCode);    return newObj;}

这两个函数的主体流程是类似的，首先通过 makeStringObject 函数生成 StringObjcet 对象并且根据类型分配内存，然后通过 memcpy 或者 dvmConvertUtf8ToUtf16 函数分别将 jchar 数组或者 char 数组的内容设置到这个对象中，最后计算好 hash 值设置给这个 StringObject 对象。所以很明显的区别就在于 memcpy 函数和 dvmConvertUtf8ToUtf16 函数，我们分析一下这两个函数。

memcpy 函数这里就不分析了，内存拷贝函数，将 unichars 指向的 jchar 数组拷贝到 StringObject 内容区域中；dvmConvertUtf8ToUtf16 函数我们这里来仔细分析一下：

/* * Convert a "modified" UTF-8 string to UTF-16. */void dvmConvertUtf8ToUtf16(u2* utf16Str, const char* utf8Str){    while (*utf8Str != '\0')        *utf16Str++ = dexGetUtf16FromUtf8(&utf8Str);}

通过注释我们可以看到，这个函数主要是用来将 utf-8 编码变更成 utf-16 编码，继续跟到 dexGetUtf16FromUtf8 函数中，这个函数在 DexUtf.h 文件中：

/* * Retrieve the next UTF-16 character from a UTF-8 string. * * Advances "*pUtf8Ptr" to the start of the next character. * * WARNING: If a string is corrupted by dropping a '\0' in the middle * of a 3-byte sequence, you can end up overrunning the buffer with * reads (and possibly with the writes if the length was computed and * cached before the damage). For performance reasons, this function * assumes that the string being parsed is known to be valid (e.g., by * already being verified). Most strings we process here are coming * out of dex files or other internal translations, so the only real * risk comes from the JNI NewStringUTF call. */DEX_INLINE u2 dexGetUtf16FromUtf8(const char** pUtf8Ptr){    unsigned int one, two, three;    one = *(*pUtf8Ptr)++;    if ((one & 0x80) != 0) {        /* two- or three-byte encoding */        two = *(*pUtf8Ptr)++;        if ((one & 0x20) != 0) {            /* three-byte encoding */            three = *(*pUtf8Ptr)++;            return ((one & 0x0f) << 12) |                   ((two & 0x3f) << 6) |                   (three & 0x3f);        } else {            /* two-byte encoding */            return ((one & 0x1f) << 6) |                   (two & 0x3f);        }    } else {        /* one-byte encoding */        return one;    }}

这个函数就是用来将 utf-8 编码的字符串转换成 utf-16 的编码，我们详细分析一下这个函数，我们先假设传递过来的字符串是“a中文”，对应 utf-8 编码十六进制是 “0x610xE40xB80xAD0xE60x960x87”：

先执行了这么一个语句 one = *(*pUtf8Ptr)++; 这个语句的作用是获取传递过来的字符串的第一个字符，获取之后指针后移，这里也就是获取 a 代表的 0x60，然后 0x61&0x80 = 0x00，所以代表是单字节的 utf-8 字符，返回 0x61 给上层，上层因为是 u2（typedef uint16_t u2），所以上层存储为 0x000x61；
外层循环继续执行该函数，走到了第二个字符 0xE4，0xE4&0x80 = 0x80，所以为双字节或三字节的 utf-8 编码，继续走到下一个字节 0xB8，0xB8&0x20 = 0x20，代表是三字节编码的 utf-8 编码，然后执行 ((one & 0x0f) << 12) | ((two & 0x3f) << 6) | (three & 0x3f);，这里就对应了上面介绍到 utf-8 与 unicode 转换的公式，最后返回结果是 0x4E2D，这个也是 “中” 的 unicode 字符集表示，上层存储为 0x4E2D;
外层地址继续往后自增，再次执行到该函数时，one 字符就成了 0xE6，此时步骤和第二步类似，返回结果是 0x6587，上层存储为 0x6587，代表 unicode 中的 “文”；

执行完成整个函数之后，就成功将 utf-8 编码转成了 utf-16 编码。回顾整个过程我们可以发现，NewString 和 NewStringUTF 生成的 jstring 对象都是 utf-16 编码，所以这里我们可以得出一个结论是 dalvik 虚拟机中的 String 对象都是 utf-16 编码，后面我们会证实这个结论。

4.2 ART 源码分析

分析完 dalvik 源码之后，我们来分析一下 ART 的相关源码，同样的流程，这两个函数的定义在 jni.h 中，实现在 jni_internal.cc 文件中，来看一下这两个函数的定义：

static jstring NewString(JNIEnv*env, const jchar*chars, jsize char_count) {    if (UNLIKELY(char_count < 0)) {        JavaVmExtFromEnv(env)->JniAbortF("NewString", "char_count < 0: %d", char_count);        return nullptr;    }    if (UNLIKELY(chars == nullptr && char_count > 0)) {        JavaVmExtFromEnv(env)->JniAbortF("NewString", "chars == null && char_count > 0");        return nullptr;    }    ScopedObjectAccess soa (env);    mirror::String * result = mirror::String::AllocFromUtf16(soa.Self(), char_count, chars);    return soa.AddLocalReference < jstring > (result);}...static jstring NewStringUTF(JNIEnv*env, const char*utf) {    if (utf == nullptr) {        return nullptr;    }    ScopedObjectAccess soa (env);    mirror::String * result = mirror::String::AllocFromModifiedUtf8(soa.Self(), utf);    return soa.AddLocalReference < jstring > (result);}

可以看到调用的方法分别是 AllocFromUtf16 和 AllocFromModifiedUtf8，这两个函数在 string.cc 文件中：

String*String::AllocFromUtf16(Thread*self, int32_t utf16_length, const uint16_t*utf16_data_in) {    CHECK(utf16_data_in != nullptr || utf16_length == 0);    gc::AllocatorType allocator_type = Runtime::Current () -> GetHeap()->GetCurrentAllocator();const bool compressible = kUseStringCompression &&    String::AllASCII < uint16_t > (utf16_data_in, utf16_length);    int32_t length_with_flag = String::GetFlaggedCount (utf16_length, compressible);    SetStringCountVisitor visitor (length_with_flag);    ObjPtr<String> string = Alloc < true > (self, length_with_flag, allocator_type, visitor);    if (UNLIKELY(string == nullptr)) {        return nullptr;    }    if (compressible) {        for (int i = 0; i < utf16_length; ++i) {            string -> GetValueCompressed()[i] = static_cast < uint8_t > (utf16_data_in[i]);        }    } else {        uint16_t * array = string -> GetValue();        memcpy(array, utf16_data_in, utf16_length * sizeof(uint16_t));    }    return string.Ptr();}....String* String::AllocFromModifiedUtf8(Thread* self, const char* utf) {    DCHECK(utf != nullptr);    size_t byte_count = strlen(utf);    size_t char_count = CountModifiedUtf8Chars(utf, byte_count);    return AllocFromModifiedUtf8(self, char_count, utf, byte_count);}String* String::AllocFromModifiedUtf8(Thread* self,                                      int32_t utf16_length,                                  const char* utf8_data_in,                                      int32_t utf8_length) {    gc::AllocatorType allocator_type = Runtime::Current()->GetHeap()->GetCurrentAllocator();const bool compressible = kUseStringCompression && (utf16_length == utf8_length);const int32_t utf16_length_with_flag = String::GetFlaggedCount(utf16_length, compressible);    SetStringCountVisitor visitor(utf16_length_with_flag);    ObjPtr<String> string = Alloc<true>(self, utf16_length_with_flag, allocator_type, visitor);    if (UNLIKELY(string == nullptr)) {        return nullptr;    }    if (compressible) {        memcpy(string->GetValueCompressed(), utf8_data_in, utf16_length * sizeof(uint8_t));    } else {        uint16_t* utf16_data_out = string->GetValue();        ConvertModifiedUtf8ToUtf16(utf16_data_out, utf16_length, utf8_data_in, utf8_length);    }    return string.Ptr();}

CountModifiedUtf8Chars 和 ConvertModifiedUtf8ToUtf16 函数在 utf.cc 文件中：

/* * This does not validate UTF8 rules (nor did older code). But it gets the right answer * for valid UTF-8 and that's fine because it's used only to size a buffer for later * conversion. * * Modified UTF-8 consists of a series of bytes up to 21 bit Unicode code points as follows: * U+0001  - U+007F   0xxxxxxx * U+0080  - U+07FF   110xxxxx 10xxxxxx * U+0800  - U+FFFF   1110xxxx 10xxxxxx 10xxxxxx * U+10000 - U+1FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx * * U+0000 is encoded using the 2nd form to avoid nulls inside strings (this differs from * standard UTF-8). * The four byte encoding converts to two utf16 characters. */size_t CountModifiedUtf8Chars(const char* utf8, size_t byte_count) {  DCHECK_LE(byte_count, strlen(utf8));  size_t len = 0;  const char* end = utf8 + byte_count;  for (; utf8 < end; ++utf8) {    int ic = *utf8;    len++;    if (LIKELY((ic & 0x80) == 0)) {      // One-byte encoding.      continue;    }    // Two- or three-byte encoding.    utf8++;    if ((ic & 0x20) == 0) {      // Two-byte encoding.      continue;    }    utf8++;    if ((ic & 0x10) == 0) {      // Three-byte encoding.      continue;    }    // Four-byte encoding: needs to be converted into a surrogate    // pair.    utf8++;    len++;  }  return len;}void ConvertModifiedUtf8ToUtf16(uint16_t* utf16_data_out, size_t out_chars,                                const char* utf8_data_in, size_t in_bytes) {  const char *in_start = utf8_data_in;  const char *in_end = utf8_data_in + in_bytes;  uint16_t *out_p = utf16_data_out;  if (LIKELY(out_chars == in_bytes)) {    // Common case where all characters are ASCII.    for (const char *p = in_start; p < in_end;) {      // Safe even if char is signed because ASCII characters always have      // the high bit cleared.      *out_p++ = dchecked_integral_cast<uint16_t>(*p++);    }    return;  }  // String contains non-ASCII characters.  for (const char *p = in_start; p < in_end;) {    const uint32_t ch = GetUtf16FromUtf8(&p);    const uint16_t leading = GetLeadingUtf16Char(ch);    const uint16_t trailing = GetTrailingUtf16Char(ch);    *out_p++ = leading;    if (trailing != 0) {      *out_p++ = trailing;    }  }}

首先可以看到对于 AllocFromUtf16 函数来说，就是比较简单的复制或者 memcpy 操作，但是 AllocFromModifiedUtf8 函数则是根据是否可以 compressible 选择 memcpy 或者 ConvertModifiedUtf8ToUtf16 函数。这两个函数中都有对 compressible 这个变量的判断，看看这个变量的赋值过程，首先是 AllocFromUtf16 函数：

const bool compressible = kUseStringCompression && String::AllASCII < uint16_t > (utf16_data_in, utf16_length)

如果字符全是 ASCII，compressible 就是 TRUE（源码中 kUseStringCompression 这个变量设置的值是 TRUE）；再看一下 AllocFromModifiedUtf8 函数对于 compressible 的赋值操作：

const bool compressible = kUseStringCompression && (utf16_length == utf8_length);

如果 utf-8 编码的字符串中字符数和字节数相等，这代表字符串都是 utf-8 单字节字符，就直接执行 memcpy 函数进行拷贝操作，如果不相等，说明字符串不都是 utf-8 单字节字符，需要经过函数 ConvertModifiedUtf8ToUtf16 将 utf-8 编码转换成 utf-16 编码。现在我们来着重分析一下这个过程，AllocFromModifiedUtf8 对于存在非 ASCII 编码的字符会执行到下面的一个 for 循环中，在循环中分别执行了 GetUtf16FromUtf8、GetLeadingUtf16Char 和 GetTrailingUtf16Char 函数，这三个函数在 utf-inl.h 中：

inline uint16_t GetTrailingUtf16Char(uint32_t maybe_pair) {  return static_cast<uint16_t>(maybe_pair >> 16);}inline uint16_t GetLeadingUtf16Char(uint32_t maybe_pair) {  return static_cast<uint16_t>(maybe_pair & 0x0000FFFF);}inline uint32_t GetUtf16FromUtf8(const char** utf8_data_in) {  const uint8_t one = *(*utf8_data_in)++;  if ((one & 0x80) == 0) {    // one-byte encoding    return one;  }  const uint8_t two = *(*utf8_data_in)++;  if ((one & 0x20) == 0) {    // two-byte encoding    return ((one & 0x1f) << 6) | (two & 0x3f);  }  const uint8_t three = *(*utf8_data_in)++;  if ((one & 0x10) == 0) {    return ((one & 0x0f) << 12) | ((two & 0x3f) << 6) | (three & 0x3f);  }  // Four byte encodings need special handling. We'll have  // to convert them into a surrogate pair.  const uint8_t four = *(*utf8_data_in)++;  // Since this is a 4 byte UTF-8 sequence, it will lie between  // U+10000 and U+1FFFFF.  //  // TODO: What do we do about values in (U+10FFFF, U+1FFFFF) ? The  // spec says they're invalid but nobody appears to check for them.  const uint32_t code_point = ((one & 0x0f) << 18) | ((two & 0x3f) << 12)      | ((three & 0x3f) << 6) | (four & 0x3f);  uint32_t surrogate_pair = 0;  // Step two: Write out the high (leading) surrogate to the bottom 16 bits  // of the of the 32 bit type.  surrogate_pair |= ((code_point >> 10) + 0xd7c0) & 0xffff;  // Step three : Write out the low (trailing) surrogate to the top 16 bits.  surrogate_pair |= ((code_point & 0x03ff) + 0xdc00) << 16;  return surrogate_pair;}

GetUtf16FromUtf8 函数首先判断字符是几个字节编码，如果是四个字节编码需要特殊处理，转换成代理对；
GetTrailingUtf16Char 和 GetLeadingUtf16Char 逻辑就很简单了，获取返回字符串的低两位字节和高两位字节，如果高两位字节不为空就组合成一个四字节 utf-16 编码的字符返回。所以最后分析的结论就是 AllocFromModifiedUtf8 函数返回的结果要么全是 ASCII 字符的 utf-8 编码的字符串要么就是 utf-16 编码的字符串。

4.3 结论与推测验证

为了验证上面的结论与推测，首先想到最直接的方式就是在 Android 4.3 的手机上获取一个 String 字符串的占用字节数，测试代码如下所示：

output = "hello from jni中文";byte[] bytes = output.getBytes();

最后观察一下 byte[] 数组的大小，最后发现是 20，并不是 32，也就是说该字符串是 utf-8 编码，并不是 utf-16 编码，这就奇怪了，和之前得出的结论不一致；我们同样在 Android 6.0 手机上执行相同的代码，发现大小同样是 20。具体什么原因呢，我们来看一下 getBytes 源码（分别在 String.java 与 Charset.java 类中）：

/** * Encodes this {@code String} into a sequence of bytes using the * platform's default charset, storing the result into a new byte array. * * <p> The behavior of this method when this string cannot be encoded in * the default charset is unspecified.  The {@link * java.nio.charset.CharsetEncoder} class should be used when more control * over the encoding process is required. * * @return  The resultant byte array * * @since      JDK1.1 */public byte[] getBytes() {    return getBytes(Charset.defaultCharset());}

/** * Returns the default charset of this Java virtual machine. * * <p>Android note: The Android platform default is always UTF-8. * * @return  A charset object for the default charset * * @since 1.5 */public static Charset defaultCharset() {    // Android-changed: Use UTF_8 unconditionally.    synchronized (Charset.class) {        if (defaultCharset == null) {            defaultCharset = java.nio.charset.StandardCharsets.UTF_8;        }        return defaultCharset;    }}

通过源码已经可以清晰的看到使用 getBytes 函数获取的是 utf-8 编码的字符串。那么我们怎么知晓 Java 层 String 真正的编码格式呢，可不可以直接查看对象的内存占用？我们来试一下，通过 Android Studio 工具我们可以清楚的看到一个对象占用的内存，我们首先创建一个 String 对象：

String output = null;findViewById(R.id.btn_new_string).setOnClickListener(new View.OnClickListener() {    @Override    public void onClick(View view) {            output = "hello from jni中文";    }});

然后通过 Android Profiler 查看这个对象的占用：

可以看到是 48 个字节，但是无法获取 String 对象中字符串占用大小；

我们来试试第二种方案，通过 lldb 调试底层 jstring 对象占用内存的详细情况，但是 jstring 对象通过 memory read 命令读取的时候会失败：

第三种方案，是否可以通过获取 char 变量占用的字节来支持我们的结论呢：

未命名.png

可以看到每个 char 变量占用的是两个字节，但是这样其实还是不能完全证明 String 对象是 utf-16 编码；

没办法了，最后第四种方案想到的是通过相关官方权威资料了，找到两份支持的资料

How is text represented in the Java platform?

The Java programming language is based on the Unicode character set, and several libraries implement the Unicode standard. Unicode is an international character set standard which supports all of the major scripts of the world, as well as common technical symbols. The original Unicode specification defined characters as fixed-width 16-bit entities, but the Unicode standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF. An encoding defined by the standard, UTF-16, allows to represent all Unicode code points using one or two 16-bit units.

The primitive data type char in the Java programming language is an unsigned 16-bit integer that can represent a Unicode code point in the range U+0000 to U+FFFF, or the code units of UTF-16. The various types and classes in the Java platform that represent character sequences - char[], implementations of java.lang.CharSequence (such as the String class), and implementations of java.text.CharacterIterator - are UTF-16 sequences. Most Java source code is written in ASCII, a 7-bit character encoding, or ISO-8859-1, an 8-bit character encoding, but is translated into UTF-16 before processing.

The Character class as an object wrapper for the char primitive type. The Character class also contains static methods such as isLowerCase() and isDigit() for determining the properties of a character. Since J2SE 5, these methods have overloads that accept either a char (which allows representation of Unicode code points in the range U+0000 to U+FFFF) or an int (which allows representation of all Unicode code points).

我们重点看这一句

The various types and classes in the Java platform that represent character sequences - char[], implementations of java.lang.CharSequence (such as the String class), and implementations of java.text.CharacterIterator - are UTF-16 sequences.

而 String 类是实现了 CharSequence 接口，所以自然而然是 UTF-16 编码；

Java HotSpot VM Options

-XX:+UseCompressedStrings
Use a byte[] for Strings which can be represented as pure ASCII. (Introduced in Java 6 Update 21 Performance Release)

这个选项不出意外就是和上面的 kUseStringCompression 变量对应。

5. 最后结论

经过上面的分析我们可以得出结论：

Android dalvik 中 String 的编码格式是 utf-16；
Android ART 中 String 如果全部是 ASCII 字符则使用 ISO-8859-1 编码，其他情况都是 utf-16 编码；

结论 3 需要做一个简单的比较分析，我们回到最上面的问题为什么不直接使用 env->NewStringUTF(str.c_str()) 函数进行转换，而需要额外写一个 UTF82UnicodeOne 函数，其实细心的人可能已经注意到了上面 dalvik 和 ART 源码中 utf-8 到 utf-16 转换函数的区别，我们放到一起来进行清晰的比较：

dalvik：

DEX_INLINE u2 dexGetUtf16FromUtf8(const char** pUtf8Ptr){    unsigned int one, two, three;    one = *(*pUtf8Ptr)++;    if ((one & 0x80) != 0) {        /* two- or three-byte encoding */        two = *(*pUtf8Ptr)++;        if ((one & 0x20) != 0) {            /* three-byte encoding */            three = *(*pUtf8Ptr)++;            return ((one & 0x0f) << 12) |                   ((two & 0x3f) << 6) |                   (three & 0x3f);        } else {            /* two-byte encoding */            return ((one & 0x1f) << 6) |                   (two & 0x3f);        }    } else {        /* one-byte encoding */        return one;    }}

ART：

inline uint16_t GetTrailingUtf16Char(uint32_t maybe_pair) {  return static_cast<uint16_t>(maybe_pair >> 16);}inline uint16_t GetLeadingUtf16Char(uint32_t maybe_pair) {  return static_cast<uint16_t>(maybe_pair & 0x0000FFFF);}inline uint32_t GetUtf16FromUtf8(const char** utf8_data_in) {  const uint8_t one = *(*utf8_data_in)++;  if ((one & 0x80) == 0) {    // one-byte encoding    return one;  }  const uint8_t two = *(*utf8_data_in)++;  if ((one & 0x20) == 0) {    // two-byte encoding    return ((one & 0x1f) << 6) | (two & 0x3f);  }  const uint8_t three = *(*utf8_data_in)++;  if ((one & 0x10) == 0) {    return ((one & 0x0f) << 12) | ((two & 0x3f) << 6) | (three & 0x3f);  }  // Four byte encodings need special handling. We'll have  // to convert them into a surrogate pair.  const uint8_t four = *(*utf8_data_in)++;  // Since this is a 4 byte UTF-8 sequence, it will lie between  // U+10000 and U+1FFFFF.  //  // TODO: What do we do about values in (U+10FFFF, U+1FFFFF) ? The  // spec says they're invalid but nobody appears to check for them.  const uint32_t code_point = ((one & 0x0f) << 18) | ((two & 0x3f) << 12)      | ((three & 0x3f) << 6) | (four & 0x3f);  uint32_t surrogate_pair = 0;  // Step two: Write out the high (leading) surrogate to the bottom 16 bits  // of the of the 32 bit type.  surrogate_pair |= ((code_point >> 10) + 0xd7c0) & 0xffff;  // Step three : Write out the low (trailing) surrogate to the top 16 bits.  surrogate_pair |= ((code_point & 0x03ff) + 0xdc00) << 16;  return surrogate_pair;}

没错，dalvik 的代码中并没有对 4 字节 utf-8 编码的 string 进行处理，而 ART 中专门用了很详细的注释说明了针对 4 字节编码的 utf-8 需要转成 surrogate pair！具体为什么之前 Android 版本没有针对 4 字节编码进行处理，推测可能之前使用的 ucs-2 编码，并没有对 BMP 之外的平面集做处理，在换成 utf-16 编码之后，自然而然就需要将 4 字节 utf-8 编码转成代理对形式的 utf-16，UTF82UnicodeOne 函数的作者发现了这个漏洞之后，为了适配 dalvik，自己专门写了这个函数做转换，实在是很佩服。于是我们得出最后一个结论；

3.Android dalvik 中 utf-8 编码转 utf-16 编码的函数有缺陷，没有对 4 字节的 utf-8 编码做特殊处理，直到 ART 中修复了该问题。

6、引用

JavaScript 的内部字符编码是 UCS-2 还是 UTF-16
Dalvik虚拟机中NewStringUTF的实现

阅读全文

'); })();