采用ISO8211封装的S57数据，中文读取时乱码及丢字原因分析与解决方法

来源：互联网发布：windows系统怎么截屏编辑：程序博客网时间：2024/04/29 06:30

很多GIS爱好者或ECDIS开发商在读取S57数据文件时多参考了“ISO8211lib is a C++ library for reading ISO8211-formatted files, such as SDTS and S-57 format “，S57数据NATF字段采用Unicode双字节编码国家属性字段，也就是说S57数据中只有NATF字段的解析与处理涉及了双字节数据问题，特别是NATF字段是可变长度字段，而在

ISO8211.lib的

“DDFSubfieldDefn::GetDataLength( const char * pachSourceData, int nMaxBytes, int * pnConsumedBytes )”

在处理数据长度时最初只考虑了单字节定界符 UT = 31,FT = 30,而将双字节数据当成一种错误数据处理，参考

/* We only check for the field terminator because of some buggy
* datasets with missing format terminators. However, we have found
* the field terminator is a legal character within the fields of
* some extended datasets (such as JP34NC94.000). So we don't check
* for the field terminator if the field appears to be multi-byte
* which we established by the first character being out of the
* ASCII printable range (32-127).
*/

这样会造成测算出的字段长度不正确，因为双字节的字串中的个别字节很可能会出现与界符冲突，造成数据长度错误，产生丢字问题！考虑到S57规定双字节单元定界符为(0/0) (1/15)，字段定界符为(0/0)(1/14) 参见S57 3.10，实际让就是00，1F和00，1E，但在比较时，还要注意系统采用的是小尾序还是大尾序，我使用的系统环境是WINDOWS，大尾序，因此应对上述数据长度测量函数进行修改：

int DDFSubfieldDefn::GetDataLength( const char * pachSourceData,
int nMaxBytes, int * pnConsumedBytes )

{
if( !bIsVariable ) // 如果数据字段是定长字段
{
if( nFormatWidth > nMaxBytes )
{
CPLError( CE_Warning, CPLE_AppDefined,
"Only %d bytes available for subfield %s with\n"
"format string %s ... returning shortened data.",
nMaxBytes, pszName, pszFormatString );

if( pnConsumedBytes != NULL )
*pnConsumedBytes = nMaxBytes;

return nMaxBytes;
}
else
{
if( pnConsumedBytes != NULL )
*pnConsumedBytes = nFormatWidth;

return nFormatWidth;
}
}
else // 数据字段为变长字段
{
int nLength = 0;
int bCheckFieldTerminator = TRUE;

/* We only check for the field terminator because of some buggy
* datasets with missing format terminators. However, we have found
* the field terminator is a legal character within the fields of
* some extended datasets (such as JP34NC94.000). So we don't check
* for the field terminator if the field appears to be multi-byte
* which we established by the first character being out of the
* ASCII printable range (32-127).
*/

if( pachSourceData[0] < 32 || pachSourceData[0] >= 127 ) // 如果第一个字符为不可见字符，则认为数据为双字节字符集，不检查字段定界符

bCheckFieldTerminator = FALSE;

while( nLength < nMaxBytes
&& pachSourceData[nLength] != chFormatDelimeter )
{
if( bCheckFieldTerminator
&& pachSourceData[nLength] == DDF_FIELD_TERMINATOR )
break;
nLength++;
}

if( pnConsumedBytes != NULL )
{
if( nMaxBytes == 0 )
*pnConsumedBytes = nLength;
else
*pnConsumedBytes = nLength+1;
}

return nLength;
}
}

笔者认为可以这样解决（程序上文不变）

if( pachSourceData[0] < 32 || pachSourceData[0] >= 127 )
bCheckFieldTerminator = FALSE;

while( nLength < nMaxBytes）
{

if( bCheckFieldTerminator)

{

if(pachSourceData[nLength] == chFormatDelimeter)

break;

}

else

{

if(pachSourceData[nLength] == chFormatDelimeter && pachSourceData[nLength+1] == 0 )

break;

}

if(pachSourceData[nLength] == DDF_FIELD_TERMINATOR && pachSourceData[nLength+1] ) break;
nLength++;
}

if( pnConsumedBytes != NULL )
{
if( nMaxBytes == 0 )
*pnConsumedBytes = nLength;
else

*pnConsumedBytes = nLength+1;

}

return nLength;
}

实践中很好的解决了问题

第二个问题，汉字乱码

在后面的数据处理时，由于中文操作系统的汉字一般采用GB18030，也就是GBK编码，因此，在显示这些汉字时还要将NATF字段转成GBK

可以直接使用系统的转换函数：

int len = WideCharToMultiByte(54936,0,(LPCWSTR )tmpStr,m_strLength/2,tmpStr1,m_strLength,0,0);

其中54936是GB18030的CodePage代码

其实第一个问题是在处理汉字乱码问题时发现的，原来只考虑了汉字编码转换问题，而且第二个问题解决后，丢字问题才在一个偶然的时候发现，这个问题在一些商用ECDIS上也存在。

以上浅见，请大师们指正。

0 0